Abstract
Multi-particle interference is a key resource for quantum information processing, as exemplified by Boson Sampling. Hence, given its fragile nature, an essential desideratum is a solid and reliable framework for its validation. However, while several protocols have been introduced to this end, the approach is still fragmented and fails to build a big picture for future developments. In this work, we propose an operational approach to validation that encompasses and strengthens the state of the art for these protocols. To this end, we consider the Bayesian hypothesis testing and the statistical benchmark as most favorable protocols for small- and large-scale applications, respectively. We numerically investigate their operation with finite sample size, extending previous tests to larger dimensions, and against two adversarial algorithms for classical simulation: the mean-field sampler and the metropolized independent sampler. To evidence the actual need for refined validation techniques, we show how the assessment of numerically simulated data depends on the available sample size, as well as on the internal hyper-parameters and other practically relevant constraints. Our analyses provide general insights into the challenge of validation, and can inspire the design of algorithms with a measurable quantum advantage.
Export citation and abstract BibTeX RIS
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
A quantum computational advantage occurs when a quantum device starts outperforming its best classical counterpart on a given specialized task [1, 2]. Intermediate models [3–6] and platforms [7–12] have been proposed to achieve this regime, largely reducing the physical resources required by universal computation. The technological race towards quantum computational advantage goes nonetheless hand-in-hand with the development of classical protocols capable to discern genuine quantum information processing [13–17]. The intertwined evolution of these two aspects has been highlighted in particular by Boson Sampling [3, 18], where several protocols have been introduced [19–35] and experimentally tested [31–46] to rule out non-quantum processes. Boson Sampling, in its original formulation [3], consists in sampling from a probability distribution that can be related to the evolution of indistinguishable photons in a linear-optical interferometer. Recent analyses suggested reasonable thresholds in the number of photons n to surpass classical algorithms [47, 48, 50].
While the sampling task itself has been thoroughly analyzed in computational complexity theory, we still lack a comparable understanding when it comes to its validation. However, it is clear from a practical perspective that any computational problem designed to demonstrate quantum advantage needs to be formulated together with a set of validation protocols which account for the physical ramifications and resources required for its implementation. For instance, while small-scale examples can be validated by direct solution of the Schrödinger equation and using statistical measures such as cross-entropy [6], this is prohibitively expensive to debug a faulty Boson sampler. Moreover, for Boson Sampling a deterministic certification is impossible [24] by the very definition of the problem [20]. Hence, it is crucial to develop debugging tools, as well as tests to exclude undesired hypotheses on the system producing the output, that are computationally affordable and experimentally feasible. Furthermore, due to random fluctuations inherent to any finite-size problem, a validation cannot be considered reliable until sufficient physical resources are spent to obtain reasonable experimental uncertainties. Ultimately, no computational problem can provide evidence of quantum advantage unless quantitative validation criteria can be stated.
In this work, we investigate the problem of validating multi-photon quantum interference in realistic scenarios with finite data. The paper is structured as follows: first, we discuss possible ambiguities in the validation of Boson Sampling, which play a crucial role in large-size experiments. Then, building upon state-of-the-art validation protocols, we address the above considerations with a more quantitative analysis. We describe a practical approach to validation that makes the most of the limited physical resources available. Specifically, we study the use of the statistical benchmark [30] and the Bayesian hypothesis testing [31] to validate n-photon interference for large and small n, respectively. We numerically investigate their operation against classical algorithms to simulate quantum interference, with a particular focus on the number of measurements. The reported analysis strengthens the need for a well-defined approach to validation, both to demonstrate quantum advantage and to assist applications that involve multi-photon states.
2. Validation of Boson Sampling: framework
Our aim, in the context of Boson Sampling, consists in the unambiguous identification of a quantum advantage in a realistic scenario. We focus on the task of validation, or verification, whose aim is to check if measured experimental data is compatible with what can be expected from a given physical model. Validation generally requires fewer resources and is, thus, more appropriate for practical applications than full certification, which is exponentially hard in n for Boson Sampling [20, 51]. In both cases, these claims must follow a well-defined protocol to distill experimental evidence that is accepted by the community under jointly agreed criteria [52] (figure 1). As we discuss below and in section 3, we propose an application-oriented approach to validation that takes into consideration the limited physical resources, be them related to the evaluation of permanents [53] or to finite sample size [51]. In fact, without such well-defined approaches, obstacles or ambiguities may arise in large-scale experiments, as we highlight in the following. For instance, not all validation protocols are computationally efficient, which is a strong limitation for future multi-photon applications or high-rate real-time monitoring. Also, a theoretically scalable validation protocol may still be experimentally impractical due to large instrumental overheads or large prefactors that enter the scaling law.
Given two validation protocols and to rule out the same physical hypothesis or model, which conclusion can be drawn if they agree for a data set of given size and unexpectedly disagree when we add more data? In principle we can accept or reject a data set when we reach a certain level of confidence, but which action is to be taken if this threshold is not reached after a large number of measurement events (which hereafter we refer to as the 'sample size')? Shall we proceed until we pass that level, shall we reject it or shall we make a guess on the available data? Finally, what if the classical algorithm becomes more effective in simulating Boson Sampling for larger data sets, as for Markov chains [47], or for longer processing times, as for adversarial machine learning algorithms [55] that could exploit specific vulnerabilities of validation protocols?
However artificial some of the above questions may seem, such skeptical approach was indeed already adopted [25] and addressed [26–30, 35–37] with the mean-field sampler (see appendix
- (a)Sample size . The strength of a validation protocol is affected by the limited number of collected events, as compared to the total number of distinct n-photon output events. While this limitation is not relevant for small-scale implementations, due to (i) the then low dimension of Hilbert space, (ii) a high level of control and (iii) reduced losses, it represents one of the main bottlenecks for the actually targeted large-scale instances [56]. It is thus desirable to assess the robustness and the resilience of a protocol under such incomplete-sampling effects, to quantify the impact of always strictly finite experimental resources on the actual applicability range of the protocol. We therefore propose to define a (minimal) threshold sample size which must be available for validation. Given a set of events, a validation protocol must be capable to give a reliable answer within a certain confidence level.
- (b)Available sampling time . While the sampling rate is nearly constant for current quantum and classical approaches [48], de facto making the time not relevant, it cannot be excluded that future algorithms may process data and output all events at once. The very quality of the simulation, i.e. the similarity to quantum Boson Sampling in a given metric, could also improve with processing time [47, 55]. Ultimately, must be treated as an independent parameter with respect to , while at the same time it should be adapted to the sample size required for a reliable validation.
- (c)Unitary . Unitary evolutions should be drawn Haar-randomly by a third agent, at the start of the competition to avoid any preprocessing. This agent, the validator (), uses specific validation protocols to decide whether a sample is compatible with quantum operation.
In the thus defined setting, a data set is said validated according to the following rule (figure 1(a)):
Boson Sampling is validated if, collecting events in time from some random unitary , it is accepted by all selected validators .
Given a unitary and a set of validation protocols, we are then left with the choice of and , which need be plausible for technological standards. Demanding to sample events in time , these thresholds in fact limit the size of the problem (n, m) for an experimental implementation. As for the time , one possibility, feasible for quantum experiments, could be for instance one hour. Within this time, a quantum device will probably output events at a nearly constant rate, while a classical computer can output them at any rate allowed by its clock cycle time. The choice of the sample size is instead more intricate, since a value too high collides with the limited , while a value too low implies an unreliable validation . With these or further considerations [57], classical and quantum samplers should agree upon a combination of (n, m, , ) that allows them to validate their operation.
3. Validation with finite sample size
In this section, we investigate a convenient approach to validation that distinguishes between two regimes: until n ∼ 30 (section 3.1) and from n ∼ 30 (section 3.2). In each section, we will first summarize the main ideas behind their operation. Then, we will discuss their performance for various (n, m), highlighting strengths and limitations, by numerically simulating experiments with finite sample size and distinguishable or indistinguishable photons.
3.1. Bayesian tests for small-scale experiments
The Bayesian approach to Boson Sampling validation (), introduced in reference [31] and recently investigated also in reference [58], aims to identify the most likely between two alternative hypotheses, which model the multi-photon states under consideration. In particular, tests the Boson Sampling hypothesis (HQ), which assumes fully indistinguishable n-photon states, against an alternative hypothesis (HA) for the source that produces the measurement outcomes {x}. Equal probabilities are assigned to the two hypotheses prior to the experiment. Let us denote with the scattering probability associated with the output state xk for HQ (HA). The intuition is that, if HQ is most suitable to model the experiment, it is more likely to collect events for which pQ(xk) > pA(xk). The idea is made quantitative considering the confidence we assign to each hypothesis, with hypo being either Q or A. By applying Bayes' theorem, after events we have
By combining equation (1) and P(HQ|{x}) + P(HA|{x}) = 1, it follows that our confidence in the hypothesis HQ becomes , with .
This test requires the evaluation of permanents of n × n scattering matrices [50, 53], since
where US,T is the matrix obtained by repeating sk (tk) times the kth column (row) of , sk (tk) being the occupation number in the input (output) mode k (). The presence of the permanent in equation (2) sets an upper limit to the number of photons that can be studied in practical applications [40–47]. Indeed, it is foreseeable that real-time monitoring or feedback-loop stabilization of quantum optics experiments will only have access to portable platforms with limited computational power. However, an interesting advantage of this validation protocol is its broad versatility, due to the absence of assumptions on the alternative distributions. Importantly, when applied to validate Boson Sampling with distinguishable photons, it requires very few measurements for a reliable assessment. In figure 2, for instance, we numerically investigate its application as a function of sample size, extending previous simulations from n = 3 [31] to n = (3, 6, 9, 12) and m = n2.
Download figure:
Standard image High-resolution imageData for distinguishable (HC) and indistinguishable (HQ) photons were generated using exact algorithms, respectively by Aaronson and Arkhipov [20] and by Clifford and Clifford [48]. The analysis shows how the validation protocol becomes even more effective for increasing n, being it able to output a reliable verdict after only ∼20 events. However, as mentioned, its power comes at the cost of being computationally inefficient in n. Also, it is not possible to preprocess and store information for successive re-use, since its confidence depends on the specific and sampled events, according to pQ(xk). Hence, in the regime n ∼ 25–35 [46, 47] it becomes rapidly harder to perform a validation in real time. Eventually, since classical supercomputers cannot assist quantum experiments in everyday applications, becomes prohibitive from n ∼ 35.
3.2. Statistical benchmark for large-scale experiments
In the previous section we described how the Bayesian test is effective in validating small- and mid-scale experiments with very few measurement events. However, the evaluation of permanents hinders its application for large n, be it due to too large scattering matrices or to the need for speed in real-time evaluations. To overcome this limitation, further validation protocols have been proposed in the last few years, to find a convenient compromise between predictive power and physical resources. All these approaches have their own strengths and limitations, and tackle the problem from different angles [16], e.g. using suppression laws [24–28], machine learning [29, 33] or statistical properties related to multi-particle interference [30]. In this section we will focus on the latter protocol, which arguably represents the most promising solution for the reasons we outline below.
Statistical benchmark with finite sample size. Validation based on the statistical benchmark () looks at statistical features of the C-dataset, the set of two-mode correlators
where (i, j) are distinct output ports and is the bosonic number operator. Two statistical features that are effective to discriminate states with indistinguishable and distinguishable photons are its normalized mean NM (the mean divided by n/m2) and its coefficient of variation CV (the standard deviation divided by the mean). For any unitary transformation and input state we can retrieve a point in the plane (NM, CV), where alternative models tend to cluster in separate clouds located via random matrix theory (figure 3(a)) [30]. Validation based on would then consist in (i) collecting a suitable number of events, (ii) evaluating the experimental point (NM, CV) associated to the Cij and (iii) identifying the cluster that the point is assigned to. For sufficiently large, the point will be attributable with large confidence to only one of the models, thus ruling out the others (figure 3(b)).
Download figure:
Standard image High-resolution imagerepresents the state of the art for validation protocols that do not require the evaluation of permanents. Indeed, this approach has several advantages [39]: (a) it is computationally efficient (one only needs to compute two-point correlators), (b) it can reveal deviations from the expected behaviour (manifest in the NM-CV plane), (c) it makes more reliable predictions for larger n (clouds become more separate), (d) it is sample-efficient (clouds separate relatively early, after few measurements events). However, despite points (c, d) above, in actual conditions the experimental point is not always easy to validate. In fact, as mentioned in point (b), hardware imperfections and partial distinguishability make the point move away from the average route shown in figure 3(a). These issues can be addressed and mitigated by numerically generating, for a fixed sample size , clouds from unitary transformations that take these aspects into account. This intuition applies to all imperfections that can be described by an error model, for instance controlled by a set of parameters that quantify the noise level. Specifically, whenever it is possible to numerically generate events from a probability distribution that models these imperfections, we can use this data to train a classifier to recognize them in actual experiments. Relevant examples of error models include the above-mentioned partial distinguishability [16, 23–26, 60–62], unbalanced losses and fabrication imperfections in the optical elements of an interferometer [59].
As suggested in reference [39], and more closely investigated in figures 3(b) and (c), a convenient approach is to employ machine learning to assign experimental points to one of the two clouds, with a certain confidence level. Specifically, one can train a classifier with numerically generated data [20, 48] for a certain (n, m, ), that can even include error models, and then deploy it for all applications in that regime. In this sense, can be seen as the label of the model that can classify (validate) data for a given (n, m). This intuition can be extended to a classifier that is trained on data from multiple [see figure 3(c)], which is likely more practical. For a fixed , the computational resources to sample events from a distribution given by n distinguishable (indistinguishable) photons scale polynomially [20] (exponentially [48]) in n. However, once trained, this classifier can be considered as an off-the-shelf tool that is readily applicable to validate multi-photon interference with no additional computational overhead, which is ideal for large-size experiments. In appendix
Finite-size effects in validation protocols. So far, we qualitatively discussed the role of a limited sample size for the validation of multi-photon quantum interference. To provide a more quantitative analysis of finite-size effects for the task of validation, and in particular for , in the following we study the scaling of the parameters involved in the above validation protocol with . The goal of this section is to elaborate on a standard test which should be implemented in all validation protocols, to guarantee their experimental feasibility.
Let us start by considering a fixed unitary circuit U, for which we calculate the correlators Cij from equation (3). Such evaluation in principle assumes the possibility to collect an arbitrary number of measurement events. In practical applications, however, sample sizes will always be limited. Hence, finite-size effects play a role in the estimation of the above correlators. According to the central limit theorem, the correlator retrieved from the experimental data can be represented as , where Xij is a random number normally distributed with zero mean and variance . The depend on the unitary evolution U and should either be evaluated from the data or be estimated using random matrix theory. Now, to infer, from noisy C-datasets [30], the centre of the cloud of points in the NM-CV plane, we need to average not only over the Haar measure, but also over Xij.
Consequently, we have to assess the impact of finite-size effects on the estimate of the moments (NM, CV). First, since the noise induced by the finite sample size averages out, namely , we have that . The estimation of CV is a bit more subtle because we need to evaluate the mean of . Since , then
and, hence, . Note that and cannot be easily compared, since the latter involves averaging the distribution of Xij over the unitary group. However, using the properties of the normal distribution under convex combinations, we can deduce that both orders of averaging yield approximately the same result (and the same scaling in ), in particular once is large and the distribution is concentrated close to its mean. Numerical simulations for 3 ⩽ n ⩽ 15 and m = n2 indeed confirm its validity (figure 4). Specifically, we observe that, upon averaging over different Haar-random unitaries with events per realization, the deviation of the experimentally-measured from the analytically predicted values decreases as fast as . Hence, their estimation from finite-size data sets shows no exponential overhead that would hinder a practical application of the validation protocol.
Download figure:
Standard image High-resolution image4. Discussion
Validation of multi-photon quantum interference is expected to play an increasing role as the dimensionality of photonic applications increases, both in the number of photons and modes. To this end, and as notably emphasized by the race towards quantum advantage via Boson Sampling, it is necessary to define a set of requirements for a validation protocol to be meaningful. Ultimately, these requirements should allow to establish strong experimental evidence of quantum advantage that is accepted by the community within a jointly agreed framework.
In the present work, we implement such a program and describe a set of critical points that experimenters will need to agree upon in order to validate the operation of a quantum device. With the goal of building a solid framework for validation, we then discuss a practical approach that applies the most suitable state-of-the-art protocols in realistic scenarios. We report numerical analyses on the application of two key validation protocols, the Bayesian hypothesis testing and the statistical benchmark, with finite-size data, providing compelling evidence in support of this approach.
A clear and illustrative example for the above considerations is provided in appendix
Finally, in appendix
Acknowledgments
This work was supported by ERC Advanced Grant QU-BOSS (QUantum advantage via non-linear BOSon Sampling; Grant Agreement No. 884676); by the QuantERA ERA-NET Cofund in Quantum Technologies Project HiPhoP (High dimensional quantum Photonic Platform, Project ID 731473) and by project PRIN 2017 'Taming complexity via QUantum Strategies a Hybrid Integrated Photonic approach' (QUSHIP) Id. 2017SRNBRK. AB acknowledges support by the Georg H Endress foundation. MW is funded through Research Fellowship WA 3969/2-1 of the German Research Foundation (DFG). This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant agreement No. 801110 and the Austrian Federal Ministry of Education, Science and Research (BMBWF). It reflects only the author's view and the Agency is not responsible for any use that may be made of the information it contains.
Appendix A.: Classical simulation and the role of sample size
To shed some light on the critical aspects of validation, and as a benchmark of the state of the art in this context, we now provide a qualitative analysis inspired by the metropolized independent sampling (), a recent algorithm to classically simulate Boson Sampling [47]. The idea behind is reminiscent of the mean-field sampler () [25], an adversarial classical algorithm that was capable to hack one of the first validation protocols [32] using limited classical resources. In the race towards quantum computational supremacy, the introduction of has prompted the development of more sophisticated techniques to tackle classical simulations. For instance, besides the Bayesian test (see inset in figure 2), also the statistical benchmark is highly effective to validate Boson Sampling against (see figure A1(a)). For our scope, the key difference between the two algorithms is that, while for the quality of the simulation does not really change over time, samples from a distribution that gets closer to the more events are evaluated (i.e. for a larger ).
The goal of is to generate a sequence of n-photon events {ei} from a Markov chain that mimics the statistics of an ideal Boson Sampling experiment. Given a sampled event ei, a new candidate event ei+1 is efficiently picked according to the probability distribution of distinguishable photons pD, and accepted with probability
where pI(ei) is the output probability corresponding to event ei for indistinguishable photons. While the approach remains computationally hard, since it requires the evaluation of permanents [53, 63], the advantage is that only a limited number of them needs to be evaluated to output a new event, rather than the full distribution as in a brute-force approach. Ultimately, after a certain number of steps in the chain, is guaranteed to sample close to the ideal Boson Sampling distribution pI [64]. Hence, not only does the sample size play a key role to improve the reliability of validation protocols, as shown in section 3, but it can be crucial also to increase the quality of the outcome of a classical simulation. This is a relevant point to keep in mind, even though has since been surpassed by an algorithm that is both provably faster and exact [48, 49]. In fact, in future, novel classical algorithms might be developed [54] that depend on more efficiently.
The aim of our present analysis is to investigate the role of the sample size in a validation of the samples generated by , via . Indeed, a crucial issue in a hypothetical competition between and concerns the number of events available to accept or reject a data set. While larger sets provide deeper information to to identify fingerprints of quantum interference, on the other hand approaches the target distribution pI as more steps are made along the chain. However, in order to output a large number of events in time , requires physical and computational resources that set a limit to the tractable dimension of the problem. We are then interested in the intermediate regime, the one relevant for experiments, to determine whether convergence is reached fast enough to mislead . In the specific case of , we then need to look at the scaling in n of its hyper-parameters: burn-in (the number Bn of events to be discarded at the beginning of the chain) and thinning (the number Tn of steps to skip to reduce correlations between successive events). Eventually, the time required to classically simulate Boson Sampling will scale as , where τp is the time to evaluate a single scattering amplitude according to equation (5). Considering the estimate provided by the supercomputer Tianhe-2 [50], and for fixed (, ), we find the constraint where α ∼ c0.8782 × 1011 and c is the number of processing nodes. If we assume Tn = 100 [47] for all n and , we get an estimate of the maximum Bn allowed by (, ). The key issue is that this estimate does not guarantee that achieves the target distribution fast enough, since Bn decreases (exponentially) in n. Moreover, the minimum Bn is expected to increase with n, since on average the Markov chain needs to explore more states before picking a good one.
Download figure:
Standard image High-resolution imageTo better clarify the above considerations, we simulate a competition between and for n = 10 photons in m = 100 modes on figure A2. Data for distinguishable and indistinguishable photons were generated with exact algorithms, respectively by Aaronson and Arkhipov [20] and by Clifford and Clifford [48]. The analysis proceeds through five main steps: (1) randomly pick a unitary transformation according to the Haar measure; (2) simulate the generation of -particle output events; (3) extract the C-dataset from these events; (4) evaluate the corresponding (NM, CV) point and plot it in figure A2(a); (5) repeat steps 1–4 200 times, to simulate as many different experiments. Upon completion, evaluate average and variance of and plot them in figure A2(b). With this analysis, we get a quantitative intuition on how the confidence of a validation changes with , as does the quality of the classical simulation. Similar behaviour is found also for other choices of n and m. In particular, we observe how a stronger thinning (up to T10 = 100, as in reference [47]) is reflected in the quality of the simulation, where behaves very similar to the ideal Boson sampler for small as well as for large sample sizes. Conversely, a faster that trades quality for speed by computing fewer permanents (T10 = 10, 30) is more easily detectable by . Constraints due to a speed vs quality compromise (figures 3(b)–(d)) define a generic scenario for a classical simulation which is run with a specific choice of and .
Download figure:
Standard image High-resolution imageAppendix B.: Combining and boosting validation protocols
So far, all validation protocols have always been applied separately and independently. Certainly, this fact shows the multifaceted nature of this line of research, where effective solutions have been developed using very different strategies. Yet, it also reflects its somewhat fragmented condition, since each protocol does not benefit from potential insights provided by the others. This limitation becomes relevant in realistic scenarios with noise and finite data sets, since each validation protocol suits some task better than the others, with different degrees of sample efficiency and resilience.
In this section, we present a novel, synergistic approach to validation, which aims at combining the strengths of these protocols to form a joint, enhanced validator. Specifically, we focus on validation protocols that make use of machine learning, and propose to combine them with a meta-algorithm (AdaBoost [65]) that attempts an adaptive boosting of their individual performance. The output of AdaBoost is a weighted sum of the predictions of these learning algorithms ('weak learners'), which are asked, sequentially, to pay more attention to the instances that were incorrectly classified by the previous learners. As long as the performance of each learner is slightly better than chance, the classifier resulting from AdaBoost provably converges to a better validation protocol.
We numerically test this approach by combining two validation protocols that employ machine learning: the statistical benchmark [30] [equipped with a simple neural network classifier trained on numerically generated data, as in figures 3(b) and (c)] and the visual assessment [29], which uses dimensionality reduction algorithms and convolutional neural networks. Here we do not consider the Bayesian approach, since, in its current formulation, it does not fit the framework of machine learning. A schematic description of our proof-of-concept analysis, which we carry out for n = 10 and m = 100, is shown in figure A3.
Download figure:
Standard image High-resolution imageSince requires fewer events than to validate ideal, noiseless experiments [20, 48], to perform this test we trained on data sets with a tunable amount noise, purposely assembled to be hard to validate. To this end, samples () for 500 Haar-random unitary transformations were constructed by sampling with a certain probability p (or 1 − p) from a Boson sampler with fully indistinguishable (or distinguishable) photons. This probability p was then varied in time, to simulate, for instance, a periodic drift in the synchronization of the input photons. As expected with these settings, we find that AdaBoost maintains the original accuracy of and when applied to, respectively, batches of and that are already highly accurate. This is mainly due to complexity of these classifiers, which are already strong learners and, hence, hard to enhance by AdaBoost. Analogous results are found with mixed batches of and , for which AdaBoost returns a joint classifier that practically focuses on the most accurate one in the set. A different result is obtained, instead, by combining several weak , for which we purposely spoil the training of the convolutional neural network (accuracy A ∼ 51% instead of A ∼ 98%) by reducing the number of training epochs. In this case, AdaBoost does in fact enhance the accuracy of up to A ∼ 57%.
In future, we expect that this approach will prove useful in non-ideal conditions with experimental noise, where validation protocols do not operate in the ideal settings where they were conceived. Furthermore, the above analyses can show larger boosts if applied to actual experiments that involve structured (non-Haar-random) interferometers, for which protocols such as and can have lower accuracies and different behaviors. Finally, still in non-ideal settings, more favorable boosts can be obtained if new validation protocols are developed that are as sample-efficient as .