Amortized Simulation-Based Frequentist Inference for Tractable and Intractable Likelihoods

High-fidelity simulators that connect theoretical models with observations are indispensable tools in many sciences. When coupled with machine learning, a simulator makes it possible to infer the parameters of a theoretical model directly from real and simulated observations without explicit use of the likelihood function. This is of particular interest when the latter is intractable. In this work, we introduce a simple extension of the recently proposed likelihood-free frequentist inference (LF2I) approach that has some computational advantages. Like LF2I, this extension yields provably valid confidence sets in parameter inference problems in which a high-fidelity simulator is available. The utility of our algorithm is illustrated by applying it to three pedagogically interesting examples: the first is from cosmology, the second from high-energy physics and astronomy, both with tractable likelihoods, while the third, with an intractable likelihood, is from epidemiology.


Introduction
Simulation-based, or likelihood-free, inference is now ubiquitous in the sciences (see, for example, Refs.[1][2][3]).Recently, Dalmasso et al. introduced likelihood-free frequentist inference, [4] (LF2I) a simulation-based method featuring provable frequentist [5] guarantees.Let D = {X i | i = 1, ..., N } be the set of "observable" data sampled from a simulator F θ and D = {x i | i = 1, ..., N } be the set of observed data.Consider a large, in principle infinite, collection of data-driven statements of the form θ ∈ R(D) with parameter θ a point in the parameter space of a theoretical model, and R(D) is a data-dependent subset of that space.Given a function of the data λ(D; θ), called a test statistic, the LF2I approach constructs data-driven statements that are either true or false with the guarantee that over a large collection of such statements a fraction p ≥ τ of them will be true.Parameter subsets that satisfy this condition are called confidence sets.
The confidence level τ associated with these subsets is the minimum probability to pick a true statement, at random, from the collection.The fraction p, which may vary over the parameter space, is called the coverage probability, or coverage for short.The interesting aspect of LF2I is that the guarantee holds for any data sample size.
Confidence sets and classical hypothesis tests are closely related.A classical hypothesis test [5] is a procedure for deciding between two hypotheses: a null hypothesis H 0 (parameterized by the parameter θ 0 ) and an alternative hypothesis H 1 (parameterized by parameters that differ from θ 0 ).For example, we may wish to perform the following test: The hypothesis test in Eq. ( 1) is equivalent to where Θ 0 ∩ Θ 1 = ∅, and Θ 0 could be a single parameter point (which defines a simple hypothesis or more than one point (thereby defining a composite hypothesis).
A good test statistic is one that provides a clear distinction between the hypotheses we wish to test.When the two hypotheses are simple the likelihood ratio test statistic is known to be optimal [6], while experience indicates that the likelihood ratio, or a function thereof, works well even when one or both hypotheses are composite.Depending on the value of the test statistic, evaluated at a set of observed data D = D, the null hypothesis will either be rejected or will fail to be rejected.If the null hypothesis is rejected, we may choose to accept the alternative.
Assuming that large values of λ = λ(D; θ) cast doubt on the null hypothesis θ = θ 0 , the latter is rejected if the p-value, P(λ > λ 0 |θ), is less than a given threshold α, called the size or significance level of the test.[7] For example, if the test statistic is a χ 2 ‡ α is also the threshold at which one is willing to commit a Type I error, the probability of erroneously rejecting the null hypothesis when it is in fact true function, large values of χ 2 disfavor the hypothesis θ = θ 0 .If the parameter point θ 0 is not rejected it is added to the confidence set R(D).A confidence set is, therefore, the set of parameter points that have not been rejected at level α.The key idea of LF2I is to approximate the p-value with a neural network in such a way that the resulting confidence sets satisfy p ≥ τ or, more realistically, p ≈ τ .
The approximated p-value function in the LF2I approach is computed for a specific data set D, therefore, it cannot be used for other similar data sets with the same sample size.Consequently, it cannot be used to check the coverage explicitly at a given parameter point.The LF2I approach provides an algorithm to train another neural network to approximate the coverage probability over the parameter space of the theoretical model.Unfortunately, a reliable way to quantify the accuracy of the approximated coverage probability function is not available as is true of the neural network approximation of the p-value.The saving grace, however, is that the quality of the confidence sets can be indirectly assessed by directly computing the coverage at any given point.If the coverage probabilities within the neighborhood of the estimated parameters agree with the desired confidence level τ to, say, 10% then one may conclude that the confidence sets, R(D), are satisfactory.The motivation for the extension introduced in this paper is the desire to have the p-value do double duty: 1) determine the confidence sets and 2) permit the explicit checking of the coverage using the same neural network.
In practice, we choose to approximate the cumulative distribution function (cdf), P(λ ≤ λ 0 |θ), with a neural network.Our key idea is to make the latter a function of both the parameter point θ and the "observed" test statistic λ 0 .This simple extension makes it possible to apply the approximated cdf to any data set that is similar to the observed data D and of the same sample size.Moreover, the cost of approximating the cdf is amortized over its subsequent use in constructing confidence sets and in explicitly checking the coverage of these sets.To distinguish the modified LF2I from the original, we refer to the former as amortized likelihood-free frequentist inference (ALFFI).The ALFFI approach is illustrated in three pedagogical examples: the first is from cosmology, the second from high-energy physics and astronomy, and the third from epidemiology.The first two examples feature likelihoods that are tractable, while for the third example the real power of LF2I and ALFFI is illustrated with a problem in which the likelihood is intractable.The paper is organized as follows.In Sec. 2, we describe the ALFFI approach.This is followed, in Sec. 3, with the three examples.Section 3.1 uses ALFFI to infer the parameters of a simple cosmological model that is fitted to Type 1a supernova data, while Sec.3.2 applies ALFFI to the prototypical signal/background problem in highenergy physics, which in astronomy is known as the On/Off problem.[8] For both of these problems, the likelihood is tractable.Section 3.3 illustrates the application of ‡ Meaning that even though the neural network is trained for a specific kind of data set D, it can be used both for the observed data set and other similar data sets with the same sample size, without the need to re-train the network.
ALFFI to a well-known epidemiological model, which, though simple, has an intractable likelihood.The paper ends with a brief discussion in Sec. 4 and our conclusions in Sec. 5.

From cumulative distribution function to confidence sets
A classical hypothesis test is designed to reject hypotheses.Consider the α-level hypothesis θ = θ 0 , as in Eq. ( 1).This hypothesis is to be rejected if P(λ λ 0 |θ 0 ) < α.The corollary is that θ 0 is not to be rejected if P(λ > λ 0 |θ 0 ) ≥ α, that is, if The set of points θ 0 that have not been rejected at level α, and therefore remain as potentially viable hypotheses for the true value of θ, is by definition a confidence set R(D) at 100τ % confidence level (CL).The boundary of the confidence set R(D) is determined by the equation By construction, the coverage probability of such sets is Note that R(D) is a random set.Under repeated observations the relative frequency with which these sets include the true value of θ (ideally) never falls below the stated confidence level τ = 1 − α whatever the true value of θ.This is a classic example of the frequentist principle, whereby in an infinite ensemble of statements, which need not be of the same kind but which are constructed using the same protocol, there is an a priori guarantee that nowhere will the coverage probability fall below the stated confidence level.To the degree that the probability P(λ ≤ λ 0 |θ) is accurately modeled, the LF2I and ALFFI approaches yield random sets that satisfy the frequentist principle.

Data Preparation for ALFFI
The LF2I and ALFFI approaches are applicable to any test statistic λ(D; θ) that is monotonic in the following sense.The test statistic is constructed so that a particular direction in its 1-dimensional space corresponds to hypotheses that are increasingly disfavored.In ALFFI, we consider test statistics for which large values correspond to disfavored hypotheses, or equivalently, small values of the p-value = P(λ > λ 0 |θ = θ 0 ), where λ 0 ≡ λ(D = D; θ 0 ) is the observed value of the test statistic for the specified hypothesis.ALFFI (see Sec. 5) approximates the probability [9] for any data set D sampled from a statistical model p(D|θ), which may or may not be tractable.However, as in the LF2I approach, [4] it is assumed that one has access to a large collection S of simulated pairs (D, θ) ∈ S where for every point θ sampled from any convenient prior π θ a single instance of a data set D is simulated with the same characteristics, including sample size, as the real data D. In contrast to LF2I, in ALFFI a second collection of data sets S ′ ∋ D ′ is created from S by randomly shuffling its data sets so that they are de-correlated with respect to the parameter points in S. The data sets in S ′ serve as instances of "observed" data sets.
For every parameter θ in S, we compute two values of the test statistic, namely, λ(D, θ) and λ 0 = λ(D ′ , θ), as well as the discrete variable Z which is unity if λ(D, θ) ≤ λ(D ′ , θ) and zero otherwise.This procedure results in a large collection of triplets T = {(Z, λ 0 , θ)}, which constitute the training data.In LF2I, the observed test statistic-the second component of the triplet-is computed using a fixed data set D ′ = D, namely, the one actually observed, while ALFFI uses the data sets D ′ ∈ S ′ .

Approximating the cumulative distribution function
The cumulative distribution function P(λ ≤ λ 0 |θ) is the expectation value E(Z|λ 0 , θ) of the discrete variable Z, a fact that suggests a straightforward way to approximate the cdf for a fixed data set D: histogram the parameter points θ, thereby yielding the histogram H 1 , and using the same bins as H 1 histogram the parameter points again, but this time weighted by Z, yielding the histogram H Z .The ratio H Z /H 1 approximates Following LF2I, a more convenient, and one hopes better, approximation of E(Z|λ 0 , θ) is created with a deep neural network (DNN) trained (that is, fitted) by minimizing the empirical risk where L(t, f ) is a loss function, f (x; ω) is a DNN with inputs x and free parameters ω, and t denotes known targets.The empirical risk or average loss, Eq. ( 6), is a Monte Carlo approximation of the risk functional where p(x, t) = p(t|x) p(x) is the (typically unknown) probability distribution of the data x, t.From the calculus of variations, the function f that minimizes Eq. ( 7) is the solution of assuming that p(x) > 0 ∀ x.In the examples below, and following LF2I, we use the quadratic loss L(t, f ) = (t − f ) 2 , which, using Eq. ( 8), leads to the well-known result [10,11] f where ω * are the best-fit parameters of the neural network model.Setting x = λ 0 , θ and the targets t = Z in Eq. ( 9) yields that is, it yields the quantity that we wish to approximate.One of the key virtues of the LF2I and ALFFI approaches, as is evident in the result in Eq. (10), is that the neural network f is conditioned on θ, which means that it is independent of the prior π θ .The form of the prior affects only the accuracy of the approximation.The accuracy of the approximation will be greatest where the density of the prior is greatest.

Results
The ALFFI approach is illustrated in the following three diverse examples: the first is from cosmology, the second from high-energy physics and astronomy, and the third from epidemiology.Our choice of the particular problem in each field highlights typical statistical inference problems in each field.We demonstrate that the ALFFI method yields valid multi-parameter confidence sets for all three examples in a computationally efficient manner, both in cases where the likelihood is tractable (the first two examples), and where the likelihood is intractable (the third example).We also demonstrate the use of different test statistics, demonstrating the compatibility with binned and un-binned analyses.Table 1 summarizes key attributes of each example.
Table 1.Cosmological Model : x i ± σ i and z i are the measured distance moduli and redshifts, respectively, while H 0 and n are the model parameters.Signal/Background : N and M are the observed counts for signal and background, respectively, while n and m are the expected signal and background counts, respectively.µ and ν are the unknown signal mean (parameter of interest) and unknown background mean (nuisance parameter), respectively.SIR Model : {x i } are the 13 observed counts of infected children on the 13 days of observation, and α and β are the model parameters.CTMC is a continuous time Markov chain model of the epidemic.

Example 1: Cosmological Model
In the late 1990s, fits of cosmological models to Type 1a supernova data led to the conclusion that the expansion of the universe is accelerating.[12,13] The fits then, as now, were performed using tractable likelihoods, typically a multivariate normal.In this example, we fit a cosmological model to the Union 2.1 data compilation of the Supernova Cosmology Project [14] via maximum likelihood and also with ALFFI.The Union 2.1 data set comprises measured distance moduli, x ± σ, and redshifts, z, for 580 Type 1a supernovae.Given the size of the data sample, it is expected that accurate confidence sets for the cosmological parameters can be constructed using standard methods such as maximum likelihood.ALFFI is therefore not needed for this problem; it is simply used to showcase the algorithm.
Our cosmological model is defined by the equation of state where n and b are free parameters, and a(t), Ω(a), and P are the dimensionless universal scale factor, the dimensionless energy density, and the dimensionless pressure, respectively, and t is the time since the Big Bang.For n > 1 and a ≪ 1, the equation of state is that of a pressureless dust of particles as in the ΛCDM model.[15] However, at later times the energy density becomes dominated by so-called phantom energy.[16] Our model is consistent with the Friedmann-Lemaître-Robertson-Walker (FLRW) metric with zero curvature, and the Friedmann equations, [15] 1 a da dt and a dΩ da are assumed to hold, where H 0 is the Hubble constant.The first Friedmann equation, Eq. ( 12), incorporates the convention a(t 0 ) = 1 at the present epoch t 0 .By definition, the Hubble constant is the present value of the Hubble parameter H(a) = a −1 da/dt, which implies Ω(1) = 1 for all cosmological models.Combining Eqs.( 11) and ( 13), integrating the latter, and imposing the constraint for the dimensionless energy density.This model is defined by the three parameters n, b, and H 0 , but we reduce it to a 2-parameter model by choosing b = n/3.From the first Friedmann equation, Eq. ( 12), the energy density Ω(a) and the dimensionless time H 0 t are related as follows where γ(a, x) = x 0 t a−1 e −t dt is the lower incomplete gamma function.Setting a = 1 yields the dimensionless age of the universe If the small correlations between the 580 Typa 1a data points are neglected, the likelihood for these data is a diagonal multivariate normal.Maximizing this likelihood with respect to its parameters is equivalent to minimizing the function where µ(z, θ), the distance modulus [15]-an astronomical measure of distance, is given by The quantity c is the speed of light in vacuum in km/s, Ω K is the curvature parameter, and When the model is fitted to the Union 2.1 data set by minimizing Eq. ( 17) an excellent fit is obtained, as shown in Fig. 1.For t < t 0 , the time dependence of the scale factor a(t) is similar to that of the standard ΛCDM model.But the model exhibits a future singularity (a Big Rip) characterized by the condition a → ∞ at a finite time given by that is, at We now apply ALFFI to the same problem using the test statistic where χ 2 is defined by the Eqs.( 17), (19), and (19).The form of the test statistic is chosen so that it is O(1).by Eq. ( 3), which requires a good approximation to the cumulative distribution function P(λ ≤ λ 0 |θ).The latter is approximated in two ways: with histograms and with a deep neural network (DNN) as described in Sec.2.3.The 2D histograms have 10 bins in both dimensions n and H 0 with the parameter n scaled down by a factor 10 and H 0 scaled down by a factor 100. Since the same simulated Type 1a data are used for both the histogram and the DNN-based approximations, the parameters are scaled so that all inputs to the DNN are O(1), which is considered good practice.
The DNN is 1,781-parameter fully-connected feed-forward neural network, with 3 input features, x = {λ 0 , n, H 0 }, 5 hidden layers with 20 nodes each, and a single output.The elements of the matrices b i and w i are the free parameters of the DNN, and the non-linear function, h, at each hidden node is a ReLU applied element-wise, that is, to every element of its matrix input.The output node is a sigmoid that constrains the output to lie within the unit interval as befits a probability.The DNN is trained as described in Sec. 2 using PyTorch [17].
In practice, at each gradient descent step in the "landscape" defined by the empirical risk (that is, average loss), Eq. ( 6), the gradient is computed from a batch of data of size K = 50 ≪ N randomly sampled from N = 250, 000 sets of 580 simulated distance moduli.The steps are computed using the Adam [18] optimizer with a fixed learning rate of 10 −3 .As the training proceeds, the network with the smallest average loss is saved.The average loss is computed using a validation data set of size 5,000 that is not used by the optimizer.The training stops if after 50,000 steps no network is found with a smaller validation loss than the saved network.The best network as defined by this training protocol is found after about 100,000 iterations.If one defines an epoch to be N/K iterations, which for this example is 5,000, then 100, 000 iterations corresponds to 20 epochs of training.The simulated sets of 580 distance moduli are sampled from 580 independent normal distributions using the standard deviations taken from the observed data.
Approximating P(λ ≤ λ 0 |θ) using histograms and using ALFFI leads to the confidence sets shown in Fig. 2  θ = {n, H 0 } are taken to be the location of the minimum of P(λ ≤ λ 0 |θ), which as indicated in Fig. 2 agrees with the values obtained from the likelihood fit.
In the LF2I approach, the coverage over the parameter space is checked by modeling the coverage probability with another neural network as a function of the parameters of the theoretical model.There are pros and cons to that approach.It is certainly convenient to have a functional approximation of the coverage probability as a function of the parameter space point because one can then estimate the coverage at any given point, not only at points for which there are sufficient simulated data.Unfortunately, however, as is true of most machine learning models, a reliable estimate of the accuracy of the trained machine learning model is not available.In ALFFI, the coverage is checked explicitly by direct enumeration at all points for which there are sufficient data.Since the problem consists of counting how often a particular statement is true, the problem is binomial; therefore, a reliable estimate of the accuracy of the coverage calculation is easy to compute.On the other hand, the coverage is available only at the parameter points for which there are sufficient simulated data.In this example, for a given parameter point, T = 4, 000 sets of 580 Type1a supernovae data are simulated.For each data set, a test statistic, λ 0 , is computed.If P(λ ≤ λ 0 |θ) ≤ τ then, by definition, θ lies within the confidence set associated with λ 0 .If S is the number of times this statement is true over the collection of T simulated data sets, then the coverage probability is p± p(1 − p)/T , where p = S/T .If the confidence sets produced by ALFFI are accurate then we should find that p ≈ τ .This calculation is performed at 500 randomly selected points within the 95% CL set associated with the observed Type 1a data, as shown in the left panel of Fig. 3.The right panel shows that the coverage of the confidence sets computed using ALFFI do indeed have correct coverage and are, therefore, in that sense accurate.

Example 2: Signal/Background or ON/OFF Model
The cosmological example served to illustrate the ALFFI algorithm, but, as noted, ALFFI is not really needed because the cosmological data are numerous, the likelihood function is tractable, and parameter estimation via maximum likelihood works well.Our second example addresses the signal/background problem in high-energy physics (see, for example, Ref. [19]), which in astronomy is referred to as the On/Off problem.[8] We choose an example in which the likelihood is tractable but the data are sparse and, consequently, asymptotic methods may not be reliable.[7] We demonstrate that the ALFFI method, which is based on the Neyman construction, yields valid results even when the regularity conditions that underpin Wilks' theorem [20] and its variants [21] are violated.
The signal/background problem in high-energy physics and astronomy is as follows.An observation is made, for given period of time, which consists of counting N events: typically, photons in astronomy and particle collisions in high-energy physics.The count is potentially a sum of counts from signal and background sources.A second independent observation is made for the same duration (in the simplest case) where by design the background has the same characteristics as in the first observation but no signal is present.The second observation yields a count M .It is generally assumed that the likelihood function for the data D = {N, M } is the product of two Poisson distributions, where µ and ν are the mean signal and background counts, respectively.In the signal/background problem the parameter of interest is µ, while ν is a nuisance parameter.We shall comment on how one might deal with such parameters in the discussion.
Our specific example is from the first experiment to search for neutron-antineutron oscillations using free neutrons, [22] which took place in the 1980s at the Institut Laue-Langevin (ILL) in Grenoble, France.Neutron-antineutron (nn) oscillations are predicted by many proposed theories of physics beyond the Standard Model of particle physics.[23] For our purposes it suffices to note that if nn oscillations can occur then a pure neutron state, when observed at time t ≪ than the mean neutron lifetime, will be observed with probability as an antineutron state, where 2∆E is the difference in neutron and antineutron energies in external fields and ϵ = τ −1 nn is the energy characteristic of whatever new physics is responsible for the oscillations; τ nn is referred to as the oscillation time.We use units in which ℏ = 1.The experimental conditions are such that ∆E ≫ ϵ, which justifies the approximation in Eq. (25).Furthermore, in the Grenoble experiment the quasi-free condition ∆Et ≪ 1 could be realized, leading to a transition probability, independent of the energy perturbation ∆E arising from the neutron and antineutron interactions with the ambient magnetic field.In the quasi-free condition N = 3 events were recorded in this experiment.The background in the Grenoble experiment was directly measured by applying a magnetic field to suppress the transition probability P t by making ∆E large enough.This condition yielded M = 7 events.The maximum likelihood estimate of the signal is μ = N − M = −4 events.However, since µ ≥ 0, we choose to take the best estimate of the signal in such experiments to be The sparsity of the data in the Grenoble experiment and our choice of signal estimate explicitly violate two of the regularity conditions for standard asymptotic results to hold [7]: the data should be sufficiently numerous and estimates must not lie on the boundary of the parameter space.The violation of these conditions, however, is not a problem for LF2I and ALFFI.
To construct confidence sets in the parameter space of θ = {µ, ν}, we use the test statistic where μ is given by Eq. ( 27) and ν(µ) by The cdf, P(λ ≤ λ 0 |θ), was again approximated with a fully-connected feed-forward DNN with 3 input features x = {λ 0 , µ, ν}, 6 hidden layers with 12 nodes each, and a single output, estimating E [Z | λ 0 , µ, ν].The activation function at each hidden node is a PReLU, [24] and the network was trained with the Adam optimization algorithm with a fixed learning rate of 6×10 −4 .The training set is composed of 10 7 examples, which were used in batches of size 5 × 10 3 , for the duration of 10 5 iterations, that is, for 50 epochs.A batch normalization [25] layer was added after every hidden layer as a regularization technique.
The DNN was used to compute the confidence sets shown in Fig. 4 and the associated coverage probabilities shown in Fig. 5.
The fact that the coverage probabilities are within about 10% of the confidence levels confirms the accuracy of the confidence sets obtained with ALFFI.

Example 3: SIR Model
In the cosmology and signal/background examples, the likelihood functions are tractable.In our third example, from the field of epidemiology, the likelihood is  intractable.[26] Therefore, this example is one for which LF2I and ALFFI are the most useful.
In this example, we fit the well-studied Susceptible-Infected-Recovered (SIR) epidemiological model to a classic data set, reproduced in Fig. 6, from a flu outbreak at an English Boarding School.[27] The SIR model has a nearly 100-year history, beginning with the work of Kermack and McKendrick in 1927.[28] The simplest version of the model comprises coupled ordinary differential equations (ODEs) and assumes a population of individuals that is closed and well-mixed in which individuals fall into one of three classes or compartments: susceptible (S), infectious (I), and recovered or removed (R).The rate of change of susceptible individuals depends on the rate at which new infections arise as a result of contact between infectious and susceptible individuals.It is typically assumed that the contact rate (number of contacts per unit time) is proportional to the total population size N or constant.The assumption that contacts are proportional to the total population size leads to a mass action term βSI for the number of new infections per unit time (called the incidence in epidemiology).Infectious (I) individuals can leave the infectious class through recovery (or death) and progress to the the Recovered (or Removed) class, R. It is often assumed that the number of recoveries/removals per unit time is linear in I: αI.Under this assumption, the time spent in I is exponentially distributed with mean 1/α.The resulting system of ODEs is The qualitative dynamics of this model are governed by an epidemiological quantity called the basic reproduction number, R 0 , which for this SIR model is R 0 = β/α.If R 0 > 1, the model exhibits a single outbreak, with I increasing to a peak, then approaching zero as time tends to infinity.Under this scenario, S will decay and approach a positive, constant value, meaning that the epidemic does not infect the whole population.If R 0 ≤ 1, I will decay and approach zero as time tends to infinity.This model has played an invaluable role in understanding infectious disease dynamics, informing control strategies, and serving as a building block for more complex compartmental epidemiological modeling frameworks.
The SIR model is often fitted to data by minimizing a weighted least squares function, where the data D = {x 1 , • • • , x N } are the number of infected individuals at the corresponding reporting times t 1 , • • • , t N , I n = I(t n , θ) is the predicted mean infection count at time t n , found by solving Eqs.(30) for the given θ, and w n are weights.Following example 1, we choose a test statistic that is transformed and scaled so that the statistic is O(1).The weights are set to w n = I −1 n , which we hasten to add should not be taken to imply that the counts x n are Poisson distributed.On the contrary, the counts x n are correlated and their fluctuations are super-Poissonian.
We follow a similar training protocol as for example 1, except that the training sample size for this example is 750,000, a fully-connected DNN is used with 6 hidden layers and 25 nodes each, and the best model is found after about 350,000 iterations, that is, after 23 epochs.The confidence sets obtained are shown in Fig. 7 and the coverage probabilities are shown in Fig. 8. Again, we find that ALFFI does an excellent job of creating accurate confidence sets.

Discussion
In the cosmological and epidemiological models all parameters are of interest.But in the signal/background-On/Off problem one is typically interested only in the mean signal  µ.The mean background, ν, is a nuisance parameter.Unfortunately, constructing confidence intervals for µ when there are nuisance parameters and when the data are sparse is challenging, though approximate methods exist (see, for example, Ref. [21]) including in LF2I [4].Given the success of LF2I and ALFFI in producing reasonably accurate confidence sets, it is of interest to explore whether the reasoning that underlies these approaches can be extended to an algorithm that can yield provably valid confidence intervals for individual parameters, ideally constructed from the associated multi-parameter confidence set.
One possible approach to construct confidence intervals from the confidence sets created with LF2I and ALFFI is to mimic the way that confidence intervals can be constructed from confidence ellipsoids for a multivariate normal density.In the two dimensional example, the objective would be to map a 2-dimensional confidence set to a one-dimensional confidence interval, as is done in the bivariate normal density.For example, in the signal/background problem, one can construct an interval for µ as follows: I(D) = [min({µ i }), max({µ i })], where {(µ i , ν i )} are points from the associated confidence set.It should be possible to use an algorithm like ALFFI to map from the confidence level of the set to that of the associated interval for the parameter of interest.If this can be done, then one would be able to determine what confidence level is needed for the confidence set to obtain the desired confidence level for the associated interval.To the best of our knowledge, if such a mapping could be devised in the general case it would constitute the first method in which valid multidimensional confidence sets can be mapped to valid one-dimensional confidence intervals without the need for explicit knowledge of the underlying statistical model.Ideas along these lines are under investigation.

Conclusions
The LF2I and ALFFI approaches are useful when asymptotic results [7,21] may not be applicable, and are particularly useful when the likelihood function is intractable.The ALFFI approach introduced in this paper, which follows LF2I, approximates the cumulative distribution function of a test statistic with a neural network in a way that permits a direct check, with the same neural network, of the coverage probability of confidence sets at any point in the parameter space of a theoretical model at which sufficient simulated data are available.This provides an a posteriori assessment of the accuracy of the confidence sets constructed with ALFFI.In addition, by directly binning the point cloud of theoretical model parameters (presumably, in a more sophisticated way than is done in this paper) it is possible to check directly the quality of the neural network approximation.Draw parameter θ i ∼ π θ 4: Compute test statistic λ i ← λ(D i , θ i )

Figure 3 .
Figure 3. Example 1 (cosmology): (left) 500 points, from the 95% CL confidence set for the Type 1a data, at which the coverage has been computed.(right) the dots are the coverage probabilities for each of the 500 points in the cosmological model n, H 0 parameter space and fhe horizontal lines are the values of the confidence levels τ .

Figure 5 .
Figure 5. Example 2 (Signal/Background): (left) 500 points, from 95% CL confidence set for the Grenoble data,[22] at which the coverage has been computed.(right) the coverage probabilities for each of the 500 points in the signal/background {µ, ν} parameter space.

Figure 6 .
Figure 6.Example 3 (SIR): English Boarding School data.The number of infected individuals at the reported times in days from the start of the flu outbreak.

Figure 7 .
Figure 7. Example 3 (SIR): Confidence sets R(D) for τ = 0.68, 0.80, 0.90, and 0.95.(dashed lines) Boundaries of confidence sets computed with the histogram-based approximation of the cdf.(solid lines) Boundaries of the confidence sets computed with the DNN-based approximation of the cdf.(black dot) Location of the minimum of the DNN-based cdf, which is taken to be the best-fit point.

Figure 8 .
Figure 8. Example 3 (SIR): (left) 500 points, from the 95% CL confidence set for the Boarding School data, at which the coverage has been computed in the SIR α, β parameter space.(right) the coverage probabilities for each of the 500 points in the SIR parameter space.
. The best-fit values of the cosmological parameters