Parametric level-set inverse problems with stochastic background estimation

We study parametric shape reconstruction inverse problems in which the object of interest is embedded in a heterogeneous background medium that is known only approximately. We model the background medium as a Gaussian random field and pose shape reconstruction as a stochastic programming problem in which we seek to minimize the expected value, with respect to the background field, of a stochastic objective function. We develop a computationally efficient algorithm based on the sample average approximation that reduces the effect of uncertainty in the background medium on shape recovery. We demonstrate that by using accelerated stochastic gradient descent, we can apply our method to large-scale problems. The capabilities of our method are demonstrated on a simple two-dimensional model problem and in a more demanding application to a three-dimensional inverse conductivity problem in geophysical imaging.

geophysical inversion, sample average approximation, shape reconstruction (Some figures may appear in colour only in the online journal) * Author to whom any correspondence should be addressed.
Original Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Introduction
In many cases, the goal of inverse problems is to estimate the physical properties and geometry of relatively simple objects that can be described by small numbers of parameters. Unfortunately these simple objects are often embedded in heterogeneous background media that are not of interest. To solve such problems it is often necessary to estimate a potentially large number of parameters that define the background model. Such background model estimation can be viewed as a nuisance parameter estimation. The quality of these nuisance parameter estimates may then adversely affect the recovered estimates of the parameters of interest.
In order to discretize the simple object and estimate its properties, in this paper, we take the view of parametric level-set methods. These methods fall under the general setting of inverse shape reconstruction methods, which seek to estimate the position and shape of an object embedded in some background medium from indirect measurements such as geophysical or medical imaging data. Level set methods have been a popular approach to shape reconstruction inverse problems. The original level set method was introduced by Osher and Sethian (1988), to model front propagation. See Burger and Osher (2005) for a survey of early work on the subject and, e.g. Iglesias et al (2016), Prieto and Dorn (2017)for examples of more recent work, with applications to inverse problems. In inverse level set methods, the boundary of the object of interest is represented by the zero level-set of some function. The level-set function may be represented and manipulated directly, in which case it must be represented cell-wise over the computational domain (Burger and Osher 2005). This is a quite flexible approach but creates a high-dimensional parameter space to optimize over. This can lead to the inverse problem becoming highly ill-posed, requiring methods such as regularization of the level-set function to address the ill-posedness (Doel and Ascher 2006).
Alternatively, the level set function can be represented parametrically, using some basis function expansion or geometric shape to represent the object of interest. Such lowdimensional parametrizations implicitly regularize the level set function. Aghasi et al (2011) introduced the use of radial basis functions for parametric level set reconstruction of general 2D shapes. Their approach has been applied to electrical impedance tomography (Liu et al 2018) and seismic full-waveform inversion (Kadu et al 2017b). In our approach, we represent the object(s) of interest as a combination of one or more simple geometric objects. We use the skewed Gaussian ellipsoid in our first example problem and rectangular prisms in our second problem. Our approach is based on the method of McMillan et al (2014), who used Gaussian ellipsoids to represent high-contrast subsurface bodies in electromagnetic geophysical inversions.
Despite their simplicity, geometric bodies such as ellipsoids and prisms can be useful models for objects of real practical interest in geophysical imaging, the application area by which we are primarily motivated. For example, McMillan et al (2014McMillan et al ( , 2016use real field geophysical data to show how single and multiple ellipsoid models can be used to image geological structures such as massive sulphide deposits and banded iron formations. Day-Lewis et al (2004) use multiple rectangular prisms to model oil seepage in physical scale model experiments. Hoversten et al (2017) use rectangular prisms to model the extent of fluid injection in hydraulic fracturing of oil reservoirs. In some of these examples useful results are obtained using only a single geometric body, and in other cases, multiple bodies are used. Using multiple geometric bodies allows for the recovery of more complex shapes. We choose to work with these geometric bodies for this combination of simplicity and practical utility. Although we have not tested it, our method could in principle be easily applied to other level set parametrizations such as radial basis functions.
In geophysical imaging, the background medium in which the object of interest is hosted is usually heterogeneous, and for an accurate reconstruction of the object, good prior knowledge of the background is necessary. Typically the object of interest exhibits a high contrast with the background but its response cannot be completely decoupled from the background response. Therefore, incorrect background estimation will introduce error into the estimate of the target object. Kadu et al (2017a) used a bi-level optimization approach to simultaneously estimate a parametric shape and heterogeneous background but the bi-level optimization problem is difficult to solve and it is not clear whether the background model can always be accurately reconstructed by this approach. A small amount of work has been done on the quantification of uncertainties in shape reconstruction (see e.g. Cardiff andKitanidis 2009, Iglesias et al 2016), but to our knowledge, no work has specifically examined or tried to mitigate the effect of uncertainty in background estimation.
There are important practical applications in which it is reasonable to assume that one might have some reliable but imperfect background information based on data other than that used in the shape reconstruction problem. An example from geophysical imaging is the time-lapse inversion of hydraulic fractures in the earth's subsurface. The physical properties of the background rocks before fracturing may be imaged by electromagnetic or seismic geophysical surveying. Then, after the injection of conductive fluid into the earth during the hydraulic fracturing process, it is possible to estimate the volume of rock effectively fractured by electromagnetic imaging, modeling the fractured zone as a thin conductive rectangular prism embedded in the previously estimated heterogeneous background medium (Hoversten et al 2017). In both these examples, there will be uncertainty in the estimate of the background used in the shape reconstruction.
In shape reconstruction with uncertain background information, the parameters describing the background model are effectively nuisance parameters. They are not of direct interest but the accuracy of their estimation affects the accuracy of the shape reconstruction. To our knowledge, there has been very little work done in general on estimating large space nuisance parameters in inverse problems. The one notable example of which we are aware, which is very different in character from the present contribution, is the work of Aravkin and Van Leeuwen (2012). They use variable projection to eliminate tuning parameters from regularized least squares inverse problems. Their technique only applies when the number of nuisance parameters is very small relative to the number of parameters of interest. We are interested in the opposite regime, where the object of interest is described by a small number of parameters, and the background is a cell-by-cell discretization of a continuous function onto some sort of computational mesh, making the number of nuisance parameters very large.
In this paper, we present a stochastic optimization based method for improving the quality of shape recovery in parametric level-set shape reconstruction problems in the presence of uncertainty in the background parameters. In section 2, we will describe the method. In section 3, we present a simple linear inverse problem that we use as a first application. This example illustrates the properties of our algorithm and examines the relative strengths and weaknesses of the two optimization techniques we propose for its solution. Section 4 discusses an application to a non-linear three-dimensional (3D) inverse problem, before concluding and summarizing the paper.

Method
In general, we are concerned with the inverse problem of estimating the location, shape, and material properties of a homogeneous object embedded in a heterogeneous background medium in two-dimensional (2D) or 3D space from a noisy set of indirect observations d ∈ R d , where d is a natural number that can vary quite widely depending on the problem. We assume that the data are sensitive to some physical property θ of the material being probed. The data are related to θ through the forward problem where ϵ is additive noise, and F is the forward modeling operator, which is typically a functional of the solution of a partial differential equation (PDE).
In level-set shape reconstruction methods, the boundary of the object of interest is represented by the zero level-set of a function τ (x), where x represents position in 2D or 3D space. The overall physical property model can then be written where H is the Heaviside step function, θ 1 is the value of θ inside the object of interest and θ 0 is the heterogeneous background physical property model. In traditional level-set methods the level-set function is discretized onto a grid, and the inverse problem attempts to reconstruct it directly.
In parametric level-set methods, the level-set function is represented by a low-dimensional parametrization. For example, in the application problem in section 3 of this paper, we model the object of interest as an ellipse, which can be described by its centre position, radii, and angle of orientation. The level set function for this model can be written where x 0 is the the location of the centre of the ellipse and M = R T SR is a rotation and scaling matrix. M depends on the radii of the ellipse and its angle of rotation about the horizontal axis of the mesh. Explicitly, we have The goal of the parametric inverse problem is to estimate the shape and location parameters that describe the object of interest as well as the value of the diagnostic physical property inside the object. It should be noted here that, in practice, in both traditional and parametric level-set methods, the Heaviside function in (2) is replaced by its smooth approximation in order to facilitate the optimization. This will be discussed further in the application sections.
We formulate the inverse problem as the least-squares optimization problem We seek the parametric model m that minimizes the discrepancy between measured data and simulated data computed by solving the forward problem-subject to bound constraints on the parameter values. ∥·∥ 2 represents the Euclidean norm. W in (5) is a diagonal matrix containing the inverse variances of the data. The individual data may vary widely in magnitudes and noise levels. The weighting matrix W insures that we fit each datum to an appropriate level. The vectors m l and m h represent the lower and upper bounds, respectively, on the parameter values. These bounds may represent essential logical constraints on the parameter values-e.g. ellipse radii must be positive-or represent prior knowledge about a particular problem-e.g. the inversionist may know that the physical properties of the anomalous body lie within a particular range. We assume that our parametrization of the level-set function is sufficiently simple that it will have a regularizing effect and that therefore explicit regularization will not be required. This is a well-known and well-studied approach-see, e.g. Li and Oldenburg (1996), Santosa (1996), Aghasi et al (2011), McMillan et al (2014. The optimization problem (5) can be solved by standard techniques. We have had success in solving such problems with the projected Gauss-Newton (GN) method, solving for the search direction using the preconditioned conjugate gradient algorithm with large stopping tolerance.
Of course, perfect knowledge of the background is unlikely. To account for uncertainty in the knowledge of the background, we model it as a random field. Then, since the objective function in (5) is a function of the background model, we treat it as a random variable. This recasts the problem of estimating the model parameters of interest m as a stochastic optimization problem. Rather than minimizing the objective function for a fixed background, we seek the m that minimizes the expected value of the objective function with respect to the stochastic background model.
Let the background physical property model be a random field θ 0 (x, ω). x represents location in some spatial domain D ⊆ R d . We consider 2D and 3D domains-d = 2, 3. ω is an element of the sample space Ω of a probability space (Ω, F, P). F and P are, respectively, the event space and probability measure of the space. For fixed ω (i.e. a realization of the random field), θ 0 is a deterministic function of space. We can write the expected value of our now random least-squares objective function as a Lebesgue integral with respect to the probability measure P We assume that P is continuous and that we can efficiently compute realizations of the background. Having defined the expectation we can define our stochastic shape-reconstruction problem Stochastic optimization has been widely used in operations research for performing optimization under uncertainty in the formulation of or input to optimization problems-see, e.g. Diwekar (2003), Wallace and Ziemba (2005). To our knowledge, it has not been used for the solution of inverse problems. A version of stochastic optimization has been used to reduce the computational complexity of solving large-scale inverse problems involving many PDEs-e.g. Haber et al (2012), Roosta-khorasani et al (2014). We are not aware of its use in improving the quality of the solution for an inverse problem.
The high-dimensional integral in the expectation (6) cannot be evaluated analytically for most problems, so it must be approximated. We use the sample average approximation (SAA) to estimate this expectation and form a tractable optimization problem. The SAA approximates the expectation using a sample average (Shapiro et al 2009), where Φ(m, θ (i) 0 ) denotes the misfit function evaluated with respect to a particular realization of the background model θ (i) 0 . Applying this approximation to (7) gives the optimization problem Intuitively, the misfit function is now a sample average of the individual misfits computed with different realizations of the background. The variance in the estimate of the parameters of interest will decrease as the number of background samples n increases. Let m * n be the minimizer of the SAA problem (9) and m * the minimizer of the true problem. The theory of stochastic programming tells us (Shapiro et al 2009) that m * n converges to m * almost surely. Under under mild regularity conditions on Φ, the convergence is guaranteed to where ∥·∥ is the Euclidean norm. Despite the slow theoretical convergence rate, we have found that, in practice, we can improve results in shape reconstruction problems using a computationally tractable number of samples. Given a set of background samples {θ (i) 0 }, (9) may be solved using methods of deterministic optimization. We have used the GN and accelerated mini-batch stochastic gradient descent (SGD) algorithms. In the next section, we will analyze the performance of these two optimization approaches on a simple linear inverse problem. We note that, computationally, each background sample requires separate forward problem, gradient, and possibly Hessian vector product evaluations at each iteration of the optimization procedure, however, the computations for each background are trivially parallel. Looking ahead, we have found that given sufficient computing resources, the GN approach provides an efficient and robust solution while SGD can still give good results while using limited computational resources.

Computing background samples
Before turning to an analysis of the performance of the SAA approach on example problems, we consider the problem of choosing background model probability distributions and generating sample background models. Ideally the distributions would be determined empirically, based on real examples from past experiments. This is not typically possible in geophysical imaging. As an alternative, we model the background as a Gaussian random field (GRF) with mean and covariance functions that encode our prior knowledge of the background. The stochastic optimization methods used in this paper do not depend on the background distribution being Gaussian. We use GRFs in our examples because of their flexibility and the simplicity of sampling from them.
We use a truncated Karhunen-Loève expansion (KLE) to approximately sample from GRFs. The KLE provides a way to represent a random field via a spectral expansion of its covariance function (Ghanem and Spanos 1991). Consider a GRF m 0 (x, ω) with mean m 0 and covariance function κ(x, y), where, as above, x ∈ D ⊆ R d is the spatial variable. Our examples deal with the 2D and 3D cases-d = 2, 3. ω is an element of the sample space in some prob- where the λ n and φ n are the eigenvalues and eigenfunctions of the covariance function and the ξ n (ω) are normally distributed random variables. For covariance function κ(x, y), the eigenvalues and eigenfunctions solve the eigenvalue problem In practice, except for special covariance functions, this must be solved approximately. We use the Nÿstrom method (Betz et al 2014) to approximate the KLE eigenvalues and eigenfunctions. This method approximates the integral (12) by a quadrature rule. This leads to an n λ × n λ matrix eigenvalue problem that gives the first n λ eigenvalues and the values of the corresponding eigenfunctions at the quadrature points. The eigenfunctions can then be evaluated at points of interest by interpolation. See Betz et al (2014) for a detailed description. For smaller problems the quadrature points may be evenly distributed over the domain and the eigenfunctions evaluated for each cell in the computational domain. However, for larger 3D problems, distribution of quadrature points over the full domain becomes prohibitively expensive. We address this problem in the discussion of our non-linear 3D application problem in section 4.

Implementation
The software required to perform the numerical experiments in the remainder of this paper was developed in the Julia programming language (Bezanson et al 2017), using the jInv framework (Ruthotto et al 2017). jInv is a set of open-source Julia software packages designed to simplify the process of building software for PDE-constrained parameter estimation problems by providing modular building block routines commonly needed in these problems, including parallelized GN optimization and tools for finite volume PDE discretization. We used these building blocks to develop routines to solve our specific application problems. To perform the Karhunen-Loève basis function computations we used the open-source Julia package Gaussi-anRandomFields (Robbe 2017), modified to our needs.

Application to a simple model problem
We use 2D straight-ray cross-hole seismic tomography as an example problem that is computationally simple enough to permit extensive experimentation. Straight-ray tomography is a linearization of seismic tomography in which seismic waves (pressure waves moving through the earth's subsurface) are assumed to propagate in straight lines from the source of excitation to a receiver point (Bregman et al 1989). Transmitters are placed in a borehole in the earth and receivers are placed in a separate borehole. The experiment then measures the times required for the signal to propagate from each transmitter to each receiver. The speed of wave propagation depends on the material properties of the propagation medium, allowing seismic tomography to be used as an imaging method. See figure 1 for an illustration of a typical survey setup.
The straight-ray tomography forward problem is The data vector t represents the travel times between each source and receiver pair. s is the inverse velocity, or slowness of each mesh cell, and each entry of A, a ij , gives the distance travelled by the straight ray through cell j along the path connecting transmitter-receiver pair i. We attempt to recover an elliptical anomaly embedded in a heterogeneous background from travel time data. Our synthetic experiment is performed on a unit domain in R 2 , with 50 transmitters spaced evenly along the left edge of the domain and 50 receivers spaced evenly along the right edge.
The background slowness model is described by a GRF with mean and covariance based on the geological scenario of a horizontally layered earth with seismic wave velocity increasing with depth-a scenario commonly encountered in seismic imaging applications (Okko 1994). The mean slowness model is composed of five homogeneous layers. It is shown in figure 2(a). The covariance function is with for characteristic length scales l x and l y . σ is the standard deviation, which is constant, giving the overall scale of fluctuations from the mean background. In this example, we use σ = 1, l x = 0.3 and l y = 0.1, encoding our assumption that the slowness should vary more smoothly in the horizontal direction than the vertical. The slowness model is completed by adding a homogeneous ellipse to the background model. The boundary of the ellipse is given by the parametric level-set function where x 0 is the the location of the centre of the ellipse and M is a rotation and scaling matrix that depends on the radii of the ellipse and its angle of rotation about the horizontal axis of the mesh. For more details on the level-set parametrization, see McMillan et al (2014). In order to use gradient-based optimization techniques, the slowness model must be differentiable with respect to the ellipse parameters across the entire domain. To achieve this, we use a smooth approximation to the Heaviside function to describe the transition from the ellipse to the background medium. The total slowness at a point the domain is given (for fixed background slowness model s 0 ) by where s e is the slowness inside the ellipse and a is a scaling factor that controls the sharpness of the transition between the ellipse and the background. The model used to generate synthetic data for our inversion consisted of an ellipse in the centre of the domain embedded in a randomly chosen slowness model from the background GRF. We discretized the domain onto a 100 × 100 regular grid of square cells. A 176 term KLE was used to compute a sample from the zero-mean GRF with covariance (14). This was then added to the mean background to generate a sample from the background model GRF. The zero-mean perturbation and resulting background sample are shown in figures 2(b) and (c), respectively. To this background model, we added an ellipse with an anomalously high slowness of 8 s m −1 . The ellipse was centred at the centre of the domain with an angle of 30 • from vertical and radii of 0.1 m and 0.2 m. The total model is shown in figure 2(d). In the rest of this section, this will be referred to as the true model. Using a random sample from the background GRF, rather than the mean background as the true model in our inversion study implies that the sample backgrounds used during inversion are being drawn from a distribution with the correct covariance function but wrong mean. In practice, both the mean and covariance could only be approximately known. We have not attempted to assess the success of our method when the background covariance function is only approximately known. Generally, the accuracy of the background model mean and covariance functions required for our SAA technique to be effective in practice will depend on specific characteristics of the application problem, and in particular, on how well separated the material properties of the object of interest are from the background. When the vast majority of the signal in the data comes from the object of interest, adequate shape reconstruction might be possible without a particularly precise characterization of the background. On the opposite end of the spectrum, if the background generates signals similar to that of the object of interest, then our approach will likely fail and shape reconstruction will only be possible with a very accurate description of the background. It is in the middle ground where we believe our method can be most useful.

SAA inversion
We conducted several numerical experiments to study the behaviour of our method. All these experiments attempted to fit the same synthetic data, which was generated by applying the straight-ray tomography forward modeling operator to the true model and then adding artificial independent Gaussian noise to each measurement. The added noise on each datum had standard deviation equal to 1% of its magnitude. We attempted to recover the orientation angle, centre position and slowness of the anomalous ellipse. For simplicity, and to insure our inverse problem had adequate sensitivity to all parameters active in the inversion, we did not attempt to recover the radii of the ellipse. They were fixed to their true values. The starting guesses and bounds on the active inversion parameters were the same in all inversions. They are listed in table 1.
Before studying the effectiveness of the SAA on the straight-ray example problem, we wanted to characterize the variability of results in single background inversions caused by uncertainty in background estimation. We assume our background model random field accurately captures the uncertainty in our knowledge of the background. Under this assumption, we can conduct many inversions, each using a single sample from the background model random field. We performed 100 such inversions. For each, we solved the deterministic optimization problem (5) using the projected GN algorithm with backtracking Armijo line-search (Nocedal and Wright 1999). Inversions could end after attaining the desired data fit as determined by a χ 2 statistic, when the magnitude of the gradient was reduced below a threshold value relative to its initial magnitude, or when the line search failed to find a step that reduced the objective function. Results from all these categories were included in assessing the variance in the recovered parameters as a function of the background. The results are summarized in figure 3, which shows histograms and kernel density estimates of the recovered values of each inversion parameter. There were a wide range of results. Excellent recovery was achieved for a few sample backgrounds but the variance in results was large, and many inversions achieved very poor recovery of the anomalous ellipse. The orientation angle was particularly poorly resolved, with the angle being pushed to its lower bound being the mode result. This seems to show that the angle is poorly resolved by this particular straight-ray tomography survey, rather than to give insight into the effect of uncertainty in the background model. Even with perfect knowledge of the background, we have seen that it is difficult to resolve the angle. Perhaps a greater variety of source receiver ray paths (e.g. on the top and bottom of the domain) would help to improve recovery, regardless of the level of uncertainty in the background estimation. Overall, these single background inversion results show that the level of uncertainty in our knowledge of the background slowness model is too high for parametric inversion using a fixed background to be reliably effective for this particular problem.
To show the effect of the SAA approach in improving ellipse recovery in the presence of uncertainty in background estimation, we performed repeated multiple background SAA inversions and studied the variability of the inversion results. For each of two SAA sample sizes n s = 10 100, we performed 100 inversions, with n s new independent background models sampled from the background GRF. Kernel density plots showing the range of recovered parameter values for the SAA and single background inversions are plotted together in figure 4. It is clear from these results that the SAA approach did not help recovery of the orientation angle-in line with our expectations-but it did reduce the variance in the recovered values of the other parameters. The density plots also show that there are small but distinct biases in the distributions of recovered values. This stems from the fact that we do not know the true background model distribution and are thus not drawing our sample backgrounds from the correct distribution. Thus, as the SAA sample size is increased, we see the sample means of the recovered parameters converging to values slightly different than the true values.
The sample means and standard deviations of the recovered parameter values for each SAA sample size are shown in table 2. The values in the table show the O(1/ √ n s ) convergence in standard deviation expected from the SAA. Overall these statistics show that a single SAA inversion with a large enough sample size can be expected to give much better parameter recovery than a single one background inversion.

Optimization by accelerated SGD
A major downside of the SAA method described in this paper, when using traditional deterministic optimization methods, is that the computational cost grows linearly with the number of samples n s , while the parameter estimates converge only at a rate of O(1/ √ n s ). Each term in the SAA objective function requires separate forward modeling, gradient and Hessian computations. When large-scale parallel computing resources are available, a large number of terms can be handled in parallel, allowing the approach to remain efficient. But when such resources are not available, the computational cost per iteration becomes prohibitive as the number of samples increases. In order to gain the statistical benefit of using a large number of samples in the SAA technique without incurring a prohibitively large computational cost, we turn to stochastic gradient methods. These algorithms converge much more slowly than the GN method but the periteration cost is drastically reduced. They can perform SAA inversions with large numbers of background samples when only modest computational resources are available, since only a small number of background models are accessed at each iteration. SGD and related algorithms such as Polyak's momentum method (Polyak and Juditsky 1992), have been used in stochastic programming for a long time (Shapiro et al 2009). Recently, these methods, along with other variants, have gained great popularity in the machine learning community due to their ability to work with small subsets of large datasets and their surprising performance on difficult deterministic non-convex machine learning optimization problems-see e.g. Bottou (2010), Lecun et al (2015). We are interested in SGD and related methods because of their ability to use information from a large number of sample background models without needing to process many of these models at each iteration. In its original context of stochastic optimization, SGD was defined as an algorithm to solve the expectation minimization problem (7). It estimates the optimal model m * via the iteration where α is the step size and ∇Φ(m, θ i 0 ) is the gradient of the objective function with respect to m computed at a single realization of the background model (Murphy 2022).
SGD can also be seen as a way to solve the SAA problem (9) by approximating the full gradient at each iteration by the gradient of a single term. This idea can be extended to the algorithm known as mini-batch gradient descent, where the model is updated as where B is some subset containing n B of all of the available background model samples (Murphy 2022). Although, technically, SGD only refers to the case of using a single sample per iteration, the term is sometimes used to refer to mini-batch gradient descent. In order to improve convergence, several accelerated SGD methods have been proposed, such as the momentum method (Polyak and Juditsky 1992), RMSProp (Hinton 2014) and ADAM (Kingma and Ba 2015), to name a few popular examples . These methods aim to improve convergence by either including information from past iterates in the parameter updates, by adaptively scaling the gradients based on past gradients, or by some combination of these approaches.
In this work, we have used the ADAM accelerated SGD algorithm (Kingma and Ba 2015). The ADAM method combines ideas from momentum methods and gradient descent with adaptive step size control. The method is shown in algorithm 1. The model is updated by an exponentially weighted average of past gradients, with the individual components scaled by an exponentially weighted average of the magnitudes of past gradients. After reviewing several accelerated gradient descent algorithms, we chose ADAM because of its proven performance on many difficult large-scale problems (see, e.g. Zhang (2018), Karpathy (2017), Reddi et al (2018)) and the fact that in our experience it seems to perform well with less hyperparameter tuning than other related methods. In order to achieve satisfactory performance it was still necessary to tune the step-size. After some experimentation, we settled on a value of α = 0.002, which we used in the experiments discussed below. We set the other parameters to the default values recommended in Kingma and Ba (2015). These are β 1 = 0.9, β 2 = 0.99, and ϵ = 1 × 10 −8 . Algorithm 1. The ADAM method, algorithm presentation adapted from algorithm 1 of Kingma and Ba (2015). Note that β k 1 and β k 2 denote β 1 and β 2 raised to the power of k. The • symbol represents element-wise multiplication.
Choose hyperparameters α, β 1 , β 2 , ϵ. Choose initial model m 0 (Abuse of notation, do not confuse with background model) Compute a random permutation of the samples and divide into mini-batches for j in mini-batches do At each iteration of the optimization, the approximate gradient of the SAA objective function is computed using a subset of the available background samples, called a mini-batch. We studied the performance of the ADAM method in solving the SAA optimization problem (9) for the straight-ray tomography problem by studying the variability of the recovery of ellipsoid parameters obtained with independent sets of background samples, as we did with GN. We studied the performance of ADAM using SAA sample sizes of n s = 100, 500, with mini-batch sizes of 1 and 10. We performed 100 inversions in each of these configurations. The optimizations generally converged faster with the larger mini-batch size, as expected. The parameter recovery was equivalent in both cases. The summary statistics of the results (computed from the one mini-batch runs) are shown in table 3. Kernel density estimates for each parameter value, for both SAA sample sizes are plotted in figure 5. The standard deviations of the ellipsoid position and slowness from the 100 background inversions was very close to what was observed in the 100 background GN experiments. Unexpectedly, the orientation angle was much better resolved by the ADAM inversions. The biases of the x and y position variables were similar to the GN results while the ADAM inversions tended to overestimate the slowness more than the GN inversions. We cannot definitively say why the biases in the ADAM results were different from the GN biases and why the ADAM results better recovered the orientation angle. It may be because the ADAM algorithm is better able to escape from local optima and saddle surfaces, or that because of its much smaller step sizes and the scaling of the problem, ADAM is able to better explore the interior of the parameter space and find reasonable values for the position and slowness parameters without quickly pushing the angle to its upper bound value.

Application to non-linear 3D problem
We illustrate the application of our method to a larger scale non-linear problem using another example from geophysics, direct current resistivity (DCR) imaging. In DCR imaging, a steady electric current is injected into the ground and resulting electrical potential differences due to electric currents flowing in the ground are measured at a small number of locations on the earth's surface. These potential differences can be used to estimate the electrical conductivity (inverse of resistivity) of the earth's subsurface. Mathematically, DCR surveys are governed by an elliptic boundary value problem for the electrical potential ϕ in the earth satisfying where θ is the earth's electrical conductivity, and j s is the electric current driving the experiment. ∇ϕ is the spatial gradient of the potential and ∇· indicates the divergence operator. The conductivity θ is here assumed to be a scalar function of space. The source current is divergence free except at a finite number of isolated points where current electrodes are connected to the ground. ∂Ω 1 denotes the portion of the domain boundary that represents the surface of the earth andn is the unit vector normal to the domain boundary. ∂Ω 2 is the rest of the domain boundary. The Dirichlet conditions model the fact that the potential tends toward zero far from sources of current. In order to approximately meet this condition in a finite domain, we must make the domain large enough that the non-surface boundaries are indeed far from any sources of current. We discretize (20) on locally refined rectilinear meshes called OcTrees (see e.g. Horesh and Haber 2011), which makes it simple to enlarge the domain without significantly increasing the computational cost of solving the boundary value problem. We approximate (20) on an OcTree mesh using a finite-volume discretization on an OcTree mesh, discretizating the potential at the mesh nodes (Haber 2015). We use the open-source jInv software package (Ruthotto et al 2017) to implement this discretization of the forward problem as well as matrix vector products with the Jacobian matrix of the forward problem, and its adjoint.
The potential difference measurements, which are the data for our inverse problem, are linear functionals of the potential ϕ, a measurement being simply the difference in the potential between two locations on the earth's surface. For general information on the mathematics of the DCR problem, see, e.g. Ward and Hohmann (1987), Haber (2015).
We apply our method to two DCR examples. In the first example we characterize a plate-like highly conductive body embedded in a heterogeneous vertically layered background medium. This example shows that our approach can still be practical when computational cost makes running SAA with many samples per iteration infeasible, and when the background model is more complex than in the simple straight-ray tomography example. In the second example, we characterize an anomalous body consisting of two overlapping conductive rectangular prisms, one tube-like and the other plate-like. This shows how our method can be applied to the recovery of more complex objects than simple individual ellipses and prisms.
We model the conductive bodies in these two examples as rectangular prisms. The prisms are incorporated with the background conductivity model using the parametric level-set approach. Similarly to the previous example the overall conductivity model for a single prism is where θ 0 is the heterogeneous background conductivity model, θ p is the conductivity of the prism, andH is a smooth approximation to an indicator function that takes value 1 inside the prism and 0 outside. The indicator function is a product of one-dimensional (1D) indicator functions in the three Cartesian coordinate directions where x 0 is the position of the centre of the prism and R is a rotation matrix that transforms x − x 0 to a coordinate frame whose axes are the principal axes of the prism. The 1D indicator function for coordinate i is where r i is the i component of r, h i is the side length of the prism in the i direction and a i the scaling factor that controls the sharpness of the prism boundary transition in the i direction.
In summary, each prism is parametrized by ten numbers: the three coordinates of its centre location, the three rotation angles about the coordinate axes, defining its orientation, its three side lengths, and its conductivity.
In the case of multiple prisms, the overall conductivity model is where n p is the number of prisms, and w(x) = min(1, 1/ ∑ np i =1H i (x)) is a normalization factor. H i and θ pi indicate the approximate indicator function and conductivity of the ith prism.
The background conductivity θ 0 is modelled as a log-normal random field. The covariance function of log (θ 0 ) is where with l x = l y = 2.5l z . We take approximate samples of the log(θ 0 ) GRF by using the KLE. In order to make the computation of the Karhunen-Loève eigenfunctions tractable and to make sampling efficient, we sample the background on a subset of the full computational domain. We use a large enough volume such that the lack of randomization outside it does not significantly effect the data. We then simply take the exponential of these GRF samples to get samples from the background random field θ 0 .

Single plate model
The 3D conductivity model used to generate synthetic data for our first DCR experiment (which we will call the true model below) was taken to be a thin dipping rectangular prism embedded in a random sample from the background field. The mean of the background GRF is constant in the y direction and piecewise constant in the x and z directions. It is illustrated in figure 6 by a slice in the x-z plane showing the core region of the model. The standard deviation σ(x) is piecewise constant, being 50% of the mean value in each constant region of the mean model. The full true model including the background and anomalous plate is illustrated by orthogonal 2D slices in figure 7.
We generated synthetic data using our true model and a survey configuration consisting of 65 transmitter locations, arranged in a regular grid inside of the core region of our model. These transmitters are pairs of electrodes, connected by a wire, to an energy source, through which electric current is injected into the ground. For each transmitter, we measure potential differences between pairs of receiver electrodes arranged parallel to the transmitter electrode pairs. This is known in applied geophysics as a dipole-dipole survey (Binley and Kemna 2005). In total across, our 65 transmitters, we have a total of 390 potential difference measurements. We added random Gaussian noise to the true synthetic data with standard deviation 1% of the value of each datum. In this example, due to computational resource constraints, we generated the true data on the same OcTree mesh as was used in the inversion. As in our previous example, we use the SAA to estimate the parameters of our conductive prism, embedded in a background medium known only approximately. Explicitly, we wish to solve the problem where m represents the vector of prism parameters, and θ (i) 0 is the ith sample from the background model GRF. We assume to have n background samples. F dc is the DCR forward modeling operator, taking prism parameters and a background model as input, and returning the predicted potential difference between receiver electrodes. As in (5), W is a data weighting matrix and, m l and m h are vectors of the lower and upper bounds on the prism parameter values.
With our computational resources, using the GN SAA inversion technique with many sample backgrounds was computationally intractable for this problem. However, we found that the stochastic gradient technique introduced in the previous section worked well. As in the last section we performed a set of inversions using single random samples from the background distribution. Due to the computational cost, we did not perform a series of SAA inversions. We performed a single SAA inversion using the ADAM stochastic optimization algorithm, with 400 background models and a mini-batch size of 4. This mini-batch size was chosen simply because this was the number of model realizations that we could process concurrently. We ran the ADAM optimization for 1500 iterations, at a step-size of α = 2 × 10 −3 . This is the same step-size we used in straight-ray tomography example and we found it worked well for this harder problem. We chose the number of iterations empirically, observing that the objective function value for a single mini-batch had stagnated by that point and that, for a randomly chosen mini-batch, the inversion had fit the data, according to a χ 2 test.
In this case the anomalous parametric body was better separated from the background than in the 2D example. Because of this, the single background DCR inversions gave better results than in the 2D tomography case. However, there was still a significant variability in the results. This is illustrated in figure 8, which shows the recovered parameter values for the single background and SAA inversions. The single background inversions were able to somewhat reliably recover the position of the anomalous block but showed much larger variability in the recovery of the other parameters. None of the inversions were able to recover the thickness (z extent) well, suggesting that the data may not be very sensitive to this parameter. The single background inversions were able to reduce misfit by tweaking the side lengths of the block and varying its orientation angle but they struggled to capture its highly conductive nature. In most cases, they increased the signal from the block only by increasing its thickness, and not by increasing the conductivity. The SAA inversion still pushed the thickness to its upper bound, but perhaps because it could not find descent directions by fitting small signals from background features, it was better able to capture the essential features of the anomalous block, leading to good recovery of its position, orientation, and conductivity.

Two prism model
Our second DCR example model consisted of a more complex anomalous body, modeled by two rectangular prisms-a large thin plate with conductivity of 1 S m −1 and a long tube like structure with conductivity of 2 S m −1 . Like, in the first DCR example, the background model GRF had the covariance function (25). However for this model, the mean conductivity was a constant 0.01 S m −1 , with a standard deviation of 0.05 S m −1 . The background model used to  generate synthetic data for this example was a random sample from the background GRF. The true model is shown in figure 9.
We generated synthetic data for this model on an OcTree mesh with 10 m × 10 m × 10 m cells in the core survey area. We used the same general layout of transmitters and receivers as in the previous example, but we used a range of transmitter electrode spacings, as well as receiver electrode spacings, giving us a total of 780 data points. We added 1% Gaussian noise to the data. The inversions for this example were performed on a coarser OcTree mesh, with 20 m cells in the core area, to avoid the inverse crime. The inversion mesh in this example had a coarser discretization, and a smaller core area, as compared with the single plate DCR model. That allowed us to run both the single-background and SSA inversions for this example problem using the GN method with a set of 24 background model samples, all used at each iteration. This number of background model samples was the maximum number we could simultaneously process with our available computational resources.
We performed 20 single background inversions, using a new randomly sampled background model each time, to show the baseline level of variability of inversion results due to background uncertainty. We used the GN optimization method with backtracking Armijo linesearch to solve these inverse problems (Nocedal and Wright 1999). We use the projected conjugate gradient algorithm to solve the GN search direction system of equations and enforce the bound constraints at each iteration. For the first prism, the y and z direction thicknesses were fixed to their true values, in order to simulate the assumption that an inversionist has some a-priori information indicating that the anomaly is a thin plate. All other parameters are active in the inversion. The results are shown in figure 10. Generally, the single background inversions gave a satisfactory recovery of the prism parameters. Their conductivities are well separated from the background. However, there was still significant variability in the recovered values. Figure 10 also shows the prism parameter values recovered when the true background is known.
Note that even in this case we do not precisely recover the true values of the prism parameters, with the single sample background inversion results sometimes getting closer to the true parameter values. This occurs because our noisy data do not actually have the information to perfectly resolve the true parameter values from our starting guesses and we are dealing with a non-convex optimization problem, so there is no guarantee we will converge to the true solution, even without adding noise to the data. In order to improve our results we would either have to play with the starting guess or generate synthetic data for a wider range of transmitter and receiver configurations, giving us a greater sensitivity to the prism parameters.
We also performed 20 SAA inversions, with 24 background samples per inversion, using the GN method to solve each SAA inversion problem. The results are shown in figure 11. As expected, the spread in results is much tighter than for the single background inversions, clustering much closer to the parameter values recovered in the true background inversions, and in most cases, closer to the true parameter values. Due to computational resource limitations we could not run with a larger number of background samples per inversion to fully assess convergence of the SAA approximation on this example.

Conclusions
In parametric shape reconstruction problems, estimating the properties of the background medium can be a major stumbling block to accurate reconstruction. We have shown that the effect of uncertainty in the background properties can be mitigated by the use of stochastic optimization methods. This can be thought of as a means to improve the shape recovery possible given a specified characterization of the background or as a justification to limit the effort one might expend on achieving a very accurate representation of the background.
The efficiency of this method could likely be improved in future. In this work, we used ADAM, a first-order stochastic optimization method, to make our approach tractable for largescale problems. There may be an opportunity to use stochastic Newton methods to improve the efficiency of optimization.

Data availability statement
The data cannot be made publicly available upon publication because they are not available in a format that is sufficiently accessible or reusable by other researchers. The data that support the findings of this study are available upon reasonable request from the authors.