Subsampling in ensemble Kalman inversion

We consider the ensemble Kalman inversion (EKI) which has been recently introduced as an efficient, gradient-free optimisation method to estimate unknown parameters in an inverse setting. In the case of large data sets, the EKI becomes computationally infeasible as the data misfit needs to be evaluated for each particle in each iteration. Here, randomised algorithms like stochastic gradient descent have been demonstrated to successfully overcome this issue by using only a random subset of the data in each iteration, so-called subsampling techniques. Based on a recent analysis of a continuous-time representation of stochastic gradient methods, we propose, analyse, and apply subsampling-techniques within EKI. Indeed, we propose two different subsampling techniques: either every particle observes the same data subset (single subsampling) or every particle observes a different data subset (batch subsampling).


Introduction
A large variety of physical, biological and social systems and processes have been described by mathematical models.Those models can be used to analyse and predict the behaviour of the associated processes and systems.In case, a model shall be employed to describe a particular system, the model needs to be calibrated with respect to observation of that particular system.This calibration process that fits the model to data is often called inversion.Inversion forms the basis for, e.g., numerical weather prediction, medical image processing, and many machine learning methods.Several inversion techniques have been proposed and studied: we often distinguish variational/optimisation-based approaches and Bayesian/statistical approaches.Throughout this work, we consider a class of methods that can be seen as being in between those two approaches: the Ensemble Kalman Inversion (EKI) framework going back to [20,31].EKI is based on an Ensemble Kalman-Bucy Filter that is iteratively applied to solve an inverse problem.In the linear setting the resulting algorithm is usually given in the form of a preconditioned gradient flow, that is an ordinary differential equation describing the dynamics of an ensemble of particles.
EKI becomes computationally infeasible, if the considered amount of data is too large: the data cannot be stored in the memory at once and, thus, EKI is not applicable.The same problem arises also in other traditional optimisation algorithms, like gradient descent or the Gauss-Newton method.In the past decades, randomised algorithms that optimise in each time step only with respect to a subsample of the data set have become popular.A subsample is a (often randomly chosen) subset of the considered data set.The foundation for all of these stochastic optimisation algorithms is the stochastic gradient descent (SGD) algorithm going back to [28].Stochastic gradient descent and its variants have become especially popular in the machine learning community.
The idea of randomised subsampling in the EKI framework has been proposed in [22], where it is, indeed, applied to train a neural network.There, the subsampling has been introduced after discretising the preconditioned gradient flow, but no analysis has been presented.A recent work by [23] explains how subsampling can be represented in continuous-time settings and how these can be analysed.In the present work, we aim at using this theory in the context of EKI to analyse subsampling methodology at the ODE level.

Literature Overview
Since its introduction in [15] the Ensemble Kalman Filter (EnKF) has been widely used in both inverse problems as well as data assimilation problems.The EnKF is very appealing for many applications due to its straightforward implementation and robustness w.r. to small ensemble sizes [3,4,18,19,20,24].Stability results can be found in [32,33].Convergence analysis based on the continuous time limit of the Ensemble Kalman Inversion(EKI) has been developed in [5,6,9,30,31].However, to obtain convergence results in the parameter space some form of regularisation is usually needed.We mainly consider Tikhonov regularisation which was analysed for the EKI in [12] for example.Recently there has been further analysis on Tikhonov regularisation for the stochastic EKI as well as adaptive Tikhonov strategies to improve the original variant [35].Considering large ensemble sizes, an analysis of the meanfield limit is presented in [10,14].A historical overview of the Kalman filter and some of its extensions can be found [10].After their introduction by Robbins and Monro [28], the stochastic gradient descent method has in the recent past been further analysed by, e.g., [8].Stochastic gradient descent is often computationally advantageous compared to normal gradient descent [26] due to computational efficiency as well as being able to escape local minimisers in non-convex optimisation problems [13,34].As mentioned earlier, the theory employed in this work is based on the continuoustime analysis of stochastic gradient descent by Latz [23] that was further generalised in [21], but is somewhat orthogonal to the diffusion-based continuous-time analysis of SGD of, e.g., [25].

Contributions and outline
In the following, we focus on the case, where the data misfit is computationally infeasible due to large data.Inspired by the success story of randomised gradient descent methods, we introduce subsampling strategies to EKI to ensure feasibility of the method also in the large data regime.We summarise our contribution below: 1. We introduce two subsampling techniques for EKI: single subsampling and batch subsampling.
2. We present an analysis of the subsampling techniques for linear forward operators, in particular we analyse stability of the subsampling schemes and give conditions under which we obtain asymptotic stability.Indeed, the resulting dynamical system approximates the EKI solution.
3. We illustrate our results with two examples: estimation of the source term for an elliptic partial differential equation and estimation of the diffusion coefficient for a parabolic partial differential equation.
Whilst analysing the subsampling EKI, we generalise some results from [23] to more general flows.These generalisations may be of independent interest.This work is structured as follows.We introduce problem setting and EKI methods in Section 2. We discuss stochastic approximations of certain flows in general and the subsampling in EKI in particular in Section 3; before analysing them in Section 4. We show numerical examples in Section 5 and conclude the work in Section 6.
where θ † ∈ X is the true parameter, y † ∈ Y is the observed data set, η † ∈ Y is observational noise, and A : X → Y is a compact operator.In a Bayesian setting, we model θ † , η † as random variables θ : Ω → X and η : Ω → Y , where θ ⊥ η.Assuming that the noise is normally distributed, i.e. η ∼ N(0, Γ) and non-degenerate, the posterior µ y is characterised through Bayes' formula: where µ0 denotes the prior distribution on the unknown parameters and Z = Eµ 0 exp(− 1 2 ∥y − Aθ∥Γ) is the normalization constant.We will focus in the following on the computation of a point estimate of the unknown parameters, the maximum aposteriori (MAP) estimate, which is a minimiser of a regularised version of the potential In order to handle large data settings, i.e.N obs large, we introduce a subsampling strategy, i.e., we partition the data y † into multiple subsets into , and I := {1, . . ., N sub }.To this end, we define data subspaces Y1, . . ., YN sub , such that Y := i∈I Yi.Moreover, we assume that the noise η has independent entries with respect to this splitting of the data space Y .In particular, we assume that there are covariance matrices Γi : Yi → Yi, i ∈ I, such that Γ has the following block diagonal structure: Finally, we split the operator A into a family of (Ai)i∈I , where Then, we can equivalently represent the inverse problem (2.1) by the family of inverse problems where η † i is a realisation of ηi ∼ N(0, Γi) for i ∈ I.
Remark 2.1 Note that the diagonal structure of the noise can always be guaranteed by multiplying (2.1) with the inverse square root of Γ.To simplify notation we will, w.l.o.g., assume for the remaining discussion that Γ = Id k and correspondingly Γi = Id k i .
We will then consider the potentials of the respective data subset To overcome the ill-posedness of the inverse problem, we often consider a regularised version of the potential in the form of where C0 is a self-adjoint, trace-class operator and α > 0. This corresponds to the MAP estimate in case of a Gaussian prior with covariance αC0, cp.[29,12].Assuming a Gaussian prior distribution with mean equal to zero, we can incorporate the regularisation via Note that the forward operator A is usually not injective, whereas the regularisation results in an injective operator Ã.
Similarly, when we split the forward operator we consider

Ensemble Kalman inversion and its variants
We aim to solve the inverse problem (2.1) using the Ensemble Kalman inversion (EKI) framework.We will focus in the following on the continuous-time limit of the Kalman inversion, cp.[30].
We define the initial ensemble to be θ0 = (θ 0 − θ0)j∈J is a linearly independent family, with Nens ∈ N, Nens ≥ 2, and J := {1, . . ., Nens}.The basic Ensemble Kalman inversion proceeds by moving the particles in the parameter space according to the following dynamical system where The linearity of the forward model leads to the following equivalent reformulation of the dynamical system where Intuitively, the dynamic represents parallel gradient flows minimizing (2.2) which are coupled through the empirical covariance Ct.This empirical covariance can be viewed as a preconditioner.The following reformulation of the right hand side reveals the so-called subspace property, i.e. the particles (θ t )j∈J lie in the span of the initial ensemble (θ (j) 0 )j∈J for any time t ≥ 0, cp.[20].Hence, we can assume that the parameter space is finite dimensional with X := R Nens and will, w.l.o.g., do so in the following.
We define the Tikhonov-regularised Ensemble Kalman inversion (TEKI) by the solution of the following ODE dθ The ensemble of particles converges to the empirical mean following the dynamics given by (2.6).This results in the so-called ensemble collapse, and thus the degeneracy of the ensemble covariance operator, which results in an algebraic convergence rate rather than an exponential rate (compared to gradient flows).Variance inflation is a technique mitigating this effect by adding an operator to the degenerate Ct.Let Cvi : X → X be a covariance operator on X.We define the variance-inflated Ensemble Kalman inversion as the solution of the ODE for αvi > 0. By replacing Φ by Φ reg in (2.7), one obtains the variance-inflated Tikhonov-regularised Ensemble Kalman inversion for αvi > 0.
The convergence of the EKI estimate to the true parameter is restricted to the span of the initial ensemble S.More precisely, the particles stay in the affine space θ ⊥ 0 + E for all t ≥ 0, cp.[12,Corollary 3.8], where E = span{e (1) (0), . . ., e (Nens) (0)} with e (j) = θ (j) − θ , (j ∈ {1, . . ., Nens}) θ ⊥ 0 = θ(0) − PE θ(0), and PE being the projection onto the subspace E. This implies that the accuracy of EKI is bounded below by the accuracy of the best approximation in θ ⊥ 0 + E. We summarise in the following the main convergence results for the various variants of EKI.
Theorem 2.3 ([30, Theorem 3.3], [12,Theorem 3.13]) Assume that span{Ae (1) (0), . . ., Ae (Nens) Then, the residuals of EKI mapped under the forward operator converge to 0. It holds that • the rate of convergence for EKI without variance inflation is • the rate of convergence for EKI with variance inflation is for a constant c > 0.
Assume that E = X.Then, the particles of TEKI converge to the minimiser of the regularised least-squares problem θ † reg , which is given by θ † reg := A T A −1 A T y and respectively θ † reg := ÃT Ã −1 ÃT ỹ when considering regularisation.It holds that • the rate of convergence for TEKI without variance inflation is • the rate of convergence for TEKI with variance inflation is for a constant c > 0.
The assumption on the affine space E = X is rather restrictive and usually not satisfied in practice.We discuss in the following the generalisation of the convergence result to the more general setting θ ⊥ 0 + E ⊂ X.The best approximation in this space is given by the solution θ † E of the following constrained optimisation problem which can be equivalently formulated as the unconstrained optimisation problem min with E denoting a basis of E (w.l.o.g.dim(E) = Nens − 1).Then, TEKI can be formulated in the coordinate system of E and convergence results can be straightforwardly generalised.Note that convergence then follows to the minimiser of 2.8, i.e. the best approximation of 2.3 in the affine space θ ⊥ 0 + E. We refer to [12] for more details on the derivation of the convergence result.We will denote in the following the optimiser of the constrained optimisation problem by θ ⋆ , i.e.

Subsampling in continuous time
In this work, we are interested in certain stochastic approximations of ODEs, indeed, we are studying EKIs in which we randomly replace the potential Φ reg by one of the (Φ reg i )i∈I .We now introduce and study a framework in which we are able to consider the subsampling of (actually, more general) flows, before then discussing the subsampling in EKI.

A general framework and result
Let Fi : X × [0, ∞) → X be Lipschitz continuous for i ∈ I.Moreover, we define F = i∈I Fi/N sub .We study the full dynamical system (θ(t)) t≥0 , given by θ(t) = −F(θ(t), t) and also the subsampled dynamical system We denote the flows with respect to (Fi)i∈I by (φ 0 (θ0) = θ0.In the same way, we denote the flow with respect to F by φt.
Solving or approximating the full dynamical system (3.1) can be computationally expensive, especially if N sub is large.We will now discuss an approximation strategy for this full dynamical system that replaces the full system at any point in time by a randomly selected subsampled system.Hence, in any small time interval, we only need to evaluate (3.2) for some i ∈ I.

and go to 2
Hence, the process (i(t)) t≥0 is a piecewise constant process that jumps from one state to another after random waiting times.There are several other characterisations of the process (i(t)) t≥0 , we refer the reader to [1] for general CTMPs on discrete spaces.The algorithmic procedure above goes back to Gillespie [17].Properties of this particular CTMP have been studied in [23].We can now define the stochastic approximation process for (Fi)i∈I and (i(t)) t≥0 .Definition 3.1 We define the stochastic approximation process given by the family of flows (Fi)i∈I and the index process (i(t)) t≥0 ) by the tuple (i(t), θ(t)) t≥0 , with In the following, we are interested in the long time behaviour of the stochastic approximation process.Assumption 3.2 Let K ∈ N and X := R K .Let (i)-(ii) hold for any i ∈ I: for any two initial values θ0, θ1 ∈ X.
Note that Assumption 3.2(ii) already implies that the flow of −F is exponentially contracting.The Banach fixed-point theorem implies that the flow has a unique stationary point, which we denote by θ * ∈ X.We now generalise one of the main results of [23] by showing that the stochastic process converges to the unique stationary point θ * of the flow (φ t ) t≥0 if the learning rate goes to zero.Convergence is measured in terms of the Wasserstein distance where q ∈ (0, 1] and C(π, π ′ ) is the set of couplings of the probability measures π, π ′ on (X, BX).

Ensemble Kalman inversion with subsampling
We introduce two possibilities of subsampling: EKI with single subsampling and EKI with batch-subsampling.In the first case, we choose a single subsample y † i from y † essentially replace the potential in (2.4) by the subsampled potential Φi.In the mini-batching case, we pick a total of Nens subsamples y † i(1) , . . ., y † i(Nens) , i.e. one subsample for each particle in the ensemble.Then, we evolve each of the ensemble members with respect to their data subsample, i.e. θ (j) (t) is evolved with respect to Φ i(j) , for j = 1, . . ., Nens.
After one remark, we continue by defining the single subsampling.
Remark 3.4 When defining the subsampling algorithms, we will only refer to the basic Ensemble Kalman inversion.Of course, it is possible to combine subsampling with Tikhonov regularisation.In this case, we replace the potential Φi in (3.4) with for i ∈ I.In the same way, one can combine subsampling with variance inflation, by replacing the empirical covariance Ct with the inflated covariance ( Ct + α ′ Cvi).

Single subsampling
The essential idea is now the following: at every time step, we follow the EKI flow of the potential Φi, for one random i ∈ I, as determined by (i(t)) t≥0 .Indeed, the Ensemble Kalman inversion with single subsampling is defined via the dynamical system We illustrate this single subsampling strategy in Figure 3.1.

Analysis of the ensemble Kalman inversion with subsampling in the linear setting
We present in the following a convergence theory for the two subsampling strategies in the linear setting.Our goal is to verify the Assumptions 3.2, in particular the condition for t large enough, with −Fi(θ(t), t) denoting the right hand side of the dynamical system and h : [0, ∞) → R being a measurable function.We will see that the analysis of the ensemble collapse will play a central role for the construction of the function h.In order to derive convergence results in the parameter space, we will focus in the following on the regularised setting, i.e. we consider the potential Φ reg .

TEKI with variance inflation
The variance inflation allows to explicitly control the ensemble collapse by controlling the preconditioner in the following way: Theorem 4.1 (Single Subsampling TEKI with variance inflation) Let (θ (j) (t)) t≥0,j∈J satisfy with index process (i(t)) t≥0 ) and with α, αvi > 0, Ct denoting the empirical covariance matrix of the particles θ (j) (t), and C0, Cvi being symmetric positive definite matrices.The tuple (i(t), θ(t)) t≥0 denotes the single subsampling TEKI solution with variance inflation.Then, Proof.By Theorem 3.3, we know that convergence of (θ (j) (t)), for j ∈ J, follows if Assumptions 3.2 are satisfied.The continuous differentiability of the right hand side follows straightforwardly from the definition of the TEKI.The inequality holds with where obviously The details on the construction of h are given in Lemma A. 5.

TEKI without variance inflation
We have seen that the control on the smallest eigenvalue of the preconditioners, i.e. the empirical covariances, is crucial in order to prove convergence.We will consider in the following the more general case, where the smallest eigenvalue converges to 0 with a rate such that ∞ 0 h(t)dt = ∞ still holds true and convergence follows by Theorem 3.3.
Theorem 4.3 (Single Subsampling TEKI without variance inflation) Let (θ (j) (t)) t≥0,j∈J satisfy with index process (i(t)) t≥0 ) and with α > 0, Ct denoting the empirical covariance matrix of the particles θ (j) (t), and C0 being a symmetric positive definite matrix.The tuple (i(t), θ(t)) t≥0 denotes the single subsampling TEKI solution without variance inflation.Then, Proof.The continuous differentiability of the right hand side follows with the same argument as in Theorem 4.3.The inequality holds with h(t) = λmin( C 1 t ) min i∈{1,...,N sub } λmin( ÃT i Ãi), where ).The details on the minimal eigenvalue are given in the Appendix A.6. 2 In the case of batch subsampling, the rate of convergence of each particle is exponential for a fixed data set, as the ensemble collapse is prevented (under suitable assumptions of the data).This is contrast to the single subsampling case, where the rate of convergence is algebraic due to the ensemble collapse.This can be shown as follows: Lemma 4.4 Given the N sub subsamples ỹ † (1) , . . ., ỹ † (N sub ) , assume that the centered initial ensemble is a generator of the full space X, i.e. span{e (j) 0 , j ∈ J} = X.We further assume that Nens j=1 (θ † j − θ † )(θ † j − θ † ) ⊤ has full rank d.Then the particles converge to the true solution θ † exponentially fast, i.e. θ (j) → θ † i with θ † i denoting the minimiser of Φi Proof.Please see A. 4.
The assumption on the ensemble spread being a generating set of the parameter space X is rather restrictive and usually not satisfied in practice.Generalisation of the result is straightforward when working in the coordinate system of the linear subspace spanned by the initial ensemble using the projection PE .
The batch subsampling case leads to exponential convergence rates for fixed data, however, as the minimum eigenvalue depends on the inital data, the convergence result cannot be readily applied.We modify the process such that we can control the minimum eigenvalue similar to the variance inflation technique above.However, instead of considering a static lower bound, we now allow the minimum eigenvalue to decrease at a certain rate.Theorem 4.5 (Batch subsampling TEKI with diminishing variance inflation) Let (θ (j) (t)) t≥0,j∈J satisfy with index process (i(t; j)) t≥0,j∈1,...,Nens ) and with α, αvi > 0, Ct denoting the empirical covariance matrix of the particles θ (j) (t), and C0, Cvi being symmetric positive definite matrices.The tuple (i(t; j), θ(t)) t≥0 denotes the bacth subsampling TEKI solution with variance inflation.Then, Proof.The proof follows the same lines as the proof of Theorem 4.1.The only difference being that we have here h(t) = α vi t λmin(Cvi) min i∈{1,...,N sub } λmin(A T i A i ), The diminishing rate however is slow enough to obtain 2 We now move on to proving the convergence of the subsampled EKI.We start by defining an auxiliary EKI subsampling process.Let ε ∈ (0, 1) and where (i ′ (t, ε)) t≥0 is the CTMP on I with transition rate matrix B(t Proposition 4.6 Let ε > 0.Then, the process (θ (•,ε) (t), ξ ε (t), i ′ (t, ξ)) t≥0 has a unique stationary measure µε.Moreover, for every θ0 ∈ X Nens and i0 ∈ I there are c, c ′ > 0, with Proof. 1.We note that for any initial value θ0 ∈ X Nens there is a compact set M ⊆ X Nens , M ∋ θ0 from which the process (θ (j,ε) (t), i ′ (t, ε)) t≥0 cannot escape.See [2] for details.Moreover, we note that under the hypothesis of Theorem 4.4, we can assure that θ (•,ε) (t) ̸ ∈ X diag := {(θ1, ..., θN ens ) ∈ X Nens : ∃j, j ′ ∈ J, θj = θ j ′ } 2. We now show that two coupled processes (θ (j,ε) (t), i ′ (t, ε)) t≥0 and (θ (j,ε) * (t), i ′ (t, ε)) t≥0 starting at different initial points θ0, θ0, * contract in the Wasserstein-2 distance.Let (i ′ (t, ε; •)) t≥0 be a realisation of i ′ (t, ξ; •)) t≥0 .Then, again we try to find a continuous function Lemma A.5 gives us h i ′ (t) = α vi 1+t λmin(Cvi) min i∈{1,...,N sub } λmin(A T i A i ).Note that we are using variance inflation, however one that is diminishing at rate t −1 .This is slow enough so that we obtain In Proposition 4.6, we showed that the auxiliary process (θ ε (t))) t≥0 is ergodic (ε > 0) and converges to a stationary measure.In the following, we will show that (θ ε (t))) t≥0 → (θ(t))) t≥0 as ε → 0 and then also that θt → PY θ † as t → ∞.Theorem 4.7 Under the assumptions of Proposition 4.6, we have Proof.The result follows from [23]. 2

Numerical Experiments
We now test our methodology in two numerical experiments: first, we aim to estimate the source term in a 1D parabolic PDE using measurements from its solution.Then, we estimate the log-diffusion coefficient in a 2D elliptic PDE, again using measurements of the solution.

1D-Heat equation
In this experiment we consider a one-dimensional Heat equation, given by the following differential equation Our goal in this inverse problem is to estimate the unknown forcing f from perturbed measurement data that we have obtained from the solution.Indeed, we define , and O : ) is a solution operator of the PDE.The inverse problem is given by: where η ∼ N (0, Γ), with Γ = 0.1 2 IdK .The forcing is assumed to be a Gaussian random field with zero mean and covariance C(s, t) = σ 2 exp (− |s−t| 2 Lsc ), where (s, t) ∈ [0, 1] × [0, 1], σ 2 = 10 and Lsc = 0.1.We simulate the random field using a KL-expansion, which is truncated after 8 terms, i.e. f (x, ω) = 8 i=1 λ 1/2 i ei(x)ξi(ω), where λi are the largest eigenvalues of C(s, t), ei(x) the corresponding eigenfunctions and ξi standard normal distributed random variables.Furthermore, we use a spatial step size of h = 0.01, a time step size of ∆ = 0.05 and a time horizon of T = 0.3.Hence, we have 6 time steps.The PDE-solution is then computed through the Crank-Nicolson method.
To solve the inverse problem we will consider the numerical computed solutions of u(t, x) at each time time step as one subsample.Therefore we have N sub = 6 many subsamples.Additionally the choice of the step size h, yields N dim(X) = 99 interior points that we seek to estimate.Furthermore, we have N obs = 6 • 99 = 594 many observations and use Nens = 5 particles.Our initial ensemble is assumed to have the same distribution of the Gaussian random field that we choose for the forcing f .We sample from the random field by making Nens independent draws from f (x, ω).Therefore, the generated solutions will lie in the subspace f ⊥ 0 +E, where E is the linear span of the centered initial ensemble and f ⊥ 0 = f (0)−PE f (0).The best approximation in this space is given by the solution f † E of the constrained optimisation problem (2.8), i.e.
with E denoting a basis of E and Ã and ỹ correspond to the regularised versions of the respective variable.This solution can be computed analytically and is given by Thus the reference solution in the parameter space is given by We simulate N = 32 many runs and illustrate the mean absolute error of the runs in the parameter space as well as in the observation space for single-subsampling, batch-subsampling and compare it to the EKI.Lastly, we also illustrate the mean ensemble collapse for each particle.The ODE solutions of the EKI as well as our subsampling methods are computed using MATLABs ode45 ODE solver.

TEKI with variance inflation
The first conducted experiment uses constant variance inflation of the magnitude αvi = 0.01 and regularisation of β = 10.Furthermore, the learning rate decays at exponential speed, i.e. η(t) = a exp(−bt), with a = 0.01 and b = 10.We compute the solution up until T = 1.With those parameter choices we obtain approx. 2 • 1e5 many data changes.We illustrate our results in semi-log plots, i.e. we only apply the log functions on to the y-axis.  .
In Figure 5.2 the mean ensemble collapse of the particles is illustrated.Again we can see that the collapse happens at an exponential rate.Interesting to note is that there seem to be more fluctuations in batch-subsampling than in single-subsampling.And that batchsubsampling is also a bit slower than single-subsampling(see different scaling on y-axis).

TEKI with diminishing variance inflation
In this simulation we consider diminishing variance inflation.We illustrate the results using the same parameters as chosen in the experiment with constant variance inflation.We let variance inflation vanish at a linear rate, i.e. the gradient flow is given by (4.2) with αvi = 0.01.We use again an exponential decaying learning rate and we illustrate the results for EKI, singlesubsampling and batch-subsampling.We can see in figure 5.3 similar results as in 5.1.Both subsampling methods converge at a similar rate towards the solution.However, they are both converging at a slower rate than the EKI.Furthermore, this experiment shows less noise in the subsampling approaches as opposed to a constant variance inflation.Figure 5.4 shows the ensemble collapse of all methods.We get similar results to the experiment conducted before.

TEKI without variance inflation
In this subsection we consider experiments without variance inflation.In theorem 4.1 we proved convergence for single-subsampling.However, we weren't able to obtain a result for batch-subsampling.We illustrate numerical results, indicating that we can also expect convergence for batch-subsampling without variance inflation.
We consider again β = 10 as regularisation parameter and compute the solution up until T = 1e6.This time we use a linear decaying learning rate η(t) = (at + b) −1 , with a, b = 100.However, due to the decreasing switching times of the subsets, the algorithm becomes computationally slow.We therefore only use the linear decaying learning rate up until T = 10 1 .Afterwards we consider 1e5 equidistant switching times.Up until T = 1 we obtain approx  In those experiments we also included the mean errors ± one standard deviation.Both subsampling methods show similar results to the EKI.Our methods are therefore suitable alternatives to the EKI, since we obtain the same convergence rate, however we have less computational costs.Note that the mean minus standard deviation is negative in the right picture and therefore not shown.We can see in figure 5.6 that the ensemble collapse also happens at an algebraic rate for EKI as well as both of our subsampling methods.Furthermore, we illustrate the results of one single run: We simulate our prior distribution X also by the same KL-expansion, using 18 terms.We then make Nens = 20 independent draws simulate our initial ensemble.Moreover, we do not use variance inflation and use a regularisation of β = 10, we compute the solution until time T = 1e7 and use until time T = 10 the same linear decaying learning rate as in the previous experiment with a, b = 10.Afterwards we consider again a constant learning rate.We obtain around 600 switching times until T = 10 using those parameters.In Figure 5.7, we can see again that the error of all three solutions behaves similarly.Single and batch subsampling compute nearly identical solutions, therefore the error of those two methods are almost identical.The only difference to EKI is the starting time at which the solution is evaluated.The randomly computed starting time of our subsampling approaches is a little higher than the starting point of the EKI.Therefore, there is a slight shift in the error curves.Furthermore, we illustrate how the computed solution behaves over time.We compare the solutions to the Tikhonov solution.Here one can see that there is a slight difference between single and batch-subsamling at T = 0.1.However, both methods quickly converge to one another as one can see in the computed results at T = 100.As mentioned above the EKI is slightly faster due to beginning a bit earlier.However at T = 10000000 methods approximate the Tikhonov solution quite similarly.

Nonlinear 2D Darcy flow
We introduce in this section one experiment with a non-linear forward operator G.Even though our theory only covers the linear case, we will illustrate that subsampling also leads to good results in the nonlinear setting.As subsampling strategy we only consider singlesubsampling.The example is motivated by [11] and [16].Consider the following elliptic PDE.
where D = (0, 1) 2 .We seek to recover the unknown diffusion coefficient given observation of the solution p ∈ H 1 0 (D) ∩ H 2 (D) := V. Furthermore, we assume that the scalar field f ∈ R is known.The observations are given by y = O(p) + η, where G = O • G and G : X → R K denotes the solution operator of the PDE (5.1).We solve the PDE on a uniform mesh with a grid size of h = 2 −8 using a FEM method with continuous, piecewise linear finite element basis functions.We model our prior distribution as the random field where we have λi = π 2 (k 2 j + l 2 j ) + τ 2 −α and ei(x) = cos(πx1kj) cos(πx2lj) with τ = 0.01, α = 2, s = 25, (kj, lj) j∈{1,...,s} ∈ {1, ..., s} 2 .The variables ξi are i.i.d standard normal variables.Afterwards we make Nens = 10 independent draws for our initial ensemble.The dimension of the parameter space is d = 2 8 due to the grid size.We take K = 30 observations and divide them into N sub = 5 many subsets.We use a linear decaying learning γ(t) = (a + bt) −1 , where a, b = 10.As regularisation factor we consider β = 10 and compute the solution up until time = 10 5 .Again we note that due to the decrease of the switching times of the data sets the algorithm becomes computationally very slow.Therefore, we only use a linear decaying switching rate up until time T = 10 1 from there on we consider 10 5 equidistant switching times.
Note that we again work in a subspace that is smaller than X.Therefore, we need to compare the computed solution with the respective one given the subspace.The particles stay in the affine space u ⊥ 0 + E for all t ≥ 0. Therefore, the reference solution is given by u ⊥ 0 + u † E,j .We formulate it as a constrained optimisation problem min and use MATLABs fmincon solver to compute it.
Figure 5.9 shows the computed solutions given by the different algorithms.The left picture is the solution given by MATLABs fmincon solver.In the middle, the solution computed by the normal EKI is depicted and on the right the result of the single-subsampling algorithm shown.One can see that both algorithms compute visually very similar solutions as the optimiser.We can see that in both methods the collapse occurs at a similar rate.One should note that single subsampling has a smoother result, this however is only due to the amount of observations where we evaluate our solutions.Due to the required frequent changes in subsampling, we obtain more observations.

Conclusions
We have introduced subsampling schemes for EKI to allow the application of the method also in the large data regime.Based on recent results on continuous stochastic gradient processes [23], two subsampling approaches, (i) single subsampling, where each particle obtains the same data set when switching the data and (ii) batch-subsampling where data sets may differ for each particle, have been considered in the continuous-time setting.By applying Tikhonov regularisation and variance inflation on both methods (i) and (ii) we were able to show convergence of the schemes to the solution of the original EKI version.For the nonvariance inflated variant of (i) convergence results with an algebraic rate could be proven.For batch-subsampling we were only able to show convergence when using a vanishing variance inflation over time.However, our numerical experiments in section 5.1 also showed similar convergence results for non-variance inflated batch-subsampling.The analysis requires the control of the eigenvalues of the empirical covariance w.r. to the initial ensemble.This will be subject to future work.Further, we also considered in section 5.2 a numerical experiment for a non-linear forward operator.Single-subsampling without variance inflation shows similar convergence results as the original EKI.Analysis of subsampling techniques for non-linear forward operators will be also subject for future work.
We will prove in the following that g (j) → 0 exponentially fast as t → ∞.This is a sufficient and necessary optimality condition as A ⊤ i(;j) Ã(i;j) is positive definite due to regularisation.We obtain i.e. the gradients are monotonically decreasing.To prove convergence, we will now show that < 0. Note that, if g (j) ̸ = 0 and {e (k) } Nens k=1 is still a generating set of X at time t, then there exists at least one k ∈ {1, . . ., Nens} such that ⟨e (k) , g (j) ⟩ ̸ = 0.The quantity e (j) satisfies de (j) (t) dt = − Ctv (j) = − 1 Nens Nens k=1 ⟨v (j) , e (k) ⟩e (k) , with v (j) = g (j) − ḡ.Thus the dynamical behavior of empirical covariance is given by j) .Therefore, the rank of the empirical covariance stays constant over time (cp.[27]) and the members {e (j) } Nens j=1 still form a generating set of X at time t.Then, there exists at least one k in (A.2) such that ⟨e (k) , g (j) ⟩ 2 ̸ = 0, i.e. the gradients converge to 0. This implies the convergence θ (j) → θ † j due to the strong convexity.By assumption, the limit of the empirical covariance has full rank d, since θj → θ † j , i.e. the minimal eigenvalue λmin( Ct) of the empirical covariance is bounded from below uniformly in time.Thus, we have where λmin > 0 denotes the lower bound on the minimal eigenvalue of the empirical covariance.

2
Lemma A.5 The particles θ (j) converge at exponential speed to the unique solution of the regularised data misfit, i.e. ρ (j) (t) → 0. Hence, there exists a (unique) θ † j ∈ S = span{θ (1) 0 , ..., θ Furthermore, let θ1 and θ2 be two coupled process with different initial values θ1(0), θ2(0).Then there exists a measurable function h : [0, ∞) → R such that the following holds for t large enough.We have Proof. 1.We first consider single subsampling with variance inflation The gradients Ã⊤ (i;j) ρ (j) for a constant (w.r. to time and particle) data stream satisfy the following differential equation for all subsets i.
The norm of the gradients thus satisfies which implies the exponential convergence of the mapped residuals and with the injectivity of the modified forward operator the exponential convergence in the parameter space to the (unique) solution of the regularised data misfit.
Therefore, we have θ Hence, there exists a function κ(t) ≥ 0 for all t > 0 that converges exponentially fast to 0 for t → ∞ such that: ∥θ (j) t − PY θ † j ∥ ≤ κ(t) ∀t ≥ 0. Due to linearity w.r. to the initial values, we obtain The next step is to consider equation (ii) from Assumption 3.2.Note that θ1 and θ2 are column vectors consisting of the stacked particle vectors.
We want to show: We can split the left hand side into two parts: Substituting the corresponding gradient flows into the equations, we obtain We consider both terms separately.For the first part we obtain where we used the positive definiteness of C 1 t for every t ≥ 0 in the third step.For the second term we obtain We can compare the rates of convergence.Considering the results from above we have and also Finally, we have for the covariance matrices 2 ) T ∥ +∥ū1(ū1) T − ū2(ū2) T ∥ Since we know that all particles converge with rate κ(t), the mean values also converge with the same rate.By using triangle inequality we obtain the following 3 ), showing that the second term converges faster and we can therefore neglect it.2. In batch subsampling with variance inflation the only difference is that the forward operator and data both depend on the particle, i.e. the mapped residuals satisfy dρ (j)  dt = −A i(t;j) Ct + αviCvi A T i(t;j) ρ (j) (t) with ρ (j) = A (i;j) θ (j) − ỹ(i;j) .The exponential convergence of each particle to the minimiser of the functional 1  2 ∥ Ã(i;j) θ (j) − ỹ(i;j) ∥ 2 + α 2 ∥θ∥C 0 follows again from standard arguments with the Lyapunov function ∥ ÃT (i;j) ρ (j) ∥ 2 .Hence, the convexity analysis does not change.
Again, by basic Lyapunov theory we obtain convergence at an algebraic speed, which however is enough to do the same analysis as above.Similar to above we obtain h(t) = λmin( Ct) min i∈{1,...,N sub } λmin(A T i A i ) 4. Batch subsampling with diminishing variance inflation: Theorem 4.4 gives us the exponential convergence to the respective solution.Then the convexity analysis is similar to the above calculations and we obtain h(t) = α vi 1+t λmin(Cvi) min i∈{1,...,N sub } λmin(A T i A i ) . 2

A.1 Subsampling without variance inflation
Lemma A.6 For the regularised single-subsampling EKI flow, given by the solution of (4.1), the following lower bound for the smallest eigenvalue λmin(t) of the empirical covariance C(t) holds.

Figure 5 . 1 :
Figure 5.1: Mean absolute errors of computed solutions in the parameter (left) and observation space (right).The red line illustrates the EKI, the blue line single subsampling and the green line batch subsampling.

Figure 5 .Figure 5 . 2 :
Figure 5.1 shows the mean relative error over all N = 32 runs, w.r.t the Tikhonov solution.The left figure shows the mean error in the parameter space and the right figure in the observation space.We can see that both methods converge towards the true solution at exponential rate, due to the linear decay in the semi-log plots.They only differentiate themselves in the constants.

Figure 5 . 3 :Figure 5 . 4 :
Figure 5.3: Mean absolute errors of computed solutions in the parameter (left) and observation space (right).The red line illustrates the EKI, the blue line single-subsampling and the green line batch-subsampling.

Figure 5 . 5 :
Figure 5.5: Mean absolute errors ± standard deviation of computed solutions in the parameter (left) and observation space (right).The red line illustrates the EKI, the blue line single-subsampling and the green line batch-subsampling.

Figure 5 .
Figure 5.5 depicts the mean absolute error in the parameter and observation space to the Tikhonov solution.In those experiments we also included the mean errors ± one standard deviation.Both subsampling methods show similar results to the EKI.Our methods are therefore suitable alternatives to the EKI, since we obtain the same convergence rate, however we have less computational costs.Note that the mean minus standard deviation is negative in the right picture and therefore not shown.

Figure 5 . 6 :
Figure 5.6: Mean ensemble collapse of all N ens = 5 particles for the three methods.The different colors represent the ensemble collapse of one ensemble member respectively.Left figure: depicts EKI; middle figure: single-subsampling; right figure: batch-subsampling.

Figure 5 . 7 :
Figure 5.7: Absolute error of computed solutions in the parameter (left) and observation space (right).The red line illustrates the EKI, the blue line single-subsampling and the green line batch-subsampling.

Figure 5 . 8 :
Figure 5.8: Development of computed solution over time in comparison with the Tikhonov solution at times T = 10 −1 (upper left), T = 10 2 (upper right), T = 10 5 (lower left) and T = 10 7 (lower right).The red line illustrates the EKI, the blue line single-subsampling, the green line batch-subsampling and purple is the reference solution θ ⋆ .

Figure 5 .
Figure 5.8  shows the development of the computed solution over time.Here one can see that there is a slight difference between single and batch-subsamling at T = 0.1.However, both methods quickly converge to one another as one can see in the computed results at T = 100.As mentioned above the EKI is slightly faster due to beginning a bit earlier.However at T = 10000000 methods approximate the Tikhonov solution quite similarly.

Figure 5 . 10 :
Figure 5.10: Mean absolute errors ± standard deviation of computed solutions in the parameter (left) and observation space (right).The red line illustrates the EKI and the blue line single-subsampling.

Figure 5 .
Figure5.10 depicts the mean errors of N = 32 runs ± one standard deviation in the parameter space (left subplot) as well as of the functionals (right subplot) evaluated in the corresponding solutions.We can see that both errors behave similarly and are converging towards zero at an algebraic rate.As in the linear example we note that the mean minus standard deviation is negative in the right picture and therefore not shown.

Figure 5 . 11 :
Figure 5.11: Mean ensemble collapse of all N ens = 10 particles for the two methods.The different colors represent the ensemble collapse of one ensemble member respectively.Left figure: EKI; right figure: single-subsampling.

Figure 5 .
Figure 5.11 shows the mean ensemble collapse of the N = 32 runs.Again the left figureshows the results of the EKI, whereas the right for single subsampling.We can see that in both methods the collapse occurs at a similar rate.One should note that single subsampling has a smoother result, this however is only due to the amount of observations where we evaluate our solutions.Due to the required frequent changes in subsampling, we obtain more observations.