Gradient flows and randomised thresholding: sparse inversion and classification

Sparse inversion and classification problems are ubiquitous in modern data science and imaging. They are often formulated as non-smooth minimisation problems. In sparse inversion, we minimise, e.g., the sum of a data fidelity term and an L1/LASSO regulariser. In classification, we consider, e.g., the sum of a data fidelity term and a non-smooth Ginzburg--Landau energy. Standard (sub)gradient descent methods have shown to be inefficient when approaching such problems. Splitting techniques are much more useful: here, the target function is partitioned into a sum of two subtarget functions -- each of which can be efficiently optimised. Splitting proceeds by performing optimisation steps alternately with respect to each of the two subtarget functions. In this work, we study splitting from a stochastic continuous-time perspective. Indeed, we define a differential inclusion that follows one of the two subtarget function's negative subdifferential at each point in time. The choice of the subtarget function is controlled by a binary continuous-time Markov process. The resulting dynamical system is a stochastic approximation of the underlying subgradient flow. We investigate this stochastic approximation for an L1-regularised sparse inversion flow and for a discrete Allen-Cahn equation minimising a Ginzburg--Landau energy. In both cases, we study the longtime behaviour of the stochastic dynamical system and its ability to approximate the underlying subgradient flow at any accuracy. We illustrate our theoretical findings in a simple sparse estimation problem and also in low- and high-dimensional classification problems.


Introduction
Inverse and classification problems arise in a multitude of scientific disciplines.In inverse problems, we aim at estimating parameters of mathematical models through indirect measurements.A usual example is the reconstruction of an image of a patient's brain structure given measurements from X-rays that are attenuated by the patient's brain.In classification, our goal is to partition data into two or more different classes.In the previous example, we could aim to partition the pixels in the reconstructed image with respect to the area of the brain they refer to.
Inverse and classification problems can often be phrased as optimisation problems: minimising the sum of a negative loglikelihood -representing the relationship between data and parameter -and a regulariser -forcing the parameter to satisfy certain regularity properties.In image reconstruction and classification, certain non-smooth regularisers have gained popularity.In image reconstruction or, generally, sparse inversion, an ℓ 1 /LASSO-regulariser ensures that the image or object of interest is represented sparsely, e.g., within a dictionary or on a basis, see [16].In binary classification, a (possibly non-smooth) Ginzburg-Landau functional is used to separate data into two classes by forcing the values to concentrate close to one of two values, e.g., −1 and 1, see [9].Both, sparse inversion and binary classification are highly relevant in modern data science and imaging, see, e.g.[2,50,51] and [36,46,48], respectively.
To understand the solution procedure of these optimisation problems, the associated subgradient flows have been studied extensively, see, e.g., [6,12,15,34].The subgradient flows are interesting as they give a continuous-time representation of a subgradient descent-type algorithm.Due to the nonsmoothness of the minimisation problem, the associated flows are often non-smooth as well.Thus, when attempting to construct an optimisation algorithm from a subgradient flow, one has to be very careful: Non-smooth dynamical systems tend to be difficult to solve with traditional numerical techniques.In practice, the non-smoothness is often relaxed or optimisation methods are constructed with more advanced numerical techniques.In this work, we follow the second approach: we propose and discuss randomised splitting techniques in a continuous-time setting.

Problem setting and motivation.
Stochastic approximations of differential inclusions.Throughout this work, we study differential inclusions on a space X := R n that are of the form where F, G : X → 2 X describe set-valued maps (usually negative subdifferentials of potentials, see [11] for details).Here, we are particularly interested in cases, where the two subflows G ∈ X, are well-understood, i.e. they can be solved analytically or simulated in an efficient, stable, and accurate way.In general, there is no exact way of combining the paths (x F (t)) t≥0 and (x G (t)) t≥0 in a way that reconstructs (x(t)) t≥0 .In this work, we study the stochastic approximation of (x(t)) t≥0 that we obtain through following one of the trajectories (x F (t)) t≥0 or (x G (t)) t≥0 for a random waiting time before switching to the other, which we then again follow for a random waiting time and so on.This stochastic approximation is indeed a randomised splitting technique.We now describe the dynamical system more particularly.Let (Ω, F, P) be a probability space and let i : Ω × [0, ∞) → {0, 1} be a continuous-time Markov process (CTMP) on {0, 1} that has transition rate matrix −λ λ λ −λ , for some switching rate λ > 0. That means, (i(t)) t≥0 is a piecewise constant, right-continuous process that stays in a state for an exponentially distributed waiting time δT ∼ Exp(λ) with mean E[δT ] = 1/λ before switching to the other state.
The value i(t) determines which of the flows (x F (t)) t≥0 or (x G (t)) t≥0 we follow at time t ≥ 0. Indeed, we consider the following dynamical system Following the work by Robbins and Monro [39], we refer to ( x(t)) t≥0 as a stochastic approximation of (x(t)) t≥0 .In a setting where the differential inclusions are actually differential equations and sufficiently smooth, one can sometimes show that ( x(t)) t≥0 → (x(t)) t≥0 in a weak sense, as λ → 0. We refer to [26,30] for results of this type and a general perspective on stochastic approximation in continuous time.
Sparse inversion and classification.Throughout this work, we study the stochastic approximation of two specific subgradient flows that are used to solve two different non-smooth optimisation problems: a sparse inverse problem and a binary classification problem, respectively.
In both cases, we minimise the sum of a linear-quadratic data misfit and a certain regulariser.This regulariser is an ℓ 1 -norm in the sparse inversion case and a Ginzburg-Landau energy in the binary classification problem.We introduce the particular problems and subgradient flows in Sections 2 and 3, respectively.In both these cases, a direct (explicit) numerical simulation of the flows is highly inefficient due to non-smoothnesses in both differential inclusions.However, both differential inclusions can be separated into a linear ODE (the gradient of data misfit and linear-quadratic part of the regulariser) and a non-smooth part (the subgradient of the non-smooth part of the regulariser).The linear part alone can be solved analytically or (more reasonably) discretised at a high accuracy in a stable way.The non-smooth part alone can, in both cases, be solved analytically and results in a thresholdingwhile this is natural for the ℓ 1 -norm, the Ginzburg-Landau energy is modelled in a way that achieves this property.This thresholding naturally becomes a randomised thresholding within the stochastic approximation.Indeed, • in sparse inversion, we subtract or add a random variable to move towards zero by a random amount, and • in binary classification, we subtract or add a random variable to move towards −1 or 1 (the one that is already closer) by a random amount.
We discuss this thresholding more particularly in Subsections 2.1 and 3.2, respectively.For both these problems, (discrete-time) splitting techniques that separate linear and non-smooth part to follow the linear flow and threshold are known and popular in practice: forward-backward splitting for sparse inversion and Merriman-Bence-Osher (MBO) for classification.In the present work, we analyse the non-discretised stochastic approximation leading to randomised versions of these splitting techniques.

Stochastic?
The introduction of stochasticity appears a bit artificial in this setting.Indeed, the algorithms mentioned above are fully deterministic.We employ it for four different reasons: Firstly, the stochasticity allows us to define a Markovian homogeneous-in-time dynamical system.This not only simplifies the analysis, but also matches the discrete splitting algorithms, which is also a Markovian homogeneous-in-time dynamical system.Secondly, in the continuous setting, we cannot usually hope for convergence of such two-step methods to a single point.Analysing ergodicity and the convergence to a probability measure feels easier for a random process.This probability measure describes what is called implicit regularisation in the machine learning context and sometimes also of interest from an uncertainty quantification perspective.Speaking about machine learning -thirdly, stochastic optimisation methods have recently gained popularity in machine learning where non-convex problems need to be regularised and optimised.When sparsity shall be enforced, stochastic optimisation methods could profit from a randomised thresholding instead of a deterministic thresholding, consider, e.g., [37].Fourthly, deep neural networks are often compositions of smooth functions and tresholdings.Ideas presented here could possibly be used to analyse randomisation in neural networks; a connection that is particularly interesting as neural networks are popular tools in both, image reconstruction and (binary) classification.

Contributions and outline.
The main contributions of the present work are: • We propose and study stochastic approximations of subgradient flows in sparse inversion and binary classification.The stochastic approximations are continuous-time, randomised models for usual splitting schemes.
• We analyse the long-time behaviour of the processes.Indeed, we see that the stochastic approximation is usually ergodic in case of the sparse inversion, while we need strict assumptions to show ergodicity in the classification case.
• We study the convergence of the stochastic approximations to the underlying subgradient flows, as the switching rate λ → ∞.Thus, we show that the stochastic approximations can approximate the subgradient flow at any accuracy.
• We illustrate our findings in a simple sparse estimation problem and two spatial classification problems.
We consider non-smooth piecewise deterministic Markov processes, where the non-smoothness arises from the deterministic processes rather than only the stochastic switching/jumping (as opposed to, e.g., the Zigzag method, [10]).Thus, the techniques employed in our analysis may be of independent interest.This work is organised as follows.In Sections 2 and 3, we introduce and discuss the considered sparse inversion and classification problems, respectively, and derive our stochastic approximations.We show that all the considered processes possess the so-called Feller property and derive their generators in Section 4. We use this information to show ergodicity in Section 5 and the convergence of the stochastic approximation to their related subgradient flow in Section 6.In Section 7 we illustrate our results in numerical examples, before concluding in Section 8.

Sparse inversion
We commence with the sparse inversion problem.In the following, we introduce this problem, the subgradient flow, and its stochastic approximation.
We consider an optimisation problem of the form where Φ(θ In the following, we assume that rank(A) = n.We denote the two parts of the target function Φ by Φ d := 1 2 A • −y 2 2 and Φ r := • 1 .We abbreviate A := A T A and b := A T b.As the optimisation problem (2.1) is non-smooth, we need to introduce a subdifferential to discuss possible solutions.Throughout this work, we consider the proximal subdifferential which is defined for a potential Φ : X → R by To solve problem (2.1), we need to find a stationary point of Φ, e.g. a point θ * ∈ X at which where g ∈ ∂Φ r (θ * ).The subdifferential is given by ∂Φ r (θ where A natural way to find this stationary point θ * is to follow the subgradient flow, which we call sparse inversion flow : for some initial value ζ (0) ∈ X.In this and other dynamical systems, we sometimes emphasise the dependence on the initial value by writing (ζ(t, ζ (0) )) t≥0 := (ζ(t)) t≥0 .We scale the right-hand side in (2.3) by 1/2 for convenience.This dynamical system has a unique, locally absolutely continuous solution, see, e.g.Theorems 2.9 and 3.2 in [34].The solution is also locally Lipschitz continuous and (ζ(t)) t≥0 converges to the stationary point of Φ; we discuss this in Proposition 5.1.Essentially based on this subgradient flow, several techniques have been proposed in the past decades to find θ * , such as subgradient descent [43] or the already mentioned forward-backward splitting [18,25].Indeed, the basic subgradient descent method follows the subgradient flow by discretising the system with a forward Euler discretisation where (h k ) ∞ k=0 is a sequence of step sizes.This method has disadvantages in practice that are easily explained by the missing Lipschitz continuity of ∂Φ r .The forward-backward splitting algorithm is considerably better: The idea is that splitting the system into two parts such that each of them can be treated easily.Thus, forward-backward splitting proceeds by alternately minimising Φ d and Φ r using gradient and proximal step, respectively, see also Algorithm 1.In the algorithm, we denote the signum function sgn(x) := 1[x > 0] − 1[x < 0], where 1[true] = 1, 1[false] = 0, the positive part f + = max{f, 0} of a function f , and the Hadamard (element-wise) product ⊙.Here and throughout the rest of this work, we sometimes apply scalar functions to vectors in which case we mean componentwise application.The 'argmin' in the last line of the algorithm is the proximal operator of Φ r with stepsize h k .
In the following, we utilise this splitting idea to derive our stochastic approximation.

The stochastic approximation
We now split the dynamical system (2.3) into two subflows: for initial values ζ d,0 , ζ r,0 ∈ X.The linear differential equation (2.4) has a closed form solution.We have: where, of course, exp(M ) := ∞ j=0 1 j! M j is the matrix exponential of M ∈ R n×n .For the subgradient flow (2.5), we can find the solution: For the uniqueness of this solution, we refer the reader again to [34].Note that (ζ r (t)) t≥0 equals the proximal operator of the ℓ 1 -norm in Algorithm 1.
In the following, we study the stochastic approximation of (2.3) that alternately follows the flows (2.4) and (2.5).To control which subflow is employed at which time, we employ the continuoustime Markov process (i(t)) t≥0 that we have defined in Subsection 1.1.The stochastic approximation (i(t), θ(t)) t≥0 of (ζ(t)) t≥0 is then defined by where θ (0) ∈ X is an initial value.This stochastic process is well-defined with probability 1, as it is a piecewise construction of well-defined functions on non-empty intervals -note that the waiting times in between two jumps of (i(t)) t≥0 is exponentially distributed and, thus, strictly positive with probability 1.
This is a randomised thresholding operation, where the threshold δT ∼ Exp(λ).Thus, the dynamical system uses a randomised thresholding to approach the minimiser of the sparsity enforcing regulariser Φ r .We give a simple example for a subgradient flow (ζ(t)) t≥0 and its stochastic approximation (θ(t)) t≥0 in Figure 2.1.For small λ, we are able to recognise the structure of the stochastic approximation: smooth curves targetting the minimiser of Φ d and linear curves pointing towards the minimiser of Φ r .For large λ, we see that the stochastic approximation approaches the underlying subgradient flow.We finish this section by mentioning several related stochastic splitting methods.The idea of randomly choosing waiting times in between proximal steps appears in the discrete-time method proposed by Mishchenko et al. [37].There, the proximal step is skipped with a high probability leading to a random amount of time that is spent in the gradient flow part.The tresholding is still deterministic.More traditional stochastic splitting methods have been discussed in several works, e.g., [4,40].In these works, similar to Stochastic Gradient Descent [39], only random approximations of ∇Φ d are available but not the gradient itself.The stochasticity here arises from those noisy gradients.

Classification
Next, we move on to the classification problem.We introduce the problem, the subgradient flow, and its stochastic approximation.Moreover, we show well-posedness of the subgradient flow.
We consider an optimisation problem of the form where and α, ε > 0, X ′ is a subspace of X, d ∈ X ′ , P X ′ : X → X ′ is the canonical projection onto X ′ , ∇ ′ is some discrete gradient, and W : R → R is the non-smooth double-well potential see also the top of Figure 3.1, where we plot the double-well potential.
Here, d ∈ X ′ represents a possibly noisy training data set which we intend to use to classify the remainder of the data in η.The double-well potential is responsible for leading the entries of η to either 1 or −1; indeed constructing a classification of the entries of η.The penalisation of the gradient implies an additional localisation by producing neighbouring components in η that are (about) constant 1 or −1.The meaning of 'neighbouring' is determined by the choice of the gradient, see also Remark 3.1.
A similar double-well potential has been employed by [13,14] -they actually used a related doubleobstacle formulation.This double-well potential is rather non-classical, we employ it as its subgradient flow is a thresholding operation that we can evaluate analytically; see Subsection 3.2 below.The parameter ε > 0 controls whether the process is closer to a diffusion (ε ≫ 0) or to a strict binary classification ) is a so-called Ginzburg-Landau energy, see, e.g.[47].
The gradient flow of a Ginzburg-Landau energy is an Allen-Cahn equation.In our discrete setting, we obtain the following discrete Allen-Cahn equation (with fidelity forcing): where we abbreviate P a certain discrete Laplacian, and the scaling by 1/2 is again a matter of convenience.The subdifferential of W is given by Note that W is not a convex function; thus, here it is especially important that we compute the proximal subdifferential, see [34,Section 1].We show in Proposition 3.2 below that this subdifferential coincides with other usual subdifferentials by proving that W is primal lower nice.We illustrate the subdifferential in the bottom of Figure 3.1.
Remark 3.1 Up to here and throughout the rest of this work, we do not specify properties of the discrete gradient ∇ ′ and the discrete Laplacian △ ′ .In the following we only require △ ′ to be a symmetric, strictly negative definite matrix.It especially does need to be neither a localised operation, nor a discretised differential operator: fractional operators, integral operators, diffusions on graphs are allowed, as are different discretisation strategies.
Before moving on to the analysis of the discrete Allen-Cahn equation, we briefly give more context: The Allen-Cahn equation has first been proposed to model the motion of an antiphase boundary [3].The connection to the Ginzburg-Landau energy has been discussed by [41].The numerical treatment of the Allen-Cahn equation has been discussed by, e.g., [24,32].Finally, the idea of using Allen-Cahn in classification and image segmentation has become popular in recent years; see, e.g., [6,13,14,31].3.1 Well-posedness of the discrete Allen-Cahn equation.
We now prove existence, uniqueness, and continuous dependency on the initial value for the solution to (3.1).To the best of our knowledge, this has not been done for this particular dynamical system.We proceed by proving that the double-well potential we employ is convex up to a quadratic functional.
Proposition 3.2 The differential inclusion (3.1) has a unique solution that is locally Lipschitz continuous.Hence, the problem is well-posed.
Proof.One way to show the existence and the uniqueness of the solution of (3.1) is by proving that W : R → R is primal lower nice.In this case, Theorems 2.9 and 3.2 in [34] apply and prove the assertion of the proposition.A function f : R → R is primal lower nice at u 0 , if there are s 0 , c 0 , Q 0 > 0 such that for all x ∈ B(u 0 , s 0 ) := {u : |u 0 − u| ≤ s 0 } and for all q ≥ Q 0 , v ∈ {u ∈ ∂f (x) : |u| ≤ c 0 q}, one has We first show that we can represent the function W as the sum of a convex function and a negative square term.Then, we show that such combinations are always primal lower nice.Indeed, we can write 2W (η) = g(η) − η 2 , where One can easily see that g is continuous and convex.We now show that 2W (η) is primal lower nice: which is equivalent to which holds for sufficient s 0 , c 0 , Q 0 as g is convex.✷ There are again multiple ways to discretise the differential inclusion (3.1) via explicit, semi-implicit, and fully implicit methods, see, e.g.[14,32].Again, we make the observation that we can split the right-hand side of (3.1) into two parts that are relatively simple to solve separately.We discuss this and the implied stochastic approximation in the next subsection.

The stochastic approximation
The differential inclusion (3.1) contains a linear/smooth and a non-linear/non-smooth part.We denote the potentials of these two parts by Φ ℓ (η The corresponding dynamical systems are given by 3) The linear part can, again, be solved analytically: The non-smooth part is a bit more involved, but we can find an analytical solution also in this case.
We write the solution coordinate-wise.For i ∈ {1, . . ., n}, we have which is a unique solution of the differential inclusion.The uniqueness can be proven in the same way as we show Proposition 3.2.We obtain the stochastic approximation (i(t), η(t)) t≥0 of the discrete Allen-Cahn equation (ζ(t)) t≥0 by alternating (ξ ℓ (t)) t≥0 and (ξ w (t)) t≥0 .We obtain: where η (0) ∈ X is an initial value.The stochastic approximation is, again, well-defined.
Remark 3.3 What we discuss in Remark 2.1 also holds true in this setting: we randomise the waiting times between two switches in (i(t)) t≥0 .Thus, the subgradient flow with respect to the double-well potential Φ w , again, leads to a randomised thresholding.
A multitude of splitting schemes for classification and segmentation tasks of this type have been proposed in the literature.The idea of alternating diffusion and thresholding steps, which we use here as well, is particularly iconic: It is the underlying principle of the MBO scheme [35].There, this idea has been used in the context of mean curvature flows.Its use in segmentation and classification has been considered by, e.g., [13,14,20,23].The connection that we make here between MBO and Allen-Cahn has been thoroughly discussed by [13] in the setting of graph diffusions.Other splittings are also possible, see, for instance [8] and the discussion therein.Finally, we note that through our stochastic approximation, we introduce a certain kind of noise into the Allen-Cahn equation.This introduced piecewise deterministic noise produces in some sense a stochastic Allen-Cahn equation.This stochastic Allen-Cahn equation, however, is rather different from the usual stochastic Allen-Cahn equation, see e.g., [7], which is driven by white noise.
We start with the definition of these terms.Then, we discuss the Feller property and the infinitesimal generators of the (proper) ODEs and (i(t)) t≥0 .Moreover, we explain how we obtain the infinitesimal generators for the subgradient processes.Ultimately, we are able to find the generators of the stochastic approximations (θ(t)) t≥0 and (η(t)) t≥0 in Subsections 4.1 and 4.2, respectively.Definition 4.1 A homogeneous-in-time stochastic process (x(t)) t≥0 is Feller, if its Markov transition kernel P t (•|x ′ ) := P(x(t) ∈ •|x(0) = x ′ ) (x ′ ∈ X, t ≥ 0) maps from C 0 (X) into C 0 (X) and satisfies the following three assumptions: (ii) P t+s = P t • P s (s, t ≥ 0), and Moreover, we define the (infinitesimal) generator of (x(t)) t≥0 by which is well-defined for any Feller process and f in an appropriate subspace of C 0 (X).
In this definition, we denoted C 0 (X) := C 0 0 (X) which is one of the spaces Moreover, for some f ∈ C 0 (X), we define P t f := X f (x)P t (dx|•) to be the expected value of f (z), where z ∼ P t .
For some of the processes mentioned above, it is easy to show the Feller property and derive the infinitesimal generator.Those are 1. the ODEs (ζ d (t)) t≥0 and (ξ ℓ (t)) t≥0 .The generator of a well-defined ODE dx dt = ϕ(x) is given by 2. the continuous-time Markov process (i(t)) t≥0 .The Markov transition kernel of (i(t)) t≥0 is given by see [30,Lemma 5] for details.The infinitesimal generator is the transition rate matrix that we give in Subsection 1.1, it has domain The more challenging cases are the (proper) subgradient flows (ζ(t)) t≥0 , (ζ r (t)) t≥0 , (ξ(t)) t≥0 , and (ξ w (t)) t≥0 .We know from the discussion in Sections 2 and 3 that all of these paths are uniquely defined and locally Lipschitz continuous.This already implies that the processes are Feller.However, the paths are not necessarily differentiable.Thus, also t → P t f is not differentiable for an arbitrary f ∈ C 1 0 (X) that is not in the correct subspace of C 1 0 (X).Instead of determining this subspace and then taking the derivatives, we discuss below why it is sufficient to work with right semi-derivatives when discussing the generator.
A function exists for t ≥ 0. The limit dg(•) dt + is the right semi-derivative of g.Then, we indeed have that the infinitesimal operator A of a Markov transition operator P t is given by as in the existence proof of the infinitesimal generator, the right semi-derivatives coincide with the derivatives for test functions f from the appropriate subspace of C 1 0 (X), see Theorem 17.6 in [27].Let now P t be the Markov transition operator for some process (x(t)) t≥0 that is a right semi-differentiable, Lipschitz continuous, deterministic Feller process.Then, Thus, we obtain the infinitesimal generators for (ζ(t)) t≥0 , (ζ r (t)) t≥0 , (ξ(t)) t≥0 , and (ξ w (t)) t≥0 after showing right semi-differentiability and by computing their right semi-derivatives with respect to t.We do this in the next two subsections, in which we also derive the infinitesimal generators for the stochastic approximations (θ(t)) t≥0 and (η(t)) t≥0 .

Sparse inversion
We now discuss the right semi-derivatives of the processes (ζ(t)) t≥0 and (ζ r (t)) t≥0 .
Assume it is not and dζ(t) i dt + > 0.Then, for all h > 0 sufficiently small, ζ i (t + h) > 0 but dζ(t+h) i dt + < 0, which is a contradiction that can also been shown if dζ(t) i dt + < 0. This concludes the proof.✷ The semi-derivatives in Proposition 4.2 and the discussion in the beginning of Section 4 allow us to define the infinitesimal generators of (ζ(t)) t≥0 , (ζ d (t)) t≥0 , and (ζ r (t)) t≥0 .In this order, they are given by Using the infinitesimal generators of (ζ d (t)) t≥0 , and (ζ r (t)) t≥0 , we can now discuss the stochastic approximation.Note that (i(t), θ(t)) t≥0 is a simple example for a piecewise-deterministic Markov process in the sense of Davis [21]; see also [5,17].Thus, the construction of the generators can, for instance, be found similarly in those articles.We summarise: Proposition 4.3 (i(t), θ(t)) t≥0 is a Markov process and Feller with infinitesimal generator

Classification
We now discuss the right semi-derivatives of the processes (ξ(t)) t≥0 and (ξ w (t)) t≥0 .
Proposition 4. 4 The functions (ξ(t)) t≥0 and (ξ w (t)) t≥0 are right semi-differentiable and have the following semi-derivatives: for t ≥ 0, where for ξ ∈ X and i ∈ {1, . . ., n}, we have Proof.The proof runs along the same lines as that of Proposition 4.2.✷ Given these semi-derivative and the discussion about ODEs in the beginning of Section 4, we know that the infinitesimal generators of (ξ(t)) t≥0 , (ξ ℓ (t)) t≥0 , (ξ w (t)) t≥0 are given by Again, we can follow [21] to show that (i(t), η(t)) t≥0 is Feller and to obtain its infinitesimal generator: Proposition 4.5 (i(t), η(t)) t≥0 is a Markov process and Feller with infinitesimal generator

Longtime behaviour
We now study the longtime behaviour and convergence of the subgradient flows (ζ(t)) t≥0 , (ξ(t)) t≥0 and their stochastic approximations (θ(t)) t≥0 , (η(t)) t≥0 .Indeed, we see that the subgradient flows converge to a point.In general, the stochastic approximations do not converge to a single point.Instead, we study their convergence to potentially existing stationary probability distributions.The longtime behaviour of subgradient flows of this form has been studied by, e.g., [12,34].From them, we obtain the following result: (ii) The discrete Allen-Cahn equation (ξ(t)) t≥0 converges to the minimal point of its trajectory, i.e.Φ(ξ(t)) → inf T >0 Φ(ξ(T )), as t → ∞.
Proof.(i) is a classical result by Bruck [12], where it is stated in Theorem 4. For (ii), we employ Proposition 4.1 from [34], which holds as Φ is bounded below and primal lower nice, see Proposition 3.2.✷ In the sparse inversion problem, we converge to the unique minimiser of Φ.In the classification problem, this is not so clear.Our proposition only states that the discrete Allen-Cahn equation to which we apply the target function finds the infimum of the target function over all visited states.We should note that, in general, Φ has no unique minimiser.Consider, for instance, the case, where X ′ = X, d = 0, and P − ε△ ′ = 1 n Id X .Then, the vectors (−1, . . ., −1) T and (1, . . ., 1) T are both minimisers of Φ.We should also note that the dynamical system (ζ(t)) t≥0 does not need to get close to a minimiser.In the aforementioned setting, (0, . . ., 0) T is a stationary point of the dynamical system.A process starting from there would not converge to a minimiser of the target function.
We now move on to the stochastic approximations.As mentioned before, we usually cannot expect convergence to a single stationary point, We discuss the convergence of the stochastic approximations in terms of the Wasserstein distance.It is defined as follows: Let π, π ′ ∈ Prob(X) be two probability measures on (X, BX) and Coup(π, π ′ ) ⊆ Prob(X × X) be the set of couplings of π, π ′ .Let p ∈ (0, 1).Then, the Wasserstein distance of π, π ′ is defined as W p (π, π ′ ), where Convergence in this Wasserstein distance is equivalent to weak convergence of the probability measures, see, e.g.[49].We start with the stochastic approximation of the sparse inversion flow in the next section and then discuss the discrete Allen-Cahn equation after that.

Sparse inversion
To understand the asymptotic behaviour of (θ(t))  (1) ∈ X be initial values and let σ min be the minimal singular value of A. Then, iii) The process (θ(t)) t≥0 is bounded whenever the initial value θ(0) = θ (0) ∈ X is bounded.
Proof.(i) follows immediately from the definition of the subgradient flow.(ii) is given by the following simple calculation: (iii) We can show boundedness for the two subprocesses in terms of the respective initial value: Thus, the process (θ(t)) t≥0 is bounded as well.✷ With this result, we now have all the ingredients to study the asymptotic behaviour of (θ(t)) t≥0 .
Theorem 5.3 The stochastic approximation (θ(t)) t≥0 has a unique stationary measure π * ∈ Prob(X), for any initial value θ(0) = θ (0) ∈ X and any p ∈ (0, 1), there are constants c, c ′ > 0 such that Proof.The convergence follows from Theorem 1.4 in [17], which requires the results of Lemma 5.2 and the stochastic approximation to be Feller.Uniqueness follows from (θ(t)) t≥0 contracting in the Wasserstein distance itself and the Banach Fixed Point Theorem.✷ Hence, we see that the algorithm converges exponentially to a stationary measure.We refer to this stationary measure as the implicit regularisation.It determines how the stochasticity in the algorithm regularises the optimisation problem and is especially important in non-convex optimisation, see, e.g., [44].Moreover, the probability distribution is sometimes used for uncertainty quantification, see, [33].
It is noteworthy in this case, that the probability distribution π * is often not a single point -in contrast to the discrete forward-backward splitting given in Algorithm 1 with sufficiently small step sizes.Thus, either the randomisation of the waiting times prohibits the contraction to the minimiser of (2.1) or the contraction in the algorithm is a numerical artefact (which seems more likely).
It is fairly vital for the proof that the operator A is of full rank.While this is true for many sparse regression problems, it is clearly not true for all usual inverse problems (like inpainting or sparse dictionary learning).Moreover, we would usually assume that the regulariser improves the convergence behaviour.One way to solve this problem is • 2 2 + • 1 regularisation, which is not untypical in practice, see, e.g., [45].

Classification
As in the sparse inversion case, we first discuss the asymptotic properties of the subflows (ξ ℓ (t)) t≥0 and (ξ w (t)) t≥0 , as well as the boundedness of the stochastic approximation (η(t)) t≥0 .
Proof.The dynamical system (ξ w (t)) t≥0 in (i) behaves differently for initial values in different areas of X.As the process has the same convergence behaviour in each component, we assume without loss of generality that X = R.In the intervals (−∞, −1) and (1, ∞), the process contracts toward −1 and 1, respectively.−1, 0, 1 are stationary points.Within the intervals (−1, 0) and (0, 1) the process expands and converges toward −1 and 1 respectively.Where the process expands, it expands like the ODE dx dt = ε −1 x, by definition.Thus, we obtain the bound in (i).(ii) follows from the fact that the components of (ξ w (t)) t≥0 converge to {−1, 0, 1} in finite time.The proof of (iii) is identical to Lemma 5.2(ii).(iv) is implied by (ii) and (iii).✷ Note that we show in (i) not a contraction of the dynamical system, but rather a bound on its expansion.In (ii), we show additionally, that after some finite time this expansion is bounded by a constant that only depends on the dimension of X.Given these statements, we can show the following theorem.The proof is identical to that of Theorem 5.3.
Theorem 5.5 Let µ min > ε −1 .Then, the stochastic approximation (η(t)) t≥0 has a unique stationary measure π † ∈ Prob(X), for any initial value η(0) = η (0) ∈ X and any p ∈ (0, 1), there are constants c, c ′ > 0 such that As for the discrete Allen-Cahn equation (ξ(t)) t≥0 , convergence to a unique stationary measure is not automatic for the stochastic approximation (η(t)) t≥0 .It actually depends on the choice of the discrete Laplacian and ε.Note that P is usually not of full rank, which is why the smallest eigenvalue of −△ ′ is likely a good approximation to µ min /ε.In case ε ≈ 0, we cannot expect the inequality in Theorem 5.5 to hold.

Approximation properties
In this last theoretical section of this work, we study the approximation properties of the stochastic approximations.In particular, we are looking for results of type as λ → ∞.Here, '⇒' means convergence in the weak sense probabilistically and uniformly in time t ≥ 0. Indeed, we aim to show that the stochastic approximations can approximate the deterministic dynamics at any accuracy.Results of this type have been discussed in [26,30] considering smooth stochastic optimisation and in [22] considering Markov chain Monte Carlo.The theoretical foundation is given by Kushner's perturbed test function theory [28,29].Indeed, we aim at employing Theorem 2 in Chapter 3 of [28].Intuitively, this theorem states, that we can show weak convergence of one family of Feller processes to another Feller process, if the family is tight and if the corresponding infinitesimal generators converge in a certain sense.Here, we are allowed to test the family of generators with perturbed test functions.We start with the tightness, which follows from uniform equicontinuity of the processes.
Proof.(i(t)) t≥0 is certainly tight, as it lives on a finite state space.We show tightness of (θ(t)) t≥0 by proving that the paths of (θ(t)) t≥0 are uniformly equicontinuous with respect to λ and bounded.Boundedness is implied by the boundedness of θ 0 and the contractivity of the piecewise processes, see Lemma 5.2.Note that -independently of λ > 0 -the paths (θ(t)) t≥0 are functions constructed piecewise from Lipschitz continuous functions (ζ d (t)) t≥0 and (ζ r (t)) t≥0 .Indeed, note that (ζ d (t)) t≥0 is Lipschitz continuous due to its differentiability and contractivity.A function that is constructed piecewise from Lipschitz continuous functions is Lipschitz continuous itself, see Proposition 4.1.2in [42].Here, the Lipschitz constant does not depend on λ.Hence, the paths are uniformly equicontinuous and, thus, tight; see Theorem 4 in Chapter 2 of [28].(η(t)) t≥0 satisfies the same boundedness and smoothness conditions as (θ(t)) t≥0 .Thus, (i(t), η(t)) t≥0 is tight, as well.✷ Now, we need to define the test functions and perturbed test functions.First, we focus on the sparse inversion case.To this end, let f ∈ C 2 0 (X) be a test function.We define the associated perturbed test function by f λ (t) = f (θ(t)) (t ≥ 0) and the action of A θ on f λ by The following lemma is the technical main result to show the weak convergence motivated above. and for any t ≤ T .
Proof. 1.The boundedness in (6.1) follows from the boundedness of the intial value, the contractivity of the process, and the boundedness of f .(6.2) is trivial, as by definition Note that (i(t)) t≥0 is a Markov process.Thus, Let now i ∈ I.Then, where the second equality holds for Unif[t, T ]-almost every u, due to the absolute continuity of (θ(u)) u∈[t,T ] .Now, we have Finally, taking expectation: the right-hand side of which goes to 0 as λ → ∞. ✷ Second, we discuss the same result for the stochastic approximation of the discrete Allen-Cahn flow.We again choose f ∈ C 2 0 (X) and f λ (t) := f (η(t)) (t ≥ 0).The action of A η on f λ is given by As for the sparse inversion problem, we can show the following statement about Allen-Cahn.
for any t ≤ T .
Proof.The proof proceeds identical to that of Lemma 6.2.✷ We now formulate the main approximation result.

Illustrations
In the following, we aim to illustrate the use of the stochastic approximations of the sparse inversion flow and the discrete Allen-Cahn equation.We start with simplified examples allowing us to stay close to the discussed setting.Thus, we solve all ODEs and differential inclusions by their analytical solution in Subsections 7.1 and 7.2.In Subsection 7.3, we then move on to a higher dimensional classification problem.Here, we solve the linear ODEs with a stable time stepping scheme.

Sparse inversion
We consider again the example we use as an illustration in Figure 2.1.Here, we minimise the target function Φ(θ) := 1 2 (θ − 4) 2 + |θ| which is defined for θ ∈ X := R. The minimiser can be computed analytically in this case, giving θ * = 3.Our stochastic approximation uses, of course, the potentials: We now simulate (θ(t)) t≥0 in this case for λ ∈ {0.25, 2.5, 25, 250}.We generate a total of 10 4 realisations of these processes starting at θ (0) = 0 and keep θ(20) in the memory.We use these samples to depict the probability distributions P(θ(20) ∈ •) in Figure 7.1.These distributions shall act as approximations to the stationary distribution π * that we have discussed in Theorem 5.3.In the case λ = 0.25 we essentially see the process going back and forth between 0 and 4, which are the minimisers of Φ r and Φ d , respectively.This happens due to the very large mean waiting time in (i(t)) t≥0 .As λ increases, the mean waiting times decrease and we see that the probability distribution P(θ(20) ∈ •) concentrates around θ * = 3, which is exactly what we expect due to Theorem 6.4.This can also be seen in Table 7.1, where we note sample mean and sample variance of the realisations.There we also see that the mean of the process is a fairly good estimate for θ * in most cases.

Classification in 1D
We now study a simple classification problem on the set I = {1, . . ., 200}.Indeed, we aim to find a classification vector η † ∈ {−1, 1} 200 given that we observed a (small) part of the entries of η † and that we assume a certain spatial clustering amongst the classes −1 and 1.We give a plot of η † in Figure 7.2, where we made sure that η † is non-trivial due to non-linear, non-monotonic, and non-oscillating behaviour.
To retrieve the classification based on the data, we employ the stochastic approximation of the discrete Allen-Cahn equation on X = R 200 .Here, we employ a very basic discrete Laplacian △ ′ 1 , given by the tridiagonal matrix with diagonal values −50 and off-diagonal values 25.This discrete Laplacian can be derived through a centred finite differences approximation of a Laplacian on an open interval with Dirichlet = 0 boundary conditions and discretisation width h = 0.2.We set the weight α := 1.
In the figures, we see overall that the larger λ(= 10) value helps the system to stabilise and exhibit physical behaviour after some time.'Physical' means that the function is approximately piecewise constant with values close to −1 and 1.The smaller λ(= 1) leads to a more noisy (ε ∈ {10 −4 , 10 −6 } or completely non-physical (ε = 10 −2 ) behaviour.Before the system stabilises with a piecewise constant solution around {−1, 1}, it appears to be dominated by the linear process.This is expected as we start at 0, where the double-well subgradient flow moves slowly, if at all.The combination ε = 10 −2 , λ = 10 appears particularly efficient.The relatively large ε leads to the diffusion/fidelity process to converge quickly to stationarity, which then amplifies the thresholding procedure within [0, 1].Computing a single realisation of η( 16) in this setting took in our experiment on a MacBook Pro (Intel i7, 4 cores @ 2.6Ghz; 16GB Ram) on average 0.22(±0.05)seconds.This appears to be very fast given that we solve the linear part by computing matrix exponentials.
None of the stochastic approximations is able to retrieve the precise classification.This is expected due to the sparsity of the data.The stochasticity of the algorithm does not seem to pick up these regions of uncertainty: the variance is only insignificantly increased around the jumps.Hence, we see also here that the randomisation of what is essentially an MBO scheme does not necessarily lead to a useful uncertainty quantification.
To employ the discrete Allen-Cahn equation, we construct a two-dimensional version of the onedimensional Laplacian △ ′ 1 given in Subsection 7.2 by defining △ ′ 2 := Id X ⊗ △ ′ 1 + △ ′ 1 ⊗ Id X .The discrete Allen-Cahn equation is now posed on the space X := R 4•10 4 .Solving the linear part of the problem (3.2) by computing the matrix exponential at very high accuracy is hardly possible on such a high-dimensional space.Thus, we approximate a linear step to which we switch at time T and that has a random step size δT ∼ Exp(λ) by This corresponds to using a single step of the implicit midpoint rule or the Crank-Nicolson method [19] to integrate the ODE (3.2).See [38] for other techniques to compute matrix exponentials and [1] for a discussion of ODE integrators with randomised step size.The thresholding step (3.3) is still evaluated accurately.We simulate the stochastic approximation for λ = 20, α := 1, ε ∈ {0.05, 0.005}, and initial condition η(0) = 0. We compute means and standard deviations over 100 independent runs of η(2), η(4), η(8), η (16).We present the results in Figures 7.  The dynamical systems appear to behave similarly to the 1D non-discretised setting.In the beginning, the function fits the data at points of observation but remains 0 everywhere else.Over time, the function is slowly smoothed out leading to a good reconstruction of the true classification.In case ε = 0.005, we see all over smaller variances in the dynamical system.Moreover, at time t = 16,   we see an increment of the standard deviation around the interface in the reconstruction showing the uncertainty around the interface position.Unfortunately, this area is very small and does, e.g., not incorporate the truth.The numerical approximation of the linear part appears to not severely damage the dynamical system, as long as the waiting times are sufficiently small with high probability.Thus, a numerical approximation of the linear ODE appears to be a suitable technique, if λ is sufficiently large.

Conclusions and outlook
In this work, we have proposed and analysed randomised, continuous-time splitting methods for sparse inversion and binary classification.The methods arise as stochastic approximations of the original dynamical systems, a sparse inversion flow, and a discrete Allen-Cahn equation.To obtain the stochastic approximation, we split the non-linear and the non-smooth part into two differential equations/inclusions.Each of these dynamical systems can be solved analytically or numerically in an accurate and stable way.To approximate we alternately follow one of these subflows for an exponentially distributed waiting before switching.The non-smooth part leads in both cases to a randomised thresholding.
We give assumptions under which the stochastic approximations are exponentially ergodic, i.e., they converge quickly to their unique stationary measure.Moreover, we show that the stochastic approximations can indeed approximate the underlying dynamical systems at any accuracy.We finally show the applicability of our method in simple numerical experiments.
We finish this work by mentioning two possible extensions to the presented methods: Data subsampling.In case the data sets b, d are very large, even a simple numerical approximation of the linear parts of the algorithms may be prohibitive.The traditional stochastic gradient descent method solves this problem by subsampling the data, i.e., partitioning the data vectors and optimising only with respect to one of the data subsets at a time.The recent work [30] presents a natural way to introduce subsampling into our stochastic approximations of the sparse inversion flow and the discrete Allen-Cahn equation.
Stochastic approximation of primal lower nice functions.In this work, we focussed on two particular dynamical systems -as opposed to the general settings we started with in Subsection 1.1.Moreover, we have noticed that the theory of primal lower nice functions appears to be particularly suitable for our objectives.It would be interesting to see which of the statements made in this work still hold when replacing the given potentials by general primal lower nice functions.Practically, this would allow to analyse stochastic splitting techniques for much more complicated non-smooth dynamical systems, such as the ROF flow and classification with other double-well potentials.