Dynamic Fourier ptychography with deep spatiotemporal priors

Fourier ptychography (FP) involves the acquisition of several low-resolution intensity images of a sample under varying illumination angles. They are then combined into a high-resolution complex-valued image by solving a phase-retrieval problem. The objective in dynamic FP is to obtain a sequence of high-resolution images of a moving sample. There, the application of standard frame-by-frame reconstruction methods limits the temporal resolution due to the large number of measurements that must be acquired for each frame. In this work instead, we propose a neural-network-based reconstruction framework for dynamic FP. Specifically, each reconstructed image in the sequence is the output of a shared deep convolutional network fed with an input vector that lies on a one-dimensional manifold that encodes time. We then optimize the parameters of the network to fit the acquired measurements. The architecture of the network and the constraints on the input vectors impose a spatiotemporal regularization on the sequence of images. This enables our method to achieve high temporal resolution without compromising the spatial resolution. The proposed framework does not require training data. It also recovers the pupil function of the microscope. Through numerical experiments, we show that our framework paves the way for high-quality ultrafast FP.


Introduction
In Fourier ptychography (FP) [1], hundreds of low-resolution intensity images are acquired by illuminating the object of interest with a coherent light source with varying incidence angles. This task is typically performed using a LED array and a microscope with a low numerical aperture (NA) objective lens, which makes FP a low-cost and label-free imaging modality. The collection of measurements is then algorithmically combined into a high-resolution complexvalued image of the sample over a large field of view. Thus, FP has a high space-bandwidth product.
Building upon the pioneering work of Zheng et al [1], the capabilities of FP have been extended in a variety of ways by improving the optical acquisition setup. For instance, in [2,3], the sequence of illuminations is optimized via an importance metric and neural networks, respectively. Multiplexed FP is introduced in [4], where one illuminates the sample with multiple LEDs and is able to reduce the number of measurements. Further, optimal combinations of LEDs are studied in [3,[5][6][7].
There have also been several improvements on the computational side for FP. At its core, the reconstruction process involves the solution of a phase-retrieval (PR) problem-the recovery of phase information from intensity measurements. In [1], this task is performed by using the iterative Gerchberg-Saxton (GS) algorithm [8]. As PR is a non-convex problem, the solution obtained by GS depends on the starting point. This problem of initialization is tackled in [9]. In [10][11][12], PR is formulated as a convex optimization problem with the help of a lifting scheme. However, this elegant approach comes at the cost of a large computational burden. As the acquired measurements are typically corrupted by noise, maximum-likelihood estimation offers an adequate framework for one to incorporate the noise statistics [13]. The resulting optimization problems are solved efficiently by gradient-based or higher-order methods [14,15]. A thorough comparative study of different methods for PR can be found in [16]. In addition to solving the PR problem, algorithms that include the estimation of the pupil function of the microscope [17] and correction of the LED positions [18,19] have also been proposed.
While FP has matured into a versatile modality with numerous applications [20], highquality high-speed imaging remains a challenge. The temporal resolution in FP is inherently limited by the large number of measurements that need to be acquired in order to reconstruct the high-resolution image of the sample. To alleviate this problem, ad hoc acquisition setups [5,6,21] have been devised. They allow one to obtain a higher temporal resolution without a significant deterioration of the spatial resolution. Alternatively, there has been a lot of interest in the development of sophisticated computational methods to solve the PR problem with only a few measurements. In such ill-posed scenarios, regularization techniques can be used to incorporate some prior knowledge about the sample of interest. These are typically applied by formulating PR as an optimization problem where the cost functional consists of a data-fidelity term and a regularization term. The data-fidelity term ensures that the solution is consistent with the observed data while the regularization promotes solutions with the desired properties. For example, the popular total-variation (TV) regularization [22] favors piecewise-constant images and has been adapted for FP in several works [23][24][25][26]. Group-sparsity-based priors have been successfully deployed in FP as well [27]. An online plug-and-play approach for FP has also been proposed in [28], where sophisticated denoisers such as BM3D [29] are used for (implicit) regularization.
Over the past few years, deep-learning-based methods have yielded impressive results, outperforming the model-based regularized methods in a variety of imaging modalities, especially in ill-posed settings [30,31]. In the context of FP, deep neural networks have been trained in a supervised manner as nonlinear mappings that take the low-resolution measurements and output the high-resolution image of interest [32][33][34]. Further, in [35,36], pre-trained deep generative priors are used to solve the PR problem. For more details regarding FP, we refer the reader to recent comprehensive reviews [20,37].
In dynamic FP, when it is desired to image a moving sample, the computational methods described above must be applied in a frame-by-frame manner to obtain the sequence of highresolution images, without accounting for the temporal dependencies in the measurements. Yet, one can decrease the number of measurements required per frame (thus increasing the effective imaging speed) by exploiting the temporal correlations in the sequence of images to be recovered. Based on this idea, the concept of low-rank FP is introduced in [38], where a low-rank constraint is enforced on the matrix formed by stacking the (vectorized) images.

Contributions
In this work, we propose a novel computational framework for dynamic FP. Inspired by the method developed in [39] for dynamic magnetic resonance imaging, we use a deep neural network (deep prior) to impose a spatiotemporal regularization on the sequence of complexvalued images to be recovered. More specifically, we parameterize each image in the sequence as the output of a single convolutional network corresponding to some fixed latent input vector. These input vectors are chosen to lie on a one-dimensional manifold. The parameters of the network are then optimized such that the generated images collectively fit the acquired measurements under the action of the specified forward model. The architecture of the generative network imposes an implicit spatial prior on the images while the constraints on the input latent vectors allow the network to associate their proximity with temporal variations in the sequence. Our method does not require any training data. It also estimates the pupil function together with the complex-valued images, which means it can be readily applied for different settings. We assess the performance of our framework on simulated data with a single measured low-resolution image per reconstructed frame and show that it paves the way for high-quality ultrafast FP.
The paper is organized as follows. In section 2, we describe a continuous-domain physical model for FP along with its computationally efficient discretization. We present the proposed reconstruction framework in section 3 and the experimental results in section 4.

Physical model
In this section, we first formulate the physical model that relates the acquired measurements and the sample of interest in the continuous domain. Then, we present a discretized version of the forward model that can be implemented in a computationally efficient manner.

Continuous-domain formulation
The optical system in FP usually involves an array of L LEDs (see figure 1), where the lth LED illuminates the specimen with a tilted plane wave with wave vector k l ∈ R 2 (l ∈ L = {1, 2, . . . , L}) and wavelength λ > 0. In this work, we consider the case where only one LED is turned on for each measured image. However, our framework is also compatible with more sophisticated acquisition settings [4].
We model the sample of interest as a 2D complex object, which is a valid assumption for thin samples. Therefore, we can represent the moving sample as a complex-valued function x : Ω X × R ⩾0 → C, where Ω X ⊂ R 2 includes the region of interest of the sample. Let {t q } Q q=1 be the uniformly-spaced timestamps, with spacing ∆ t , at which we are interested in observing the sample. We assume that the sample moves very slowly in the intervals Thus, during T q , we can acquire multiple measurements {y q,w : Ω Y → R} W w=1 , where W ⩽ L and where Ω Y ⊂ R 2 includes the support of the measurement, of the object x(·, t q ). Here, the tradeoff between the temporal resolution and the spatial resolution can be understood in terms of ∆ t and W: a small value of ∆ t (high temporal resolution) implies a small value of W, which yields a low spatial resolution.
Let I q ⊂ L, where q ∈ {1, 2, . . . , Q}, be the set of LEDs that are switched on during T q ; the cardinality of this set is |I q | = W. Further, for w ∈ {1, 2, . . . , W}, we introduce l q,w = I q (w) ∈ L to denote the wth entry of I q . The measurement image y q,w is obtained when x(·, t q ) is illuminated by the l q,w th LED with the tilted plane wave r → e j⟨k lq,w ,r⟩ . As mentioned in [4,16], it is given by Here, the operators F and F −1 denote the Fourier transform and its inverse, respectively, k ∈ R 2 is the 2D spatial frequency variable, and the quantity x(k, t q ) denotes the Fourier transform of x(r, t q ). The pupil function 5 p : R 2 → C models the pupil aperture and is compactly supported on a disk of radius 2π NA λ , where NA is the numerical aperture of the system, thus cutting off high frequencies.

Camera sampling
In practice, the camera in the acquisition setup samples y q,w on a uniform grid with stepsize ∆ and records a discrete image y im q,w of size 6 where the (M × M) image y im q,w is the sampled version of y q,w given by for m 1 = 0, . . . , (M − 1) and m 2 = 0, . . . , (M − 1), and the operator Noise(·) models the corruption of y im q,w by noise. Consider the quantity Due to the compact support of the pupil function p, the maximum angular frequency of u q,w is 2π NA λ . Note that the Fourier transform of y q,w can be written as where u ∨ q,w denotes the complex conjugate of u ∨ q,w which is given by u ∨ q,w (k) = u q,w (−k), and * denotes the convolution operation. Thus, the maximum angular frequency of y q,w is 4π NA λ . Consequently, the Nyquist criterion dictates that the sampling step ∆ of the camera should satisfy

Discretized forward model
In this work, we obtain a discrete version x im q of x(r, t q ) by sampling it on a uniform (N × N) grid with pixel-size ∆ r , as for n 1 = 0, . . . , (N − 1) and n 2 = 0, . . . , (N − 1). The image size is given by N = r p M, where r p = ∆/∆ r ∈ N is the upsampling factor. Now, consider the 2D discrete Fourier transform (DFT) of x im q . The corresponding pixel size in the Fourier domain (or angular frequency resolution) is ∆ k = 2π /N∆ r . Thus, we discretize the pupil function such that for k 1 = 0, . . . , (M − 1) and k 2 = 0, . . . , (M − 1). Note that the choice of ∆ and ∆ k ensures that the support of the pupil function lies within the (M × M) sampling grid for p. Moreover, 5 The pupil function p is described directly in the Fourier domain. 6 We consider square even-sized images for the sake of simplicity.
in our discretization scheme, we assume that the wave vector k lq,w can be written as We now introduce some additional notations to specify the discrete forward model. Let y q,w ∈ R M 2 , y q,w ∈ R M 2 , x q ∈ C N 2 , and p ∈ C M 2 be the vectorized versions of y im q,w , y im q,w , x im q , and p im , respectively. Then, let F Q , F −1 Q ∈ C Q 2 ×Q 2 be matrices that represent the 2D DFT and its inverse of a (Q × Q) image, respectively. Next, we define diag( p) ∈ C M 2 ×M 2 to be a diagonal matrix whose entries are the values in p. Finally, C k lq,w is a boolean matrix that restricts an N 2dimensional vector to an M 2 -dimensional vector depending on the illumination wave vector k lq,w . Proposition 1. The discrete counterpart of (1) can be computed as Proof. Consider the quantity We discretize the integrals in (10) using Riemann sums. A step-size ∆ k is used for the integral with respect to k and a step-size ∆ r is used for the integral with respect to s. where The limits in the sums in (11) and (12) are dictated by the supports of p and x(r, t q ), respectively. By rearranging some terms and using the fact that ∆ k ∆ r = 2π /N, we rewrite (12) as where x im q is the (N, N)-point DFT of x im q and the shifts in the DFT are applied in a circular manner. On plugging (13) into (11), we get that Next, we group all the exponential terms involving k 1 and k 2 and use ∆ k ∆ = 2π/M to obtain that Then, the discrete measurements can be expressed as Note that the computation of g im q,w involves taking the (N, N)-point DFT of x im q , (circularly) shifting it according to the wave vector k lq,w , restricting the shifted DFT to an (M × M) image, performing pointwise multiplication with p im , and then taking the (M, M)-point IDFT. This allows us to write (16) in vectorized form as in the right-hand side of (9).
While the discrete forward model (9) has previously been used in works such as [16], to the best of our knowledge, a systematic derivation of (9) from the continuous model (1) has not been presented in the literature.

Reconstruction framework
The goal in dynamic FP is to reconstruct the images {x q ∈ C N 2 } Q q=1 from the recorded meas- We first present our neural-network-based framework for the case of a well-characterized pupil function. Then, we describe a way to incorporate the recovery of the pupil function into our reconstruction algorithm.

Deep spatiotemporal priors
The concept of using untrained convolutional neural networks (CNNs) as regularization for solving inverse problems was first introduced in [40] under the name 'deep image prior' (DIP). There, the image of interest is represented as the output of a CNN with adjustable parameters and some fixed input. It is then shown that the fitting of the network to the measurements yields high-quality image reconstructions for several applications such as denoising, superresolution, and inpainting. This is attributed to the observation that CNNs have a remarkable tendency to favor natural-looking images ('good' solutions) over noisy ones ('bad' solutions). In some scenarios, DIP is deployed with early stopping as deep networks have the capacity to fit noise. In other words, there is a point beyond which running the optimization process for more iterations degrades the quality of the reconstruction. Thus, the architecture of a CNN (and the optimization procedure) can be used as an implicit prior for natural images.
In our reconstruction framework, we propose to use an extended version of DIP to impose spatiotemporal regularization on the sequence of images. We parameterize each of the Q images as the output of a single CNN f θ : R N 2 z → C N 2 , with adjustable parameters θ ∈ R P , applied to some fixed input latent vector z q ∈ R N 2 z , q = 1, . . . , Q. We choose these latent vectors such that they lie on a straight line, in accordance with where the end-points z 1 , z Q are fixed beforehand (for example, by drawing two samples from some multivariate probability distribution). We then optimize the parameters of the network to fit the measurements according to where D : The rationale behind our choice of the latent vectors is to allow the CNN to associate the spatial proximity between them with the temporal proximity of the images. In this manner, the architecture of the network imposes spatial regularization while the use of a shared network for all images and the design of the latent space impose temporal regularization. A schematic illustration of our framework is given in figure 2.

Algorithm 1. Initialization of network parameters.
Exit the while loop end if end while Output: Network parameters θ

Optimization strategy
The relation between the measurements and the underlying images is nonlinear, which makes the inverse problem very challenging. The fact that only one LED is switched on for each measurement further adds to the difficulty. Thus, in order to avoid bad local minima while solving the optimization problem in (18), we initialize the parameters of the network according to where { x q } Q q=1 are low-quality reconstructions obtained via a standard frame-by-frame method. The magnitude | · | and phase arg(·) operations in (19) are applied component-wise. We can solve (19) using off-the-shelf minibatch stochastic gradient-descent algorithms. However, it is not desirable to run these algorithms till convergence as the network then overfits the artifacts present in the low-quality reconstructions. Thus, in our initialization routine, which is described in algorithm 1, we deploy early stopping by choosing suitable values for the tolerance ϵ tol and the maximum number of iterations n max (see section 4.2.3 for details).
After the initialization, we can solve (18) using again some minibatch stochastic gradientdescent algorithm. In some cases (for example, when the measurements are corrupted by a non-negligible amount of noise), running the optimization process beyond a certain number of iterations leads to deterioration of the reconstruction quality as the network begins to overfit the measurements. Thus, we also adopt early stopping when necessary.
For both the initialization and reconstruction tasks, we use (minibatch) stochastic gradientdescent algorithms instead of deterministic ones. This introduces additional hyperparameters (batch sizes) that must be set appropriately. However, stochastic methods with small batch sizes require much less memory than the deterministic ones. In fact, if the number of frames Q is large, applying a deterministic gradient-descent method is infeasible. Further, such stochastic methods are also more likely to escape bad local minima and thus reach better solutions. Indeed, in our experiments, we observed that using reasonably small batch sizes (B Q = 10) led to better reconstructions than using large batch sizes (B Q = 40).

Recovery of the pupil function
So far, we have assumed complete knowledge of the pupil function in our reconstruction framework. However, the pupil function is typically not well-characterized in FP. Thus, similar to the work in [17,26], we estimate it along with the sequence of images. Following [26], we use Zernike polynomials to represent the pupil function with only a few parameters ( M 2 ). These functions are orthogonal on the unit circle and are often used in optics for modeling aberrations. We express the pupil function in polar coordinates (ρ, ϕ) as where Z a is the ath Zernike polynomial according to Noll's sequential indices (refer to appendix for details) and c = (c a ) A a=1 ∈ R A (A M 2 ) contains the Zernike coefficients. The pupil function is discretized as in (8) by evaluating (20) on the required Cartesian grid. We denote the vectorized discrete pupil function by p(c) ∈ R M 2 to explicitly indicate the dependence on the Zernike coefficients. Similarly, our forward model (9) is then written as Finally, the optimization problem for the joint recovery of the pupil function and the sequence of images is where D : R M 2 × R M 2 → R + is the data-fidelity term. We can solve (22) using a minibatch stochastic gradient-descent algorithm coupled with early stopping if required. Our complete reconstruction algorithm is summarized in algorithm 2.

Simulated setup
We demonstrate the advantages of our reconstruction method on simulated data. We consider an FP setup consisting of L = 100 LEDs arranged in a (10 × 10) uniform grid with a spacing of d L = 4 mm. The maximum illumination NA of the LED array, which is placed at distance h = 90.88 mm from the sample, is 0.27. The LEDs emit light with wavelength λ = 532 nm. The NA of the objective is NA = 0.1. We have chosen these values of d L , h, λ and NA based on the experimental setup in [20]. The pupil function is defined according to (20) using the first nine Zernike polynomials with coefficients c = (0, 0.15, 0.3, −0.1, 0.2, 0, 0, 0, 0) ∈ R 9 . We take the low-resolution measurements acquired by the camera to be of size (64 × 64) with pixel-size ∆ = λ 4 NA = 1.33 µm and we set the oversampling ratio as r p = 4. Consequently, the pixel size for the high-resolution image is ∆ r = 332.5 nm and the step-size for discretizing the pupil function is ∆ k = 0.074 µm −1 . The LED array and the pupil function are shown in figure 3.
Our ground truth is a sequence of complex-valued images {x q ∈ C 256 2 } 100 q=1 of size (256 × 256) which we created from experimental phase images 7 . We place ourselves in the extremely challenging ultrafast regime where only one measurement is acquired for each image in the sequence. For each measurement, a single LED of index 8 l q is randomly activated and a lowresolution image 8 y q ∈ R 64 2 is simulated according to (21). The recorded measurement image y q ∈ R 64 2 is then generated according to where n q ∈ R 64 2 is a realization of a zero-mean Gaussian random vector with covariance matrix Σ q ∈ R 64 2 ×64 2 . Specifically, we consider two settings for our simulations. In the first case, Σ q is the zero matrix, which means that the recorded measurements are noiseless. In the second case, Σ q is a diagonal matrix with entries ([y q ] m )/1000 64 2 m=1 . There, (23) corresponds to a Gaussian approximation of the Poisson noise model with a photon budget of 1000.
We show some of the frames in the ground-truth sequence and the corresponding measurements for both the settings (noiseless and noisy) in figure 4. The full sequences are provided in the supplementary material.

Implementation of the deep spatiotemporal prior
In this subsection, we describe the implementation of our reconstruction method-the deep spatiotemporal prior (DSTP).

Network architecture.
It has been observed that the choice of the network architecture can greatly affect the performance of DIP [40]. Therefore, the common practice when deploying DIP (or other related schemes), is to select the architecture in an empirical trial-and-error manner for the specific task at hand. For our experiments, inspired by [39], we adopt a convolutional decoder-like architecture for f θ , which, as we demonstrate in sections 4.5 and 4.6, yields high-quality reconstructions. It takes a low-dimensional input vector z ∈ R 8 2 and outputs a complex-valued (vectorized) image f θ (z) ∈ C 256 2 . The architectural details are described in table 1. In particular, the complex-valued image is generated from a pair of magnitude and phase images. The initial part of the network creates feature maps of size (128 × 256 × 256). These are then fed into both the magnitude and phase branches of the network. The magnitude branch consists of a convolutional layer followed by the pointwise differentiable rectified linear unit (DReLU) activation function, which we define as where γ > 0 is set a priori. We use DReLU (with γ = 0.1) instead of ReLU to avoid the 'deadneuron' issue during the first few iterations of the optimization, while ensuring that the magnitude is positive. Meanwhile, the phase branch consists of a convolutional layer followed by the π tanh nonlinearity to constrain the phase to lie within the range [−π, π].

Latent vectors.
As mentioned in section 3.1, the latent vectors {z q ∈ R 8 2 } 100 q=1 are chosen such that they lie on the straight line defined in (17). We fix the end-points z 1 , z 100 of this line by drawing two samples from the standard multivariate normal distribution in 8 2 dimensions.

Initialization.
In all our experiments, we initialize the parameters of the network using reconstructions obtained from the GS algorithm (briefly described in section 4.3). We run algorithm 1 using the AMSGrad solver [42] with a learning rate of 10 −3 , batch size B Q = 10, tolerance ϵ tol = 0.1 × (B Q × 256 2 ) and maximum number of iterations n max = 1000. We then freeze the tunable parameters of the batch-normalization layers. For experiments involving the estimation of the pupil function, we initialize the Zernike coefficients as c = 0.
We have observed that the initialization of the network parameters has an impact on the reconstruction quality. For example, randomly initializing the parameters does not lead to satisfactory results. However, initializing the network by simply fitting it to low-quality solutions of the GS algorithm (along with early stopping to avoid overfitting the artifacts) allows us to obtain excellent reconstructions (see sections 4.5 and 4.6). Architecture of the network f θ . Size of input: (1 × 8 2 ). Conv: convolutional layer with (3 × 3) kernels and reflective boundary conditions. BN: batch normalization layer. Upsampling: nearest neighbor interpolation. The amplitude and phase branches take the same input of size (128 × 256 × 256) and output the magnitude and phase images of size (1 × 256 × 256), respectively. DReLU is described in (24). The combination layer generates a complex-valued image from the magnitude and phase images. This network consists of 1628 546 learnable parameters.

Layers
Output shape

Choice of the data-fidelity term.
The data-fidelity term D(·, ·) in (18) and (22) measures the discrepancy between the observed and the simulated measurements. For optimizationbased FP reconstruction, it has been experimentally shown in [16] that a cost function of the form where a, b ∈ R M 2 , is robust and leads to better reconstructions than the popular mean-squarederror cost function D(a, b) = 1 2 a − b 2 2 . In particular, the cost function in (25) has a gradient similar to that of the Poisson-likelihood-based cost function, which suggests that it can also handle Poisson-like noise well [16]. In our framework, we use the slightly modified version of (25) given by where 1 ∈ R M 2 is a vector with all entries equal to 1 and ϵ = 10 −10 helps us avoid numerical instabilities in the computation of the gradient.
Note. The details regarding the optimization process for (18) and (22) are provided in sections 4.5 and 4.6.

Comparisons
We compare our proposed framework to the following methods.

GS algorithm.
The GS algorithm [8] is a classical method for phase retrieval. Assuming that the Zernike coefficient vector c is known, it aims at solving the feasibility problem x * GS,q ∈ x : y q = |H lq (c)x| 2 (27) for q = 1, 2, . . . , 100, by alternately updating the image plane and the object plane. We refer the reader to [8] for more details. When the pupil function is not well-characterized, we do not incorporate its recovery within the GS algorithm. Instead, we solve (27) assuming an idealized pupil function with no phase aberrations that corresponds to c = 0.

Data-consistency estimator (DC).
Based on the work in [16], we consider a dataconsistency (DC) estimator that minimizes the (slightly modified) 'amplitude-based' cost function (26). For the joint recovery of the images and pupil function, it is given by where D(·, ·) is defined in (26).

Spatially total-variation-regularized estimator (STV).
In our numerical simulations, we also consider a regularized estimator where the cost function in (28) is augmented with spatial anisotropic TV regularization for each frame. It is given by where the operator L : R N → R 2N computes finite differences in both the directions for the underlying image, and {τ amp,q , τ phase,q } 100 q=1 ⊂ R + are hyperparameters that control the strength of the regularization.

Spatiotemporally total-variation-regularized estimator (STTV).
Finally, we also implement a spatiotemporally-regularized estimator where the cost function in (28) is augmented with both spatial and temporal TV regularization. It is given by where L : R N → R 2N is the finite-difference operator and {τ amp,s , τ phase,s , τ amp,t , τ phase,t } ⊂ R + are the regularization hyperparameters.

Evaluation metric
We quantify the performance of a method by computing the regressed signal-to-noise ratio (RSNR) for the entire reconstructed sequence of images. Let x and x * denote vectorized versions of the ground truth and reconstruction, respectively. These are created by concatenating the vectorized representations of each frame in the sequence. The RSNR is computed as We also report the SNR for the pupil function whenever it is jointly estimated with the sequence of images. This metric is computed as where c and c * are the ground-truth and estimated Zernike coefficients, respectively.

Reconstruction from noiseless measurements
We now present two experiments involving noiseless measurements. In both of them, we run the iterative algorithm for each method for a sufficient number of iterations (details are provided below), beyond which the reconstruction does not change significantly. In other words, we do not deploy early stopping for any method as the measurements are noiseless.

Perfectly characterized pupil function.
We first consider an idealized setting where the pupil function is perfectly characterized and is therefore not estimated during the reconstruction of the images of interest. In this scenario, the DC and STV estimators can be computed in frame-by-frame manner (similar to the GS method) by decomposing the overall optimization problems into Q = 100 smaller ones. We solve these by running AMSGrad with a learning rate of 10 −3 for 1000 iterations. In order to improve their performance, we initialize the GS, DC, and STV methods for the timestamp t q with the reconstructed images from the previous timestamp t q−1 . The GS solution is used for initializing the STTV method. We solve (30) by using AMSGrad for 10 000 epochs with a learning rate of 10 −3 and a full batch size of 100. The optimal hyperparameters {τ amp,q , τ phase,q } 100 q=1 and {τ amp,s , τ phase,s , τ amp,t , τ phase,t } for the STV and STTV methods, respectively, are chosen via a grid-search. For DSTP, the network parameters are initialized with the help of the GS solution. We then solve (18) by running the AMSGrad optimizer for 10 000 epochs with a learning rate of 5 × 10 −5 and a batch size of B Q = 10.
We present the RSNR values for all the methods in table 2. Further, we display some slices of the (2D + time) reconstructions in figure 5. The entire reconstructed sequences can be found in the supplementary material. We observe that the proposed method significantly outperforms the GS, DC, STV and STTV methods. Even though only one measurement is acquired per  frame, it yields a high-quality reconstruction, unlike the other methods which exhibit various artifacts (for example, the features marked by arrows in figure 5).

4.5.2.
Joint recovery of dynamic sample and pupil function. Next, we consider a setting where the pupil function is not well-characterized and is therefore estimated jointly with the dynamic sample in our framework and in the DC, STV and STTV methods. (We do not adapt the GS algorithm for the recovery of the pupil function; we simply assume the idealized pupil function c = 0.) For the DC, STV and STTV methods, the sequence of images is initialized with the GS solution and the Zernike coefficients are initialized as c = 0. We solve (28) and (29) by running the AMSGrad optimizer for 10 000 epochs with a learning rate of 10 −3 and a batch size of 10. For solving (30), we run AMSGrad for 10 000 epochs with a learning rate of 10 −3 and a full batch size of 100. In the STV method, we select two global hyperparameters {τ amp , τ phase } via grid search and share them among all frames. The hyperparameters {τ amp,s , τ phase,s , τ amp,t , τ phase,t } for the STTV method are also tuned for best performance with the help of a grid search. In our method, we initialize the network parameters using the GS solution and we initialize the Zernike coefficients as c = 0. We solve (18) by running AMS-Grad for 10 000 epochs with a learning rate of 5 × 10 −5 and a batch size of B Q = 10. We present the RSNR and SNR values for the reconstructed sequence and the pupil function, respectively, in table 3. We also show some slices of the (2D + time) reconstructions and the recovered pupil functions (phase) in figure 6, as well as the recovered Zernike  Recovered Zernike coefficients from noiseless measurements. The first (Noll index = 1) Zernike mode only contributes a constant phase factor which has no effect on the intensity measurements and thus can be ignored. coefficients in figure 7. The full reconstructed sequences are provided in the supplementary material. Here, the DC, STV and STTV methods fail to recover the Zernike coefficients (i.e. the  pupil function) accurately and yield poor reconstructions of the dynamic sample. On the contrary, our method provides a good estimate of the pupil function along with a high-quality reconstruction of the moving sample.

Reconstruction from noisy measurements
Finally, we consider the joint recovery of the sequence of images and the pupil function from noisy measurements. In this case, we observe that the GS, DC and DSTP methods require early stopping as running the corresponding iterative algorithm beyond a point leads to overfitting the noisy measurements. Thus, we run each method for a sufficiently large number of epochs (=10 000) and we report the reconstruction that achieves the best RSNR during these epochs.
For each method, we use the initialization, optimizer, learning rate and batch size described Figure 9. Recovered Zernike coefficients from noisy measurements. The first (Noll index = 1) Zernike mode only contributes a constant phase factor which has no effect on the intensity measurements and thus can be ignored.
in section 4.5.2. The hyperparameters for the STV and STTV methods are also tuned in the same way as in section 4.5.2. We summarize the quantitative results for all the methods in table 4. We display some slices of the (2D + time) reconstructions and the estimated pupil functions (phase) in figure 8, and we present the recovered Zernike coefficients in figure 9. The entire reconstructed sequences are available in the supplementary material. In this setting, as shown in figure 4, the darkfield measurements are corrupted by significant amounts of noise, which makes the recovery problem quite challenging. Remarkably, our method still yields reconstructions of very good quality and outperforms the DC, STV and STTV methods by a big margin.

Computational cost
In all our experiments, we used an Intel Xeon Gold 6240R (2.6 GHz) CPU for the GS method and an NVIDIA V100 GPU for the DC, STV, SSTV and DSTP methods. While DSTP achieves substantially better reconstruction quality than the other methods, its computational cost is also higher. For example, the run time for DSTP was around 5.5 h as opposed to 3 − 30 min for the other approaches when jointly estimating the sequence and the pupil function from noiseless measurements.

Conclusion
We have presented a neural-network-based framework that does not require training data for the reconstruction of high-resolution complex-valued images of a moving sample in dynamic FP. In our method, we have parameterized the sequence of images to be reconstructed using a shared convolutional network with adjustable parameters. We have encoded the temporal behavior of the sample in the input vectors of the network by constraining them to lie on a onedimensional manifold. In this manner, we have leveraged both the structural prior of a neural network and the temporal regularity between consecutive frames. Further, we have incorporated the recovery of the pupil function of the microscope within our framework. Finally, with the help of simulations, we have shown that the proposed approach drastically improves the quality of reconstruction over standard frame-by-frame methods.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https://github.com/pakshal23/DynamicFP.

Appendix. Zernike polynomials
In the polar coordinates (ρ, ϕ), the Zernike polynomials are given by For a ∈ Z + \ {0}, Noll's sequential indexing defines a mapping Z u v → Z a such that