Qualitative and quantitative enhancement of parameter estimation for model-based diagnostics using automatic differentiation with an application to inertial fusion

Parameter estimation using observables is a fundamental concept in the experimental sciences. Mathematical models that represent the physical processes can enable reconstructions of the experimental observables and greatly assist in parameter estimation by turning it into an optimization problem which can be solved by gradient-free or gradient-based methods. In this work, the recent rise in flexible frameworks for developing differentiable scientific computing programs is leveraged in order to dramatically accelerate data analysis of a common experimental diagnostic relevant to laser–plasma and inertial fusion experiments, Thomson scattering. A differentiable Thomson-scattering data analysis tool is developed that uses reverse-mode automatic differentiation (AD) to calculate gradients. By switching from finite differencing to reverse-mode AD, three distinct outcomes are achieved. First, gradient descent is accelerated dramatically to the extent that it enables near real-time usage in laser–plasma experiments. Second, qualitatively novel quantities which require O(103) parameters can now be included in the analysis of data which enables unprecedented measurements of small-scale laser–plasma phenomena. Third, uncertainty estimation approaches that leverage the value of the Hessian become accurate and efficient because reverse-mode AD can be used for calculating the Hessian.


Introduction
Parameter estimation is a fundamental component of many experimental scientific disciplines.Measurements are often performed using parameter estimation, where the objective is to estimate the set of parameters that result in the observable from nature or a controlled experiment.In experimental plasma physics, a significant part of the discipline involves matching diagnostic models, physics based mathematical functions that enable the mapping between inputs and the observable, to data in order to extract physical parameters of interest.Colloquially this process is referred to as fitting.Examples of measurements that utilize diagnostic models include interferometry [1], proton radiography [1][2][3][4], spectroscopy [5], and Thomson scattering [6].
Measurement using a model-based approach is a type of optimization problem where the objective is to minimize the difference between the prediction and the observed data.Optimization algorithms can be grouped into gradient-driven or gradient-free methods.Gradient free methods include genetic algorithms, Bayesian optimization, Monte Carlo search, and surrogate methods like neural networks.When a gradient can be calculated, gradient-based algorithms generally outperform gradient-free methods, taking fewer iterations to reach a minima [7].Gradient free methods often do not readily allow uncertainty to be quantified as is the case with neural networks and genetic algorithms, and scale to many free parameters as in the case of Bayesian optimization using Gaussian processes.Gradient based optimization with a diagnostic model, allows many free parameters, uncertainty analysis, and interpretability based on known physics.
Accurate and efficient calculation of gradients with respect to many parameters remains the largest obstacle for gradient driven methods.Symbolic differentiation is not generally applicable because it not only requires a closed-form expression for the objective function but also requires that expression to be well-posed enough to not suffer from equation bloat when differentiated.Finite differencing (FD) is frequently used [8] however, this techniques suffers from truncation errors and requires at least two function evaluations to determine the gradient with respect to each parameter of interest.Both methods, symbolic and finite differencing, become significant obstacles when attempting to determine a large number of parameters [9].
The recent explosion in deep learning has been driven by the flexibility and performance of high-level libraries such as PyTorch [10] and TensorFlow [11].These libraries enabled the training of neural networks with O(10 9 ) parameters.The training of neural networks with such a large number of parameters is enabled by the efficient implementations of automatic differentiation (AD) [12].Specifically, reverse-mode AD enables the calculation of the gradient with respect to many parameters efficiently by only requiring a single function evaluation [12].
Applying reverse-mode AD to general purpose numerical programs is often termed differentiable programming.It has been applied extensively in, for example, the geosciences [13], in computational fluid dynamics [14][15][16][17], and in accelerator physics [18,19].Nascent applications toward plasma physics include the development of theoretical plasma physics [20] and the implementation of reduced models in multiscale plasma physics [21].
In this article, reverse-mode AD is applied to quantitatively improve data analysis and qualitatively enable new regimes of study using Thomson-scattering diagnostics for experimental plasma physics.It is shown that AD combined with graphics processing units (GPUs) allows all the physics of a diagnostic model to be preserved while improving computational efficiency >140× over previous algorithms and enables rapid uncertainty estimation.This enables a qualitatively new regime of data analysis in three different ways.First, because data is analyzed in near real time, the experiment can be informed as it is performed.Second, because the size of the parameter space is effectively unrestricted due to the use of reverse-mode AD, parameter estimation was performed using O(1000) parameters.For certain applications, such as in Thomson scattering for inertial fusion, it is shown that such a large increase in the number of parameters enables the direct interrogation of the particle velocity distribution functions, a qualitatively new feature that was previously computationally restrictive.Third, uncertainty estimation approaches constructed using the second derivative of the value of the model-based diagnostic can be used because of the accuracy and efficiency of arbitrary-order derivative calculation using AD.Finally, the use of AD enables future enhancements such as including missing physics, e.g.relativistic effects, in the physical model through the use of neural networks.
The article is structured as follows.In section 2 the general algorithm is described and a comparison is made between implementations using a CPU vs. a GPU as well as using reverse mode AD vs. finite differentiation.Section 2.1 discusses the changes to the methods while sections 2.2 and 2.3 discuss the resulting improvements in computation time and uncertainty analysis.Section 3 shows the improvements when applied to Thomson scattering of laser-plasma experiments.Section 3.1 gives background on the measurement technique, while section 3.2 describes the resulting improvements to Thomson-scattering data analysis.Section 3.3 describes how AD opens the door to extracting even more information from existing data such as measuring electron velocity distribution functions.Finally, the article is summarized in section 4.

Reverse mode AD enables batched parameter estimation
Figure 1(a) shows a typical analysis algorithm for many model based diagnostics.In the typical algorithm, raw data is preprocessed to account for instrumental effects such as spectral transmission of the diagnostic and then background is subtracted from the data.Background subtraction is often a manual process that requires the user to try different models or amplitudes until they are satisfied with the result.Often the diagnostic model describes the data in terms of one diagnostic axis, in the case of Thomson scattering this is scattered frequency.If data is collected on a camera or similar device then there may be a second diagnostic axis such as time or space, so a lineout of the data is selected.This lineout may be a single row of pixels defining a single time or the average over a few rows representing a small window in time.After the user selects a lineout and provides a set of initial parameters for the model a fit metric, also called a loss metric, is minimized providing the parameters that best represent the data.If the plasma conditions are desired for an entire data set this process must be repeated for every lineout.Sometimes stepping forward to the next Raw data is preprocessed to account for some instrumental effects.Background is subtracted, usually manually, and a single row of pixels (lineout) is selected for analysis.The model is matched to data through by minimizing a loss function producing the desired plasma conditions.This process is manually repeated to obtain plasma conditions for an entire dataset.(b) An algorithm optimized for AD.Data is split into batches which consist of multiple lineouts that are analyzed simultaneously.Background subtraction is performed algorithmically and a single set of initial conditions are applied to all batches in order to minimize human intervention.Manual steps which require user input are shown in gray while automated steps are shown in orange.The minimization loop which contain multiple steps is shown in blue when using finite differentiation and purple for automatic differentiation.lineout is automated using the previous fit's best parameters as the new initial parameters, but often this is a manual loop where the code must be restarted with new initial conditions and a new lineout.In figure 1 the manual steps are shown in grey while the automated steps are shown in orange.
Figure 1(b) shows an analysis algorithm modified to fully leverage the advantages of AD.In this algorithm the data is split into batches which are minimized simultaneously.A batch is a group of lineouts whose agreement is given by a single fit metric.Splitting the entire data set into batches circumvents the process of manually looping over each lineout in a dataset.Instead of manually subtracting the background from the preprocessed data, the background is determined once by a separate algorithm and added to the model during minimization in order to improve the statistical behavior determining uncertainty, e.g.ensuring Poisson statistics are calculated based on the total read signal.This transition from manual to automatic can be seen in figure 1 with the transition from gray to orange.The initial parameters are still supplied by the user but only one set is required and applied to all lineouts in all batches.These changes have two effects, first user intervention is removed as much as possible since waiting for user input tends to add significant down time.Secondly by moving to batches fewer minimizations have to be performed at the expense of higher dimensional gradients where AD is most efficient.Optionally, the set of fits can be passed through a filter determining a 'good' or 'bad' fits and redoing 'bad' fits using new initial parameters determined by neighboring lineouts.This step, called 'refitting' , is not shown in figure 1(b).
Fitting is accomplished by minimizing a fit metric or the value of the loss function which quantifies the difference between a model and data by a scalar.This is an optimization problem that can be solved using gradient-free or gradient-based methods.As motivated in the introduction, gradient-based methods were used here.The implementation is as follows.An experimental observable, for example, in the form of an image, is denoted by D ij where i is the index of the pixel and j is the index of the independent observation, i.e. different temporal or spatial locations.A diagnostic model is denoted by the function M, this model recreates the experimental observable based on some physical parameters, that is, M is a function of ⃗ p.Using these quantities a gradient descent approach is used.The square of the relative l2-norm was used as the scalar fit metric given by where the loss function L(D ij ,⃗ p j ) is the sum of the squared differences between the experimental data, D ij and the calculation from the diagnostic model (M(⃗ p j )) relative to the square of the uncertainty in the experimental data at pixel i.This summation is performed over all pixels in the region of interest, the total number of pixels is given by N pixel .As is common in measurements made with charge coupled devices and similar devices, the data is assumed to be Poisson distributed with an uncertainty of D ij .
The objective is to determine a ⃗ p * j that will minimize L(D ij ,⃗ p j ) given experimental data D ij .The relation is given by When gradient descent is performed using finite differencing, it is often computationally prohibitive to simultaneously fit many independent measurements (fit over many j) as the number of free parameters (size of ⃗ p) increases with the number of independent measurements (j).Typically, each independent measurement is fit independently.However, it is reasonable to assume that the sum of equation ( 1) over a batch of independent measurements can also be minimized.Formally, this is given by where J is the size of the batch of independent measurements.Assuming the model-based diagnostic is not trivially cheap, gradient descent using finite differencing is limited to small values of J because as J increases, the number of free parameters, that is the size of ⃗ p, increases in proportion.
Because reverse-mode AD enables the calculation of the gradient with respect to many parameters with nearly the same cost as calculating one [12], it enables a quantitative enhancement of this approach by allowing the search for ⃗ p * to occur over larger batch sizes, J, than is afforded by finite differencing.In section 3, the acceleration of data analysis due to the batching of Thomson scattering data is discussed.
The remaining components of gradient descent remain the same.Explicitly, this means calculating the gradient of equation ( 3) with respect to ⃗ p, ∂L(D ij ,⃗ p j )/∂⃗ p, and using the optimization algorithm to update ⃗ p.

Efficiency improvements of batched parameter estimation
Figure 2 shows the computation time for a fit is improved by two orders of magnitude using AD and switching to computations on a GPU.Computation of the gradient is the largest computational expense in gradient descent minimization, making it the primary avenue for accelerating analysis.AD and GPU enablization were provided by the the JAX [22] python library which allows Python code to be compiled using the XLA [11] compiler at the time of execution, allowing the code to run on GPUs.
AD is known to improve computation time and dramatically change the scaling with free parameters from an order N scaling in finite difference to an order unity scaling with automatic difference [9].These improvements are largely shown to hold for this application to Thomson scattering analysis providing roughly an order of magnitude speedup, with another order of magnitude speedup provided by performing computations on a GPU.The improved scaling with gradient dimension motivates the algorithmic change from single lineouts to batches of lineouts.These timing test were performed on an A10G Tensor Core GPU and a 2nd generation AMD EPYC processor (AMD EPYC 7R32) CPU from AWS, exact computation time is dependent on static parameters such as the number of points computed and the clock speeds of the specific processors used.Further improvements in computation speed can be achieved by using faster processors.
A batch size of six lineouts was found to be optimal for balancing computation speed and convergence.Increasing the lineouts per batch reduces the number of minimizations needed while minimally increasing the computation time per batch, as shown in figure 2 where a two lineout batch takes ∼30 s and an 18 lineout batch takes ∼40 s.A decreased number of minimization therefore results in faster analysis of a dataset but increases the susceptibility to local minima.Local minima are a common problem faced in gradient descent minimization.Each lineout was given its own set of plasma conditions meaning the more lineouts the larger the number of fitting parameters.As the number of fitting parameters increases, the fit metric phase space or loss landscape becomes increasingly complex with more local minima.Therefore the computation time must be balanced with the ability to converge to a global minimum.Reverse-mode AD effectively removes size restrictions on ⃗ p. Batching is one way to exploit this outcome, however it also opens the door to increasing the complexity of the model by adding additional fitting parameters.

Reverse mode AD accelerates uncertainty estimation
Differentiable programming provides multiple improvements in uncertainty estimation.A common method for determining the uncertainty in gradient descent minimization makes use of the matrix of second derivatives called the Hessian.By assuming the fitting parameters are normally distributed in the vicinity of their minima, the Hessian can be used to determine the covariance matrix AD was used to efficiently calculate these second derivatives.Just as the computation of first derivatives with FD scales with order N, since second derivatives require FD steps in two directions the computation time for the Hessian scales as order N 2 .With AD this time scaling is reduced again to order unity and improves the accuracy with exact derivatives, avoiding truncation error or the use of approximate Hessians from quasi-Newton methods [23].
With the improved computation time, data can be analyzed as finely as one pixel per lineout.The diagnostic used for this data had a temporal resolution of 25 ps corresponding to 5 pixels, making it possible to determine the variance within a resolution unit.The uncertainty in parameters can alternatively be calculated using the standard deviation of a moving window the size of the diagnostic resolution unit.
Speeding up computation time also makes other methods of error analysis more tractable.For example, multiple fits can be performed on the same data in order to understand the variability in the minimization.As another example, the multiple fits can be performed on randomized subsets of the data, a process often referred to as 'bootstrapping' .Finally, the ability to run on GPUs can also accelerate gradient free approaches to error analysis such as Monte Carlo techniques.

Thomson scattering theory
Thomson scattering [6] is a common technique for measuring plasma conditions such as density, temperature, and flow velocity.Data can be described by a physical model for the spectrum of light [P s (λ s , θ, ⃗ A,⃗ p)] scattered from a laser by a plasma.The model is given by where the free physical parameters of the model are part of the parameter vector ⃗ p = [λ 0 , T e , T i , Z, n e , V a , U d ] and are denoted with bold symbols.These physical parameters are; probe laser wavelength (λ 0 ), electron temperature (T e ), ion temperature (T i ), average ionization state (Z), electron density (n e ), fluid velocity in the direction of ⃗ k (V a ), and electron drift relative to the ions in the direction of ⃗ k (U d ).The electron charge (q e = e), ion charge (q i = Ze), mass of the electron (m e ), mass of the ion (m i ), and speed of light (c) were constants in this work.The remaining terms are functions or functionals of the free parameters, constants, and independent variables of scattering angle (θ) and scattered wavelength (λ s ).⃗ k is the wavevector of the probed fluctuation, determined by the geometry of the probe and collection, and is calculated by The spectral density function, describes the shape of the spectrum and is dependent on the wavevector being measured ( ⃗ k) and the electron and ion velocity distribution functions (f e , f i ).These distribution functions describe the kinetic behavior of the plasma determining the growth or damping of various waves.The fluid parameters (those used in ⃗ p) of the plasma can be related to the moments of these functions.These distribution functions were also used to calculate the dielectric function, and the electron and ion susceptibility, Some dependence of this physical model on electron density and probe wavelength can be seen in equations ( 4), ( 5) and (9).But the majority of dependence is in the normalized velocities, x e and x i , given by where the frequency, ω, and thermal velocity, v te,ti , are given by A set of amplitude normalizations is used to prevent the need to absolutely calibrate the diagnostic.These normalization factors differentiate between three sections of the spectrum, corresponding to the ion-acoustic wave feature close to the probe wavelength, the blue-shifted and red-shifted electron plasma wave features at large distances from the probe wavelength.
A typical spacing of 3 nm is shown here but this can be modified in the code.The need for different amplitude normalization between the blue-shifted and red-shifted electron plasma waves may indicate missing physics in the model.In future work, these amplitudes could be replaced with neural networks of the physical parameters to study this missing physics.
In addition to amplitude normalization the scattered power P s must be convolved with instrumental effects to achieve agreement with experimental data.The scattered power is calculated over a range of scattering angles (θ).The scattering angles, and their weightings in a weighted sum, account for the variation in scattering angle across the collection optics and the range of probe wavevectors.Optionally, each of the physical parameters in ⃗ p can be treated as set of values representing changes in the plasma conditions over the small volume that is being measured.Commonly a functional form is assumed for the variation of the parameters, often called a gradient in the plasma conditions.Finally, the scattered power is convolved with a point-spread-function that describes the blurring into nearby pixels due to the internal structure of the diagnostic.Recent developments in Thomson-scattering analysis have allowed a better understanding of plasmas by loosening the assumptions on the shape of these velocity distribution functions [24,25].These more complex models have additional free parameters further complicating and slowing the analysis.In order to compute the scattered power (equation ( 4)) for non-Maxwellian velocity distribution functions, the susceptibilities (equation ( 9)) are calculated on the fly each time a spectrum is computed.For a Maxwellian or some other predefined distribution functions the susceptibility can be tabulated or analytically simplified.On the fly computation allows any functional or numerical distribution function to be used.By generalizing the model, Thomson scattering has been used to address a wider range of problems, increasing the amount of data being collected and analyzed.In this case, computation time becomes a bottleneck for the amount of data that can be analyzed and the complexity of the models that can be used.

Assessing the performance of the algorithm
Figure 3 shows the qualitative agreement achieved by this new algorithm when applied to Thomson scattering.Figure 3(a) shows an example of temporally resolved Thomson scattering data where the scattered light is resolved as a function of time and wavelength.Two spectral peaks associated with the blue-shifted and red-shifted electron plasma wave were observed as a function of time.Figure 3(b) shows that the fit qualitatively reproduced the changes in the separation of these two features, their width, and light observed between them.The SciPy [26] implementation of the limited memory Broyden-Fletcher-Goldfarb-Shanno with bounds (L-BFGS-B ) [23] algorithm was used as the optimization algorithm for these fits.To maximize the performance of this algorithm the parameter vector ⃗ p was normalized as , where ⃗ lb, ⃗ ub are user supplied lower bound and upper bounds respectively for each parameter.In this fit the electron density had a lower bound of 10 17 cm −3 and an upper bound of 10 20 cm −3 , and the electron temperature had a lower bound of 0.01 keV and an upper bound of 1.5 keV.
This fit represents 360 lineouts, a resolution of 1 pixel per lineout along the temporal direction.Using the optimal batch size of 6, the fit was completed in 11.2 min including postprocessing, refitting, and plotting the results shown in figures 3-6.Previous analysis of this dataset using finite-differencing with a batch size of 1 required 90 min to analyze 20 lineouts.This shows an 8 × speedup and 18 × resolution improvement or an acceleration of the required time per lineout of 144 ×.This is a generous comparison because the postprocessing, refitting, and plotting times are included for the AD calculation while these are not included in the FD calculation.This highlights the effectiveness of applying AD to accelerate analysis as a whole.Because these analyses, finite difference based and automatic difference based, utilize the same diagnostic model and similar quasi-Newton-Gauss gradient-descent minimization algorithms, the accuracy is roughly equivalent.
The traditional model based diagnostic algorithm, figure 1(a), is a generalization for many techniques that reflect the algorithm used in some previous work [24,25].However, there are no standard analysis tools for Thomson scattering.Most analyses still rely on manual effort, where inputs are modified by hand and iterated until agreement is deemed 'good' by eye without the use of quantitative metrics.
Figure 4 shows examples of a 'good' fit and a 'bad' fit. Figure 4(a) is an example of a good fit L = 151.Features of the spectrum are reproduced quantitatively as well as qualitatively.The model agrees with the data within the uncertainty in the individual points as given by Poisson statistics.χ 2 was set to zero outsize  the data window and in the central region where a notch filter was used in the diagnostic to reject wavelengths too close to the laser wavelength.Applying a window to the analysis was found to improve the convergence of the fits and can be easily adjusted based off the dataset.Figure 4(b) shows an example of a bad fit where the minimizer converged to a local minima that is not the global minimum.Here, discrepancies can be seen in the width and heights of both peaks as well a a mismatch between the peaks with the model under-predicting the data.This is one realization of the most common failure mode where a parameter is pushed into a regime of relative insensitivity resulting in a local minimum.In this case, the temperature was lowered to a point where the width of the feature is mostly due to instrumental broadening, once in this regime modifying the temperature has little to no effect and can make the fit worse by moving the location of the peak.In the other common realization of this failure mode the density is increased to the point where the model's peaks are outside the range of spectral analysis preventing changes from modifying the fit metric.
When analyzing a few lineouts, it was important that every lineout had an excellent fit so the parameters could be trusted.However by automating the process, hundreds of lineouts are analyzed making it infeasible to determine the quality of each fit by eye.Instead a histogram was used to determine the quality of the set of fits.In this paradigm a few poor fits can be tolerated as increased uncertainty in the parameters or removed based off their fit metric.Figure 5 demonstrates a small portion of the fits are 'bad' and can mostly be eliminated through refitting.For this analysis 'good' fits were defined to have χ 2 < 1510 and the initial fitting pass produced five inadequate fits.During the refitting step each bad fit was refit using the previous lineouts solution as the initial condition for the minimizer and a batch size of 1.With 360 fits a histogram was found to be an efficient way of determining the quality of the dataset while individual lineouts (as seen in figure 4) were checked to determine an appropriate threshold for a 'good' fit.A universal definition for a 'good' fit remains complicated by the differing signal-to-noise ratio between datasets.
Figure 6 shows the level of detail in plasma conditions that can be extracted from a dataset and some options for uncertainty analysis of these plasma conditions.Plasma conditions were extracted for every pixel in the dataset showing statistical variation in the extracted conditions as well as long terms trends due to the plasma evolution.Here, the electron temperature and density rise early in time due to the laser heating and ionizing the plasma.This is followed by a period where the conditions are quasi-stationary as laser heating is balanced by thermal condition losses.Once the laser turns off at 1000 ps the temperature and density drop as the plasma cools and expands.By defining the electron velocity distribution function as a super-Gaussian (exp[−(x/x 0 ) m ]) in 3D and projecting it on the measured ⃗ k, the super-Gaussian order can be extracted from the data as well [24] figure 6(c).This similarly showed an evolution of heating to a super-Gaussian order of 3.2 followed by cooling back to a super-Gaussian order of 2 once the heating laser is turned off.
The statistical variations in the fits were analyzed in two ways, using the Hessian and a moving window.The Hessian-based technique was used to determine the 3σ contour shown in blue in figure 6. AD was used to efficiently calculate these second derivatives.A new option enabled by the resolution of this analysis is to determine the variance within a diagnostic resolution unit.The diagnostic used for this data had a temporal resolution of 25 ps corresponding to 5 pixels.A standard deviation was calculated for a 5 pixel moving window to determine the red 3σ contour shown in figure 6.These two techniques for determining the statistical uncertainty in the fitting are in good agreement.

Fitting to novel quantities using angularly resolved Thomson scattering
Algorithmic changes made to improve data analysis of traditional Thomson-scattering data scale favorably when more parameters are extracted in a single fit (figure 2).These changes allow this algorithm to have an even bigger impact on problems where more information is being extracted such as the newly developed angularly resolved Thomson scattering [25].This technique has been used to directly measure the electron velocity distribution function by treating the distribution function as a set of points (representing a number of particles within a small velocity band) all of which are free parameters in the fit.Replacing the electron distribution function in equations ( 7) and ( 9) with a set of free parameters, necessitated an on the fly computation of the susceptibility.Due to the >60 points required for a smooth realization of the distribution function, fitting angularly resolved Thomson scattering data is a computationally expensive problem that is even more sensitive to local minima.Optional penalty functions can be added to the fit metric L depending on the physics being measured to constrain parameters or improve the fitting behavior.In this case, penalty terms (1 − ´dxf e [x]) 2 and (1 − ´dx x 2 f e [x]) 2 were used to maintain the normalization of the distribution function.Other penalty functions could be used to enforce symmetry or monotonicity.
Figure 7 shows the result of applying AD to the problem of angularly resolved Thomson scattering.In this data (figure 7  Discrepancies remain noticeable the multiple peaks on the blue-shifted side of the fit around 100 • .These are due to small modulations on the distribution function around v/v th = 2 (figure 7(c)).However this fit still has a χ 2 per degree of freedom of 4.2.This represents a 30 × speedup and 4 × resolution improvement over previous analysis which required at least 8 h to determine the distribution function with 64 points.

Conclusion
AD combined with GPUs allow a dramatic improvement in the efficiency of model-based diagnostic analysis.For Thomson scattering, analysis of a spectrum to extract plasma conditions can be >140× more efficient.This improved analysis maintains the physical model developed for diagnostic analysis allowing a full understanding of the analysis including detailed instrumental corrections, assumptions in the model, and error analysis.Moreover, this improved efficiency allows three changes to the current paradigm of Thomson scattering analysis: (1) faster and higher resolution analysis, which allows a better understanding of the evolution of plasma conditions in an experiment.(2) Near real-time data analysis, by performing analysis at a lower resolution an accurate picture of plasma conditions can be obtained in 1-2 min and then used to educate the next experiment during an experimental campaign.(3) The ability to employ more complicated models such as the entire distribution function or additional models for missing physics, this enables the field to tackle new and interesting problems.The necessary streamlining of an efficient AD algorithm has the side benefit of lowering the barrier to perform Thomson scattering analysis.With access to a code built to handle most of the details in analysis a new user can be taught in a couple hours as opposed to needing to learn all the theory and then implementing their own by-hand analysis routine.Finally, this differentiable programming approach has been applied to Thomson scattering as a proof of principle in its application to cutting edge diagnostic analysis and hopefully stands as blueprint for the application of these principles to other diagnostics.

Figure 1 .
Figure 1.(a) Typical model based diagnostic analysis algorithm.Raw data is preprocessed to account for some instrumental effects.Background is subtracted, usually manually, and a single row of pixels (lineout) is selected for analysis.The model is matched to data through by minimizing a loss function producing the desired plasma conditions.This process is manually repeated to obtain plasma conditions for an entire dataset.(b) An algorithm optimized for AD.Data is split into batches which consist of multiple lineouts that are analyzed simultaneously.Background subtraction is performed algorithmically and a single set of initial conditions are applied to all batches in order to minimize human intervention.Manual steps which require user input are shown in gray while automated steps are shown in orange.The minimization loop which contain multiple steps is shown in blue when using finite differentiation and purple for automatic differentiation.

Figure 2 .
Figure 2. Fitting time vs number of lineouts for minimizations using finite difference (blue) and automatic difference (red).Computations time is shown for runs performed on both a CPU (solid curves) and GPU (dashed curves).

Figure 3 .
Figure 3. (a) Raw Thomson scattering as a function of time compared to (b) the result of the analysis code.

Figure 4 .
Figure 4. (a) Example of a 'good' and (b) 'bad' fit.(a), (b) Data (blue) is shown in units of photocathode photo-electrons the minimal counting unit in the diagnostic system and compared to the codes best fit (orange).

Figure 5 .
Figure 5. Histogram of the fit metric.The histogram is shown before (blue) and after (orange) refitting.One count corresponds to an individual lineout that was fit and the counts are shown logarithmically.

Figure 6 .
Figure 6.Plasma conditions (a) electron density, (b) electron temperature, (c) super-Gaussian order as a function of lineout.Plasma conditions (blue curve) are shown with a 3σ uncertainty calculated from the Hessian (blue shaded region) and from a 5 pixel moving standard deviation (red shaded region).
(a)) the scattered light is measured as a function of wavelength and angle.The fit (figure7(b)) is able to qualitatively reproduce the measurement in 15.4 min using 256 points in the distribution function.

Figure 7 .
Figure 7. (a) Raw angularly resolved Thomson scattering data compared to (b) the result of the analysis code.The electron velocity distribution function found by this analysis is shown on a (c) logarithmic and (d) linear scale.