A Gaussian process guide for signal regression in magnetic fusion

Extracting reliable information from diagnostic data in tokamaks is critical for understanding, analyzing, and controlling the behavior of fusion plasmas and validating models describing that behavior. Recent interest within the fusion community has focused on the use of principled statistical methods, such as Gaussian process regression (GPR), to attempt to develop sharper, more reliable, and more rigorous tools for examining the complex observed behavior in these systems. While GPR is an enormously powerful tool, there is also the danger of drawing fragile, or inconsistent conclusions from naive GPR fits that are not driven by principled treatments. Here we review the fundamental concepts underlying GPR in a way that may be useful for broad-ranging applications in fusion science. We also revisit how GPR is developed for profile fitting in tokamaks. We examine various extensions and targeted modifications applicable to experimental observations in the edge of the DIII-D tokamak. Finally, we discuss best practices for applying GPR to fusion data.


Introduction and motivation
The importance of experimental data in fusion science is difficult to overstate.Experimental data is used, among other things, for analyzing and predicting machine performance [1]; validating theoretical and numerical models [2]; as the basic inputs that drive and inform physics studies [3]; and for the control of hazardous transient phenomena [4].Fusion plasmas routinely reach temperatures an order of magnitude beyond that of the core of the Sun.Due to this hostile environment, acquiring experimental data in fusion devices is a longstanding challenge.The complexity of plasma diagnostic systems, the extreme spatio-temporal scales of plasma dynamics, and the inherent uncertainties in these experimental measurements all contribute to a closely related challenge: maximizing the utility of the diagnosed data and interpreting it in a statistically meaningful manner.
Because of the critical importance of extracting reliable information from observation, there is a long history of techniques that aim at providing physically interpretable representations from acquired data.While many of these techniques are well-constructed and provide insight into the experiment, a recent push within fusion science has been towards statistically-principled techniques that rely on rigorous statistical and mathematical foundations in order to understand the inherently uncertain observational data obtained from experiments.
One such method, which is the subject of this paper, is Gaussian process regression (GPR) [5][6][7][8].GPR is a statistical inference technique based on Gaussian processes (GPs).A GP is an extension of the concept of a Gaussian random vector to probability distributions over function spaces.Specifically, it is defined as a collection of random variables, any finite subset of which have a joint Gaussian distribution [6].In most applications of GPR and in this work, a GP is used to approximate an underlying deterministic, but uncertain, function, which is inferred from a given set of observations.The resulting calibrated GPs can be readily utilized to inform theoretical analyses [9,10], by inputting experimental data fits to theoretical and numerical physics models, and can also be powerful in guiding experimental exploration, optimization, and design.
GPR has been getting increased attention in recent years in fusion and plasma applications.In [11] GPs are used to infer soft x-ray emissivity distributions from noisy line integral measurements, while in [12,13] GPs are similarly used to develop a GPs tomography method to reconstruct local soft xray emissivity in WEST and EAST, and similar approaches in ASDEX [14] upgrade and Wendelstein 7-X [15].In [16][17][18] GPRs are applied to simulation studies with an eye towards eventual application to full experimental regimes, while in [19] studies are performed on three-dimensional equilibrium reconstruction from both experimental data as well as simulation codes.In [20] GPs are used for determining properties of micro-tearing modes in spherical tokamaks, while in [21] GPs are used for inferring axisymmetric plasma equilibrium from magnetic field and plasma pressure measurements and in [22] Bayesian inference is used to resolve Z eff profiles from line integrated bremsstrahlung spectra.In [3,[23][24][25] GPRs are used to fit experimental core temperature and density profiles, to infer second order derivative information from these profiles, and to propagate uncertainties, while in [26] GPs are explored for use in quasi-coherent noise suppression.At JET, GPs have been used to infer electron cyclotron emission spectra [27], and high resolution Thomson scattering and far infrared interferometer data in [28], while work in [29] is focused on using GPRs to quantify edge plasma evolution from experimental data.A general overview of Bayesian inference in fusion can be found in [10,30].
In this paper, we apply GPR to an important problem in magnetic confinement fusion: fitting the profiles of temperature and density in the pedestal.The pedestal is the narrow region at the edge of the plasma in which transport barriers arise.Our focus on the pedestal is motivated by its importance for confinement, which is largely determined by the peak temperature attained at the inner boundary of the pedestal.We focus on fitting the pedestal region of the profile, and examining the quality of the fits.One compelling application of GPR in this context is assessing what physical information can be extracted from them, for instance whether there is experimental evidence of profile flattening, which is a signature of certain fluctuations that arise in the pedestal [31].This use case is not only intrinsically compelling but also serves as a starting point for exploring GPR applications to broad ranging experimental fusion data sets.
While GPR can make assessing gradients (as well as higher order derived features) and associated uncertainties in data fits very easy, there is also the danger of drawing fragile, or inconsistent conclusions from naive GPR fits that are not driven by a proper statistically principled treatment of the GPR framework being used.We highlight some of these potential pitfalls as well as the immense potential of GPR, ultimately establishing a framework that is fruitfully applied to the problem at hand.We expect that much of the framework outlined here can also be generalized to many fusion problems beyond this use case.That said, the current paper explicitly develops a custom pipeline for addressing the specific needs and constraints required for examining edge behaviors in tokamak plasmas.These custom features are not available to our knowledge in the literature.In summary, the contributions of the present work are as follows: • We provide a generalized, consistent, self-contained, and detailed mathematical and numerical description of how to implement and apply GPR to profile fitting.The accompanying code will be made available to the broad community via github.• In the context of tokamak edge plasmas, GPR fits are compared with common parametric fits, and analyzed in detail, including evaluation and analysis of common builtin assumptions.• A novel 'monotonicity-promoting' prior is developed for edge plasmas, incorporating the physical constraint of nonpositive dT e /dψ.
• A deep analysis is performed on hyperparameter sensitivities, exposing how ignoring (or over-simplifying) these can lead to fragile and/or physically over-interpreted results.• An analysis on the sensitivities to the prior mean is included, again showing how important these are on the final physically-interpretable results.
A brief outline of the paper is as follows: in section 2 GPs are introduced in a self-contained notation for future reference, and the discretization process along with various evaluation procedures are explained.In section 3 the electron temperature experimental procedures and data are examined, and various fitting approaches are analyzed, as well as a cursory examination of their physics interpretations.In section 4 modifications and potential improvements to standard GP approaches are discussed, and some examples are given.Throughout these developments, the dependence of the results on various parameters that the GP depends upon is examined.In section 5, a Bayesian approach to setting these parameters is introduced and used to examine the data from section 3.This leads to a more complete Bayesian approach, since GPR itself can be viewed as a Bayesian inference.Finally, additional discussion of the application of GPR methods in given in section 6, and conclusions are provided in section 7.

Profile fitting with GPs
This section provides the necessary background on GPs and GP-based regression techniques.Specifically, we formally define a GP and show how to calibrate a GP by conditioning on data (section 2.1), discretize the GP on a grid and introduce discrete approximations of derivatives (section 2.2), show how to take draws from a GP and evaluate statistics (section 2.3), and finally, introduce a credibility metric for use in evaluating GP fits (section 2.4).While much of this material covers wellestablished techniques (see, e.g.[6]), some of it has not been directly applied to fusion and plasma applications before, and so it is included here for the benefit of readers not familiar with GPs and to make the paper self-contained.

The GP framework
We present GPR in a general framework here, to emphasize the fact that this methodology can be applied for any quantity of interest.In that context, let ψ denote the independent coordinate and f denote the dependent variable, i.e. the function we wish to infer from the measurements.In results shown beginning in section 4, ψ denotes the poloidal magnetic flux, and f corresponds to the electron temperature.The goal of the framework presented in this paper is to recover a distribution over functions f , such that for any finite number of points ψ i , i = 1, . . ., N, the random variables f(ψ i ) are jointly Gaussian.Such a distribution over functions is referred to as a GP.Like a Gaussian random variable, a GP is entirely defined by its mean and covariances functions.Thus, to indicate that f is distributed according to a GP, it is common to write where µ is the mean of f , and k the covariance of f , (i.e.k(ψ, , for E the expected value).GPR operates by conditioning an existing (or prior) GP on experimental observations.This conditioning is equivalent to a Bayesian update.Importantly, if the observation error is Gaussian and the map from the GP to the observable quantity is linear (in this work it is generally the identity), then the resulting conditional distribution (or posterior) is also a GP.To derive this conditional GP, the outcome of the experiment is modeled as a deterministic function plus Gaussian noise: where N e is the number of experimental observations and ϵ i are jointly Gaussian.Modeling f true using a GP gives Although the true function f true is deterministic, using a random model-here a GP-is appropriate to represent the lack of knowledge about this deterministic truth, a lack of knowledge which is induced by the fact that only finitely many uncertain observations are available.
According to the model (3), the vector f exp = [f exp 1 , . . ., f exp Ne ] of observations at the points Ψ = [ψ 0 , . . ., ψ Ne ] is jointly Gaussian with f evaluated at an arbitrary set of test points where N t is also arbitrary.Letting f * = [f(ψ * 1 ), . . ., f(ψ * Nt )], K exp denote the covariance of the observation noise ϵ i , and N (m, K) denote a vector normal distribution with mean m and covariance K, we have and µ(Ψ * ), K(Ψ, Ψ * ), K(Ψ * , Ψ), and K(Ψ * , Ψ * ) are defined analogously.Conditioning on the experimentally observed values, then gives where µ post and K post are the posterior mean and covariance: Equations ( 2) through ( 6) reveal the assumptions that underly the use of GPR as applied in this work.First, it is assumed that the data provides a realization of an uncertain process that can be modeled as in equation ( 2) with ϵ i that are jointly Gaussian with known mean and covariance.It is critical that the experimental uncertainty is appropriately characterized, as otherwise the conclusions from the resulting analysis will be contaminated.For instance, if the experimental uncertainty is under-estimated, the uncertainty in the posterior GP will be smaller than it should be, and the confidence in any conclusions drawn from it stronger than they should be.While it is possible to relax this assumption by inferring some parameters of the experimental uncertainty model, these parameters are generally conflated with the GP model being inferred, and thus it is best if the experimental uncertainty is well-characterized independent of the GP analysis.
Second, the true function f true , which is the goal of the inference process, is assumed to be deterministic but unknown.Because of a lack of complete information, it cannot be completely known, and this lack of knowledge is modeled stochastically through the GP, as discussed previously.As with the first assumption, it is possible to relax this assumption to cases where the underlying truth is inherently stochastic, but this is beyond the scope of the current paper.
Third, it is assumed that the prior does not make the truth impossible.In particular, as the prior is supported on a specified function space, it is assumed that the truth is in this space.More generally speaking, in equation ( 6), it is clear that the choice of the prior has an impact on the posterior beyond simply setting the function space.In Bayesian probability, this is natural and can be viewed in a number of different ways.For instance, one can view the prior as simply a type of regularization, which may be used to eliminate unwanted features from the posterior even though the data does not rule them out.As an example, by choosing an appropriate covariance, oscillations on wavelengths smaller than the spacing between the data points can be eliminated, even though the data alone do not contain information about these wavelengths.Alternatively, in the so-called 'subjective' Bayesian approach, each individual analyst is responsible for formulating his own prior, which represents his own personal state of knowledge about the quantities being inferred before seeing the data [32].
There is a substantial literature and debate surrounding the specification of prior information in Bayesian inference [33,34].Our purpose here is not to give even a brief review of this literature and debate, only to note that the prior can substantially impact the posterior, particularly in situations where the data are sparse.Thus, it is important to think about the prior.Here we make some general comments regarding our choices for the priors used in this work, with more details and justification provided along with the example calculations in sections 7-5.
While a common choice for the prior mean in equation ( 5) is µ = 0, this choice can result in unwanted sensitivities developing relative to the hyperparameters in the covariance, particularly in situations where the data are strictly nonnegative and far from zero relative to their variability.This sensitivity can lead to significant dangers when trying to physically interpret the resulting quantity of interest (QoI) relative to selected covariance hyperparameters.We will discuss these issues in more detail below.
A common choice of covariance k is the squaredexponential covariance (or radial basis function (RBF) kernel), which we write: where σ 2 is the prior variance, and ℓ the prior covariance lengthscale.This covariance (7) enforces smoothly varying C ∞ functions.In other words, fits that utilize this kernel are predicated on the assumption that the underlying function being fit is globally smooth and infinitely differentiable, i.e. there are no functional discontinuities in the observed quantify, and while the QoI may vary rapidly over small lengthscales ℓ this variation is always smooth.In this work, the functions being fit are assumed to satisfy these properties, and thus the squared-exponential covariance is a reasonable first choice.Note also that this covariance is stationary, meaning that it depends only on the difference between ψ i and ψ j , not the absolute position, and is thus translation invariant.
There are many alternative covariance kernels that can be chosen [35], however, and often these choices are justified from prior knowledge about the system under question.Another common covariance is the Matern kernel, which provides a parameter controlling the smoothness of realizations [36].Non-stationary kernels may also be used when supported by prior information.For example, in [23] a nonstationary kernel is chosen to enforce a change in lengthscale dependencies between the core and edge plasma in the tokamak.

Discrete GPs
Equations ( 5) and ( 6) denote general expressions that are valid for determining the posterior at arbitrary points Ψ * given data at arbitrary points Ψ.However, it is also often critical to determine the posterior distributions of derivatives of the inferred function and, more generally, to use derivative data in the inference process and/or build constraints on the derivatives into the prior distribution.While these tasks can be approached by evaluating analytic derivatives of the prior mean and covariance, in this work we use discrete approximations of these derivatives, which eliminates the need for these analytic derivatives, thereby reducing the coding burden and enabling straightforward testing with different covariance structures.
To support these discrete derivatives, we discretize the GP.Specifically, we define a grid over the input coordinate and an associated basis for f .Denote this representation of f by f λ : where n is the number basis functions.In particular, we will take a uniformly spaced grid and use piecewise linear Lagrange basis functions.Specifically, given n grid points {ψ 1 , . . .ψ n }, In this case, the coefficient F i corresponds to the value of f at the ith grid point.The goal now is to infer the values of F i from observations of f .This inference is performed using Bayes' rule, as in section 2.1, and the results represent a discrete approximation of those from section 2.1; namely equations ( 5) and ( 6).
To derive the posterior distribution for the coefficients F i , let A denote the matrix containing the basis functions evaluated at the points where experimental observations are available: Then, As in section 2.1, we model the experiment as f λ contaminated by observational noise: where N e is the number of experimental observations and ϵ i are jointly Gaussian with mean zero and covariance K exp .Further, let the prior for F be Gaussian with mean µ and covariance K, where, using the notation from section 2.1, Then, it is straightforward to show that the posterior for F is the following Gaussian distribution: where These expressions are exactly analogous to equations ( 5) and (6).By taking the grid spacing sufficiently small-e.g.much less than the prior covariance length scale and the distance between the data points-this discrete representation can be made sufficient for any purpose.
In particular, one may approximate derivatives of the GP using any consistent finite difference method.For instance, using a first-order, left-sided difference, where λ i is the grid spacing of the ith grid element, i.e. λ i = ψ i − ψ i−1 .Again, by taking the grid spacing to be small relative to any length scale of interest, including the distance between data points and length scales in the prior, the error in this finite difference approximation can be made vanishingly small, such that the discrete derivatives are indistinguishable from the analytical derivatives computed based on derivatives of the covariance kernel.In the results that follow, the grid is uniform over the interval spanned by the data and contains the same number of points as the data set, for the full data set shown in section 7, or 20 times the number of points in the data set, for the individual profile analysis shown subsequently, where the data are much sparser.

Using the posterior GP
Given the posterior distribution in equation ( 8), it is straightforward to evaluate the probabilities of events or statistics of quantities of interest.To illustrate, let Q : R n → R denote a quantity of interest functional.Specifically, Q takes the coefficients F and returns a scalar quantity of interest.Then, statistical quantities of interest take the form of integrals of functions of Q against the posterior distribution for F. For instance, the mean of Q is or the probability that a < Q(F) < b is given by where I (a,b) denotes the indicator function on the interval (a, b).
In special cases-e.g. when Q is linear-such calculations can be performed analytically using known properties of Gaussian distributions.More generally, they can be evaluated using Monte Carlo sampling, which requires generating draws of F. It is also useful to plot draws from the posterior to gain insight into its behavior.
Samples from F can be obtained from samples of an n dimensional Gaussian using the following formula: Here, z denotes a sample from N (0, I), µ post is the posterior mean, and V and Λ are the eigenvectors and eigenvalues of the posterior covariance, Since K post is symmetric, postitive semidefinite, the eigenvalues are known to be real and non-negative.

Evaluating the GP fit
Given a probabilistic model, it is of course necessary to assess whether the model is appropriate for its intended uses.There is a significant literature on evaluating Bayesian statistical models; see for instance [37][38][39].Two ideas are used in this work, although this should not be taken to mean that other possible methods or evaluations are not relevant.
First, a qualitative sense of the performance of the model can be obtained by simply plotting the model predictions and the data simultaneously, or, similarly, by plotting the residuals-the discrepancies between the model predictions and the data.This approach offers immediate insight and, in cases where the mismatches are stark, may be sufficient to justify modifying or even discarding the model.Examples of such cases are shown in sections 3.2 and 3.3.In these examples, both a traditional fitting approach and a specific GPR approach are shown to give results where the mismatch between the data and the mean predictions are large relative to either the model or observational uncertainties (see figures 1 and 2).Such results lead to the conclusion that the model is not a credible representation of the data.
Second, this evaluation can be made more quantitative by examining the Bayesian p-value of a test statistic of interest [38].The main idea is to assess the probability, according to the model, that the outcome of the experiment is more extreme than the observed data.When this probability is small, the implication is that the experimental observation was an outlier according to the model, which decreases confidence in the model's predictive capability.In general one should select a test statistic that represents the intended uses of the model.In the present situation, the model is intended to represent the entire edge temperature profile, and thus we choose the identity as the test statistic, indicating that we wish to fit all aspects of the data in order to have confidence in the model profiles.
To compute the Bayesian p-value, denote the probability density implied by the model as p(f).For now this model could be constructed in any fashion, but subsequently we will use the posterior GP constructed in section 2.3.Letting p obs = p(f exp ), where f exp denotes the experimental observation, the probability that the model assigns to values of f where the PDF is greater than p obs is given by Qualitatively speaking, if f exp is near the peak of p(f)assuming the distribution is unimodal-η will be near zero, while if f exp is far out on the tail, η will be near one.Thus, we choose C = 1 − η as a credibility metric, where larger values of C indicate better agreement between the model and data.
Quantitatively speaking, one must be cautious in interpreting the credibility.If one were to draw data according to the probabilistic model-i.e. the model is an ideal representation of the stochastic process generating the data-the implied distribution of C is uniform on the interval (0, 1).Thus, even the perfect model would lead to small credibility in a small fraction of cases and, conversely, very high credibility in only a small fraction of cases.For instance, 1% of draws from the perfect model would generate a computed credibility of less than 0.01.Because of this, we choose to interpret the credibility essentially as a sanity check on the probabilistic model.That is, we reject models with credibility below some small threshold (e.g.C < 0.05 or 0.01), but do not attempt to fit any model by maximizing the credibility for given data, since this is likely to lead to overfitting that data, and do not recommend using the credibility to distinguish between models with C above the chosen rejection tolerance.
In general situations, it is difficult to evaluate the credibility, since it involves evaluating a high-dimensional integral over a potentially complicated domain.However, when p(f) is Gaussian, it is analytically tractable, since the domain of integration is a hyper-ellipse.This can be transformed to a hyper-sphere, allowing the credibility to be written in terms of the incomplete Gamma function.Specifically, when p(f) is Gaussian with mean µ and covariance K, Since the integrand depends only on the magnitude of ξ, it is straightforward to evaluate this integral in hyper-spherical coordinates.Letting r = ξ T ξ, where S Ne represents the surface area of the unit hyper-sphere in N e dimensions, which is given by and Γ denotes the usual gamma function.Thus, where γ denotes the lower incomplete gamma function: Finally, the credibility can be written as, where Γ (n, x) = Γ (n) − γ(n, x) denotes the upper incomplete gamma function.

Experimental results on T e
As a primer for examining the state of the art of GPR in nuclear fusion, we choose as a prototyping example the electron temperature profile T e in the edge as experimentally measured in a tokamak.The T e profile is a good example of a 1D field that is determined through experimental measurement, and that is then extensively utilized, both as a diagnostic of machine performance and behavior and as a basic input for simulations.

Experimental procedure
Here we examine the electron temperature T e profile as a dependent quantity of the independent variable, ψ (i.e.T e = T e (ψ)).The variable ψ is the normalized poloidal magnetic flux, which serves as the radial coordinate.We define the edge pedestal for these studies as being between ψ = [0.94,1.0].The electron temperature profile (T e ) measurements were taken on the DIII-D tokamak [1] using the Thomson scattering diagnostic system (see [40] for a detailed description of the system).Thomson scattering is the elastic scattering of light from free charged particles, where the charged particles are electrons and the light is a laser system.The scattered light spectrum is collected and fitted with Maxwellian electron velocity distribution with mean velocity neglected compared to thermal motions (f ∼ exp(−mv 2 /2T e )).Strictly speaking, the collected signal is also proportional to the electron density.Here, we focus on the electron temperature which can be determined without the knowledge of the density.
Electron temperature uncertainties for each radial point are typically determined from both the number of photons collected as well as errors on the Maxwellian fit.The largest component of the uncertainty stems from the number of points used for the Maxwellian fits (which maps directly with the number of spectral channels).In this work, the fractional error in T e is due to the number of channels-less than 5% [41]-suffering from small systematic errors.
The work presented below used a high confinement mode (H-mode #153764) discharge on the DIII-D tokamak.Details of this discharge have been reported in [42].The goal of this experiment was to capture the recovery of the edge pedestal after an edge localized mode (ELM) crash using the recently upgraded Thomson scattering system.After an ELM crash the edge radial profiles evolve until saturation is reached (i.e. the profiles reach a steady state).Once this saturation is reached, the profiles are collected.ELM cycles are quasiperiodic and somewhat repetitive, which motivates collection and ensemble-averaging of profiles during the saturated phase.The standard approach is to collect profiles prior to ELM crash, with the analysis windows chosen to obtain as similar conditions prior to an ELM as possible.In particular, ELM cycles can be normalized from 0 to 1, with 1 being the ELM event.Once this normalization is done, one can tag all the data points that are within 80%-99% of an ELM cycle.These data can then be analyzed, for instance, to determine a representative radial profile of the electron temperature prior to the ELM crash.However, despite the generally repetitive nature of ELM cycles, there can be systematic sources of variation between the profiles.These variations make interpretation of typical ensemble-average fits difficult, and may cause problems for the GPR approach as well, as discussed in the results below.
The data resulting from this observation and tagging process, as applied to the H-mode #153764 discharge on DIII-D, is shown in figure 1.Specifically, the blue points, with green error bars corresponding to one standard deviation, represent the 'instantaneous' measurements collected from within the 80%-99% window of the ELM cycle.These instantaneous measurements can be analyzed in a number of ways.To ground the discussion of GPs later, we begin by examining one common fitting procedure in section 3.2, the results of which are shown in the blue smooth curve in figure 1.

Common fitting techniques
A typical objective of the fitting techniques to be discussed is to determine the ensemble-averaged profile in the 80%-99% portion of the ELM cycle.Since the instantaneous T e measurements that one wishes to use to determine this ensembleaveraged profile do not share the same ψ locations, it is not possible to explicitly construct the sample average of the data to approximate the average profile.Instead, it is common to implicitly approximate the ensemble-average by performing a least-squares fit of some assumed form to the ensemble of instantaneous measurements.
The resulting fit can be thought of as an approximation of the ensemble-average, and, assuming the chosen form is sufficiently flexible, it would converge to the 'true' average in the theoretical limit of infinitely many observations.However, in practice, it is not clear that commonly used forms are sufficiently flexible and therefore, the results are influenced by the selected form.As discussed in [23], in fusion tokamaks the assumed form is often based on splines [43][44][45][46].However, spline fitting can suffer from the fact that knot insertion and degree elevation are challenging to perform systematically over many different fits, as discussed in [23].An equally common choice, particularly for approximating edge T e profiles in the pedestal is based on a modified tanh form, as optionally utilized in the OMFIT framework for example [47,48].Specifically, the form is given by the following composite mapping: where and c i , i = 0, . . ., 8 are constants.In this article we generally refer to this form as a modified tanh function, see [49,50] for example.This form has also been frequently utilized in simulation studies meant to predict the behavior of plasma in tokamaks [51,52].Further, modified tanh fits can be partially justified due to the fact that they preserve the physically relevant constraint of monotonic temperature profiles, without arbitrarily introducing fine-scale artefacts from over-fitting, or experimental error.Table 1 gives brief descriptions of the constants, along with values determined from a least-squares fit to the data from shot #153764.The resulting fit is shown in figure 1. Examining the fit, it is not immediately clear whether this function represents an adequate approximation of the ensemble-mean.Over much of the domain, it seems to run roughly through the middle of the data cloud, such that it seems reasonable as an approximation of the mean.However, there are regions where concern is warranted.For instance, near ψ = 0.962 and ψ = 0.969 there are regions where the data is entirely below the fit, even accounting for the uncertainty in the measurements.Alternatively, near ψ = 0.975, the bulk of the data appears to cluster above the fit, although there are a few observations below.Thus, while it is very difficult to evaluate since we do not have direct measurements of the ensemble-average, it appears that the modified tanh form may be overly restrictive in the sense that it has too few degrees of freedom to match the true ensemble-average.Alternatively, it is possible, perhaps even likely, that these apparent discrepancies between the raw data and the fit are due to real systematic variations between ELM cycles-i.e.systematic variations in T e due to different conditions prevailing in different analysis windows-or systematic measurement variations-e.g.instrument calibration variations between different chords.
Regardless of the cause, it is at best unclear based on the data whether the modified tanh form is sufficiently flexible.This is concerning in that the interpretation of experimental results or simulations based on such fits could then be influenced by features determined by the implicit assumptions embedded in the chosen form rather than by the data or explicit assumptions, which may at least be openly acknowledged and debated.By moving to a GPR framework, we simultaneously dramatically enrich the flexibility of the fit and require explicit statement, in the form of the prior distribution, of the assumptions embedded in the fit.

GP-based ensemble averaging
In order to assess this sensitivity and demonstrate the performance of the GP-based techniques introduced in section 2, we perform GP fits to the same data as in section 3.2 and figure 1(shot #153764) with the same objective (i.e. to determine the ensemble-average profile).Relative to the modified tanh approach, GP-based regression is attractive due to its flexibility.While the modified tanh form has only ten degrees of freedom, as described in section 3.2, the GP-based regression is a non-parametric method that has the same number of degrees of freedom as observation locations.While we use the discretized approach of section 2 here, we can always choose the grid to be fine enough such that discretization error is negligible.Further, even with the fairly restrictive squaredexponential covariance, it is known that the GP is flexible enough to represent any smooth function.
To proceed with a naive GPR, we simply set f = T e from sections 2.1 and 2.2, such that the experimental input is a vector T e over the magnetic coordinate support points Ψ.We are confronted at this point with a choice of covariance kernel.As mentioned briefly above, there are any number of covariance kernels that can be chosen.For example, a particularly interesting choice is given in [23], where a nonstationary covariance kernel with a lengthscale that varies between the core and edge regions of the plasma is used along with a fairly exotic spline basis.This type of choice can be well-suited to a circumstance where the theoretical physics gives sharp and reliable estimates on the minimum characteristic lengthscales ℓ involved in the plasma dynamics of interest, and the experimental observations are sampled at a spatial resolution ℓ obs that are more dense than that; i.e. ℓ > ℓ obs .However, given that we focus exclusively on the pedestal region, there is no a priori evidence that a nonstationary kernel is necessary-i.e. the pedestal is a narrow, distinct region, which is likely characterized by a single length scale between the pedestal top and separatrix.Further, as discussed in [24], it was found that this spline-based nonstationary covariance had a tendency to over-constrain the GPR fit and generate 'artifacts' in higher-order derivatives.
Any structure in the pedestal profiles on length scales shorter than the pedestal width is likely a result of fluctuations.Motivated by several recent studies [31,51,53,54] unambiguously connecting the most studied magnetic fluctuations [55] to microtearing modes (MTMs), we can appeal to MTM to set the GPR scale lengths.We can approximate a lower bound on the relevant physical length scales ℓ in the midpedestal region of the profile by noting that the magnetic skin depth-L skin = c/ω pe = (5.31× 10 5 )/ √ n e -is, for a typical DIII-D shot, about L skin ∼ 0.08 cm.More specifically, noting the localization of MTM by the variation of ω * over the pedestal, we can estimate the relevant length scale as L MTM,global ∼ L skin w ped .For the pedestal above, this roughly agrees with the width found in GENE simulations: L MTM,global ∼ 0.35 cm, or L MTM,global,ρ ∼ 0.006.Similarly, [31] reports that the relevant radial length scale for MTM in the pedestal is ∼(0.005-0.01) in ρ tor .
If we wish to expand our set of candidates to additional fluctuations, we can consider ITG/TEM and KBM.For ITG/TEM, the minimum width would be ∼ρ i ∼ .014T e,eV /B Tesla , which corresponds to ∼0.14 cm for typical DIII-D parameters.For KBM, we use k ρi < 0.1 as the range of KBM (using k y at the outboard midplane, relevant to the data), and k x ∼ k y .Combining this with the classic Fourier estimate for half widths δ x,KBM δ kx,KBM ∼ 2, this becomes δ x,KBM ∼ 10πρ i ∼ 20ρ i , which is on the order of the pedestal width and much broader than the other modes.
In addition to these physics-motivated length scales, there is a length scale associated with the instrument function of the Thomson scattering procedure used to make the measurements.For the data used here, this scale is expected to be on the order of a few millimeters up to one centimeter, with the dominant factors depending on, for example, the optical fiber bundle, followed by the laser radius [56].
Consequently, for the sake of prototyping an edge solution, we choose the most common RBF covariance function equation (7), and investigate correlation length scales of roughly two millimeters and larger.Given that the edge region in DIII-D-i.e.where 0.95 < ψ < 1-has a width of two to three centimeters over the bulk of the device, this corresponds to correlation scales in ψ of approximately ℓ = 0.005 and larger.Using ℓ = 0.005 and a standard deviation of σ = 0.5, we arrive at the GPR fit shown in figure 2, which also shows the GP fit for ℓ = 0.01, σ = 0.5 and the modified tanh fit from section 3.2 for comparison.Comparing the ℓ = 0.005, σ = 0.5 fit to the modified tanh result in figure 2(a), the issues previously raised regarding the tanh-like fit in section 3.2 have largely been addressed; at least superficially.That is, near ψ = 0.962 and ψ = 0.969 the GP-fit is closer to the data, and in the vicinity of ψ = 0.975 the GP fit runs through the bulk of the data.One drawback of the GP fit here, however, is its non-monotonic nature (i.e. it can produce localized inverted temperature gradients).This is generally thought to be unphysical for ensemble-averaged profiles, since an inverted temperature gradient seems thermodynamically implausible given the strong heating applied to the core plasma in a fusion discharge.Stated differently, an inverted temperature gradient would require a dominant heat pinch in the most strongly driven thermodynamic channel.Such a mechanism has never been proposed theoretically.
It could be that these oscillations are caused by the systematic variations and/or measurement issues discussed previously in the context of the discrepancies in the modified tanh fit.Regardless, the GPR result indicates that the data itself does not rule out such oscillatory behavior.Thus, to obtain a monotone fit, for this data set at least, one must impose the assumption of monotonicity during the fitting process.In the tanh-like approach, this assumption is imposed by choosing a fixed functional form that cannot oscillate rapidly.In the GP approach there are multiple ways to accomplish this.
The first is to raise the value of the prior length scale ℓ, for example as shown in figure 2(b).The effect of raising the prior length scale is to discourage the kind of oscillatory behavior seen in the GP fit in figure 2(a).With ℓ = 0.01 the fit is still non-monotonic, but is clearly less oscillatory, at the expense of demonstrating a poorer match to the data.This trend continues for ℓ = 0.02, as shown in figure 2(c).
A second approach is to more strongly build the assumption that the dT e /dψ is negative into the prior.We refer to this as the 'monotonicity-promoting' prior, since it gives larger prior probability to monotonic profiles.A similar idea was introduced by [57] where a 'weak' monotonicity constraint was introduced through the prior distribution in the context of inferring electron density profiles.However, the specific formulation in that work is not Gaussian, and thus it is not applicable if we wish to use GPs.Instead, to develop a GP-based monotonicity-promoting prior, we build a custom covariance that strongly favors profiles where dT e /dψ is negative everywhere.In particular, we construct a prior GP for dT e /dψ.Since T e is a linear function of dT e /dψ, the prior GP for dT e /dψ induces a GP prior for T e .Then, we can make profiles with dT e /dψ ⩽ 0 more likely by manipulating the parameters of the prior for dT e /dψ.It is important to note that this prior should only be used for datasets where monotonic behavior is expected.Otherwise, the prior will inappropriately weight monotonic profiles, which could lead to poor fits.Thus, while it may be reasonable for the pedestal region, it is not necessarily appropriate for more general use.
To be concrete about the formulation, let where D is the discrete derivative operator introduced in section 2. To make D invertible, the top row of the matrix is replaced by [1, 0, . . ., 0].If g were modified accordingly, this would correspond to a Dirichlet condition.Then, given g, one could solve for T e (ψ): Of course, g is not well-known, so we assign a prior: Then, the induced GP for T e is given by where Taking µ g = 0.5 and K g to be the squared-exponential covariance with σ = 0.5 and ℓ = 0.005 or 0.01, we find the results shown in figure 3. Qualitatively speaking, the results of the GP-based fit with the monotonicity-promoting prior resemble those obtained using the modified tanh approach.The profiles show essentially none of the oscillatory nature of the original GP fits shown in figure 2. Since the goal was to eliminate these oscillations, the approach is successful in this sense.However, as with the original modified tanh fits, the overall data fit is difficult to defend.Again there are regions where the bulk of the data fall entirely above (as near ψ = 0.975) or below (near ψ = 0.962) the mean.In these regions the prior essentially overwhelms the data, keeping the fit from following the data more closely.While this is not necessarily wrong, it is troubling for two reasons.First, there appears to be a substantial sensitivity to the choice of prior, despite what would qualitatively appear to be a large number of data points.
Second, and possibly more importantly, it is quite difficult to evaluate the quality of the fit in the present situation.The reason for this is that the fitting procedure is designed to obtain the ensemble-averaged profile, but we have no independent measurements of this ensemble average.To illustrate this point, we may evaluate the credibility metric described in section 2.4 for the GP fits shown in figures 2 and 3, but we find that in all cases, the credibility is essentially zero.This result is not surprising at all, since the credibility is assessing whether or not the observed data represent likely draws from our stochastic model for the mean profile.But, even if the GPbased profile were a very good representation of the true mean, the variance of the underlying stochastic process generating the data, which is not represented by the GP fit as constructed here, would lead the credibility to be exceedingly small.This only implies that the posterior GP is not a good representation of the stochastic process generating the data, not that it is not a good representation of the ensemble-mean.
Thus, we are forced to discard the credibility in this situation.One can imagine deploying techniques from Bayesian model selection [58], which would evaluate the evidence for different prior assumptions, which would be identified as different models.In the present situation, it is expected that such methods would strongly favor priors that allow the posterior mean to more closely follow the data, i.e. they would favor GP-fits like those shown in figure 2 over those shown in figure 3. We do not explore this approach here because it does not resolve the fundamental problem that the results and their implications are strongly dependent on the prior assumptions.That is, if one asserts that it is physically impossible for the ensemble-mean to have oscillations shown in figure 2, this could be expressed as a prior in the Bayesian model selection that no amount of data could overcome.Instead, we choose to examine a different approach to fitting the ensemble-mean altogether, one that does not require us to use the data in a way where the resulting stochastic model and input data are not directly comparable.

Evaluations and modifications
In the approach determining the ensemble mean in section 7, the underlying assumption is that the fitting process is implicitly computing the ensemble-averaged profile.The difficulty that this introduces is that then we have no direct observations from the process modeled by our stochastic model, making it very challenging to evaluate the fit.Here we propose an approach in which the fitting and the averaging are separated.In particular, rather than concatenating all measurements during the saturation period, i.e. within 80%-99% of an ELM cycle, as described in section 3.1, we fit a model to each instantaneous profile measurement and then explicitly average to approximate the ensemble mean.Thus, if we have N p instantaneous profiles, the process involves fitting a GP to each measured profile.Then, denoting the mean function of each of these profiles as µ p (ψ) for p = 1, . . ., N p , the ensemble-mean is approximated by By treating each profile individually, in the language of Bayesian hierarchical modeling, this approach corresponds to 'no pooling.'Alternatively, the approach from section 7, in which a single model is built for the entire data set, corresponds to 'complete pooling.'In this work, we do not apply the method to the full set of data where the instantaneous profiles are all distinguished.Instead, we focus on only a single instantaneous profile, on which we (1) demonstrate the process of fitting, (2) observe how the results are different from using the full data set observed during an 80%-99% of an ELM cycle, and (3) deal with issues introduced by the relative sparsity of data in the instantaneous profile.

Demonstration on a single profile with fixed hyperparameters
Figure 4 shows the GP-based fits, using the original squaredexponential prior (i.e.without the monotonicity-preserving modification) from section 3.3.The very first thing to notice here is that reducing to the instantaneous profile eliminates all of the statistical inconsistencies we observed in the ELM averaged profiles.The experimental data distributions now very cleanly overlap the posterior GPR distributions, and follow the observation data correctly.For comparison, figure 5 shows results for the same data with the monotonicity-promoting prior as well.Similarly, the data and fit appear to be consistent.To make this quantitative, all the fits in figures 4 and 5 give a credibility >99% (note: except the monotonicity promoting one which has a credibility of ∼83%, but requires a much larger σ prior to achieve it, because, as shown in equation ( 13) the σ on the monotonicity promoting prior corresponds to the variance in the gradient rather than the base temperature) while, as noted previously, the credibility of the implicitly ELM-averaged fits in figure 2 have credibilities of 0.00%.As discussed in section 3.3, this low credibility does not imply that the fits in figure 2 are bad representations of the ensemblemean, but this is very challenging to evaluate.Instead, the high credibility of the results shown in figures 4 and 5 demonstrates that these fits are plausible representations of this individual profile.This means that they are not unreasonable choices to support determination of the ensemble-mean as described in (14).
Nevertheless, by solving the problems seen due to implicitly ELM averaging the data, we have uncovered a new set of statistical concerns.As seen in figure 4, the substantial reduction in the number of observation points changes the effect of the parameters in the covariance in very important ways.First, the experimental length scale ℓ obs is now more spatially sparse than the physics driven choice of lengthscale ℓ = 0.005 in the covariance.While the difference between these two fits does not substantially change the mean of the posterior fit, the posterior distributions (which can be far more important) dramatically diverge.That is, setting the covariance lengthscale ℓ < ℓ obs leads to large uncertainties between measured points.While this is completely consistent with what is expected, the immediate question becomes: what values should one choose for ℓ given such sensitivities in the fitting?Another way of 'tuning' this response in the posterior distribution is to reduce the variance σ, but this then can also significantly change the mean value of the fit, especially when observations are sparse.These complications lead to the critically important question: how does one choose the parameters ℓ and σ in a way that is consistent with the physics, informative relative to the observation data, and yet does not impart unjustifiable assumptions into the regression process?As we will discuss in section 5 below, the answers to these questions are what make GP fits justifiable and interpretable, and must be made explicit in order to understand what the GP fit itself even means.

The role of the hyperparameters
The previous sections provide GP solutions of the form, where the hyperparameters-specifically, σ and ℓ here-are fixed and set independent of the GP fitting process.These sections have highlighted some of the perils involved in trying to determine values for the hyperparameters in a fully physically principled way.We have shown that different hyperparameter values can lead to significantly different GPR fits, depending on the choice of σ, ℓ, as well as µ, and that while different physical constraints (such as monotonicity) can also be enforced, it can become challenging to understand, or interpret in a physically insightful way, which GPR posteriors are most credible.In constrast to previous sections where σ and ℓ were set based on expert opinion prior to the Bayesian update for T * e , in this section, approaches to inform the hyperparameters based on the data are examined.Specifically, we will examine an approach where the hyperparameter uncertainty is accounted for in the profile model by marginalizing over the hyperparameters when making preditions, and this approach is compared to two point estimate approaches, namely a maximum likelihood estimate (MLE) and maximum a posteriori (MAP) point estimate.All these approaches are defined in section 5.1.Then, section 5.2 shows sample results using a GP prior mean constant function of zero.Finally, section 5.3 examines the sensitivity of the results to the choice of the GP prior mean.

The joint posterior
Let θ denote the hyperparameters.Here we show results for inferring the prior variance σ 2 and covariance length scale ℓ, such that θ = (σ, ℓ).However, the analysis extends automatically to other hyperparameters, which may arise due to different prior covariance choices or if parameters of the prior mean are also inferred.The joint posterior of T * e and θ is given by p (T * e , θ|T exp e ) = p (T * e |T exp e , θ) p (θ|T exp e ) , (15) where the first term on the right-hand-side is simply the posterior GP for given θ, as discussed in section 2.1, and p(θ|T exp e ) is the marginal posterior for the hyperparameters.
Consider a QoI of the following form:  Using (15), this can be rewritten as q = ˆˆf (T * e , θ) p (T * e |T exp e , θ) dT * e p (θ|T exp e ) dθ, where the inner integral (over T * e ) is over a Gaussian distribution and can often be evaluated analytically.Alternatively, the outer integral in general will require numerical integration, e.g.Monte Carlo, quadrature, etc.To continue, it is necessary to further characterize the marginal posterior for the hyperparameters.
Along these lines, Bayes' theorem requires that, where the marginal likelihood M = ˆp (T exp e |θ) p (θ) dθ is simply a normalization factor.In the following, we will use Markov chain Monte Carlo (MCMC) to draw samples of θ from the marginal posterior.In MCMC, this normalization factor is irrelevant, and thus will be neglected from now on.Instead, we need only specify the likelihood p(T exp e |θ) and prior p(θ).The prior is problem-dependent and will be specified further below.We evaluate the likelihood as follows: where T e denotes the model predicted temperature at the points where the observations are measured.While ( 18) is true for any random variable, using T e is particularly convenient, since p(T exp e |T e , θ) is simply the likelihood from the original Bayesian update (to derive the posterior GP for fixed θ) and p(T e |θ) is obtained from the prior GP.Since the uncertainty in the data is taken to be Gaussian, this integral can be evaluated analytically.Specifically, letting K exp denote the covariance of the experimental uncertainty, µ denote the prior mean, and K denote the prior covariance (see section 2.1), Thus, where the likelihood ( 18) can be written as Finally, this expression can be rewritten as follows: ] .(22) Note that throughout the remainder of the paper-for computational conveneince-we report the log-likelihood, i.e. the natural log of (22).
In place of sampling the posterior for θ, point estimates can also be used.Two common estimates are point that maximizes likelihood, known as the maximum likelihood estimator or MLE, and the point that maximizes the posterior, known as the maximum a posteriori (MAP) point, Profiles computed with these point approximations will be compared to results from the full posterior in the next section.
We note that, if a uniform prior were used, the MLE and the MAP coincide, assuming that the support of the prior includes the MLE point.However, since Gaussian priors are used here, the MLE and MAP are not the same and may or may not be close, depending on the effect of the prior.

Numerical examples
Figures 6-8 illustrate the hyperparameter estimates from section 5.1.Specifically, the hyperparameters are informed using data from the edge from a single shot, specifically shot 153764-2615-97, which was used in section 4.1 also.
To enforce the positivity of the hyperparameters, the inference was performed in terms of new variables ξ = log(σ 2 ) and η = log(ℓ 2 ).In this transformed space, the hyperparameter prior was chosen to be Gaussian with ξ and η independent and ξ ∼ N(−1.4,1.0) and η ∼ N(−8, 3).In terms of σ and ℓ this leads to the prior shown in figure 6(a).This prior is intended to promote ℓ values near 0.01 and σ values near 0.5 without strongly ruling out significantly different values, so that the data is able to set σ and ℓ.The data enters through the likelihood, which is shown in figure 6(b), and it is clear that for this case the data prefers a slightly larger σ and significantly larger ℓ.Specifically, the maximum likelihood values are σ MLE = 0.453 and ℓ MLE = 0.0320.Finally, figure 6(c) shows the posterior distribution.This distribution was obtained via MCMC sampling using the Emcee package [59].A total of 480 000 samples were taken, with the first third being discarded and the last two-thirds being used to generate results.The MAP values are σ MAP = 0.462 and ℓ MAP = 0.0314.In this case, the MLE and MAP are very similar, and the posterior is fairly concentrated around the MAP.Because of this, the MLE, MAP, and posterior marginal profiles for both T e and dT e /dψ are very similar to each other, as shown in figures 7 and 8.However, in general this is not necessarily the case.Also note that the ℓ values determined from the dataeither the MLE or MAP-are fairly large > 0.03 and that the posterior probability assigns essentially zero probability to ℓ values below 0.01.This can be interpreted to mean that for the given sparse data set in the edge, phenomenon such as high frequency MTM modes with ℓ < .008,cannot be justifiably inferred from the single set of experimental data considered in this examination.This result is not surprising given the resolution of the data, but nonetheless reinforces the notion that GPR, or any inference technique, can only provide insight into the physical processes which are captured by the data.Alternatively, even when the data are sparse, it is possible to pose and answer questions about what data would be required to distinguish between different hypotheses in the GPR context, which could help guide experimental design.

Influence of the GP prior mean function
As seen in section 3, the data aggregation technique (e.g.ELM-averaging), the kernel constraints, and the hyperparameter values and approach to σ and ℓ can all dramatically impact the regressed GP fit obtained from the data, and hence the plasma physics interpretation of the observation.Here, we show that the prior mean function µ also affects the results, both the profiles and the inferred hyperparameters.To demonstrate this sensitivity, three prior mean functions are used: a constant value of zero, a constant value of 0.5 and a linear fit to the data.
Table 2 shows the MLE of σ and ℓ for different choices of prior mean function.Clearly, the choice of prior mean function µ has a significant impact on the optimal values of the hyperparameters.For example, if one starts with a constant prior mean µ = 0.0 the optimal likelihood in ℓ is ℓ ∼ 0.03, while if one sets the prior mean as a best fit line through the experimental support points the MLE is ℓ ≈ 0.008.Similarly, the MLE σ value decreases, from 0.453 for µ = 0 to 0.304 for µ = 0.5 to only 0.0576 for the best fit line.
It is not surprising that the hyperparameters change.When the prior mean function is altered, the function that the GP must fit, which is the difference between the prior mean function and the data, is also changing.Thus, it makes sense, for example, that the MLE σ value is decreasing, since the data are closer to the prior mean as we go from µ = 0 to the best fit line case.However, it must be noted that these choices do influence the resulting fit and any derived quantities, as demonstrated in figures 9 and 10, which show T e and dT e /dψ, respectively, for different prior mean functions, all using their own maximum likelihood estimates of the hyperparameters.This is a particularly delicate and important observation, since it is very difficult, if not impossible, to tell which of these fits should be preferred without additional information.All the results appear to match the data reasonably well, with the uncertainty interval containing all or most of the data, and all have high credibility I.This means that all these fits are statistically credible representations of the experimental data and thus the experimental data, by itself, cannot determine with high confidence which is more correct.This result is not uncommon in situations where the data are sparse, as they are here, since we focus on a single profile, and it indicates that additional data or some external information-e.g.information from theory-must be incorporated if it is necessary to better constrain the inference process.

Discussion
In addition to the analysis shown here, the ideas and techniques introduced and applied in sections 2 through 5 can  be used more generally.Here we describe further uses of the edge profile GPR analysis shown here as well as provide general guidance for applying the tools in new situations.

Further uses of GPR models for the edge profile
The GPR results for the edge profile data can be used directly to provide probabilistic answers to questions that the data can address by itself.For instance, this data alone can in principle answer questions regarding the characteristics of the edge profile, although, as has been shown in section 5, one possible outcome of the analysis is that the data is too sparse to provide a statistically meaningful conclusion.
However, for many other questions, it is necessary to combine the edge profile with further modeling and analysis.Examples include numerical solutions to gyrokinetic or fluid models, or analysis of MHD stability.In such situations, the GPR model provides the input profiles for further simulations or analysis.When it is computationally tractable to evaluate the map from the input profile to the resulting QoIs many times, one can sample the posterior GP, as described in section 2.3, to generate an ensemble of input profiles for use in a Monte Carlo analysis to approximate the distribution of the QoIs.In cases where this map is computationally expensive, it may be necessary to combine this sampling with advanced uncertainty quantification algorithms to make the forward propagation tractable.In the extreme case where the forward model is so computationally expensive that only a single simulation is possible, the maximum a posteriori (MAP) result-i.e. the profile that the posterior distribution, which is the posterior mean in the GP caseshould be used.
In this extreme case, while the ability to quantify the uncertainty in the QoIs due to input uncertainty in the profiles-which is a significant advantage of GPR over traditional deterministic methods-is obviously lost, the GPR approach still provides a coherent method for determining the most likely profile, given the measurements.Further, it still has two important benefits over more traditional methods.First, the MAP profile is equivalent to a non-parametric, weighted least-squares fit to the data; and since it is non-parametric, the method is generally able to fit the data better than parametric forms with only a few parameters, such as the traditional modified tanh form, as show in section 4. Second, the prior distribution provides a natural method for incorporating additional information that is separate from the data itself.For example, this flexibility was used in section 3.3 to impose a 'monotonicity-promoting' prior, which helped to avoid oscillatory profiles.

Applying GPR to new datasets
Clearly GPR may be applied to understand data beyond the tokamak edge T e profile measurements considered here.However, it is important that these tools be properly utilized in order to assure that reliable insights and interpretivelycoherent conclusions are made relative to observational data.To promote such utilization, a recommended set of phases in a GPR application is given below.This list is intended to be a loose guide for the application of GPR and other data-driven modeling techniques, with the understanding that the process is necessarily dependent on applicationspecific details and that it is not possible to consider all contingencies.
1. Initial model exploration and refinement.The goal of this stage is to explore the characteristics of the data and what GP features will be necessary to generate a reasonable model.This process generally begins by fitting with 'off-the-shelf' models-e.g.squared-exponential covariance with user-set hyperparameters and zero-mean priorand the results are examined in mostly qualitative ways, such as plotting the data and the posterior model.The initial GPR modeling attempt described in section 3.3 falls into this phase.As was the case in that study, often the simple initial models are not even qualitatively credible.However, this initial phase is nonetheless useful because it allows one to rapidly begin examining the characteristics of the relationship between the data and the model.This relationship, specifically the coarse characteristics of the mismatch between the data and the initial models, will generally suggest potential improvements.For instance, it may be necessary to use a richer prior covariance or a parametric model for the mean.Or, process improvements, such as Bayesian calibration of the covariance hyperparameters may be required.Or, it may be necessary to revise the interpretation of the data, which was the case here, as discussed in section 4, or even gather more or different data.Regardless of the detailed course of action, at the end of this phase, a qualitatively reasonable model has been constructed.2. Model validation and further refinement.Once a qualitatively reasonable model has been constructed, it can be evaluated more systematically.This validation evaluation can take different forms depending on the goals of the study.One metric, the credibility, defined in section 2.4, has been used here, but other approaches may be dictated depending on the goals of the study.For some other possibilities, see the literature on 'model checking', e.g.[60].Further, the development of a model that passes a validation check is not necessarily the endpoint of this phase, because the analysis undertaken as part of the validation assessment may itself lead to other improvements in the model or approach.For instance, in the present work, the dual observations in section 4 that (1) models with a wide range of covariance lengthscale ℓ would give high credibility for the data and (2) that the model predictions for output quantities of interest are sensitive to this parameter, led to the application of a Bayesian inference framework to better inform the GPR hyperparameters.As in the initial model exploration phase, the refinements necessary to achieve a validated model may take many forms.However, at the end of this phase, the model has been validated and deemed mature enough to make predictions of the quantities of interest.3. Probabilistic analysis.The final phase is to use the validated model for probabilistic analysis.Conceptually, this phase is the simplest.It involves forward propagation of results of the stochastic model developed in the first two phases through auxiliary relationships or models to determine the probability distributions of the output quantities of interest.As a basic example, the stochastic gradients implied by the stochastic T e model were evaluated in section 4, and the GPR hyperparameters themselves were regarded as output quantities of interest in section 5.As noted above, the results of this phase may indicate that it is not possible to differentiate between alternative behaviors from the given data alone.In this case, additional sensitivity analysis may point the way to what data should be gathered to better constrain the results.

Conclusion
Principled statistical methods such as GPR promise enormous potential for extracting a deeper and sharper understanding of data from magnetic confinement fusion experiments.However, the results they provide do not always lead to the insights one might like.Sometimes, for example, instead of clearly determining that a particular behavior is present within an experimental campaign, they may instead show that it is not possible to differentiate between alternative behaviors from the given data alone, due to, for example, a combination of underlying constraints imposed by the theory and limitations in the experimental instrumentation itself.As a result, these tools can be impactful not only for analyzing data, validating theory, and experimental control, but for developing and supplementing robust diagnostic systems as well as experimental design.

Figure 1 .
Figure 1.Ensemble-averaged profile in the 80%-99% portion of the ELM cycle with edge Te modified tanh fit profile from experimental DIII-D data with 1σ gaussian error recorded at each point.

Figure 5 .
Figure 5. On top the local GPR profile and gradient with σ = 0.5, and on bottom the monotone constrained fit with σ = 5.0.Both use a lengthscale ℓ = 0.01.

Figure 7 .
Figure 7. Te distribution implied by the MLE, MAP parameter estimates, and as computed by marginalizing over the posterior for σ, ℓ, using a constant prior mean function (µ = 0.0).

Figure 8 .
Figure 8. distribution implied by the MLE, MAP parameter estimates, and as computed by marginalizing over the posterior for σ, ℓ, using a constant prior mean function (µ = 0.0).

Figure 9 .
Figure 9. Te distribution implied by the MLE parameter estimate for different choices of prior mean function.

Figure 10 .
Figure 10.dTe/dψ distribution implied by the MLE parameter estimate for different choices of prior mean function.

Table 1 .
Key for tanh-like fit coefficients and example prototype value.

Table 2 .
Effect of the GP prior mean function on the MLE values of the hyperparameters.