Machine learning and Bayesian inference in nuclear fusion research: an overview

This article reviews applications of Bayesian inference and machine learning (ML) in nuclear fusion research. Current and next-generation nuclear fusion experiments require analysis and modelling efforts that integrate different models consistently and exploit information found across heterogeneous data sources in an efficient manner. Model-based Bayesian inference provides a framework well suited for the interpretation of observed data given physics and probabilistic assumptions, also for very complex systems, thanks to its rigorous and straightforward treatment of uncertainties and modelling hypothesis. On the other hand, ML, in particular neural networks and deep learning models, are based on black-box statistical models and allow the handling of large volumes of data and computation very efficiently. For this reason, approaches which make use of ML and Bayesian inference separately and also in conjunction are of particular interest for today’s experiments and are the main topic of this review. This article also presents an approach where physics-based Bayesian inference and black-box ML play along, mitigating each other’s drawbacks: the former is made more efficient, the latter more interpretable.


Introduction and motivation
The progress of nuclear fusion research depends on a number of scientific modelling and analysis efforts, including theoretical, experimental, computational and inferential expertise. * Author to whom any correspondence should be addressed.
Original Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
The running of an experiment, and in the future, a reactor, also relies on algorithms whose computational efficiency satisfies strict time requirements. The algorithms typically consist of a combination of plasma physics simulations and data analysis models aimed at predicting the plasma state. Limitations on the available computational power usually drive the development along a trade-off between accuracy and efficiency.
In this article, we review the application of two subjects, Bayesian modelling and machine learning (ML), both of which, in our opinion, are game-changing, powerful and mature fields that should be standard approaches for the complex inferential tasks facing nuclear fusion research. This is especially so in light of current trends suggesting that the amount and complexity of the collected data are going to increase.
The two subjects have an increasing number of applications in the nuclear fusion field: one of the goals of this review is therefore to provide a selection of example applications to gain insight into how they can be used and what can be achieved with them in fusion research.
Why Bayesian inference and ML? The scientific method itself relies on extracting knowledge about the world using the raw data provided by the outcome of an experiment. The Bayesian interpretation of probability theory allows us to do this in a principled way, using mathematical formulations of all aspects of a model (physics assumptions, uncertainties in models and nuisance parameters, data prediction) and using such models for inference from experimental data. Probabilities are here interpreted as degrees of beliefs, hence the language of model uncertainties: a quantitative description of the limited possibility of predicting natural phenomena with theoretical models. As it emerges from this description, the rules for scientific Bayesian inference do not depend on the specific nature of a model or experiment. As such, they are ideal for an efficient exploitation of the information shared among heterogeneous experimental data sources regarding a common underlying description, or model, of a phenomenon.
On the other hand, as we will show, ML provides computational methods that are also suitable for efficient implementations of Bayesian inference as well as automation of the process of analysis and learning from available experimental data. In particular, recent advances in what is now called deep learning (DL) allow the development of algorithms which can autonomously learn to approximate sophisticated tasks and can execute them at time scales often useful for realtime applications. These tasks range from the kind of scientific discovery [1], to that of engineering-oriented applications of complex autonomous control [2].
ML is most commonly used to learn an empirical model of the data from their statistical properties in cases where a first-principled theoretical model is lacking. In this case, the models learnt from data are 'black-box' and not readily interpretable, but has the advantage of computational efficiency. There are also important cases, where ML can be used to implement first-principled models and learn through Bayesian inference. We will cover such cases in nuclear fusion research, and describe an approach to alleviate the computational inefficiency of Bayesian models by developing and training approximate and fast 'black-box' ML models.
We hope to show that the combination of Bayesian inference and ML can provide powerful tools to perform accurate scientific inference also within practical constraints of execution time, as it is often required in nuclear fusion experiments.
Reviewing every single application of ML and Bayesian inference is virtually impossible, and a selection, including some authors, and excluding others has to be made. We are aware that there are more contributions relevant to the topics of this article. Differently from the case of most ML algorithms, which require a multitude of examples to learn from, we could base our writing on a different principle: the limited number of examples reviewed in this article are hopefully enough for readers to form fundamental intuitions allowing them to interpret and apply Bayesian inference and ML to their own research work.
This review is structured in the following way: section 2 introduces basics elements of Bayesian inference and section 3 reviews its application to physics models of nuclear fusion measurements. Section 4 introduces in some detail a few ML algorithms which have seen multiple applications to nuclear fusion, section 5 discusses the application of Bayesian inference to neural network models, as an analogy to the previously discussed physics-based models, and shows the advantages regarding uncertainty quantification (UQ) of black-box models. In section 6, we review applications of ML addressing problems in nuclear fusion for which a first principle understanding is lacking or computational efficiency is of concern. We shall conclude in section 7 with a discussion of some of the limitations inherent to ML methods, possible solutions to overcome them, and opportunities in fusion research for the application of novel ML methods.

Bayesian probability
According to the Bayesian interpretation of probability theory, probabilities describe degrees of certainty about the assumptions of a model of reality and the predictions that a model can make. We refer to the model as a generative model with associated prior, posterior and predictive uncertainties. The model is called generative because it includes all model assumptions necessary to generate potential observable data. It is thus a full model of the observables. Conditioning the model on actual observations, directly gives a probability distribution over the free parameters of the model. Bayes rule. In a physics context, the model is a mathematical model of a process whose outcome can be observed, having free parameters h whose value can be adjusted to compute predictions of expected observations d. The model's hypotheses include the possible values that the free parameters can have according to a priori knowledge, i.e. before any experiment is conducted. This is the prior probability distribution p(h). The possible outcomes of the experiment which can be predicted with the model, given a choice of parameter values, are distributed according to a likelihood function p (d|h). Typically, the model has a functional form which allows the computation of the likelihood as where f corresponds to a deterministic simulation or forward model, and ϵ represents the remaining uncertainty of the observations. This could be a Gaussian distribution with mean zero and standard deviation σ: Bayes rule, shown in equation (3), gives the updating formula that transforms the prior to the posterior distribution, taking into account the information provided by the observations: This updated probability distribution of the model's parameters p(h|d) is the posterior probability distribution.
In the equation, the term p(d) is a normalization factor, known under the names of model evidence or marginal likelihood, and is computed as: Equation (3) also shows that it is possible to write the posterior distribution p(h|d) in terms of p(d, h), the joint probability distribution of the model's parameters and observations. Parameter uncertainties given by the posterior distribution are particularly interesting for models whose parameters have a clear interpretation and therefore are not black-box. Such models can be, for example, mechanistic models of plasma parameters or plasma diagnostics which allows the prediction of raw measurements expected through an experiment, and the model's free parameters correspond to some relevant plasma parameters. Section 3 reviews the application of Bayesian inference based on simulations of processes which can be observed in fusion plasmas. Prediction and marginalization. The model with updated knowledge of the parameter's distribution can be used to make new predictions t. The associated predictive uncertainties are then computed by marginalization over the model parameters of the a posteriori distribution, as shown in equation (5). The distribution p(t|d) of new predictions conditioned on the observed data d is known as the posterior predictive distribution: Marginalization is a common operation of probabilistic inference which allows computing the distribution of a quantity t, depending on another one h, by taking into account all possible variations of h. It can be seen as the probabilistic solution to propagate uncertainty from h to t. Predictive uncertainties of this kind are particularly useful in the context of black-box models, such as neural networks, whose parameters, or weights, have no straightforward interpretation and their posterior distribution, inferred from the observed training data, contains little meaningful information from the point of view of understanding the model, but can give full uncertainties of predicted quantities. In section 5 we will show how limitations inherent to black-box approaches can be mitigated in this way.

Bayesian modelling for fusion experiments
Some of the most successful applications of Bayesian inference in nuclear fusion regard the interpretation of diagnostic raw data, an early example of which is the Thomson scattering (TS) diagnostic [3]. Under a Bayesian interpretation, each diagnostic measurement can be seen as a constraint on the distribution of possible plasma parameter values, as is shown by writing the joint probability distribution of the system. As an example consider measurements of interferometer (IF) and TS diagnostics together [4]: they depend on the plasma electron density n e and the plasma electron temperature, constraining the possible values it can have. Therefore, given a predictive model f i for the IF signal d i = f i (n e ), and one for the TS signal d t = f t (n e , T e ), the joint probability distribution of a consistent model of the electron density can be written as: where p(n e ), p(T e ) are the prior probability distributions of the electron density and p(d i |n e ) and p(d t |n e , T e ) are the likelihood functions of the signals. The posterior probability distribution of electron density values consistent with both diagnostic measurements can be written as: where we have also done a marginalization over the electron temperature. The challenge in practical implementations of Bayesian inference is twofold: on one side, the accurate development of the computational models f t and f i requires a thorough understanding of the physical processes in play and a detailed accounting of experimental configurations, such as instrument functions, calibration and geometrical features; on the other side, the computation of the posterior distribution with numerical recipes is not trivial: it often requires tailoring a problem-specific implementation and a trade-off between accuracy and efficiency. As shown by the work reviewed in this section, the painstaking modelling of diagnostic measurements and the implementation of robust inference algorithms pays off with accurate and consistent results.
Inference of plasma equilibria. An early application of Bayesian inference to achieve consistency across heterogeneous measurements is reported in [5,6]. The authors create a joint model of a number of diagnostics together with an equilibrium constraint, where both the uncertainty of flux surfaces and other nuisance parameters are modelled to achieve consistency between diagnostics: the Bayesian approach is to include all systematic uncertainties in the model as probability distributions that can in some cases be inferred from data, but in all cases allow the inclusion of their uncertainties in the final marginalized posterior over the plasma parameters of interest. In other words, if a set of a i identifies nuisance parameters with probability distribution p(a i ), the posterior distribution of the plasma parameters h p can be obtained by marginalization: In this way, the authors succeed in performing Bayesian inference for a joint model of a TS diagnostic, diamagnetic loop measurement, neutral particle analyzer and an IF system modelled together with a surrogate equilibrium model for the W7-AS stellarator. The free parameters of the model are electron density, temperature and ion pressure as functions of the toroidal magnetic flux, which are then inferred self-consistently with the uncertain magnetic coordinate system. The nuisance parameters include uncertainties of the flux surfaces and calibration of the YAG Thomson system which allows the joint inference of consistent electron density, temperature and pressure profiles. Figure 1 shows posterior samples of the position of the flux surfaces: the inference of flux coordinates and profiles simultaneously is necessary to exclude systematic errors arising from a fixed magnetic coordinate system. In [7] is another Bayesian approach (Current tomography), where the posterior distribution of the current distribution is inferred using a tomographic approach assuming only magnetostatics and axisymmetry. The work described in [8] demonstrates another Bayesian approach to the inference of tokamak plasma equilibria at JET. The prior distributions of the toroidal current distributions, plasma pressure, and current profiles, modelled with parametric functions, are constrained by the magnetohydrodynamic force balance equation and the magnetic measurements. The authors show that the time evolution of pedestal pressure inferred by this approach follows the kinetic measurements well in both trend and absolute magnitude throughout an H-mode plasma discharge. Kwak et al [9] further develop this approach to obtain plasma equilibria consistent with both magnetic and kinetic measurements from various plasma diagnostics at JET: magnetic probes including pickup coils, saddle loops and flux loops, IF, polarimeter, TS and lithium beam diagnostics. The model allows one to predict volume-averaged quantities, such as magnetic inductance and poloidal beta, from the posterior samples, over a plasma discharge, including L-mode and H-mode plasmas. Figure 2 shows the posterior samples of the magnetic flux surfaces, electron density and temperature and q-profiles from the equilibrium inference. Gaussian process (GP) tomography. Svensson et al [10] develop a Bayesian non-parametric tomographic method based on GPs. Owing to the incomplete understanding of the transport phenomena in fusion plasmas, the underlying function of plasma profiles is unknown, and the authors use a GP to define the underlying function for the case of a tomographic inversion. In this formulation, the function is expressed as a random vector, in which each entry contains the function value f (x) at a particular domain point x, following a multivariate Gaussian distribution. Although the function is not defined by a parametric formula, the GP constrains the function through covariances between domain points: Σ(x, x ′ ). The covariance is determined by a function with hyperparameters determining the properties of the function, e.g. the smoothness. In this application, the observations are not function evaluations of the GP, f (x), but quantities derived by integrating over the function. Since these tomographic integrals are linear operators, the distribution of the observations will be given by another GP, giving a closed-form expression for the Bayesian posterior of the unknown function for the tomographic problem. In this way, properties of the function such as smoothness can be explicitly specified by using different GP families. In the example of [10], the free parameters modelled with GP are the 2D distribution of toroidal current density in the plasma cross-section and electron density profiles as a function of the flux coordinate, and the corresponding observations are the Bio-Savart integrals to predict magnetic measurements, and the integrals along the lines of sight of an IF diagnostic. Figure 3 shows the inference process of the posterior distribution of smooth profiles of the electron density obtained with GP. The initial rather uniform prior distribution on the top left already contains smooth functions over a large value range. The space of plausible functions is further constrained by the addition of observations, as is well shown by the evolution of the posterior distribution as more measurements are used for the inference. The smoothness of the underlying function can also be determined directly from the data by Bayesian Occam's razor through maximization of the evidence term. The bottom plot of figure 4 shows the posterior distribution of density profiles obtained with fixed values of the hyperparameters. The top plot, instead, shows that the inference of the hyperparameter values from the data embodies the principle of Occam's razor: overly complex functions with small length scales, as the ones above, are discarded, and the preferred solution is a   family of slowly varying functions, more plausible from a physical point of view, in good agreement with independent measurements of a TS diagnostic, shown in red. Alternatively, the hyperparameters can be marginalized from the posterior. GP tomography has been applied in a number of experiments, including [11][12][13][14] where it is used to infer 2D emissivity contours from soft x-ray diagnostics at W7-AS and W7-X, effective ion charge profiles and 2D toroidal current distribution in a field reversed configuration device. The latter application was used to study the behavior of Alfvenic transients. Uncertainty propagation. A major benefit of the application of Bayesian inference is the straightforward estimation of uncertainties. This is well exemplified in [15], where the authors discuss different methods to obtain samples from the posterior distribution of the electron density, temperature and pressure profiles in order to propagate uncertainties into derived quantities. The computation of the quantities can involve simple operations, such as the calculation of gradients, or computationally expensive simulations. As an example of the latter case, consider simulation codes such as GENE [16] and SOLPS-ITER [17] which usually require plasma profiles and their gradients as input. In this case, the simulations run for each sample generating a distribution of output corresponding to the uncertainties. This allows an accurate study of the sensitivity of modelling codes. The authors describe two methods to obtain  samples from the posterior distribution. In the first one, an expensive Markov Chain Monte Carlo (MCMC) sampling algorithm is used to sample from the posterior distribution of the electron and density profiles with a joint model of interferometry, lithium beam, electron cyclotron emission (ECE) and TS diagnostics at the ASDEX Upgrade tokamak [18]. The MCMC approach can be computationally demanding since it requires executing the diagnostic forward models for a number of iterations ranging from 10 4 up to 10 6 . Figure 5 shows n e . T e and p e profiles and corresponding uncertainties obtained with MCMC. The red lines correspond to the mean and standard deviation of the posterior samples. The black lines show the maximum a posteriori (MAP) solution and corresponding uncertainty bands. The MAP solution is the profile which maximizes the posterior distribution and is found with an optimization algorithm, usually faster than MCMC. The uncertainty bands of the MAP solution are not, in this case, the Bayesian posterior, but they are computed with the χ 2method: this is defined as the local change in the profile necessary to obtain a deviation of the data residuals, the χ 2 , equal to 1. The figure also shows uncertainties in the gradients of the profiles, computed from the MCMC samples. The second sampling method makes use of GP regression for the inference of interpolated and smooth ion temperature and velocity profiles from noisy data obtained from charge exchange recombination spectroscopy at ASDEX Upgrade. Electron density and temperature profiles at JET and W7-AS. Fischer et al show in [3] a very detailed modelling of the TS diagnostic at W7-AS. The model includes a wide range of nuisance parameters corresponding to uncertain diagnostic features, such as laser power, scattering angle, TS cross section and polarization factors. The authors use Bayesian inference to infer the posterior distribution of electron density and temperature profiles obtained by marginalizing the nuisance parameters and therefore accounting for these additional uncertainties. Kwak et al show in [4] a Bayesian model of TS and multichannel IF at JET. The main targets of the inference are the electron density and temperature profiles. The authors use GP to constrain the prior distribution of the possible profile shapes: the parameters of the GP describe the correlation between the values of the profiles at different locations, hence controlling their smoothness and introducing physically plausible regularization. The authors infer the posterior distribution of the joint system by marginalization over the GP hyperparameters, thus considering all possible parameter values allowed for that GP family. The joint system can also be used to infer the TS's calibration factor without a Raman calibration. This factor can also be marginalized in the calculation of the final posterior distribution of the profiles. Figure 6 shows a simplified Minerva graphical model of the high resolution Thomson scattering (HRTS) system and far infrared (FIR) interferometer at JET. Other applications. Numerous detailed models of ECE diagnostic measurements can be found in the literature: in [18], the authors infer temperature and density profiles simultaneously from multiple diagnostics at ASDEX Upgrade; in [20], the authors employ a Bayesian model of three ECE diagnostics, including ray-tracing, Martin-Puplett IFs, a heterodyne radiometer and a reflectometer at JET to infer temperature and density profiles, discussing also local posterior correlations between the profiles; in [21], the authors describe the model of a general radiometer calibration that can be used to model the radiation temperature of the ECE of W7-X plasmas using a 3D ray tracing model; in [22], the authors apply Bayesian inference to investigate higher harmonics of the ECE obtained with a Michelson IF at W7-X. Langenberg et al [23] show a Bayesian forward model of an x-ray imaging crystal spectrometer system, which allows one to infer kinetic profiles and study impurity transport at W7-X. In [24], a Bayesian model of W7-X helium beam measurement is developed, which includes uncertainties of rate coefficients and allows one to infer electron density and temperature. In [25] Bayesian modelling is used to validate the consistency of the Grad-Shafranov force balance model. In [26], Bayesian inference of T e and n e profiles from TS measurements at W7-X is accelerated with a hardware implementation in a field-programmable gate array (FPGA). Appel et al [27] propose a novel description of plasma filamentary dynamics of JET plasma based on the uncertainty in the filament rise times, obtained with Bayesian inference from the signals of a Langmuir probe. Effective ion charge profiles are inferred from a consistent interpretation of multiple independent measurements of bremsstrahlung and charge exchange diagnostics at ASDEX Upgrade [28]. Modelling the collisional-radiative processes allows the inference of the edge electron density profiles from beam emission diagnostics [29][30][31] as well as the electron distribution function [32]. Sciortino et al [33] use a Bayesian inference with simulations of impurity transport and infer transport coefficients consistent with measurements of x-ray emission at Alcator C-Mod. Also, they show that Bayesian model selection allows them to infer the most appropriate parameterizations of transport coefficients and other nuisance parameters. Similarly, Chilenski et al [34] use Bayesian model selection to infer the appropriate level of complexity in the description of impurity profiles compatible with x-ray spectrometer data.

ML
A wide variety of computer algorithms fall under the term ML. Despite the differences, they share a common feature: they are not explicitly programmed to solve a specific task, instead, they learn starting from an initially poor guess and automatically improving through experience with trial and error [35]. Typically, an ML algorithm implements a statistical model with free parameters that can be adjusted to solve a training task. Training data are constituted of input data and a loss or cost function which assigns a score to the output of the algorithm. The objective of training is to adapt the value of the internal parameters in order to minimize the loss function.
As an example, consider a regression tasks where the output of the algorithm should match a target value. In this case, the error function can be chosen as the mean square error between output and target values, a rather common measure to quantify the performance of a fitting routine. Indeed, the training procedure can be regarded as a fitting routine where the model used to fit the data is a very flexible ML model. For example, a deep neural network model (DNN) is a highly non-linear function which can be parameterized with millions of coefficients or weights and is able to fit very complex patterns to the data. Most of the time, the goal of training a ML algorithm is using it to make new predictions provided novel input data: for the predictions to be accurate, the algorithm must learn a general solution which is valid also for input data not seen during training, as long as it is not asked to extrapolate too far away from the training data. The ability to generalize is the main criterion used to evaluate a ML model. Generally speaking, ML models can be trained relatively easily to fit the training set, and the real challenge is avoiding overfitting and improving generalization. For this reason, recently, DL has gained a lot of popularity: the ability of these algorithms to approximate very complex functions for high dimensional input and output data allows them to learn solutions to real-world complex problems [36]. Because a review of every kind of ML algorithm would be infeasible, and out of the scope of this article, our approach for surveying the subject is the following: we aim at providing the basics of most trending ML methods, despite including in our visitation of fusion applications also more traditional approaches, as we find them relevant to certain topics in this field. We will start with section 4.1 by providing a slightly formal introduction to the class of artificial neural network (ANN) algorithms, following with section 4.2 describing convolutional neural networks (CNNs), and section 4.3 describing recurrent neural networks (RNNs). This shall provide the reader with an understanding of the mathematical background and open issues relating to more sophisticated DL algorithms. We then move to describe in a more qualitative way the approach of physics informed ML in section 4.5 which can be particularly interesting for fusionrelated applications, and approaches dealing with uncertainties estimations, in subsection 5, relevant to any real-world applications of ML.

ANNs
ANNs were introduced as a computational analogy to biological neural networks in the brain aimed at solving the complex tasks of perception and cognition. In the '60s, Rosenblatt wrote 'Principles of neurodynamics. Perceptrons and the theory of brain mechanisms' [37] where he introduced the multi-layer perceptron (MLP) architecture, the foundation of the more complex architectures of DL. The original perceptron model could only learn a linear discriminator function. For this reason more complex architectures such as the MLP, which is made of multiple layers and non-linear basis functions, were later on introduced. Several mathematical theorems, known as universal approximation theorems, establish the approximation capabilities of MLPs, as well as other kinds of feedforward neural networks. An important result of this kind is provided by [38] with a publication titled: 'Multilayer feedforward networks are universal approximators'-a title already disclosing the potential of MLPs. This early result highlights the importance of this kind of network and justifies our choice of introducing them now more formally. Figure 7 shows a typical representation of an MLP architecture where circles represent units and arrows represent the MLP learnable weights or parameters. The terminology regarding the number of hidden layers that can be found in the literature is ambiguous: the MLP in the figure can be said to have three hidden layers of weights or two hidden layers of units. The figure also shows a four-dimensional input vector x = {x i }, and a twodimensional output vector y = {y l }. Such an MLP is a case of a fully connected (FC) feed-forward neural network because every unit in one layer is connected to a unit in the following layer, and all connections are in the forward direction, from input to output. Different connection patterns correspond to different network functions which can perform better on different kinds of problems. Some of the most successful examples covered in this review are CNNs (see section 4.2) and RNNs (see section 4.3). The history of discovery of different neural network architectures has often involved a mix of trial and error, intuitions arising from empirical findings in the attempt to solve a certain class of problems, and firstprincipled investigations driven by a higher-level understanding of the learning principles. It is a notorious theme in ML research, especially in the field of DL, the lack of solid firstprincipled approaches. Therefore, it should not surprise if, in the development of new models, empirical approaches, i.e. trial and error, often come first, followed only later on by theoretical interpretations explaining why things work the way they do.
The analytical expression of the neural network function corresponding to the feed-forward FC MLP of figure 7 is given in equation (9): where x represents the input vector and w the neural network weights. The input of a hidden unit is the linear combination of the weights and the output of the units of the previous layer: for example, for the first layer of units it can be written as: The output of the corresponding unit can be written as g(z j ), where g denotes the unit activation function. The activation function is usually non-linear. At present, one very popular choice is the rectified linear unit: and other popular choices are the hyperbolic tangent and the sigmoid function. The procedure is then repeated recursively until the output layer units. Hidden layers and hidden units act on the input space by distorting it in a non-linear way so that the projection operated by the last layer can be mapped linearly into the target space [36]. We denoted with g * the activation functions of the hidden units in the output layer to emphasize that they need not be the same function as used for the hidden units. Indeed, they are typically chosen according to the constraints of the problem. For regression problems, they are often taken to be the identity function, so that g * (z) = z. For classification problems, a common choice is the softmax activation function, such that the value of each output unit can be interpreted as the probability for the input to belong to that unit class: In this way, each output unit value is bounded to the interval (0, 1) and the sum of all output is equal to 1. At training time, the neural network weights are learnt in order to minimize the mismatch between output and targets. For a regression task, for example, the mismatch can be measured with a loss function such as the mean-squared-error in equation (13): where N is the number of training samples, x n represents the nth input sample and t n the corresponding target sample. Provided an initial random guess for the values of the neural network (NN) weights, their values are updated during training so to minimize E with a gradient descent algorithm. The partial derivative of the error function with respect to the network weights, ∂E/∂w ij , can be calculated for each layer of the network using the backpropagation formula [39], which simply uses the chain rule for partial derivatives to express the gradients at layer l with respect to those at layer l + 1, as shown in equation (14): The backpropagation algorithm is perhaps the most successful algorithm for training large networks and is widely used for all sorts of architectures, including those described below.

CNNs
The CNN architecture was designed ad-hoc for an image recognition task [40]. The main feature of a CNN architecture is the convolutional layers (CLs). These layers implement weight sharing with convolution operations: each unit in a CL is connected only to a local region of the previous layer, known as receptive field, and all units share the same weight values. This set of weights is known as kernel or filter. The input to a unit in a CL is obtained by applying the kernel on the local region of the previous layer output. The kernel size, determining the number of connecting weights shared in a CL is a parameter of the CNN algorithm. The application of the kernel corresponds to a linear combination of the weight values with the unit values in the local region, such as those shown in equation (9). The input to a subsequent unit in a CL is then obtained by sliding the filter onto another local region. The sliding stride is also a parameter of the CNN. The sliding stride and the kernel size together determine the size, i.e. the number of units of a CL. Each CL can be made of multiple feature maps. Feature maps are computed by sliding a different kernel across the same output of the previous layer. The number of feature maps is also a parameter of the CNN. Using different feature maps allows each CL to learn multiple sets of shared weights from the same input data, therefore allowing them to  learn multiple relevant features from the previous layer output. Feature maps are represented with different shades of blue in figure 18. A guide to the arithmetic of convolution for neural networks with helpful visualizations is provided by [41]. One visualization is also reported here in figure 8. It depicts the sliding of a 3 × 3 kernel over a 4 × 4 input using unit strides. The input to each green square unit z j is then computed as a linear combination of the kind: z j = ∑ il w il x il , where w il is the kernel, a 3 × 3 matrix of weight values indexed with i and l, and x il is the corresponding receptive field. Weight sharing is useful for two reasons: (1) it allows for reducing the total number of connections in the network, therefore reducing the number of free parameters and allowing faster training with a smaller model less prone to overfitting, (2) the hierarchal internal representations are built upon recurrent features extracted from the layers below. The latter is a useful property when the input data present structures which are relevant to the solution of the task across different locations. In the case of a 2D image of a real-world object, basic low-level structures are, for example, the edges. A representation of a CNN is shown in figure 18. Such networks were used in nuclear fusion applications to reconstruct ion and electron temperature profiles from 2D xray spectral images at the Wendelstein 7-X experiment [42]. In this case, the input images present recurrent features in their 2D structure such as the Doppler broadening of the emission lines directly related to the ion temperature. The architecture shown in figure 18 and used in [43] also presents subsampling or pooling layers. Pooling is used to reduce the size of feature maps by applying a function to a subregion of the map, such as taking the average or maximum value across the region, and then repeat the operation by sliding the pooling window across the entire feature map. The pooling size, i.e. the size of the feature map subregion where pooling is applied, and the pooling stride are also parameters of the network. The role of pooling is to reduce the overall number of network parameters and obtain invariant representations with respect to small translations of the input [43,44]. The CNN architecture considered here also shows a cascade of multiple CLs. The effect of a cascade of layers can be thought of in this way: at each successive layer, features of the input data which are less relevant to the final task are filtered out. This intuitive explanation has been investigated experimentally in [45] and a mathematical interpretation has also been proposed [46]. The architecture in figure 18 also shows two final FC layers, as they are used in [43]. Connecting CL to FC layers is common practice and it serves the purpose of projecting the representation into the desired output space [46]. Finally, CNN make use of operations which can be implemented very efficiently in GPU cards and specialized hardware, allowing parallel computation and therefore reducing the overall training time.

RNNs
The term RNNs refers to a type of network apt at modelling the generation of sequences, such as data occurring in a time series. They are designed to have a memory state of previous time steps: recent events are stored as activations and used as inputs for the generation of the next time step through feedback connections [47]. In this sense, they are different from a standard feedforward neural network, although it is possible to achieve the same behavior with an equivalent feedforward network [39]. Figure 9 represents an RNN with a feed-back loop on the left-hand side and the equivalent feedforward representation on the right-hand side. The repetition of the feedback loop for t time steps is equivalent to the replication of a feedforward network block t times, where at each time step the network block receives as input both the input vector x t and the activation from the previous time step and the block weights are shared across time. The internal structure of the block can be as simple as a single layer with weights and non-linear activation functions as in the MLP architecture, or a more complex one. Simple single-layer blocks have fundamental limitations in their ability to learn longer-term influences [48,49]. The limitation is related to a problem known as the vanishing or exploding error gradient: the weight adjustments computed with back-propagation tends to very low or oscillating values as they are propagated backwards through the network, preventing learning of long-distance reference. More complex structures can alleviate this problem: one such architecture is the long short-term memory (LSTM) RNN [47]. The fundamental unit of an LSTM is not a simple activation, instead it also comprises a state: a value which is passed recursively at each time step with relatively small perturbations, therefore allowing it to memorize longer references. The state value is modified with values of weights learn during training and it is used in the computation of the unit output. In this way, the network units output an internal state and an output value: the former flowing relatively unchanged throughout time steps, while the latter is computed by the application of a non-linear function to the stat. Both state and output values are used as input to the unit computing the following time step, in addition to the time-specific input vector. A bound on the computations carried out by each unit prevents vanishing and oscillating behavior.

Reinforcement learning (RL)
To be precise, the term RL does not refer to a network architecture, but to a paradigm to learn the optimal interaction between an agent and the environment finalized at achieving some prescribed results. An extensive introduction to RL is provided in [50], whereas a more concise review can be found in [51]. Here we will give a high-level overview. The goal of RL is teaching a goal-oriented agent to perform appropriate actions to operate successfully in an environment. During training, the agent learns a policy of actions it executes in the environment. The environment's response is summarized by the state, which determines the reward of that action in relation to the defined goals. The core idea is as follows: at time step t, the agent performs action a t . In reaction, the environment produces state s t , from which the reward: is computed. The transition from state s t−1 to state s t due to action a t is governed by the, typically unknown, environment dynamics f : Afterwards, the current state and reward s t and r t are input to the agent, which determines the action for the following time step a t+1 , according to the policy π: Notably, the agent's objective is not learning a policy for maximization of the immediate reward, i.e. the reward a t obtained at every time step. Instead, the optimal policy should maximize the long-run expected cumulative reward, also called value, often computed as: where the expectation is taken over the noise ϵ possibly present in the environment dynamics. Figure 11 shows a schematic of an RL agent applied to a control task at the KSTAR tokamak. We will discuss this application in further detail later on. One challenge in RL is the collection of data by interaction with the environment, to learn the long-term value of possible action-states configurations. Different approaches exist for collecting and learning from the data. They fall into two major categories: model based and model-free approaches.
The model-based approaches assume knowledge of a model of the environment: this can be learned from data collected by the agent by exploring the space of possible (s t , a t ) trajectories. The function is then used for searching for the optimal policy which maximizes the cumulative. The theoretical framework to do this is known as dynamic programming. Alternatively, model-free approaches attempt to estimate the optimal policy and value function directly from the experience of (s, a) position. The experience can occur in the form of interaction with the environment or a simulation. An example of this approach makes use of Monte Carlo methods: a starting position (s, a) is randomly sampled, together with a policy π. The policy is then followed until a terminal state. After visiting all configurations available under π and storing the corresponding rewards, an estimate of the state-value function q can be made by averaging across them and, correspondingly, an approximated optimal policy can be estimated. The process is then repeated until an accurate estimate of the optimal policy is obtained. In other cases, the optimal policy is expressed in a parameterized form π θ , and the data collected by interacting with the environment are used to update the parameter values according to an update rule. These methods are known as policy gradient methods. Recent advances in RL also make use of DL models, such as the deep MLP and CNN architecture described in this section. The DL models are used to learn functions from the collected data (a t , s t ), such as the environment dynamics f θ , the state-value functions q θ and the policy π θ , where θ now refers to the parameters of the DL model. In this case, the RL approach is known as deep RL. Deep RL has been used in fusion research and we will discuss a few example applications of this kind in section 6.1.

Physics-informed ML
Physics-informed ML [52] attempts to train neural network models introducing physics constraints. The conventional way of training a network by samples does not guarantee the output is consistent with physics law which might underlie the distribution of the training data. There are different ways to embed physics information in ML. In [52], they are divided into three different categories: observational, inductive and learning bias. Observational bias simply uses training data from a distribution of physically meaningful samples. This is a weak approach to embedding physics knowledge into the model, because there are no guarantees that the output will be physically meaningful when the network is evaluated on outof-sample input data. Inductive biases correspond to enforcing the physics knowledge in the network architecture. In other words, the network functional form is carefully chosen to represent a physics consistent solution by construction. CNNs, for example, are designed to respect invariance along certain groups of symmetries [53]. In other cases [52], the activation functions, weights, biases and composition operations applied to the network's units are chosen so that the resulting composite neural network function is equivalent to the solution of a certain partial differential equation (PDE). Learning biases embed physics information in the construction of the training loss function. For example, physics informed neural networks (PINNs) [54] use a loss function which includes a penalty term corresponding to a PDE, in addition to the traditional meansquared error, so that the output matches the training targets and at the same time is a solution to the PDE: Equation (19) shows the loss function for a PINN trained to solve the PDE: where the network function is used to parameterized u = f nn (x, t; w), and and are the typical mean-squared error loss functions. The training data provide domain samples for the initial and boundary condition x i b , t i b , and the collocation points x i c , t i c . The weights w pde and w mse in equation (19) balance the interplay between the terms PDE and mean-squared error terms in the loss function. One advantage of using a PINN to solve a PDE is that derivatives can be calculated with automatic differentiation very efficiently and the network output is guaranteed to be a solution of the PDE introduced in the loss function.

Bayesian inference for neural networks
UQ is a crucial and often neglected topic in neural network applications. Notoriously, neural networks provide black box models, in the sense that the function encoded by the network cannot be easily interpreted and its output is of dubious significance for out-of-sample input data. UQ allows for quantifying the degree of confidence related to the network output. This is very desirable for real-world applications, such as those of nuclear fusion, where networks are trained and meant to be used for very practical tasks, such as plasma control [55] and prediction of disruptive events [56]. UQ is also needed to propagate uncertainties from a NN to subsequent analysis: in cases where the output of the network is a plasma parameter obtained from diagnostic data, it can also be the input to modelling codes. Bayesian inference, as a framework for learning, can be used in this context with the name of Bayesian neural networks (BNNs) [57,58]. The target of the inference is the posterior of the network's weights: where D represents the observed training data set samples, and p(w) the prior distribution of the network's weights. The posterior distribution of the weights determines the uncertainties of the network output for a novel input x. For example, the posterior predictive uncertainties are obtained marginalizing over all possible values of the weights w, as in equation (24): The expected value and variance of the neural network regression f nn can be calculated in a similar way as the first and second moments of f nn under p(w|D). In general, the posterior p(w|D) is not tractable and it needs to be estimated with approximations. Below we review some recent approaches to develop these approximations.
Laplace approximation (LA). The LA is an established one [58]. It approximates the posterior with a Gaussian distribution centered at the weight values found with training w MP , and with the covariance matrix given by the Hessian matrix A of the loss function with respect to the network weights. The choice of a Gaussian prior of the kind: and a Gaussian likelihood term: leads to the loss function in equation (27), where the parameters α and β are hyperparameters to be optimized for, and the second summation runs over the entire set of network weights: The parameters α and β balance the interplay of the two terms in the loss function. The second term, arising from the prior distribution, can be regarded as a regularizing term favoring small values of the weights. Such regularization is also known as L2 regularization, also discovered independently of the BNN framework, to penalize large weight values and reduce overfitting. By substituting the above expressions in equation (24), and taking the linear expansion of y around w MP , it is possible to write an analytical expression for the output error bar: where An alternative for finding predictive uncertainties, while avoiding using the first-order approximation of y, consists of computing the integral in equation (24) with a Monte Carlo approach, i.e. by drawing weight values from the approximated posterior: and computing for each of them independently the corresponding network output y i , leading to a realization from the distribution p(y|x, D). The LA requires the calculation of the Hessian matrix which, for large networks, involves computing second derivatives with respect to hundreds of thousands of weights; practical implementations often consider diagonal or other approximations [59]. The number of simplifying approximations involved in the LA procedure, and the fact that in real-world problems the weight posterior distribution is not well approximated by a Gaussian, are the main limitations of this approach. Monte Carlo dropout (MC dropout). An alternative, known as MC dropout [60], is based on variational inference, an established approximate Bayesian inference method [61]. It is based on an extension of dropout training, a technique used to prevent neural networks from overfitting [62]. Intuitively, dropout works as follows: at each training step, each network unit is multiplied by a random value drawn by a Bernoulli distribution with probability p, z ∼ Bernoulli(p). The end result is that some units, along with their weight connections, are dropped with probability p. The effect is that, at each training step, a slightly different network is used to compute the output, and the final predictions can be regarded as obtained from averaging across all the different network realizations, leading to a smoother function. MC dropout casts a mathematical interpretation to traditional dropout showing that minimization of the training loss function with dropout corresponds to the minimization of the Kullback-Leibler (KL) divergence between the true posterior distribution of the weights and an approximating distribution q θ (w), as in equation (31): The KL divergence is a measure of the dissimilarity between two distributions, and q θ (w) is derived from the distribution used to drop the units and weights. In the equation, w indicates the weight random variable and θ the deterministic parameterization of the network. The minimization of the KL-divergence is performed with respect to θ. The main advantage of MC dropout is that it allows obtaining predictive uncertainties from the network without any modification to the standard procedure of dropout training, an approach widely available in many ML frameworks. Predictive uncertainties can be obtained by standard and fast feedforward passes through the network. Stochastic gradients. Stochastic gradient Langevin dynamic [63] relies on an MCMC method for estimating the posterior p(w, D). It is based on a modification of the update step in the back-propagation rule of equation (14), for the loss function in equation (27). The formula is shown in equation (32), where ϵ is a step size and η is Gaussian noise: Welling and Teh [63] show that the stochastic optimization process follows the Langevin dynamics converging to the correct posterior distribution of the weights. The main disadvantage related to this approach is the memory footprint it requires, as is common for MCMC methods. Deep ensembles. The deep ensembles [64] approach is different as it is not based on BNNs: predictive uncertainties, in this case, do not arise from an approximate reconstruction of the weight posterior; instead, they are computed from the statistics of the predictions of an ensemble of networks. According to [64], the uncertainty quality of deep ensemble can outperform that achieved by using MC dropout on some test cases. The training recipe is the following: (i) Choice of a loss function to estimate the conditional distribution of the output p θ (y|x). For regression problems, this can be achieved by assuming a Gaussian distribution for the observed values and interpreting the network output as the mean and variance. The negative log-likelihood can be used as a loss function −logp θ (y|x). Modelling the variance at the output is simple in the case of 1D output because it corresponds to a single scalar value; the extension to multidimensional output values is not obvious, and a possible way to address this is by using a mixture density network model [65]. (ii) Independently trains each of the networks in the ensemble and aggregates the predictions in a manner consistent with the chosen training objective. In the case of the negative log-likelihood, the final output is the mean and standard deviation of a Gaussian mixture with formulas found in the standard textbook.
Finally, their results on classification datasets show that the ensemble statistics can be used to detect unreliable predictions for which the network output can be, for example, discarded. They also show that predictive uncertainties are larger for out-of-sample input data compared to known input data. Active learning. Besides allowing the identification of overconfident predictions of a DL model, UQ also provides a framework for active learning, a paradigm for defining what data are most informative for training and for learning from small amounts of data. This can be useful in settings where the collection of training data is expensive: for example, when it requires performing new experiments or running computationally expensive simulations. The idea is that training data are collected with a larger priority in the region of input space corresponding to larger predictive uncertainties, so to increase the model accuracy. Informative areas in input space are identified with an acquisition function. This function can be based on the estimation of the predictive variance or information measures [66,67]. Depeweg et al [68] also show that BNN allows computing to distinguish between two contributions to uncertainties: epistemic uncertainties, arising from the weight posterior distribution p(w|D), can be reduced by collection of new training data; aleatoric uncertainties, arising from intrinsic noise present in the data, cannot be reduced by a collection of more samples. This allows more efficient data collection, targeted at the reduction of epistemic uncertainties. Their work also shows a promising application to risk-sensitive RL, where an agent learns an optimal control policy within a real or simulated environment and takes into account risk factors estimated with the BNN uncertainties. In this way, the agent is able to learn a policy which optimizes reward while at the same time mitigating risks from both epistemic and aleatoric uncertainty. By accounting for both contributions, the agent can reduce optimally the epistemic component. Risk-sensitive and, in general, RL applications aware of safety-constraints are going to have significant relevance in fusion research: they allow the automation of complex control systems, such as those required for fusion device control, while carefully accounting for machine safety.
Several other approaches to UQ can be found in the ML literature: a review covering practical aspects can be found at [69], whereas [70] provides a survey with an orientation to scientific ML. Surveys comparing Bayesian and non-Bayesian approaches to UQ in DL can also be found at [71,72], which shows that, in general, Bayesian approaches provide bettercalibrated uncertainties, i.e. they scale properly with data coming from a domain shift or are out-of-sample. Benchmarks with code pipelines are also demonstrated and available at [73]. In general, the ML community provides several valuable research directions, with the limitation that such applications mostly are demonstrated on standard ML datasets for benchmarking purposes. The application of such methods to nuclear fusion data, therefore, is an opportunity for further research work which can be properly exploited by practitioners having domain-specific knowledge.

ML for fusion
There is an increasing volume of published research work regarding the applications of ML to nuclear fusion. This indicates that ML can provide innovative and useful solutions to research questions that are hard to address with more conventional tools. The following sections are dedicated to applications of ML for plasma control, disruption prediction, plasma parameter inference from diagnostic data, and surrogate models of theory-based simulation codes. These topics are not truly independent: ML methods developed to solve one of them can be used in the context of a different one. For example, a plasma control system can make use of different ML components: DNNs for the reconstruction of the plasma parameters from diagnostic measurements, as well as fast surrogate models of complex modelling tools to predict the plasma response rapidly enough to enable real-time control. These research questions are a good candidate for ML methods mainly for two possible reasons: either they concern physics phenomena for which a detailed, first-principled model is missing, or they concern time-sensitive applications that benefit from the efficient execution of ML algorithms, especially in the case of DL. The list of work reviewed here is not complete, but it should provide examples of a good part of the questions in nuclear fusion where ML can be useful. Possibly, this will serve as inspiration for continuing with such a promising approach.

Tokamak plasma control
An early MLP at COMPASS. The accurate control of the position and shape of the plasma is one of the main challenges of modern nuclear fusion experiments, especially in tokamak devices. Because existing models of plasma dynamics are limited in their predictive power, feedback systems allow reliable and efficient automatic control of the plasma. Typically, they use measurements of magnetics diagnostics, possibly in conjunction with temperature and density signals as well, to estimate in real-time the plasma boundary position and shape. The mismatch between the desired position and shape and the current estimate determines how to adjust the actuators, such as the voltage of the control coils, according to a linear model of the plasma response [74]. For this to work, the reconstruction of the plasma boundary shape must occur in a time-sensitive manner, typically <= 100 µs. For this reason, ML algorithms can play a role here: early applications of neural network models date back to the nineties [75][76][77], making them some of the very first applications of NN in fusion. Bishop et al [75] integrated a NN in the control system at the COMPASS tokamak as shown in figure 10. The neural network, trained on a database of 2000 equilibrium solutions of the Grad-Shafranov equation found with the EFIT code, takes as input 16 scalar magnetic diagnostics signals and outputs the vertical and radial position of the plasma center and the elongation. The error between the NN output and the desired target position and shape is fed to a linear controller which can then adapt the current flowing in the control coils. The authors show that the MLP does not assume a specific functional form for the mapping of magnetic diagnostic data to geometrical parameters, and for this reason, when evaluated on a wide variety of equilibrium configurations, performs better compared to the existing approaches. The MLP was also implemented in hardware, with a flexible system allowing changes in the network architecture. Despite the increase in flexibility in the functional representation of the equilibrium mapping provided by a simple single-layer perceptron, the control system still involves a linear feedback controller to represent the plasma response. Deep RL at TCV. Degrave et al [55] show that deep RL can provide even more flexibility to control the plasma at the TCV experiment. RL makes possible the generation of non-linear feedback controllers and it can significantly simplify the overall control system. The control actions of the RL agent regulate the voltages of the power supplies of the full set of control coils. The agent learns the control policy by interacting with a simulated environment which models the dynamic state of the plasma shape and current with a forward Grad-Shafranov solver, the sensor of magnetic diagnostics, coil currents, and the controller dynamics [78]. The simulations use varying plasma parameters such as plasma resistivity, normalized plasma pressure and safety factor. The control targets are specified at a high level in terms of plasma shape (such as elongation, triangularity, etc), position and current. Two different DL architectures are necessary to learn the final control policy: a deep MLP is used to parameterize the actual control policy, determining the next action a t+1 from the current plasma state s t and the previous action a t : and a RNN which learns the Q-value function 1 by executing a fixed policy on the simulated environment, and collecting the experience of the actual reward r for each (a t , s t ) configuration visited. The reward is computed from the mismatch between the actual plasma position, shape and current and the target ones, and it penalizes undesired plasma conditions, such as unsafe ones. The training occurs in two phases: in the policy evaluation phase, the RNN learns the Q π k−1 -value function from simulated data with a fixed policy π k ; in the policy improvement phase, the Q π k−1 -value is used to estimate a better policy π k+1 . The process is iterated until convergence. This approach is known as actor-critic algorithm [79], where the actor refers to the MLP network learning the actual policy and the critic to the RNN network learning the Q-value function. The RNN is a large network suited to learning complex temporal dynamics, but also more computationally expensive; the MLP network is smaller and computationally efficient. Only the latter one is deployed for control of the power supplier voltages, so to satisfy the strict control time requirements. The authors demonstrated the control capabilities in a few scenarios: a plasma with low elongation is handed over to the RL agent. The agent then increasingly elongates the plasma while containing the verticalinstability growth rate; in another scenario, it stabilizes plasma position and current with neutral beam injection (NBI) heating, and finally, it realizes negative triangularity and snowflake configurations, demonstrating an accurate time-varying control of the X-point locations. In conclusion, the RL algorithm simplifies the overall procedure: the conventional system requires extensive manual intervention, for example, for the offline feedforward generation of the plasma discharge. The RL system, instead, learns from simulations autonomously and receives as input the time-varying desiderata of plasma shapes and currents. Moreover, the method is of general validity and certainly relevant for different tokamak devices. Deep RL for T e and q profiles for DEMO. Deep RL demonstrated the possibility to adjust the electron heating power in order to control the electron temperature T e and safety factor q profiles in a simulated environment for a DEMO tokamak [80]. The simulation models the time-dependent turbulent transport with prescribed density profiles. The authors show that the RL agent can learn to control q and T e profiles during the current ramp-up phase to minimize the poloidal flux consumption of the central solenoid, which allows designing tokamaks with smaller central solenoids. They also show that the system can learn to control the optimal T e profiles for varying Z eff profiles, even without a direct measurement of Z eff . They manage this by providing the agent with information about the loop voltage on the magnetic axis at the current and previous time steps, which in turn is dependent on Z eff . This allows the RL agent to learn the dependence of T e profiles from Z eff . Deep RL for β N at KSTAR. The application described in [81] makes use of DL for multiple components of the system: an actor-critic RL approach makes use of MLP neural networks to model the policy and Q-value function from data collected by interaction with a simulated environment of the plasma response. The environment is also modelled with ML: a LSTM network (LSTM) is trained on 5 years of experimental data, providing a more flexible and efficient prediction of the plasma state compared to conventional physics-based modelling codes. In order to improve generalization, the authors use an ensemble of networks for the final prediction. This plasma simulator takes as input the control variables: plasma current, shape, and position, NBI and electron heating power, and it predicts the plasma state variables, among which β N . The reward is computed as the difference between the target and observed β N values. The RL system, shown in figure 11, is used to plan plasma discharges to achieve values of β N up to 3.5, near the stability boundary of KSTAR. The control demonstrations in presence of external NBI require high confinement enhancement which the agent is capable to achieve with accurate control of the plasma shape. The authors also show the limitation of the training procedure: approaching β N = 3.5 trigger magnetohydrodynamics (MHD) instabilities which the system cannot handle. The limitation arises from the fact that the training data only included 100 ms-averaged global quantities, whereas the precursor of the instabilities evolves over shorter time scales.

Disruption prediction
Disruptions are harmful events which can occur in tokamaks and cause substantial damage to the device. Forces arising from current and pressure gradients can destabilize the plasma and lead to MHD instabilities. Disruptive phenomena typically include vertical displacement events and thermal quenches which can lead to high plasma resistivity, and in turn, cause a current quench producing highly energetic runaway electrons due to drastic changes in the toroidal electric field [82]. If not controlled, such events may lead to high thermal and electromagnetic loads on the device components, for example through the formation of eddy currents in conducting structures such as the blanket and vessel. Disruption avoidance is one fundamental open problem in fusion research, and it is especially relevant for large devices, such as ITER, where they can cause serious damage to the device [82,83]. This topic is linked to the control problem: ideally, a sophisticated control system would make use of a disruption prediction component and it would initiate an appropriate response to avoid the possibility of melting or damaging the plasma-facing materials. For example, a disruption mitigation system could trigger the injection of high-Z impurities to increase radiation and consequently reduce the thermal loads. A prediction system should trigger an alarm within strict time requirements to allow effective mitigation. For example, in the case of ITER, the warning time is estimated at around 30 ms: this is the expected response time for the shattered pellet injection system to inject high-Z impurities, such as neon or argon, into the plasma [56]. Several years of research make it possible to identify distinctive features of these events, as well as their precursors, which can be used to predict their occurrence with useful advance. Figure 12 shows the temporal qualitative trend of some precursors until the disruption time. A mitigation system can be triggered either by detection or by prediction of such events. Detection is, in general, more straightforward: a system based on a threshold of the plasma current derivative, for example, can trigger mitigation with enough advance. Therefore, detection provides a reliable and robust fallback alternative in the case that prediction fails. On the other hand, prediction allows more time for triggering a response. This might be particularly desirable for a device like ITER which can tolerate a very small amount of disruptive events. Unfortunately, the complexity of the physics of disruption makes first principles predictive models difficult to develop and impractical because of their inefficiency. Therefore, data-driven models which can learn complex patterns leading to disruptions from experimental data have a high chance to make good predictions. The lack of causal models also implies that it is hard to say what signals are the best predictors of a disruption: there is general consensus on a few of them, but, overall, different ML models pick up different signals to make their prediction table 1 reports a summary of the signals most found in the literature. Moreover, a desirable requirement for such models is to be able to yield predictions for devices which they are not trained for. Indeed, one of the main applications driving this research is the development of a predictive system to safeguard the operation of ITER experiments. In order to achieve this, it is necessary that input quantities are normalized to be dimensionless and have a similar numerical scale across devices. As exemplified by the studies reviewed below, different normalization procedures can be useful. Two examples are: using the ratio of the electron density to the Greenwald density, and the ratio of the radiated power to the input power. Another requirement for the ML models is to learn efficiently from a relatively scarce amount of training data: this is particularly true for ITER, where the number of disruptive events must be kept at a bare minimum and a corresponding database will be particularly scant.

DL across JET and DIII-D.
Kates-Harbeck et al [56] attack this problem using DL. Their algorithm makes use of a combination of convolutional and RNNs, CNN and RNN respectively, trained on several thousands of shots from the JET and DIII-D experiments. The CNN learns a lowdimension representation of 1D diagnostic plasma profiles such as electron density and temperature. These features are concatenated to multiple 0D signals: plasma current, input power, core and total radiated power and normalized plasma pressure. The 1D + 0D concatenated features are input to a LSTM network (LSTM) which can learn the temporal dynamic of the plasma and it predicts the likelihood of the upcoming state being a disruptive state. If the LSTM output value exceeds a certain threshold, an alarm triggers the activation of a mitigation action. The system is well summarized in figure 13. In particular, a sensitivity analysis of the network to the inputs shows that the inclusion of the 1D profile information is crucial in enhancing the quality of the prediction. Compared to other ML models trained on the same task, the DL architecture also shows superior generalization abilities: the authors show that when the networks are trained on data from one machine only, JET or DIII-D, and tested on a different one, the DL model provides significantly more reliable predictions compared to shallow models. Another approach also combines a CNN, for the extraction of salient features from diagnostic data, and an RNN, to learn the sequence of events which can lead to disruptions at JET [85]. The model, comprising CNN and RNN, is trained as a single network. The input, a time sequence of line-integrated bolometer measurements, is processed by a stack of CLs. The output of the convolutions, is an intermediate internal representation of the overall network, that is processed by the RNN part. The final output of the RNN can be discrete, labelling a state as disruptive or not disruptive, or continuous, corresponding to the time until the disruption. The innovation is the exclusive use of the bolometer signal: according to the authors, this is one of the most frequent precursors of disruption at JET because it contains information related to impurity transport and accumulation. Taking this into consideration, the algorithm performs well compared to most methods using multiple 0D signals.
Churchill et al [86] make use of a modified CNN architecture to learn the time sequence of possible disruptive scenarios from an ECE imaging diagnostic at DIII-D. The ECE imaging diagnostic collects data across multiple lines of sight with multiple channels and high temporal resolution. The CNN architecture makes use of so-called dilated convolution operations which allow learning longer sequences of temporal patterns, competing well against RNN and LSTM networks to model sequential data. The network shows a satisfactory true positive score, above 90%, comparable to that of models using multiple diagnostics data. Nevertheless, the false positive rate is quite high: above 15%, compared to more competitive results around 5%. According to the authors, this might be due to the class unbalance in the training data: samples from non-disruptive discharges are more frequent than those from disruptive ones. This might be mitigated by the addition of further diagnostic data: this allows capturing more complex patterns based on distinct physics processes and enable more accurate classifications. Still, a critical challenge yet to be addressed is the possibility to apply this model to a different device such as ITER.

Random forests (RFs) at DIII-D and Alcator C-Mod. This
interesting approach does not use DL and succeeds well at predicting disruptions across multiple devices: Rea et al [87] use a RFs algorithm trained on a database of DIII-D and Alcator C-Mod disruptions. The 0d inputs include: plasma parameters directly measured by diagnostics, EFIT reconstructed quantities and control parameters. The generalization is poor when training on data from only one device and testing on a different one. The reason might relate to the different time scales of the physics processes involved. This suggests that deep architectures, such as the LSTM network used in [56], are indeed able to learn a more flexible representation of the temporal dynamics present in the training data. Interestingly, we find that the signal importance analysis carried out on both shallow and deep architectures confirm the importance of the safety factor at 95% of minor radius, q 95 , above all the others, and the relatively low information content of measurements related to the radiated power. The importance analysis also shows another advantage of the deep architecture: the CNN used in the algorithm allows the processing of highly-dimensional 1D plasma profiles, not included in the training of shallow models, and whose information content ranks high compared to many other 0D signals. MLPs at JET. Windsor et al [88] also use a shallow MLP, with one single hidden layer, to predict disruptions in one tokamak, based on training data from a different one. The devices considered are JET and ASDEX-Upgrade. Interestingly, the authors find that an MLP with one single layer of weights 2 performs better on the cross-tokamak validation, compared to a deeper MLP having one layer of hidden units. The network is trained with 0D input and the output is the time interval up to the disruption. Still, the performance of the MLP is worse compared to the DL architecture in [56]: the number of discharges correctly labelled as disruptive is ≲70%, compared to the success rate >80% of the deep architecture making use of 1D plasma profiles. The MLP model has been integrated into the control system for online use at ASDEX-Upgrade [89]. When deployed on actual experimental discharges, the network alarm triggers the injection of killer pellets to mitigate the disruptive loads. Cannas et al [90] also developed an MLP model to predict disruption at JET. In particular, we find interesting the pre-processing of the input training data: the usual signals describing the plasma state are clustered by a self-organizing map. In this way, the authors manage to reduce the inherent imbalance in the training data: the non-disruptive samples are more frequent than the disruptive ones. The self-organizing map allows grouping multiple samples into one cluster, therefore reducing the overall number of effective samples for each class. At JET, the operational disruption prediction system, integrated within the real-time data network, is based on a more conventional ML model: a two-layer stack of support vector machine (SVM) does a binary classification of alarm/no-alarm triggers [91]. The first layer of SVMs takes as input 0D plasma parameters at three recent time points, and it output three binary classification values, one for each time step; the second layer takes these as input and returns the final alarm/no-alarm trigger. The system also includes a pre-processing stage to identify and remove outlier signals. The training data include more than 10 000 discharges from the JET campaign with carbon wall. The algorithm provides overall reliable prediction, with a low false positive rate (<1%) also on discharges performed with the ITER-like wall. Adapting to new devices. In a recent overview [92], Vega et al suggests a possible solution to the limited amount of training data, and the possibility to apply ML models to new devices, such as ITER. The solution is based on adaptive learning, extensively studied and applied to JET discharges by Murari et al [93]. Adaptive learning strategies can be applied to the RF and neural network algorithms mentioned above. They provide a scheme to train a model online between discharges. For example, a predictor which incorrectly labels a data point as nondisruptive is immediately retrained including the novel data point correctly labelled; similarly, including novel non-disruptive data points in an existing training set can decrease the false positive rate. A different approach uses different decision functions in parallel to label a discharge as 'disruptive/non-disruptive'. The performance of the different functions is compared between shots, and the one performing better is deployed in the next experimental scenario. Figure 14 is a schematic showing how multiple classifiers are combined together from a single dataset to bring one final classification. Murari et al [93] tested these two strategies on data from the initial discharges at JET with the ITER-like wall, in order to simulate the beginning of the life of a new device, and they showed promising improvement in the detection accuracy.
Let us conclude this section with a pragmatic remark. At present, most devices rely on disruption detection systems, rather than prediction, in order to trigger mitigation procedures such as massive gas injection [82,94,95]. The detection of disruption is based on the identification of anticipatory features from diagnostic data. Despite providing shorter warning times, these systems are reliable because they detect simple trends in plasma parameters, such as plasma current, density, or vertical position. This highlights how, perhaps, the major drawback of more complex algorithms is inherent to their black-box nature, and the fact that they learn to respond only to those features which they are exposed to during training. Finally, it is worth to point out that the distinction between detection and prediction in ML algorithm is only a loose one: as shown by the examples above, the training and good performance of ML 'prediction' algorithms often rely on the inclusion of precursor signals, such as the locked-mode, which themselves are good predictors of disruptive states.

Surrogate models
An emerging research direction for the application of DL in fusion concerns the development of fast surrogates of computational intensive models. The idea of making demanding simulation codes faster is not new to the field: so-called reduced models have been subjects of investigation for a long time, especially in relation to computations of turbulent transport. Such models are often based on numerical approximations and theoretical first principles. Also in this case, faster and more reliable predictive models of turbulent transport can enable more accurate and refined control systems. More recently, DL has been used to generate 'black-box' reduced models of computationally intensive simulations.
Reduced models of gyrokinetic theory. The work by Citrin et al [96] offers a perspective where, in the first stage, a reduced model of the gyrokinetic theory of transport, Qua-LiKiz, is derived with a first-principle quasi-linear approximation. In the second stage, this model, up to 10 3 times faster than a full non-linear gyrokinetic calculation, is used to generate a large database of simulations for training a neural network [97]. The data set is generated by scanning a 9D grid of turbulence-relevant parameter values, such as ion and electron temperature gradients and ratios, safety factor, magnetic shear, local inverse aspect ratio, collisionality, and effective charge. Lower and upper bound values and the number of samples are defined for each parameter. In order to cover experimentally more relevant regimes more densely, the samples are not uniformly distributed. A total of 3 × 10 8 calculations of the evolution of multiple flux quantities are used for training. The network, an MLP with three hidden layers and 128 units per layer, learns desired physics features thanks to a careful choice of target quantities, and the addition of a penalization term to the conventional mean-squared-error regularized loss function. The trained network is then integrated within an integrated modelling simulation to calculate the evolution of ion and electron temperature profiles and electron density profiles. The network provides the value of the main ion and electron heat flux, and electron particle flux. Similar calculations are carried out using the quasilinear reduced model QuaLiKiz, and compared against each other for three different experimental discharges at JET. The discrepancy is moderately low: it is mostly below 10%, whereas the gap in execution time is quite remarkable. QuaLiKiz calculations took 3.6 h on 16 cores, whereas the neural network ones only took 15 s on a single core. Surrogates of NBI models in tokamaks. In a different application, neural networks are trained to deliver a fast approximation of a computational model of neutral beam injection in tokamaks [98]. The model, called NUBEAM, allows computing quantities such as beam heating and density profiles, fast ion diffusivity profiles, fast ion pressure and total neutron rate. The neural network surrogate allows real-time computation of such quantities, provided as input plasma measurements such as the effective charge, the plasma position and shape, electron density and temperature profiles, and the plasma current. The training data are generated by running the full NUBEAM analysis on data collected across hundreds of experimental shots at the NSTX-U tokamak. In the conventional approach, NUBEAM is used within the accurate transport simulation code TRANSP to fit experimental measurements of, for example, neutron rates by varying the fast ion diffusivity profiles, and observations of the plasma current by varying the effective charge. This process is iterative and time-consuming because of the long computation times required by full NUBEAM simulations. Replacing NUBEAM calculations with the neural network surrogate, which are 10 4 -10 5 faster, allows making use of an accurate predictive transport code such as TRANSP in real-time applications and feedback control systems. The authors also demonstrate how uncertainties of the neural network reconstruction can be estimated. They train an ensemble of networks and combine their output computing mean and variance of individual predictions. Figure 15 shows a comparison between plasma profiles calculated with full TRANSP calculations and those obtained using the neural network NubeamNet including error bars. The variance of the predictions also increases when the input data fall outside the training region, therefore providing a measure for possibly unreliable network outputs.

Core turbulent transport and pedestal calculations.
Meneghini et al [99] present a NN surrogate of core turbulent transport fluxes and pedestal features. The network is trained on simulations performed with theory-based models: the TGLF code, a quasilinear transport model that can predict with high fidelity turbulent transport fluxes, and the EPED1 code, which predicts the pedestal height and width in H-mode plasmas. The final goal is to perform efficient and accurate whole-device modelling: the NN algorithm is a good fit for this because running self-consistent models of core and pedestal plasmas requires executing both the TGLF and EPED1 code. Transport simulations require thousands of evaluations of 10 CPU-second TGLF runs, whereas the EPED1 model requires tens of CPU hours. An NN-based reduced model can perform complete selfconsistent simulations up to >10 5 times faster. The creation of the training data set remains an expensive task, which can be made cheaper by performing the simulations only for a subset of plasma parameter values relevant to a specific application. The range of values can be chosen, for example, from experimental datasets or prospective design principles. This can lead to the production of a sparse dataset: this limitation can be mitigated by training an ensemble of networks, and using the mean and variance of the ensemble for the final prediction. The variance is larger when the networks extrapolate outside of the training region, therefore it can detect possibly unreliable output. A further advantage of the NN ensemble is that it can provide smooth reconstructions. Meneghini Figure 16. Schematic of the encoder-decoder architecture used to reproduce XGC calculations. On the left side, the input ion and electron distribution function as a 2D matrix in velocity space; on the right end, the output change in the distribution function. Reproduced with permission from [100].
et al show that it is able to compute smoother functions of heat, particle and momentum fluxes, as well as pedestal height and width, in comparison to the theory-based counterpart. This also allows the self-consistent simulation of the core-pedestal plasma to converge faster. Finally, the training and evaluation of the NN surrogates are integrated conveniently within the OMFIT framework, which is also used to perform the self-consistent whole machine modelling. The possibility to execute simulations according to different trade-offs of time and accuracy within a single framework is a desirable feature for a practical and user-friendly solution. Surrogates of particle-in-cell gyrokinetic code XGC. The work of Miller et al [100] provides a good example of physics-informed training of a DL model. The task is to train a neural network to reproduce the calculation of XGC, a particle-in-cell gyrokinetic code used to simulate the plasma behavior in the edge region of fusion devices. The bottleneck is the calculation of the nonlinear Fokker-Planck-Landau collision operator, which acts on 5D particle distribution functions (parallel and perpendicular velocity, plus 3D position coordinates). In order to speed up the calculations, an encoder-decoder neural network is used. For this particular architecture, the input and output dimensionality is the same. The input is 2D velocity grids (parallel and perpendicular velocity) of the distribution function of ions and electrons for each set of 3D coordinates considered in the calculation. The output is 2D velocity grids of the change in the distribution function for each species. The initial stack of hidden layers performs a convolution of the input to extract lower dimensional features. The final stack up-sample, or decode, such features back into an output of the same size as the input. Figure 16 shows a schematic of the network architecture. The network is trained to minimize a loss function with two terms. One corresponding to the minimization of the conventional mean-square-error between output and targets obtained with full calculation with XGC; the other corresponding to the conservation laws of density, momentum and energy of the particle distribution function. In this way, the fast network surrogate will provide solutions satisfying conservation laws. The paper shows preliminary results from training with JET data, where the conservation laws are satisfied with a tolerance up to ≈10 −4 , to be compared with ≈10 −10 achieved with the full Picard calculation of XGC. The discrepancy is quite large: according to the authors, though, the surrogate can be used safely with an accuracy on the order of ≈10 −5 , which makes the gap smaller. Considering that the network calculations are multiple orders of magnitude faster, this reconstruction can provide a valuable trade-off between speed and accuracy.

Surrogates of Bayesian inference
As mentioned previously, a notorious drawback of Bayesian inference is the time required to compute and sample from the posterior distribution. This is the case, for example, for some of the physics-based models of plasma diagnostics reviewed in section 3. MCMC methods can sample from the posterior distribution, but they often require a large number of iterations, which makes them impractical for many real-world complex scenarios. The issue is that computing the likelihood function involves running a forward model or simulation code of the expected measurements. In the case of complex systems, such computations are often very slow.

The method.
ML models can be trained to reconstruct the posterior more efficiently and infer plasma parameters from diagnostic data in short time scales [42,[101][102][103]. The training data for the DL model are generated with the forward model used with the conventional inference approach. Given a Bayesian model of a certain diagnostic process, the distribution of the data it generates is defined by the model joint probability distribution in equation (34): where d represents the model observations and h the model free parameters. The terms p(d|h) and p(h) are the likelihood function and the prior probability distribution. The likelihood function is evaluated using a deterministic forward or generative model of the expected measurements which we denote with f (h). For example, if the likelihood is a Gaussian, the mean can be defined as the value of the forward model: p(d|h) = N(µ = f (h), σ). The training data are samples from the joint probability distribution as in equation (35), where h t and d t represent the free parameters and observation training data samples: An ML model can be trained on different learning tasks with this training data. The tasks used in [42,102,103] are shown in table 2. Where the function column indicates the DL function, the mapping column indicates the input-to-output mapping, and the training conditional indicates the distribution that the network learns with the training. For brevity, the bottom entry of the table is grouped under 'training conditional', although it does not represent a proper conditional distribution, but rather a joint probability distribution. A DL model trained on task f * is a fast surrogate of the physics-based deterministic forward model f, which outputs the expected observations and takes as input the model's parameters. This surrogate can replace the full physics-based forward model and be used within a conventional MCMC algorithm to sample from the posterior. Consider task g * : in this case, the DL model learns the 'inverse' function, mapping input measurements to model parameters. In practical terms, this surrogate can accelerate Bayesian inference in multiple ways: when a sum-of-squares error function is used as a training objective, the DL model learns to reconstruct the mean of the true posterior under a Gaussian approximation. Such fast reconstruction can be used for inter-shot analysis or as the initial guess for conventional Bayesian inference algorithms such as MCMC. An illustration of the sampling procedure for this learning task is shown in figure 17. Consider now learning task h * : a DL model trained in this way learns the joint probability distribution value provided as input both model parameters and observations. This surrogate can also be used within an MCMC algorithm such has Metropolis-Hastings that requires the computation of the joint probability distribution value in order to sample from the posterior. In principle, it is possible to consider also different learning tasks exploiting different training strategies and models. For example, joint probability distributions can also be modelled successfully with variational autoencoders (AEs) [104] and generative adversarial networks [105]. To summarize, in all the cases of the learning tasks above, the DL learns an internal representation of the joint probability distribution from a large number of training data samples generated with a simulator, and it is then used on novel experimental data to infer plasma parameters efficiently. Overall, this method can make Bayesian inference approaches such as those in section 3 ready to scale and more efficient. Also, in this case, issues related to generalization and the performance for out-of-sample input data stay open and can be addressed with novelty detection and techniques mentioned in section 5.

Applications to diagnostic models.
In this section, we review the applications of the method described above to models of diagnostic data.
The x-ray imaging system at W7-X. In [42,101], the authors discuss a neural network based surrogate of an x-ray imaging crystal diagnostic at W7-X for the inference of the electron and ion temperature profiles. The Bayesian modelling of the diagnostic within the Minerva framework is described in [23,106]. The diagnostic collects 2D images with one dimension corresponding to the spatial resolution of different lines of sight and along the other one the energy/wavelength resolution. The model predicts the lines-of-sight integrated spectra from values of the electron and ion temperature profiles, and the electron and impurity density profiles. The main impurity species considered is Ar 16+ . In order to generate the training data for the neural network, it is necessary to specify the joint probability distribution of the model. The distribution of the plasma profiles is defined with a GP prior with a squared exponential kernel [107], which generates smooth profiles (see the following paragraph on the JET Li-beam system for a depiction). The profiles are also constrained to have lower values at the last closed flux surface as is expected according to physics. In [42] the authors also discuss a method for novelty detection. The method uses the Knearest neighbors algorithm as a distance-based method to estimate the likelihood of a novel input under the distribution of the training data. The average distance of the new input data to the ten-nearest training data samples is compared to the average distance of the test set data samples used to evaluate the network performance during training. If the distance of the new input data is much larger than that of the test set samples, the input data are assumed to belong to a region of input space not well covered by the training data. This can lead to unreliable network output. The network architecture chosen for this task was a CNN. The choice is motivated by the fact that the input data present a 2D structure with recurrent relevant features, e.g. the broadening of different emission lines carrying information about the plasma ion temperature. The CNN is trained to reconstruct ion and electron temperature profiles given the 2D x-ray images as input. This corresponds to learning task g * in table 2. A depiction of the network function together with input and output data is shown in figure 18. An early evaluation on data collected during the first experimental campaign at W7-X showed promising results: a good agreement is found between the CNN reconstruction and the solution found with MCMC sampling of the posterior. Discrepancies fall within error bars in most cases. The error bars of the network reconstruction are computed with the BNN framework for uncertainty estimation. The Li-beam diagnostic at JET. In [102], the authors consider the inference of the edge electron density profile from measurements collected with a Lithium beam diagnostic at the JET experiment. The Bayesian model implemented within the Minerva framework allows predicting the line intensity emitted by the interaction of injected Lithium atoms and the plasma species and observed with a spectrometer along a ≈40 cm vertical beam. The model is described in detail in [30,31]. A multi-state collisional radiative can predict the spectral measurements starting from the values of the electron density and temperature profiles. The training data samples are generated by where p(I Li |n e , T e ) is the likelihood function of the line emission intensity. The priors p(n e ) and p(T e ) are modelled with a GP to generate smooth and realistic profile functions. An example of the functions drawn from the GP prior is shown in figure 19. Samples from such prior distributions are shown in figure 19. The data are used to train a neural network on task g * from table 2. The input is the Lithium line intensities and the target data are the corresponding electron density profile samples from the prior. The trained network is used to reconstruct n e profiles given as input experimentally observed values of I Li . Also for this case, the authors provided an estimate of the uncertainties with BNNs. In particular, the posterior distribution of the network's weight is approximated with MC dropout [60]. This method can be used to generate samples from the output distribution with repeated forward passes of the network with fixed input data. This is particularly convenient because no additional calculation is required; moreover, each forward pass is fast, and output samples can be generated without the overhead. The method is therefore suitable for real-time applications.

Challenges
ML is a rapidly growing field that extensively exploits the improvement in computing power. As such, it has the potential to bring substantial innovation to the applied sciences. Where and how best to apply ML to nuclear fusion still represent relatively uncharted territory. However, it is clear that the field can benefit from practices that the ML community has established. These relate, for example, to the generation of standardized benchmark datasets, and the opening of domain-specific challenges to a broad audience of talents. Moreover, novel techniques developed within the ML community have the potential to address open questions in fusion research. It would be interesting also for ML specialists to address real-world applications, such as those of nuclear fusion research. In this concluding section, we attempt a portrayal of the landscape of opportunities for future developments. Some ML methods referenced here have been developed only recently. In this section, their inner working is not described in full detail: the interested reader is invited to look at the references for a more thorough understanding.

Outreach and engagement
The number of students and young scientists in a given scientific discipline limits progress in that discipline. If new generations are not trained to use ML models, ML models will not be adopted by the community. Outreach activities are thus crucial to educate current fusion students to make use of ML, and to attract ML-trained students to work in fusion. Outreach and engagement efforts can be directed towards a variety of audiences, including colleges and research institutes, industries, and the public. In academia, training courses, boot camps, and hackathons in applied ML can be held on a regular basis [108][109][110]; mini-conferences or workshops in existing ML conferences can be organized to attract the right students [111,112]. PhD programs with shared supervision between an academic and an industrial partner can be developed to bridge the gap in applied ML between academia and industry (where more applied ML expertise may lie). Educational and training efforts can reach audiences outside of academia and industry. Open competitions, for example, can be hosted on dedicated platforms [113,114]. Furthermore, outreach materials that are easy to digest by a larger audience should be created, such as podcasts, blog posts, and tutorials.
Fusion is a real-world application that offers interesting challenges for using ML [55], which can attract talented students with a computer science background looking for uncharted territory to practice their skills.

Data set
The lack of a culture and practice for developing open, carefully curated, high-quality benchmark data sets has a wideranging impact on fusion research: progress can be made only by researchers who have access to the data, repetitive work finalized at assembling and cleaning the data is performed multiple times by different researchers, results cannot be easily replicated, and models cannot be easily benchmarked. Shared fusion data sets would democratize the value of experimental and simulation data to a broader scientific audience, encourage the development of better models, push peer-review articles to require reproducibility practices [115], and foster collaboration around shared tasks (e.g. disruption prediction). Furthermore, open data sets can provide materials for teaching, tutorials, and training, as well as materials to fuel collaborative challenges (see section 7.1). Moreover, rather than simply hosting and sharing the data set, evaluation metrics and simple working baselines should be given as a starter kit.
We also note the absence of any inter-machine data sets. To effectively deploy ML models on future, yet-to-be-built devices, the models must be able to generalize across (unseen) devices. The lack of an inter-machine data set hinders progress in transfer learning within the fusion ML community (see section 7.4).
Recently, a few attempts have started in this direction: [116,117] are working on an inter-device disruption database, and [118] has compiled a data set of 5 million samples of reduced turbulence simulations for tokamak plasmas. Outside the setting of ML, the fusion community is not unfamiliar with the requirement to create shared data sets for physics code validation. TCV-X21 [119] is an outstanding example of this: a data set for the validation of edge turbulence models in a tokamak geometry. The authors make experimental and simulation data easily accessible, interoperable, and reusable. Furthermore, an easy-to-use web interface is offered to become acquainted with the data without the need to install anything on a local computer.
Open data sets have been collected in different scientific domains: high-energy physics data at the CERN Open Data portal [125], protein structure database [126], weather forecasting [127,128], and turbulence modeling [129].

Data efficiency
The total error of a learning model is characterized by three forms of error: approximation error, optimization error, and generalization error [130]. The approximation error represents how effectively the learning model can represent the target function, the optimization error describes how well the model can minimize the training loss, and the generalization error describes how well the model can generalize to unseen data.
The size of the training data set bounds the generalization error of any learning model: the more data you have, the larger model you can train without overfitting, and the better the model can generalize to unseen data. In ML, scaling laws have been observed: the performance of a learning model can be accurately predicted in terms of model size and the set size [131][132][133]. Moreover, it favorably scales in both dimensions. In particular, if the training procedure is not limited by one of the two dimensions, a power law has been observed: e.g. in the limit of an indefinitely large data set, the generalization error decreases according to the model size to a negative power. These works suggest that these scaling laws are neither domain nor task-dependent, therefore fusion tasks can benefit from them as well.
However, in fusion, both experimental and simulation data can be costly to acquire. For example, because turbulent transport limits the performance of present magnetically confined devices, being able to predict it beyond existing configurations is extremely valuable. ML based surrogate models of highfidelity turbulent codes would allow making use of turbulence simulations at scale. However, non-linear turbulence simulations are one of the most computationally expensive simulations in magnetic confinement, requiring up to O(10 4 ) CPUh for a single flux tube simulation [134], and up to O(10 6 ) CPUh for a whole device simulation [135] (both in the case of a threedimensional stellarator geometry).
Two approaches exist to address this problem: bring down the cost to generate sufficient training data, or improve the model's data efficiency. Assuming we have no control over the generation of training data, how can we incorporate training data more efficiently or increase the model's data efficiency?
Not all data are created equal. The training data may contain redundant information, and some regions of the input space may be underrepresented. Active learning [136] seeks to address the following question: what is the best strategy to select the highest value samples to build the training data set? As a recent example, [137] developed algorithms that dynamically pick and add samples to the training data, focusing on regions of the input space where the learning model is more uncertain. Similarly, an existing training data set can be pruned to improve the model's data efficiency [138]: the same test error can be achieved when training on a carefully selected subset of the original training data set.
In magnetic confinement, RL offers the opportunity to build a single general-purpose feedback system to control the plasma. However, to learn an optimal policy, RL requires a very large amount of samples (i.e. it is extremely data inefficient) [139]. This is especially important in fusion, where samples are expensive to obtain. Recently, [140] suggested a sample-efficient RL agent for tokamak control: the approach balances the exploration-exploitation dilemma by taking actions that maximize the expected information gain about the optimal trajectory. In order to bring RL-based control technology to maturity such that it can be reliably used in mission-critical applications, the fusion community could benefit from the definition of standards for validation and certification to assess the readiness level of different proposed solutions in an operational environment.

Domain adaptation
In a real-world application such as fusion, the test data usually come from different feature spaces or distributions with respect to the training data [141]: models are usually trained on simulated data to be later used on experimental data [142], or they are trained on experimental data from current devices to be later used on future devices [87,143]. Collecting training data that are close to the expected test data may be expensive (e.g. experiment time is limited) or impossible (e.g. the experiment has not yet been performed). 'Domain adaptation' strategies seek to mitigate this problem. The training data come from the 'source' domain, and the test data represent the 'target' domain.
Addressing the problem of domain adaptation will be necessary to avoid catastrophic events in ITER. Disruptions are a show-stopper for a tokamak reactor. State-of-the-art approaches rely on data-driven models to predict and avoid them. ITER will require a functioning disruption prediction and avoidance system from the outset. In addition, no interruptions will be allowed. In other words, the model must generalize well to ITER plasmas.
A simulation model usually has input parameters that influence the simulation result. In general, system identification can be used to adjust the simulation parameters to make the simulation more similar to reality. However, this process may be expensive to perform. In domain randomization, a model is trained on data generated by randomly varying the simulation parameters. In this way, at training time, the model is exposed to a wide range of 'simulated worlds' so that 'with enough variability in the simulator, the real world may appear to the model as just another variation of the simulation' [144]. Domain randomization has been successfully applied in realworld applications, such as flight collision avoidance [145].
Another mitigation strategy is to split the training process into two stages: a high-capacity model is first trained on a large amount of simulated data, followed by the fine-tuning of a subset of the model on a limited set of experimental data. This method has been employed successfully in both computer vision [146] and inertial confinement fusion [147]. Furthermore, this process can be iterated online: when more experiments are conducted, the experimental data set on which the model is tuned can be augmented.
One source of generalization error when shifting from the source domain to the target domain is that the model learns to exploit latent features present only in the source domain and hence fails to learn resilient features also present in the target domain. Adversarial training can be used to build models that are robust to hidden latent variables (e.g. the experimental device, the simulated or experimental data) [148].

Interpretability
ML models are usually regarded as 'black-boxes'. How to understand when and why they fail? When to trust their predictions? Will they be able to generalize when deployed? How to draw insights from the learned mapping?
Interpretable models will be needed to deploy fusion energy into the grid. The tokamak is the leading magnetic confinement configuration, but it suffers from disruptive current instabilities (i.e. disruptions). ML-based models are expected to predict and avoid disruptive events in future tokamak reactors [92,149]. Interpretable models may be needed from the perspective of safety regulations, and may catalyze their adoption [150].
Interpretability can be classified into two types: intrinsic and post hoc [150]. An intrinsically interpretable model is a model with a simplified structure such that it is interpretable by construction (e.g. linear regression, decision trees). Post hoc interpretability involves the use of interpretation methods on a trained (possibly complex) model (e.g. permutation feature importance, SHAP [151]).
While interpretable models are already routinely used in tokamak experiments, deep ML models (not inherently interpretable) have also been proposed [84]. In this context, interpretation methods can 'whiten' the black-box, for example: feature visualization [157], class activation mapping [158,159], network dissection [160].
Moreover, not only in the context of classification, not interpretable deep models can be distilled into simpler interpretable models [161]. For example, when developing surrogate models for expensive simulations (e.g. turbulence transport), first, a high-capacity, bias-free, black-box surrogate (i.e. the teacher) can be trained to interpolate the input space, then, a simpler, interpretable model (i.e. the student) can be trained to mimic the teacher's predictions. The student's training data can also be augmented by the teacher model, i.e. data can be queried and interpolated from the high-capacity model.
In addition, a future fusion reactor must operate safely. If operations depend critically on a not interpretable ML model, the robustness of the model against adversary attacks [162][163][164] must be evaluated. To the reviewer's knowledge, no investigation has been conducted in this direction.

Stellarator optimization
Stellarators may offer a faster and safer path to fusion energy than tokamaks [165]. Stellarators do not require an internal plasma current to achieve confinement, so they do not suffer from disruptive current instabilities (i.e. disruptions). Stellarators are designed to be steady-state, and they require less recirculating power than tokamaks. Stellarators do not suffer from Greenwald's density limit: in a reactor, they can operate at a higher density, thus at a lower temperature, which benefits the divertor and the first wall.
Yet, a classical stellarator does not fulfil all the properties needed for a reactor. The magnetic field of a stellarator must be carefully tuned to achieve many reactor requirements (e.g. low neoclassical transport, confinement of fast particles). In addition, the magnetic field should be robust enough to allow tolerances on the manufacturing and assembly of the coils [166]. Nevertheless, optimized stellarator configurations can be obtained: state-of-the-art stellarator designs can confine energetic particles for long timescales with minimal losses [167].
The problem of stellarator optimization has two main difficulties (among others): a large, nonconvex optimization manifold (O(10 2 )) and the need to evaluate multiple equilibrium properties with computationally expensive physics codes [168].
To reduce the dimension of the search space, dimensionality reduction techniques can be used: interpretable random or Gaussian matrices to project input parameters into a lower dimensional space [169], or deep AEs to learn a reduced latent representation of the input space [170]. To more efficiently navigate the non-convex optimization space, Bayesian optimization (BO) could be used: BO is usually employed in problems where it is necessary to optimize a 'black box' function that is expensive to evaluate [171]. To speed up the computation of the physics metrics needed to evaluate the goodnessof-fit of a candidate configuration, data-driven surrogate models can be developed and used instead of expensive physics solvers.

Other opportunities
Fusion is a big-data science, and future experiments will generate a significant amount of experimental data [172]. However, the majority of this data will be unlabeled or may have few labels. For example, processing all shots is timeconsuming for activities such as identifying plasma states or detecting anomaly events. How to make the most of the scarce labelled data?
Semi-supervised learning is one option [173]: semisupervised learning seeks to use unlabeled data to develop a learner that is superior to a learner trained solely on labelled data. The wrapper method [174], for example, comprises two alternating stages: first, a supervised model is trained on the labelled data, then the unlabeled data is labelled by the learned model and inserted back into the training data, and finally, the model is retrained on the augmented data set. Optionally, the model can be fine-tuned to the original labelled data to avoid incorrect bias propagation in the model. Semisupervised learning for event detection has been already used in fusion [175].
Self-supervised learning has also been proposed in recent years to exploit enormous collections of unlabeled data. In self-supervised learning, first, a synthetic (self-supervised) pre-training task is designed to learn robust features from a large unlabeled data set, then, the model is fine-tuned (typically only the last layers) on the target task (e.g. classification) with sparse labelled data. The model's performance on the pre-training task is meaningless; its primary aim is to derive robust features from the enormous amount of available unlabeled data, which will later be exploited in the 'downstream' task. Masked prediction, transformation prediction, instance discrimination, and clustering are examples of pretraining tasks [176].
Let us consider the downstream task of detecting plasma states based on 0D and 1D plasma parameters. A masked pretraining task could be to train a model to recover the whole plasma states from a random subset of the 0D and 1D parameters. A pre-training transformation task could be to randomly shuffle the timestamps associated with a single shot and train the model to recover the original order. After the model has been trained on the pre-training task, a single classifier layer replaces the model's final layers and is fine-tuned to classify plasma states using the labelled shots available.
Traditional local storage solutions may have to be reconsidered in future fusion experiments [172]. The long duration of the plasmas, along with the significant use of camera-based diagnostics, will generate a vast amount of experimental data to store.
ML can provide solutions for more efficient data storage and representation. Flexible ML-based compression algorithms (e.g. AE) can be deployed at the diagnostic frontend to reduce bandwidth requirements while retaining essential information [177], while neural compression and implicit neural representation can provide a more compact and differentiable view of scalar and vector fields [178][179][180]. Vega [181] discuss methods for data retrieval when massive amounts of data are stored in fusion databases. The issue is particularly relevant for large time series and video-movies data formats as described in [182] and can be addressed with ML methods that can automatically detect potentially relevant events [183].
Many fusion simulation codes have parameters that can be tweaked to trade off simulation computing cost versus simulation fidelity. Solver-in-the-loop approaches incorporate a lowfidelity numerical solver into the training of a neural network model, which learns to correct the low-fidelity solver solution to produce the high-fidelity target solution [184]. Similarly, if the simulation model does not match reality, residual physics models supplement the simulation model with learnt models to bridge the simulation-to-reality gap [185].
In fusion, many tasks can be cast as time series predictions in fusion (e.g. data-driven digital twin). LSTM models have previously been used to predict future plasma states [81,186]. Transformers have demonstrated exceptional performance in NLP, speech processing, and computer vision. They've also been proposed for predicting time series [187]. Despite their efficiency, transformer models have been used only infrequently [188].

Data availability statement
No new data were created or analyzed in this study.