Efficient training sets for surrogate models of tokamak turbulence with Active Deep Ensembles

Model-based plasma scenario development lies at the heart of the design and operation of future fusion powerplants. Including turbulent transport in integrated models is essential for delivering a successful roadmap towards operation of ITER and the design of DEMO-class devices. Given the highly iterative nature of integrated models, fast machine-learning-based surrogates of turbulent transport are fundamental to fulfil the pressing need for faster simulations opening up pulse design, optimization, and flight simulator applications. A significant bottleneck is the generation of suitably large training datasets covering a large volume in parameter space, which can be prohibitively expensive to obtain for higher fidelity codes. In this work, we propose ADEPT (Active Deep Ensembles for Plasma Turbulence), a physics-informed, two-stage Active Learning strategy to ease this challenge. Active Learning queries a given model by means of an acquisition function that identifies regions where additional data would improve the surrogate model. We provide a benchmark study using available data from the literature for the QuaLiKiz quasilinear transport model. We demonstrate quantitatively that the physics-informed nature of the proposed workflow reduces the need to perform simulations in stable regions of the parameter space, resulting in significantly improved data efficiency compared to non-physics informed approaches which consider a regression problem over the whole domain. We show an up to a factor of 20 reduction in training dataset size needed to achieve the same performance as random sampling. We then validate the surrogates on multichannel integrated modelling of ITG-dominated JET scenarios and demonstrate that they recover the performance of QuaLiKiz to better than 10%. This matches the performance obtained in previous work, but with two orders of magnitude fewer training data points.


Introduction
Turbulent transport is the dominant transport mechanism in tokamak plasmas.Understanding and predicting it is essential to achieving fusion power [1,2].Transport models constitute a fundamental tool towards the delivery of ITER, DEMO-class reactors and beyond.However, the computational cost associated to integrating transport models in highly iterative applications such as multichannel integrated models (e.g., JINTRAC, [3], RAPTOR [4]) requires the delivery of fast and accurate surrogates, particularly for many-query applications such as simulation uncertainty quantification, scenario optimization and controller design.Feed-forward neural network (NN) surrogate models of the quasi-linear gyrokinetic model QuaLiKiz [5,6,7] and the gyrofluid model TGLF [8,9,10], have shown a factor 10 4 prediction speedup thus enabling real-time capable profile prediction [11], discharge optimisation studies [12] and integrated core-pedestal transport models [10,13] at a fraction of the computational cost.
Due to the cost associated to retrieving large simulation databases to be adopted as training sets for these neural networks, previous works have focused on spanning a small volume in the input space.This restricts some of the current applications to small dimensionality and narrow range in parameter space [14,15], or medium dimensionality [7], also sometimes based on experiments [10,16].At the same time, in [17], where linear GKW [18] simulations were used to derive semi-empirical saturation rules based on JT60 discharges, an increase in data availability was indicated as a major contributor to the success of the derived reduced-order model in integrated models.In recent work on developing bespoke gyrokinetic surrogates for ITER [19], increased data efficiency was also identified as a top priority for the development of surrogate models of higher fidelity gyrokinetic codes.
Big data is not always necessarily informative.Indeed, current datasets obtained from experimental parameter spaces are severely oversampled.For example, [20] devised a clustering algorithm for the dataset presented in [16], demonstrating that a performing surrogate can be trained on a carefully selected subsample of the full dataset, with up to a factor of 10 reduction in training set size.Thus, the amount of information needed to obtain an actionable surrogate is contained in a significantly smaller subset of current gyrokinetic databases.While extremely useful to uncover the oversampling problem typical of current approaches, the work by [20] was performed a-posteriori, once the costly training set had already been generated.
Moreover, by the nature of the critical threshold characteristic of tokamak turbulence, not all plasma states result in unstable modes.In previous work [7,16], the consistency of the surrogate with the critical threshold behaviour was enforced by means of a physics-based loss function that encouraged a controlled extrapolation to negative values where the true output fluxes were null.Negative predictions would then be clipped to zero at inference time.Although effective, this strategy lacked in efficiency as it resulted in a large fraction of the computational budget being spent to obtain stable modes (roughly 40% in [16] across all the electrostatic modes resolved).Instead, we hypothesise that the boundary manifold between stable and unstable inputs may be learned more efficiently using a separate surrogate model.This idea first appeared in our previous work [21], and it was developed concurrently by Hornsby et al. [22] for data-efficient surrogates of micro-tearing modes.
This study proposes to build NN surrogate models of gyrokinetic turbulence by leveraging Active Learning (AL, [23]) methods.Active Learning is a sequential sampling strategy that queries an expensive black box function (in our case a gyrokinetic model) by means of an acquisition function that identifies regions where additional data would improve the NN performance.Contrary to Bayesian Optimisation approaches, which aim to perform sequential optimisation with only a few function evaluations (see for example [14,24,25,26] for applications relevant to Fusion), Active Learning enables learning of the function to be approximated over the entire parameter space.
Here we develop ADEPT (Active Deep Ensembles for Plasma Turbulence), a twostage AL framework where a surrogate of the critical gradient threshold in the form of a classifier determines whether a given input will result in growing modes, and a regressor predicts the output turbulent transport fluxes.We focus on an acquisition function that queries inputs for which the output uncertainty of the NN is highest, thus maximising informativeness [27].Deep Ensembles [28], which provide state-of-the-art uncertainty quantification capabilities for NNs, are adopted as the surrogate model.
We provide a demonstration of the ADEPT pipeline using an existing large database of QuaLiKiz simulations obtained from JET inputs [16].For this proof-of-concept work we focus on ITG turbulence only.As the input-output mappings are already available in the dataset, we can easily test the performance of the two-stage workflow.Explicit integration of gyrokinetic models in ADEPT will follow in upcoming work.
The paper outline is as follows.We describe the dataset in Section 2, we introduce the ADEPT methodology in Section 3.1 and we outline the integrated modelling framework in Section 4. In Section 5 we give the first main result of the paper.We demonstrate that even only the inclusion of the classifier stage and the adoption of more powerful deep learning models such as Deep Ensembles results in actionable performance for turbulent transport surrogates with around 200,000 simulations for 15 input dimensions, that is a two order of magnitude reduction from the original dataset.The physics-informed nature of the proposed sampling strategy only queries a minority of the inputs in the stable regions, which are instead dominant in the original dataset, thus enabling the surrogate to focus on accurate modelling of non-zero transport fluxes.Sequentially building the training dataset via AL results in a further large reduction in training sample size.In Section 6 we validate ADEPT on a representative parameter scan and on integrated modelling of ITG-dominated JET scenarios.We find that ADEPT and previous work [16] agree with JINTRAC runs that adopt the original QuaLiKiz model to better than 10%, albeit ADEPT was trained with two orders of magnitude less data compared to the surrogates in [16].Finally, in Section 7 we discuss the results obtained, identify remaining issues and propose potential solutions to be explored in future work.

Data
We use the existing JET-Exp-15D dataset devised in [16,29].The dataset contains the input-output mappings of the QuaLiKiz [5,6] quasilinear model.The inputs are based on 2135 JET experimental discharges including a variety of plasma scenarios, augmented taking into account measurement uncertainties for the parameters that turbulence is most sensitive to.The dataset generation took approximately 150kCPUh.
The input space is 15-dimensional and it includes: the species charge number, the species mass number, the fractional species density, the logarithmic electron density gradient, the ion and electron temperature gradients, the rotation Mach number, the rotation gradient, the radial coordinate, the tokamak aspect ratio, the safety factor, the magnetic shear, the pressure gradient (via α M HD ) the collisionality and the ExB shearing rate.The output encompasses the multiple channels of transport of ITG, ETG and TEM turbulence obtained from QuaLiKiz.The raw dataset produced from all the available inputs was subjected to consistency checks to either enforce physical consistency within the data (i.e., ambipolar particle fluxes, consistency between the predicted fluxes and the fluxes calculated from combining diffusive and convective terms computed separately) or discard abnormally large heat fluxes and abnormally small particle fluxes; see Table 6 of [16] for more details.
In the remainder of this work the focus will be on ITG turbulence, for which only less than 25% of inputs in the JET-Exp-15D dataset develop turbulent transport.The transport fluxes considered (in GyroBohm units, cfr.Table 3 of [16]) are the heat flux of ions (q i,IT G ) the heat flux of electrons (q e,IT G ) the momentum flux of ions (Π i,IT G ) the particle flux of electrons (Γ e,IT G ) and the particle flux of ions (Γ i,IT G ).

Data-efficient surrogate models
3.1.Active Learning 3.1.1.Basics Active Learning (AL, e.g., [23] for a review) is a sampling strategy that aims at reducing the amount of training data needed to obtain a performing surrogate.An AL system comprises three components: a learner, an oracle and a query strategy.The learner is a ML method, such as a NN or Gaussian Process [30], that improves its performance as more data is collected from the oracle according to the query strategy.The decision on which learner to use depends on the nature of the problem: Gaussian Processes are more suitable in the low-data regime, while NNs are more effective in the big-data limit.The oracle is a costly data acquisition system that provides the training data for the learner; the oracle might be, for example, a simulator (which is the case this paper is focused on) or a human annotator (such as in the GalaxyZoo project, [31]).The main focus of the AL literature is on defining efficient query strategies [23].
AL can be applied in a pool setting and a streaming setting.In the first case, the query strategy acts on a pre-existing pool of unlabelled data (that is, for which only inputs are available but outputs are unavailable) and the distribution of the input space is fixed, while in the second setting a decision on which data to focus the labelling effort is made on a source of streaming data, potentially from a non-stationary distribution.Although digital twinning applications involving building surrogate models off streaming data from fusion devices may benefit from AL, the aim of this paper is to prove the simpler pool setting.

Maximum informativeness and uncertainty sampling
The goal of AL is to obtain a machine learning predictive model by identifying training points that are more efficient than random selection.Space-filling methods, such as Latin Hypercube Sampling (LHS, [32]), have been shown to improve upon random selection, however space-filling algorithms sample the input space just once, and therefore do not account for potential redundancy in the information provided by different inputs.A more efficient query strategy consists in maximising the informativeness of the training sample as a whole.The simultaneous placement of N points to obtain optimal coverage of the parameter space of interest is, unfortunately, computationally intractable [33].Popular alternatives, which include sequential acquisition strategies that account for changes in the model induced by the newly collected training data, are still more advantageous than the fixed design space offered by space-filling algorithms.
The sequential strategy proposed in [27] queries the inputs for which the surrogate model's predictive uncertainty is largest, where σ(x; D train,t ) is the output uncertainty of a learner trained on a dataset D train,t , t is the current iteration and U is a pool of inputs (e.g., the plasma states) for which the outputs (e.g., the turbulent fluxes) are not available.As indicated in the expression above, the dataset at the next iteration is enriched with data obtained from the query.The uncertainty is the standard deviation of a regression model.
Here, we adopt Batch Mode AL (e.g., [34]), which consists in performing the acquisition for the M inputs that rank highest in the model uncertainty.Batch Mode AL is more suitable for NNs, as retraining a NN with just one new sample is impractical.
The literature on AL strategies is vast (see [35,23] for two excellent reviews).In the following, we will adopt the acquisition function in eq. 1 for the following reasons.First, it is good practice to develop surrogate models that offer uncertainty estimates on their predictions, especially in view of incorporating surrogates of gyrokinetic models, the Given a data pool for which only inputs are available, a classifier evaluates the likelihood of a given input in the pool resulting in unstable modes.The acquisition function is evaluated on the unstable inputs, and a batch of the most uncertain ones are selected to be run through the gyrokinetic model.The newly obtained input-output mappings are used to train both NNs.This strategy is repeated until the computational budget has been exhausted or the performance of the surrogates is deemed actionable.topic of this paper, into integrated suites to enable uncertainty quantification studies.Moreover, the implementation of uncertainty sampling by exploiting surrogate models with such capabilities is trivial and, as we will show, it performs well in practice.Furthermore, while conceptually very simple, uncertainty-driven AL is widely used with great success in other fields, such as, for example, drug discovery [36].
As a final note, it is worth pointing out that AL tends to induce a shift between the distribution of the unlabelled pool U and the that of the training set over time, as only the most informative points are selected for labelling [27,37].It is therefore crucial to ensure that the NN uncertainties are well-calibrated also out of distribution.A discussion on this matter is carried out in Section 3.3.

Physics-informed Active Learning for gyrokinetic models with ADEPT
Linear gyrokinetic turbulence exhibits a critical gradient behaviour, whereby growing modes and the resulting turbulent transport are triggered only above a certain threshold in the driving gradients that depends on the plasma conditions.This creates a further complication for surrogate models, as the fluxes predicted need to be exactly zero in the stable region to avoid the presence of spurious transport that would alter the predictions of integrated models.In previous work [7] showed that the sharp transition between the stable and unstable region is smoothed out in naive approaches where a single regressor surrogate model is trained on the entire space.The solution proposed in [7] was to identify the critical gradient threshold was by encouraging a NN to predict negative values whenever the true flux was null.These were then clipped to zero for use in the integrated model.For positive fluxes, instead, the NN would be trained using a standard Mean Squared Error loss function.[7] showed that a NN surrogate that does not account for the critical gradient behaviour of gyrokinetic turbulence leads to oversmoothing around the critical gradient and therefore overpredicts transport.The physics-informed training adopted in [7,16] elegantly enables the NN to perform both classification tasks (i.e.whether an input results in growing modes) and regression tasks (to predict the turbulent fluxes).While effective, [7]'s method results in the computational budget spent to obtain the training set to be overly focused on points well within the stable region of the input space.
Below we propose ADEPT (Active Deep Ensembles for Plasma Turbulence)2 a twostage, physics-informed active learning strategy that delivers a significant reduction in the volume of data required to train performing surrogates.Contrary to previous work, we assign the classification and regression tasks to two separate neural networks.This setup preserves the physics-informed nature of the framework proposed by [7], but it splits the burden of identifying the critical gradients and regressing to turbulent fluxes between two highly specialised NNs.Given a data pool U of inputs, a NN classifier and a NN regressor are pretrained on a small (20,000 points) random sample of data for which the input-output mapping is available.This pretraining allows to capture a general initial representation of the space.Hereafter, for each iteration, the networks and the labelled dataset are updated following the strategy shown in Figure 1: • The classifier is tasked with screening a sample of candidate points in the data pool U.This is the physics-informed stage of the workflow.The entire pool may be screened, but this may slow down the acquisition process in the case of large data pools, such as that of the JET-Exp-15D dataset.Therefore the classifier is used to evaluate 300,000 inputs randomly sampled from the pool; • The acquisition function in eq. 1 uses the regressor's uncertainty (in our case, the epistemic uncertainty in eq.8) to select a batch of size acquisition batch candidates.Extending the acquisition strategy to account for the uncertainty of the classifier will be the subject of future work; • The outputs for the input candidates selected are queried from the model of choice (QuaLiKiz in the case of this paper); • The newly available input-output mappings are appended to the training data; • Both the regressor and the classifier NNs are trained again.
The loop above is repeated until the computational budget has been exhausted, or the surrogates have reached the desired performance.
As gyrokinetic turbulence involves multichannel transport, it is necessary to maximise the information gain for all the fluxes, and therefore the acquisition function in eq. 1 becomes (3)

Surrogate uncertainty via Deep Ensembles and its uses within integrated modelling
It has long been established that NN models give overconfident predictions that are factually wrong (e.g., [38]).Equipping NNs with a notion of uncertainty in their own predictions has since become a mainstream line of research producing a rich literature [39].The calibration of NN uncertainties are currently debated in the community [40], with new frequentist methods on the rise ( [41] and references therein).In particular, although NN uncertainties generally increase moving away from the training distribution across all uncertainty estimation methods, the issue of their calibration remains a point of concern.[28] proposed to train NNs using proper scoring rules [42] to obtain calibrated uncertainties.Given a NN approximation p θ (y|x) 3 of a distribution that approximates the truth, q(y|x), a scoring rule S p θ , (x, y) assigns to a learned supervised model a score based on the quality of the model's uncertainty for a particular input-output pair.A scoring rule can be formalised as global metric by integrating over the full probability space, S(p θ , q) = q(x, y)S p θ , (x, y) dxdy.
A scoring rule is strictly proper if S(p θ , q) ≤ S(q, q), that is, the learned approximation p θ is best only in the case where it perfectly reproduces q, that is when p θ = q.
A NN trained with a proper scoring rule is encouraged to provide better calibrated uncertainties compared to one trained on MSE.The standard MSE loss routinely used to train NNs (see eq. 9) is not a strictly proper scoring rule [43], and therefore cannot provide calibrated uncertainties out of the box.Instead, [28] showed that the loglikelihood of the data under a learned NN, log p θ (y|x), is always a strictly proper scoring rule and it provides calibrated uncertainties also in practice.Ensembles of NNs trained with a proper scoring rule are termed Deep Ensembles.If the ensemble is treated as a uniformly weighted mixture model, then the proper scoring rule for the ensemble is where we have taken the negative of the log-likelihood as an objective to minimise.For classification, the usual binary cross-entropy loss is also a proper scoring rule, and therefore deep ensembles and regular NN ensembles coincide.For regression problems, the Gaussian negative log-likelihood below is a proper scoring rule, A NN with two output neurons trained with the objective above will explicitly learn the mean µ θ (x) and variance σ 2 θ (x), where the suffix indicates that these quantities are parametrised by the same NN with parameters θ.With this expression, the NN is encouraged to learn that, in order to have a low variance to minimise the first term of eq.6, the predictions µ θ need to be very accurate to keep the second term small.
The mean µ E and variance σ E of the deep ensemble as a whole can be computed under the assumption of a uniformly weighted mixture of M members: On the other hand, [16] used an NN ensembling slightly different approach compared to Deep Ensemble to obtain a notion of uncertainty.The approach consisted in training a committee of ten NNs with identical architecture but different random initialisation, and the mean and variance of the predictions were then used for downstream applications.The NNs in [16] were trained to minimise the Mean Squared Error (MSE) between each NN prediction, ŷ and the target, y true , Note that the expression in eq.6 allows for heteroskedasticity in the variance estimate (i.e. the variance can vary based on each individual input, and this is explicitly modelled).It is important to realise that, without this feature, the expressions in eq.6 and eq. 9 would coincide (up to a constant) after identifying µ θ ≡ ŷ.Although this may seem only a subtle difference between Deep Ensembles and regular NN committees, the objective in eq. 9 does not explicitly capture NN uncertainty.Therefore, the uncertainties obtained by considering the standard deviation of the ensemble outputs are not guaranteed to be valid.Instead, training Deep Ensembles involves the optimisation of the negative log likelihood, which improves MSE with the constraint of fitting sensible uncertainty estimates.Hence, Deep Ensembles strike a balance between uncertainty quantification capabilities and predictive power, which are both equally important in downstream applications.The variance of a Deep Ensemble regressor (eq.8) is composed of two contributions.The first one is the average variance between all members.The second one is the variance of the means of the ensemble, as shown in the last two terms on the right hand side of eq. 8.The uncertainty of the deep ensembles (eq.8) is sometimes interpreted as the sum of the epistemic uncertainty (i.e. the uncertainty of the model) and the aleatoric uncertainty (i.e. the irreducible noise in the data), e.g.[44].The epistemic uncertainty is the natural choice to use in the acquisition function, as we seek to improve the inherent accuracy of the model regardless of data noise [45].Conversely, the total uncertainty should be used to assess how much trust should be placed in the surrogate predictions for downstream applications such as integrated models.
Uncertainty quantification capabilities are also a natural feature of the classifier NN.The confidence of the classifier can be defined as its output probability of a point being unstable.Probabilities close to a value of 0.5 inform downstream applications that performing a run of the original QuaLiKiz model is recommended.Entropy [46], which measures the disagreement between the members of the ensemble, may also be used as an information-theoretical measure of uncertainty: where p i (x) is the output probability of the i − th ensemble member.Both probability and entropy will be shown as measures of uncertainty for the classifier for a few parameter scans in Section 6.1.

Details of the training procedure
We borrow from [7] the idea of fitting NN surrogate models to the "leading flux" of a given turbulence type and the flux ratio between the leading flux and the secondary fluxes.This methodology was devised to ensure the same critical gradient behaviour for all fluxes of a given turbulent mode.While this is not strictly necessary in our case, as the classifier takes care of identifying the critical gradients, we opted for this option to minimise changes in the JINTRAC integration.We train a suite of deep ensembles, each regressing to one turbulent flux, and one deep ensemble classifier for the stability boundary.We adopt 5 ensembles per model, each consisting of 8 layers with 512 parameters each and ReLU activation functions.Each model is trained for 200 epochs with 100 epochs of patience, a weight decay of λ=10 −4 and a batch size of 512.For the regressor we adopt the NLL loss and for the classifier the binary crossentropy loss.Each acquisition batch consists of 512 training samples, doubling every 30 acquisitions due to the costs associated with retraining NNs on large datasets.
Extensive hyperparameter tuning was not performed in this study.Optimizing hyperparameters should ideally occur during each iteration of AL.However, even in AL research this is rarely done due to its high computational cost.Performing hyperparameter tuning at every iteration can be prohibitively time-consuming, especially as the training dataset grows.However, the extra computational cost incurred is expected to be small compared to the data acquisition for expensive high fidelity codes (e.g., [47,18]).

Integrated modelling
The JINTRAC integrated modelling suite [3] was chosen for this study both due its history of integration with QLKNN [16,7] and its relevance for ITER scenarios [48].The JINTRAC integrated model test cases and settings used in this study were taken from those used to validate QLKNN-jetexp [16], specifically selecting: The particle transport options are only applicable when using the QLKNN model.Further details about the different options available within QLKNN are given in Ref. [16].
A summary of the major JINTRAC settings used for these test cases are provided in Table 1.In the following Sections we will compare the results of adopting either ADEPT, QLKNN-jetexp and the original QuaLiKiz within JINTRAC.Specifically, the critical gradient threshold in ADEPT will be estimated using the trained classifier, while it is estimated according to the methodology summarised in Section 3.2 for the QLKNNjetexp surrogates.
As outlined in Section 4.2 of Ref. [16], the ion transport coefficients for the JET-Exp-15D dataset were derived only for a pure deuterium plasma, and therefore some assumptions need to be made to model impurity transport.While the differences in heat transport among different ion species can usually be neglected, this is not typically true for particle fluxes [49].
A first condition to allow treatment of the particle transport coefficients of impurities stems from the ambipolarity constraint, However, a second condition needs to be specified for eq.11 to admit a unique solution.In this work, we follow Ref. [16] and assume a proportionality between the electron and the impurities particle fluxes: ADEPT generates surrogate models that inherently include a measure of uncertainty.This characteristic can be leveraged in integrated modeling to evaluate the level of confidence that should be placed in the surrogate model's predictions, and perform a QuaLiKiz run whenever the uncertainty of surrogate is not considered acceptable.An in-depth study of the impact of the precise acceptance threshold on the integrated modelling results is outside the scope of this work, but it is highly recommended for future investigation.In particular, in this study the average predictions of the surrogates are used regardless of the surrogate uncertainty.

Results: data-efficient training sets with active learning
An important benchmark of performance of any Active Learning strategy is given by random sampling.Specifically, the extra costs incurred in Active Learning due to retraining the surrogate at every acquisition are justified solely if Active Learning outperforms random sampling, which requires training only once.The metrics used to assess the surrogate performance are described in detail in Appendix B.
Figure 2 shows the performance of ADEPT on ITG turbulence compared to random selection as a function of number of training samples collected.For both ADEPT and random selection the same NN architectures and training hyperparameters were adopted.It can be seen that ADEPT provides up a factor of 20 data reduction compared to random sampling.As shown in Section 5.1, an important contribution to this success is the inclusion of the classifier stage, which allows for a more data-efficient learning of the manifold where unstable turbulent fluxes develop.
A performance comparison on a test set between ADEPT and the surrogates presented in [16], which were trained using approximately 20,000,000 data points, is given in Tables 2 and 3.Although the surrogates in Ref. [16] do not explicitly employ a separate classifier NN to model the critical gradient, they achieve a comparable effect by zeroing out all fluxes when the leading flux is predicted to be negative.Therefore, we can evaluate the F1 performance of these surrogates, as they effectively exhibit classifierlike behavior in this context.It can be seen that the performance of ADEPT in terms of the F1 score is comparable or superior to that of the NNs in Ref. [16], albeit with two orders of magnitude fewer data.In particular, a high classifier performance in the case of ADEPT is crucial in ensuring that sampling does not occur deep in the stable regions.As a demonstration, we have computed that the contribution of stable inputs to the training set of the classifier is 75% in the random sampling case (and, indeed, the case of Ref. [16]), while this drops to around 20% in the case of ADEPT.The surrogates of [16], instead, feature a poor Precision, showing a high number of non-zero Regressor: q e, ITG /q i, ITG q i,IT G q e,IT G /q i,IT G Γ i,IT G /q i,IT G Γ e,IT G /q i,IT G Π i,IT G /q i,IT G ADEPT 0.9585 0.9366 0.9174 0.9554 0.9108 [16] 0.9518 0.9505 0.6987 0.5268 0.9140 Table 2.The performance of ADEPT trained with up to 200,000 samples compared to the NNs presented in [16] (where 20,000,000 samples were used), in terms of R 2 score for the fluxes (see Appendix B for a description of the performance metrics).

F1
Recall Precision ADEPT 0.9504 0.9412 0.9602 [16] 0.7791 0.9880 0.6431 Table 3.The performance of ADEPT trained with up to 200,000 samples compared to the NNs presented in [16] (where 20,000,000 samples were used), for the classifier (see Appendix B for a description of the performance metrics).
flux predictions in the stable region.The latter surrogates, however, achieve a better Recall, albeit with two orders of magnitude more training data points.ADEPT would need more data to reach the same kind of performance.As further discussed in Section 7, the classifier is not currently included in the acquisition function explicitly, which instead will be crucial to improve its data efficiency compared to random sampling.This feature will be explored in future work.

The effect of abandoning the physics-informed approach
The importance of ensuring that the critical gradient of turbulent transport is preserved by surrogates was discussed already in [7].The behaviour of one "naive" regressor surrogate model that predicted all output fluxes, including in the stable region, and without the clipping strategy for negative leading fluxes proposed in [7], was shown to oversmooth the critical gradient behaviour and produced unphysical results within integrated modelling.
In this Section we further demonstrate the two following points: (i) providing an estimate for the critical gradient (i.e., utilising a physics-informed approach) results in increased data efficiency within ADEPT compared to naive surrogates and (ii) as a consequence of (i) the seemingly good integrated performance of the naive approach actually results in poor performance in the unstable region compared to ADEPT.
To this end, we performed an experiment where Active Learning was run in a naive fashion, where one regressor was trained on both stable and unstable regions, using only the regressor uncertainty to drive the acquisition.No classifier was used for this experiment.The test set that is natural to use for this method is drawn from the entire space (red line in Figure 3) and, at face value, the performance of the naive methodology seems actionable.It is however instructive to inspect the performance solely on the unstable region.Figure 3 demonstrates that the data efficiency of the naive method degrades significantly when specifically tested on unstable inputs.A crucial observation Regressor: q i, ITG QLKNN-ADEPT (tested on unstable region) Naive (tested on entire space) Naive (tested on unstable region) Figure 3.The performance of the regressor in the physics-informed two-stage ADEPT workflow (teal) versus a naive approach where both zero and non-zero fluxes are fit by the same regressor NN (red).It can be seen that, for a fixed amount of training data, when tested on the unstable region only (gray line), the naive approach achieves a much poorer performance than the physics-informed approach.that justifies the observed behaviour is that the JET-Exp-15D dataset used in this work contains a significant proportion of stable inputs, accounting for over 75% of the data available for ITG turbulence.Thus, the representation learned by the naive approach is not capable of accurately capturing the mapping for both stable and unstable regions.
On the contrary, the classifier stage of ADEPT helps prevent querying points inside the stable region and instead allows the regressor to focus on the unstable region, thus resulting in improved data efficiency.In line with [7], our findings show that integrated performance metrics must be handled with care when informing suitability for downstream applications. .The performance of the ADEPT pipeline for q i,IT G when the surrogates are trained either q i,IT G only (black lines) or on all the five fluxes considered (teal lines, reproduced again from Figure 2 for convenience).The behaviour of both the regressor and classifier depend strongly on the number of fluxes used.ADEPT acquisitions for both the black and teal lines were run until exhaustion of the computational budget (36 hours).

Training dynamics dependence on number of fluxes.
The acquisition function in eq. 3 fully accounts for the multichannel nature of gyrokinetic turbulent transport.It is instructive to inspect the training dynamics induced by the training samples collected iteratively by the acquisition function when using a different number of fluxes.
In Figure 4 we show the test performance of the two-stage ADEPT pipeline when only the leading flux, q i,IT G , is used.We compare the results to the case where all five fluxes are considered.While the performance on predicting the fluxes in the unstable region is greatly improved compared to the multichannel case, it can be seen that the classifier performance degrades significantly, performing even worse than random sampling.A possible explanation for this behaviour is that in the multichannel case the contribution of the uncertainties from the different fluxes conspire to query a batch that carries high information for the classifier, but not for the regressor surrogate of q i,IT G .
Ultimately, the patterns evident in Figure 4 are driven by the acquisition function, which relies solely on the uncertainty of the regressors.We believe that this behavior can be controlled by developing an alternative acquisition function that explicitly takes into account classifier uncertainty.

Results: Validation
Bearing in mind that a large-scale evaluation study including uncertainty quantification is outside the scope of the present paper, in this Section we validate ADEPT on Ensemble classifier in terms of probability of an input being unstable and entropy of the ensemble.Note that the uncertainty estimates provided by the committees in [16] and by ADEPT differ significantly.See main text for discussion.
parameter scans (Section 6.1) and JINTRAC modelling of selected JET discharges (Section 6.2) as introduced in Section 4. We use surrogates that were trained on a final dataset of 200,000 input-output pairs collected using the ADEPT strategy and compare their performance to the work of [16] (QLKNN-jetexp in the following), which were trained using approximately 20,000,000 input-output pairs.

Parameter scans
In this Section, we validate the ADEPT surrogates on parameter scans obtained by running the original QuaLiKiz model.For each output flux, we fix 14 of the 15 input dimensions of the dataset to their median value and we perform a scan in the remaining dimension.trends, there are important differences in how the two approaches perform around the critical gradient.In particular, QLKNN-jetexp tends to provide a smoother behaviour while the two-stage nature of ADEPT results in sometimes too sharp discontinuities (see Figure C1).However, in some instances (see, e.g., Figures C2, C4) QLKNN-jetexp oversmooths the trends around the critical gradient, albeit it does so out of the training distribution.It is also important to note that the classifier uncertainty for ADEPT peaks around the critical gradient, which is highly desirable as it provides a way to refine the critical gradient estimation -this is a new feature that was not present in QLKNN-jetexp.

Validation on ITG-dominated JET discharges
Figures 6 and 7 show the steady state profiles obtained by adopting ADEPT, the original QLKNN-jetexp surrogates of Ref. [16] and the original QuaLiKiz model.The experimental data is shown here for reference, however the purpose of this test is to verify whether the surrogate models are able to reproduce the behaviour of QuaLiKiz within JINTRAC.Table 4 provides the profile-averaged relative RMS (RRMS) for these JINTRAC runs using their respective networks, with the QLKNN-jetexp reference given inside the square brackets.The RRMS is computed as: where the sum is over the number of radial points, and Y N N,i and Y QLK,i indicate the profiles computed using the NN prediction and QLK respectively.Both the surrogate models considered achieve a match with QuaLiKiz that is better than 10%.The experiments carried out in this Section suggest that both ADEPT and QLKNN-jetexp models can effectively replace the original transport model as a drop-in replacement for obtaining steady-state profiles, albeit the training dataset acquired by ADEPT was two orders of magnitude smaller than for QLKNN-jetexp.

Summary, conclusions and future work
We presented ADEPT (Active Deep Ensembles for Plasma Turbulence), a two-stage physics-informed Active Learning framework for data-efficient surrogate models of gyrokinetic turbulence.ADEPT consists of a classifier NN that learns the boundary manifold between regions that are stable and unstable under linear gyrokinetic turbulence, thus simultaneously providing a model for the critical gradient and limiting the search space of a second surrogate regressing to the turbulent transport fluxes.Using an existing large dataset of QuaLiKiz simulations based on experimental JET discharges [29,16], we have demonstrated a sizeable reduction in the size of the training dataset needed to obtain surrogates with integrated performance metrics comparable to surrogates trained by sampling at random.The reduction factor can be up to a factor of 10 or more, and it is due to both the adoption of Active Learning and the physicsinformed nature of ADEPT enforced by the classifier, which alone results in increased data efficiency, as only a minority of the QuaLiKiz runs would have been performed deep into the stable regions of the parameter space, thus limiting the need to run costly and uninformative simulations.Compared to previous work, ADEPT delivers similar or superior performance albeit using two order of magnitude fewer data points, and a proportionally lower compute time.We also showed agreement with QuaLiKiz and previous surrogates in parameter scans and in integrated modelling applications to two very different JET discharges.The classifier stage of ADEPT may be relevant for any other model where restricting the parameter space to a certain region with desirable properties is useful, such as the case of building surrogate models of codes modelling magnetohydrodynamic instabilities [50].
While our results are extremely encouraging, the data volume required to obtain a performing surrogate valid over the sizeable but not extreme parameter space of JET still required of the order of hundreds of thousands simulations, even with Active Learning in a physics-informed setting.Much more efficient strategies should be employed to deliver actionable surrogates of higher fidelity models with high dimensionality and over wide parameter spaces.For instance, acquisition functions that do not employ the surrogate uncertainty should also be considered (e.g., [51,52,34]).Moreover, as noted in Section 5.2, the acquisition function adopted in this work uniquely depends on the sum of the uncertainties of the NNs regressing to the turbulent fluxes.As a result, it is found that the performance of the classifier for a fixed amount of training data does depend on the number of fluxes that contribute to the acquisition function; explicitly accounting for classifier uncertainty in the acquisition function may alleviate or resolve this issue.Lastly, distances between the requested sample and the training distribution may also be explored as a means to quantify uncertainty [53] as well as to discourage (or encourage) exploration out of the pool distribution in the acquisition function.
In this work, we are interested in using NNs for supervised learning.In supervised learning, a machine learning algorithm is trained on a dataset for which both x train and y train are known.In probabilistic terms, the algorithm learns the distribution p(y|x) of labels y given an input x.During training, the discrepancy between the output of the NN ŷ|x train and the true output y train is quantified by means of a loss functions, and this information is used to adjust the weights and biases of the NN.
Supervised learning includes both regression and in classification tasks.For regression, the labels y ∈ R are real numbers which can assume any value in the real domain.Instead, in a classification task the labels are discrete.

Appendix B. NN integrated performance metrics
We evaluate the surrogates in terms of the R 2 score for the regressors and the F1 score for the classifier, defined as follows: The F1 score is more suitable than Accuracy to evaluate performance on imbalanced datasets like ours, where only 25% of inputs are unstable, and it is in general recommended for AL workflows as the relative proportion of positive and negative labels (i.e., unstable and stable regions in our case) is unknown a priori.To be more precise, Accuracy provides a measure of correct predictions across both positive and negative labels.In situations where datasets are heavily imbalanced, the model might focus solely on performing well with the majority class, resulting in seemingly good or even almost perfect performance, while the minority class is never predicted correctly.Thus, Accuracy alone does not shed light on the occurrences of false positives and false negatives, which are equally important to capture.Recall specifically addresses false negatives, indicating how often the model erroneously classifies something as negative when it's actually positive.Conversely, precision deals with false positives, revealing how frequently the model incorrectly labels something as positive when it's truly negative.Therefore, in regions of instability, having a high recall is beneficial as it maximizes the detection of truly unstable points while minimizing false negatives.Conversely, in stable regions a high precision is valuable because it reduces the occurrence of spurious fluxes (see also Appendix G of [7]).The F1 score is the geometric average of Precision

Figure 1 .
Figure1.Schematic diagram of the two-stage physics-informed AL workflow used in this work.Given a data pool for which only inputs are available, a classifier evaluates the likelihood of a given input in the pool resulting in unstable modes.The acquisition function is evaluated on the unstable inputs, and a batch of the most uncertain ones are selected to be run through the gyrokinetic model.The newly obtained input-output mappings are used to train both NNs.This strategy is repeated until the computational budget has been exhausted or the performance of the surrogates is deemed actionable.

Figure 2 .
Figure 2. ADEPT vs random sampling performance.We use the R 2 for the deep ensembles regressing to the electron heat flux and the flux ratios involving the ion heat flux and the ion momentum flux.The F 1 score for the stability boundary classifier is also shown.The shaded areas represent the standard deviation of 5 runs with different random seed.Active Learning improves data efficiency by at least factor of 2 (and up to a factor of 20) compared random sampling.ADEPT acquisitions were run until exhaustion of the computational budget (36 hours).
Figure 4.The performance of the ADEPT pipeline for q i,IT G when the surrogates are trained either q i,IT G only (black lines) or on all the five fluxes considered (teal lines, reproduced again from Figure2for convenience).The behaviour of both the regressor and classifier depend strongly on the number of fluxes used.ADEPT acquisitions for both the black and teal lines were run until exhaustion of the computational budget (36 hours).

Figure 5 .
Figure 5. Parameter scans in R/L Ti for the five ITG fluxes used in this work.The QuaLiKiz runs are shown in orange, while the predictions of the surrogates are shown in teal for ADEPT and magenta for QLKNN-jetexp.The shaded areas indicate the 1σ confidence levels.Dotted, solid and dashed lines indicate the 2.5%, 50% and 97.5% of the distribution in R/L Ti .The bottom right panel shows the uncertainty for the DeepEnsemble classifier in terms of probability of an input being unstable and entropy of the ensemble.Note that the uncertainty estimates provided by the committees in[16] and by ADEPT differ significantly.See main text for discussion.

Figure 5 Figure 6 .
Figure 6.Comparison of the steady-state profiles from the simulation of JET#73342.

Figure 7 .
Figure 7.Comparison of the steady-state profiles from the simulation of JET#92436.
y is the mean of the target flux in the test dataset, and TP, FP and FN are the true positives, false positive and false negatives.For all metrics, higher values indicate better quality of the surrogates, with a maximum value of 1.

Figure C1 .
Figure C1.Parameter scans in R/L ne for the five ITG fluxes used in this work.

Figure C2 .Figure C3 .
Figure C2.Parameter scans in R/L Te for the five ITG fluxes used in this work.

Figure C4 .
Figure C4.Parameter scans in γ E for the five ITG fluxes used in this work.

Table 1 .
Summary table of most pertinent JINTRAC settings of the base case simulation.

Table 4 .
Summary table of the JINTRAC predicted profile RRMS within the QuaLiKiz evaluation region.The values are given for the QLKNN-ADEPT simulation, with the reference QLKNN-jetexp simulation provided within the square brackets.