Predicting resistive wall mode stability in NSTX through balanced random forests and counterfactual explanations

Recent progress in the disruption event characterization and forecasting framework has shown that machine learning guided by physics theory can be easily implemented as a supporting tool for fast computations of ideal stability properties of spherical tokamak plasmas. In order to extend that idea, a customized random forest (RF) classifier that takes into account imbalances in the training data is hereby employed to predict resistive wall mode (RWM) stability for a set of high beta discharges from the NSTX spherical tokamak. More specifically, with this approach each tree in the forest is trained on samples that are balanced via a user-defined over/under-sampler. The proposed approach outperforms classical cost-sensitive methods for the problem at hand, in particular when used in conjunction with a random under-sampler, while also resulting in a threefold reduction in the training time. In order to further understand the model’s decisions, a diverse set of counterfactual explanations based on determinantal point processes (DPP) is generated and evaluated. Via the use of DPP, the underlying RF model infers that the presence of hypothetical magnetohydrodynamic activity would have prevented the RWM from concurrently going unstable, which is a counterfactual that is indeed expected by prior physics knowledge. Given that this result emerges from the data-driven RF classifier and the use of counterfactuals without hand-crafted embedding of prior physics intuition, it motivates the usage of counterfactuals to simulate real-time control by generating the β N levels that would have kept the RWM stable for a set of unstable discharges.


Introduction
The resistive wall mode (RWM) is a global mode of instability of high pressure tokamak fusion plasmas that can lead to disruption of the plasma current and termination of the discharge * Author to whom any correspondence should be addressed.
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. [1,2]. Since these disruptions can lead to physical damage to the machines, a method of early detection and forecasting of RWM stability is desired. In past post-discharge analysis efforts [3,4], an exponential rise in an n = 1 toroidal mode number poloidal magnetic field measurement (known as the RWM sensor) on the time scale of magnetic flux penetration through conducting surfaces surrounding the plasma, τ w , was used as the primary indicator of RWM instability. However, while this signal defines the existence of the RWM it requires that the mode be unstable. It is highly preferred to give an ear-lier warning of the mode potentially becoming unstable, therefore other predictors such as the normalized beta, β N , level and the lack of indication of a locked tearing mode (LTM) were also examined. A tearing mode is another mode of instability of tokamak plasmas that is more localized to rational magnetic surfaces rather than having a more global eigenfunction such as the RWM [5,6]. The presence of low frequency rotating magnetohydrodynamic (MHD) activity, which can lead to an LTM, has been seen to almost always preclude an RWM from going unstable at the same time [3].
Analysis of the physics of RWM instabilities identified other important plasma characteristics such as plasma rotation and collisionality [4,7], although the dependencies were too complex to handle via manual identification efforts. For example, a trained specialist could not simply look at a plasma rotation profile and use it as an indication of RWM stability or instability; what mattered were resonances between those rotation profiles and certain particle motions [8]. Complex physics software tools that analysed these kinetic resonances, such as MISK [9] and MARS-K [10] were developed and benchmarked [11,12]. However, while these tools were useful to understand the physics of RWM stability, they were too complex to execute in real-time to predict instabilities in a manner such that they could be avoided. An alternative approach was then developed, where the physics of RWM instability uncovered by these codes was distilled into a reduced model that maintained the major physics, albeit in a more tractable form that could potentially be executed in real-time [13]. This reduced kinetic model was included in the disruption event characterization and forecasting (DECAF) code [13][14][15][16], which includes many other models of physical events that can predict potential disruptions in addition to RWM destabilization.
Recently, machine learning (ML) algorithms have been explored for disruption prediction and avoidance as well [17][18][19][20][21]. In one extreme, these could be completely 'black box' approaches where available data is indiscriminately fed in and used to train a disruption predictor. In other cases the models have been built in order to produce results that are less difficult to interpret and allow the gain of physics insight (i.e. feature contribution, stability maps etc) [22,23]. An alternative is to use a physics-guided ML approach [24,25]. In such cases, the known physics of the problem (i.e. domainspecific knowledge) is used to pre-process input data, to constrain the learning objective as well as to interpret the output. ML techniques have now been used for one piece of the kinetic RWM stability problem-determining the ideal stability [26], including a physics-guided framework [27], and this piece has been incorporated into DECAF.
This paper extends this work to use ML to predict whether the RWM will go unstable in NSTX. Specifically, a random forest (RF) based algorithm is tested on discharges where human analysis has determined the time of RWM instability, or lack thereof. Inputs to the algorithm include the previous physics-guided neural network (PGNN) determination of the ideal no-wall β N limit [27], an expression for the with-wall limit, the measured β N , two measured quantities related to the rotation and collisionality, and finally a signal indicating the presence of a rotating MHD mode. The latter signal is expected to provide a better compromise between true and false positives in the proposed approach and is also used in DECAF as part of an MHD/LTM warning module [15]. In DECAF the absence of an MHD warning is used in conjunction with a simple threshold test on the RWM sensor signal to indicate the possible presence of a growing RWM.
The method presented here serves four main purposes: (a) To propose a different approach than the reduced kinetic model (RKM) [13] for RWM stability forecasting, which uses experimental data inspired by the full and reduced kinetic models without requiring the computation of kinetic and potential energies, (b) To show that the RKM and the ML approach have the potential to cooperate towards a more reliable RWM stability forecaster since the latter extends the former's domain of applicability, (c) To prove that the combination of sampling approaches with tree-based classifiers helps in tackling strong imbalances in the dataset, an aspect of crucial importance since some events leading to disruptions occur with very low prevalence [28], and (d) To propose a model-agnostic tool to interpret ML-based disruption predictions via counterfactual explanations and determinantal point processes (DPP) [29]. The exploration of what if conditions and how they affect the outcome has a twofold benefit. Firstly, it can assess the reliability of an ML algorithm by verifying that its counterfactual predictions agree with the known underlying physics. Secondly, one can use the counterfactual generation process to explore situations that are actionable in a real control system, such as lowering the β N level slightly above the computed no-wall limit.

The reduced kinetic RWM model
Tokamak fusion plasmas are theoretically stable to global, ideal MHD instabilities up to the so-called no-wall beta limit, β n=1 N,no−wall . Specifically, if no electrically conducting wall is present, the plasma is theoretically unstable above this limit to external kink-ballooning modes driven by the free energy of current or pressure gradients. Successful wall stabilization of kink/ballooning instabilities uncovered the reduced growth rate, yet still disruptive, RWM [2]. The RWM exists between the no-wall and the higher with-wall limit, β n=1 N,with−wall , at which point not even wall effects can stabilize the ideal kink/ballooning modes. Stabilizing the RWM is of paramount importance since it grows on a timescale (τ w ∼ ms) that is still fast compared to the duration of the plasma shot. Since magnetically confined plasma fusion devices need to operate at high β in order to improve plasma confinement efficiency, it is strongly desirable to operate above the no-wall limit. Past tokamak experiments found the fortuitous result that plasmas could be stably operated above the no-wall limit [6,30] with either passive stabilization or active mode control [31]. Understanding the physics of passive stabilization is key to relying on it and projecting it to the operation of future devices, and theoretical attention turned to kinetic modifications of ideal MHD theory [32].
In the DECAF code, the RKM was implemented and performed well on a limited number of discharges from the NSTX tokamak [13]. In the RKM, the growth rate of the RWM was calculated using an energy principle approach [33], converting the force balance into an equation to determine mode stability based on a potential energy functional δW. The resulting dispersion relation for the complex mode frequency is: Three terms needed to complete the RKM calculation are: the no-wall ideal term, δW n=1 no−wall , the with-wall ideal term, δW n=1 with−wall , and finally the more complex kinetic term δW K . Rather than replicate the calculation of the RKM, the approach taken here is to use signals inspired by the physics model as inputs to an ML algorithm which will then return an RWM warning level. This can be considered an alternative to the growth rate from the RKM calculation.

No-wall and with-wall beta limits
The evaluation of the change in plasma potential energy due to a perturbation of the confining magnetic field without the presence of a conducting wall, δW n=1 no−wall , was covered extensively in previous work [27]. More specifically, a RF regressor and a PGNN were tested on a large database of equilibria from the NSTX tokamak and used to reproduce the output of the DCON stability code [34] and to get an improved closed form equation for the no-wall limit. The physics guidance of the ideal MHD model in its extension outside the domain of the NSTX experimental data helped in carrying cross-device calculations to the MAST tokamak as well. Therefore, for the no-wall part of the equation, the quantity β n=1 N,no−wall from the previous PGNN [27] is used (see appendix).
Similarly to the no-wall limit, there is a higher withwall beta limit used in the calculation of δW n=1 with−wall in the RKM. However, unlike the no-wall case, there does not presently exist a large database of calculations of δW n=1 with−wall or β n=1 N,with−wall to train a neural network on. Rather, a somewhat simple expression dependent on the 'pressure peaking factor', i.e. the ratio of central to average pressure, was found to be adequate for NSTX [13]. The same expression will be used in this work (see appendix), although it should be possible in the future to train an ML model to give β n=1 N,with−wall based on more plasma parameters.

Kinetic effects
The term δW K results from a volume integration of the plasma displacement eigenfunction dotted with the divergence of the perturbed kinetic pressure tensor. The perturbed kinetic pressure tensor is found by taking a moment of the perturbed distribution function of the particles. In the end, the expression for δW K is an integral over particle's energy, pitch angle, and magnetic surface of a fraction which includes the frequencies of the particle's motion including bounce, precession drift, E × B Average E cross B frequency inside pedestal ω E Average ion collision frequency inside pedestal ν ii Low frequency, odd n MHD Odd-n MHD frequency (or plasma rotation), and collisionality [35]. When these frequencies match with opposite sign in the denominator, a resonance between particle motions is expected, which leads to a large fraction, a large δW K , and a stabilizing effect. Physically, the stabilizing effect can be thought of as a match between particle motions and the mode, which then allows efficient transfer of energy from the potential growth of the mode to the particle motions, thus damping the growth. Consideration of the effect of collisions showed that reduction of collisionality will reduce the collisional dissipation that is important when plasma rotational resonances are not present, but conversely can also reduce the damping of resonant kinetic stabilizing effects, allowing them to be more powerful [7]. In order to reduce the complexity of this calculation for potential use as a real-time disruption warning, the RKM model implemented in DECAF [13] dispensed with the integration and instead used Gaussian functional forms for δW K resonances with particle precession and bounce frequencies.
These functional forms depend on the measured values of the E × B frequency and collisionality, ω E and ν ii , both averaged inside the plasma pressure pedestal. These quantities are calculable in real-time if measurements such as plasma rotation, ion and electron density and temperature profiles are available. These profiles are also available in the DECAF database of analysed discharges and will be used in the present work as inputs to the described ML approach. The full expression for δW k and the RKM can be found, for reference, in the appendix. However, neither of these expressions are used in the present work. Instead, the ML approach essentially creates its own use of the ω E and ν ii input features for predicting RWM stability.

Data considerations
In order to develop an ML-based RWM predictor, we have gathered a set of 134 NSTX discharges where a human validation step had been carried out to assess whether the primary DECAF indicator of RWM stability was in agreement with an expert's judgement. For 44 shots, the coupled analysis identified the time at which the RWM went unstable, while the remaining are experimentally RWM stable. Listed in table 1 are the aforementioned plasma parameters with the DECAF aliases and the corresponding symbols used later in this paper. The ω E and ν ii quantities have been causally smoothed for visualization purposes. Human analysis has identified an unstable RWM at t = 0.537 s, marked as a red vertical dashed line in the bottom frame. The shaded grey area represents the cutting window determined by the preprocessing step. Table 2. Sub-categories identified in the NSTX database after performing the preprocessing step. The ones in which the no-wall limit is never crossed are kept in order to let the model be able to discern these cases from the more common ones.

Category
Description Count I Above the no-wall limit 58 (stable) + 42 (unstable) II Never above the no-wall limit or crossed after disruption 27 (stable) + 1 (unstable) III No-wall limit crossed outside the cutting window 5 (stable) + 1 (unstable) For this dataset, rotation profiles were available as a function of normalized poloidal magnetic flux (Ψ N ). However, we have decided to deploy a model based only on 0D signals since the required radial profiles might not be fully available in a future real-time system [13,36]. Moreover, the usage of averaged quantities allows us to run ML tools in DECAF in parallel with the RKM without having to interact with a completely different set of plasma parameters. Additionally, two more features are obtained by processing a low frequency odd-n MHD signal currently used in DECAF to assess the presence of a rotating MHD mode, as shown later. Finally, another substantial difference between the RKM and the model developed here is that the latter uses β N along with the nowall and with-wall limits without combining them into the well-known [13]. In fact, since C β is normally only computed in the range of β n=1 N,no−wall < β N < β n=1 N,with−wall , we are in practice extending the domain of applicability of the DECAF RWM stability module.

Data preprocessing
A typical discharge in NSTX lasts up to 1.0-1.3 s and most of the measurements are available with a time resolution of 5 ms or less, hence the dataset is comprised of roughly 2 × 10 4 individual time slices. In this work, we have decided to not restrict our analysis to the flat-top phase or to the time window in which β N is above the no-wall limit, but rather we cut the data to the range of times during which all the signals are available, followed by an interpolation on a common time scale of 5 ms. The resampling process never uses future information and it is done by looking at the previous and closest time point in the original timebase [19]. It is important to highlight that all the signals but ω E and ν ii are always available for the entire duration of the shot. In a few unstable cases (see figure 1), the latter are absent at the very end of the discharge leading to cutting out the β N collapse, which is a non-trivial indicator of a possible RWM instability. To address this issue, we initially considered to forward-fill the last valid point, however assuming that the collisionality and the E × B frequency do not change towards the end of the shot is not physically plausible. Fortunately, these few cases did not undermine the performance of our model. Table 2 lists the three different types of discharges uncovered by our preliminary analysis. It is worth noticing that in 34 shots the no-wall limit is either never reached during the discharge, crossed after the disruption, or crossed before the disruption but outside the cutting window. Moreover, in just 2 of those 34 has the human analysis identified an unstable RWM. The choice of not ruling out low beta discharges is key from the physics side as well as from the ML point of view. In fact, although an unstable RWM below the computed nowall limit is a rare event, a properly trained model should have access to all the possible cases.
With regards to the odd-n MHD signal, we have applied a short-time fourier transform to find mode peak frequency within the previous 5 ms. The root-mean-square (RMS) amplitude for the same time frame is also extracted using the librosa package. The usefulness of these two signals will be shown in detail in section 6. Figure 2 displays the magnetic spectrogram for (a) an RWM-stable discharge in which the mode rotation slowly decreases and locks at t = 0.96 s, and for (b) a shot in which the RWM went unstable at t = 0.79 s. One can notice the substantial difference in the time evolution of the peak frequency, as well as in the amplitude level (in log 10 -scale) towards the end of the discharges.

Combination of RFs and balancing techniques
ML approaches have been widely used recently in the fusion context given their ability to learn complex patterns in problems where events evolve over multi spatio-temporal scales.
They proved themselves useful tools in order to support firstprinciples physics models in the prediction of electron density and pressure profile shapes in tokamaks [37] and in disruption forecasting [19,38,39]. The latter are normally trained to predict with sufficient warning the likelihood that an instability will occur in the near future by treating the problem as a time-wise binary classification task. That is, individual time slices from a stable shot are labelled as negative (stable phase or far from instability), whereas an unstable discharge is split into two regions labelled as negative and positive (or close to instability), respectively. Although there is no ubiquitous way of defining the class label separation time in the unstable cases, it is quite common to choose 300 ms in advance of the instability, especially for disruption forecasting. In our case, such a choice would lead to an abundance of false positive warnings. In fact, previous analyses of the RKM revealed that while the collisionality drops in all the discharges due to increasing temperature, a pronounced turn towards higher ω E is generally observed at the end of the shot in the unstable cases [13]. Given the length of a typical NSTX discharge and that the RWM grows on a fast time scale (normally milliseconds), we have found that the optimal choice for the transition from negative to positive class is at t * = t RWM − t = 100 ms.
One of the main consequences of this common setup is that the training data will be inevitably imbalanced with the risk of increasing the bias towards the majority class and jeopardizing the final performance; an effect that is further exacerbated by the fact that the minority class is often the one of interest. The simplest way of fixing this issue could be training the model via cost-sensitive learning (i.e. misclassified positive samples are heavily weighted), but we have found this approach leading to worse performance, as shown later. The use of random under/over-sampling [38] is an effective alternative to solely weighting the objective function but comes with some complications that must be taken into account. In fact, down-sampling by randomly deleting elements from the majority class might result in loss of valuable information, which means that its Architecture of the BRF-based RWM predictor's pipeline, in which the final outputŶ is the probability of being close to instability. The first five features in table 1 are directly fed into the model, whereas the odd-n MHD signal undergoes a preprocessing step to extract mode peak frequency and RMS amplitude. Each tree in the forest analyzes a different subset in which the training data has been balanced using a sampler chosen from the imbalanced-learn library. Note that the class ratio is roughly 16 : 1 for the best performing class label separation time of 100 ms. usage would be more suited for large datasets. On the other hand, up-sampling may increase the likelihood of overfitting since the model might learn rules that are tailored on replicas of the minority class. Previous work [40] has focussed on the possibility of combining ensemble learning with random sampling in order to better learn from strongly imbalanced datasets. Ensemble learning is the combination of multiple machine learning methods to obtain a model that is more stable and less prone to overfitting. RFs ensure the required diversity and robustness and have been extensively used in several applications-including plasma physics and fusion energy-by proving to be robust tools for real-time disruption prediction on the DIII-D [23] and EAST [41] tokamaks, as well as in a crossdevice fashion [20]. In the case of RFs, the imbalanced-learn package [42] provides an implementation of the balanced random forest (BRF) in which for each tree a bootstrapped sample from the minority class is drawn and then a random set of equal size is selected with replacement from the majority class. This process is different from classical sampling techniques because it involves reducing the number of samples at a tree-level rather than as a preprocessing step over the entire training set. Therefore, with a sufficient amount of estimators the ensemble will explore the dataset in its entirety regardless of the sampling strategy. Here we propose a customization of the original implementation of the BRF that allows pluggingin any sampling technique and choosing the one that performs best.
The learning phase has been performed via a stratifiedby-shot cross-validator that not only splits the data in order to have non-overlapping sets of discharges in the different folds, but also takes into account the distribution of negative over positive time slices to stabilize the subsequent sampling step. Figure 3 illustrates the general pipeline followed during the training routine. The analysis carried out in the following section aims at showing the impact of several sampling techniques in terms of predictive performance as well as computational requirements. In particular, we have compared the classical RF using cost-sensitive learning (often referred to as weighted random forest, WRF [43]) with the ones that are regarded as being the most effective approaches, namely random under/over-sampling, SMOTE [44] and ADASYN [45].

Results on individual time slices
The dataset has been split into training and testing sets comprised of 106 and 28 discharges, respectively. For each predictor, the best hyperparameters are chosen to be the ones that return the operating point on the cross-validation ROC curve which is closest to a perfect true positive rate (TPR = 1) and false positive rate (FPR = 0), with the hope that such parameters will perform well on the unseen test data. We have found the choice of this objective function to be the most reliable since its minimum represents the best compromise between a high TPR and a low FPR. In order to speed up the training process, we have employed a multivariate tree Parzen estimator (TPE) [46], which sequentially constructs a model to identify  promising new configurations based on the interdependencies between hyperparameters. Table 3 summarizes the classification results obtained for prediction of stable or unstable individual time slices in the holdout test set using the different predictors. It is worth noticing that, although the WRF gives the lowest FPR, the BRF with random under-sampling (BUSRF) appears to have an edge over all the other approaches employed in terms of generalization performance and training time. As expected, the explored sampling techniques tend to improve the classification performance on the minority class, although the objective function gives the same weight to TPR and FPR. Fortunately, in all the cases the improvement in the TPR outweighs the worsening of the FPR, with a particularly favorable compromise for the BUSRF. This is an important aspect since in a fusion reactor we want no disruptions at the expense of a lower capacity factor. Although a false positive is undesirable, a wrongly triggered alarm would let active control steer the plasma back to a safe operational region.
In order to find the best operating point on the ROC curve, since the classifiers' predictions are continuous, we need to determine the best cutoff that maps the output to discrete values and allows for minimization of our chosen metric. On each curve, the optimal threshold chosen via crossvalidation is indicated by a circle, which again shows that the BUSRF's estimate of the best threshold is the closest to the northwest corner. In this regard, it is also found that for the over-sampling techniques the optimal cutoff value is in the range 0.2-0.275, which tells us that these models are somewhat under-confident on their positive class predictions. This could be a consequence of the artificially generated data that might make these predictors more sensitive to boundary effects.
Finally, not only the searching method, but also the time required to train the underlying predictor strongly affects the optimization routine. A representative comparison between the learning curves for the WRF and BUSRF classifiers with a budget of 1000 trials is shown in figure 4(b). Rather than plotting the objective value for every single iteration, only those that have improved upon previous ones are marked. Evidently, the BUSRF converges to a minimum much faster than the WRF, with the latter also being generally more prone to noise [43], an effect that is visible by looking at the large standard deviation (blue shaded area) among the K folds. This issue is far less pronounced for the BUSRF even though the two predictors use exactly the same splitting. Although in our case the size of the dataset does not make the computational requirements of paramount importance, the potential advantage is the possibility of obtaining improved performance on large databases without having to deal with very long training times.

Results on a per-shot basis
The process of training the model on individual time slices would certainly give a sense of how suitable the method is to the problem at hand. However, while it is important to have a warning level for each time step in order to track the evolution of a discharge, the intuition is that a predictor should either trigger an alarm or not during the entire duration of a shot. The usage of a hard threshold has been observed to be highly sensitive to noise in the input signals as well as to random spikes in the predicted output. For this reason, previous works [38,47] have proposed a 'softer' approach in which the model is re-trained with the addition of three new hyperparameters following the criterion that the alarm of impending instability is triggered only if the predictions stay above a low threshold (k 1 ) for a minimum amount of time (Δt w ) after having crossed a high threshold (k 2 ). An ideal warning criterion should counteract the issue of having alarms highly separated in time from the time of the actual instability as well as giving enough time for the plasma control system to act and avoid the instability. Although we could consider a warning raised any time during the shot as a success, we have decided to follow the DECAF RKM rule, in which if an alert is triggered more than 400 ms in advance without any related minor disruption, then it is considered too early. Moreover, in some of the tested predictors the alarm is raised too late (i.e. less than 10 ms before the RWM instability, regarded as missed).
Again, the algorithm that gave the lowest cross-validation objective was the BUSRF, with the model also resulting in the best test set performance by capturing 10 out 11 unstable RWMs (no late or early warnings) and just 2 out of 17 wrongly triggered stable discharges, as listed in table 4. We must point out that for the BUSRF one of the two false positives is a borderline case in which the alarm is triggered very early and then the warning level remains stable until the end of the shot, although it is unclear why this happened.
It is also interesting to notice that the best hyperparameters for the chosen predictor (BUSRF) slightly differ from the typical combinations in an RF-based disruption predictor, which further confirms that not only the machine but also the temporal evolution of plasma quantities associated with a specific instability affects the warning criterion. For example, in reference [47] the authors found that an ideal alarm should be triggered as soon as the output crosses the high threshold, with the low threshold at the lowest end of their search grid. On the other hand, Churchill et al [38] have partially confirmed this behaviour as their deep neural network triggers the alarm just 1 ms after exceeding a high threshold (0.96) which is quite large and also coincides with the low threshold. In our case the situation is somewhat in between, with k 1 being in the lowto-mid range, k 2 in the upper range, while Δt w tells us that the predictor should wait at least 60 ms before triggering the alarm.

Comparison with the RKM and limitations
In addition to analyzing performance on the holdout test set, we are also interested in examining what we should expect from this new model and whether its predictions agree or not with the RKM. Although the warning times are slightly different, we have found the two models in general agreement apart from some peculiar cases. First of all, the RKM correctly identifies the unstable RWM missed by the BUSRF and predicts stability for one of the two discharges that are wrongly triggered by the ML model. On the other hand, the BUSRF appears to benefit from the usage of the absolute values of β N , no-wall and with-wall limits rather than condensed into the C β parameter, as well as from the addition of the odd-n MHD signal. In fact, in two unstable cases (curves in cyan and olive in figure 5) the BUSRF triggers an instability at the very end of the discharge with warning times of roughly 50 ms before the human identified RWM. Interestingly, both cases are not triggered by the RKM, possibly because of the low C β or the high collisionality. In one unstable case (green curve) missed by the Comparison between a true negative discharge (Left) and a false positive one (Right). Both discharges are characterized by rising ω E , decreasing ν ii and β N crossing the no-wall limit. However, the very high β N and low collisionality for shot 140102 seem to mislead our classifier, which wrongly triggers an early warning at t = 0.441 s. RKM, the discharge evolved in the ν ii vs ω E space in what appears to be a stable range, but it is triggered 30 ms before the disruption by our model. Conversely, neither the BUSRF nor the RKM seem to identify stable discharges that evolve in what should be a highly unstable ν ii , ω E , β N region.
The single test set shot belonging to this category is displayed on the right in figure 6 and compared to another stable discharge. By looking at the two plots, it is possible to highlight some of the limitations of the proposed model. The stable discharge on the left is characterized by steadily rising ω E along with β N reaching roughly 5.2 at t = 0.67 s. No MHD activity is observed in the [0.4-0.7] s range, followed by increasing RMS amplitude and decreasing mode frequency that led to an LTM during the plasma current rampdown at t = 0.851 s. In contrast, shot 140102 had a broader rotation profile, ω Φ , and remained RWM stable as ω Φ was reduced by n = 3 magnetic braking [48]. For this specific shot, the proximity to marginal stability as rotation is reduced was evaluated by active MHD spectroscopy. Active n = 1 RWM feedback control was turned off at around 0.6 s, followed by the application of an n = 1, 40 Hz co-NBI propagating tracer field to evaluate the resonant field amplification (RFA) due to the high β N level. Although during this time the plasma remained at a high and relatively constant β N of roughly 6.2, the RFA amplitude peaks at around 0.82 s and then decreases as ω Φ is reduced. This denotes that the plasma first approaches, then departs from RWM marginal stability [49]. Evidently, these early signs of possible RWM instability led the BUSRF to trigger an early false alarm despite the presence of rotating MHD activity at the beginning of the shot. Interestingly, the BUSRF is then picking up in advance that something is changing in the plasma evolution, with the warning level monotonically decreasing as the plasma rotation is reduced. Unfortunately, the model raises another warning right at the end of the shot, corresponding to a steep rise in ω E . This analysis further shows that ML models should be used in conjunction with plasma diagnostics and existing control systems to define paths for improvement, especially when it comes to marginal cases such as the one presented here. Although just one of these cases is observed in the testing set, 4 other similar discharges are also wrongly triggered during the cross-validation process. Presumably, both models are also not capturing other stabilizing aspects such as the effect of energetic particles (i.e. temperature and pressure of the thermal ions). One possibility in the future could be using fast ion pressure profiles produced by the NUBEAM module of the TRANSP code [50], or reconstructed by a neural network [51], and then injecting this additional data into a larger RWM stability module. Generation of odd-n MHD counterfactuals for NSTX shot 140134, which was experimentally RWM unstable. In the top and middle panels are displayed the original and simulated odd-n activities in black and colour, respectively. Each colour represents one of the ten possible levels of MHD activity. The bottom panel shows the original BUSRF predictions as coloured dots, while the blue shaded area represents the range of counterfactual predictions if strong MHD activity was present. Note that, for the sake of visualization, only the frequencies where the RMS amplitude is at least 2 G are displayed in black in the top panel.

Interpretation of the warning level via diverse counterfactuals
Post-hoc interpretation of complex models are key for physicists to understand algorithmic predictions and judge whether they agree with the underlying physics or not. In the RKM, it is possible to follow plasma trajectories over time in the ν ii , ω E space by drawing contours of γτ w [13]. In our case, the model's decision boundary lies on a sevev-dimensional manifold, making us unable to visualize it unless we performed a scan of two features at a time while keeping the others at a constant or highly representative value. Other works [23,47] have explained the decisions made by multi-dimensional RF classifiers by decomposing the predictions into feature contributions. Here we propose the use of an alternative and model-agnostic method of interpreting the outcome based on counterfactuals, or what-if explanations. The concept of counterfactuals relies on the generation of hypothetical combinations of the input features that would return a specified output. In our case, the most obvious choice would be the generation of plasma parameters that produce predictions on the stable side of the decision boundary. Such a framework should ideally respect three conditions, which are: the ability to suggest a diverse set of combinations, the possibility of practically performing these changes (i.e. proximity to the original input space), and the consideration of the causal links in a real case scenario. The diverse counterfactual explanations (DiCE) algorithm has proved to satisfy all three requirements and provides the right flexibility to take into account user-defined constraints and experts' knowledge. For the sake of brevity, we will not report its complete formulation [29], 3 but rather highlight the main points and show how to customize it to the problem at hand. For non-gradient based ML methods, DiCE uses genetic programing to minimize an objective function over the entire set of potential candidates as follows: where c 1 , . . . , c k is a set of k counterfactual examples, L is the hinge loss between the desired output y and the ML model's predictions ( f (c i )) associated with the set of proposed counterfactuals, D 1 is a measure of perturbation defined as the normalized average feature-wise l 1 -distances from the original input (x), D 2 is a diversity metric based on DPP (see appendix) that avoids repetition of the same c i in a set of given size k, and λ 1 and λ 2 can be considered as regularization coefficients that are used to balance the impact of diversity and proximity. The hinge loss guarantees a smaller perturbation of the original set of features x since in a single-threshold model a counterfactual only needs to be on the other side of the decision boundary in order to be valid. As a starting point, we have customized the DiCE algorithm in several ways. First, physics-informed counterfactuals are generated in order to gain understanding into the model's decisions. For example, based on physics knowledge we expect that the usage of the odd-n MHD signal should improve the model's capabilities, but we do not know a priori how it is actually helping. Second, this approach can be modified and potentially used in a real-time control algorithm by stepping down the most influential parameters, such as β N , in order to see which levels keep the plasma RWM stable. Additionally, in order to speed up the counterfactual generation process, a TPE-based optimizer was added to the DiCE framework.

Usage of counterfactuals to generate hypothetical odd-n MHD activity
Altogether, the extracted mode frequency and RMS amplitude cover roughly 20% of the relative feature importance, hence we can assume that these features definitely influence the decisions made by our classifier, as expected. Notwithstanding the fact that only some of the RWM-stable discharges had strong MHD activity, we can still explore the ability of the BUSRF in understanding how RWM stability is affected by the presence of low order MHD modes, in particular slowing ones. As shown previously in figure 2(b), the MHD mode spectrum from a toroidal array of magnetic pickup loops for an unstable RWM shows no mode activity, indicative of stable rotating MHD. As an additional example, figure 7 displays one of the analysed RWM unstable shots, correctly identified by the BUSRF at t trigger = 0.376 s. By looking at the black points in the top and middle panels, one can notice that the absence of MHD activity towards the end of the shot is followed by the warning level (bottom panel) rising above the high threshold for the first time at t 0 = 0.316 s, indicated by a change to a grey background.
Starting from this point, we have asked the algorithm to produce a set of 10 possible combinations of MHD frequencies and amplitudes per time step that would have prevented the BUSRF from triggering an RWM warning. Since the original main loss only ensures that a counterfactual lies below or above a fixed threshold of 0.5, we chose to replace the hinge loss in equation (2) with a custom step function that penalizes counterfactual predictions above the high threshold, as follows: where hence the optimization will tend to prioritize DPP-diversity while keeping counterfactuals below the high threshold. Previous DECAF characterization of LTMs in NSTX discharges has found that before the disruption the frequency decreases and linearly halves, then it bifurcates and abruptly drops towards zero [14]. An opposite trend is generally observed with the RMS amplitude. In order to make these hypothetical situations as realistic as possible, in this application the algorithm has been constrained within the red curves, which follow these considerations and whose expressions can be found in table 5. Specifically, for this shot the bounds change from linear trends to exponential decays (or growths) at t 1 = 0.416 s and t 2 = 0.436 s, respectively. Moreover, we have found that the choice of λ 0 = 1 and λ 2 = 4 was the best in order to generate a diverse set of opposite MHD scenarios while keeping counterfactuals inside the bounds. For this specific application, proximity is not a requirement, hence λ 1 was set to 0.
The generated MHD activity in the top and middle panels of figure 7, when applied to the RWM unstable discharge, produces the counterfactual BUSRF warning level shown in the bottom panel, which is now stable. The choice of setting the initial bounds to 7.5 and 15 kHz is just an example of such application. In fact, we have also tested the same approach on other 5 unstable discharges by narrowing or widening the initial gap between 5 and 20 kHz, which are values that come from experimental observation of MHD activity in NSTX discharges [14]. On the other hand, the RMS amplitude bounds have been left unchanged in all the combinations as well as the halving times, which are set to 100 and 120 ms, respectively. In all the experiments DiCE was able to find a set of counterfactuals that would have kept the warning level below k 2 until the end of the discharge. Therefore, the counterfactual approach is confirming the physics understanding that the global RWM mode cannot grow to instability while local MHD activity is present, growing in amplitude and slowing to potentially disrupt the plasma itself. The same technique might be used to simulate MHD scenarios in which both the bounds and the time scales can vary based on NSTX experience as well as to evaluate the reliability of our model if run in parallel with the DECAF LTM warning module. It must be stated that such alternative scenarios might still lead to a disruption, which however should not be caused by an unstable RWM and we expect our model to not trigger any alarm. However, other possible causes of disruptions forecast by the DECAF code could be evaluated separately given the parameter space spanned by the counterfactual analysis for RWM.

ML-informed safe scenarios for potential real-time control
The second application of this approach lays the basis for a potential usage of counterfactuals for real-time control of the main RWM instability drivers, which are β N and rotation. Normalized beta control has already been accomplished in NSTX [52] by means of a proportional-integral-derivative gain controller that changes the injected power in order to achieve a user-defined β N level. In NSTX, all the beams are pointed in the same direction, hence in addition to more heating, more beam power necessarily means more plasma rotation. Moreover, because of the different angles at which the different beam sources were injected, an increase or decrease in power in one beam or the other would also influence pressure peaking and internal inductance (the two are correlated). For example, the beam that penetrated farther into the plasma core would tend to increase pressure peaking whereas one that was more towards the edge would tend to decrease it. Therefore, such a modulation would also modify the no-wall and with-wall beta limits, which would in turn play a role in the global RWM stability. For example, a less peaked pressure profile generally increases the with-wall limit allowing access to higher β N . Although other parameters could change simultaneously as β N is stepped down, an ML-informed safe β N level represents a valuable piece of information that could be input to a more sophisticated control method [53] already tested and planned for future NSTX-U operation. Alternatively, an allencompassing counterfactual generation algorithm for RWM stability control can be built, but this is mostly future work.
For the present purpose, we will generate the time-wise β N levels that would have kept the plasma RWM stable for a set of NSTX discharges for which the BUSRF had triggered an alarm, and we will do so by replacing equation (2) with an objective function that constrains counterfactuals in the probability domain rather than in the input space. Since a high β N is desirable to achieve good fusion performance, we will find  figure 8) is analysed in terms of the most relevant feature contributions while β N is reduced. Each error bar represents the contribution associated with the two levels β 0 N , β 1 N found by minimizing equation (4). The collisionality dropping below 1 kHz for shot 140123 allows access to higher rotation, but also results in a heavier impact of the latter towards the final prediction.
the β N region that keeps the warning level between 0.5 and the high threshold, k 2 , as follows: Apart from speeding up the counterfactual generation process, this objective function also maintains the diversity and proximity properties. In fact, given the desired probability range we expect the two β N values to not coincide as well as to lie in the high beta region. Figure 8 shows the application of such a method to two similar high beta discharges in which the RWM became unstable in the chain of events leading to a disruption. The main characteristic that one can notice is that the blue shaded safe β N region lies around/slightly above the NN-modeled no-wall limit. The behaviour is not surprising, as the goal was to find a safe operating region and the counterfactual approach is confirming that, to first order, increasing beta above the no-wall limit increases the disruption probability. Two other things emerge from the analysis of the two discharges as well. First of all, the slightly higher with-wall limit for shot 134836-indicative of a less peaked pressure profile-seems to partially justify the maximum safe β N of roughly 4.8, somewhat above the no-wall limit, in the grey counterfactual region.
On the same grounds, one can notice that the blue area moves upwards between 0.55-0.6 s for shot 140123, corresponding to the times at which the with-wall limit is almost 7. Second, while both discharges are characterized by decreasing collisionality, the rise in ω E is more pronounced for shot 140123, in which it crosses 6 kHz at around 0.62 s. This may further explain the much narrower safe β N operating space towards the end of this discharge.
All these considerations can be confirmed by producing the time-wise feature contributions [23] for the BUSRF classifier. As a reminder, this analysis essentially decomposes the prediction into a bias term plus the sum of each feature's contribution 4 . The bias term in this case is 0.5 since trees in the BUSRF are trained on a perfectly balanced dataset. We have found that the contributions of both β n=1 N,no−wall and the MHD activity are roughly the same between these two specific discharges since the evolution in time of these two features is almost identical between the two shots. Therefore, the corresponding contributions do not inform us about why the counterfactual β N region is different between the two discharges. Conversely, we compare the effect of β n=1 N,with−wall , ω E and ν ii towards the warning level in the following figure 9.
Since the counterfactual generated β N is a region that changes in time, the individual contributions are generated as error bars as well. By looking at the top panel, one can notice that the higher with-wall limit for shot 134836 is effectively damping the warning level, in particular around 0.57 s when the counterfactual β N reaches its maximum. This effect is less pronounced in the red shaded area, related to a β n=1 N,with−wall that barely reaches 7. Moreover, it is worth noticing the contributions of ν ii and ω E . Physics expectation based on MISK calculations has shown that RWM instability is possible at very low rotation and also at a higher rotation level between stabilizing resonances. This higher rotation level at which the RWM becomes unstable tends to increase as collisionality decreases [49]. This appears to be consistent with the behaviour of shot 140123. As ν ii drops below 1 kHz access to higher rotation is allowed, so ω E steeply rises. Meanwhile collisionality starts to oscillate. One can see that the contribution of ν ii first decreases and then increases again, which confirms that it is not possible to assume that stability decreases monotonically with collisionality [7]. The joint effect of these three contributions is an increase in the warning level that, in the counterfactual scenario, can only be damped by a reduction in the β N level below the no-wall limit, as seen on the top right panel in figure 8. Overall, the warning level is 0.05 to 0.15 higher for this discharge, hence the lower and narrower counterfactual region suggested by the DiCE algorithm.
As was previously mentioned, β N modulation also influences rotation in a device like NSTX. Therefore, in the future this approach can be applied in a way that takes into account, for example, the simultaneous variation of β N and ω E . In addition, the reliability of any counterfactual scenario relies on the accuracy of the underlying model. Therefore, it is really important building a robust counterfactual generator as well as perfecting the RWM stability forecaster. The implementation of ML-based controllers has been gaining attention in recent years with through the usage of reinforcement learning, for example to determine the optimal control scenario to obtain a constant user-defined β N level in the KSTAR tokamak [54]. The direction we have taken in the present work follows the observation that the disruptive β N is not constant when it comes to RWM stability and the usage of counterfactuals might lay the basis for a control algorithm that takes this variability into account. Moreover, recent research [55] has shown that contribution-based explanations can be jointly used with counterfactual generation algorithms and disclose which features are necessary or sufficient towards the model's decisions.

Discussion and conclusions
The prediction of impending instabilities is crucial for the safe operation of future commercial fusion devices. The combination of physics knowledge and the capabilities of ML algorithms is arguably one of the best solutions to this grand challenge. Recently, particular attention has been given to interpretable ML models, such as RFs, which can easily give an insight into the underlying causes of instabilities by providing the importance of each input feature as well as their contribution to the final output. A new RWM stability forecaster based on the RF approach has been developed here to predict whether the RWM will go unstable or not for a set of human-labelled NSTX discharges. The present approach outperforms classical weighted RFs by including a random per-tree undersampler to balance individual time slices. The algorithm is a valuable addition to DECAF and can be regarded as a supporting tool for the existing RKM model, since a processed signal indicative of the presence of low frequency MHD activity has been added to the input features.
Additionally, a model-agnostic tool based on counterfactuals has been tested for the first time to explain the effect of the hypothetical presence of MHD activity as well as to simulate β N levels that would have kept the plasma RWM stable. Regarding the first usage, it has been observed that the underlying model is understanding that strong MHD activity is generally unobserved when the RWM goes unstable, as expected by prior experimental evidence. On the other, suggested counterfactuals have interestingly shown that in discharges where the RWM warning level rises above 70%, the first line of defense is to reduce β N in a range that is slightly above the no-wall limit.
Both the BRF and the counterfactual approaches are applicable to other tokamaks as well. While it may be necessary to retrain the underlying ML model on data from each machine, it should be noted that the PGNN-determined β n=1 N,no−wall from NSTX data was applied with some initial success to the MAST tokamak [27]. Although more research is needed to determine the reliability of cross-device applications, the physics in the kinetic stability terms, ω E and ν ii , are applicable to other devices as well [12]. For example, comparisons of the theory to dedicated DIII-D experiments were successful. Other terms, like the no-wall and with-wall limits, are based on physics parameters, such as the aspect ratio, that can scale to other machines. Finally, recent calculations [56] for the MAST-U tokamak show potential high beta stability regions that could provide valuable information to better characterize RWM stability across multiple devices.
There are various elements in this approach that may be improved. First of all, the underlying RF model might be refined with the inclusion of the effect of energetic particles from TRANSP NUBEAM runs, which would likely help in reducing the FPR. Secondly, although the current usage of counterfactuals is encouraging, it still needs to include interdependencies between the main plasma parameters. Work in the near future will cover the implementation of counterfactual explanations to further gain understanding of RWM stability, exploring actionable scenarios and implementing a more sophisticated β N /rotation ML-based controller. Table 6. Example of the counterfactuals generated by DiCE for shot 140134 at t = 0.366 s. The last two columns are the suggested set of alternative scenarios that would have prevented the warning level from rising above 0.7 and triggering an RWM alarm. counterfactuals in table 6 are reported below:  (15) in which one can notice that the per-column similarity metric decreases the farther we move from the i = j entry, indicative of increasing diversity.