Integrated deep learning framework for unstable event identification and disruption prediction of tokamak plasmas

The ability to identify underlying disruption precursors is key to disruption avoidance. In this paper, we present an integrated deep learning (DL) based model that combines disruption prediction with the identification of several disruption precursors like rotating modes, locked modes, H-to-L back transitions and radiative collapses. The first part of our study demonstrates that the DL-based unstable event identifier trained on 160 manually labeled DIII-D shots can achieve, on average, 84% event identification rate of various frequent unstable events (like H-L back transition, locked mode, radiative collapse, rotating MHD mode, large sawtooth crash), and the trained identifier can be adapted to label unseen discharges, thus expanding the original manually labeled database. Based on these results, the integrated DL-based framework is developed using a combined database of manually labeled and automatically labeled DIII-D data, and it shows state-of-the-art (AUC = 0.940) disruption prediction and event identification abilities on DIII-D. Through cross-machine numerical disruption prediction studies using this new integrated model and leveraging the C-Mod, DIII-D, and EAST disruption warning databases, we demonstrate the improved cross-machine disruption prediction ability and extended warning time of the new model compared with a baseline predictor. In addition, the trained integrated model shows qualitatively good cross-machine event identification ability. Given a labeled dataset, the strategy presented in this paper, i.e. one that combines a disruption predictor with an event identifier module, can be applied to upgrade any neural network based disruption predictor. The results presented here inform possible development strategies of machine learning based disruption avoidance algorithms for future tokamaks and highlight the importance of building comprehensive databases with unstable event information on current machines.


Introduction
A disruption is the sudden, sometimes unexpected termination of magnetically confined plasma typically resulting from the growth of large magnetohydrodynamic (MHD) instabilities or strong radiation from impurities inside the plasma. When the disruption occurs, the plasma loses the stored magnetic and thermal energies on a very short timescale, and the rapid loss of plasma energy can lead to serious damage of the device; this is especially concerning for next generation tokamaks like ITER or SPARC. Therefore, it is important to predict imminent disruptions and then avoid or mitigate them in advance. Due to various causes of disruptions [1] and the complexity of fusion reactors, predicting disruptions using first principle methods has demonstrated to be very difficult [2]. The large amount of experimental data from the operation of various tokamaks provides the possibility of data-driven solutions for the disruption prediction problem. Many disruption prediction studies [3][4][5][6][7][8][9][10][11][12][13][14] have proven the effectiveness of data-driven prediction methods. Furthermore, recent modeling efforts based on deep learning (DL) algorithms [8][9][10] have shown improved performance and the potential cross-machine transferability of such predictive methods. However, DL approaches often lack the ability to identify disruption precursors, thus making them less explainable. This not only undermines the confidence of tokamak operators in the results themselves but also hinders the implementation of disruption avoidance strategies.
On the other hand, there exist outstanding examples of physics-driven approaches to predict disruptions and their precursors [15][16][17][18]: the disruption event characterization and forecasting (DECAF) suite [15] incorporates various physicsbased modules the identification and forecast of tearing modes, locked modes, resistive wall modes, edge localized modes, among other unstable events. These physics modules are designed for stability boundary detection on different devices and some modules are accelerated via ML surrogate models [19]. Building upon these physics models, the DECAF suite can provide the proximity of plasma state to different disruption precursors and final disruption which gives it much better interpretability over data-driven methods and enables the machine operators to avoid disruptions instead of simply mitigating them.
In this paper, we present a new integrated DL-based framework that can detect several unstable events and at the same time predict plasma disruptions. Although both our integrated framework and physics-driven models like DECAF can output unstable levels of various disruption precursor, the data-driven property of our integrated framework allows it to be straightforward adapted for new devices or new unstable events given sufficiently many new devices or new events data (i.e. retrain the model with new labeled data added to the original database) while the adaptation of physics-driven model to new operational region or unseen plasma instabilities requires physics understanding about the new physics and there does not exist a standard way to do this. Through extensive numerical experiments using data from the C-Mod, DIII-D, and EAST tokamaks, we demonstrate four major advantages of such an integrated framework: (a) Any DL-based predictor can be adapted to an integrated model that combines event detector and disruption predictor using our framework with little extra computation. (b) Numerical experiments show that the integrated model gives longer warning times for predicting disruptions when compared with a baseline disruption prediction model. (c) The integrated model is able to identify the whole chain of events leading to the disruption instead of just predicting the final major disruption. The precursors' identification allows for implementation of appropriate actuators (a set of control knobs integrated in plasma control system, e.g. increase electron density or decrease plasma current.) than can be employed to actively avoid disruptions. Examples of control knobs that can be incorporated into the real-time plasma control system include increasing the electron density, or decreasing the plasma current. (d) Finally, our cross-machine numerical experiments suggest that the combination of unstable events' identification with disruption prediction can strongly improve the crossmachine portability of the deep learning model.
In the following sections, we will first introduce the dataset used in the development and testing of the integrated model in section 2. Then, an iterative labeling process for generating event labels for non-disruptive shots is discussed in details in section 3. In section 4, the design of integrated DL model is presented and its performance on manually labeled DIII-D dataset is shown. Preliminary cross-machine disruption prediction studies using integrated model is described in section 5. Finally, we present the conclusions and future plan about the integrated DL model in section 6.

Dataset description
Our disruption prediction and unstable event identification studies are conducted on disruption warning datasets coming from three experimental devices, i.e. C-Mod, DIII-D, and EAST [7]; additionally, a DIII-D dataset with labeled unstable events manually identified [20] is used. The three disruption warning datasets have been well described in our previous work [7,9]. The dataset compositions and sampling rates of these three databases are shown in table 1 [9]. We interpolate the signals from DIII-D and EAST onto uniform 10 ms and 25 ms time bases, respectively. This is necessary because the DIII-D and EAST disruption warning databases have nonuniform sampling for disruptive discharges [5,7], while our DL-based model requires uniformly sampled data. The manually labeled event identification dataset consists of 287 DIII-D disruptive shots (from the DIII-D 2015-2016 experimental campaigns), with manually labeled start times for different unstable events across the whole plasma current flattop of each shot [20]. We include 22 classes of different unstable events when we build this database, and all event names are consistent with events described in [1]. Given the limited size of the database and the frequency of different unstable events, we choose ten classes of unstable events that occur during at least ten different shots to include in our unstable event identification study (table 2). The 'event occurrence' of a particular is the number of disruptive shots that have that event during the flattop, divided by the total number of disruptive discharges in the manually labeled DIII-D dataset (287). Since multiple unstable events can happen during the flattop of a single disruptive discharge, the sum of the 'event occurrence' fractions can be larger than 1.
As for the selection of plasma signals included in our analysis, we first use all plasma signals considered in our previous disruption prediction study [9]. Furthermore, to better detect different unstable events, we add two more signals to the original list of plasma signals used by our model. The first additional plasma parameter is Te_width_norm, which is the half width of the parabola fitted to all measurement points from the core Thomson system, normalized by minor radius. All these points have the same R coordinate, and a parabola is fitted to T e as a function of Z. The second additional plasma signal is Prad_peaking_CVA which is the radiated power from the core plasma, divided by the total plasma radiated power [21,22]. The full list of input plasma signals is given in table 3. The set of plasma signals included in this study is informed by three factors: (a) the suggestions from machine operators from C-Mod, DIII-D and EAST; (b) the analysis of the nondisruptive and close-to-disruption distributions of plasma signals included in our databases, as some signals have different distributions when disruption is imminent. For example, the normalized internal inductance, l i , increases before the final current quench on C-Mod, DIII-D and EAST [5,7,9]; (c) We also take into account the need to characterize the plasma state and its evolution across the 'events' or precursors considered for event identification. Plasma signals that are closely related to important disruption precursors should be included in our analysis. For example, the n = 1 locked mode is needed for detection of locked modes, which often precede disruptions.
The manually labeled database of DIII-D disruptive discharges is then randomly divided into a training set (160 shots) and a test set (127 shots). Our previous work [9] suggests that sequence-based models have a clear advantage over models based on individual time slices categorization. Therefore, we use plasma sequences of ten consecutive time steps as input to our models. Since prior to each major disruption there is a sequence of unstable events that finally lead to the final loss of control, both the disruption prediction and the event detection problems are formalized sequence-to-label supervised Radiation from central plasma divided by the overall plasma radiation Prad_peaking_CVA [21,22] The units of dimensional plasma signals are given in the parentheses. machine learning tasks. To this end, we need to assign two labels to each plasma sequence: (a) a disruption label-encoded as 1 if plasma sequences are close to disruption or 0 if are far from disruptions; (b) a ten-dimensional event label vector, where each coordinate is independently linked to a score for one of the ten unstable precursors considered in table 2. Each label vector element is encoded as 1 if training plasma sequences are unstable with respect to the corresponding event or 0 if samples are stable.
For disruption label assignment, we use the same procedure as our previous study [9]. However, for event label assignment, the procedure is not straightforward: (a) We only record the start time of each unstable event in our manually labeled dataset, but the end time of the event is missing. (b) All manually labeled shots are disruptive shots, but we need both disruptive and non-disruptive training shots for the development of our integrated model. In the following we present the solutions to both these problems.
After testing different label assignment schemes, we find that the best approach is to label all plasma sequences that encompass the start time (onset point) of the unstable events as belonging to the unstable category of the corresponding event. All other plasma sequences that are either before or after the onset time belong to the stable category of the corresponding event. Under this labeling scheme, our target is to identify the onset of unstable events instead of the unstable events themselves. The predicted onset time is the point at which (a) the predicted event's level exceeds the threshold corresponding to the unstable event and (b) this event's level is larger than the level of the event on the previous time step.
Finally, in order to complement our disruptive dataset of labeled unstable events, we randomly select 900 nondisruptive shots from the 2015-2018 DIII-D experimental campaigns and assign unstable event labels to these 900 nondisruptive shots using a trained event predictor through an iterative labeling process that will be discussed in detail in section 3.
By solving these two problems, we construct a database with events and disruption labels for both disruptive (manually labeled, 160) and non-disruptive (automatically labeled, 900) shots that represent the training set for the development of the integrated DL model. Notice that 160/900 is close to the ratio of disruptive and non-disruptive discharges in our DIII-D disruption warning database [7,9]. In addition, 700 non-disruptive shots randomly selected from the DIII-D disruption warning database are combined with 127 manually labeled disruptive shots to form a test set. Both disruptive and non-disruptive test data are used to evaluate disruption prediction performance of the model while only disruptive test data are used for testing of disruption precursor detection performance of the model. Finally, 127 disruptive shots and 700 non-disruptive shots are randomly selected from the DIII-D disruption warning database as the validation set. For the disruption prediction problem, the time threshold that determines the unstable phase of each disruptive training sequence (described in [9]) is uniquely chosen as the time at which the first unstable precursor event appears.
The training samples are (x, (y dis , y event )) pairs where x is a ten-step consecutive temporal sequence of 14 plasma signals in table 3, and y dis , y event are the disruption label and event label, respectively. The training samples are extracted from each training set via a scheme equivalent to [9]. For each disruptive training shot, 20 disruptive samples are randomly selected from those sequences that intersect the unstable phase of the shot. For each non-disruptive training shot, 20 nondisruptive samples are randomly selected from all sequences during the flattop phase of the plasma current.

Labeling non-disruptive shots through iterative labeling process
As mentioned in section 2, assigning event labels to non-disruptive data in the training set is necessary for the development of the integrated model. Previous studies like [20] give examples of using data-driven methods to generate event labels for unseen shots given very limited manually labeled shots. Therefore, given a manually labeled dataset with ∼300 shots, we want to develop an event identifier to automatically assign event labels to non-disruptive training shots. To this end, we designed an event identifier and the trained event identifier is used to generate event labels for all 900 nondisruptive discharges in the training set via iterative labeling process. We note that the size of the manually labeled dataset is relatively small, and it only includes disruptive shots from DIII-D 2015-2016 campaigns. Given this limitation, the distribution of different events in this dataset might be incomplete or biased and it might miss some event chains that can lead to disruption. Generating event labels for non-disruptive discharges, using a model trained on this manually labeled dataset, can result in biased event labels because the trained model will be affected by the event occurrence in the training set, and it can only recognize those patterns of unstable events that appear in the training set. Therefore, the event identification performance of the final trained model on manually labeled dataset can be exaggerated. Nevertheless, adding automatically labeled data to the training set should still improve the event detection performance of the trained model as long as the generated labels are accurate enough. In addition, a larger training set provides higher statistical significance. Furthermore, since the disruption label for each training shot is already known, and disruption labels are independent from event labels, the biased event labels should only have small effects on the disruption prediction results. This is because disruption prediction does not require us to detect all events, but rather only those typical/frequent events in the event chains that lead to disruptions. As long as our biased event dataset covers the most frequent unstable events that lead to disruptions (e.g. HL, ML, MHD on DIII-D), the trained event identifier should be able to detect these frequent events and give us extra warning time. The biased event labels, and hence event identifiers, might miss some infrequent events, but these missed events should have a negligible effect on disruption prediction. In this section, the event identifier and iterative labeling process are discussed in detail.

The hybrid deep learning (HDL) event identifier
Modified with respect to our previous work [9,10], a hybrid deep learning (HDL) model is developed for unstable event identification. The HDL event identifier (HDL-EI) consists of six multi-scale temporal convolution (MSTConv) layers and two dense_bn layers plus input and the classification layer (with sigmoid activation [23] for each coordinate). A dense_bn layer (figure 1(c)) contains one fully connected layer followed by a batch normalization layer [24] and a rectified linear unit (ReLU) activation [25]. The MSTConv layer described in [9] is a novel neural network layer designed for time-series processing. It contains six 1D temporal convolution layers as well as batch normalization and ReLU activation. The architecture of the HDL-EI is shown in figure 1(a) and the structure of the MSTConv layer is detailed in figure 1(b). Empirically, the deep neural network is designed to have wider layers in the middle of the model which allows the network to learn more complex patterns of the input data. Thus, the 3rd, 4th and 5th MSTConv layers in the middle of the neural network have 15 convolutional filters in each of their 1D temporal convolution layer, while the 1st, 2nd and 6th MSTConv layers have ten convolutional filters in each of their 1D temporal convolution layer. This architecture gives better performance than the model that has ten filters for all MSTConv layers.
The HDL-EI transforms an input ten-step consecutive temporal sequence of 14 plasma signals to an output 10D event level vector at the last time step of the sequence. Each coordinate of the event level vector provides unstable levels of one event in table 2 with ranges between 0 and 1, where 1 is the unstable class and 0 is the stable class, and the training loss of the HDL-EI comes from an average mean square error (MSE) of each individual unstable event. To label non-disruptive data, each shot was divided into batches of sequences, with each neighboring sequence having nine steps of overlap. Therefore, given a non-disruptive shot with N flattop time steps from t 1 to t N , the HDL-EI will generate N-9 event level vectors corresponding to the time steps between t 10 and t N . If one coordinate (e.g. the 3rd coordinate) of the output event vector exceeds the pre-set threshold corresponding to that event (e.g. HL, 0.5) at a flattop time step while it is less than the event threshold at previous time step, the time of this step is the predicted onset time t i onset,p of the corresponding event (e.g. HL) and the t i onset,t means the actual onset time of the corresponding event. A simple illustration of this process is shown in figure 2. To evaluate the shot-by-shot performance of the HDL event identifier, we focus on the first onset of each unstable event during the test shot. If the predicted first onset time is close (within uncertainty) to the true first onset time: t i onset,t − t i onset,p < 0.03 s, then it is considered a true positive. Different thresholds were considered and 30 ms represents the best trade-off, allowing us to achieve good average accuracy (above 80%) for the five most frequent events (HL, ML, RC, MHD, SAW). Furthermore, we find the class membership probabilities for each The feature extractor of the HDL-EI is marked by a green dashed box. Note that the six 1D temporal convolution layers, contained in the MSTConv layer, have window lengths L from one to six to extract local temporal information at different levels (see [9] for a detailed explanation). particular event (aka instability levels from HDL-EI) corresponding to these five events usually ramp up within 30 ms of unstable event onset. These observations suggest that 30 ms is a good choice for the definition of the TP criterion for these five most frequent events, and a 30 ms time interval is a good match to the time scale of these five events on DIII-D. If the output event level corresponding to an event does not exceed the threshold for the whole flattop of a shot, and this event does not happen during the flattop of this shot, this is regarded as a true negative. HDL-EI is optimized to achieve the highest TPR at a fixed FPR (typically FPR = 0.1). From table 2, it is clear that, from among the ten selected events, HL, ML, RC, MHD, and SAW have the highest frequencies. To maximize the overall accuracy of the model, a good model should give higher weight to the TPR in the ML detection (since the occurrence of ML is 77%) to avoid missed alarms, and while giving more weight to the FPR for IMC detection (since the IMC probability is low) to avoid false alarms. Due to these considerations, when we define the performance metric for each event, we choose different target FPRs for frequent and infrequent events, allowing us to rebalance the class frequencies for different events.

The iterative labeling process
The iterative labeling process evolves in two stages. During the first stage, the initial training set X 1 of the HDL-EI is constructed by sampling 20 sequences (10 × 14 matrix) from the unstable phase of each manually labeled disruptive training shot. The HDL-EI is then trained using this initial training set. The trained model is subsequently applied to manually labeled disruptive test shots, and the optimal threshold corresponding to each event is obtained by optimizing the performance of the HDL-EI on the manually labeled test set. The performance of HDL-EI model in the first stage is given in table 4. After this, the predicted event labels of all 900 nondisruptive shots are generated using the trained model and optimized event thresholds. During the second stage, the training set is obtained by sampling 20 sequences from the unstable phase of each manually labeled disruptive training shot plus randomly sampling 20 sequences from the flattop of each non-disruptive training shot (with generated event labels). The HDL-EI is then trained using this combined training set. Given  the trained model, the optimized threshold corresponding to each event and the predicted label of each non-disruptive training shot are obtained using the same method as in stage 1.
The second stage of the labeling process is run iteratively until the obtained thresholds and the performance on the manually labeled test set converge. The ensemble method is well known in the machine learning community, and has been shown to significantly increase the performance and reduce the uncertainty of the model [9,26]. In our previous work [9], we have shown that using the ensemble method can significantly improve the performance of a data-driven disruption predictor. Therefore, we independently trained ten different HDL-EIs with the same dataset; each HDL-EI has different initial parameters (i.e. different initialization) and different training random seeds. Then, we combine these ten independently trained HDL-EIs into an ensemble. The final output of our model is the average output from ensemble of ten HDL-EIs. The diagram of this iterative labeling process is shown in figure 3, and the final optimized event thresholds are summarized in table 5. Notice that infrequent events tend to have lower thresholds. This phenomenon comes from the fact that the HDL-EI always sees negative samples during training and it learns to always output a low event level to achieve high accuracy. The low event level leads to a low event level threshold. An example of automatically labeled non-disruptive training shots is given in figure 4.

The integrated deep learning framework for disruption prediction and unstable event identification
The integrated DL framework combining the predictive ability of disruptions as well as several precursors is developed using the training set that includes manually labeled shots and automatically labeled ones. This integrated framework is designed to map an input plasma sequence to two connected outputs: a scalar indicating the disruption risk and a 10-D event level vector that corresponds to the level of all ten classes of unstable events. The model's loss function includes two terms that need to be minimized at the same time. Figure 5 shows the architectural details of this deep learning framework. Since the disruption level is closely related to the unstable levels of each disruption precursor, we want the intermediate representation of the input signals to contain information about both precursors and the major disruption itself. Therefore, the integrated model is built upon the HDL-EI described in section 3 by adding a separate disruption prediction branch after the intermediate layer of the original HDL-EI. This allows the model to output both the disruption level, i.e. the 'disruptivity', and the predicted event level vector based on the intermediate representation of the input plasma signals. The integrated model adopts a composite loss function (a function measures the difference between predicted label and ground truth) that includes the contributions from both the unstable event identification and the disruption prediction branch. This loss function can be represented as: loss integrated model = loss dis + λ * loss event (1) where loss event is the average mean squared error (MSE) loss of the unstable event task, while loss dis is the average negative log-likelihood (NLL) loss of the predicted disruptivity risk. λ is a framework's hyperparameter balancing these two terms and we chose λ = 1 throughout this paper. By removing the event branch of the integrated model, and setting λ = 0 in equation (1), we can convert an integrated DL model into a baseline HDL disruption predictor. The shot-by-shot testing scheme of the integrated framework follows the two-staged approach of the HDL-EI iterative labeling. If the level of any unstable event (e.g. ML) or the disruptivity exceeds the corresponding pre-set threshold at any flattop time step, the whole shot will be classified as unstable (with respect to that event, e.g. ML) or disruptive shot. A successfully predicted DIII-D disruptive shot from the   test set is shown in figure 6 and the event identification performance of the model is given in table 6. The average TPR of the four most frequent unstable events (HL, ML, RC, MHD) achieves 84%, which is significantly better than the performance of HDL-EI when only trained with manually labeled data (see table 4); this confirms the effectiveness of using automatically generated event labels.

Comparing the disruption prediction performance between the integrated model and the baseline disruption predictor
To investigate the advantage of an integrated DL model, we compare the performance of the integrated model with that of the baseline HDL disruption predictor, using the same test set for each approach. Both the integrated model and the baseline model are trained using the same DIII-D training set. The baseline model does not need event labels from the training shots; it only uses the threshold for major disruption (disruptivity threshold). The event thresholds are obtained via an iterative labeling process, and they are fixed during this experiment. The performance metric chosen for these numerical experiments is the area under the receiver-operator characteristic (ROC) curve (AUC) which is the curve of true positive rate (TPR, the ratio of correctly predicted disruptive shots to all disruptive shots) and false positive rate (FPR, the false alarm rate) [27], and the disruption prediction performances of all numerical experiments reported in this paper are evaluated at 50 ms before the current quench, as this is the reference warning time to successfully trigger the mitigation system on future tokamaks like ITER [28]. The comparison results are shown in figure 7. To make this comparison fair, the hyperparameters of both disruption predictors are optimized independently using a separate validation set. To do this, we independently tune the hyperparameters of the baseline HDL model and the integrated DL model, to maximize their disruption prediction performance on this validation set. In addition, the cumulative distribution of warning times, i.e. the difference between the triggered alarm time t alarm and the disruption time t dis for true positive shots, returned by the two models are reported in figure 8. Through the comparison, the integrated DL model gives AUC = 0.940 (TPR = 0.88 at FPR = 0.1) while the baseline HDL model gives AUC = 0.920 (TPR = 0.85 with FPR = 0.1). Note that the 0.940 AUC achieved by integrated model is close to the performance of the original HDL model, trained on a much larger dataset (reported in [9]). There are three major factors that contribute to this: (a) Adding event information; (b) Improved network design, substituting a GRU layer with an MSTConv layer and adding short-cut connection; (c) Adding two useful 1D features (Te_width_norm and Prad_peaking_CVA). If we consider the fact that the baseline predictor has already achieved high accuracy on DIII-D, the 3% TPR improvement is significant. By using the integrated DL model, we reduce the number of the missed alarms by 20% (15 missed alarms to 12 missed alarms every 100 disruptions). The integrated DL model also gives longer median warning times compared with the baseline disruption predictor. The median warning time increases by roughly 200 ms when we use the integrated DL model, and the longer warning time could allow the plasma control system to take actions to avoid disruptions, instead of simply mitigating them. Furthermore, detecting unstable events together with disruption in plasma experiment operation enables disruption avoidance and analysis of plasma physics. All these considerations contribute to clarifying the advantage of the integrated HDL vs the baseline version.
The conclusions above suggest the advantage of providing unstable event information to such DL frameworks and verify the close correlation between unstable event identification and disruption prediction tasks. In figure 8, most of DIII-D shots that have very long warning time greater than one second usually have locked mode onset at the early/middle stage of the flattop but the initial locked mode onset does not result in the large thermal quench of the plasma in next few Having demonstrated that adding event information helps the integrated model achieve higher accuracy on disruption prediction, we ask a further question: to improve the disruption prediction, how accurate does the unstable event identifier need to be? To try to answer this question, we reduced the size of manually labeled training set (from 160 shots to 110 shots) and used an iterative labeling process on this reduced training set to label 50 remaining disruptive shots and 900 non-disruptive shots. Then, we combine these 50 disruptive shots, plus 900 non-disruptive shots, with generated labels and 110 manually labeled shots to the new 'degraded' training set. Finally, we train a 'degraded' integrated DL mode using this new combined dataset. The event identification performance of the 'degraded' integrated DL model is shown in table 7; the average TPR for the four most frequent unstable events (HL, ML, RC, MHD) is 0.74. The disruption prediction performance of the degraded integrated DL model, and the comparison with both the complete integrated DL model (trained with all event information) and with the baseline HDL model, are given in the table 8. From the comparison, the 'degraded' integrated DL model gives similar disruption prediction performance compared with baseline HDL model, which suggests that a bad event identifier might not be able to provide extra information for disruption prediction. Results from table 8 show that the event identifier needs to achieve higher than 75%-80% accuracy for the most frequent unstable events in order to improve the disruption prediction. We need to mention that the 75%-80% accuracy estimation is not directly applicable to other devices, because different devices have different frequent events and different event occurrence. This empirical accuracy should also depend on the signals considered by model and the accuracy of the baseline model. The required accuracy will decrease if the baseline model performance is lower. Knowing the statistics of the root cause of disruptions on the tokamak [1,29] might help us obtain an upper bound of this required accuracy. However, since this value depend on lots of factors, more accurate estimation of the required accuracy needs to be obtained via numerical experiments.

Cross-machine performance of the integrated model
Given the fact that a few, or even just one unmitigated full current, high stored energy disruption can significantly damage future tokamaks like ITER, it is strongly desirable to develop a disruption predictor that can reliably and accurately operate before the first high performance run day of such tokamaks [2]. Therefore, the disruption prediction model with better crossmachine transferability represents a suitable candidate for the DMS trigger algorithm for future devices like ITER, assuming that enough knowledge from other tokamaks' data is extracted and that the minimum data is required by the future tokamak itself. From section 4, we find that unstable event information can provide extra information and improve disruption prediction performance of the data-driven model. Next, we would like to investigate the cross-machine transferability of the integrated framework by setting up extensive numerical experiments, and comparing disruption prediction performances against the baseline DL model.
In this section, we consider DIII-D as the 'existing' device with C-Mod or EAST chosen as the 'new' device and investigate how the integrated model (with event information) and baseline disruption predictor (without event information) trained on DIII-D data perform on either C-Mod or EAST. The description of C-Mod and EAST disruption warning databases can be found in [7,9].

Cross-machine prediction performance of the integrated model and baseline disruption predictor
The cross-machine transferability of the following models is considered in the comparison experiments:  The trained integrated model and the baseline one are tested on the EAST and C-Mod datasets, respectively. Besides the prediction accuracy, we are also interested in investigating whether providing unstable event information from the 'existing' device allows the trained model to find early precursors of the disruption on a different, 'new' machine and hence give longer warning times. To this end, the distributions of warning time returned by the integrated model and the baseline disruption predictor are also analyzed. Results on the test set are shown in figures 9 and 10. From these test results, we can draw the following conclusions: cross-machine scheme. When FPR is set to 0.2, the integrated DL gives 120 ms median warning time on EAST while baseline HDL gives 50 ms median warning time on EAST (figures 9(b) and 10(b)).
These two conclusions imply that information about unstable events, i.e. disruption precursors, contains general, machine-independent knowledge about disruptions. The common physics contained in the event information can be learned by an integrated model, and gives better transferability across devices when predicting disruptions.

Cross-machine unstable event identification
Beyond the cross-machine disruption prediction transferability, we also want to investigate whether the integrated model trained with data from one machine can identify disruption precursors on different devices. To this aim, we manually labeled the unstable events of a few shots from EAST and C-Mod and applied our integrated model trained on DIII-D labeled shots. While the number of manually labeled EAST and C-Mod test shots is small, we can still get some preliminary cross-machine conclusions from the test. The test results are shown in figures 11 and 12. The results from these experiments point to the following qualitative conclusions:  • The integrated model trained with data from one device can qualitatively identify disruption precursors on different tokamaks. This observation again suggests the underlying physics that drives the unstable events is similar on different tokamaks (figures 11 and 12).
• The integrated model trained on data only from one device seems to have large numerical bias when it is directly applied to another device. This bias can make the absolute values of disruptivity and unstable event level returned by model meaningless. Nevertheless, the increases of the output instability levels from the model still indicate the increasing risks of correspondent plasma instabilities ( figure 12). The underlying reason for this cross-machine bias is that different machines have very different operational regimes in the plasma signal space even after normalization [9]. Therefore, the data-driven model trained on data from device A can be unconstrained on a disjoint new operational regime (e.g. the operational regime of a different tokamak B) because it has never seen a sample related to this new regime. The extrapolation of the trained model to this new regime can result in large numerical bias. This numerical bias can be greatly reduced by adding a few shots from the target domain/device [8][9][10] and/or some simulation data to the training set. Previous studies [5,7,9,10] have shown different devices usually have similar behaviors when a disruption/unstable event is imminent. These similar dynamics among devices can be captured by data-driven models (especially sequencebased models) and hence the changes of the output instability levels from the data-driven model trained on other devices can still reflect the risks of correspondent unstable events [9].

Summary of cross-machine numerical conclusions
Given all the conclusions in sections 5.1 and 5.2, it is possible to state that disruption prediction and unstable event identification are two closely related tasks from the machine learning perspective. Furthermore, the physics of disruption precursors has large similarity among different tokamaks, and the unstable event information provides general knowledge about disruptions. Therefore, integrated models trained with additional disruption precursor information provide better disruption prediction performance as well as better crossmachine transferability compared with baseline disruption predictors. Given a labeled dataset, this strategy of combining disruption predictor and disruption precursor identifier to a single integrated framework can be easily applied to upgrade any neural-network based disruption predictor and improve their performance.

Summary and future plans
In this paper, we have discussed an iterative labeling method to automatically assign event labels to unlabeled shots using a deep learning based event identifier and a manually labeled DIII-D database with a few hundred disruptive shots. Given the fact that all manually labeled shots are disruptive, while we need both disruptive and non-disruptive shots to train the integrated DL model, we have used the iterative labeling method to construct a training database with 160 manually labeled disruptive DIII-D shots and 900 automatically labeled nondisruptive DIII-D shots (with generated event labels). The generated event label might be biased, because the HDL-EI model used to generate these event labels is trained using only 160 manually labeled disruptive shots, all from the 2015-2016 DIII-D campaigns. In this context, we assume that the limited information available on the statistical representation of DIII-D disruption dynamics might lead to a biased dataset. However, given the manually labeled disruptive set available together with the automatically generated labeled non-disruptive discharge set, we have developed an integrated deep learning framework that can output the disruptivity score and unstable event levels simultaneously. Through numerical experiments, the integrated model is found to give higher disruption prediction accuracy as well as longer warning time compared with the baseline version aimed solely at predicting disruptions. The cross-machine numerical studies using C-Mod, DIII-D, and EAST data further demonstrate that the integrated model can provide better cross-machine transferability, and the integrated model trained using data from one device can qualitatively identify disruption precursors on a different tokamak. All of these conclusions confirm the close correlation between disruption prediction and disruption precursor identification tasks and suggest that the physics mechanism of disruption related events shares large similarity on different tokamaks. Therefore, combining a disruption predictor and disruption precursor identifier into a single model is a promising strategy for the development of disruption predictors on future devices, and it highlights the importance of including unstable event information when we construct the database for data-driven disruption prediction studies. Future efforts will focus on the following topics: Firstly, we note that the disruption prediction performance of the integrated model shown in this paper is does not meet the requirements foreseen for future tokamaks like ITER (TPR roughly 0.99 with FPR ⩽ 0.05) and SPARC (likely TPR roughly 0.90-0.95 with FPR ⩽ 0.05 when including considerations of thermal loads from disruptions). Given the iterative labeling method, and considering the value of unstable event information, we plan to manually label unstable events for few hundreds DIII-D, EAST and C-Mod disruptive and non-disruptive shots. Then we can further expand the manually labeled databases by using our iterative labeling method with the original manually labeled datasets, to automatically label more shots. The expanded DIII-D, C-Mod and EAST databases will allow us to investigate the disruption prediction capability limit of our integrated model on existing devices. In addition, cross-machine numerical experiments using these databases and integrated model can further confirm the efficiency of our labeling method, and also allow us to develop a more robust cross-machine integrated model using databases from all three tokamaks. Secondly, a real-time test of the integrated model capabilities should be conducted on an existing tokamak (DIII-D is a good candidate) to further investigate robust disruption prediction and avoidance strategies, and to facilitate development of a disruption handling system on future tokamaks. Finally, we want to investigate the most dangerous paths to disruptions on future tokamaks, including ITER and SPARC, and perform tokamak discharge simulations for these conditions (e.g. MHD simulation). Then we can try to apply the integrated model to the synthetic signal and quantify the performance of the predictor. The integrated DL model can be used together with the tokamak discharge simulator to find a good operating scenario (stable and high fusion power/fusion gain plasma) for future tokamaks.