Simultaneous control of safety factor profile and normalized beta for JT-60SA using reinforcement learning

Plasma with an internal transport barrier (ITB) will be developed in JT-60SA as an attractive operation appropriate for a steady-state fusion reactor. To achieve the ITB plasma while avoiding magnetohydrodynamic instabilities, it is advantageous to simultaneously control the safety factor (q) profile and the normalized beta ( βN ). In this study, a control system for simultaneous control of the q profile and βN is studied in simulations prior to the real experiment in JT-60SA. The bootstrap current dominates the total current in the ITB region, which results in a coupling between the pressure profile and the q profile. Thus, it is crucial to control the q profile and βN according to the strength of the ITB. A two-stage neural network (NN)-based control system was developed to address this problem. The first stage estimates the transport properties (i.e. ITB strength) of the plasma from measurements. The second stage consists of several NNs for control of the q profile and βN . According to the ITB strength estimated by the NN in the first stage, the appropriate NN for control is selected from those in the second stage. Each NN in the second stage is trained to control plasmas with different ITB strengths through reinforcement learning employing RAPTOR, an integrated transport code. To validate this system, it is tested in a simulation employing another integrated transport code, TOPICS, to mimic the plasma control in JT-60SA plasmas with various ITB strengths. Stable control of the q profile and βN is achieved in ITB plasmas simulated by both the RAPTOR and TOPICS codes.


Introduction
The safety factor (q) profile is a crucial parameter in tokamak plasmas since it affects the confinement performance and magnetohydrodynamic (MHD) stability. The q profile typically decreases monotonically from the edge to the core, and the * Author to whom any correspondence should be addressed.
Original Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. q at the magnetic axis falls below 1 unless additional heating or non-inductive current drive is used to change the profile. This type of magnetic configuration is employed in the ITER standard operation [1]. In this configuration, the magnetic shear s, defined as s = (r/q)dq/dr, is positive throughout the plasma. By modifying the q profile to have a weak or negative magnetic shear in the inner half of the plasma, the anomalous transport can be reduced and the confinement performance can be enhanced [2]. Advanced tokamak operations are the operation modes with the modified q profiles. These advanced tokamak operations are crucial for enhancing the overall performance of the fusion plasma as a power source since they not only enhance the energy confinement but also enable a steady-state operation with a high bootstrap current fraction [3,4]. The control of the q profile has been extensively investigated in the context of sustaining advanced tokamak operations [5][6][7]. Another crucial parameter in tokamak plasmas is the normalized beta (β N = βaB T /I p ), where a, B T , and I p represent the minor radius, the toroidal magnetic field and the plasma current, respectively. The fusion power is approximately proportional to the square of β N . However, MHD instabilities tend to emerge if β N exceeds certain limits, such as the no-wall limit or the ideal wall limit [8][9][10][11][12]. Thus, it is crucial to control β N to a certain value according to the target of operation scenarios.
One of the main goals in JT-60SA [13,14] is to develop a high-beta steady-state scenario that is appropriate for a steadystate fusion reactor. Advanced tokamak operations with an internal transport barrier (ITB) are envisioned to achieve this goal [15][16][17]. JT-60SA will be equipped with various noninductive current actuators, such as neutral beams (NBs) and electron cyclotron (EC) waves, which can be used for approximately 100 s. Those actuators will allow us to perform long pulse operations that far exceed the current relaxation time. These features make JT-60SA an ideal device for investigating simultaneous control of the q profile and β N necessary for sustainment of the advanced tokamak.
Real-time q profile control has been investigated in many tokamaks, including Tore Supra [5], JET [18] JT-60U [6,19], and DIII-D [7]. In more recent studies, the model-based method has been widely investigated to realize more reliable and accurate control [20][21][22][23][24][25][26][27]. Furthermore, model-based simultaneous control of the q profile and the kinetic profile or β N has been investigated [20,21,28,29]. In ITB plasmas, a large amount of localized bootstrap current is generated due to a strong and localized pressure gradient. This bootstrap current modifies the q profile, which in turn affects the confinement quality. Thus, the response characteristics of the q profile and β N become closely coupled, showing a self-regulating nature [30,31]. Simultaneous control of the q profile and β N becomes more challenging under these conditions. Control in ITB plasmas will need to be developed based on models that consider the coupling between the q profile and the pressure profile through ITB formation.
In this study, reinforcement learning is introduced as an alternative method to develop a model-based control system. Through trial-and-errors in the integrated transport simulations, a neural network (NN)-based controller for simultaneous control of the q profile and β N is trained. Recently, reinforcement learning has been employed for magnetic equilibrium control in TCV [32], operation trajectory design in KSTAR [33,34], vertical stabilization in EAST [35], q profile control in DEMO [36], and ion temperature gradient control [37]. The controller is trained by employing various randomized models to bridge the gap between the model dynamics and the real plasma. This method is called domain randomization in reinforcement learning. However, the performance of the controller deteriorates if it is trained using models with high uncertainty [38]. In this study, a two-stage NN-based control system is newly developed to address this issue.
This study is organized as follows. The approach for training the two-stage NN-based control system is explained in section 2, including the detailed structure of the control system and the setup of reinforcement learning. The results of simultaneous control of the q profile and β N in simulations are presented in section 3. The trained control system is tested in the simulations using a code employed in training and a different code with high fidelity to mimic the plasma control in JT-60SA plasmas. Finally, in section 4, the conclusions of this study are presented.

Training of two-stage control system employing the RAPTOR code
A control system that consists of two stages of NN is developed in this study, as illustrated in figure 1. The first stage, called the analyzer NN, uses measurements to estimate the transport characteristics of the current plasma. The NNs in the second stage, called the controller NNs, determine the control output for the actuators of NB and EC. Several controller NNs were prepared. Each of them was trained for plasma with various transport characteristics. A controller NN is selected based on the transport characteristics deduced by the analyzer NN during plasma control. This method enables each controller NN to be trained for plasmas with smaller model uncertainty while maintaining the overall applicability of the control system. In this section, the approach of reinforcement learning employed to train the controller NN is described, followed by a description of the approach of supervised learning employed to train the analyzer NN.

Training of the controller NN
The controller NN is trained to control the q profile and β N employing NB and EC. JT-60SA will be equipped with on-axis and off-axis negative ion-based NBs (on-axis NNB and offaxis NNB), and three groups of positive ion-based NBs that will be injected in co-current tangential (co-PNB), countercurrent tangential (ctr-PNB), and perpendicular (perp-PNB) directions. The heating and current drive profiles computed employing an integrated transport code TOPICS [39,40] for a typical plasma at the plasma current flat-top phase are shown in figures 2(a) and (b) as a function of the normalized minor radius, ρ.
A set of input parameters of the NN in reinforcement learning is called a state since it describes the state of the controlled object, which in this study is the plasma. The state at the ith control time step is defined as . . , 0.6] represents the normalized radius from 0.1 to 0.6 with interval of 0.05, P act (t i ) = [P EC , P on−NNB , P off−NNB , P co , P ctr , P perp ] is the heating power of EC and each group of NBs, β N (t i ) = 0.8β N (t i−1 ) + 0.2β N (t i ) is the smoothed value of the past β N values, β limit N (t i ) is the control target of β N and θ Flattop is a  Boolean function which takes 0 in the current ramp-up phase and 1 in the current flat-top phase. In this study, the control interval is 0.1 s, which is shorter than the energy confinement time (∼0.4 s). P act (t i+1 ) is the output parameter of the NN. The smoothed β N is included in the state since it can be employed to reduce oscillations in β N control. The parameters included in the state are selected such that the NN can estimate the present response characteristics of the q profile and β N to the heating and current drive by EC and NBs. In this study, the interaction between the controller and the plasma is modeled as a transition from the present state s i to the next state s i+1 because of the controller's output, P act (t i ).
The NN is trained using a value called a reward, which evaluates the control result based on the state transition at each time step. The reinforcement learning approach aims to maximize the sum of rewards obtained in a series of time-dependent simulations. In this study, Soft Actor-Critic [41] is employed for the reinforcement learning algorithm since it has two features, parallel learning and experience replay, that help accelerate the learning process. Parallel learning enables us to increase the number of trials per unit time, and experience replay is an approach that enables us to reuse previous trial results of trials for current learning, which enhances control performance even with a relatively small number of trials. The feed-forward NN which has two hidden fully connected layers is used for the controller NN. Each layer has 256 units. The hyperparameters used in the training of the controller NN is not changed from the original paper of Soft Actor-Critic [41]. The transport simulation for NN training is performed using the control-oriented integrated transport code RAPTOR [42,43], which enables rapid predictive simulations. Only the current profile and the electron temperature profile are solved to minimize run time. The ion temperature profile is assumed to have the same profile as the electron temperature profile, but it is multiplied by c scale , which will be changed randomly in training. Time evolution of the electron density and 2D MHD equilibrium are prescribed. At the current flattop, the density profile is assumed as n e = n edge + n e0 (1 − ρ 2 ) rn , where n edge = 5 × 10 17 (m −3 ), and n e0 and r n are randomly set to values between 3 × 10 19 and 4 × 10 19 (m −3 ) and between 0.1 and 0.2, respectively. The plasma current is ramped up from 0.9 MA to 1.9 MA in 3.5 s from t = 1.5 s and is maintained at 1.9 MA for 45.0 s from t = 5.0 s. The simulation starts from t = 1.5 s with the same initial plasma parameters and ends at t = 50.0 s. The time step in RAPTOR simulation is 10 ms while control time step is 100 ms. The toroidal magnetic field at the plasma center is 1.9 T and q 95 is about 7.5 at the current flattop.
As illustrated in figures 2(c) and (d), the heating and current drive sources by NBs and EC are given as prescribed profiles scaled with the input power. These profiles are defined to be similar to typical profiles computed in TOPICS. In RAP-TOR, all the heating powers are assumed to be absorbed by electrons or ions in predefined ratios. For instance, EC is fully absorbed by electrons, while PNBs are primarily absorbed by ions. RAPTOR does not consider fast ion pressure, unlike TOPICS. However, a significant portion of the applied power is stored as fast ion energy in JT-60SA due to a large amount of power injected by NBs. The ion and electron temperatures will be overestimated if this injected NB power is assumed to be directly absorbed by thermal ions or electrons. To reconcile this discrepancy, an ad hoc fast ion model is introduced, where k is an index of beamlines, W k,fast is the fast ion stored energy, τ k is the effective slowing down time and P k,abs is the power absorbed by thermal ions and electrons for the kth NB.
An ad hoc transport model defined in [44] is employed for the anomalous heat transport model. The thermal diffusivity χ e is defined as, where χ neo represents the neoclassical diffusion term, which is assumed to have spatially constant values for simplicity.
This model describes the enhancement in confinement when the magnetic shear is low and I p increases while having higher diffusivity toward the plasma edge. The second term χ central is added to increase the thermal diffusivity at the plasma center to represent the experimental T e profiles that are typically flat near the center. The coefficients χ central = 10 and δ 0 = 0.1 are employed. The last term represents the enhancement in confinement due to the modification of the q profile. The last term governs the decrease in thermal diffusivity when the magnetic shear is small. Two set of parameters are defined for a ic , w ic , d ic , c ano , and χ neo . It is assumed that the ITB is formed when the q profile becomes a reversed shear profile, and the parameters are switched with a probability of 10 % when the normalized minor radius, where q takes a minimum (ρ qmin ) is larger than 0.4. Once the parameters are switched, they remain in effect until ρ qmin becomes less than 0.4. This allows for the definition of two different levels of confinement performance before and after the formation of the ITB. The second set of coefficients is introduced to restrict the confinement performance high once the ITB is formed. This restriction is required to achieve the desired q profile and β N in the stationary state within the available power from NBs and EC. On the other hand, a reversed shear q profile can, sometimes transiently, be formed for plasmas with wider range of confinement characteristics. Therefore the range of coefficients in the first set which is used before the ITB formation is made wider than the second set.
The parameters, c scale , n e0 , ano , and χ ′ ′ neo are randomly selected at the beginning of each simulation to simulate a plasma with various confinement characteristics. These parameters are fixed during the simulation of a single discharge from 1.5 s to 50.0 s. The ranges of parameters employed in the training are shown in table 1. After the formation of the ITB, it is assumed that the range of c ano is reduced to almost half. The parameter a ic determines the strength of the ITB and affects the response characteristics of the q profile and β N . It is difficult to achieve the control goal if NN is trained for plasmas with various ITB strengths, as mentioned in the introduction. Thus, a range of a ′ ′ ic from 0.2 to 1.0 is divided into three ranges to train three NNs for plasmas with different ITB strengths. The ranges overlap, which is necessary to avoid unfavorable chattering of the control output, as demonstrated in the example shown in section 3. In addition to the controller NNs employed after the ITB formation, another controller NN to be used before ITB formation is trained since the control required to achieve the ITB is different from that required to maintain it.
The control goal is to achieve and maintain a q profile with a minimum value (q min ) higher than 2 to avoid the deleterious MHD modes associated with the rational surface of q = 2, and to keep β N at 3, which is slightly less than the empirical limit of 3.5 [8,10]. Although the ideal MHD limit with a conducting wall is above β N = 3 in reversed shear plasmas investigated in this study, a precise MHD stability analysis is beyond this study's scope. The design of the reward is crucial to achieving the control goal since the NN is trained to maximize the sum of the rewards. The reward at each time step consists of six terms (r l=1−6 ), as illustrated in figure 3. The purpose of the terms r 1 and r 2 is to offer the NN information about the target range of q min and the target value of β N . r 1 is added only when ρ qmin is higher than 0.4 to control q min in the reversed shear profiles. The terms r 3 and r 4 are added to achieve a stationary state and suppress the fluctuations of q min and β N . To prevent the q profile from staying in the off-target profiles, r 4 is added only when ρ qmin is greater than 0.4 and q min is greater than 2.1 and smaller than 2.8. The term r 5 is also added to achieve a reversed shear q profile, which takes a minimum at ρ greater than 0.4. To avoid a strongly reversed shear q profile, the term r 6 is added. The key point is that the target for β N is defined as a value of 3, but the strict target for the q profile is not given, instead, a target range for q min is provided. This is because β N is related to the fusion power in the reactor and should be regulated to a designated value, whereas for the q profile, it is sufficient to regulate certain key parameters, such as q min , to a designated range to enhance confinement and avoid MHD instabilities. Additionally, if a strict target for the q profile is given, it may not be attainable since a significant portion of the plasma current is driven as the bootstrap current, which depends on plasma confinement characteristics and shows a highly self-regulated nature. Furthermore, a limited number of actuators are available to drive the non-inductive current. In other words, the current profile or the q profile cannot be controlled to an arbitrary profile, so it is inappropriate to set a strict target for the q profile.
In addition to the rewards described above, large additional rewards and penalties are given. An additional reward, equaling the number of remaining time steps multiplied by two, is given when a reversed shear q profile is formed and the parameters are switched from . This reward is given to achieve a reversed shear q profile as early as possible. A penalty of −100 is given when q min is less than 1 or β N is more than 3.5. This penalty is given to avoid MHD instabilities. Under the simulation settings described above, a controller NN trained for each range of a ′ ′ ic can achieve good control results for q min and β N after about one million time steps of simulations.

Training of the analyzer NN
The analyzer NN is introduced to select the most suitable controller NN for the present transport characteristics. The analyzer NN is trained to infer the model parameters given in the RAPTOR simulations from the states s i since the controller NNs are trained with different ranges of a ic . More than 5 million combinations of the states s i and the model parameters are generated alongside RAPTOR simulations, which are performed in reinforcement learning of the four controller NNs. Training of the analyzer NN can be performed as supervised learning using these samples, separately from the training of the analyzer NN. The analyzer NN is a feed-forward NN with five hidden fully connected layers. The first and the last layers have 256 units each, the second and the fourth layers have 128 units each and the third layer has 64 units. The optimizer used for the NN is Adam [45], and the hyperparameters are set to default values except for the learning rate, which is set to 3 × 10 −4 . The batch size used for training is 1024. After more than 15 000 training epochs, the analyzer NN is able to predict the model parameter with the root mean squared errors of around 0.1 for a ic and 0.6 for c ano , as illustrated in figure 4. The analyzer NN also predicts the other model parameters, such as c scale , n e0 , r n , τ k , w ic , d ic , and χ neo , although they are not employed in the control system.

Control results using trained NN
This section shows the test results of the trained NN-based control system. In the first part, the control results in RAP-TOR simulations demonstrate that this system can achieve the control goal in the simulations employed in the training. The control result in the TOPICS simulation is shown in the second part. Although both RAPTOR and TOPICS solve 1D transport equations, they employ different physics models. For instance, they employ different heating and current drive models and anomalous heat transport models. Thus, testing the trained NN using TOPICS enables us to verify its validity for plasmas with response characteristics different from those employed in the training.

Control results in RAPTOR simulations
An example of the control test in the RAPTOR simulation at a ′ ′ ic = 0.8 is illustrated in figure 5. The control starts at t = 1.5 s employing the controller NN trained to control plasmas before the ITB formation. Off-axis NNB and ECH are primarily injected during the I p ramp-up to delay the current diffusion and drive the non-inductive current at the off-axis location. Thus, a reversed shear q profile is formed at t = 5.7 s, and the controller NN is switched to one that is trained to control plasmas after the ITB formation. As the analyzer NN estimates a ic to be about 0.8, the controller NN trained for 0.6 < a ic < 0.9 is selected accordingly. After the ITB formation, off-axis NNB power is slightly reduced to decrease q min below 3 before 10 s, and then, 2 < q min < 3 and β N ∼ 3 are stably maintained. There is a relatively large oscillation in β N between 2 s and 5 s. This is because L-H transition takes place at 2.2 s and the shape of the reward function for β N changes at 5 s since the current flat-top phase starts.
Another example of the control test in the RAPTOR simulation is illustrated in figure 6. Here, a ′ ′ ic is changed from 0.8 to 0.2 at t = 25 s. The waveforms before t = 25 s are the same as the case demonstrated in figure 5. After t = 25 s, the analyzer NN follows a sudden change in the value of a ic employed in the simulation, and the estimated value falls below 0.4 in about 1 s. At this time, the controller NN is switched to another controller NN trained for 0.2 < a ic < 0.7. The switchover of the controller NNs occurs at a ic = 0.4 and a ic = 0.7 when a ic decreases and increases, respectively. This hysteresis is introduced to avoid unfavorable chattering of the control due to frequent switchover. After the switchover, perp-PNB is increased to sustain β N despite reduced confinement due to the weakened ITB.
In the lower panels of figure 7, it is demonstrated that the two-stage controller can achieve the control goal in the RAP-TOR simulation at a ′ ′ ic = 0.2. The two-stage controller has got high rewards in all six terms after about 35 s, as illustrated in figure 7(g). The controller tries achieving a stationary profile while maintaining the other reward terms high. However, a single controller NN that is trained for a wide range of 0.2 < a ic < 1.0 fails to maintain a reversed shear q profile, as illustrated in the upper panels of figure 7. When the twostage controller is used, r 6 , the reward related to q max becomes small at the beginning, meaning the two-stage controller tries to reach a stationary state earlier at the expense of reduced reward at the beginning to increase the sum of rewards at the end of the simulation. This kind of long time scale strategy will be valid only when the transport property is well predicted. Therefore it only appears in the two-staged controller. This is a typical example that demonstrates the difficulty to learn control for plasmas with various ITB strengths using a single NN. Although it might be possible to achieve the same level of control using a single large-scale NN, the required number of trials in reinforcement learning will increase, and it might not be feasible to obtain them within a realistic calculation time. One of the advantages of the two-stage structure is that the analyzer NN is trained using supervised learning, which greatly improves sample efficiency. As a result, a good prediction capability can be achieved from a relatively small number of samples. From these results, control of q min and β N for plasmas with various ITB strengths is enabled by the twostage structure proposed in this study.

Control results in TOPICS simulations
Although it has been demonstrated that the trained system can achieve q min and β N in RAPTOR simulations, it is unclear whether this system will apply to real experiments, as there will always be a finite difference between the simulation and the experiment. Thus, the system is tested in TOPICS simulations to check its validity for the plasma that behaves differently from that simulated in RAPTOR. In this study, the primary differences between simulations performed by TOP-ICS and RAPTOR are the auxiliary heating and current drive models and the anomalous heat transport models. The heating and driven current profiles by NB are computed based on the one-dimensional Fokker-Planck equation [46] and the NB current drive is calculated using Mikkelsen and Singer's model [47]. Orbit loss of fast ions is not calculated in Fokker-Planck code. The fast ion pressure is also computed in TOPICS. The heating and driven current profiles by EC are computed using EC-Hamamatsu code [48], in which the ray trajectory of the EC wave is calculated by the ray-tracing method, and the ECdriven current is calculated by the relativistic Fokker-Planck equation. NB and EC heating and current drive profiles are computed consistently with the time evolution of plasma parameters using these models. For anomalous heat transport models, the CDBM model [49,50] is employed, as this model has been validated for its predictive performance against ITB plasmas at JET and JT-60U [15,16]. The Matrix-Inversion method [51] is employed for the neoclassical heat transport model. Additionally, the ion temperature profile, which is not solved in the RAPTOR simulation, is solved in the TOPICS simulation. Time evolution of the electron density is prescribed same as the RAPTOR simulation. At the current flattop, the density profile is assumed as n e = n edge + n e0 (1 − ρ 2 ) rn , where n edge = 5 × 10 17 (m −3 ), and n e0 = 3.5 × 10 19 (m −3 ) and r n = 0.1.
The control results are illustrated in figure 8. The controller NN trained for plasmas before the ITB formation starts to control at the beginning. Unlike the result demonstrated in figure 5, EC and off-axis NNB are not simultaneously injected, as β N reaches the target value. This is because confinement is better than the simulation employing the RAPTOR code, as can be confirmed by the lower prediction of c ano in figure 8( f ).
A reversed shear q profile is formed just after the beginning of the I p flattop phase. The analyzer NN estimates a ic to be about 0.4, so the controller NN trained for 0.2 < a ic < 0.7 is chosen. The controller reduces the power of off-axis NNB since q min is beyond the target range. As q min decreases gradually, the power of off-axis NNB is increased, and after that, a stationary state is achieved. q min stays below 3 and β N remains around 3 for approximately 30 s. These findings should demonstrate that the trained system can be used for the plasmas even if there are errors in predicting the transport characteristics and the heating and current drive profiles.
Although it is demonstrated that the trained system can be used for plasmas when the confinement performance can be high enough, it is also confirmed that the system fails to sustain the reversed shear plasma when the confinement performance is poor and out of range, which is assumed in the training. When Bohm/gyro-Bohm (BgB) model [52] or BgB model including magnetic shear stabilization [53] are used as an anomalous heat transport models, the rewards become small during the flat-top phase as shown in figure 9(a). This is because the confinement is not high enough to sustain the target plasma within the available heating and current drive power. Additional simulations with anomalous thermal diffusivities half of the original BgB models with and without shear stabilization, and double of the original CDBM model are performed and it is shown that the averaged reward can be recovered when c ano predicted by the analyzer NN is kept in the range assumed in the training as shown in figure 9(a).
Although the averaged reward becomes comparable with the results using CDBM model, the control residual for β N is large when anomalous thermal diffusivity half of BgB model is used as shown in figure 10. In this case, a slightly reversed shear q profile is sustained by off-axis heating and current drive by off-axis NNB and ECH. However, the confinement is not sufficient to achieve β N = 3 since the confinement improvement in low magnetic shear region is not included. Note that on-axis NNB is not injected at its maximum input power, and it is understood that the controller NN avoids to drive the current in the near axis (ρ < 0.3) region to sustain the reversed shear q profile. On the other hand, the control residual for β N can be smaller in the simulation with anomalous thermal diffusivity half of BgB model including shear stabilization as shown in figure 11. This is due to the confinement improvement with ITB formation. Although the control  residual for β N is reduced, the averaged reward is remained comparable with the results using half of BgB model without shear stabilization since the q profile keeps evolving slowly.
It is worth mentioning that the predicted a ic differs for BgB simulations with or without shear stabilization if the anomalous thermal diffusivities are made half as shown in figure 9(b). This means that the analyzer NN can assess the level of confinement improvement due reduced shear as intended if a sufficiently large low shear region is generated. From these results, it can be seen that the trained system can be robustly used for sustainment of the ITB plasma, which has a reversed shear q profile and β N = 3 if sufficiently high confinement is achieved.

Conclusions
Simultaneous control of q min and β N for ITB plasmas in JT-60SA was developed employing the two-stage NN-based control system. In ITB plasmas, the q and pressure profiles are closely coupled depending on the strength of ITB. Under these conditions, the two-stage control system achieves simultaneous control of the q min and β N . The analyzer NN is the first stage, which is trained to estimate the transport characteristics in terms of the transport models in the RAPTOR code. The second stage consists of multiple controller NNs, which are trained to achieve 2 < q min < 3 and β N = 3. Each controller NN is trained for plasmas with various ITB strengths. The appropriate controller NN can be selected according to ITB strength employing this structure, and control for plasmas with various ITB strengths is realized.
The trained system has demonstrated simultaneous control of q min and β N not only in the RAPTOR simulations but also in the TOPICS simulations. It is expected that the trained system can be used in real experiments assuming that the difference in the simulations employing the RAPTOR and TOPICS codes corresponds to the modeling error between RAPTOR simulations and real experiments. This two-stage system has additional merit when applied to experiments. The estimation of the transport characteristics of the experimental plasma can be obtained at the same time as a control test. It is possible to add another controller NN, which is trained for plasmas with transport characteristics close to the estimated ones, and this will improve the control result. Control results can be improved iteratively using this procedure. Realistic treatments of measurement errors and hardware specifications of NB and EC systems remain a future study.