Physics-regularized neural network of the ideal-MHD solution operator in Wendelstein 7-X configurations

The computational cost of constructing 3D magnetohydrodynamic (MHD) equilibria is one of the limiting factors in stellarator research and design. Although data-driven approaches have been proposed to provide fast 3D MHD equilibria, the accuracy with which equilibrium properties are reconstructed is unknown. In this work, we describe an artificial neural network (NN) that quickly approximates the ideal-MHD solution operator in Wendelstein 7-X (W7-X) configurations. This model fulfils equilibrium symmetries by construction. The MHD force residual regularizes the solution of the NN to satisfy the ideal-MHD equations. The model predicts the equilibrium solution with high accuracy, and it faithfully reconstructs global equilibrium quantities and proxy functions used in stellarator optimization. We also optimize W7-X magnetic configurations, where desiderable configurations can be found in terms of fast particle confinement. This work demonstrates with which accuracy NN models can approximate the 3D ideal-MHD solution operator and reconstruct equilibrium properties of interest, and it suggests how they might be used to optimize stellarator magnetic configurations.


Introduction
The computational cost of constructing 3D magnetohydrodynamic (MHD) equilibria is one major limiting factor in stellarator theory, experimentation and design. Depending on the desired resolution and accuracy of the solution, such computations require up to O(10) CPUh [1,2].
Fast and accurate equilibrium reconstructions are crucial to interpret experimental results in magnetically confined devices. At Wendelstein 7-X (W7-X), the Bayesian framework MINERVA [3,94] reconstructs the plasma state (e. g., ion and electron temperature). For each forward evaluation of the simulation model, a free-boundary ideal-MHD equilibrium has to be constructed. For each reconstructed state (i. e., a single timestamp), between O(10 3 ) and O(10 6 ) forward evaluations are required [4,5]. Due to the computational cost of constructing free-boundary ideal-MHD equilibria, a self-consistent Bayesian inversion of a single plasma state has been attempted only once [4] and is not part of the standard data evaluation procedure.
W7-X is currently the largest stellarator experiment in operation. Its scientific mission is to assess the stellarator line on the path towards a fusion reactor [6]. The configuration space of W7-X provides a large experimental test bed, however, only nine reference configurations are mainly investigated [7]. Experimental time is limited, and priorities must be established to fit financial and human resources. Flight simulators provide a parallel and cheaper path to explore and exploit current and future experiments [8,9]. In a stellarator, 3D MHD equilibria bound the accuracy and cost of simulation.
Differentiable and less expensive MHD equilibrium solutions would allow a more extensive exploration of the stellarator optimization space , in which the number of degrees of freedom is usually large ∼ O(10 2 ) [10,11]. The computational cost of the objective function, which is largely dominated by deriving ideal-MHD equilibria, limits the extent to which we can explore the optimization space. According to recent works, each optimization run requires up to O(10 3 ) function evaluations [10,12,13].
Data-driven approaches (e. g., artificial neural networks (NNs)) can provide fast 3D ideal-MHD equilibria [14][15][16][17][18][19]. When benchmarked against equilibria from non-linear codes (e. g., the variational moments equilibrium code (VMEC)), the geometry of the flux surfaces is in excellent agreement. However, it is unclear how equilibrium properties (e. g., plasma stability) can be faithfully reproduced by the NN model. Furthermore, it is unclear how well the NN equilibria satisfy the MHD equations. To make use of such NN models in downstream tasks (e. g., Bayesian inference, optimization), it is necessary to investigate how well NNs capture equilibrium properties.
The ideal-MHD equilibrium problem is a system of Partial Differential Equations (PDEs). In general, initial and boundary conditions, as well as input parameters define the PDE problem. The solution operator maps these variable terms to the corresponding PDE solution.
In this paper, we extend [19] by learning the solution operator of the ideal-MHD equilibrium problem in the subspace of W7-X magnetic configurations with a deep operator network (DeepONet) [20]. A DeepONet is a NN macro-architecture with two pathways: a branch network to encode the PDE problem inputs, and a trunk network to represent the domain of solution function. Here, the NN model maps an externally applied field (defined by the currents in the coil system) and a plasma state (defined by the pressure and toroidal current profiles) to an approximation of the ideal-MHD equilibrium solution. Within the boundary of the training data (i. e., W7-X magnetic configurations), the NN model approximates the solution of unseen equilibrium problems at a fraction of the computational cost currently required by a traditional code. Summarizing the paper's contributions, we: • propose a DeepONet like architecture to represent the solution operator of the ideal-MHD equilibrium problem (see Section 3); • train the model to approximate the solution of ideal-MHD equilibrium problem obtained from a non-linear code (e. g., VMEC), while the ideal-MHD force residual regularizes the model's solution (see Section 4); • investigate how well the model's solution satisfies the ideal-MHD equations, and propose a strategy to improve the solution at inference time by approximating an equivalent fixed-boundary equilibrium solution (see Section 5.3); • investigate to which degree equilibrium properties of interest are faithfully reproduced, in the context of ideal-MHD stability (e. g., magnetic well), neoclassical transport (e. g., the effective ripple), and fast particle confinement (e. g., extrema of the magnetic field strength along a field line) (see Sections 5.4 to 5.6); • show the use of the model in the a posteriori optimization of W7-X magnetic configurations (see Section 5.7); 2 Ideal-MHD equilibria

The ideal-MHD force balance
The equilibrium magnetic field under the ideal-MHD model is characterized by the force balance equation, Ampere's and Gauss's law: where B is the magnetic field, J is the current density, p is the plasma pressure, and µ 0 is the permeability of free space.
In this work, we assume that a set of nested toroidal flux surfaces exists, and that the pressure is a flux function. Let x * = (R, φ, Z) be a cylindrical coordinate system where R is the major radius, φ is the cylindrical toroidal angle, and Z is the height above mid-plane. α = (s, θ, ϕ) is a flux coordinate system where s = ψ ψ edge is a radial-like coordinate, θ is a poloidal-like angle, and ϕ is a toroidal-like angle (in this work, ϕ = φ). 2πψ is the toroidal flux enclosed by a flux surface, and Φ edge = 2πψ edge is the toroidal flux enclosed by the plasma boundary (i. e., s = 1). In the inverse equilibrium formulation [21], α describes the independent variables of the computational grid, and the flux surfaces locations are described by the coordinate transformation f : α → x * .
Because magnetic fields are divergence free ( ∇ · B = 0) and flux surfaces are assumed to be nested ( B · ∇s = B s = 0), the magnetic field can be written in contravariant form following [22]: where 2πχ is the poloidal flux, θ * = θ + λ(s, θ, ϕ) is the poloidal angle for which the magnetic field lines are straight in (s, θ * , ϕ), and λ is a periodic function of θ and ϕ with zero average. e i = ∂ x * ∂αi are the covariant basis vectors, and e i = ∇α i are the contravariant basis vectors.
If we define x = (R, λ, Z), the MHD force residual can be then computed from the mapping x(s, θ, ϕ), and the p(s) andι(s) flux functions (see section 9.3).

Fixed-and free-boundary equilibria
The ideal-MHD problem admits two kinds of boundary conditions: fixed-and free-boundary conditions. In the fixed-boundary case, the shape (R and Z) of the outermost flux surface (i. e., s = 1) is fixed. In the free-boundary case, the shape of the outermost flux surface is determined by the continuity of the total pressure B 2 /2µ 0 + p and by the vanishing of the normal component of the vacuum field at the plasma-vacuum interface (thus enforcing the s = 1 surface to be a flux surface). Multiple quantities parametrize the equilibrium problem: Φ edge ∈ R defines the total toroidal flux enclosed by the plasma, p : [0, 1] → R defines the pressure profile, andι : [0, 1] → R defines the rotational transform profile. The toroidal current profile I tor : [0, 1] → R can be also specified in place of the iota profile: since I tor (s) = s 0 2π 0 J ϕ √ gds dθ, and J ϕ linearly depends onι, the rotational transform profile consistent with a given toroidal current profile can be computed by solving a linear algebraic equation.
In case of free-boundary equilibria, a vacuum field, which is usually generated by a set of external coils, can be provided. A reference field generated by each coil is commonly precomputed on the cylindrical grid, and the total vacuum field is given as the superposition of each coil field is the reference field generated by each coil given a reference current I 0 i , N c is the number of independent coils, and i i = I i /I 0 i is the current ratio between the actual and reference coil current. The set of current ratios i ∈ R Nc additionally parametrizes the equilibrium problem.
In this work, since in equilibrium reconstruction routines the plasma boundary is not known a priori, we are only interested in free-boundary equilibria. In addition, no diagnostics to probe the iota profile are installed at W7-X, whereas continuous Rogowski coils measure the net toroidal current. Therefore, the toroidal current profile is specified instead of the iota profile.

The ideal-MHD solution operator
In this work, a set of scalar parameters (i i and Φ edge ) and plasma profiles (p and I tor ), represented by onedimensional functions of the radial coordinate s, define a instance of the ideal-MHD equilibrium problem. The equilibrium solution is represented by the mapping (s, θ, ϕ) → x(s, θ, ϕ) (where x = (R, λ, Z)) and by the one-dimensional function s →ι(s). Let us define the solution operator as the operator that maps the set of scalar parameters and plasma profiles to the MHD solution: where i ∈ R Nc , Φ edge ∈ R, {p, I tor , -ι} : [0, 1] → R are one-dimensional functions defined on the radial plasma domain, and x : [0, 1] × [0, 2π] × [0, 2π] → R 3 is the mapping between flux coordinates and (R, λ, Z).

NNs to approximate non-linear operators
NNs can approximate non-linear continuous functions [23][24][25][26][27], discontinuous functions [28,29], and nonlinear continuous operators [30][31][32]. In this work, we propose a NN model to learn the non-linear ideal-MHD operator. A set of free parameters Θ ∈ R N (i. e., trainable weights) parametrizes the NN model: where x(s, θ, ϕ) andι(s) compose the equilibrium solution provided by the NN model. Three sources of error characterize the accuracy of a NN: approximation, optimization and generalization errors [33]. The approximation error describes how well a NN represents the target function or operator, the optimization error describes how well the NN training procedure minimizes the training loss (i. e., empirical risk), and the generalization error describes how well the NN generalizes to unseen data.
The approximation theorem in [30] shows how any NN with a specific structure and enough capacity (i. e., number of free parameters) guarantees a small approximation error. However, no guarantees are given on the optimization and generalization errors. Also, from a practical standpoint, estimating the required number of free parameters to meet "enough capacity" is difficult. Fulfilling known symmetries of the solution by construction should ease the training process (i. e., it should reduce the optimization error), and the addition of a physics constraint loss should improve generalization (i. e., it should reduce the generalization error). The next section will introduce them separately.

Imposing symmetries in the NN model
In a learning algorithm, the available computational budget is divided into three baskets: the capacity (i. e., size) of the model sets the approximation error, the length of the training affects the optimization error, and the size of the data set impacts the generalization error.
The space of possible functions that the model can represent is large. The training process is essentially a "search" in this space. An inductive bias "allows a learning algorithm to prioritize one solution over another, independent of the observed data" [34]. An inductive bias encodes the solution's prior knowledge (e. g., a solution's symmetry); it allows budget to be relocated from the capacity basket to the other two baskets, thus, reducing the optimization and generalization errors.
The equilibrium solution should satisfy multiple symmetries. Let us define (X 1 , X 2 , X 3 ) = (R, λ, Z). Each quantity X j (with j ∈ {1, 2, 3}) should be a periodic function of the poloidal and toroidal angles: where N fp is the number of toroidal field period. Moreover, if stellarator symmetry is assumed (which is the case for most experimental devices, including W7-X), the solution should satisfy the following relationships [35]: For example, the periodicity of X j and stellarator symmetry can be imposed if the solution is expanded in Fourier series along the poloidal and toroidal directions: where X j mn are the associated Fourier coefficients, m is the poloidal mode number, and n is the toroidal mode number. Since λ is a periodic function with zero average, λ 00 = 0.
At the magnetic axis (s = 0), the flux surface geometry must be regular (i. e., infinitely differentiable).
If it is expanded in Fourier series, the Fourier coefficients X j mn must have the form [36]: Section 3.2 introduces how the NN model inherently satisfies this property.
Up to m = 8 poloidal and |n| = 11 toroidal modes represent each X j . For W7-X, previous works showed that up to 6 modes are sufficient to represent the geometry of the flux surfaces [17][18][19]. However, using VMEC equilibria with up to m = 11 and |n| = 12 as a "high-fidelity" reference, a Fourier resolution scan revealed that up to m = 8 and |n| = 11 modes are required to represent the equilibrium field with a field error below 1 % (see section 9.1).
For each Fourier coefficients and for iota, a "trunk" network parametrizes a set of non-linear functions of s, [T j 1mn (s), . . . , T j Lmn (s)] T ∈ R L and [T 1 (s), . . . , T L (s)] T ∈ R L , respectively. A "branch" network outputs a set of scalar coefficients as a function of u, [X j 1mn , . . . , X j Lmn ] T ∈ R L and [ι 1 , . . . ,ι L ] T ∈ R L . Finally, the iota and the Fourier coefficients are obtained as a linear combination of trunk network functions, which are weighted by the branch network coefficients.
Iota highly depends on the toroidal current [37]. To let the model exploit this dependency, the toroidal current profile substitutes the last trunk network function in the expansion of the iota profile: T L (s) = I tor (s).
The trunk network expands the radial coordinate s into a set of "basis" functions, independent for each Fourier coefficient X j mn and iotaι. We inappropriately refer to them as a set of basis functions on the interval [0, 1] since these functions are not guaranteed to form a basis (a set of orthogonal functions). To obey the required form needed to represent X j (see section 3.1), each function I is multiplied by a √ s m factor, and the two first trunk network functions are the constant and identity functions, respectively: T j 1mn (s) = T 1 (s) = 1 and T j 2mn (s) = T 2 (s) = s. Imposing the identity function as one of the trunk network functions effectively bypasses the trunk network layers, and it preserves the model ability to build simple functions (e. g., the identity function). This construction is known as a "skip connection" in machine learning [38]. A multilayer perceptron (MLP) parametrizes the trunk network. Please refer to section 9.2 for a detailed description of the trunk network architecture.
The branch network maps the scalar parameters and the plasma profiles of the equilibrium problem to the set of coefficients associated to the trunk functions. The one-dimensional plasma profiles are observed at a set of sensors [s 1 , . . . , s Ns ] T . A set of one-dimensional convolutional layers extract high-level features from the plasma profiles into a compressed latent representation, which is then concatenated to the set of scalar parameters. Finally, a MLP outputs the coefficients associated to the trunk functions. Please refer to section 9.2 for a detailed description of the branch network architecture.
The model can represent a global ideal-MHD solution (i. e., a solution that is defined everywhere in the plasma volume) and fulfils the required symmetries and regularity condition at the magnetic axis by construction. Because of the Fourier expansion along the poloidal and toroidal directions, the periodic boundary conditions and stellarator symmetry are satisfied. The required radial form of each Fourier coefficients is satisfied due to the √ s m factor and the trunk network skip connection.

Training
The model is trained in a supervised-learning fashion: VMEC provides the ground-truth ideal-MHD equilibria. To restrict the space in which to approximate the ideal-MHD solution operator, only W7-X configurations are considered.
I only the functions that expand the Fourier coefficients X j

Data set
The boundaries of the training data limit the application of any data-driven method. In this work, the training data are sampled from a multivariate distribution that approximates the experimental distribution (table 1). The coil current ratios define the vacuum magnetic configuration, the plasma volume determines the total toroidal flux at the edge. One-dimensional Gaussian processes (GPs) on the [0, 1] interval represent the normalized pressure and toroidal current profiles. The volume averaged plasma beta informs the scale of the pressure profile, and the total integrated toroidal current defines the scale of the current profile. See [19, Section 2.2] for a more in depth description of the methodology.

Vacuum configuration
The current ratios between the W7-X coils define the vacuum magnetic configuration. Two symmetric sets of coils compose each half toroidal field period: five non-planar coils, two planar (tilted) coils, and one control coil constitute each set. Defining I 1 as the current flowing in the first non-planar coil, i i = Ii I1 for i ∈ {1, . . . , N c } are the coil current ratios. In the data set, the current of the first non-planar coil is held fixed at I 1 = 13 068 A (current value to obtain B axis (ϕ = 0) 2.52 T in case of the standard configuration), and the current in the control coil is null. The coil current ratios are uniformly sampled to cover a broad configuration space around the nine reference W7-X configurations [7].
Plasma volume For a given magnetic field strength, Φ edge sets the plasma volume: where a is the effective minor radius, R the major radius, and V p the plasma volume. The plasma volume is sampled from a normal distribution with mean 30 m 3 and standard variation 3 m 3 [39]. Once a plasma volume has been sampled, a linear fit estimates Φ edge . Plasma beta For a given magnetic field strength and pressure profile shape, the pressure on axis sets the volume averaged plasma beta: β = 2µ 0 p/B 2 Vol = 2µ 0 p 0 p/B 2 Vol , where p 0 is the pressure on axis and p = p/p 0 is the normalized pressure profile. In previous operational campaigns, W7-X has reached a maximum β of roughly 1 %. However, W7-X has been optimized for higher plasma beta (up to β 5 % [6]). Therefore, the volume averaged plasma beta is sampled from a uniform distribution between 0 % and 5 %. Once a plasma beta has been sampled, a linear fit estimates p 0 .

Net toroidal current
The plasma bootstrap current and the externally induced current (e. g., via electron cyclotron current drive (ECCD)) mainly define the net toroidal current. The net toroidal current should be minimized for the robust operation of an island divertor. In W7-X operations, two current-free scenarios are envisioned [39]: a scenario in which the bootstrap current is minimized via the magnetic configuration, and a scenario in which the bootstrap current is balanced by a strong ECCD current. During the first operational campaign, the measured bootstrap current ranged between −7 kA and 17 kA [41]. Therefore, the total toroidal current is sampled from a uniform distribution between −20 kA and 20 kA.
Pressure profile shape For electrons and ions, the gradient of the density and temperature (p ∝ nT ) profiles can be substantially different in the core and edge regions. To account for such difference, a GP with a non-stationary covariance function models the pressure profile. The GP is parametrized by a set of hyper-parameters (HPs) [19, Section 2.2]: the covariance function scale factor σ f , the core length scale l core , the edge length scale l edge , the transition location s 0 , and the transition width s w . In W7-X, the pressure profile is generally flat, however, high-performance operations are correlated with a peaked density profile [42]. To account for both, flat and peaked profiles are included in the data set. Profiles with a non-physical positive pressure gradient (i. e., ∇p > 0) are discarded.
Toroidal current profile shape In W7-X, the toroidal current has two main contributions: the ECCD induced current and the bootstrap current. The bootstrap current density is generally flat across the plasma volume, however, the ECCD current density is peaked at the deposition location. To capture both shapes, a GP with a non-stationary covariance function is used to model the toroidal current profile.
14 281 equilibria have been sampled in the data set, but equilibria for which VMEC was not able to minimize the MHD variational forces up to a tolerance of 1 × 10 −7 N H m −1 were discarded II . The data set thus contains 9857 equilibria. Figures 2a to 2c show the histograms of a subset equilibrium properties for all converged equilibria: the volume averaged beta β , the pressure peaking factor p 0 / p , and the peak toroidal current density max s |j tor (s)|. The β distribution is not uniform due to the not converged equilibria at high beta (figure 2a). Nevertheless, the data set holds both flat (i. e., small peaking factor) and peaked (i. e., large peaking factor) pressure profiles (figure 2b).
The data set has been split into a training set (80 %), a validation set (10 %), and a test set (10 %). The training set has been used to train the model, the validation set has been used to assess the best model HPs, and the test set has been used to assess the model performance on held-out data.
To improve the training procedure, the model inputs have been normalized with the mean and standard deviation computed on 20 % of the training data.

Loss function
Three terms form the model loss function: a data term, a gradient term, and a physics-based regularization term. The data term biases the model to provide equilibria that are close to the ground truth (computed by VMEC), the gradient term regularizes the model equilibria to exhibit the same radial derivatives as the ground truth, and the physics-based term regularizes the model equilibria to satisfy the ideal-MHD equations.

Data loss
The model equilibrium solution is globally defined in the whole plasma volume. However, to numerically compute a residual measure against the ground truth, an equally spaced grid is adopted: N s = 99 flux surfaces, N θ = 32 poloidal locations, and N ϕ = 36 toroidal locations compose the loss grid. The iota profile, being a flux surface quantity, is evaluated only at the radial grid locations. On each equilibrium, the data loss term is: whereX j andι are the model outputs, and α X j and αι are the set of coefficients that weight the data loss terms.

Gradient loss
An equilibrium solution is fully defined by the mapping X j (s, θ, ϕ) for j ∈ {1, 2, 3} and byι(s). However, many equilibrium properties depend also on equilibrium first and second radial, poloidal and toroidal derivatives. For example, the flux surface Jacobian √ g, which is the local linear approximation of the coordinate transformation (s, θ, ϕ) → (R, φ, Z), is: To faithfully reconstruct the equilibrium properties, the gradients of the equilibrium solution need to match the gradients of the ground truth. Due to the Fourier expansion of the solution in the poloidal and toroidal directions, the poloidal and toroidal derivatives are already constrained by the Fourier coefficients: for example, A similar relationship holds for any poloidal and toroidal derivatives of (R, λ, Z).
However, the radial derivatives of the equilibrium solution are not guaranteed to match those of the ground truth. The radial derivatives depend on the learnt trunk network functions, which are derived during training to minimize a L 2 residual with the ground truth on a finite-spacing radial grid. Even if the L 2 residual vanishes, numerical artefacts (e. g., numerical oscillations) are known to occur [43]. Therefore, the radial derivatives of the model solution can substantially differ from the ground truth.
To ameliorate these phenomena, a gradient term regularizes the radial derivatives of the model equilibrium: where α X j s and αι are a set of coefficients that weight the gradient loss terms. The radial derivatives are approximated with a central finite difference scheme for both the model and ground truth solutions.

Physics regularization
The ideal-MHD equilibrium problem exhibits a high condition number [44]. In the framework of MHD equilibria, the condition number is defined as ratio between the largest and the smallest eigenvalue of the MHD force residual linearized around an equilibrium solution. Due to the presence of second-order radial derivatives in the residual, the condition number scales with the square of the number of radial grid locations [44], namely, P ∼ N 2 s (O(10 4 ) in this work). A high condition number means that the MHD force residual is highly sensitive to the equilibrium solution: small deviations of the solution away from the ground truth may lead to a large MHD residual. Even if the model equilibria are close (in the L 2 sense) to the ground truth equilibria, they are not guaranteed to well minimize the residual.
The MHD force residual in the loss function regularizes the model to provide equilibria that better satisfy the ideal-MHD equations. For an equilibrium solution, F 2 = 0 everywhere in the domain. The volume-averaged residual norm can be computed as: where A Vol = 1 0 2π 0 2π 0 A √ gdsdθdϕ denotes a volume-averaged operator.
In this work, a proxy for the MHD force residual is used instead. VMEC (the code that provides the ground truth equilibria) does not directly minimize the force residual, instead, it minimizes the total energy of the system (i. e., it uses a variational approach). It has been shown that equilibria computed by VMEC only poorly minimize the ideal-MHD force residual [2]. Therefore, adding the full F 2 Vol as regularization term may cause instabilities in the training procedure. The volume averaged radial force balance assuming a vanishing helical component (F β = 0) is used instead. This quantity is also used in VMEC to determine the goodness of a converged solution. Let us firstly define where A Fs = 2π 0 2π 0 A √ gdθdϕ is a flux surface averaged operator. Then, the model is regularized with the following term: where s min = 0.02 to avoid the sensitivity of the MHD force residual at the magnetic axis. Please refer to section 9.4 for the derivation of f * .

Training stages
Three subsequent training stages compose the training procedure. In every training stage, early stopping is employed during training [45], and the best model accordingly to the validation loss is used as a warm start for the next stage.
Data stage At the beginning of the training, only the data loss L d is used. A set of flux surfaces randomly sampled across all equilibria in the training set compose each batch (i. e., individual flux surfaces from different equilibria are randomly shuffled into the same batch).

Gradient stage
In this stage, the gradient loss L g is included in the overall loss. To compute the radial derivatives of the solution, the flux surfaces of the same equilibrium are aggregated together in the same batch. However, multiple equilibria are still randomly shuffled into each batch.
Physics regularization stage Finally, the physics regularization term L MHD is included in the loss function. As in the gradient stage, and to compute f * , the flux surfaces of the same equilibrium are aggregated together in the same batch. Still, multiple equilibria form each batch.
During the physics regularization stage, a curriculum learning approach is used [46]. The high condition number limits the convergence rate and imposes a superior bound on the learning rate (i. e., the step size): even if the model commits a finite, but small error on the equilibrium solution, the regularization term might lead to a diverging loss.
To mitigate this issue, the model is gradually regularized, starting with well-predicted equilibria. At every batch, the physics regularization term is included into the loss function only for the subset of equilibria where the mean-squared error (mse) on (R, Z) (the key quantities affecting the MHD residual) is below a given threshold. When the regularization term has not decreased in the previous 50 epochs, the threshold is progressively increased by a factor of 10. More formally, the physics regularization term for the generic batch B at the k-th step is: where L MHD,i is the physics regularization term for the i-th equilibrium in the batch, H(·) is the Heaviside function, L k is the loss threshold value at the k-th step, L (R,Z),i is the mse on (R, Z) for the i-th equilibrium. The loss threshold is progressively increased, i. e., L k = 10 k L 0 . An initial threshold of L 0 = 3 × 10 −7 m 2 is used. At the fourth step, the physics regularization term is computed on all the equilibria in the batch, namely, L 4 = +∞.

Hyper-parameters
Due to the computational cost of the whole training procedure ( 8 days on a single NVIDIA Tesla-V100 graphical processing unit (GPU)), a simplified scenario is used to identify the best model HPs. Given a fixed budget of 6 M free parameters, HP grid search is performed for the depth and width of both trunk and branch networks. In addition, only half of the training data is used. The optimum HPs combination is chosen based on the loss on the validation set in the data stage.
The trunk and branch networks are then proportionally upscaled to reach 40 M free parameters. Tables 7 and 8 summarize the HPs of the final model.
The AdamW [47,48] optimizer is used to train the model. In each stage, due to the different loss function, the learning rate, the learning rate scheduler, and the L 2 regularization term are derived with a limited grid search. The loss on the validation set is used to inform the search. In addition, early stopping [45] is employed during training. Tables 9 to 11 report the HPs used in each training stage.

Results
In the following chapter, the model's accuracy is evaluated by how well it predicts the equilibrium problem solution and reconstructs equilibrium properties of interest (which are not explicitly included in the loss function) across many physics domains (table 2): flux surface geometry (section 5.1); iota profile, shear profile and magnetic field structure (section 5.2); MHD force residual (section 5.3); ideal-MHD stability (section 5.4); neoclassical transport (section 5.5); fast particle confinement (section 5.6).
The root-mean-square error (rmse) and mean absolute percentage error (mape) summarize the error on the test set. Given a quantity of interest y ∈ R K (e. g., the magnetic well), the rmse evaluates the average residual between the ground truth and predicted values for that quantity. If Y = {y ∈ R K } is the set of true values, andŶ = {ŷ ∈ R K } is the set of related model predictions, the rmse on y is: Given a quantity of interest y ∈ R K , the mape measures the average relative deviation of ground truth with respect to predicted values for that quantity. Using the same notation as for the rmse, the mape on y is: To further investigate the model accuracy, and to validate its use in downstream applications, section 5.7 shows the a posteriori optimization of W7-X coil currents to search for ideal-MHD unstable and improved fast particle confinement configurations. In this task, the model replaces a traditional non-linear ideal-MHD code (e. g., VMEC) in constructing free-boundary equilibria.
The results of the model on the test set are now presented. Because of the adopted finite differences scheme, numerical noise affects quantities that rely on the radial derivative of (R, λ, Z) at the magnetic axis and on the nearest flux surface; those values are invalid. A note is present in the caption of all affected figures.

Flux Surface Geometry
The geometry of the flux surfaces (R, λ, Z) andι represent the solution of the ideal-MHD equilibrium problem: they fully describe the equilibrium magnetic field and its properties. Therefore, the accuracy  with which the model predicts (R, λ, Z) andι bounds the accuracy with which it reconstructs equilibrium properties.
The flux surface geometry is predicted with an exceptional accuracy (figure 3). The flux surface error between the ground truth and the model prediction is evaluated as: where ∆R i and ∆Z i are the differences in flux surfaces relative to the ground truth, and the brackets denote the mean across all equilibria and grid locations. In other words, ε Fs represents the average absolute distance (∆R) 2 + (∆Z) 2 across all equilibria and grid locations. The model achieves ε Fs = 616 µm. Similarly, the model successfully predicts the angle renormalization parameter λ, achieving rmse λ = 2.41 mrad.
Low-beta equilibria are better regressed than high-beta equilibria (figure 4). As expected from the distribution of the volume averaged beta in the training data set (see figure 2a), the flux surface error increases towards the boundary of the training set (i. e., where the plasma beta increases). No clear patterns are visible in terms of the net toroidal current enclosed by the plasma. The flux surface error does not considerably change when the physics regularization is applied (see table 3).
As the plasma beta increases, the plasma column moves outwards. This effect is called the Shafranov  shift, and it is usually measured as ∆/a, where ∆ = R axis (β) − R axis (0) and a is the effective minor radius. The Shafranov shift defines the equilibrium beta limit (∆/a 0.5) [49], and it causes an outward movement of the last closed flux surface (LCFS), which affects the edge magnetic field topology. In W7-X, the minimization of the Shafranov shift was an optimization criterion [50]. Nevertheless, a Shafranov shift of ∆ 5 cm has been observed in high-β operations [51].
The determination R 2 captures the reconstruction quality.

Magnetics
ι sets the location of the resonant surfaces in the plasma volume. It also defines the island chain and magnetic edge topology, which affects how the W7-X island divertor operates. Although equilibria with the assumption of nested flux surfaces cannot resolve islands, VMEC equilibria are an important requirement in the investigation of plasma phenomena involving resonant surfaces [52][53][54].
The model effectively predicts theι profile (figure 6a). A relative error of mapeι < 0.1 % is achieved. In case of the median regressed equilibrium, the predicted and ground truthι profiles overlap (figure 6b). The magnetic shear is the radial derivative ofι: where 2πψ(s) = Φ edge s is the toroidal flux. It represents the change of direction of the magnetic field lines from one flux surface to another. The magnetic shear influences multiple properties of the equilibrium: it affects MHD stability [55,56], it is inversely proportional to the island width (i. e., large islands are present at low shear) [55], and a low shear leads to long connection lengths allowing efficient divertor operation [57].
The model qualitatively reconstructs the shear profile (figure 7a). Figure 7b shows that the ground truth and predicted shear profile only partially overlap. Across all equilibria in the test set, the relative error on the magnetic shear is high mapeι = 84.4 %. The structure of the magnetic field plays a crucial role in many physics aspects: neoclassical and turbulence transport, confinement of energetic particles, plasma instabilities, and deposition of electron cyclotron resonance heating (ECRH) heating power.
The model correctly reconstructs the magnetic field structure (figure 9). The model achieves a rmse B = 23.6 mT on all equilibria in the test set. When compared with an average W7-X field of B 0 = 2.5 T, this value represents a relative error below 1 %.
However, the magnetic field strength error differs along the radial profile ( figure 9). On average, the field strength error is below 10 mT for a considerable fraction of the plasma volume, up to s 0.3 (i. e., ρ 0.55). The accuracy of the magnetic field strength in the core region is sufficient to navigate the magnetic configuration space of W7-X to obtain fast particle optimized configurations (see section 5.7.2). Moreover, an accuracy of 10 mT is expected to be enough to perform equilibrium reconstruction routines [59].
The model introduces artificial field ripples in the equilibrium solution ( figure 10): at the plasma edge, and also to a less extent close to the magnetic axis, the predicted B field is not as smooth as the ground truth solution. The flux surface geometry Jacobian √ g and the metric tensor elements g ij mainly define the magnetic field B (section 2.1). These quantities are highly sensitive to the radial derivative of the flux surface coordinates (R, λ, Z). High frequency components in the learned trunk network basis functions might be the cause of such field ripples.

Ideal-MHD loss
A high condition number, which has several negative implications, characterizes the ideal-MHD equilibrium problem (see section 4.2.3). Firstly, a high condition number limits the rate of convergence. During the physics regularization training stage, the weights of the NN model are adjusted to minimize the MHD force residual for the equilibria in the training set. The condition number sets the convergence rate of this gradient descent method. To accelerate the descent, non-linear MHD codes adopt a preconditioning matrix, which decreases the condition number of the problem [44]. In this work, this technique is not available: in general, preconditioning algorithms require an approximation of the Hessian matrix, which is impractical due to the large number of free parameters of the NN model. Secondly, a high condition number implies that the MHD force residual is highly sensitive to the equilibrium solution. For example, if we define x 0 to be that solution that satisfies the MHD force residual F MHD ( x 0 ) = 0 and κ as the condition number, then F MHD ( x 0 + ) κ 2 1, where is a small displacement to the equilibrium solution. Even if the model provides a good approximation of the equilibrium solution, the MHD force residual is not guaranteed to be null.
To quantify how well the equilibria satisfy the MHD equations, two measures are adopted. The proxy and full MHD force residuals: represents the expected accuracy needed on the magnetic field strength for equilibrium reconstruction routines [59]. Numerical noise in the finite difference scheme invalidates the values on the magnetic axis (see section 5).
The model trained with the physics regularization term better satisfies the ideal-MHD force residual compared with a model that has not been regularized (table 3). Both models show similar flux surface and iota errors, however, the physics regularized model shows a reduced ideal-MHD force proxy residual. However, this number is still three orders of magnitude higher than the value of ground truth equilibria from VMEC. Even though the model provides the solution of the ideal-MHD equilibrium problem (the geometry of the flux surfaces and the iota profile) with high accuracy, the ideal-MHD force residual is poorly satisfied. In case of the median predicted equilibrium (at β = 1.39 %), the normalized MHD force proxy residual is η f * = 23.3 % ( figure 11). For comparison, the ground truth equilibrium has a normalized residual of 3.04 × 10 −2 %.

Fine-tuning at inference time
Let us call a single-pass equilibrium an equilibrium solution provided by the model with a single forward pass.
Single-pass equilibria poorly satisfy the ideal-MHD equations (section 5.3). However, without access to the ground truth solution, equilibria can be improved at inference time. The original free-boundary equilibrium problem is cast into an equivalent fixed-boundary one: the plasma boundary of the single-pass free-boundary equilibrium defines the boundary condition on (R, Z) for the equivalent fixed-boundary equilibrium. The error on (R, Z) of the single-pass equilibrium is below 1 mm (Section 5.1), therefore, the plasma boundary can be safely assumed to be close to the correct.
The solution of the equivalent fixed-boundary equilibrium problem is approximated by minimizing the MHD force residual at inference time. The loss function has three terms: the full MHD force residual, a distance between the model and the fixed boundary (R b , Z b ) derived from the single-pass equilibrium, and the satisfaction of the required toroidal current profile I tor (s). The loss function is then: where α MHD = 1 × 10 −3 , α b = 0.2495, α I = 0.5, and s min = 0.02. The single-pass equilibrium is used as initial guess for the minimization problem. The AdamW algorithm minimizes the loss function. The initial learning rate is set to 10 −4 , and it is decreased to 10 −6 at 10 4 steps with an exponential decay. The training is halted after 2 × 10 4 steps. The branch network, apart from the bias of the last layer, is frozen. For a complete description of the HPs, see Section 9.2. We call this optimization at the inference step fine-tuning.
In terms of the full MHD force residual, the model can provide better-than-ground-truth equilibria (table 4 and figures 12a and 12b). As an example, we consider a W7-X standard configuration at β = 2 %, which is not included in the training data. Both the single-pass and fine-tuned equilibrium show low flux surface errors, however, the fine-tuned equilibrium satisfies the ideal-MHD equations much better than the single-pass equilibrium: the normalized error is only η f * 1 %. Surprisingly, the fine-tuned equilibrium minimizes the full MHD force residual better than the ground truth equilibrium computed by VMEC. Table 4: Comparison between the fine-tuned, not-regularized and regularized (single-pass) model in minimizing the MHD force residual in case of a W7-X standard configuration at β = 2 %, which is not part of the training data. ε f * and εF denote the proxy and full MHD force residuals, η f * and ηF denote the proxy and full normalized MHD force residuals. Please refer to section 5.3 for the definition of these quantities. NN denotes a model that has not be regularized with the MHD force residual (i. e., trained only in the data and gradient stages), NN regularized denotes the model that has been regularized with MHD force residual (i. e., the single-pass model), and NN fine-tuned denotes the fine-tuned model. VMEC denotes the value for the ground truth equilibrium. For each quantity, the best value is highlighted in bold.  In addition, the fine-tuning procedure also improves the reconstruction of the magnetic field strength: the average error is below 10 mT for the entire plasma volume ( figure 13).

NN
The fine-tuning procedure is currently computationally expensive: on a single GPU, the fine-tuning procedure reaches the MHD force residual of the VMEC ground truth equilibrium after 4807 iterations (i. e., gradient update step) and 3610 s, whereas VMEC converges after 13 783 iterations and 127 s (on 20 central processing unit (CPU) cores). Both the fine-tuning procedure and VMEC use a gradient descent method, however, even if the model requires fewer iterations to achieve the same MHD force residual, the model runtime is longer: the network and loss function are described in python, while VMEC is written in FORTRAN (thus resulting in compiled code). The fine-tuning procedure simply shows how the single-pass model predictions can be improved in a selfsupervised fashion (i. e., without the use of a ground truth equilibrium). The reduction of the fine-tuning runtime (e. g., just-in-time compilation of both the network and loss function) and the improvement of the fine-tuning convergence (e. g., higher learning rate) are left for future investigations.

Ideal-MHD stability
The magnetic well W [60,61] represents a fast proxy for ideal-MHD stability [58,62]. The configuration is taken to be stable if [63]: Therefore, for a standard pressure profile with p < 0, a positive W is favourable for stability. In the literature, multiple definitions can be found. When considering vacuum configurations, the magnetic well can be expressed as the second radial derivative of the plasma volume: where V p (s) is the plasma volume within the flux surface labelled s: When a finite plasma pressure is introduced, the expression is modified into [60]: Moreover, the magnetic well term as defined in the Mercier stability criterion is [63,64]: where the s 2 ι 2 π 2 prefactor common to all Mercier terms has been omitted, as employed in VMEC. The model only qualitatively reconstructs the local magnetic well ( figures 14a to 14c); it correctly provides the trend of the profile, however, it introduces artificial wiggles: because the magnetic well depends on the solution's second radial derivatives, the magnetic well is particularly sensitive to the solution's radial dependency. The relative error for all three expression of the magnetic well is high (see table 2). The MHD force residual regularization aids the model to faithfully reconstruct equilibrium properties. As an example, figure 15 shows the reconstructed local magnetic well by the not-regularized, the regularized (i. e., single-pass), and the fine-tuned models. We again consider the W7-X standard configuration equilibrium at β = 2 % previously examined (see section 5.3.1). The amplitude of the radial wiggle decreases from the not-regularized to the regularized models, and it almost vanishes in the fine-tuned model. Since the MHD force residual depends on the equilibrium solution's second radial derivatives, the MHD force residual regularization also implicitly regularizes the equilibrium solution's radial derivatives. To assess the MHD stability of an equilibrium, a globally defined magnetic well is usually considered. For example, stellarator optimization frameworks often use the magnetic well depth as a fast proxy for plasma stability [7,11]: where W vacuum does not consider finiteβ effects. A magnetic well depth that considers also a finite pressure can be defined as: where p * = 2µ 0 p + B 2 is the sum of the fluid and the magnetic pressure. Despite only a qualitative agreement of the local magnetic well, the model faithfully reconstructs the magnetic well depth ( figures 16a and 16b). In case of the vacuum magnetic well depth, the relative error is remarkably low: mape W vacuum = 3.43 %. The accuracy on the magnetic well depth is sufficient to effectively navigate the magnetic configuration space of W7-X in finding precise, negative-well configurations (see section 5.7.1).

Neoclassical transport
In a stellarator, one critical transport regime is the so called 1/ν regime, where the neoclassical transport increases with decreasing collision frequency. A measure of such transport is the effective ripple coefficient, eff [65]. In this work, the drift-kinetic code NEO [66,67] evaluates the eff . No acceptable agreement is present between the predicted and ground truth eff values (figure 17a). The effective ripple is greatly affected by the local structure of the magnetic field, which the model struggles to smoothly reconstruct (figure 10). It seems coherent that the artificial field ripples in the model equilibrium solution affect the accuracy on the effective ripple.
A poor, qualitative agreement between the ground truth and predicted eff is present only in the core region ( figure 17b). In such region, namely for s ≤ 0.33 (ρ ≤ 0.57), the magnetic field strength is reconstructed with an accuracy below 10 mT. However, the eff relative error is still high, mape eff = 54.6 %. Simplified proxies of the neoclassical transport do exist. For example, in case of W7-X magnetic configurations with a low mirror term (b 01 0), the effective ripple can be approximated as [39]: where b mn = B mn /B 0 , B mn are the Fourier coefficients of the magnetic field strength B in Boozer coordinates, B 0 is a reference magnetic field value (e. g., B 00 (s = 1)), κ = −b 10 R r eff is the toroidal curvature term, R is the major radius, and r eff = ρa is the effective radius.
The model well reconstructs the epsilon effective proxyˆ eff ( figure 19).ˆ eff relies on few geometrical quantities and on the leading Fourier components of the magnetic field strength B, which the model faithfully reconstructs (see sections 5.1 and 5.2). In case of the epsilon effective proxyˆ eff , the model shows a relative error of mapeˆ eff = 13.3 %, lower than compared with eff .

Fast particle confinement
Another critical measure of transport is the confinement of fast α-particles (energetic ions), which are the products of fusion reactions. Any fusion facility should confine fast particles so that they can heat the bulk plasma during the plasma burn, and to avoid damage to the first wall as a result of their losses. Codes like the symplectic integration methods for particle loss estimation (SIMPLE) compute the loss fraction of test particles, for both trapped and passing particles, when injected with an initial velocity in the plasma volume. The collisionless guiding centers of particle motion are followed in time, till a stopping condition is met or the particle leaves the plasma volume. However, such computations are expensive. Stellarator optimization frameworks instead use explicit symmetries (e. g., quasi-symmetry), equilibrium properties (e. g., omnigenity) or simpler proxies that correlate with the confinement of fast particles (e. g., Γ c ) [62,[68][69][70][71][72].
For example, W7-X is a quasi-isodynamic (QI) stellarator: a quasi-omnigeneous (QO) stellarator with magnetic field strength contours that close poloidally. Omnigeneity means that the bounce-averaged radial drift of locally trapped particles vanishes [73]. For a confining field B, the conditions of omnigineity are [74] To assess property (2), the standard deviation of the extrema of B on a flux surface across all field lines are assessed, σ Bmax and σ Bmin . In a QO configuration, σ Bmax (s) = σ Bmin (s) = 0 across the whole plasma volume.
The model equilibria correctly reproduce the standard deviation of the extrema of B (figures 21a and 21b): rmse σ Bmax = 9.43 mT, and rmse σ B min = 6.57 mT.
The Boozer angular separation δ(s, B, α) is the angular distance along the field line α between the B contours (see figure 28). In a QO configuration, the angular separation δ(s, B, α) should be independent of the field line label α, namely, ∂δ(s,B,α) ∂α | s,B = 0. Numerically, such condition can be verified by observing σ δ (s, B) = 0.
The Boozer angular separation δ relates to the bounce-averaged radial drift that trapped particles experience. Therefore, it makes sense to assess it only if a clear B field well, in which particles can be trapped, exists. Not all W7-X magnetic configurations exhibit such a B field well: the median regressed equilibrium in the test set is not guaranteed to possess it. Therefore, only to assess property (3), the model is evaluated on a small (58 equilibria) out-of-sample set of high-mirror W7-X configurations. The high-mirror W7-X configuration has been constructed to feature a magnetic-mirror like B field structure, and it is the closest configuration to QI out of the nine W7-X reference configurations. This test set has been constructed as the main data set (see section 4), however, the coil current ratios have been fixed to generate only high-mirror W7-X configurations. Still, finite-β and toroidal current effects are present in the data set. For completeness, section 9.5 presents the δ reconstruction in case of the median regressed equilibrium in the larger test set.
Even if the predicted magnetic field structure does not perfectly overlap with the ground truth field (figure 10), the model preserves the Boozer angular separation ( figure 22). However, the reconstruction accuracy is better in the core region (figure 23).

Optimization in the configuration space of W7-X
The exploration and optimization of the W7-X magnetic configuration space serve to further investigate how faithfully the NN model reproduces equilibrium properties, and to validate its use in downstream tasks. For example, [69,75,76] proposed the a posteriori optimization of coil currents to further improve the equilibrium properties, and to recover inevitable errors that occur during the coils manufacturing and placement. Moreover, especially in case of an island divertor that relies on a robust edge magnetic topology, it is desirable to include finite-β effects in the optimization. A fast, free-boundary ideal-MHD equilibrium model benefits such optimization. In this work, W7-X configurations are targeted due to the limitations of the training data. However, the proposed technique can be applied to any present or future device once a training data set is generated. The optimizations carried out in this work represent toy examples of real-world applications. Indeed, the proposal of new, optimized W7-X configurations is beyond the scope of this paper.
The equilibrium solution provided by the NN is differentiable. The analytic gradient of equilibrium properties (e. g., the magnetic well) with respect to the magnetic configuration and plasma profiles is available at no additional cost via automatic differentiation (AD). In a gradient-based optimization, a single equilibrium evaluation is sufficient in each optimization step (regardless of the number of free parameters). In contrast, when finite-differences approximate the objective function gradient, the number of equilibrium evaluations each step scales linearly with the number of free parameters.
Both magnetic configuration and finite-β effects are included in the optimization. The search space is a subset of the NN model input parameters: the W7-X non-planar i [2,...,5] and planar i [A,B] coil current ratios, the toroidal magnetic flux at the edge Φ edge , and the pressure on axis p 0 . The pressure profile shape is fixed, and it is assumed to be of the form p(s) = p 0 (1 − s) 2 . Moreover, we assume the toroidal plasma current to vanish (i. e., I tor (s) = 0).
To avoid extrapolation, the search space is bounded by the distribution of the input parameters as seen during training. Denoting p = [p 1 , . . . , p d ] ∈ R d the optimization vector, each input parameter p i is bounded by [µ i −cσ i , µ i +cσ i ], where µ i and σ i are the mean and the standard variation of the distribution of p i , respectively. c = √ 3 for the parameters that have been uniformly sampled in the training data, and c = 2 for all the others. The configuration with p i = µ i ∀i serves as initial guess (i. e., the average configuration seen during training).
The objective function is the weighted sum of the squared residual between the proposed and target equilibrium properties:  where f * i is a target equilibrium property, f i ( p) is the proposed equilibrium property, and w i is the weight of each residual term. Such form of the objective function is usually employed in stellarator optimization frameworks [62,77,78].
A combination of the global tree-structured Parzen estimator (TPE) algorithm from hyperopt [79] and the local limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) [80] algorithm from scipy [81] drives the optimization. Whenever possible, AD computes the analytical Jacobian of the objective function. On the other hand, if the analytical Jacobian is not available, (e. g., an objective function term requires an external software package that breaks the model computational graph), finite differences approximate the Jacobian.

Negative well
Almost all W7-X configurations in the training set feature a positive magnetic well (figure 24). W7-X was optimized with respect to good MHD stability [6], therefore, in general, a positive magnetic well is to be expected.
Can a small, but negative magnetic well depth is targeted: W * vacuum = −0.01 and w W vacuum = 10 4 . Finite-β effects deepen the magnetic well, therefore, the search is performed in the limit of vacuum configurations. After 500 iterations of TPE, L-BFGS refines the best configuration. Although the NN model has rarely seen a negative well configuration during training, it can reliably predict equilibria with a negative well (table 5). The optimization converged to a configuration for which the model predicts a magnetic well depth of −0.01. The configuration has been evaluated also with VMEC, which reports a magnetic well depth of −0.009. Indeed, the vacuum magnetic well (W (ψ) =

Improved fast particle confinement
A classical stellarator does not confine fast particles as good as a tokamak, therefore, the magnetic geometry has to be optimized to do so. W7-X standard and high-mirror configurations have been optimized to improve fast particle confinement at high-β. Moreover, [69,75] have also found improved configurations.
Can we use the model to search for a W7-X fast particle optimized configuration? To address this question, we employ a similar objective function to the function used for the min-max configuration in [69], and we investigate whether the optimized configuration shows the same performance in terms of fast particle confinement.
In a perfect QI configuration, the variance of the extrema of the magnetic field strength across all field lines vanishes. As in the case of the min-max configuration, the variance of B at the bean-shaped cross-section (i. e., ϕ = 0 or the B max -contour) and at the triangular cross-section (i. e., ϕ = π/N fp or the where Var Bmax and Var Bmin are the variances of the magnetic field strength at the bean-shaped and triangular cross-section, respectively, and is a mirror factor to ensure that the extrema of the magnetic field strength are located at the correct angular position (maxima at ϕ = 0 and minima at ϕ = π/N fp ). β * = 2 % and mr * = 10 % (like in the W7-X high-mirror configuration). Finally, w B = 10 2 , w β = 10 3 , and w mr = 10 2 . The optimization is carried out for 500 iterations of the TPE algorithm. SIMPLE [82,83] assess the fraction of lost particles from the plasma volume. 10 4 particles at 60 keV are launched from s = 0.06, and their trajectories are followed for 0.1 s, or till they are lost from the computational domain. The optimized configuration is compared against the W7-X standard, high-mirror and min-max configurations. All configurations have been scaled to have the same magnetic field strength on axis B(ϕ = 0) 2.52 T, the same plasma beta β 2 %, and the same plasma volume V p 30 m 3 . The optimized configuration confines fast particles better than the standard and high-mirror configurations, and as well as the min-max configuration (figure 26): the total fraction of lost particles after 0.1 s is only 8.86 % (table 6). Table 6: Relative magnitude of the non-planar and planar W7-X coil currents for the β = 2 % fast particle confinement optimized configuration. The total fraction of lost 60 keV fast ions initialized at s = 0.06 is reported.

Discussion and conclusion
We propose a NN model that quickly approximates the ideal-MHD solution operator in a stellarator geometry. The model correctly predicts the equilibrium solution, and it faithfully reconstructs global equilibrium properties and proxy functions that are commonly used in stellarator optimization frameworks. Compared with previous data-driven approaches, the model fulfils equilibrium symmetries by construction. In addition, the MHD force residual regularizes the model to better satisfy the ideal-MHD equations.
The error between the model and the ground truth equilibrium solution is less than 1 mm in case of the flux surface locations. However, mainly due to the high condition number of the equilibrium problem, the model equilibria only poorly minimize the MHD force residual. Yet, the model predictions can be improved at inference time to yield better than ground truth equilibria (i. e., equilibria that minimize the MHD force residual better than the ground truths).
Equilibrium properties that are highly sensitive to the magnetic field structure cannot be precisely reconstructed. The magnetic field depends on first-order derivatives of the equilibrium solution, therefore, even small errors in the equilibrium geometry might lead to large discrepancies in the computed magnetic field. Nevertheless, the magnetic field strength is smoothly reconstructed in the core region, but the model introduces artificial field ripples at the edge. As a consequence, the model fails to meet the required magnetic field accuracy to faithfully compute some quantities of interest (e. g., the effective ripple, which is a measure of neoclassical transport). Still, the accuracy of the model is sufficient to explore and optimize W7-X magnetic configurations, in terms of ideal-MHD stability and fast particle confinement. This result suggests that NN-based surrogate models can be used to find optimized configurations for current and future stellarator devices.
The model has still many limitations. The NN is trained in a supervised learning fashion, therefore, the training data limit the applicability of the model (the model can only approximate W7-X equilibria). Moreover, noise-free plasma profiles are assumed. Finally, the model accuracy on the magnetic field limits its use in tasks that are highly sensitive to it.
This work can be improved along multiple dimensions. The robustness of the model with respect to experimental noise can be investigated and enhanced with established robustness and regularization techniques: injecting noise to the input plasma profiles and training the convolutional layers with dropout. The artificial field ripples introduced by the model can be ameliorated by adding a regularization term on the second-order radial derivative (e. g., ∂ 2 R ∂s 2 ), or, by replacing the learned trunk network functions with an orthonormal set of basis functions (e. g., Legendre polynomials). In addition, the model can be trained on an extended space of stellarator equilibria, relaxing the constraint of W7-X configurations. Finally, equilibrium reconstructions in a stellarator geometry must be demonstrated. Contrary to a tokamak, the computation of MHD equilibria is a limiting factor in stellarator research and design. Fast, differentiable and accurate three-dimensional MHD equilibrium solutions would reduce the gap between tokamaks and stellarators. In particular, they would enable: real-time equilibrium reconstructions [84,85], first-principles flight simulators [8], data-driven plasma control [86], and extensive understanding and exploration of the stellarator optimization space.

Author Statement
The contributions to this paper are described using the CRediT taxonomy [87]:
The data sets were generated on the Max Planck computing and data facility (MPCDF) cluster "CO-BRA", Germany. Financial support by the European Social Fund (ID: ESF/14-BM-A55-0007/19) and the Ministry of Education, Science and Culture of Mecklenburg-Vorpommern, Germany via the project "NEISS" is gratefully acknowledged. This work has been carried out within the framework of the EUROfusion Consortium, funded by the European Union via the Euratom Research and Training Programme (Grant Agreement No 101052200 -EUROfusion). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them.

Equilibria Fourier scaling
A trade-off must be struck between model capacity and computational complexity. Given the poloidal and toroidal expansions of the equilibrium solution in Fourier series, the number of Fourier modes utilized to represent the solution limits its accuracy: a large number of Fourier modes enhances the model accuracy. On the other hand, it increases the computational complexity of the model, as well as the training and inference times. How many Fourier modes are adequate?
To examine such trade-off, the approximation error on magnetic field strength is investigated. One of the objectives of this study is to investigate the precision with which the model reconstructs equilibrium properties. The magnetic field strength is one of the critical equilibrium properties. Therefore, the Fourier resolution of the equilibrium solution should not be the largest source of error in representing the field strength.
To represent the magnetic field strength with an error of less than 1 %, up to m = 8 poloidal and |n| = 11 toroidal models are required. Figure 27 shows the magnetic field approximation error when the Fourier resolution is varied, using VMEC equilibria with up to m = 11 and |n| = 12 Fourier modes as "high-fidelity" references.

Model architecture and HPs
The trunk network maps the radial dimension s into a set of non-linear basis functions, in which the X j mn Fourier coefficients andι profile are radially expanded (see section 3.2). Each Fourier coefficient andι has its own set of basis functions. The input dimension is N ti = 1, and the output dimension is N to = (N X j mn + Nι )N tb = 4696, where N X j mn = 3[(m pol + 1)(2n tor + 1) − n tor ] − 2 = 586 is the number of Fourier modes used to represent (R, λ, Z), m pol = 8 is the highest Fourier poloidal mode, n tor = 11 is the highest Fourier toroidal mode, Nι = 1 is the single output needed to represent the iota profile, and N tb = 8 is the number of trunk basis functions per output.
The trunk network is a MLP network with sigmoid linear unit (SiLU) as non-linear activation functions (table 7). Each layer is initialized accordingly to [93].
The branch network maps the MHD parameters and input plasma profiles into the weighting coefficients of the trunk basis functions (see section 3.2). A profile head extracts high-level features from the input profiles into a latent representation. A set of blocks made of one 1D convolutional and pooling layers composes the profile head. The branch body then processes the concatenation of the profile latent representation with the input scalar parameters. The output dimension is N bo = (N X j mn + Nι )(N tb + 2) = 5870, where the "+2" is for the constant and identity functions. 5-95% quantiles mean Figure 27: The approximation error on the magnetic field strength when representing the equilibrium solution {R, λ, Z} with a reduced number of Fourier modes. The equilibrium with up to m = 11 poloidal and |n| = 12 toroidal mode is taken as a reference. The solid line represents the average error across 128 randomly sampled equilibria in the data set, and the shaded region represents the 95 % and 5 % quantiles. W7-X has a mean field 2.5 T, therefore, a field error of 25 mT represents a 1 % error. The branch body is a MLP network with SiLU as non-linear activation functions (table 7). Each layer is initialized accordingly to [93]. The last layer is initialized at the median value of the model outputs evaluated on 20 % of the training data set.
The overall training process is divided in three different stages (see section 4.3): data, gradient, and physics regularization stages. During the data stage, the learning rate is decreased accordingly to an exponential decay schedule with γ as decay constant. During the gradient and physics regularization stages, the learning is kept constant. Only after the first step of the curriculum learning procedure in the physics regularization stage, the learning rate is decreased by a factor of 10. In addition, in every training stage, the norm of the model gradients is clipped. Tables 9 to 11 list the HPs in each training  HP value batch size 128 initial learning rate 8.0 × 10 −4 γ 9.9 × 10 −1 L 2 regularization 6.0 × 10 −2 α R 2.5 × 10 −1 α Z 2.5 × 10 −1 α λ 2.5 × 10 −1 αι 2.5 × 10 −1 patience epochs 150 gradient clip 1.0 × 10 −2 stage. Table 12 lists the HPs for the fine-tuning procedure.

Derivation of MHD force residual
This section shows how the MHD force residual can be derived once the mapping x = (R, λ, Z), as well as the p(s) andι(s) profiles, are known. As an example, F β is derived; F s can be derived in a similar manner.
Let us recall that: where √ g is the Jacobian of the 3D coordinate transformation f : (s, θ, ϕ) → (R, φ, Z), and it is expressed as √ g = ( e s · e θ × e ϕ ). Recalling the definition of the covariant basis vectors (see section 2.1), the choice of ϕ = φ effectively yields a 2D Jacobian: The covariant magnetic field components (see equations (8) and (9)) are connected to the contravariant magnetic field components (see equations (5) and (6)) through the covariant metric tensor elements:    Finally, the expression for F β reads: F β depends only on x, its first and second derivatives, andι.

Physics regularization
The proxy of the MHD force residual f * is derived as follows.
The magnetic field can be written in contravariant form as: By substituting the expression of B into F , the covariant form of the MHD force is: where β = √ g(B ϕ ∇θ − B θ ∇ϕ).
The contravariant components of the current density, J i = J · ∇α i = 1 µ0 ( ∇ × B) · ∇α i , are: By substitute the contravariant current density components in the radial force component, it follows that: Let us define the proxy of the MHD residual as the flux surface averaged force residual norm when F β = 0: where A = 2π 0 2π 0 Adθdϕ denotes a flux surface average operator (note that the √ g factor is not included in the average operator here). Substituting the contravariant magnetic field components (see equations (5) and (6)) in equation (79), the following terms result: √ gB ϕ ∂B ϕ ∂s √ gB ϕ ∂B s ∂ϕ The ∂λ ∂ϕ and ∂λ ∂θ terms have been integrated by parts exploiting the fact that λ and the covariant component of the B field are periodic functions in the poloidal and toroidal directions. The second and fourth terms cancel each other.
In addition, when F β = 0 and √ g = 0, it follows that: Therefore, also the second addenda of the first and third terms cancel each other. Then, the MHD force residual proxy is:

Boozer angular separation δ on test set
The evaluation of the reconstruction of the Boozer angular separation δ on the test set is shown here. The median regressed equilibrium in the test set does not have a minimum of the magnetic well strength for ϕ = π N fp , therefore, the Boozer angular separation δ depicted in figure 28 does not represent the angular separation between bounce points. Nevertheless, figure 28 shows how the model smoothly reconstructs the local magnetic field strength close to the magnetic axis, and it introduces artificial field ripples in the edge region.