A new approach to the formulation and validation of scaling expressions for plasma confinement in tokamaks

A. Murari; E. Peluso; M. Gelfusa; I. Lupelli; P. Gaudio

doi:10.1088/0029-5515/55/7/073009

1. The extrapolation of the energy confinement time to future devices

The problem of extrapolating the energy confinement time from present devices to the future generation of machines is a long standing issue in the tokamak community. For several decades various approaches have been pursued, both theoretically and experimentally. Theoretically, the main theories typically make one of two quite extreme assumptions. Either the transport is dominated by short wavelength turbulence (drift waves), to be scaled to intrinsic plasma parameters, or the cross field transport is due to long wavelength turbulence (MHD), scaled to the main dimensions of the devices. The first family comprises the so called gyro-Bohm scalings and the second is indicated with the name Bohm scalings [1]. Unfortunately, even if substantial progress has been made in characterizing the transport mechanisms in tokamaks, the physics is not sufficiently established to allow performing first principle calculations to interpret the results of the present devices. In any case, even if the mechanisms dominating transport were well understood, modelling them in sufficient detail to obtain realistic values of the global confinement time would remain a quite challenging task. Therefore, since many decades, the theoretical studies have been complemented by experimental efforts aimed at collecting enough data to derive data-driven scaling expressions for the confinement time, sufficiently robust to be extrapolated to the next generation of devices. The obtained databases have typically been analysed using traditional log regression. The resulting scaling expressions are therefore in the form of power law monomials:

$\begin{equation} A=\frac{B^{\beta}\,C^{\chi}\cdots \;D^{\delta}}{G^{\gamma}\,H^{\lambda}\cdots \;K^{\mu}}, \end{equation} \tag{ 1 }$

where the capital letters represent physical quantities [2]. This assumption on the form of these scaling equations is not justified neither on theoretical nor on experimental ground. Indeed theoretically there is no justification to limit the analysis to scalings obeying power law expressions. Experimentally, the vast majority of the used databases do not present statistical distributions compatible only with power laws.

In general, it must be considered that the formulation of the scaling expressions as power laws can be unsatisfactory for several reasons. In power law scalings, there is no saturation of the effects even when the independent variables grow to infinity or decay to zero. These scalings are also monotonic and tend to overestimate the relevance of the variables with the longest tails. The interaction of all the variables is also assumed to be multiplicative, which normally results in non-integer exponents of the independent variables not always easy to reconcile with theory. It is also important to appreciate that there are many dynamical interactions which can give rise to power laws and therefore this statistics in not very informative about the underlying physics. These inadequacies of power laws have become evident recently and are motivating efforts to extract more sophisticated scaling expressions from the data [3]. In the case of tokamak physics, the limitations of the power law scalings have clearly been shown in [4, 5]. The rigidities' of power laws can be particularly problematic in case of significant extrapolations, as it is the case of trying to model ITER on the basis of present day machines.

In order to alleviate this problem of uncertainties in the extrapolation and to obtain a more reliable physical basis for the results, the scaling expressions of the confinement time have also been formulated in terms of dimensionless variables [6]. In this case, the main idea consists of the observation that, if the basic equations governing the plasma behaviour are invariant under a certain class of transformations, also the scaling expressions for these plasmas must show the same invariances. In the literature, the plasma behaviour has been assumed to obey the Vlasov equation in various approximations and limits. These assumptions allow on the one hand to identify the dimensionless variables to be used in the scalings and, on the other hand, they provide constraints on the exponents of the various terms [7]. Again the scaling expressions are assumed 'a priori' to be power law monomials. All the mathematical limitations of the power laws as scaling functions apply to the case of dimensionless variables. Again in practice, the databases expressed in terms of dimensionless variables do not support the assumption of distributions of the power law form.

In this paper advanced techniques of symbolic regression via genetic programming are applied to the problem of deriving scaling expressions for the confinement time from large databases without practically any assumption about their mathematical form. The proposed methods allow identifying the most appropriate mathematical expression for the regression equations. Moreover, the versatility and robustness of these techniques is such that they allow regressing also the non-dimensional quantities; it is worth pointing out that this can be problematic for the databases analysed in this paper using log regression. In any case, independently from the choice of the variables, dimensional or dimensionless, the best scaling identified by symbolic regression, in the case of the studied ITPA H-mode database, are not power law monomials. This has been confirmed by a series of statistical methods deployed to assess how well the candidate scaling expressions interpret the available experimental evidence. A combination of non-linear fitting tools and statistical criteria will be shown to address this aspect of the analysis in a quite reliable manner. The extrapolation of the identified scaling expressions indicate that the confinement time in ITER could be about 20% lower than predicted by the traditional power law scalings (but the various estimates agree within their respective confidence intervals).

With regard to the structure of the paper, section 2 provides an overview of the main methodology developed. In particular, symbolic regression via genetic programming, the main tool for the exploration of the databases, is introduced. The main mathematical details about the methodology are provided in appendix A1. In section 3 the ITPA database is reviewed. The obtained scaling expressions in non-power law form and expressed in terms of dimensional quantities are reported and discussed in section 4. Section 5 contains the same analysis for the scalings expressed in terms of dimensionless variables. The conclusions and prospects of future activities are discussed in the last section 6 of the paper.

2. The statistical tools for the exploration of the databases, the extraction of the scaling expressions and their selection

As mentioned in the introduction, this paper describes the application of advanced statistical techniques to the problem of deriving scaling expressions for the confinement time from large databases. The main advantage of the proposed approach consists of practically eliminating any assumption about the form of the scaling expressions. The methods developed indeed allow identifying the most appropriate mathematical expression for the regression equations and to demonstrate that it has the potential to better interpret the present experimental data for the confinement time in comparison with power laws (PLs).This section describes briefly the mathematical basis of the tools implemented to perform the analysis used in the rest of the work. All the details can be found in the references and are summarized in appendix A1.

The objective of the method consists of testing various mathematical expressions to interpret a given database. The main stages to perform such a task are reported in figure 1. First of all, the various candidate formulas are expressed as trees, composed of functions and terminal nodes. The function nodes can be standard arithmetic operations and/or any mathematical functions, squashing terms as well as user-defined operators [8, 9]. The terminal nodes can be independent variables or constants (integer or real). This representation of the formulas allows an easy implementation of the next step, symbolic regression with genetic programming. Genetic programs (GPs) are computational methods able to solve complex and non-linear optimization problems [8, 9]. They have been inspired by the genetic processes of living organisms. In nature, individuals of a population compete for basic resources such as water, food and shelter. Those individuals that achieve better surviving capabilities have higher probabilities to generate descendants. As consequence, best adapted individuals' genes have a higher probability to be passed on to the next generations.

**Figure 1.** The main steps of the proposed methodology to identify the best scaling laws without assumption on their mathematical form.
Download figure:
Standard image High-resolution image

GPs emulate this behaviour. They work with a population of individuals, e.g. mathematical expressions in our case. Each individual represents a possible solution of a problem. A fitness function (FF) is used to measure how good an individual is with respect to the environment. A higher probability to have descendants is assigned to those individuals with better FF. Therefore, the better the adaptation (the value of the FF) of an individual to a problem, the higher is the probability that its genes can be propagated. The line of reasoning of the standard procedure of GP, to solve a specific optimization problem, is the one summarized below.

(1)
Generate a random population of individuals.
(2)
Evaluate each individual of the population with a fitness function.
(3)
Select individuals (parents) to create a new population; the better their fitness function, the more likely they are chosen for this parental role.
(4)
Pass the genes of the chosen parents, obtaining 'children', forming the new generation.
(5)
Repeat the steps 2 to 4 till an ending condition is fulfilled.

In our application, the role of the genes is played by the basis functions used to build the trees. The list of basis functions used to obtain the results described in the rest of the paper is reported in table 1. Evolution is achieved by using operators that perform genetic like operations such as reproduction, crossover and mutation. Reproduction involves selecting an individual from the current population and allowing it to survive by copying it into the new population. The crossover operation involves choosing nodes in two parent trees and swapping the respective branches thus creating two new offspring. Mutations are random modifications of parts of the trees.

Table 1. Types of function nodes included in the symbolic regression used to derive the results presented in this paper, x_i and x_j are the generic independent variables.

Function class	List
Arithmetic	c (real and integer constants),+,−,*,/
Exponential	exp(x_i), log(x_i), power(x_i, x_j), power(x_i,c)
Squashing	logistic(x_i), step(x_i),sign(x_i),gauss(x_i), tanh(x_i), erf(x_i),erfc(x_i)

To derive the results presented in this paper, the Akaike Information [Criterion (AIC) has been adopted [10] for the FF. The AIC, is a well-known model selection criterion that is widely used in statistics, to identify the best models and avoid overfitting. The AIC form used in the present work is:

$\begin{equation} {\rm AIC}=2k+n\cdot {\rm ln}\left( {{\rm RMSE}} \right),~ \end{equation} \tag{ 2 }$

where RMSE is the root mean square error, k is the number of nodes used for the model and n the number of y_data provided, so the number of entries in the database (DB). The AIC allows rewarding the goodness of the models, by minimizing their residuals, and at the same time penalizing the model complexity by the dependence on the number of nodes. The better a model, the smaller its AIC. All the mathematical justification for its adoption can be found in the references and in appendix A1.

At this stage, the best mathematical expression for the scaling expression has been identified. In order to improve the model, the parameters of the equation are adjusted with traditional tools of non-linear regression [11]. This has proved to be important not only to improve the quality of the models but also to provide confidence intervals in their parameters. It is worth pointing out that our methodology tends to provide more reliable confidence interval than log regression, since its hypotheses are not satisfied by the available databases [4, 5]. All these aspects, together with the techniques for non-linear fitting, are discussed in more detail in appendix A1.

Having optimized the models with non-linear fitting, now what remains is the qualification of the statistical quality of the obtained scaling expressions. To this end a series of statistical indicators have been implemented. They range from model selection criteria, such as the Bayesian information criterion (BIC), to statistical indicators, such as the Kullback–Leibler divergence (KLD) [12, 13]. These are nowadays quite standard indicators and in our case they all agree that the new proposed scaling expressions are better than the previous power laws. In any case, for the benefit of the interest readers, they are described in some detail in appendix A1.

3. The ITPA database and weight selection

3.1. The main characteristics of the ITPA database DB3v13f

To maximize the generality of the results obtained with the methodology described in the previous sections, an international database has been considered [14]. This database was explicitly conceived to support advanced studies of the confinement time and includes validated signals from the vast majority of the most relevant tokamak machines ever operated in the world. In line with the previous literature on the subject, the following quantities have been considered good candidate regressors in the present work.

Dimensionless variables: ρ^*, β, ν^*, κ_a, , q₉₅, M
Dimensional variables: B(T), I(MA), n(10¹⁹ m⁻³), R(m), M, ε, k_a; P(MW).

In the previous lists, ρ^* indicates the normalized ion Larmour radius, β the normalized plasma pressure, ν^* the normalized collision frequency, k_a the volume elongation measurement, ε the inverse aspect ratio, q₉₅ the plasma safety factor evaluated at the flux surface enclosing the 95% of the poloidal flux, M the effective atomic mass in a.m.u, n the central line average plasma density, B the toroidal magnetic field, R the plasma major radius, I the plasma current and finally P the estimated power loss [15]. In agreement with previous works [15], in the case of regressions in terms of dimensionless variables, the dependent quantity considered is the product of the ion cyclotron frequency, conveniently rescaled, times the confinement time, y = τω_ci. The definitions of the other dimensionless quantities are the same normally used to interpret the experiments and reported in the literature [15]:

$\begin{eqnarray*}&&\fl \rho \ast =\left( {\frac{2m_{p}}{3e^{2}}} \right)^{\frac{1}{2}}\cdot \left( {\frac{MW}{Vn}} \right)^{\frac{1}{2}}\cdot \frac{1}{Ba}\quad \beta =\frac{4\mu _{0}}{3}\cdot \frac{W}{VB^{2}}, \end{eqnarray*}$

$\begin{eqnarray*}&&\fl k_{a} =\frac{V}{2\pi^{2}a^{2}R}\quad \epsilon =\frac{a}{R}\quad q=q( a )=\frac{2\pi a^{2}Bk}{\mu_{0} IR},\\ &&\nu \ast =\bigg( {\frac{15e^{4}ln({\Lambda_{i}})}{4\pi ^{\frac{3}{2}}\epsilon_{0}^{2} \mu_{0}}}\bigg)\cdot \frac{V^{2}R^{2}B\epsilon^{\frac{1}{2}}n^{3}k}{W^{2}I}. \end{eqnarray*}$

The entries of the database have been selected adopting exactly the same criteria used in [15] to build the database DB3, from which the reference scalings in power law form were derived. This choice has been motivated by the main goal of the paper, which consists of demonstrating the higher potential of the new data analysis tools compared to the traditional log regression. In this perspective, it is only appropriate to apply the new tools exactly to the same database DB3, used to obtain the reference power law scalings [15]. The final database analysed consists therefore of 3093 entries for the dimensional scalings and a total of 2806 examples for the dimensionless set. Figure 2 shows the trend of the confinement time versus four dimensional physical quantities, while figure 3 shows the behaviour of the non-dimensional product τ · ω_ci with four non-dimensional quantities.

**Figure 2.** Trend of the confinement time with four physical quantities considered in the pools of selected data: (a) (I(MA)); (b) τ(B(T)); (c) τ(n(10¹⁹ m⁻³)); (d) τ(P(MW)).
Download figure:
Standard image High-resolution image

**Figure 3.** Trend of the confinement time with four physical quantities considered in the pools of selected data: (a) ·ω_ci(ρ^*); (b) τ · ω_ci(β); (c) τ · ω_ci(v^*); (d) τ · ω_ci(κ_a).
Download figure:
Standard image High-resolution image

All the quantities selected as regressors are routinely available in all the major tokamaks, providing enough data for a sound statistical analysis. The uncertainties in the dimensional quantities can be derived from [15] and the information provided in the database. These errors are summarized in table 2.

Table 2. Uncertainties in the dimensional variables in the ITPA Database.

	ε_I	ε_B	ε_R	ε_n	ε_a	ε_M	ε_W	ε_P	ε_k	ε_V	ε_q95	ε_τ
Error %	1.3	1.5	1.3	5	0.9	8.4	14.1	14.2	3.7	4.6	8.1	28.3

From the information about the dimensional variables, the uncertainties in the dimensionless quantities can be estimated with traditional error propagation. The results are reported in table 3. It is worth mentioning that it has been checked numerically that SR via GP can handle this level of uncertainties. First, a series of systematic numerical tests, with hundreds of models of the widest mathematical form and noise level up to 50%, have been performed. Given enough data and computational time, the proposed method can certainly handle level of uncertainties of 40% and performs always better than log regression, particularly against collinearities and outliers. Second, our methodology, thanks to the non-linear fitting step, provides robust estimates of the uncertainties in all the derived quantities as reported in the rest of the paper.

Table 3. Errors in the non-dimensional variables, computed using table 2.

	ε_ρ	ε_β	ε_ν	ε	ε_κa	ε_τ·ωci
Error %	34.5	20.2	33.7	2.2	6.8	38.2

3.2. The selection of the weights

For all the results reported in the rest of the paper, the weights have been chosen with the method of the percentiles. This technique allows weighting data depending on their distribution. The objective consists in fact of weighting more the data falling in the tails of their distributions, since those data carry more relevant information for the determination of scaling expressions. For instance high plasma current values (I > 2.5 (MA)) of JET are extremely valuable since they are the closest to ITER operational region of 15 MA. Even if statistically they could be considered as outliers, physically they are the most relevant points.

For each physical quantity, five percentiles have been computed in order to divide the distribution function in six partitions, which can be independently weighted. In our case we have chosen the inverse of the cumulative probability, defining the percentiles themselves, as the weight for data falling in each different partition. This can be repeated for all the selected physical quantities to obtain a final weight for each entry of the DB. These weights have been used consistently in all the various steps of the data analysis method reported in figure 1. More sophisticated methods are under study but it is worth mentioning that, for this database, the traditional choice of the weights, using the number of entries per machine [15], does not change qualitatively the results obtained in the paper. A sensitivity study has indeed been performed showing that small variations in the weights do not alter significantly the final results. Therefore the qualitative difference between the scaling expressions found with symbolic regression and the ones in the literature remain valid independently from the choice of the weights.

4. Results in terms of dimensional quantities

The best models found with symbolic regression via genetic programming in general are not in power law form but they present additional terms. In particular, in the case of scalings with dimensional quantities, the most performing models include one or more multiplicative saturation\squashing terms in the plasma current, plasma density or magnetic field. It is interesting to note that the method by itself selects these three quantities as the ones responsible for the saturation of the confinement time. However, the plasma volume, represented by the major radius R, is not found to cause any saturation as intuitively expected, since there is no physical reason for the confinement to present non-linear saturation terms scaling with volume.

Among the best scalings found by symbolic regression, the statistically most performing functional form for the confinement time, in terms of dimensional quantities, is reported in table 4. Also the power laws (PL1 and PL2) typically used as reference by the community are reported: PL1 is IPB98(y,2) and PL2 is EIV of [15]. The most important aspect of the non-power law (NPL) functional form is the presence of a single saturation term depending only on the ratio between density and magnetic field. Since the various physical quantities appear as products in this NPL expression, their multiplicative constants can be considered to have the right units to obtain the appropriate dimensionality as discussed in [15]. The scaling with density of the models in table 4 can be seen in figure 4, once the other physical quantities have been set to their values at ITER. To obtain the right dimensionality for NPL, it must be assumed that the multiplicative constant in front of the expression presents the dimensions: [I]^−1.00[l]^−0.2[P]^0.74[t] (where l indicates the unit of length, t the unit of time, etc). The multiplicative constant in the exponent of the squashing term h(n, B) has dimensions of [B]^−1.37[l]^4.1.

**Figure 4.** Comparison of the PLs and NPLs scaling expression behaviour with the plasma density at the parameters of ITER.
Download figure:
Standard image High-resolution image

Table 4. Power laws (PLs) and non-power law model (NPL). PL1 is the IPB98(y,2) scaling while PL2 is (EIV) [15]. The term h(n,B) is: $h({n,B}) =n^{0.448_{0.436}^{0.460}} \cdot ({1+{\rm e}^{-9.403_{-9.694}^{-9.112} \cdot ( {\frac{n}{B}})^{-1.365_{-1.412}^{-1.318}}}})^{-1}$ .

PL1	$5.62\times 10^{-2}I^{0.93} B^{0.15}n^{0.41}M^{0.19}R^{1.97}\epsilon^{0.58}\kappa_{a}^{0.78} P^{-069}$
PL2	$5.55\times 10^{-2}I^{0.75} B^{0.32}n^{0.35}M^{0.06}R^{2.0}\epsilon^{0.76}\kappa_{a}^{1.14} P^{-062}$
NPL	$0.0367_{0.0366}^{0.0369} \cdot I^{1.006_{0.996}^{1.016}} R^{1.731_{1.712}^{1.751}} \kappa_{a}^{1.450_{1.411}^{1.489}} P^{-0.735_{-0.744}^{-0.727}}h\left( {n,B} \right)$

The equations of the scaling expressions have been used to generate their estimate corresponding to the values of the database and their pdfs have been compared with the experimental ones. The Kullback–Leibler divergence, the MSE and the BIC and AIC criteria all show that the NPL scaling is better than the PLs in interpreting the experimental data available. The comparison between the traditional PL and the NPL scalings, in terms of statistical indicators, is summarized in table 5, proving the best quality of the NPL regression. The good fit of the experimental data obtained with the NPL model is illustrated also graphically in figure 5.

**Figure 5.** Log–Log plot of the energy confinement time of the proposed NPL model versus the experimental values.
Download figure:
Standard image High-resolution image

Table 5. Statistical estimators used to qualify the scaling reported in table 2. The KLD has been computed in a range of ±6σ around the mean value of the data. MSE is the mean square error in units of second².

	k	AIC	BIC	MSE (s²)	KLD
PL1	10	−19 416.86	−19 362.86	1.866 × 10⁻³	0.0337
PL2	10	−19 084.36	−19 203.68.	2.077 × 10⁻³	0.0802
NPL	10	−19 660.03	−19 599.04	1.724 ×10⁻³	0.0254

In some cases, particularly in comparison with the scaling PL1, the improvement in the statistical indicators is not dramatic. On the other hand, all the indicators show consistently that the NPL is the best model, which gives a quite high degree of confidence in this scaling. Moreover, the improvement in the quality of the NPL expression in terms of KLD, which takes into account the probability distribution of the residuals, is quite significant. In any case, even if the superior properties of the NPL scaling are quite consolidated in statistical terms, its extrapolation capability could still be questioned. In reality, the better quality of the NPL expression in this respect is in a certain sense proved by its best performance in terms of AIC and BIC. These two indicators, being model selection criteria, already take into account the complexity of the equations and therefore tend to avoid overfitting (see appendix A1).

On the other hand, extrapolation is always a delicate matter, particularly when the gap from existing examples to new devices is as large as in the case of ITER. Therefore, to assess the potential of NPL also in this respect, a first extrapolation test has been performed consisting of fitting the NPL model on the database for currents below 2.5 MA and then extrapolating this fitted model to higher current data. The NPL scaling, once fitted and extrapolated, shows similar statistical values respect to the PLs, but improves clearly the reconstruction of the high current data distribution (KLD_NPL = 0.1258, while KLD_PL1 = 0.3210 and KLD_PL2 = 0.7478). A second and more relevant test has been performed, by non-linearly fitting the various models using data of smaller devices and testing the results with JET data. Since the PL1,2 models have the same variables, only one fit has been performed and labelled as PLs. The non-linear fits for the PLs and the NPL have been performed using the weights obtained by the percentiles method described above. The PLs perform slightly better on the small machines (MSE_PLs = 3.322× 10⁻⁴ s² and KLD_PLs = 0.0516; while MSE_NPL = 3.531 × 10⁻⁴ s² and KLD_NPL = 0.0689), but when the scaling are applied to JET data, the superiority of the NPL model in terms of the indicators considered can be clearly seen; the fitted models are reported in table 6 and their statistical estimators for JET data in table 7.

Table 6. Power laws (PLs) and Non-Power Law model (NPL) fitted on the subset of data without the JET's entries. The term h(n,B) is: $h({n,B}) =n^{0.279_{0.269}^{0.290}}\cdot ({1+{\rm e}^{-13.645_{-14.961}^{-12.328} \cdot ({\frac{n}{B}} )^{-0.802_{-0.877}^{-0.728}}}})^{-1}$ .

PLs	$5.35_{5.30}^{5.39} \times 10^{-2}I^{0.67_{0.62}^{0.73}}B^{0.12_{0.07}^{0.17}}n^{0.38_{0.35}^{0.41}}M^{0.43_{0.35}^{0.51}}R^{1.69_{1.59}^{1.78}}\epsilon^{0.53_{0.44}^{0.63}}\kappa_{a}^{0.39_{0.29}^{0.48}} P^{-0.54_{-0.57}^{-0.52}}$
NPL	$0.0647_{0.0643}^{0.0651} \cdot I^{0.959_{0.947}^{0.972}}R^{1.216_{1.199}^{1.233}}\kappa_{a}^{0.280_{0.244}^{0.316}} P^{-0.503_{-0.490}^{-0.516}}h\left( {n,B} \right)$

Table 7. Statistical estimators used to qualify the scaling extrapolated on JET data.

	k	AIC	BIC	MSE (s²)	KLD
PLs	10	−5842.20	−6720.79	1.578 × 10⁻²	5.895
NPL	10	−6005.91	−6975.85	1.406 × 10⁻²	3.829

The higher extrapolation capability of the scaling expressions in NPL form motivates a revision of the expected performance in terms of confinement time for ITER. Using the equations of table 4, the predicted value of the confinement time for ITER (n_e = 10.3 10¹⁹ m⁻³, κ_a = 1.70, I_p = 15 MA, R = 6.2 m, P = 87 MW) is about $2.96_{2.53}^{3.46}$ seconds to be compared to the $3.6_{4.14~}^{3.13}$ seconds of the traditional extrapolations obtained with power law scalings (IPB98). The non-linear scaling expression foresees a similar value but slightly lower confinement time for ITER compared to the traditional scaling expressions. The two estimates agree within the confidence intervals but would tend to diverge more for larger devices such as DEMO. This is due to the effect of the saturation term, which tends to be more significant at higher densities.

5. Results in terms of dimensionless quantities

The procedure described in section 3, consisting of symbolic regression plus non-linear fitting, has been also applied to the ITPA database in terms of dimensionless variables. The best scaling expression obtained is given by the following equation (3) and will be referred as AdNPL from now on:

$\begin{eqnarray} &&\fl \tau \cdot \omega_{ci} =\left( {1.13} \right)_{1.11}^{1.15} \cdot 10^{-6}\cdot \frac{\kappa_{a}^{1.93_{1.70}^{2.12}} \beta ^{0.37_{0.33}^{0.41}}M^{0.57_{0.46}^{0.67}}}{\rho^{2.19_{2.16}^{2.22}}\nu^{0.40_{0.39}^{0.42}}q^{0.16_{0.08}^{0.23}}}\nonumber\\ &&-0.072_{-0.085}^{-0.060} \cdot \kappa_{a}^{1.18_{0.94}^{1.40}}\nonumber\\ &&+ -0.009_{-0.011}^{-0.006} \cdot q^{1.08_{0.94}^{1.21}}+0.15_{0.13}^{0.17} \cdot M^{0.07_{-0.05}^{0.19}}. \end{eqnarray} \tag{ 3 }$

This time, in addition to a traditional power law factor, some saturation terms are included which are now additive. With regard to the dimensionality, it should be observed that all the additive terms are dimensionless and therefore the multiplicative constants are pure numbers.

To appreciate better the quality of the scaling expression (3), the residuals have been plotted in figure 6, which shows that their pdf can be quite well approximated with a zero centred Gaussian. It is worth emphasizing that regressing in terms of these non-dimensional quantities is another added value of our proposed methods. The used ITPA database presents too many collinearities and the term τω_ci is not sufficiently independent from the dimensionless quantities to simply apply log regression. Since, as will be shown later, the new scaling given by equation (3) provides the same confinement time for ITER as the one using the dimensional quantities, this fact increases significantly the confidence in the obtained results.

Relation (3) has been compared with some of the most credited scaling expressions reported in the literature. In particular the two formulations chosen for the comparison are given in [15].

Since they have been derived starting from PL1 and PL2 [15], they have been respectively labelled as AdPL1, equation (4), and AdPL2, equation (5), from now on. Equation AdPL1 is:

$\begin{equation} \tau \cdot \omega_{{\rm ci}} =7.214\times 10^{-8}\cdot \frac{M^{0.96}\epsilon^{0.73}\kappa_{a}^{3.3}}{\rho^{2.70}\beta ^{0.90}\nu^{0.01}q^{3.0}} \end{equation} \tag{ 4 }$

Again, to better appreciate the quality of the scaling expression, the residuals have been plotted in figure 7, which shows that their pdf cannot be well approximated with a zero centred Gaussian.

The other main scaling expression suggested in [15] is AdPL2:

$\begin{equation} \tau \cdot \omega_{{\rm ci}} =1.836\times 10^{-8}\cdot \frac{M^{0.49}\kappa _{{\rm a}}^{3.34}}{\rho^{2.67}\beta^{0.57}\nu^{0.14}q^{1.84}\epsilon ^{0.19}}. \end{equation} \tag{ 5 }$

Also in this case the indicators are significantly worse than those of the equation in AdNPL form. The lower quality of this model for the interpretation of the ITPA database is easy to appreciate by inspection of the residuals reported in figure 8.

**Figure 8.** Residuals of equations AdPL2 and AdNPL.
Download figure:
Standard image High-resolution image

Finally table 8 summarizes the statistical estimators of the models expressed in terms of dimensionless quantities above:

Table 8. Statistical estimators used to qualify the scalings reported in terms of dimensionless quantities.

	k	AIC	BIC	MSE	KLD
AdPL1	9	−1650.59	−2533.00	0.5518	0.3316
AdPL2	9	−6034.11	−6744.95.	0.1157	0.1875
AdNPL	14	−13 833.82	−13 758.91	0.715 × 10⁻²	0.0567

The extrapolation capability of the non-power model has been tested as well. Again the non-linear fits have been performed on the subset of data where all the JET's entries have been removed and the fitted model can be found in table 9. The subset of data used to fit the free parameters of the AdNPL model considered contains 1411 entries. For the power law scaling, the original expressions, obtained over the entire database, have been retained (equations (4) and (5)).

Table 9. Non-dimensional non-power law model (AdNPL) fitted on the subset of data without JET entries.

AdNPL

$\left( {1.13} \right)_{1.11}^{1.16} \times 10^{-6}\cdot \frac{\kappa_{a}^{1.57_{1.33}^{1.82}} \beta^{0.29_{0.24}^{0.35}}M^{1.00_{0.78}^{1.22}}}{\rho^{2.17_{2.11}^{2.24}}\nu^{0.39_{0.36}^{0.41}}q^{0.55_{0.45}^{0.65}}}-0.062_{-0.070}^{-0.053} \cdot \kappa_{a}^{0.76_{0.55}^{0.97}} + -0.007_{-0.009}^{-0.004} \cdot q^{0.72_{0.53}^{0.91}}+0.13_{0.12}^{0.14} \cdot M^{-0.06_{-0.15}^{0.03}}$ $\left( {1.13} \right)_{1.11}^{1.16} \times 10^{-6}\cdot \frac{\kappa_{a}^{1.57_{1.33}^{1.82}} \beta^{0.29_{0.24}^{0.35}}M^{1.00_{0.78}^{1.22}}}{\rho^{2.17_{2.11}^{2.24}}\nu^{0.39_{0.36}^{0.41}}q^{0.55_{0.45}^{0.65}}}-0.062_{-0.070}^{-0.053} \cdot \kappa_{a}^{0.76_{0.55}^{0.97}} + -0.007_{-0.009}^{-0.004} \cdot q^{0.72_{0.53}^{0.91}}+0.13_{0.12}^{0.14} \cdot M^{-0.06_{-0.15}^{0.03}}$

The statistical estimators described above (AIC, BIC, MSE) have then been computed using the subset of JET's data previously removed. The results are reported in the table 10, from which the superior extrapolation capability of the AdNPL scaling is evident in all indicators.

Table 10. Statistical estimators for the extrapolation to JET of the dimensionless scaling expressions According to all statistical indicators AdNPL outperforms the traditional power laws.

	k	AIC	BIC	MSE	KLD
AdPL1	9	−294.94	−1102.55	79.905 × 10⁻²	0.573
AdPL2	9	−2435.49	−3072.78	17.226 × 10⁻²	0.283
AdNPL	14	−5610.85	−5723.52	1.756 × 10⁻²	0.230

It is worth pointing out that the power law scalings AdPL1 and AdPL2 were derived in [15] using the entire database, including JET data. The fact that the new scaling expression AdNPL, derived by applying our approach only to the smaller devices, manages to extrapolate to JET better than such scalings is another very important proof of the quality of the new proposed methodology.

The quality of the AdNPL scaling and its good extrapolation behaviour to JET data have motivated the analysis of its estimate for ITER. In the following table 11, the prediction of the AdNPL scaling expression for ITER is compared to the ones of the other two AdPLs scalings from the literature. Moreover, fixing the ITER's parameters, the trends of equations (3), (4) and (5) with the elongation κ_a and plasma volume have been reported in figure 9.

**Figure 9.** Comparison of AdPL1,2 and AdNPL scaling behaviour with the elongation and plasma volume at ITER's parameters. The plasma volume has been chosen because it enters into other three dimensionless quantities; β, ν*, ρ*.
Download figure:
Standard image High-resolution image

Table 11. Comparison of the extrapolated confinement times to ITER.

Equation	τ (s)
AdNPL	$2.97_{2.78}^{3.16}$
AdPL1	3.66
AdPL2	3.29
NPL	$2.96_{2.53{\rm ~}}^{3.46}$

Another observation, which confirms the quality of the results obtained with the proposed methodology, is the general agreement between the estimates of the confinement time obtained both in terms of dimensional and dimensionless quantities. Indeed both scalings, NPL and AdNPL, provide in general very similar estimates of the confinement time over the experimental database available. This is shown graphically in figure 10, which reports the energy confinement time estimated by the expression NPL versus AdNPL. Also the extrapolation to ITER of the two scaling expressions are very similar as reported in figure 11. This remarkable agreement not only increases the confidence in the results but also highlights another advantage of the proposed methodology. As mentioned at the beginning of the section, in the literature the scaling expressions as functions of dimensionless quantities are typically derived by simple mathematical manipulation of the ones expressed in terms of dimensional quantities. They are therefore not the results of an independent exercise in regression. This is due to the limits of the databases, which present too many collinearities and dependencies of τ from the regressors when expressed in dimensionless form. Indeed the technique used to derive power laws, log regression, is based on the following assumptions: (a) no dependencies between the independent variable and the regressors, (b) no collinearity between the regressors and (c) normal distribution of the uncertainties (error bars) in all the variables. The available datasets typically do not satisfy any of these conditions.

**Figure 10.** The estimates of the scaling expressions NPL and AdNPL for the entries of the ITPA database. The general agreement between the two empirical scalings is evident since they do not present any systematic discrepancy.
Download figure:
Standard image High-resolution image

**Figure 11.** Comparison of NPL and AdNPL scalings with elongation at ITER's parameters.
Download figure:
Standard image High-resolution image

On the other hand, symbolic regression is not based on the assumptions of log regression. It is has indeed been tested with a series of numerical tests that the methodology proposed in this paper is robust against the aforementioned limitations of the available databases. A confirmation can be also obtained by the inspection of the obtained models, which are statistically much better than the ones in power law form obtained by rearranging the dimensional scalings.

The saturation terms tend to smooth the trend of the confinement time when the quantities approach ITER's values. In particular, as can be seen clearly in figure 9, the AdPL1 and AdPL2 foresee a quite abrupt increase of τ close to ITER's values of various quantities, outside of the range where there can be any experimental confirmation of such an improvement. This is a general and quite problematic behaviour of all power law scalings.

6. Discussion, conclusions and further developments

In this paper, symbolic regression via genetic programming has been applied to the derivation of empirical scaling expressions for the energy confinement time in Tokamaks. The analysis has been particularized for the H-mode of confinement in terms of both dimensional and dimensionless quantities. The examples used for the present investigation belong to the ITPA international database, which include all of the most relevant machines in the world.

Contrary to the main results reported in the literature, the obtained empirical scalings are not in the form of power laws. Indeed they present either multiplicative or additive saturation terms.

These new expressions have been obtained with a methodology, symbolic regression via genetic programming, which generalizes previous attempts to find non-power law scalings [17–22]. Indeed in the previous studies devoted to non-power law scalings, the mathematical form of the scaling expressions had been assumed 'a priori' from empirical or theoretical considerations. On the contrary, using the techniques presented in this paper, even the mathematical forms of the scalings are purely data-driven, being selected by symbolic regression on the basis of the fitness function.

The superior quality of these new scalings, compared to the traditional power laws, has been demonstrated first of all with the help of a series of statistical indicators. To complement this analysis, the extrapolation capability of the new scalings has been verified by dedicated investigations of different groups of devices (small and large machines) and of different parameters ranges (low and high currents). On the basis of the new found scalings, the confinement time to be expected in an ITER class device is about 20% lower than the predictions of the traditional power laws. This is due to the excess rigidity of the power law scalings (irrespective of the fact that they are express in term of dimensional or dimensionless quantities), which probably tend to overestimate the confinement time in the case of large extrapolations. In any case, for ITER the estimates of the non-power law scalings and the IPB98 overlap within the confidence intervals. However, the difference is more significant for the generation of devices of the size of DEMO. This is due to the fact that the saturation term tends to become more important the larger the device.

On the other hand, the effect of the saturation with increasing density can be seen even in the present largest devices, particularly JET, as shown in figure 12, which shows a comparison of the NPL scaling expression, found using the ITPA database with the usual DB3 selection, and the traditional IPB98 and EIV scalings. From the point of view of the physical interpretation, in our opinion the available database is not of sufficient quality to derive detailed conclusions. Anyway, this saturation seems to be linked to the plasma collisionality; above a certain level of this parameter, the confinement does not continue to improve with density as a power law. The more flexible symbolic regression identifies this dependence on collisionality as a saturation term, instead of an inverse power as is the case for the power law scalings IPB98(y,2) and EIV (see their dimensionless version reported in equations (4) and (5)). This saturation of the confinement with density is in line with the evidence reported in [16]. On the other hand, the available database does not contain a sufficient scan in triangularity to determine how this effect would be affected by the plasma shape, as was found again in [16] (the average triangularity in the ITPA database is only 0.22 with a σ of about 0.1).

**Figure 12.** Scaling with the density for JET:= 2.239 T;I = 2.2 MA;M = 2.0; P = 10.129 MW;R = 2.917 m; = 0.321;k_a = 1.589. The green line is the new non-power scaling, obtained applying the DB3 selection to the entire ITPA database.
Download figure:
Standard image High-resolution image

epsilon — **Figure 12.** Scaling with the density for JET:= 2.239 T;I = 2.2 MA;M = 2.0; P = 10.129 MW;R = 2.917 m; = 0.321;k_a = 1.589. The green line is the new non-power scaling, obtained applying the DB3 selection to the entire ITPA database.
Download figure:
Standard image High-resolution image

With respect to future developments, the obtained results are expected to provide significant inputs to the developments of physics based models of the energy transport [23–25]. Indeed the validation of transport models is one of the main scientific drives of research in JET, given the programme of experiments with different fuel mixtures on the route to the next D–T campaign. In this perspective, the proposed technique should be also applied to new databases, explicitly conceived to study the confinement in metallic devices. Indeed, even if the devised methodology is very powerful and the obtained results are statistically quite sound, these data-driven methods can learn only what is contained in the data. Therefore, particularly in the perspective of providing reliable extrapolations to ITER, specific and more updated databases should be built, to take into account the recent progress in the development of scenarios and the effects of the different wall materials. Dedicated studies should also address the issue of finding the most appropriate experiments to be carried out in order to discriminate between the various scalings, beyond what can be done with a statistical analysis of the presently available data sets. In this direction, certainly JET operation at high current with the new ITER Like Wall [26] should receive high priority, since validated data for optimized discharges around 4 and 5 MA is completely missing. A full D–T campaign in JET should also provide essential new information and allow answering most of the questions remained open after the DTE1 in 1997 [27].

The same methodology, which is absolutely general, can also been applied to other problems, since almost all the scaling expressions in the Tokamak community are expressed in terms of power laws. It is also worth mentioning that the developed technique of symbolic regression via genetic programming can also been used to extract symmetries and invariants from experimental databases [8]; therefore one natural application could be the validation of the dimensionless variables which are normally used to investigate the energy confinement in Tokamaks. Indeed the dimensionless quantities normally used in the literature, and therefore adopted also in this paper, have been derived from theoretical considerations but their adequacy has never been experimentally verified.

Appendix A.: Symbolic regression via genetic programming for the identification of scaling expressions and their selection

As mentioned in the introduction, this paper describes the application of advanced techniques of symbolic regression (SR) via genetic programming (GP) to the problem of deriving scaling expressions for the confinement time from large databases. The main advantage of the proposed approach consists of practically eliminating any assumption about the form of the scaling expressions. The methods developed indeed allow identifying the most appropriate mathematical expression for the regression equations, the so called best unconstrained empirical model structure (BUEMS), and to demonstrate that it has the potential to better interpret the present experimental data for the confinement time in comparison with power laws (PLs).This appendix describes the mathematical basis of the tools implemented to perform the analysis used in the rest of the work. SR via GP is described in some detail in section A.1. The statistical criteria to qualify the obtained scaling expressions are described in section A.2 and the non-linear fitting procedures are briefly introduced in section A.3.

A.1. Genetic programming for database exploration

As mentioned before, the main objective of SR is to identify, for a given finite dataset, a class of appropriate model structures to describe the system under study without 'a priori' hypotheses (BUEMS). Solutions of varying levels of complexity can be generated and evaluated to obtain the best trade-off between accuracy and computational complexity. In this study, SR analysis has been performed using a genetic programming approach. SR via genetic programming is a non-parametric, non-linear technique that looks for both the appropriate model structure and the optimal model parameters simultaneously [8]. This approach provides a natural extension of the traditional linear and non-linear regression methods that fit parameters to an equation of a given model structure.GP is a systematic, domain-independent method that merely creates and searches for the best individual computer program (CP), i.e. the best mathematical expression for modelling the available database, among several possible randomly generated ones. The basic steps in a GP algorithm are shown in figure A1. The first step is the generation of the initial population of CPs (formulas in our case) and then the algorithm finds out how well an element of the population works according to some appropriate metrics. This performance of each model is quantified by a numeric value called fitness function (FF). In the second phase, as with most evolutionary algorithms, genetic operators (reproduction, crossover and mutation) are applied to individuals that are probabilistically selected on the basis of the FF, in order to generate the new population. That is, better individuals are more likely to have more child elements than inferior individuals. When a stable and acceptable solution, in terms of complexity, is found or some other stopping condition is met (e.g. a maximum number of generations or a tolerated error limits are reached), the algorithm provides the solution with best performance in terms of the FF.

**Figure A1.** The basic flow chart of symbolic regression via genetic programming for the best unconstrained empirical model selection.
Download figure:
Standard image High-resolution image

In this work, CPs are composed of functions and terminal nodes and can be represented as a combination of syntax trees, see figure A2. The function nodes can be standard arithmetic operations and/or any mathematical functions, squashing terms as well as user-defined operators. The function nodes included in the analysis performed in this paper are reported in table 1.

The terminal nodes can be independent variables or constants (integer or real). An example of a program representing the expression 3 · P + B · I^0.5 is shown in figure A2. In this particular case, the function nodes set (F) is composed of multiplication, power, addition F = *, +, ∧}.

The terminal nodes set (T) in this example is composed of three variables and two constants T = {B, P, I, 3, 0.5}. The functions and terminals must satisfy two important conditions, in order to find an appropriate representation of the problem. These conditions are the closure property and the sufficiency property. The closure property includes protection of the function set and the terminal set against all inadmissible argument values, e.g. protection against negative square roots, division by zero, etc. The sufficiency property consists of the selection of the appropriate functions and terminals to solve the problem at hand.

The initial population of CP is formed by stochastic combining functions and terminals nodes and then it is evolved stochastically, generation by generation, to be converted into new, better populations. As mentioned before, evolution is achieved by using genetic operations such as reproduction, crossover and mutation. Reproduction involves selecting the program from the current population and allowing it to survive by copying it into the new population. The crossover operation involves choosing nodes in two parent trees and swapping the respective branches thus creating two new offsprings. Figure A3 illustrates the crossover operation. For example, if the expressions for parent I (a) and parent II (b) are as follows before crossover:

$\begin{equation*} {\rm Parent~I}:{\rm ~}\frac{\varepsilon \cdot \left( {B-R+1} \right)}{{\rm n}};{\rm ~Parent~II}:{\rm ~}3\cdot P+B\cdot I^{0.5} \end{equation*}$

the resulting offsprings (c) and (d) after the crossover operation are:

$\begin{equation*} {\rm Offspring~I}:{\rm ~}\frac{\varepsilon \cdot \left( {B-R+1} \right)}{B\cdot I^{0.5}};{\rm ~Offspring~II}:{\rm ~}3\cdot P+n. \end{equation*}$

A mutation operation consists of selecting a random node from the parent tree and substituting it with a newly generated random tree within the terminals and functions available.

**Figure A3.** Crossover operation in genetic programming. (a) and (b) are the two selected models of the population used to produce the models (c) and (d) of the following iteration. The grey areas intersect the black branches, randomly selected during the genetic operation. Then all the leaves attached to these branches are swapped to produce the offspring (c) and (d).
Download figure:
Standard image High-resolution image

As with most data-driven modelling tools, special attention must be paid to avoid overfitting the candidate models to the provided data. In this work, various forms of parsimony have been introduced and they operate at least at three different levels [9]. Already at this stage of building the trees, a method has been implemented which optimizes both fitness and syntax tree size. The shortest individual, the one having fewer nodes in its tree, is selected as the winner when two individuals are equally fit. This technique is particularly effective in controlling code dimension bloat, which is a phenomenon consisting of an excessive syntax tree growth without a corresponding improvement in fitness.

The first step in GP consists of selecting the initial population and therefore, together with the choice of the basis functions and operators, defines the search space, or phase space, which the algorithm will explore. This includes all the CPs, i.e. candidate BUEMS, which can be constructed by composing the initial population in all possible ways. In order to assure that the most appropriate solution is found, it is important to start with a sufficiently large initial population. Of course there is no theoretical criterion for determining the appropriate size of the initial population but a practical approach consists of increasing it until no significant new solutions are identified. At this point one can reasonably assume that the limitations are due to the database available more than to the exploratory technique. In its turn, the FF allows knowing which elements are appropriate to model the system under study. Fitness can be measured in many ways. To derive the results presented in this paper, the AIC criterion has been adopted [10] for the FF. The AIC form used is:

$\begin{equation} {\rm AIC}=2k+n\cdot {\rm ln}\left( {{\rm RMSE}} \right)~. \end{equation} \tag{ A1 }$

In equation (A1), RMSE is the root mean square error, k is the number of nodes used for the model and n the number of y_data provided, so the number of entries in the database (DB). The FF parametrized above allows considering the goodness of the models, thanks to the RMSE, and at the same time their complexity is penalized by the dependence on the number of nodes. It is worth pointing out that the better the model, the lower its AIC. This is the second level of protection against overfitting. Another relevant comment regards the choice of the RMSE to determine the quality of the fit of the various models. This choice is driven by the fact that the RMSE satisfies very general statistical properties. In particular, it can be easily demonstrated that the RMSE is an unbiased estimator that maximizes the likelihood of the results in the case of Gaussian errors. This is the main reason why the RMSE is so widely used in the statistical community [11, 13]; moreover its unbiased character makes the RMSE the best candidate to the present application, meant at verifying what is the most suitable mathematical expression for the scaling expressions. Indeed these classical statistical properties are not satisfied by the alternative indicators used in the literature and based on the logarithmic or relative deviations between the models and the data [15]. The choice of these alternative indicators was motivated by the 'a priori' choice of trying to fit power laws to the data and they are not the most adequate alternative for a more general exploration of the databases, as undertaken with the methodology proposed in this paper. Moreover these alternative indicators, based on the logarithmic or relative deviations, tend to penalize the contributions of large devices, in particular the entries of large τ (due to the saturating trend of the log function). This is not advisable in the case of extrapolations to ITER. Indeed the examples at large τ are already only a minority and if anything their importance should be emphasized and not reduced, as it happens if the quantity to be minimized by the fit is not the RMSE but it is proportional to log τ. In any case, it is also worth mentioning that more advanced but statistically sound indicators for the FF have been developed, in particular to better take into account both the distribution of the residuals and the errors in the measurements. Particularly powerful is the method of the geodesic distance on Gaussian manifolds [22]. On the other hand, it has already been checked that the implementation of such more sophisticated techniques to the ITPA database would not alter in any significant way the conclusions of the present paper.

A.2. Final model selection

However, the best result provided by the algorithm is not necessarily the best solution to the given problem. A more solid output can be found computing the Pareto frontier (PF). The PF is another tool designed to find the best compromise between the complexity of the models and the quality of their fit to the data. The complexity of the models is quantified by the number of nodes they contain; their goodness of fit is given by a suitable quantifier, the Bayesian information criterion (BIC) in our case. The PF is a plot of the BIC versus the complexity of the models. Typically, for each level of complexity, the three best models (according to BIC) are retained. Therefore, the PF is a reduced collection of BUEMS where each element represents the best models (according to the FF) for the subgroup of individuals having the same number of nodes. The PF is then visualized as reported in figure A4, plotting fitness versus complexity. The final solution, i.e. the required model, is finally chosen looking at the saturation part of the PF curve, after which an increase in the complexity of the models does not translates into a significant improvement in their quality (does not translate into a significant reduction in the BIC). This is the third level of protection against overfitting and allows finding a trade-off between the goodness of fit and complexity.

**Figure A4.** The plot shows an example of a Pareto frontier: the best models are plotted inn green (the whole pool of models is in blue). In our case, fitness is quantified by the BIC indicator and complexity by the number of nodes in the model.
Download figure:
Standard image High-resolution image

As mentioned, in order to increase confidence in the selection, these models are classified using a Bayesian criterion. Indeed, while the AIC criterion is used for the FF, the BIC is used for the PF to increase the generalization capability of the model selection. The BIC form used is:

$\begin{equation} {\rm BIC}=n\cdot {\rm ln}\left( {\sigma_{\left( \epsilon \right)}^{2}} \right)+k\cdot {\rm ln}(n), \end{equation} \tag{ A2 }$

where = y_data − y_model are the residuals, $\sigma_{\left( \epsilon \right)}^{2}$ their variance, calculated with respect to the mean of the residuals, and the other symbols are defined in analogy with the AIC expression. Again the better the model, the lower its BIC.

With regard to the use of the two indicators AIC and BIC, it should be noticed that AIC is typically defined in terms of the sum of the square residuals (the application of the square root is immaterial). BIC is traditionally defined in terms of the variance, in which the difference is taken from the average of the residuals. So conceptually the two indicators are different and it is reassuring when, as in our analysis, they both give the same response on the quality of the models. We would also like to point out that, in any case, minor variations in the definition of these criteria are immaterial since the absolute values of the indicators are not important and what matters is the relative ranking of the models (which we have checked is unaffected by differences in the definitions).

The confidence and the stability of the chosen BUEMS are also important aspects of the analysis. To achieve this goal, several runs of SR via GP are launched varying the maximum number of trees and varying the initial population size. The analysis is performed along the lines previously described each time and in this way the chances of finding a good BUEMS are enhanced. To reduce the probability that relevant solutions are missed, a typical approach (the one that we have adopted) consists of increasing the number of trees and enlarging the initial population until the results are consistent and not vary with these parameters. In general, the models discussed in this paper have been selected out of a population of about 200 000 BUEMS.

A.3. Non-linear fitting

Once selected the best model with symbolic regression as described in the previous section, a non-linear fitting technique, discussed briefly in this section, has been applied to improve the behaviour of the BUEMS on the data. Basically SR via GP is used to determine the functional form of the BUEMS and then the exact parameters of the models are determined by the non-linear fitting. Indeed optimized routines for non-linear fitting exist, which guarantee good results and are computationally efficient. They also provide confidence intervals for the parameters of the models. It is worth mentioning that the previous criteria, AIC and BIC, used to quantify the quality of the models, have been computed for the non-linearly fitted models.

The algorithm implemented for the non-linear fits [11, 23, 24, 25] aims at finding the best parameters $\vec{c}$ minimizing the sum of squares of the residuals:

$\begin{eqnarray} &&\fl \mathop{{\rm min}}\limits_{LB \le \vec{c} \le UB}\Big[\sum(\vec{\varepsilon}(\vec{x};\vec{c},\vec{c}_0))^2\Big]\nonumber\\&&=\mathop{{\rm min}}\limits_{LB \le \vec{c} \le UB}\Big[\sum(\vec{y}_{\rm data}(\vec{x})-\vec{y}_{\rm model}(\vec{x};\vec{c},\vec{c}_0))^2\Big]. \end{eqnarray} \tag{ A3 }$

The algorithm uses a trust region method based on an interior reflective Newton method. In equation (1) $\vec{c}_{0}$ stands for the vector of initial conditions provided by the genetic algorithm, the upper bounds (UB) and the lower bounds (LB) define instead the ranges of variation where the requested parameters $\vec{c}$ can be found and are obtained after the first iteration. Those boundaries correspond in fact to the 99% of confidence level of their initial conditions $\vec{c}_{0}$ . The recursive procedure developed prevents the parameters to exceed those initial boundaries; moreover it determines the stability of both constants and exponents in $\vec{c}$ to reach convergence. For this reason, a tolerance of 10⁻⁶ has been set for the difference between the values assumed by the fitted parameters between two successive iterations. Once the difference falls below the previous value, again the Jacobian and the residuals are used to estimate the parameters' confidence levels (95%). Finally, the weights (previously calculated in the first step of the procedure) can be easily applied by multiplying each term in the previous sum by the required weight.

Once fitted, in order to increase the generalization of the analysis and to make easier the comparison between different models, the RSS (sum of squared residuals) is used instead of the RMSE in equation (A1) and at the same time the number of parameters (p) plus one replaces the number of nodes for the complexity (k) both in equation (A1) and in equation (A2), so k is k = p + 1.

The best BUEMS, after the non-linear fitting, are then qualified by an additional estimator, the Kullback–Leibler divergence (KLD), calculated using the kernel density estimation (KDE) [12]. The KDE is the representation of the continuous probability distribution function of the selected model $p( {\vec{y}}_{{\rm model}} \left( {\vec{x}} \right))$ and of the experimental data $q( {\vec{y}} _{{\rm data}} ( {\vec{x}}))$ . Then the aim of the KLD is to quantify the difference between the computed KDEs, in other words to quantify the information lost when $p( {\vec{y}}_{{\rm model}} \left( {\vec{x}} \right))$ is used to approximate $q( {\vec{y}} _{{\rm data}} \left( {\vec{x}} \right))$ [13]. The KLD is defined as:

$\begin{equation} {\rm KLD}\left( {P||Q} \right)=\mathop \smallint\nolimits p\left( x \right)\cdot {\rm ln}\left( {p(x)/q(x)} \right)\,{\rm d}x, \end{equation} \tag{ A4 }$

Where the symbols have been defined as above. The Kullback–Leibler divergence assumes positive values and is zero only when the two pdfs, p and q, are exactly the same. Therefore the smaller the KLD is, the better the BUEM approximates the data, i.e. the less information is lost by representing the data with the model.

A new approach to the formulation and validation of scaling expressions for plasma confinement in tokamaks

Article metrics

Permissions

Author e-mails

Author affiliations

Dates

Abstract

1. The extrapolation of the energy confinement time to future devices

2. The statistical tools for the exploration of the databases, the extraction of the scaling expressions and their selection

3. The ITPA database and weight selection