Learning physical descriptors for materials science by compressed sensing

The availability of big data in materials science offers new routes for analyzing materials properties and functions and achieving scientific understanding. Finding structure in these data that is not directly visible by standard tools and exploitation of the scientific information requires new and dedicated methodology based on approaches from statistical learning, compressed sensing, and other recent methods from applied mathematics, computer science, statistics, signal processing, and information science. In this paper, we explain and demonstrate a compressed-sensing based methodology for feature selection, specifically for discovering physical descriptors, i.e., physical parameters that describe the material and its properties of interest, and associated equations that explicitly and quantitatively describe those relevant properties. As showcase application and proof of concept, we describe how to build a physical model for the quantitative prediction of the crystal structure of binary compound semiconductors.


I. INTRODUCTION
Big-data-driven research offers new routes towards scientific insight.This big-data challenge is not only about storing and processing huge amounts of data, but also, and in particular, it is a chance for new methodology for modeling, describing, and understanding.
Let us explain this by a simple, but most influential, historic example.About 150 years ago, D. Mendeleev and others organized the 56 atoms that were known at their time in terms of a table, according to their weight and chemical properties.Many atoms were not yet identified, but from the table, it was clear that they should exist, and even their chemical properties were anticipated.The scientific reason behind the table, i.e., the shell structure of the electrons, was unclear and was understood only 50 years later when quantum mechanics had been developed.Finding structure in the huge space of materials, e.g., one table (or a map) for each property, or function of interest, is one of the great dreams of (big-)data-driven materials science.
Obviously, the challenge to sort all materials is much bigger than that of the mentioned example of the Periodic Table of Elements (PTE).Up to date, the PTE contains 118 elements that have been observed.About 200 000 materials are "known" to exist 1 , but only for very few of these "known" materials, the basic properties (elastic constants, plasticity, piezoelectric tensors, thermal and electrical conductivity, etc.) have been determined.When considering surfaces, interfaces, nanostructures, organic materials, and hybrids of these mentioned systems, the amount of possible materials is practically infinite.It is, therefore, highly likely that new materials with superior and up to now simply unknown characterstics exist that could help solving fundamental issues in the fields of energy, mobility, safety, information, and health.
For materials science it is already clear that big data are structured, and in terms of materials properties and functions, the space of all possible materials is sparsely populated.
Finding this structure in the big data, e.g., asking for efficient catalysts for CO 2 activation or oxidative coupling of methane, good thermal barrier coatings, shape memory alloys as artery stents, or thermoelectric materials for energy harvesting from temperature gradients, just to name a few examples, may be possible, even if the actuating physical mechanisms of these properties and functions are not yet understood in detail.Novel big-data analytics tools, e.g., based on machine learning or compressed sensing, promise to do so.
Machine learning of big data has been used extensively in a variety of fields ranging from biophysics and drug design to social media and text mining.It is typically considered a universal approach for learning (fitting) a complex relationship y = f (x).Some, though few, machine-learning based works have been done in materials science, e.g., Refs.2-17.Most of them use kernel ridge regression or Gaussian processes, where in both cases the fitted/learned property is expressed as a weighted sum over all or selected data points.A clear breakthrough, demonstrating a 'game change', is missing, so far, and a key problem, emphasized in Ref. 18, is that a good feature selection for descriptor identification is central to achieving a good-quality, predictive statistically learned equation.
One of the purposes of discovering structure in materials-science data, is to predict a property of interest for a given complete class of materials.In order to do this, a model needs to be built, that maps some accessible, descriptive input quantities, identifying a material, into the property of interest.In statistical learning, this set of input parameters is called descriptor 18 .For our materials science study, this descriptor (elsewhere also termed "fingerprint" 5,12 ) is a set of physical parameters that uniquely describe the material and its function of interest.It is important that materials that are very different (similar) with respect to the property or properties of interest, should be characterized by very different (similar) values of the descriptors.Moreover, the determination of the descriptor must not involve calculations as intensive as those needed for the evaluation of the property to be predicted.
In most papers published so far, the descriptor has been introduced ad hoc, i.e., without performing an extensive, systematic analysis and without demonstrating that it is the best (in some sense) within a certain broad class.The risk of such ad hoc approach is that the learning step is inefficient (i.e., very many data points are needed), the governing physical mechanisms remain hidden, and the reliability of predictions is unclear.Indeed, machine learning is understood as an interpolation and not extrapolation approach, and it will do so well, if enough data are available.However, quantum mechanics and materials science are so multifaceted and intricate that we are convinced that we hardly ever will have enough data.Thus, it is crucial to develop these tools as domain-specific methods and to introduce some scientific understanding, keeping the bias as low as possible.
In this paper, we present a methodology for discovering physically meaningful descriptors and for predicting physically (materials-science) relevant properties.The paper is organized as follows: In section II, we introduce compressed-sensing [19][20][21][22] based methods for feature selection; in section III, we introduce efficient algorithms for the construction and screening of feature spaces in order to single out "the best" descriptor."Feature selection" is a widespread set of techniques that are used in statistical analysis in different fields 23 .In the following sections, we analyze the significance of the descriptors found by the statisticallearning methods, by discussing the interpolation ability of the model based on the found descriptors (section IV), the stability of the models in terms of sensitivity analysis (section V), and the extrapolation capability, i.e., the possibility of predicting new materials (section VI).
As showcase application and proof of concept, we use the quantitative prediction of the crystal structure of binary compound semiconductors, which are known to crystallize in zincblende (ZB), wurtzite (WZ), or rocksalt (RS) structures.The structures and energies of ZB and WZ are very close and for the sake of clarity we will not distinguish them here.
For many materials, the energy difference between ZB and RS is larger, though still very small, namely just 0.001% or less of the total energy of a single atom.Thus, high accuracy is required to predict this difference.The property P that we aim to predict is the difference in the energies between RS and ZB for the given AB-compound material, ∆E AB .The energies are calculated within the Kohn-Sham formulation 24 of density-functional theory 25 , with the local-density approximation (LDA) exchange-correlation functional 26,27 .Since in this paper we are concerned with the development of a data-analytics approach, it is not relevant that also approximations better than LDA exist, only the internal consistency and the realistic complexity of the data are important.The below described methodology applies to these better approximations as well.
Clearly, the sign of ∆E AB gives the classification (negative for RS and positive for ZB), but the quantitative prediction of ∆E AB values gives a better understanding.The separation of RS and ZB structures into distinct classes, on the basis of their chemical formula only, is difficult and has existed as a key problem in materials science for over 50 years [28][29][30][31][32][33][34][35][36][37][38] .In the rest of the paper, we will always write the symbol ∆E, without the subscript AB, to signify the energy of the RS compound minus the energy of the ZB compound.

II. COMPRESSED-SENSING BASED METHODS FOR FEATURE SELECTION
Let us assume that we are given data points {d 1 , P 1 }, . . ., {d N , P N }, with d 1 , . . ., d N ∈ R M and P 1 , . . ., P N ∈ R. We would like to analyze and understand the dependence of the output P (for "Property") on the input d (for "descriptor"), that reflects the specific measurement or calculation.Specifically, we are seeking for an interpretable equation P = f (d), e.g., an explicit analytical formula, where P depends on a small number of input parameters, i.e., the dimension of the input vector d is small.Below, we will impose constraints in order to arrive at a method for finding a low-dimensional linear solution for P = f (d).Later, we introduce nonlinear transformations of the vector d; dealing with such nonlinearities is the main focus of this paper.
When looking for a linear dependence P = f (d) = d • c, the most simple approach to the problem is the method of least squares.This is the solution of where P = (P 1 , . . ., P N ) T is the column vector of the outputs and D = (d j,k ) is the (N ×M )dimensional matrix of the inputs, and the column vector c is to be determined.1).First, the solution of Eq. ( 1) is given by an explicit formula c = (D T D) −1 D T P if D T D is a non-singular matrix.Second, it is a convex problem, therefore the solution could also be found efficiently even in large dimensions (i.e. if M and N are large) and there are many solvers available 39 .Finally, it does not put any more constraints on c (or f ) than that it is linear and minimizes Eq. (1).
A general linear function f of d usually involves also an absolute term, i.e. it has a form It is easy to add the absolute term (also called bias) to Eq. ( 1) in the form argmin ( This actually coincides with Eq. ( 1) if we first enlarge the matrix D by adding a column full of ones.We will tacitly assume in the following that this modification was included in the implementation and shall not repeat that any more.
If pre-knowledge is available, it can be (under some conditions) incorporated into Eq.
(1).For example, we might want to prefer those vectors c, which are small (in some sense).
This gives rise to regularized problems.In particular, the ridge regression with a penalty parameter λ > 0 weights the magnitude of the vector c and of the error of the fit against each other.The larger λ, the smaller is the minimizing vector c.The smaller λ, the better is the least square fit (the first addend in Eq. ( 3)).More specifically, if λ → 0, the solutions of Eq. ( 3) converge to the solution of Eq. ( 1).And if λ → ∞, the solutions of Eq. ( 3) tend to zero.
A. Sparsity, NP-hard problems, and convexity of the minimization problem The kind of pre-knowledge, which we will discuss next, is that we would like f (d) depend only on a small number of components of d.In the notation of Eq. ( 1), we would like to achieve that most of the components of the solution c are zero.Therefore, we denote the number of non-zero coordinates of a vector c by Here, {j : c j = 0} stands for the set of all the coordinates j, such that c j is non-zero, and #{. ..} denotes the number of elements of the set {. ..}.Thus, c 0 is the number of nonzero elements of the vector c, and it is often called the 0 norm of c.We say that c ∈ R M is k-sparse, if c 0 ≤ k, i.e. if at most k of its coordinates are not equal to zero.
When trying to regularize Eq. ( 1) in such a way that it prefers sparse solutions c, one has to minimize: This is a rigorous formulation of the problem that we want to solve, but it has a significant drawback: It is computationally infeasible when M is large.This can be easily understood as follows.The naive way of solving Eq. ( 5) is to look first on all index sets T ⊂ {1, . . ., M } with one element and try to minimize over vectors supported in such sets.Then one looks for all subsets with two, three, and more elements.Unfortunately.their number grows quickly with M and the level of sparsity k.It turns out, that this naive way can not be improved much.The problem is called "NP-hard" 40 .In a "Non-deterministic Polynomial-time hard" problem, a good candidate for the solution can be checked in a polynomial time, but the solution itself can not be found in polynomial time.The basic reason is that Eq. ( 5) is not convex.

B. Methods based on the 1 norm
Reaching a compromise between convexity and promoting sparsity is made possible by the LASSO (Least Absolute Shrinkage and Selection Operator) 41 approach, in which the 0 regularization of Eq. ( 5) is replaced by the 1 norm: The use of the 1 norm ( c 1 = k |c k |), also known as the "Manhattan" or "Taxicab" norm, is crucial here.On the one hand, the optimization problem is convex 42 .On the other hand, the geometry of the 1 -unit ball {x ∈ R M : x 1 ≤ 1} shows, that it promotes sparsity 21,22 .
Similarly to Eq. (3), the larger the λ > 0, the smaller the 1 -norm of the solution of Eq.
(6) would be, and vice versa.Actually, there exists a smallest λ 0 > 0, such that the solution of Eq. ( 6) is zero.If λ then falls below this threshold, one or more components of c become non-zero.

C. Compressed sensing
The minimization problem in Eq. ( 6) is an approximation of the problem in Eq. ( 5), and their respective results do not necessarily coincide.The relation between these two minimization problems was studied intensively [43][44][45] and we summarize the results, developed in the context of a recent theory of compressed sensing (CS) 19,21,22 .These will be useful for justifying the implementation of our method (see below).The literature on CS is extensive and we believe that the following notes are a useful compendium.
We are especially interested in conditions on P and D which guarantee that the solutions of Eqs. ( 5) and ( 6) coincide, or at least do not differ much.We concentrate on a simplified setting, namely attempting to find c ∈ R M fulfilling Dc = P with fewest non-zero entries.This is written as As mentioned above, this minimization is computationally infeasible, and we are interested in finding the same solution by its convex reformulation: The answer can be characterized by the notion of the Null Space Property (NSP) 46 .Let D ∈ R M ×N and let Ω ∈ {1, . . ., N }.Then D is said to have the NSP of order Ω if It is shown 46 that every Ω-sparse vector x is the unique solution of Eq. ( 8) with P = Dx if, and only if, D has the NSP of order Ω.As a consequence, the minimization of the 1 norm as given by Eq. ( 8) recovers the unknown Ω-sparse vectors c ∈ R N from D and P = Dc only if D satisfies the NSP of order Ω.Unfortunately, when given a specific matrix D, it is not easy to check if it indeed has the NSP.
However, the CS analysis gives us some guidance for the admissible dimension M of D.
It is known 22,46 that there exists a constant C > 0 such that whenever then there exist matrices D ∈ R M ×N with NSP of order Ω.Actually, a random matrix with these dimensions satisfies the NSP of order Ω with high probability.On the other hand, this bound is tight, i.e., if M falls below this bound no stable and robust recovery of c from D and P is possible, and this is true for every matrix D ∈ R M ×N .Later on, M will correspond to the number of AB compounds considered and will be fixed to M = 82.Furthermore, N will be the number of input features, and Eq. ( 10) tells us that we can expect reasonable performance of 1 -minimization only if their number is at most of the order e M CΩ .
Furthermore, the CS analysis sets forth that the NSP is surely violated already for Ω = 2 if any two columns of D are a multiple of each other.In general, one can expect the solution of Eq. ( 8) to differ significantly from the solution of Eq. ( 7) if any two columns of D are highly correlated or, in general, if any of the columns of D lies close to the span of a small number of any other its columns.
Historically, LASSO appeared first in 1996 as a machine-learning (ML) technique, able to provide stable solutions to underdetermined linear systems, thanks to the 1 regularization.
Ten years later, the CS theory appeared, sharing with LASSO the concept of 1 regularization, but the emphasis is on the reconstruction of a possibly noisy signal c from a minimal number of observations.CS can be considered the theory of LASSO, in the sense that it gives conditions (on the matrix D) on when it is reasonable to expect that the 1 and 0 regularizations coincide.If the data and a, typically overcomplete, basis set are known, then LASSO is applied as a ML technique, where the task is to identify the smallest number of basis vectors that yield a model of a given accuracy.If the setting is such that a minimal set of measurements should be performed to reconstruct a signal, as in quantum-state tomography 47 , then CS tells how the data should be acquired.As it will become clear in the following, our situation is somewhat intermediate, in the sense that we have the data (here, the energy of the 82 materials), but we have a certain freedom to select the basis set for the model.Therefore, we use CS to justify the construction of the matrix D, and we adopt LASSO as a mean to find the low-dimensional basis set.
D. A simple LASSO example: the energy differences of crystal structures In this and following sections, we walk through the application of LASSO to a specific materials-science problem, in order to introduce step by step a systematic approach for finding simple analytical formulas, that are built from simple input parameters (the primary features), for approximating physical properties of interest.We aim at predicting ∆E AB , the difference in DFT-LDA energy between ZB and RS structures in a set of 82 octet binary semiconductors.Calculations were performed using the all-electron full-potential code FHIaims 48 with highly accurate basis sets, k-meshes, integration grids, and scaled ZORA 49 for the relativistic treatment.
The order of the two atoms in the chemical formula AB is chosen such that element A is the cation, i.e., it has the smallest Mulliken electronegativity, EN = −(IP + EA)/2.IP and EA are the ionization potential and electron affinity of the free, isolated, spinless, and spherical symmetric atom.As noted, the calculation of the descriptor must involve less intensive calculations than those needed for the evaluation of the property to be predicted.
Therefore, we consider only properties of the free atoms A and B, that build the binary material, and properties of the gas-phase dimers.In practice, we identified the following primary features, exemplified for atom A: the ionization potential IP(A), the electron affinity EA(A) 50 , H(A) and L(A), i.e., the energies of the highest-occupied and lowest-unoccupied Kohn-Sham (KS) levels, and r s (A), r p (A), and r d (A), i.e., the radii where the radial probability density of the valence s, p, and d orbitals are maximal.The same features were used for atom B. Clearly, these primary features were chosen because it is well known since long that the relative ionization potentials, the atomic radii, and sp-hybridization govern the bonding in these materials.Consequently, some authors [51][52][53] already recognized that certain combinations of r s and r p -called r σ and r π (see below) -may be crucial for constructing a descriptor that predicts the RS/ZB classification.Note that just the sign of ∆E was predicted, while the model we present here targets at a quantitative prediction of ∆E.In contrast to previous work, we analyze how LASSO will perform in this task and how much better the corresponding description is.We should also note that the selected set of atomic primary features contains redundant physical information, namely H(A) and L(A) contain similar information as IP(A) and EA(A).In particular H(A) − L(A) is correlated, on physical grounds, to IP(A) − EA(A).However, since the values of these two differences are not the same 50 , all four features were included.As it will be shown below and in particular in Appendix A, the two pairs of features are not interchangeable.
We start with a simplified example, in order to show in an easily reproducible way the performance of LASSO in solving our problem.To this extent, we describe each compound AB by the vector of the following six quantities: All this data collected gives a 82×6 matrix D -one row for each compound.We standardize D to have zero mean and variance 1, i.e. subtract from each column its mean and divide by its standard deviation.Standardization of the data is common practice 42 and aimed at controlling the numerical stability of the solution.However, when non-linear combinations of the features and cross-validation is involved, the standardization has to be performed carefully (see below).The energy differences are stored in a vector P ∈ R 82 .
We are aiming at two goals: • On the one hand, we would like to minimize the mean-squared error (MSE), a measure of the quality of the approximation, given by In this tutorial example, the size N of the dataset is 82 and, with the above choice of Eq. ( 11), M = 6.
• On the other hand, we prefer sparse solutions c † with small Ω, as we like to explain the dependence of the energy difference P based on a low-dimensional (small Ω) descriptor These two tasks are obviously closely connected, and go against each other.The larger the coefficient λ in Eq. ( 6), the more sparse the solution c will be; the smaller λ, the smaller the mean-squared error will be.The choice of λ is at our disposal and let us weight sparsity of c against the size of the MSE.
In the following, we will rather report the root mean square error(RMSE) as a quality measure of a model.The reason is that the RMSE has the same units as the predicted quantities, therefore easier to understand in absolute terms.Specifically, when a sparse solution c † of Eq. ( 6) is mentioned, with Ω non-zero components, we report the RMSE: 1 82 where D † = (d † j,k ) contains the columns corresponding to the non-zero components of c † and c * = (c * k ) is the solution of the least square fit: Let us first state that the MSE obtained when setting c = 0 is 0.1979 eV 2 , which corresponds to a RMSE of 0.4449 eV.This means that if one "predicts" ∆E = 0 for all the materials, the RMSE of such "model" is 0.4449 eV.This number acts as a baseline for judging the quality of new models.
In order to run this and all the numerical tests in this paper, we have written Python scripts and used the linear-algebra solvers, including the LASSO solver, from the scikit learn (sklearn) library (see Appendix D for the actual script used for this section).We like to calculate the coefficients for a decreasing sequence of a hundred λ values, starting from the The sequence of λ values is constructed on a log scale, and the smallest λ value is set to λ 100 = 0.001λ 1 .In our case, λ 1 = 0.3169 and the first non-zero component of c is the second one, corresponding to r p (A).At λ 19 = 0.0903, the second non-zero component of c appears, corresponding to r d (B) (see Eq. ( 11)).For decreasing λ, more and more components of c are different from zero until, from the column corresponding to λ 67 = 0.0032 down to λ 100 = 0.0003, all entries are occupied, i.e. no sparsity is left.The method clearly suggests r p (A) to be the most useful feature for predicting the energy difference.Indeed, by enumerating all six linear (least-square) models constructed with only one of the components of vector d at a time, the smallest RMSE is given by the model based on r p (A).We conclude that LASSO really finds the coordinate best describing (in a linear way) the dependence of Let us now proceed to the best pair.The second appearing non-zero component is r d (B).
The RMSE for the pair (r p (A), r d (B)) is 0.2927 eV.An exhaustive search over all the fifteen pairs of elements of d AB reveals that the pair (r p (A), r d (B)), indeed, yields the lowest (leastsquare) RMSE among all possible pairs in d AB .
The outcome for the best triplet reveals the weak point of the method.The third non-zero components of d is r p (B), and the RMSE of the triplet r p (A), r p (B), r d (B) is 0.2897 eV, while an exhaustive search over all the triplets found, suggests (r s (B), r p (A), r p (B)) to be optimal with a RMSE error of 0.2834, i.e., some 2% better.The reason is that Eq. ( 6) is only a convex proxy of the actual problem in Eq. ( 5).Their solutions have similar performance in terms of RMSE, but they do not have to coincide.Now let us compare the norm c 1 of both triplets, for c * obtained by the least square regression with standardized columns.
For the optimal triplet, (r s (B), r p (A), r p (B)), c 1 = 3.006 and for (r p (A), r p (B), r d (B)), c 1 = 0.5843.Here we observe, that the first one needs higher coefficients for a small P − Dc 2 2 , while the second one provides a better compromise between a small P − Dc 2 2 and a small λ c 1 in Eq. ( 6).The reason for the large coefficients for the former is that considering the 82-dimensional vectors whose components are the values of r s (B) and r p (B) for all the materials (in the same order), these vectors are almost parallel (their Pearson correlation coefficient is 0.996) 54 .In order to understand this, let us have a look at both least square models: In the optimal 0 model, the highly correlated vectors r s (B) and r p (B) appear approximatively as the difference r s (B)−r p (B) (their coefficients in the linear expansion have almost the same magnitude, ∼ 1.3, and opposite sign).The difference of these two almost collinear vectors with the same magnitude (due to the applied standardization) is a vector much shorter than both r s (B) and r p (A), but also much shorter than r p (B) and P , therefore r s (B) − r p (B) needs to be multiplied by a relatively large coefficient in order to be comparable with the other vectors.Indeed, while the coefficient of r s (B) − r p (B) is about 1.3, all the other coefficients, in particular in the 1 solution, are much smaller.Such large coefficient is penalized by the λ c 1 term and therefore a sub-optimal (according to 0 -regularized minimization) triplet is selected.This examples shows how with highly correlated features, λ c 1 may not be a good approximation for λ c 0 .
Repeating the LASSO procedure for the matrix D consisting once only of elements of the first triplet and once of the second triplet, shows that only for λ ≤ 0.00079 the first triplet has a smaller LASSO-error, since then the contribution of the high coefficients to the LASSO-error are sufficiently small.When D contains all features, at this λ already all coefficients are non-zero.
For the sake of completeness, let us just mention that the RMSE for the pair of descriptors defined by John and Bloch 31 as is 0.3055 eV.For predicting the energy difference in a linear way, the pair of descriptors (r p (A), r d (B)) is already slightly better (by 4%) than this.

E. A more complex LASSO example: non-linear mapping
Until this point, we have treated only linear models, i.e., where the function is a linear combination of the components of the input vector d.This is a clear limitation.
In this section, we describe how one can easily introduce non-linearities, without loosing the simplicity of the linear-algebra solution.To the purpose, the vector d is mapped by a (in general non-linear) function Φ : R M 1 → R M 2 into a higher-dimensional space, and only then the linear methods are applied in R M 2 .This idea is well known in 2 -regularized kernel methods 55 , where the so-called kernel trick is exploited.In our case, we aim at explicitly defining and evaluating a higher dimensional-mapping of the initial features, where each new dimension is a non-linear function of one or more initial features.
We stay with the case of 82 compounds.To keep the presentation simple, we leave out the inputs r d (A) and r d (B).Hence, every compound is first described by the following four primary features, We will construct a simple, but non-trivial non-linear mapping Φ : R 4 → R M * (with M * to be defined) and apply LASSO afterwards.The construction of Φ involves some of our preknowledge, for example, on dimensional grounds, we expect that expressions like r s (A) 2 − r p (A) have no physical meaning and should be avoided.Therefore, in practice, we allow for sums and differences only of quantities with the same units.We hence consider the 46 features listed in Table I: the 4 primary plus 42 derived features, building Φ(d AB ), represented by the matrix D.  (marked by a '−').The value of 0 counts the non-zero coefficients at each reported λ value.In a rather non-intuitive fashion, the number of non-zero coefficients fluctuates with decreasing λ, rather than monotonically increasing; this is an effect of linear correlations in the feature space.
Applying LASSO gives the result shown in Table II, where we list the coordinates as they appear when λ decreases.Where we have truncated the list, at λ = 0.028, the features with non-zero coefficient are: Let us remark, that the features 2 and 3 and the features 3 and 5 are strongly correlated, with covariance greater than 0.8 for both pairs 56 .This results in a difficulty for LASSO to identify the right features.
With these five descriptor candidates, we ran an exhaustive 0 test over all the 5 • 4/2=10 pairs.We discovered that the pair of the second and third selected features, i.e. of achieves the smallest RMSE of 0.1520 eV, improving on John and Bloch's descriptors 31 by a factor 2. We also ran an exhaustive 0 search for the optimal pair over all 8 features that were singled out in the LASSO sweep, i.e., including also r p (A), r π , and r s (A).The best pair was still the one in Eq. (18).We then also performed a full search over the 46 pairs, where the pair in Eq. ( 18) still turned out to be the one yielding the lowest RMSE.
We conclude that even though LASSO is not able to find directly the optimal Ω-dimensional descriptor, it can efficiently be used for filtering the feature space and single out the "most important" features, whereas the optimal descriptor is then identified by enumeration over the subset identified by LASSO.
The numerical test discussed in this section shows the following: • A promising strategy is to build an even larger feature space, by combining the primary features via analytic formulas.In the next section, we walk through this strategy for systematically constructing a large feature space.
• LASSO cannot always find the best Ω-dimensional descriptor as the first Ω columns of D with non-zero coefficient by decreasing λ.This is understood as caused by features that are (nearly) linearly correlated.However, an efficient strategy emerges: First using LASSO for extracting a number Θ > Ω of "relevant" components.These are the first Θ columns of D with non-zero coefficients found when decreasing λ.
Then, performing an exhaustive search over all the Ω-tuples that are subsets of the Θ extracted columns.The latter is in practice the problem formulated in Eqs. ( 4) and ( 5).In the next section, we formalize this strategy, which we call henceforth LASSO+ 0 , because it combines LASSO ( 1 ) and 0 optimization.
The feature-space construction and LASSO+ 0 strategies presented in the next section, are essentially those employed in our previous paper 18 .The purpose of this extended presentation is to describe a general approach for the solution of diverse problems, where the only requisite is that the set of basic ingredients (the primary features) is known.
As a concluding remark of this section, we note that compressed sensing and LASSO were successfully demonstrated to help solving quantum-mechanics and materials-science problems in Refs.57-61.In all those papers, a 1 based optimization was adopted to select from a well defined set of functions, in some sense the "natural basis set" for the specific problem, a minimal subset of "modes" that maximally contribute to the accurate approximation of the property under consideration.In our case, the application of the 1 (and subsequent 0 ) optimization must be preceded by the construction of a basis set, or feature space, for which a construction strategy is not at all a priori evident.
In the discussed numerical tests until this point, we have always looked for the lowdimensional model that minimizes the square error (the square of the 2 norm of the fitting function).Another quantity of physical relevance that one may want to minimize is the maximum absolute error (maxAE) of the fit.This is called infinity norm and, for the vector The minimization problem of Eq. ( 6) then becomes argmin This is still a convex problem. of Eq. ( 6).We have looked for the model that gives the lowest MaxAE, starting from the feature space of size 46 defined in Table I.I. Reported is also the performance of the models in terms of RMSE and MaxAE.
For the specific example presented here, we find (see Table III) that the 2D model is the same for both settings, i.e., the model that minimizes the RMSE also minimizes the MaxAE.
This is, of course,not necessarily true in general.In fact, the 1D model that minimizes the RMSE differs from the 1D model that minimizes the MaxAE.

III. GENERATION OF A FEATURE SPACE
For the systematic construction of the feature space, we first divide the primary features in groups, according to their physical meaning.In particular, necessary condition is that elements of each group are expressed with the same units.We start from atomic features, see Table IV.Next, we define the combination rules.We note here that building algebraic function over a set of input variables (in our case, the primary features) by using a defined dictionary of algebraic (unary and binary) operators and finding the optimal function with respect to a given cost functional is the strategy of symbolic regression 62 .In this field of statistical learning, the optimal algebraic function is searched via an evolutionary algorithm, where the analytic expression is evolved by replacing parts of the test functions with more complex functions.In other words, in symbolic regression, the evolutionary algorithm guides the construction of the algebraic functions of the primary features.In our case, we borrow from symbolic regression the idea of constructing functions by combining 'building blocks' in more and more complex way, but the selection of the optimal function (in our language, the descriptor) is determined by the LASSO+ 0 algorithm over the whole set of generated functions, after the set of candidate functions is generated.
Our goal is to create "grammatically correct" combinations of the primary features.This means, besides applying the usual syntactic rule of algebra, we add a physically motivated constraint, i.e., we exclude linear combinations of inhomogeneous quantities, such as "IP + r s " or "r s + r 2 p ".In practice, quantities are inhomogeneous when they are expressed with different units.Except this exclusions of physically unreasonable combinations, we produce as many combinations as possible.However, compressed-sensing theory poses a limit on the maximum size M of the feature space from which the best (low-) Ω-dimensional descriptor can be extracted by sampling the feature space with the knowledge of N data points: N = C Ω ln(M ) 19,63,64 , when the M candidate features are uncorrelated.C is not a universal constant, however it is typically estimated to be in the range between 4 and 8 (see Ref. 22).For Ω = 2 and N = 82, this implies a range of M between ∼ 200 and ∼ 30000.Therefore, we regarded values of a few thousand as an upper limit for M .
Since the number of thinkable features is certainly larger than few thousands, we proceeded iteratively in several steps, by learning from the previous step what to put and what not in the candidate-feature list of the next step.In the following, we describe how a set of ∼ 4500 features was created.In Appendixes A and B, we summarize how different feature space can be constructed, starting from different assumptions.
First of all, we form sums and absolute differences of homogeneous quantities and apply some unary operators (powers, exponential), see Table V.   screening would require the definition of a threshold for the absolute value of the covariance, for the decision whether or not any two features are correlated, and then possibly discarding one of the two.A similar problem would appear in case more refined techniques, like singularvalue decomposition, were tried in order to discard eigenvectors with low eigenvalues.Still a threshold should be defined and thus tuned.
We adopted instead a simple yet effective solution: The best Ω = 30 features with nonzero coefficients that emerge from the application of LASSO at decreasing λ are grouped, and among them an exhaustive 0 minimization is performed (Eqs.( 4) and ( 5) for c 0 = 1, 2, 3, . . .).The single features, pairs, triplets, etc. that score the lowest RMSE are the outcome of the procedure as 1D, 2D, 3D, etc., descriptors.The validity of this approach was tested by checking that running it on smaller feature spaces (M ∼ few hundreds), where the direct search among all pairs and all triples could be carried out, gave the same result.
Our procedure, applied to the above defined set of features (Tables IV-VI), found both the best 1D, 2D, and 3D descriptor, as well as the coefficients of the equations for predicting the property ∆E as shown below (energies are in eV and radii are in Å): We removed the absolute value from "IP(B) − EA(B)" as this difference is always negative.
In Fig. 1, we show a structure map obtained by plotting the 82 materials, where we used the two components of the 2D descriptor (Eq.( 21)) as coordinates.We note that the descriptor we found contains physically meaningful quantities, like the band gap of B in the numerator of the first component and the size mismatch between valence s-and p-orbitals (numerators of the second and third component).In closing this section, we note that the algorithm described above can be dynamically run in a web-based graphical application at: https://analytics-toolkit.nomad-coe.eu/tutorial-LASSOL0.

IV. CROSS VALIDATION, SENSITIVITY ANALYSIS, AND EXTRAPOLATION
In this section, we discuss in detail a series of analyses performed on our algorithm.We start with the adopted cross-validation (CV) scheme.Then, we investigate how the cross-validation error varies with the size of the feature space (actually, its "complexity", as will be defined below).We proceed by discussing the stability of the choice of the descriptor with respect to sensitivity analysis.Finally, we test the extrapolation capabilities of the model.
In the numerical test described above, we have always used all the data for the training of the models; the RMSE given as figure of merit were therefore fitting errors.However, in order to assess the predictive ability of a model, it is necessary to test it on data that have not been used for the training, otherwise one can incur the so-called overfitting 42 .Overfitting is in general signaled by a noticeable discrepancy between the fitting error (the RMSE over the training data) and the test error (the RMSE over control data, i.e., the data that were not used during the fitting procedure).If a large amount of data is available, one can just partition the data into a training and a test set, fully isolated one from another.In our case, that is not so unusual, the amount of data is too little for such partitioning, therefore, we adopted a cross-validation strategy.
In general, the data set is still partitioned into training and test data, but this procedure repeated several times, choosing different test data, in order to achieve a good statistics.In our case, we adopted a leave-10%-out cross-validation scheme, where the data set is divided randomly into a ∼ 90% of training data (75 data points, in our case) and a ∼ 10% of test data.The model is trained on the training data and the RMSE is evaluated on the test set.By "training", we mean the whole LASSO+ 0 procedure that selects the descriptor and determines the model (the coefficients of the linear equation) as in Eqs. ( 20)- (22).
Another figure of merit that was monitored is the maximum absolute error over the test set.The random selection, training, and error evaluation procedure was repeated until the average RMSE and maximum absolute errors did not change significantly.In practice, we typically performed 150 iterations, but the quantities were actually converged well before.
We note that, at each iteration, the standardization is applied by calculating the average and standard deviation only of the data points in the training set.In this way, no information from the test set is used in the training, while if the standardization were computed once and for all over all the available data, some information on the test set would be used in the training.In fact, it can be shown 42 that such approach can lead to spurious performance.
The cross-validation test can serve different purposes, depending on the adopted framework.For instance, in kernel ridge regression, for a Gaussian kernel, the fitted property is expressed as a weighted sum of Gaussians (see also section V): , where N is the number of training data points, i.e., there are as many coefficients as (training) data points.The coefficients c i are determined by is the squared 2 norm of the difference of descriptors of different materials.A recommended strategy 42 is to use the cross validation to determine the optimal value of the so-called hyper-parameters, λ and σ, in the sense that it is selected the pair (λ, σ) that minimizes the average RMSE upon cross validation.In our scheme, we can regard the dimensionality Ω of the descriptor and the size M of the feature space as hyper-parameters.By increasing both parameters, we do not observe a minimum, but we rather reach a plateau, where no significant improvement on the cross-validation average RMSE is achieved (see below).
Since in our procedure, the descriptor is found by the algorithm itself, a fundamental aspect of the cross-validation scheme is that all the procedure, including the selection of the descriptor, is repeated from scratch with each training set.This means that, potentially, the descriptor changes at each training-set selection.We found a remarkable stability of the 1D and 2D descriptors.The 1D was the same as the all-data descriptor (Eq.( 20) for 90% of the training sets, while the 2D descriptor was the same as in Eq. ( 21) in all cases.For 3D and higher dimensionality, as expected from the N = CΩ ln(M ) relationship, the selection of the descriptor becomes more unstable, i.e., for different training sets, the selected descriptor often differs in at least one of the three components.The RMSE, however, does not change much from one training set to the other, i.e., the instability of the descriptor selection just reflects the presence of many competing models.We show in Table VIII the cross-validation figures of merit, average RMSE and maximum absolute error, as a function of increased dimensionality.We also show in comparison the fitting error.

A. Complexity of the feature space
Our feature space is subdivided in 5 tiers.
• In tier zero, we have the 14 primary features as in Table IV.
• In tier one, we group features obtained by applying only one unary (e.g., A 2 , exp(A)) or binary (e.g., |A − B|) operation on primary features, where A and B stand for any  primary feature.Note that in this scheme the absolute value, applied to differences, is not counted as an extra operation, i.e., we consider the operator |A − B| as a single operator.
• Our procedure was executed with tier 0, then with tier 0 AND 1, then with tiers from 0 to 2, and so on.The results are shown in Table IX.A clear result of this test is that little is gained, in terms of RMSE, when going beyond tier 3. The reason why MaxAE may increase at larger tiers is that the choice of the descriptor becomes more unstable (i.e., different descriptors may be selected) the larger the feature space is.This leads to less controlled maximal errors, and is a reflection of overfitting.
Incidentally, while for the 1D and 2D descriptor, the results presented in Eqs. ( 20) and ( 21) contain only features up to tier 3, the third component of the 3D descriptor shown in  Eq. ( 22) belongs to tier 4. The 3D descriptor and model limited to tier 3 is: noise is applied to the primary features.The noise is measured in terms of the standard deviation of the Gaussian-distributed set of random numbers that multiply the feature.Only for the 5 features contained in the 2D noiseless model, the noise affects the selection of the descriptor, therefore only 5 out of 14 primary features are listed.The last line shows the effect of noise applied to all primary features simultaneously.The results are displayed for both leave-one-out (LOOCV) and leave-10%-out CV schemes, as indicated by the "CV scheme" column.
differ from iteration to iteration (each iteration is characterized by a different value of the random noise).For the LOOCV, the RMSE goes from 0.09 eV (σ = 0.001) to 0.12 eV (σ = 0.3), while for the L-10%-OCV the RMSE goes from 0.11 to 0.15 eV.So, even when, at the largest noise level, the selected descriptor may vary each time, the RMSE is only mildly affected.This reflects the fact that many competitive models are present in the feature space, and therefore a model yielding a similar RMSE can always be found.It is interesting to note that upon applying noise to IP(B) in all cases the new 2D descriptor is i.e, the very similar to the descriptor in Eq. ( 21), but here IP(B) is simply missing.It is also surprising that for quite large levels of noise (10-13%) applied to EA(B), the descriptor containing this feature is selected.In general, up to noise levels of 5%, the descriptor of the noiseless model is recovered the majority of times or more.Therefore, we can conclude that the model selection is not very sensitive to the noise applied to isolated features.When the noise is applied to all features, however, the frequency of recovery of the 2D descriptor of Eq. ( 21) drops quickly.Still, for noise levels up to 1%, that could be related, e.g., to computational inaccuracies (non fully converged basis sets, or other numerical settings), the model is recovered almost always.

Adding noise to P = ∆E
We have added uniformly distributed noise of size δ = ±0.01,±0.03, . . .eV to the DFT data of ∆E.Here, we have selected two feature spaces of size 2924 and 1568, constructed by two different set of functions, but always including the descriptors of Eqs. ( 20)- (22).The results are shown in Table XI.For Ω = 2, we report the fraction of trials for which the 2D descriptor of the unperturbed data was found in a L-10%-OCV test.(10 selection of random noise were performed and for each selection L-10%-OCV was run for 50 random selections of the training set, so the statistic is over 500 independent selections.)The selection of the descriptor is remarkably stable up to uniform noise of ±0.1 eV (incidentally, at around the value of the RMSE), then it drops rapidly.We note that the errors stay constant when the noise is in the "physically meaningful" regime, i.e., the relative ordering of the materials along the ∆E scale is not much perturbed.Only when the noise starts mixing the relative order of the materials, then the prediction becomes also less and less accurate in terms of RMSE.

C. Extrapolation: (re)discovering diamond
Most of machine-learning models, in particular kernel-based models, are known to yield unreliable performance for extrapolation, i.e., when predictions are made for a region of the input data where there are no training data.We note that, in condensed-matter physics,  Besides the RMSE and the AveMaxAE, we report the number of times, the 2D descriptor of the unperturbed data is recovered.
are not is difficult if not impossible.We test the extrapolative capabilities of our LASSO+ 0 methodology, by setting up two exemplary numerical tests.In the first test, we remove from the training set the two most stable ZB materials, namely C-diamond and BN (the two rightmost points in Fig. 1), and then calculate for both of them ∆E, as predicted by the trained model.Although the prediction errors of 1.2 and 0.34 eV for C and BN, respectively, are very large, as can be seen in  We conclude that a LASSO+ 0 based model is likely to have good, at least qualitative, extrapolation capabilities.This is owing to the stability of the linear model and the physical meaningfulness of the descriptor, which contains elements of the chemistry of the chemical species building the material.
In closing this section, that we cannot draw general conclusions from the particularly robust performance of the descriptors that are found by our LASSO+ 0 algorithm when applied to the features space constructed as explained above.The tests described in this section, however, form a useful basis for assessing the robustness of a found model.We regard such or similar strategy to be good practice.Two criteria give us confidence that the found models may have a physical meaning: These are the particular nature of the models found by our methodology, i.e., they are expressed as explicit analytic functions of the primary features, and the evidence of robustness with respect to perturbations on the training set and the primary features.We also note that the functional relationships between a subset of the primary features and the property of interest that are found by our methodology cannot be automatically regarded as physical laws.In fact, both the primary features and ∆E are determined by the Kohn-Sham equations where the physically relevant input only consists of the atomic charges.
V. COMPARISON TO GAUSSIAN-KERNEL RIDGE REGRESSION WITH VAR-

IOUS DESCRIPTORS
In this section, we use Gaussian-kernel ridge regression to predict the DFT-LDA ∆E for the 82 octet binaries, with various descriptors built from our primary features (see Table IV) or functions of them.The purpose of this analysis is to point out pros and cons of using kernel ridge regression (KRR), when compared to an approach such as our LASSO+ 0 .In the growing field of data analytics applied to materials-science problems, KRR is perhaps the machine-learning method that is most widely used to predict properties of a given set of molecules or materials 5,8,68,69 .
KRR solves the nonlinear regression problem: where P j are the data points, k(d i , d j ) is the kernel matrix built with the descriptor d, and λ is the regularization parameter, with a similar role as λ in Eqs.(3), (5), and (6).In KRR, λ is determined by minimizing the cross-validation error.The fitting function determined by KRR is therefore P (d) = N i=1 c i k(d i , d), i.e. a weighted sum over all the data points.The crucial steps for applying this method are the selection of the descriptor and of the kernel.
The most commonly used kernel is the Gaussian kernel: The parameter determining the width of the Gaussian, σ, is recommended 42,70 to be determined together with λ, by minimizing the cross-validation error, and this is the strategy used here.The results are summarized in Table XIII.In each case, the optimal (λ, σ) was determined by running LOOCV.
We make the following observations: • With several atomic-based descriptors, KRR fits reach levels of RMSE comparable to or slightly better than our fit with the LASSO+ 0 .• However, the performance of KRR is not improving with the dimensionality of the descriptor: Descriptor 7 contains the same features as descriptor 5 or 6, plus other, possibly relevant, features.One thus expects a better performance, which is not the case.The same happens when going to all 23 atomic and dimer features (descriptor 8).

A. Prediction test with KRR
We have repeated the tests as in section IV.C, i.e., we have trained a KRR model for all materials except C and BN and for all materials except all four carbon compounds.Then we have evaluated the predicted ∆E for the excluded materials.This test was done by using descriptors 1, 2, 4, 5, and 7 from Table XIII.Furthermore, the 2D LASSO+ 0 descriptors were evaluated for the two datasets described above (details on these descriptors are given   The performance of KRR in predicting the ∆E of selected subsets of materials is strongly dependent on the descriptor.In particular, when KRR is used for extrapolation (descriptor CBN, where C and BN data points are distant from the other data points in the metric defined by this 2D descriptor), the performance is rather poor in terms of quantitative error, even though still correctly predicting BN and C as very stable ZB materials.Some descriptors expected to carry relevant physical information, such as the set (r s (A), r p (A), r s (B), r p (B)), show also good predictive ability in these examples.
In summary, KRR is a viable alternative to the analytical models found by LASSO+ 0 (as also noted in Ref. our LASSO+ 0 approach is that the descriptor is determined by the method itself.Most importantly, LASSO+ 0 is not fooled by features that are redundant or useless (i.e., carrying little or no information on the property).These features are simply discarded.In contrast, KRR cannot discard components of a descriptor, resulting in decreasing predictive quality when a mixture of relevant and non-relevant features is naively used as descriptor.

VI. CONCLUSIONS
We have presented a compressed-sensing methodology for identifying physically meaningful descriptors, i.e., physical parameters that describe the material and its properties of interest, and for quantitatively predicting properties relevant for materials-science.The methodology starts from introducing possibly relevant primary features that are suggested by physical intuition and pre-knowledge on the specific physical problem.Then, a large feature space is generated by listing nonlinear functions of the primary features.Finally, few features are selected with a compressed-sensing based method, that we call LASSO+ 0 because it uses the Least Absolute Shrinkage and Selection Operator for a prescreening of the features, and an 0 -norm minimization for the identification of the few most relevant features.This approach can deal well with linear correlations among different features.
We analyzed the significance of the descriptors found by the LASSO+ 0 methodology, by discussing the interpolation ability of the model based on the found descriptors, the robustness of the models in terms of stability analysis, and their extrapolation capability, i.e., the possibility of predicting new materials.• As scalar features describing the valence orbitals, we use the radii at which the radial probability densities of the valence s, p, and d orbitals have their maxima.This type of radii was, in fact, selected by our procedure, as opposed to the average radii (i.e., the quantum-mechanical expectation value of the radius).To the purpose, first two feature spaces starting from both sets of radii as primary features were constructed.
In practice, in one case we started from the same primary features as in Table IV, but without group A2 (in order to reduce the dimensionality of the final space), in the other case we substituted the average radii in group A3, again without group A2.
We then constructed both spaces following the rules of Tables V and VI.Finally, we joined the spaces and applied LASSO+ 0 to this joint space.Only features containing the radii at maximum were selected among the best.
• Similarly, we also defined three other radii-derived features for the atoms: the radius of the highest occupied orbital of the neutral atom, r 0 , and analogously defined radii for the anions, r − , and the cations, r + .r 0 is either r s or r p as defined above, depending on which valence orbital is the HOMO.As in the previous point, we constructed a feature space containing both {r 0 , r − , r + } and {r s , r p , r d } and their combinations, and found that only the latter radii were selected among the best.
• We have considered in addition features related to the AA, BB and AB dimers (see Table XVI).These new features where combined in the same way as the groups A1, A2, and A3, respectively (see Tables IV-VI).After running our procedure, we found that features containing dimer-related quantities are never selected among the most prominent ones.• We constructed in sequence several alternative sets of features, in particular varying systematically the elements of group G (see Table VI).Multiplication of the information for building a predictive model, following our algorithm.Here, we added r σ and r π 31 to the primary features, in order to see whether combined with other features, according to our rules, yielded a more accurate and/or more robust modeling.
The feature space was reduced, as plainly adding all the combinations with these 2 extra features, made the whole procedure unstable (remember that we have only 82 data points).
By optimizing over a feature space of size 2924 and using all 82 materials for learning and testing, LASSO+ 0 identified the same 2D descriptor as in Eq. (21).In other words, no function of r σ or r π won over the known descriptor.For the L-10%-OCV, in about 10% of the cases, a descriptor containing a function of r σ or r π was selected.
3. Use of force constants derived from the tedrahedral pentamers AB 4 and BA 4 as primary features Here, we build a feature space exploring the idea that the difference in energy between RS and ZB may depend on the mechanical stability of the basic constituent of either crystal structure.For instance, we choose ZB and we look at the mechanical stability of the tetrahedral pentamers AB 4 and BA 4 .In practice, we look at the elastic constants for the symmetric and antisymmetric stretching deformations.
We write the elastic energy of deformation of a tetrahedral pentamer XY 4 , for a symmetric The performance in terms of RMSE and CV is summarized in

The equality defines the 2
or Euclidean norm ( • 2 ) of the vector P − Dc.The resulting function would then be just the linear relationship f (d) = d, c = M k=1 d k c k .It is the function that minimizes the error P − Dc 2 2 among all linear functions of d.Let us point out a few properties of Eq. ( r s (A), r p (A), r s (B), r p (B)5-16  All ratios of all pairs of r's r s (A)/r p (A) 17-22 Differences of pairs r s (A) − r p (A) 23-34 All differences divided by the remaining r's (r s (A) − r p (A))/r s (B) 35-40 Absolute values of differences |r s (A) − r p (A)| 41-43 Sums of absolute values of differences with no r appearing twice |r s (A) − r p (A)| + |r s (B) − r p (B)| 44-46 Absolute values of sums of differences with no r appearing twice |r s (A) − r p (A) + r s (B) − r p (B)| : |r s (A) − r p (B)| and 5 : |r s (B) − r p (B)|.

2 ∞
r s (B)/r p (A), (r s (B) − r p (B))/r p (A) r s (B)/r p (A), (r s (B) − r p (B))/r p (A) TABLE III: Comparison of descriptors selected by minimizing the 2 -norm of the fitting function (the usual LASSO problem) or the ∞ (maximum norm) over the feature space described in Table

FIG. 1 :
FIG.1: Predicted energy differences between RS and ZB structures of the 82 octet binary AB materials, arranged according to our optimal two-dimensional descriptor.The parallel straight lines are isolevels of the predicted model, from left to right, at -0.2, 0, 0.2, 0.5, 1.0 eV.The distance from the 0 line is proportional to the difference in energy between RS and ZB.The color/symbol code is for the reference (DFT-LDA) energies.

in
Appendix C).These two 2D descriptors are analytical functions of five primary features each and these were also used as a 5D descriptor.In all cases, we have determined the dimensionless hyperparameters λ and σ by minimizing the RMSE over a LOOCV run, and the descriptors are normalized component by component.Each component of the descriptor is normalized by the 2 norm of the vector of the values of that component over the whole training set.The results are shown in Tables XIV and XV.
This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No 676580, The NOMAD Laboratory, a European Center of Excellence, the BBDC (contract 01IS14013E), and the Einstein Foundation Berlin (project ETERNAL).J.V. was supported by the ERC CZ grant LL1203 of the Czech Ministry of Education and by the Neuron Fund for Support of Science.This research was initiated while CD, LMG, MS, and JV were visiting the Institute for Pure and Applied Mathematics (IPAM), which is supported by the National Science Foundation (NSF).
Feature space with primary features R (Row or Period) and G (Group) of the PTE.Performance over the whole set of 82 materials.For the case where the primary features are 14, including R(A), R(B), G(A), and G(B), we constructed a feature space of size 4605, and LASSO+ 0 found the following 1D model:∆E = −0.376+ 0.0944 |R(B) − G(B)| r p (A) 2 .(B1) In essence, |R(B) − G(B)| replaced |IP(B) − EA(B)| in Eq. (20), while the denominator remained unchanged.The RMSE of this new 1D model is only sightly better than the model in Eq. (20), namely 0.13 eV (vs 0.14 eV), but the MaxAE is much worse, i.e., 0.43 vs. 0.32 eV.However, this finding is remarkable as it implies some correlation between |R(B) − G(B)| and |IP(B) − EA(B)|, where the latter is a DFT-LDA property while the former comes simply from the number of electrons and the Aufbau principle.Indeed, there is a Pearson's correlation of 0.87 between the two quantities, while, when both quantities are divided by r p (A) 2 , the Pearson's correlation becomes as high as 0.98.The 2D and 3D optimal descriptor, though, are the same as in Eqs.(21) and (22).Even though it is expected that the difference IP−EA grows when moving in the PTE to the right, a linear correlation between the difference |R(B) − G(B)| and IP−EA it is unexpected.Figure 2 shows in the bottom panel this correlation for the p elements (the anions B in the AB materials) of the PTE, while the top panel shows an even stronger linear correlation between IP−EA and the group G in the PTE, for each period.

2 .
Adding John and Bloch's r σ and r π as primary features

FIG. 2 :
FIG. 2: Plot of the difference IP−EA for the anions (B atoms in the AB materials) vs their Group (top pane) or the difference Group−Row (bottom pane) in the PTE.The straight lines are linear least-square fit of the data points

TABLE I :
Definition of the feature space for the tutorial example described in section II.E.The data are standardized, so that each of the 46 columns of D has mean zero and variance 1.Note that for columns 5-46, the standardization is applied only after the analytical function of the primary features is evaluated, i.e., for physical consistency, the primary features enter the formula unshifted and unscaled.
31e descriptors of John and Bloch31, r π and r σ , are both included in this set, with indexes 41 and 46, respectively.p(A)− r s (A)| + |r p (B) − r s (B)| − s (B) − r p (B))/r p (A) + s (A) −TABLE II: Bookkeeping of (decreasing) λ values at which either a new feature gets non-zero coefficients (marked by a '+' in column Action) or a feature passes from non-zero to zero coefficient

TABLE IV
: Set of atomic primary features used for constructing the feature space, divided in groups.The 'ID' labels the group and '#' indicates the number of features in the group.

TABLE V :
First set of operators applied to the primary features (TableIV).Each group, labeled by a different ID, is formed by starting from a different group of primary features and/or by applying a different operator.The label A stays for both A and B of the binary material Next, the above combinations are further combined, see TableVI.LASSO was then applied to this set of ∼ 4500 candidate features.If the features had low linear correlation, the first two features appearing upon decreasing λ would be the best {F 1, F 2, F 3} Abs.differences and sums of ||r p (A) ± r s (A)| ∓ |r p (B) ± r s (B)|| [65] 72 2D descriptor, i.e., the one that minimizes the RMSE 66 .Unfortunately, checking all pairs of features for linear correlation would scale with size M as unfavorably as just performing the brute force search for the best 2D descriptor by trying all pairs.Furthermore, such a

TABLE VI :
Second set of operators applied to the groups defined in Table IV and V.

TABLE VII :
Root mean square error (RMSE) and maximum absolute error (MaxAE) in eV for the fit of all data (first two lines) and of the test set in a leave-10%-out cross validation (L-10%-OCV), averaged over 150 random selections of the training set (last two lines), according to our LASSO+ 0 algorithm.

TABLE VIII :
Errors after L-10%-OCV."Tier x" means that all tiers up to tier x are included in the feature space.

TABLE IX :
Cross validation tests if the found model is good only for the specific set of data used for the training or if it is stable enough to predict the value of the target property for unseen data.Sensitivity analysis is a complementary test on the stability of the model, where the data are perturbed, typically by random noise.The purpose of sensitivity analysis can be finding out which of the input parameters maximally affect the output of the model, but also how much the model depends on the specific values of the training data.In practice, the training data can be affected by measurement errors even if they are calculated by an ab initio model.This is because numerical approximations are used to calculate the actual values of both the primary features and the property.Since, through our LASSO+ 0 Number of times the 2D descriptor of the noiseless model (see Eq. 21) is found when B. Sensitivity Analysis

TABLE X :
Performance of the model for increasing uniform noise added to the calculated ∆E.

TABLE XI :
TableXIfor the 2D descriptor, the model still predicts C and BN as the most stable ZB structures.Thus, in a setup where C and BN were unknown, the model would have predicted them as good candidates to be the most stable ZB materials.Performance of the model found by LASSO+ 0 when diamond and BN are excluded from the training.The rightmost column reports the LDA cohesive energy per atom of the ZB structure, referred to spinless atoms as used for determining the primary features in this work.In the other test, we remove form the training set all four carbon-containing materials, namely C-diamond, SiC, GeC, and SnC, and then calculate for all of them ∆E, as predicted by the trained model.The results are reported in TableXIIfor the model based on the 2D descriptor.The prediction error for C-diamond is comparable to the first numerical test, and also the other errors are relatively large.However, the remarkable thing, here, is that the trained model does not know anything about carbon as a chemical element, nevertheless, it is able to predict that it will form ZB materials, and the relative magnitude of ∆E is respected.

TABLE XII :
Performance of the model found by LASSO+ 0 when the carbon atom is excluded from the training, i.e., C-diamond, SiC, GeC, SnC are excluded from the training.The rightmost column reports the LDA cohesion energy per atom of the ZB structure, referred to spinless atoms as used for determining the primary features in this work.

TABLE XIII :
Performance of the KRR method with various descriptors, in terms of the minimum cross-validation RMSE over a grid of 30 × 30 (λ, σ) values.Descriptor 4 is built with the IUPAC group (G, from 1 to 18) and period (R from row, to avoid confusion with the property (P )) of the elements in the PTE; performance of this set of four possible primary features with LASSO+ 0 is discussed in Appendix B 1. Descriptors 5 contains the 4 primary features used for building descriptor 2. Similarly, descriptor 6 contains the 5 primary features found in descriptor 3. Descriptor 7 contains all 14 atomic features listed in TableIV.Descriptor 8 contains the 14 atomic features plus 9 dimer features (see Table XVI in Appendix A).

TABLE XIV :
KRR prediction of ∆E (all in eV) for C and BN, when these two materials are not included in the training set for various descriptors, compared to the LASSO+ 0 result and the LDA reference.Descriptor CBN is the same 2D descriptor found by LASSO+ 0 for this dataset (see Table XIX, row b) and descriptor CBN * is built with the 5 primary features found in descriptor CBN.

TABLE XV :
KRR prediction of ∆E (all in eV) for all four carbon compounds, when these materials are not included in the training set for various descriptors, compared to the LASSO+ 0 result and the LDA reference.Descriptor xC * is the same 2D descriptor found by LASSO+ 0 for this dataset (see Table XIX, row d) and descriptor xC 14), but only when a good descriptor is identified.The strength of * is built with the 5 primary features found in descriptor xC.

TABLE XVIII :
Table XVIII.We conclude Performance of the model built from primary features based on force-constants.that the features based on the elastic energy do not yield a model with good predictive ability.

TABLE XIX :
Summary of the training and CV errors for 1-, 2-, 3-dimensional descriptors for different datasets.The last column reports the descriptor found using all data in the dataset for the training and the column "Ratio" reports the fraction of times the same descriptor was found over the L-10%-OCV iterations.CC means C-diamond.