Quantifying transfer learning synergies in infinite-layer and perovskite nitrides, oxides, and fluorides

We combine density functional theory simulations and active learning (AL) of element-embedding neural networks (NNs) to explore the sample efficiency for the prediction of vacancy layer formation energies and lattice parameters in ABX n infinite-layer (n = 2) versus perovskite (n = 3) nitrides, oxides, and fluorides in the spirit of transfer learning. Following a comprehensive data analysis from different thermodynamic, structural, and statistical perspectives, we show that NNs model these observables with high precision, using merely ∼30% of the data for training and exclusively the A-, B-, and X-site element names as minimal input devoid of any physical a priori information. Element embedding autonomously arranges the chemical elements with a characteristic recurrent topology, such that their relations are consistent with human knowledge. We compare two different embedding strategies and show that these techniques render additional input such as atomic properties negligible. Simultaneously, we demonstrate that AL is largely independent of the initial training set, and exemplify its superiority over randomly composed training sets. Despite their highly distinct chemistry, the present approach successfully identifies fundamental quantum-mechanical universalities between nitrides, oxides, and fluorides that enhance the combined prediction accuracy by up to 16% with respect to three specialized NNs at equivalent numerical effort. This quantification of synergistic effects provides an impression of the transfer learning improvements one may expect for similarly complex materials. Finally, by embedding the tensor product of the B and X sites and subsequent quantitative cluster analysis, we establish from an unbiased artificial-intelligence perspective that oxides and nitrides exhibit significant parallels, whereas fluorides constitute a rather distinct materials class.


Introduction
Artificial intelligence (AI) algorithms receive increasing attention in computational materials science. Following a decade of high-throughput materials discovery [1][2][3][4][5], which was * Author to whom any correspondence should be addressed.
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. paralleled by the emergence of different materials databases [6][7][8][9], machine learning [10][11][12][13][14][15][16][17][18][19] and, more specifically, deep learning techniques [20][21][22][23] open the perspective towards intriguing and often unconventional approaches for the identification of novel and possibly exotic stable and metastable materials with optimized properties.
The overall success story of AI is so far dominated by supervised methods, e.g., for the understanding and prediction of chemical trends and physical properties from materials data [20,21,24,25]. However, these techniques are usually highly data expensive, and their accuracy is detrimentally affected by errors in the data. Semi-and unsupervised approaches [26] such as active learning (AL) [11,27] proved to be more robust and efficient strategies [28]. A further concept that will become increasingly relevant is transfer learning [22], where the AI knowledge gained while solving one problem is exploited in addressing a different, yet related problem, resulting in a boosted sample efficiency and higher accuracy. In the context of materials physics, this requires the existence of underlying universal systematics on the atomic scale, which are generally nontrivial, as quantum mechanics entangles the impact of the constituent chemical elements on the physical properties of a given compound far beyond simple superposition of atomic properties. Nevertheless, the identification of such hidden patterns without external a priori information is one of the key strengths of AI.
In this comprehensive paper, we explore this aspect by complementing our first-principles data on infinite-layer (IL, ABX 2 ) and perovskite (P, ABX 3 ) oxides [28] by the respective nitrides and fluorides (figure 1). The IL oxides, which are experimentally derived from the P phase by topotactic reduction of the apical X anions, constitute a highly topical materials class , and contrasting them with their nitride and fluoride analogues is of fundamental interest. Specifically, fluoride perovskites are an important sub-class of the technologically relevant halides [50], and stable nitride perovskites have been predicted just recently [51]. On the other hand, this represents an extremely challenging extension of the already complex transition metal oxides: the strong differences in formal anion oxidation states, Pauling electronegativities (χ N = 3.04, χ O = 3.44, and χ F = 3.98), and bond dissociation energies of the respective diatomic molecules lead to a highly distinct chemistry in these three materials classes, and unraveling their universalities and synergies by AI in the spirit of transfer learning must be considered a nontrivial task.
We begin with a detailed data exploration for the three materials classes, comparing the relative thermodynamic stability of the P vs the IL phases, quantifying their structural response to the reduction reaction, and discussing chemical stability trends and statistical correlations. Subsequently, we demonstrate that element-embedding neural networks (NNs) are capable of modeling the formation energies of V N , V O , and V F vacancy layers as well as the IL and P lattice parameters with high accuracy, despite their complexity: these observables act as a fingerprint of the reduction reaction, encoding the quantum-mechanical essence of symmetry breaking, fourfold versus six-fold B-site coordination, the redistribution of electrons released by the anion vacancies, the different orbital sequence due to changes of the crystal field, and the modified hybridization. Interestingly, it turns out to be sufficient to provide solely the A-, B-, and X-site element names as input to the NNs. This highly reduced and abstract input is devoid of any physical a priori knowledge. Subsequently, element embedding [24,28,52] automatically arranges the chemical elements with a characteristic topology, such that their relations mirror the conventional picture of the periodic table. We compare two different embedding input strategies, and show in how far complementing this approach by atomic properties (atomic radii and electronegativities) of the constituting elements influences the prediction accuracy. Detailed tracking of the prediction accuracy during the iterative AL process shows that a relative error of <1% is achieved by using merely ∼30% of the data for NN training. Simultaneously, we demonstrate that AL is largely independent of the initial training set and therefore highly reliable, and highlight its superiority over randomly composed training sets, which becomes particularly evident for the intricate structural anisotropy of the IL phase. We also provide insight into how AL iteratively composes the training sets, which reflects the varying complexity of the considered observables between the different materials classes. By systematically comparing the performance of three specialized NNs trained exclusively on the nitride, oxide, and fluoride data versus a single NN trained on the combined data, we show that AI indeed identifies latent universalities that improve the prediction accuracy, i.e., enhance the sample efficiency. These synergistic effects are found to be particularly pronounced for lattice parameter prediction, with an accuracy improvement up to 16% at equivalent numerical effort, and most relevant for small training set sizes. Finally, we reverse the perspective and exemplify how the unconventional combination of BX tensor product embedding and subsequent cluster analysis allows to quantify abstract meta-level terms such as the similarity of entire materials classes. This shows that fluorides constitute a rather distinct materials class, whereas oxides and nitrides bear higher similarities. (a) The phase diagrams compare the relative stability of the IL (n = 2) versus the P (n = 3) structure as a function of the respective P heat of formation for the ABX n nitride, oxide, and fluoride data sets. A selection of interesting compounds is highlighted for guidance (cf table 1). The inset in the fluoride panel shows a magnification of the top-left part. (b) The circular charts illustrate the distribution of the compounds across the three stability sectors and thereby quantify fundamental differences between the three materials classes. (c) Structural perspective on the data, comparing apical to basal changes upon reduction and superimposing them with E V X f (color scale). The insets show the back-to-front perspective for clarity, plotting data points with low E V X f in the foreground.

Methodology
We performed first-principles simulations in the framework of density functional theory [53] (DFT) as implemented in the Vienna ab initio simulation package [54,55] using the generalized gradient approximation as parameterized by Perdew, Burke, and Ernzerhof [56] to construct a database of groundstate energies and optimized lattice parameters for 4692 combinations of different elements at the A and B sites (figure 1) for both the P and the IL geometry, which were modeled by using cubic [17] and tetragonal [34,36] unit cells, respectively. Exploring X = N, O, and F at the anion sites, this results in 14 076 combinations and 28 152 simulated compounds in total. Accordingly, we refer to the individual data sets as N, O, and F, whereas NOF = N ∪ O ∪ F denotes the combined data set. Consistent with previous work [28], we largely adopted the DFT + U standards [57] of the Materials Project database [6,58,59] and used the respective ground-state crystal structures to construct our elemental bulk references. The NNs were realized in Keras/Tensorflow 2 [60,61], and the AL algorithm was developed in Python 3. The formation energies of the V X vacancy layers are determined from DFT ground-state energies via 2 E(X 2 ) models the N-, O-, and F-rich limit. We employ an anion energy correction to mitigate the well-known DFT overbinding of gas-phase N 2 , O 2 , and F 2 molecules, reproducing the experimental binding energies of 9.79, 5.15, and 1.63 eV, respectively [62][63][64]. The heats of formation of the P phase from the constituent bulk ele- An analogous quantity E IL f can be defined for the IL phase. All energies are given per formula unit.

Data exploration: nitrides and oxides versus fluorides
We start our discussion with an overview of the three data sets from a thermodynamic, a structural, and a statistical perspective, identifying similarities and differences between ABX n nitrides, oxides, and fluorides. Simultaneously, this provides an impression of the data complexity, and what challenges this poses for subsequent modeling with AI methods. Figure 2(a) displays the N, O, and F data sets in three E V X f versus E P f phase diagrams, comparing the relative stability of the IL and the P structure as well as their stability with respect to the constituent bulk elements. This perspective at the data is motivated by recent work on IL oxides, specifically superconducting nickelates [29,35,41,45,48,49], which are initially stabilized as P films on SrTiO 3 (001) via heteroepitaxy, followed by a topotactic reduction of the apical X ions. Overall, E V X f ranges from −12 to +8 eV, while E P f covers ∼40 eV. The individual N, O, and F data clouds exhibit linear trends, i.e., a correlation of the P stability and the corresponding reduction energy to the IL phase. However, the data scatters broadly around the three regression lines:

Thermodynamic and structural perspective
Moreover, we observe that the three data clouds are concentrated in different sectors of the phase diagrams, the nitrides (fluorides) occupying predominantly the bottom-right (topleft) corner. The oxides are less compact than the fluorides and appear more balanced around the origin. The circular charts in figure 2(b) quantify the distribution of the compounds across the three stability sectors considered here. Most of the nitrides (68.3%) are labeled as thermodynamically unstable; 31% are located in the IL sector, and only 0.7% are labeled as P. In contrast, 66% of the oxides are located in the P sector, and even 92% of the fluorides. This can be understood from the highly different X 2 dissociation energies, the N-N triple bond being much more difficult to break than the O-O and F-F bonds (see above), and the much higher reactivity of F. Moreover, N requires too high cation oxidation states; we will explore this aspect in more detail below. Hence, the oxides emerge as the most balanced of the three materials classes, which is interesting since the P vs IL classification also reflects preferences with respect to the six-fold vs four-fold B-site coordination. Notably, these considerations do not directly imply statements on the absolute stability of the different phases, which require a convex-hull analysis [17,51]. However, they highlight first fundamental differences between the three materials classes, and some metastable compounds may be accessible via heteroepitaxy. The structural analysis provided in figure 2(c) confirms our earlier finding for oxides [28] that most materials exhibit the tendency to contract vertically upon reduction with respect to their cubic phase (the c IL 0 /a IL 0 ratios can drop below 0.5, particularly for some nitrides), expanding simultaneously in the plane (up to 20%) with reduced volume. This applies in particular to those materials that feature a low E V X f [see insets in figure 2(c)]. For some very stable and therefore difficult Table 1. Energies and lattice parameters for a selection of ABX n nitrides, oxides, and fluorides (cf figure 2). E V X f < 0 indicates that the tetragonal IL structure (n = 2) is preferred over the cubic P phase (n = 3). All listed materials are stable as P and IL phases with respect to A-B cation interchange; the associated energy is reflected exemplarily for the P phase by to reduce compounds, the structural changes are rather modest, as one can observe from the E V X f maxima that are consistently located close to the centers of the panels. In sharp contrast, some materials expand massively in apical direction (c IL 0 /a IL 0 ∼ 2-3) with 10%-25% basal contraction relative to the P phase. For the fluorides, one notices that these materials align closer with the volume conservation curve than for nitrides and oxides, but simultaneously feature more pronounced basal contraction (i.e., extend further to the left of the panel). The nitrides and oxides listed in table 1 consistently contract in vertical direction with respect to their cubic phase and simultaneously expand in the basal plane upon reduction. The listed fluorides, on the other hand, show a more complex behavior. Specifically, the structural response can vary within isoelectronic families. For instance, NaMgF n contracts vertically and expands in the basal plane, whereas KMgF n expands vertically. RbMgF n expands vertically and contracts in the basal plane upon reduction. This already foreshadows that c IL 0 is the most intricate observable, as we will see below.
It is worthwhile to explore those materials that are highlighted in figure 2(a) in more detail, also in conjunction with table 1, and compare some of the present results to the literature. For the nitrides, the top-left (i.e., most stable) part of the Group-resolved stability trends of the P phase for nitrides, oxides, and fluorides. Each matrix element displays an averaged heat of formation E P f (g A , g B , X) (in color and explicit values), where averaging is performed over all combinations of chemical elements at the A and B sites that belong to the groups g A and g B of the periodic table, respectively (cf figure 1). The transition metals are highlighted by blue labels. Comparison of a matrix element with its counterpart mirrored at the main diagonal provides an impression of the energy associated with A-B cation interchange. data cloud is formed by the A 3+ (Ta/W/Re)N n families, where A 3+ denotes a group-3 element including the rare-earth metals. This is in line with earlier work that predicted LaReN 3 , LaWN 3 , and YReN 3 to feature (distorted) perovskite structures as ground state [51], albeit no perovskite ground state was found for LaTaN 3 . All three families are located near the boundary between the P and IL regimes. The oxide panel highlights the formally d 9 IL nickelates and cuprates [36,40,46] next to some well-known and highly stable perovskites. The nickelates appear as a compact family in the phase diagram, exhibiting a stable P phase, but being simultaneously close to the IL regime. In contrast, the cuprate family extends widely over the IL region, which reflects the naturally preferred fourfold coordinated plaquette structure that is a common structural motif in high-T C cuprate superconductors. In the IL region, the A 3+ LiO 2 compounds emerge as stable, strongly anisotropic, and insulating materials [28]. For the fluorides, which constitute a sub-class of the technologically important halides [50], we observe a series of well-known P materials in the top-left (i.e., most stable) part of the data cloud: for instance, the A 1+ (Mg/Ca/Sr)F n families, where A 1+ denotes an alkali metal (e.g., neighborite NaMgF 3 and parascandolaite KMgF 3 ). NaMgF 3 is of specific interest due to its role as a lowpressure analog for phase transitions in bridgmanite MgSiO 3 [65], which is abundant in the lower part of the earth's mantle. Of comparable stability are members of the A 2+ LiF n family, where A 2+ denotes an alkaline earth. Characteristic of these materials, often referred to as 'inverse' perovskites (not to be confused with antiperovskites), is that the A-site cation has a higher oxidation state than the B-site cation, an example being BaLiF 3 . In the same part of the phase diagram, we also find the (Zn/Cd/Hg)RF n families, where R denotes a group-3 element, particularly a rare-earth metal [2]; we will further comment on these compounds below. Near the boundary to the IL regime, we observe the A 1+ LiF n family, which is located at a similar position in the phase diagram as the A 3+ LiO n oxides. Table 1 summarizes the DFT-predicted energies and lattice parameters for a selection of nitrides, oxides, and fluorides. As we pointed out earlier [28], the oxide lattice parameters are in close agreement with the experimental and theoretical literature; for instance, NdNiO 2 (a IL 0 = 3.92,  [51]). The fluoride perovskite KMgF 3 crystallizes in the cubic Pm3m structure [50] with a P 0 = 4.00 Å [74], and also NaMgF 3 in its high-temperature cubic phase agrees nicely with our results (a P 0 = 3.955 Å [75]), whereas the lattice parameters in the tilted orthorhombic phase are naturally smaller (a P 0 ∼ 3.834 [76]). The layer reduction energies E V X f can be related to vacancy formation energies. For instance, in SrTiO 3 , the oxygen vacancy formation energy amounts to ∼5.5 eV according to first-principles calculations [77]. In LaAlO 3 , this energy is even higher (6.9 eV [78]), whereas the values in nickelates and their heterostructures [40,63] are much lower (∼2.8 eV in bulk LaNiO 3 [62]). These theory results align perfectly with table 1. In addition, the energy differences provide an impression of how stable the listed materials are with respect to A-B cation interchange, exemplarily given for the P phase. For instance, the 'inverse' perovskite BaLiF 3 is more stable than its non-inverse analogue LiBaF 3 by 3.2 eV per formula unit.

Stability trends and statistical correlations
Next, we explore the stability trends of different A-B cation combinations, and particularly how they are modified by varying the X-site anions. In order to highlight the key aspects, we focus on the P phase. The matrices in figure 3 show the averaged P heats of formation, The correlation matrix r i j analyzes the interdependence of the different observables in the NOF data set, including atomic properties (atomic number Z, periodic table group g, atomic radius r, and electronegativity χ), the derived Goldschmidt tolerance factor t, and different energies and lattice parameters as determined from first principles (marked by blue labels).
where averaging is performed over all combinations of chemical elements at the A and B sites that belong to the groups g A and g B of the periodic table, respectively. X is fixed for each panel to either N, O, or F. Since the properties of oxide perovskites are well known, they serve as a benchmark. One can observe minima at the g A /g B combinations 3/3 (−15.0 eV; e.g., LaScO 3 ) and 2/4 (−15.7 eV; e.g., SrTiO 3 ), and an even deeper minimum for 3/4 (−16.5 eV; e.g., LaTiO 3 ). In general, low values (i.e., strongly bound compounds) are achieved by combining a group-3 element (including the rare-earth metals) at the A site with a transition metal or group-13 element (predominantly Al) at the B site. These results reflect the perfectly balanced 2− oxidation state of the three oxygen ions.
It is interesting to compare these results with the nitrides. As we discussed above [figure 2(a)], nitride perovskites are far less stable than the oxides. However, similar to the latter, we observe a clear preference for a group-3 element at the A site ( figure 3). Assuming the standard oxidation state of 3− for nitrogen, this would imply a very high oxidation state of 6+ at the B site. Cr, Mo, and W from group 6 are the first elements in the transition metal series that feature such an oxidation state. Indeed, stable nitride perovskites have so far only been predicted for 3/6 and 3/7 combinations involving W and Re [51]. However, we find lower heats of formation already for 3/4 and 3/5 (−4.0 and −4.8 eV). This underlines the competition in nitride perovskites between inducing extremely high cation oxidation states and accepting a lower anion oxidation state than 3−, which however impedes the compensation of the high N 2 dissociation energy. The overall similar structure of the nitride and oxide matrices suggests that nitrides can be considered, to some degree, as destabilized oxides.
For the fluorides, a substantially different picture emerges. The oxide and nitride preference for group-3 elements at the A site is replaced by a preference for 1+ alkali metals, combined with 2+ alkaline earths (−16.9 eV) or group-3 elements (−16.3 eV; note that some of these elements feature 2+ oxidation states) at the B site. The minimum at 3/13 observed for oxides, predominantly with Al at the B site, is destabilized; instead, mirrored at the main diagonal of the matrix, a strong minimum emerges around 12/3 (−16.4 eV), combining the late transition metals (Zn, Cd, Hg) at the A site with group-3 elements at the B site.
Another statistical perspective at the data is provided in figure 4. The symmetric matrix displays the Pearson product-moment correlation coefficients r i j between different observables x i , the latter comprising atomic properties of the A-, B-, and X-site elements as well as energies and lattice parameters as determined from first principles. Here, where averaging is performed over the entire NOF data set. The energies E P f and E V X f are predominantly impacted by the X site, whereas the influence of the other sites (reaching values of 0.6 for oxides alone [28]) is largely quenched. This highlights the distinct chemistry induced by varying X. E V X f is significantly anticorrelated with E P f (−0.9), which reflects the linear trends observed in figure 2(a). The (basal) lattice parameters a P 0 and a IL 0 correlate predominantly with the B and the X sites, particularly r B (0.7), and are also significantly intercorrelated (0.9). In sharp contrast, the vertical lattice parameter c IL 0 exhibits almost no correlations with the other quantities, not even with the Goldschmidt tolerance factor t = r A +r X √ 2(r B +r X ) , which we find to be generally less characteristic here than for oxides alone [28]. Optimized descriptors [15,16,18] may enhance the correlation. In any case, this shows that a complex nonlinear methodology (such as NNs) is required to model these observables.
More in-depth analysis unravels that the specific correlation matrix of nitrides is similar to that of oxides [28] concerning energies and lattice parameters, whereas fluorides exhibit substantially different correlations and thereby quench many matrix elements in figure 4 (not shown). This provides additional evidence that nitrides and oxides are rather similar materials classes, whereas fluorides are distinct.

Active learning of element-embedding neural networks
We showed recently that it is possible to arrive at the insights presented so far without explicitly calculating the entire data set, but only a fraction of it, even for such complex materials classes as transition metal oxides [28]. In this context, elementembedding artificial NNs emerged as efficient means to predict the observables of interest for all materials in a parameter space on the basis of just a subset of it. The latter can be iteratively constructed on the basis of entropy and information by AL. The key difference lies in the input strategy: (a) In NN type 1, the A and B sites are element embedded, whereas the X vector is directly passed to the dense layers. (b) In contrast, in NN type 2, the tensor product of the B and X sites is embedded (light blue). In both NN types, these categorical input channels can be complemented by a parallel numerical (scalar) input channel that provides the elements' atomic properties (atomic radii r and Pauling electronegativities χ; dark blue). The small numbers denote the vector dimensions and neurons per layer. (c) The NNs are trained within an AL cycle, which operates with two NNs of the same type and iteratively expands the training set based on local entropy estimates Σ (A, B, X) in the parameter space [28].

Neural network architectures and active learning strategy
The two distinct NN architectures investigated here are displayed in figures 5(a) and (b). In both cases, the NNs take the one-hot-encoded names of the elements (e.g., 'La', 'Ti', and 'O') at the A, B, and X sites as categorical input. In NN type 1 [ figure 5(a)], the A and B sites are element embedded, whereas the X vector is directly passed to the subsequent dense layers. The element embedding approach [24,28] is inspired by word embedding [52], a deep-learning languageprocessing technique where words are represented in a semantically insightful way in a vector space of compact dimension. In the present context of materials physics, embedding autonomously establishes a new and efficient representation of the high-dimensional one-hot-encoding vectors that describe the materials composition, i.e., it maps each of them onto a new vector in the embedding space (here: 16-dimensional [28]). While this is at first sight only an abstract mathematical transformation, we will analyze below that the established representation unambiguously encodes information about the elements' chemical properties. Note that the norm of the embedding vectors may also encode information, owing to the nonlinearity of the NNs. In this spirit, NN type 2 [ figure 5(b)] constitutes an interesting variation of NN type 1, as the tensor product of sites B and X is embedded. This is physically motivated by the fact that in transition metal oxides, conventionally the B site is the active site that determines the physical properties of the compound, and is simultaneously also most susceptible to external stimuli such as, in this case, variation of the X-site anion. Element embedding constitutes a highly minimized input strategy in the sense that it is entirely devoid of a priori physical knowledge. Optionally, both NN types feature a parallel numerical (scalar) input channel that complements the output of the embedding layers with the elements' atomic properties (atomic radii r and Pauling electronegativities χ). This input layer is followed by a sequence of hidden layers, featuring 512 (1024), 256 (512), and 128 (256) densely connected neurons in the case of energy (lattice parameter) prediction, respectively. These numbers emerged from hyperparameter optimization. Finally, the output layer provides energies (E V X f , E P f , E IL f ) or lattice parameters (a P 0 , a IL 0 , c IL 0 ). We apply error backpropagation on the training set to automatically adapt the weights that connect the individual neurons, until an optimal mapping from input to output is achieved.
Here, we train these NNs within an AL cycle [11,27,28]. Starting from a small subset of the parameter space (here initially ∼20%), AL iteratively extends the training set in a statistically optimal way and thus assists in approaching a high prediction accuracy rapidly and efficiently. This type of algorithm can be applied to (at least) two scenarios: (i) If extensive data is available, AL separates information-rich data from redundant or erroneous data. (ii) If data has to be generated first (e.g., by time-consuming quantum-chemical or DFT simulations), AL efficiently drives this data generation, predicting and requesting iteratively what materials to simulate next to extend the training set. In either case, one obtains a small but optimal subset of the materials parameter space that contains a maximum of non-redundant information, which is subsequently used by the NNs to accurately predict the properties of all materials in the entire parameter space.
The iterative AL process is illustrated in figure 5(c). The key question in each step is how to optimally increase the training set size to efficiently enhance the prediction accuracy. We solve it by training two NNs of the same type in parallel and subsequently comparing their individual predictions to estimate which materials are attractive candidates: given the observables φ 1,2 i as predicted by NN 1 and NN 2 and the respective DFT ground truth x DFT i , where i labels either different energies (E V X f , E P f , E IL f ) or lattice parameters (a P 0 , a IL 0 , c IL 0 ), we define by averaging over i:  (1) over a series of AL runs starting from different random initial training sets, and the colored areas display the corresponding standard deviations. This demonstrates that the reliability and accuracy of AL is largely independent of the initial training set. In parallel, the insets monitor the evolution of the NOF training set composition, starting from a 33% distribution of N, O, and F materials. Particularly for lattice parameter prediction, one can see that the AL algorithm preferentially includes the more challenging fluorides.

MAE(A, B, X)
Site averaging (i.e., integrating over the entire parameter space) yields the overall mean absolute error Naturally, if X is fixed, averaging over X is omitted. In the AL cycle [ figure 5(c)], the training set is iteratively updated, appending ∼130 N, O, or F (∼400 NOF) materials per step that exhibit the highest Σ(A, B, X), followed by further NN training. Interestingly, Σ can be interpreted as an estimate of the local entropy in the parameter space [28]. In this spirit, the present AL algorithm statistically maximizes the information entailed in the training set. From the definition of Σ it follows that the DFT ground truth beyond the current training set is not required by the AL algorithm to select interesting materials candidates; we use it only a posteriori to obtain the MAE and thereby analyze the AL performance [figures [6][7][8].
Finally, the presented AL algorithm can be stopped as soon as the desired accuracy is reached, establishing the latter as a systematic control parameter. Since only the autonomously selected materials need to be calculated ab initio in each iteration, a substantial gain in performance and energy efficiency as compared to conventional high-throughput calculations can be achieved.  analyzed. We note again that the overall MAE [equation (1)] is only accessible since all the data has been calculated in advance in this case. In a 'real' AL materials prediction run, this information would not be available. Therefore, we have the opportunity here to explore in very detail the reliability of this AI technique.

Evolution of the prediction accuracy during active learning
First of all, three MAE curves for N, O, and F are shown in each panel of figure 6. They have been obtained from three individual NNs specialized on either N, O, or F prediction; i.e., they are exclusively trained and evaluated on the respective data sets. As one can see from the definition in equation (1), the MAE is a highly integrated measure that averages the NN prediction difference with respect to the DFT ground truth over different observables and the parameter space, and thus condenses a sizable amount of information in a single quantity. Moreover, each curve represents the mean of 25 consecutive and independent AL runs, where each AL run started from a different random initial training set. The colored areas visualize ± the corresponding standard deviations. Interestingly, these standard deviation corridors are rather small and furthermore narrow rapidly with increasing training set size. This demonstrates that the AL process is largely independent of the initial training set composition, and can therefore be considered highly reliable. We can also see that the accuracy improves systematically, since the MAE curves decrease monotonically with the number of AL iterations.
Second, each panel presents an additional MAE curve that corresponds to a NN of identical architecture which, in contrast, is trained and evaluated on the combined NOF data. In this case, we operate with three times larger training sets, so that the overall number of compounds used for training (and therefore the time and energy invested in DFT calculations) at each step is equal and the accuracy therefore legitimately comparable. Again, each curve constitutes the mean of 25 consecutive AL runs, and is displayed together with the respective standard deviation. At first sight, these NOF curves resemble a simple average of the individual N, O, and F curves. However, the key aspect here is the NOF NNs can identify and learn universal properties of the N, O, and F data sets and benefit from their synergies. We will explore and quantify this important difference below.
In order to assess the AL performance, we use an error of 1% as reference, which we calculate for the energies relative to ∼20 eV range of E V X f and for the lattice parameters relative to the average of the ranges of a P 0 , a IL 0 , and c IL 0 , which amounts to ∼4 Å [see also figures 2(a) and 7]. Figure 6 shows that the N, O, and F energy curves drop below 1% relative error between ∼30%-35%, the NOF curve at ∼32% training set size. Here, the oxides always exhibit the highest MAE, whereas the fluoride curve is consistently the lowest (most precise). Overall, the curves are close together, in particular for training set sizes beyond 35%. In contrast, the lattice parameter MAE curves exhibit a clearly larger spread. The N, O, and F curves drop below 1% relative error between ∼25%-35%, the NOF curve at 29% training set size. The fluoride curve is by far higher than the oxide and nitride curves, the latter being consistently the lowest. This reflects on the one hand how challenging the prediction of energies versus lattice parameters is in general, but simultaneously how the complexity varies between the three classes. Furthermore, we can state that while during the first AL iterations atomic properties improve the accuracy, the impact of this scalar input substantially reduces for training set sizes beyond ∼30%. This underlines the strength of element embedding, which renders a priori knowledge largely unnecessary.
Interestingly, in the NOF case, the AL algorithm has the flexibility to dynamically control the relative composition of the training set. Therefore, it is worthwhile to monitor in parallel how it evolves during AL (figure 6, insets). For energy prediction, the contributions from the N, O, and F subsets remain largely balanced around the initial 33%, with a slight preference emerging for O and F at the expense of N. A fundamentally different behavior can be observed for lattice parameter prediction, which, as we observed above, is most challenging for the fluorides. This is automatically identified by the AL algorithm and mitigated by increasing the fraction of fluorides in the training set. Already after 4-5 AL iterations, the training set composition reaches the final values ∼40% F versus ∼30% O and N. The inclusion of atomic properties as additional NN input does not change this picture significantly.

Observable-resolved prediction accuracy: superiority of active learning
A more detailed perspective on the performance of the NOF NNs with respect to E V X f , a P 0 , a IL 0 , and c IL 0 prediction is shown in figure 7. In the top row we can see results for single NNs of type 1, suppressing additional atomic properties input. ALiterating towards a training set size of ∼50%, we already obtain a MAE ∼0.1 eV for E V X f per vacancy. Hence, we achieve a comparable accuracy on the NOF data set as before on oxides alone (0.072/0.126 eV MAE for E V O f on seen/unseen data [28]), which is also in line with the uniform convergence of the MAE curves in figure 6. Relative to their overall range of ∼20 eV, this corresponds to an error of only ∼0.5%. The heats of formation E P f and E IL f are predicted with even higher accuracy, reaching ∼30 meV/atom (not shown), which is comparable to recent work on perovskite oxides (20-34 meV/atom [21]) and well within DFT accuracy [20,79]. For elpasolites, a heat of formation accuracy of 150 meV/atom was obtained [24]. These observations reflect that E V X f is a fingerprint of the complex reduction reaction and consequently more demanding to predict. Among the lattice parameters, the prediction of a P 0 and a IL 0 proved to be straightforward, reaching MAEs around 0.01 Å ( figure 7). In contrast, c IL 0 turned out to be more challenging, with a higher MAE of ∼0.03 Å. This can be traced back to the sparse data available for vertically expanding materials [figure 2(c)] and the almost vanishing correlations of c IL 0 with the other observables ( figure 4). Again, we achieve a comparable accuracy on the NOF data set as before on oxides alone (0.01/0.01 Å for a IL 0 and 0.02/0.03 Å for c IL 0 on seen/unseen data [28]), which are only slightly better than the MAEs reported here. As we discussed above, the positive impact of atomic properties in the NN input for small training sets reduces as more and more data becomes available. The results displayed in the middle row are even slightly less accurate (i.e., they exhibit a higher MAE) than those obtained for element embedding alone (top row).
The results shown in the bottom row have been obtained for a randomly chosen training set of equal size (∼50%). While the overall accuracy on the training data is good, the performance on the unseen data is by far inferior. We confirmed that this effect is not related to overfitting. In particular for c IL 0 it becomes obvious that AL considerably enhances the prediction accuracy and thus emerges as superior strategy (figure 7). Note that the overall numerical effort is quasi identical, since the DFT simulations are the time-consuming step, whereas the AL overhead is negligible. In summary, our results show how difficult these observables are to predict in general for such complex materials classes, even at ∼50% training set size, and that a naive random approach is insufficient.

Transfer learning synergies
Next, we address the fundamental question in how far NNs that are trained on the combined NOF data can identify universal quantum-mechanical systematics and thereby utilize the synergies between nitrides, oxides, and fluorides to enhance their prediction accuracy. Simultaneously, this implies that a smaller training set is sufficient to obtain a desired precision, i.e., it enhances the sample efficiency in the spirit of transfer learning. Figures 8(a) and (b) compare the prediction accuracy of NOF NNs of type 1 and type 2 (the latter featuring the BX tensor product embedding), both without atomic properties in the input, to the averaged MAE of three specialized NNs that were trained exclusively on the N, O, and F data: Again, in order to provide a statistically representative impression of the reliability, each curve displays the mean of 25 consecutive and independent AL runs (see above), whereas the colored area visualizes ± the corresponding standard deviation. For lattice parameter prediction [ figure 8(b)], the NOF NN exhibits a consistently higher accuracy than the three specialized NNs on average, irrespective of the input encoding and even including the standard deviation. For energy prediction [ figure 8(a)], the precision of NN type 1 is better than the N, O, F average below ∼30% training set size, whereas the MAE of NN type 2 is slightly higher. For training set sizes >30%, the two NN types exhibit a quasi equivalent prediction accuracy, despite the fundamental difference in their architecture and physical motivation. NN type 2 may require slightly more data than NN type 1 to establish an accurate representation of the chemical elements in the more complex BX embedding space, but in turn provides highly interesting insights, as we will discuss below. Figure 8(c) specifically emphasizes the aspect of universality, now focusing exclusively on NNs of type 1, but additionally exploring the impact of the atomic properties input channel. From the curves shown in figure 6, we define the following measure to quantify the synergistic effect: i.e., we use the averaged MAE of the specialized N, O, and F NNs defined above as reference, and ΔMAE < 0 indicates a superior performance of the NOF NNs. Indeed, the ΔMAE curves displayed in figure 8(c) demonstrate an overall improvement of the prediction accuracy by up to 16% due to unraveled synergies at equivalent numerical effort, in particular for smaller training sets. For lattice parameter prediction, the curves range from −16% to −2% if element embedding alone is employed, and from −8% to +1% if atomic properties are included in the input. For energy prediction, the results based exclusively on element embedding range from −13% to +8%, and from −4.5% to +9% if atomic properties are included in the input. The overall positive slope exhibited by all ΔMAE curves in figure 8(c) indicates that the enhancing impact of synergistic effects reduces as more and more training data becomes explicitly available. For energy prediction, we consistently obtain ΔMAE > 0 beyond 35% training set size. However, the energy curves also offer a different, yet speculative interpretation. The better accuracy of the specialized NNs for larger training sets may trace back to their higher memoryto-data ratio, as they feature the identical architecture as the NOF NNs. If we tentatively eliminate this aspect by considering the final ΔMAE values as reference (reaching up to ∼ + 8%), which almost aligns the energy curves with the lattice parameter curves, we can identify considerable synergistic effects of up to ∼20% for small training sets also for energy prediction.
Another interesting observation is that the impact of synergies reduces to some extent if atomic properties are provided as additional input. This evidences that the atomic radii and the electronegativities constitute a sizable part of the underlying systematics. Specifically, one could naively expect that the lattice parameters are largely determined by the atomic radii (despite the complexity of c IL 0 discussed above; see figure 4). However, particularly for these observables the enhancement is clearly not quenched, which highlights that the AI identifies further, possibly more abstract synergies.
These results imply that the systematics unraveled for one materials class enhance the prediction accuracy for the other two materials classes. This boosted sample efficiency is the very essence of transfer learning. We found that the positive impact of these synergies is largest for smaller training sets. From a practical perspective, this early phase of AL plays an important role, as one seeks to keep the training set (and thus the number of required DFT calculations) as small as possible. This maximizes the advantage and efficiency with respect to a conventional high-throughput approach, where the entire parameter space is explicitly calculated from first principles.  [28], their overall topology unambiguously exhibits universal trends. For clarity, blue and violet arrows guide the eye by marking the exemplary elements Hf, Hg, and Rb. Note that here no atomic properties are input to the NNs, so that the AI arranges the chemical elements autonomously in a 16-dimensional vector space without any external a priori knowledge. As indicated by the insets, panels (a) and (b) belong to NN type 1 displayed in figure 5(a), whereas panel (c) refers to NN type 2 displayed in figure 5(b) featuring the BX tensor product embedding (cf table 2). The A-site embedding analysis for NN type 2 is quasi identical to panel (a) and therefore not shown.

Element embedding analysis
As the final step, we analyze how the NNs perceive the materials data. Figure 9 explores the element embedding vectors that emerged automatically in the two types of energy-prediction NNs during AL-training towards ∼50% of the NOF data. We performed a dimensionality reduction via a principal component analysis (PCA [80]), which represents the element vectors from the 16-dimensional embedding space (see above) by points in two dimensions, while simultaneously retaining as much of their variance as possible. The two dimensions are rather of mathematical character and do not necessarily have a physical correspondence. This technique is less complex than stochastic neighbor embedding (t-SNE), which we employed recently to analyze the embedding vectors trained on oxides alone [28]. Consequently, the clusters of chemically similar elements in figures 9(a) and (b) are not as pronounced, distinct, and well-separated. However, being a linear transformation, the PCA dimension reduction has the advantage of modifying any structure in the data rather gently. One can clearly see in the analysis of the A-and the B-site embedding layers [figures 9(a) and (b)] that the overall topology exhibited by the chemical elements unambiguously shows universal trends. For instance, the transition metal cluster shows an evolution from right to left that aligns largely with the periodic table of elements (from Ti, Zr, and Hf to Zn, Cd, and Hg), the lighter 3d elements being organized more towards the center, the heavier 5d elements more at the outside. A distinct cluster of rare-earth metals forms that interestingly also comprises the group-3 elements Sc and Y. Moreover, the alkali metals are consistently located in the top-left corner (e.g., Rb), next to the alkaline earths (e.g., Ca, Sr, and Ba) which appear more centered. This universal structure is particularly compelling since (i) the A and the B sites have a highly distinct impact on the compounds' chemistry and (ii) the X site fundamentally modifies this impact, as discussed above. Moreover, we stress that no atomic properties are input to the NNs here, and that the AI is forced to arrange the chemical elements completely autonomously in a 16-dimensional vector space, unbiased by any external a priori knowledge. The NNs develop their individual understanding of the chemical similarity between the different elements, being agnostic about concepts such as the atomic number or the group of a particular element.
An even more fascinating question is how the entangled B and X sites in NN type 2 are interpreted and ordered in their combined embedding space. Albeit rather complex at first sight, the analysis of this BX tensor product embedding displayed in figure 9(c) paints a consistent, yet different and more realistic picture of the (dis-)similarity of the three distinct materials classes than the data clouds in figure 2(a), from which one could naively conclude that the rather stable fluorides and oxides bear higher similarities, whereas the unstable nitrides are consequently different. From top to bottom, figure 9(c) shows an evolution from nitrides over oxides to fluorides, i.e., the materials data is automatically structured into (two to) three distinct clusters. Each of the three materials clusters reflects, in broader strokes and partly modified, internally the universal topology observed in figures 9(a) and (b).
Interestingly, the fluoride cluster is almost entirely separated from oxides and nitrides, while the oxide and nitride clusters exhibit substantial overlap. This aspect is numerically analyzed in table 2, where we alternatingly compare two of the three clusters. In order to quantify their similarity, which represents the similarity of the respective materials classes, two different measures are employed: the Euclidean distance and the silhouette coefficient. The (dimensionless) Euclidean Table 2. Numerical analysis of the three materials clusters emerging during the BX tensor product embedding quantifies the (dis-)similarity of nitrides, oxides, and fluorides by employing different techniques. Alternatingly, two of the three clusters are compared. Smaller (larger) Euclidean distances and silhouette coefficients indicate a higher (dis-)similarity. While '16 dim.' denotes that the analysis has been performed in the entire embedding space, '2 dim.' refers to an analysis subsequent to the PCA dimensionality reduction [cf figure 9(c)]. distances of the centroids (i.e., the geometric 'centers of mass') clearly show that fluorides are much more different from nitrides and oxides (reaching values ∼1) than oxides and nitrides are from each other (0.616). Simultaneously, fluorides are slightly less distant from oxides than from nitrides. This holds both in the abstract 16-dimensional embedding space as well as in the PCA-reduced two-dimensional space shown in figure 9(c). Moreover, we can infer that the similarity of oxides and nitrides is artificially enhanced by the PCA, as the Euclidean distance is reduced from 0.616 to 0.357. Thus, in 16 dimensions it becomes more clear that three (and not only two) clusters are constructed by the AI. Hence, it is indispensable to concomitantly explore the results obtained from the full 16-dimensional data. A complementary and more involved characteristic is the (averaged) silhouette coefficient [80,81]. It is based on the clusters' tightness and separation and ranges from −1 to +1, where a high positive value is indicative of disentangled and distant clusters, whereas a near-zero value corresponds to overlapping clusters. (Negative values are of minor relevance here, as they arise predominantly when more than two clusters are compared.) Overall, the silhouette coefficient confirms the (dis-)similarities unraveled by the Euclidean distance measure in 16 and two dimensions, and also that PCA tunes the contrast to some extent (table 2). We conclude from these observations that (i) the two-dimensional PCA paints a representative picture of the situation in the abstract 16-dimensional embedding space, and (ii) that fluorides constitute a rather distinct materials class, whereas oxides and nitrides bear higher similarities.

Summary
Combining DFT simulations and AL of element-embedding NNs, we investigated the sample efficiency for the prediction of vacancy layer formation energies and lattice parameters in infinite-layer versus perovskite nitrides, oxides, and fluorides in the spirit of transfer learning. First, we provided a detailed data analysis from different thermodynamic, structural, and statistical perspectives. Subsequently, we showed that NNs model these observables with high precision, using just ∼30% of the data for training and based exclusively on the constituent element names as minimal input devoid of any physical a priori knowledge. Element embedding autonomously arranges the chemical elements with a characteristic and recurrent topology, such that their relations align with the conventional picture of the periodic table. We compared two different embedding strategies and showed that these techniques render additional scalar input such as atomic properties negligible. AL was found to be largely independent of the initial training set and thereby emerged as reliable and robust algorithm. Additionally, we exemplified its superiority over randomly composed training sets. The present methodology successfully identified fundamental quantummechanical universalities between the three materials classes despite their largely distinct chemistry that enhanced the combined prediction accuracy by up to 16% with respect to three specialized NNs that were trained exclusively on the isolated data sets, highlighting an increased sample efficiency. This quantification of synergistic effects provides an impression of the transfer learning improvements one may expect for other materials of similar complexity. Finally, traditional data analysis and unconventional artificial-intelligence methodology in the form of BX tensor product embedding combined with subsequent quantitative cluster analysis converged on the picture that fluorides constitute a rather distinct materials class, whereas oxides and nitrides bear substantial parallels.