ScGAN: a generative adversarial network to predict hypothetical superconductors

Despite having been discovered more than three decades ago, high temperature superconductors (HTSs) lack both an explanation for their mechanisms and a systematic way to search for them. To aid this search, this project proposes ScGAN, a generative adversarial network (GAN) to efficiently predict new superconductors. ScGAN was trained on compounds in Open Quantum Materials Database and then transfer learned onto the SuperCon database or a subset of it. Once trained, the GAN was used to predict superconducting candidates, and approximately 70% of them were determined to be superconducting by a classification model–a 23-fold increase in discovery rate compared to manual search methods. Furthermore, more than 99% of predictions were novel materials, demonstrating that ScGAN was able to potentially predict completely new superconductors, including several promising HTS candidates. This project presents a novel, efficient way to search for new superconductors, which may be used in technological applications or provide insight into the unsolved problem of high temperature superconductivity.


I. INTRODUCTION
In recent years, superconductors have been applied in a variety of important technologies such as power transmission lines, MRI magnets, Maglev trains, and quantum computers.Quantum computers are especially important as they are expected to solve problems that are too computationally expensive for current classical computers.However, to be used in these applications the superconductors must be cooled below their critical temperatures (T c ), which for most current superconductors are very low.For instance, Google's quantum computer must be maintained at 0.02 K, which severely limits its general use [1].This points to a growing need for superconductors with higher T c , which has made them an active research topic for the last couple of decades.
The mechanisms of superconductivity in most materials with relatively high T c is not fully understood, which means there is no systematic way to search for new materials or to predict their critical temperatures [2].Thus, the current procedure for finding HTSs is essentially trial-and-error, which is extremely inefficient.This was exemplified in a recent study, which found that only about 3% of the approximately 1000 materials surveyed were superconducting [3].Furthermore, the study failed to find any superconductors with T c > 60 K.This extreme inefficiency means that the likelihood of manually finding new superconductors, especially HTSs, is extremely low.
To address this difficulty, the use of computational tools in superconductivity research has become popular in recent years [4].In particular, there have been several studies utilizing machine learning to predict whether a given chemical compound will be superconducting or not.The earliest such study was done by Stanev et al. in which two random forest-based models were built: a classification model for predicting superconductivity and a regression model for predicting superconducting transition temperature [5].The models were successfully applied on the SuperCon database, achieving a 92% accuracy in classification and a R 2 = 0.88 for regression.They did run across one limitation of their machine learning model, however.When trained on a certain class of superconductors (e.g.cuprates), the model was unable to make good predictions on other classes of superconductors (e.g.pnictides).
Following this pioneering study, there have been several other studies applying machine learning methods, such as a K-nearest neighbors algorithm (Ref.[6]) and a deep learning model (Ref.[7]).The K-nearest neighbors algorithm reported improvements on the previous study from Ref. [5] in terms of overall performance: an R 2 of 0.93 and a classification accuracy of 96.5%.The deep learning model, on the other hand, showed that it might be possible to overcome the limitation that Stanev et al. faced, as they were able to make predictions about pnictide superconductors from the training data that did not contain them.
The way each of these previous studies predicted new superconducting materials was by running their trained model on a database of known existing chemical compounds and finding which compounds the model indicated could be superconducting.This procedure has several notable limitations.First, these studies miss out on the vast chemical composition space that is not already contained in existing databases.Second, commonly used databases (such as ICSD and OQMD) contain mostly stoichiometric compounds, whereas many superconductors are non-stoichiometric (cuprates and pnictides, in par-ticular), and so many possibilities were missed in that manner as well (see for example Table 3 in Ref. [5] and Table 1 in Ref. [6]).Finally, the discovery of superconductors in the lab usually does not happen that way; they are discovered by synthesizing new materials and testing them, rather than checking the known ones.
In order to overcome these limitations, in this work we employ Generative Adversarial Networks (GANs) [8].Generative models refer to a general class of machine learning models which are able to generate things that resemble the input dataset.They have proven to be extremely powerful, and in recent years have found numerous applications such as science, engineering, medicine, art, video games, deepfakes, etc. [9].In most cases GANs performed better than other generative models, such as variational autoencoders, because they are able to learn more hidden rules in the input data set.For example, Dan et al. [10] reported a GAN model which generated new chemical compounds with 92.53% novelty and 84.5% validity.Another work, Hu et al. [11] applied GANs to the SuperCon dataset, but for the purpose of characterization rather than prediction.
In this work, we combine the general idea of GANs with the previous superconductor models, and propose the first GAN to predict new superconductors (ScGAN).Our ScGANs are based on chemical composition only, and are able to generate new superconducting materials with high novelty and validity.The paper is organized as follows.In Section II we present the details of the creation of the ScGANs.In section III the main results of the study are discussed.In particular, we present a list of hypothetical superconducting materials generated by our ScGANs, as well as their predicted critical temperatures.Finally, in Section IV we summarize the most important findings.

II. METHODOLOGY
As stated in the introduction, we chose the GAN as our generative model.Its structure is shown in Fig. 1.A GAN is composed of two competing neural networks, the generator and discriminator.The generator takes in random noise and generates a "fake" compound, which the discriminator attempts to determine if it is real or not using existing data of real compounds.Each of them then updates their parameters based on the performance of the discriminator.The two networks improve their performance iteratively until a generator can generate realistic looking compounds, while the discriminator can detect unrealistic looking ones [8].
Fig. 2  is to allow a model to learn even with a limited amount of data, which is the case here as the general compounds dataset (OQMD) has on the order of ∼ 10 6 data points, while the superconductor datasets will only have on the order of ∼ 10 4 data points.

A. Data Collection
Data were sourced from two open-source databases: SuperCon [12] and the Open Quantum Materials Database (OQMD) [13,14].SuperCon is the largest database for superconducting materials with around 30, 000 superconductors before filtering.Similar to what was done in some previous studies [6,15], we only used the chemical compositions of the materials extracted from SuperCon.On the other hand, OQMD is a much larger database with around 10 6 DFT-calculated compounds, most of which are not superconductors.

B. Data Processing
Before filtering the data, all compounds had to be expressed in a common format so that unwanted datapoints could be detected regardless of small differences in formatting across the databases, such as the ordering of elements.First, all datapoints were formatted as 1 × 96 matrices [6] (the 96 is the maximum atomic number present across all the compounds in the database) and then combined so that each dataset was a matrix in R 96×m , where m is the number of data points (see Fig. 3).Then, both datasets were filtered for duplicates, which reduced OQMD to around 800, 000 datapoints.The SuperCon dataset required further processing as a number of entries were incomplete, and they were either corrected or removed, leaving around 16,000 datapoints.Lastly, the SuperCon dataset was split into three different groups (classes): cuprates, pnictides (iron-based) and others (anything else that is neither a curpate nor a pnictide).Table I has the quantitative counts of these different groups.The purpose of this was to test the model's ability to learn the different classes of superconductors, especially the high temperature classes of cuprates and pnictides.Once the data was filtered, the chemical compositions were then transformed into a form for the GAN to train on.A previous study, Ref. [10], used a one-hot encoding for general chemical compounds.However, that encoding was designed for stoichiometric compounds, i.e. for integer values of parameters, and many superconductors are non-stoichiometric, i.e. have decimal compositions.Instead, we propose the use of a "adjusted one-hot" encoding that works for decimals, in which each compound is represented by a 96 × 8 matrix of real numbers (Fig. 4).As shown in the figure, from the 1 × 96 vectors in the columns of Fig. 3, each nonzero component of that vector v i was expanded into an 8-dimensional vector with the following process.First the nearest integer k between 1 FIG. 3. Chemical composition data represented as a matrix [6] in R 96×m , where m is the number of datapoints.Each column is a single compound, with each entry representing the number of each element present in the compound.Note that that the numbers in the matrix are for illustration purposes only; they do not represent any real compounds or superconductors.and 7 inclusive to v i was found.Then, the matrix values were set as where δ here is the Kronecker delta, we index from 0 (top left is A 00 ), and m ranges from 0 to 7. For zero components (v i = 0), a 1 was simply placed at A 0i .We point out that we tested other encodings, such as the one from Ref. [7], but these were susceptible to mode collapse.The encoding proposed in this work successfully encodes decimal values both through the actual matrix entry and its location.

C. Model
A GAN is a type of generative model that has two competing neural networks, a generator and a discriminator, as shown in Fig. 1.Traditional GANs, however, FIG. 4.An "adjusted one-hot" encoding of the chemical composition as a matrix in R 96×8 .The idea here is that the amount of each element is encoded through both the vertical location of the yellow box and the value in it, which allows the GAN to learn the chemical compositions better.
can suffer from issues such as mode collapse and gradient vanishing, so the Wasserstein GAN with Gradient Penalty [16] was used instead.In the Wasserstein GAN with gradient penalty, the loss functions are (2) Here D(x) represents the output of the discriminator and E is the expectation value (average).Then the parameters are updated with an optimizer, where w are the discriminator's parameters, θ is the generator's parameters, and α is the learning rate.RMSProp was chosen as the optimizer [17], after testing out several other options, such as Adam [18].

Training
The model was first trained on the OQMD dataset for 400 epochs.It was then transfer learned onto the Super-Con dataset or a subset of it, on which it would train for another 500 epochs.Transfer learning onto four different datasets (cuprates, pnictides, others, and everything together) resulted in four different versions of the GAN.
Afterwards, the testing procedure and the data analysis were conducted, and then the hyperparameters were updated based on the results.The training curves for the final model on each of these sets are displayed in Fig. 5. Notably, they were all able to converge and stabilize over the 500 epochs.

Testing
After each training process, 5, 000 hypothetical compounds were generated from the versions of the GAN trained on the smaller sets and 30, 000 from the version trained on everything.The generated predictions were then inspected with various quality checks.Each compound was first tested for validity using the charge neutrality and electronegativity check features of the SMACT package [19].The package tests the compound for electronegativity and charge balance to determine whether it is a valid compound or not.Each prediction was then checked for uniqueness-whether the compound showed up earlier in the generated list-and novelty-whether the compound was in the training dataset.These three checks looked at the general quality of the model.Then, more specific to superconductivity, each compound was run through the model from Ref. [6] to check whether it is a superconductor or not, as well as to predict its critical temperature.Of course, to be sure of superconductivity, the compounds must be synthesized and tested.Lastly, the formation energy of each generated compound was calculated using the model from Ref. [20], which indicates the stability of the compound.These tests will be discussed further in Section III, along with the actual results.

D. Clustering
In order to further assess the quality of predictions, clustering analysis was performed.Clustering is an unsupervised machine learning technique whose main goal is to unveil hidden patterns in the data.It was recently applied to superconductors from SuperCon database [15].Depending on the data set, different clustering algorithms were used, such as k-means, hierarchical, Gaussian mixtures, self-organizing maps, etc.The results showed that in case of superconductors, clustering methods can achieve, and in some cases exceed, human level performance.
In order to visualize clustering results, different techniques can be used.It was shown that for superconductors the so-called t-SNE produces the best results [15].t-SNE is a non-linear dimensionality reduction technique which allows higher dimensional data to be represented in 2D or 3D [21].In case of superconductors, the data points are 96-dimensional (each compound is represented by a 1 × 96 matrix), as discussed in Section II B. t-SNE reduces these dimensions down to either two or three, which allows easy visualization.We point out, however, that these reduced dimensions do not have any physical meaning.

III. RESULTS
After training, from the four different versions of the GAN, we generated four superconductor candidate lists with either 5,000 or 30,000 chemical compositions.We then ran the predicted superconductors through a series of tests to evaluate their quality.The first few were general tests, and the rest were in the context of superconductivity using existing computational models, as experimentally testing all of them is unfeasible.

A. Duplicates and Novelty
We first screened the output for duplicates within the generated sets themselves and then for duplicates between the generated set and the dataset of known superconductors that it trained on.The results are tabulated in Table II.As seen in the table, the number of repeats within the generated samples were relatively low (high uniqueness), with the exception of pnictides, which had more duplicates than the rest.This is likely due to the fact that the dataset of pnictides was significantly smaller than the rest (see Table I).We speculate that this overall low rate of duplicates stems from the fact that the model is able to handle decimals (see Section IIB and Fig. 4), which opens up a large composition space for it to explore.The percent of predicted compounds that were novel, i.e. not already known to exist, listed in Table II is also very high across all four GANs.These two results demonstrate that all versions of ScGAN can generate a diverse array of new compositions.

B. Formation Energy
As mentioned in the previous section, we found the formation energies of the predicted compounds from the GANs using ElemNet [20].It was indicated in Ref. [20] that a negative value of formation energy is a good indicator of stability, i.e. the possibility of being synthesized in the lab.In Fig. 6 we display the values of formation energy for all predictions.We see from the distributions of formation energies that most of predicted compounds have calculated formation energies less than zero.Even though this does not provide definitive proof of stability, it is a general indication that most of the predicted compounds are stable and can be synthesized in the lab.

C. Superconductivity
As a next test, we ran the predicted compounds through the K-Nearest Neighbors (KNN) classification model from Ref. [6] for predicting superconductivity based on elemental composition, in order to to check if the predictions of two machine learning models (GAN and KNN) would agree.However, the probabilistic nature of the machine learning model had to be taken into account.If p sc is the proportion of predictions that came up superconducting according to the model, then we can estimate ρ sc , the true proportion of superconducting entries, using Bayesian statistics.Denoting tp and fp as the true positive and false positive rates of the classification model, respectively, we can write Solving for ρ sc gives The true positive and false positive rates here are reported from Ref. [6]: tp = 98.69% and fp = 16.94%.The output percentages along with the estimates of the true proportions calculated from equation 7 are tabulated in Table III.All four GANs achieved very high percentages of superconducting material according to the KNN model, especially when compared to the 3% figure from the manual search in Ref. [3].However, the only definite test of superconductivity can come from experimental measurements.

D. Critical Temperature Estimates
We also calculated the critical temperatures of our predictions using the regression model from Ref. [6].Similar to the superconductivity tests in the previous subsection, these calculated values can only be taken as approximations.However, the regression model can still provide us with a general understanding of the capabilities of Sc-GAN.The critical temperature outputs from the model are summarized in Table IV and the distributions are in shown in Fig. 7.While the distributions are somewhat broad, the GANs were still able to find several superconductors with predicted critical temperatures higher than 100 K, which exceeds the manual search maximum of 58 K (though that search was mostly restricted to pnictides) and several of the previous machine learning approaches [5,6].This is not surprising, as those previous searches were limited to existing databases of stoichiometric compounds, which meant that these forward design approaches could only produce a limited number of candidates, mostly with low critical temperatures.

E. Ability to Learn Features
We then looked at the types of superconductors that were generated by the different versions of ScGAN.The distributions are given in Table V, and we can see that each versions of ScGAN generated mostly superconductors that matched their training data.This indicates the GAN was able to detect the different underlying features behind these different major classes of superconductors.

G. Promising Candidates
After running through the T c prediction model, we manually identified the ones that looked the most promising, including some with very high critical temperatures.It turns out that most of these were cuprates, which is not surprising, considering that superconductors with highest critical temperatures in the training set are cuprates.We then checked the Crystallography Open Database (COD) [22] to see if those compounds were in the database.The ones listed in Table VI could not be found neither in COD nor in SuperCon, showing that this model overcomes the limitations of the previous forward design models and finds completely new superconductors.A more comprehensive list of predictions is available upon reasonable request.

IV. CONCLUSION
For decades the search for new superconductors has relied on the serendipity of material scientists to synthesize a new material with superconducting proprieties.This paper introduced a novel method to search for superconductors-discovering candidates with a generative adversarial network.In contrast to previous computational methods which attempted to predict new superconductors off of existing datasets, this model predicted compounds directly.This "inverse design" approach proved to be far more powerful than manual search methods and previous computational methods, with the model being able to generate thousands of candidates with a wide-range of critical temperatures that lied outside of existing databases (both superconductor and general inorganic compound databases).Even while only training on chemical compositions, more than 70% of the GANs predictions were cross-checked with a sep- arate model to be potentially superconducting (the only way to know for sure, however, would be to synthesize and check these compounds in a lab).Of these, several were promising HTS candidates listed in Table VI.We point out that previous models would have been unable to find such candidates as they were outside of current databases.
While the compounds generated were new, our clustering showed that the GAN did not generate any new families of superconductors.However, it was still able to generate non-stoichiometric compounds, widening the scope of the computational search.Future studies should look into some improvements that can be made, such as being able to account for charge neutrality and crystal structure in the compound encodings.Chemical checks on the predicted compounds (using SMACT as detailed earlier) revealed that while there was electronegativity balance, charge balance was not always exact for the superconductor candidates.Furthermore, crystal structure has been known to play a significant role in superconductivity, so including it in calculations would most likely result in improvements.However, this may prove to be a difficult endeavor as crystal structure is not well-documented in existing databases.Active transfer learning could also be attempted to narrow down the predictions to high temperature superconductors only [23].However, even without these possible improvements, the model at its current version is still very promising and can be applied to search for superconductors, starting with the candidates identified in this paper.

FIG. 5 .
FIG. 5.The generator loss against training epoch for each of the four datasets the GAN trained on: (a) All of SuperCon; (b) Others (i.e.not cuprates or pnictides); (c) Cuprates; and (d) Pnictides.

FIG. 6 .
FIG. 6. Distribution of the formation energies of the predicted compounds from the four versions of the GAN: (a) everything, (b) others, (c) cuprates, (d) pnictides.

F
. Clustering resultsIn Fig.8we display the results of clustering analysis on three sets of predictions: panel (a) for cuprates, panel (b) for pnictides and panel (c) for others.The results are visualized with the help of t-SNE.As pointed out above, the two t-SNE dimensions, Y 1 and Y 2 , do not have any physical meaning.Full circles of different colors represent different families of superconductors from Su-perCon database, whereas purple open circles represent GAN predictions.As can be seen from all three panels, GANs were able to generate new superconductors from all known families of cuprates, pnictides and other superconductors.However, GANs did not predict any new families of superconductors.

FIG. 8 .
FIG. 8. Clustering of the predicted compounds from various versions of the GAN: (a) cuprates, (b) pnictides and (c) others.Full circles represent the data points from SuperCon and purple open circles are GAN predictions.

TABLE II .
The percentage of generated predictions that were novel (not in the training set) and unique (distinct from others in the given generated set) for each of the versions of ScGAN trained on the given training datasets on the left.

TABLE IV .
Summary statistics of the predicted critical temperatures of the generated predictions that were determined by the regression model for different training sets.

TABLE VI .
[6]mising superconductor candidates generated by ScGANs, that do not exist in current databases.Also shown are their predicted critical temperatures[6].