Lightweight and high-precision materials property prediction using pre-trained Graph Neural Networks and its application to a small dataset

Large data sets are essential for building deep learning models. However, generating large datasets with higher theoretical levels and larger computational models remains difficult due to the high cost of first-principles calculation. Here, we propose a lightweight and highly accurate machine learning approach using pre-trained Graph Neural Networks (GNNs) for industrially important but difficult to scale models. The proposed method was applied to a small dataset of graphene surface systems containing surface defects, and achieved comparable accuracy with six orders of magnitude and faster learning than when the GNN was trained from scratch.

of magnitude and faster learning than when the GNN was trained from scratch.© 2024 The Author(s).[6][7] Open Catalyst 2020 (OC20), 8) which contains over two million of the first-principles calculations for catalytic materials, is a prime example of a large dataset that facilitates the materials discovery.The release of such huge datasets spurred the development of new GNNs, such as GemNet, 9) PaiNN, 10) and SCN. 11)On the other hand, the cost of the GNNs' training has been increasing as the datasets have grown and the GNN algorithms have become more complex.In addition, all deep learning methods, including GNNs, require millions to tens of millions of training parameters, so a large dataset is essential for their training. 6,11,12)n materials research, however, the cost of the firstprinciples calculations to create the reference datasets remains high, making it difficult to create the large datasets for practically important materials that require large simulation models, such as structures containing defects. 13,14)Similarly, materials that require expensive approximation methods to simulate their accurate electronic states are also difficult to scale up.For example, one such material is the graphene surface system.The graphene and defected graphene have been utilized as a very good catalyst support due to its large specific surface area. 15,16)When dealing with van der Waals (vdW) materials such as graphene, in which dispersion forces make an important contribution to their properties, accurate simulations based on a vdW density functional 17,18) are essential, which is another factor that prevents the dataset from being scaled up.If a GNN is trained when the dataset is not large enough, the number of training parameters may be too large, resulting in overfitting and limited applicability (namely less generalization performance) of the trained GNN model.
When applying deep learning methods to such small datasets, a method called transfer learning is often used.Transfer learning is the method of fine-tuning a pre-trained deep learning model to solve problems in different subtasks, and is considered a province in various fields such as image recognition and natural language processing. 19,20)[23][24][25] This study proposes a machine learning method for constructing high-precision models with a low cost by utilizing pre-trained GNNs for a small material dataset that are important for practical use but difficult to scale up.Specifically, we devised a method to predict rotationally invariant material properties, such as formation energy by a linear regression using a pre-trained GNN on OC20 to obtain a material-specific embedded representation, using each dimension of the embedded representation as an explanatory variable.The proposed method adapts the models trained on the large dataset constructed under lower theoretical levels to the smaller dataset simulated by higher theoretical levels.
The proposed method was validated using monatomic adsorption structures on the graphene and defected graphene surfaces.We found that the proposed method achieves an accuracy equivalent to that of learning the GNN from scratch, but learning is six orders of magnitude faster.Furthermore, it was revealed that the higher accuracy could be achieved when the embedding was generated by simpler GNN algorithms.
The GNN architecture generally consists of three blocks: the embedding block, the interaction block, and the output block.Given the aforementioned GNN architecture, the embedding vectors obtained from the final interaction block can be considered descriptors that fully reflect the local environment of the system, representing the target variable of the GNN.Therefore, utilizing these embedding representations for each atom in a material from pre-trained GNN should be beneficial even for predicting properties of material in a domain different from one for pre-training.
Based on this concept, we propose a methodology for predicting material properties by leveraging the embedding representations of pre-trained GNN, as schematically illustrated in Fig. 1.The construction of the model is divided into three distinct steps as follows.In the first step, the sitespecific embedding vectors of the material were computed based on the graph embedded vector obtained from the last interaction block of the GNN, which was pre-trained using OC20.In the present study, previously reported GNN models with different levels of complexity, including CGCNN, 4) SchNet, 26) DimeNet++, 27) SpinConv, 28) PaiNN, 10) eSCN 29) (from simple to complex), were examined.In the second step, the material-specific embedded representation was computed by averaging the atomic embedding vectors obtained in the previous step over the entire material.After concatenating the feature vectors for each atom, each column component was averaged.This averaging allows simple methods such as a linear regression to make the predictions regardless of the number of atoms in the material.The dimensionality of the final material-specific embedded representation corresponds to the dimensionality of the graph embedded vector of the pre-trained GNNs.Finally, in the last step, using the obtained pairs of the embedded representation and the materials property values to be predicted as the training data, the parameters of the linear regression were fitted using each dimension of the embedded representation as explanatory variables.With the above configuration, if the GNN used to obtain the embedded representation is invariant about the transition, rotation and site permutation, the output of the linear regression also satisfies the invariance condition.
The role of the embedded representation here is to represent the information of the environment surrounding each atom in a material as a fixed-length vector.The use of pre-trained GNNs eliminates the need for the manual adjustments required for descriptors.Since the GNNs have already been proven to be useful as the material graph transformers, 30) it was thought that pre-trained GNN parameters would extract a better representation than the structural descriptors.Since we used the GNNs trained using OC20 which contain 55 elemental species, the number of applicable elemental species was large enough, and we considered it appropriate for use in the transfer learning of the surface system dataset.) However, these approaches face a challenge when applied to the general-purpose representation of diverse systems where various elemental species may be present due to the combinatorial increase in representation dimensionality concerning the elemental species of interest.
Using the proposed method, the prediction models were constructed using the graphene surface monatomic adsorption structure dataset as a reference.[38] To treat the surface adsorption accurately, rev-vdW-DF2 18) was used as the vdW density functional, because it shows the best accuracy for the simulation of various two-dimensional materials. 39)The details on generating the database is described in the Supplementary data.The reference dataset contains structures of elements up to the third period in the Periodic Table and major noble metals, such as Ag, Au, and Pt, adsorbed on graphene.The reference dataset consists of approximately 100 structures for each adsorbed element, for a total of 3,617 structures.
The structures after the structural relaxation with adatoms just above the three high-symmetry adsorption sites on the graphene surface, Bridge, Hollow, and OnTop, shown in the Fig. 2 were used as test structures, and structures with sufficiently large surface vertical distances of adsorbed atoms were used as training and validation data.The adsorption energy and formation energy were estimated from the calculated total energies.The details of the adsorption and formation energy simulations are described in the Supplementary data.To avoid data leakage, only structures that were at least 1 Å away from the adsorption distance after structural relaxation at each site were used in the training and validation data.By dividing the data by adsorption distance as described above, the physical properties of the material after structural relaxation were predicted from the structure with the farther adsorption distance.This is equivalent to setting up the problem of predicting the physical properties after structural relaxation from the structure before relaxation. 8)The resulting data was split into 3,007 training data, 493 validation data, and 117 test data structures.
In addition to adsorption on the "perfect" graphene surface, we also investigated adsorption on the "defected" graphene surface because of its importance.These structures Fig. 1.Schematic diagram of the lightweight machine learning method proposed in this study.First, an embedding vector representing the local environment for each atom is computed from the structure based on the pretrained GNNs.The dimension of the embedding vector corresponds to the dimensionality of the graph embedded vector of the pre-trained GNNs.Then, the embedding vector for each atom is averaged over the entire system to generate a material-specific embedded representation.After concatenating the embedding vectors for each atom, each column component is averaged.Finally, a linear regression using each dimension of the embedded representation as an explanatory variable predicts rotationally invariant material properties.with Ag, Au and Pt adsorbed on the surface defects were used to check the ability of the extrapolative prediction by the present models.
First, we tested the prediction of the formation energies for the graphene surface system dataset using various training methods and different embedded representation generation methods, and compared model performance in terms of mean absolute error (MAE).With respect to the training method, the following five methods with different pre-trained/retrained levels were examined.Their levels are schematically shown in Fig. 3.
(1) The method in which GNNs were trained from scratch using the graphene surface system dataset.(Re-trained From Scratch) (2) The method that only the parameters of the output block of the pre-trained GNNs were re-trained on the graphene surface system dataset.(Re-trained Output) (3) The method that pre-trained GNNs were directly used to predict test data for the graphene surface systems, and not re-trained.(No Re-trained) (4) The method in which the parameters of a linear regression were optimized by the least squares method after obtaining an embedded representation from the pre-trained GNNs.(Linear) (5) The method that the parameters of a linear regression were optimized by stochastic gradient descent (SGD) after obtaining an embedded representation from the pre-trained GNNs.(Linear with SGD) For Re-trained From Scratch and Re-trained Output, the neural network weight parameters were optimized by Adam 40) with a learning rate of 0.01, using the mean squared error per batch between the first-principles and predicted values as the loss function.Learning was stopped when the MAE value of the validation dataset did not improve for 20 epochs.The learning rate decayed at a rate of 0.5 when the MAE of the validation dataset did not improve for 5 epochs.In both cases, training was performed using the same hyperparameters and network structure as the pre-trained GNNs.In other words, the network structure of No Retrained, Re-trained From Scratch, and Re-trained Output are all the same, and only the parameters of the learned part differ.
As for Linear with SGD, the weight parameters of the linear regression were optimized by stochastic gradient descent with the squared error between the first-principles and predicted values as a loss function.Hyperparameters such as the setting of the regularization term, learning rate, and number of iterations were optimized using methods based on Bayesian optimization. 41)or Linear, the weight parameters of the linear regression were determined by the least squares method using all the training data.
The reason for using the optimization method with SGD in addition to simple least squares linear regression is that SGD is empirically known to be an effective estimator for small datasets. 42)ith respect to the method used to generate the embedded representation, we have used the pre-trained GNNs on the OC20 dataset 8) (CGCNN, 4) SchNet, 26) DimeNet++, 27) SpinConv, 28) PaiNN 10) and eSCN 29) ).All of the pre-trained GNNs are available from the OCP repository. 43)All of the results are summarized in Table I.
The columns of Table I represent pre-trained GNN architectures, and the further to the right, the higher the accuracy with respect to OC20 when pre-trained.First of all, it is clear that the No Re-trained method, namely direct use of the pre-trained GNN model, could not achieve a sufficient prediction accuracy for the prediction of the graphene surface systems.This is reasonable due to the fact that the OC20 dataset does not include graphene surface systems, and there is no data that properly handles the vdW interactions.
Furthermore, when the embedded representation was generated using the pre-trained CGCNN and linear regression was performed, the accuracy was comparable to that achieved when the CGCNN was trained from scratch, namely CGCNN-Re-trained From Scratch was accurate to 268 meV, while both CGCNN-Linear with SGD and CGCNN-Linear were also achieved 376 meV and 374 meV, respectively.Surprisingly, the GNN with the simple algorithm shows better accuracy than the GNNs with the advanced algorithm with linear regression, for instance eSCN-Linear showed the poor accuracy of 2, 250 meV, while the simple models, CGCNN-Linear and SchNet-Linear with SGD achieved the better accuracy with 374 meV and 588 meV, respectively.This indicates that, the simpler GNN models, such as CGCNN and SchNet, are more suitable as the pre-trained models.
We found that while the advanced algorithm GNNs, such as eSCN and PaiNN, were accurate for OC20, it was not accurate when transferring them to the subtask of predicting the present systems with linear regression.This is due to the fact that the advanced GNNs overfit OC20 and fail to correctly represent the present systems that are not included in OC20.The effect of the different number of embedding dimensions may also be the origin of this error, but it is highly possible that the accuracy tends to be higher with the simpler GNNs.
Figure 4 also shows the prediction results of the formation, adsorption, and Fermi energies using the proposed CGCNNlinear method.It is clear that the proposed lightweight method is able to predict all properties with high accuracy.II.The CGCNN-Linear was found to be six orders of magnitude faster than CGCNN-Re-trained From Scratch.This is more pronounced as the GNN becomes newer, and in the case of eSCN, the training time differs by eight orders of magnitude between eSCN-Re-trained From Scratch and eSCN-Linear.
Finally, we examined the "extrapolation" of each training method and pre-trained GNNs by predicting the formation energy of the noble metal adsorption to the graphene defect, which are not included in the training dataset.Since the advanced algorithm GNNs were found to be unsuitable for the pre-trained models, only the simple models, CGCNN and SchNet, were examined for the defect case.
The prediction results for those defect cases are shown in Table III.In the prediction of the extrapolative structures, a lightweight model that generates the embedded representation by SchNet and performs a linear regression is found to be more accurate than the model that learns the GNNs from scratch.Furthermore, SchNet-Linear with SGD achieved a significantly high accuracy, 307 meV, whereas two-time worse accuracy, 892 meV, was obtained by the same GNN from scratch, SchNet-Re-trained From Scratch.Namely, learning a GNN from scratch on a small dataset with only about 4,000 structures in the training data would result in overtraining of the model, and causes degrading in its generalization performance.The result indicates that combining the pre-trained models with lightweight machine learning methods is effective for improving accuracy and generalization performance even for the small dataset.
In this study, we described the details of how to construct a lightweight and highly accurate model using the pre-trained GNNs.We then confirmed that the proposed method achieved the same accuracy as training a GNN from scratch, but with six orders of magnitude faster in learning time.The proposed method and the model built from the small dataset were applied to the extrapolative defect systems.While a large dataset is essential to increase the versatility of GNNs, the present method could construct the model that can make certain extrapolative predictions even when trained on a small dataset.
The lightweight and accurate model construction method proposed in this study provides the guideline for building machine learning models for complex structures for which it is difficult to construct larger datasets.Furthermore, by utilizing the pre-trained GNNs, we can reduce the training time dramatically, which has been the bottleneck of the   Table III.Extrapolative predictions of graphene surface defect adsorption structures using a model trained on a graphene monatomic adsorption system.In the lightweight model, we only validated the use of CGCNN and SchNet for generating embedded representations, which had particularly good prediction accuracy for the test data of graphene monatomic adsorption.The values in the table are in MAE of formation energies (in [meV]).Bold type denotes the method with the highest accuracy.
GNNs.In the future, we would like to verify the dependence of the dataset for pre-training GNNs and expand the range of applicable materials.

Fig. 3 . 3 ©
Fig. 3. Schematic diagram of each learning method. ( )h i L represents the embedding vector of nodes output from the interaction block of the L-th layer, and ŷ represents the output of the model.n represents the number of atoms in the material.The details of models (1)-(5) are described in the text.

4 ×
[sec].Re-trained From Scratch and Re-trained Output were trained on a GPU (GeForce RTX 3090).Linear and Linear with SGD were computed on a single core on an Intel(R) Core (TM) i9-10900K CPU @ 3.70GHz.SGD performed hyperparameter search and recorded the time taken to optimize the model up to 100 times as the time taken for training; Linear recorded the time taken to optimize the parameters by the least squares method as the time taken for training.10 −4 4.4 × 10 −4 2.6 × 10 −4 2.6 × 10 −4 3.8 × 10 −4 3.8 × 10 −4

Fig. 4 .
Fig. 4. The formation energy, the adsorption energy, and the Fermi energy predicted by the learning method using CGCNN-Linear are compared with that obtained by the DFT simulation.

Table I .
MAE of the formation energy prediction (in[meV]) for the test data by the different embedded representation generation methods and different training methods, and the number of embedding dimensions for each method.Results in bold indicate our methods that achieve accuracy comparable to methods that re-train GNNs from scratch.The six columns on the right correspond to the GNN methods, with the newer algorithms toward the right of the table providing better prediction accuracy for the OC20 dataset.

Table II .
Training time for each training method and embedded representation generation method.All values in the table are in