Graph attention neural networks for mapping materials and molecules beyond short-range interatomic correlations

Bringing advances in machine learning to chemical science is leading to a revolutionary change in the way of accelerating materials discovery and atomic-scale simulations. Currently, most successful machine learning schemes can be largely traced to the use of localized atomic environments in the structural representation of materials and molecules. However, this may undermine the reliability of machine learning models for mapping complex systems and describing long-range physical effects because of the lack of non-local correlations between atoms. To overcome such limitations, here we report a graph attention neural network as a unified framework to map materials and molecules into a generalizable and interpretable representation that combines local and non-local information of atomic environments from multiple scales. As an exemplary study, our model is applied to predict the electronic structure properties of metal-organic frameworks (MOFs) which have notable diversity in compositions and structures. The results show that our model achieves the state-of-the-art performance. The clustering analysis further demonstrates that our model enables high-level identification of MOFs with spatial and chemical resolution, which would facilitate the rational design of promising reticular materials. Furthermore, the application of our model in predicting the heat capacity of complex nanoporous materials, a critical property in a carbon capture process, showcases its versatility and accuracy in handling diverse physical properties beyond electronic structures.


Introduction
The past few years have witnessed a surge of interest in applying machine learning (ML) technologies to power many aspects of both computational chemistry and materials science [1][2][3].For example, ML techniques open new avenues in the construction of complicated potential energy surfaces from quantum-mechanical data in an automated fashion to tackle the long-standing computational challenges (e.g. the realistic modelling of chemical reactions or complex materials and interfaces) that are inaccessible to either poorly transferable empirical force fields or computationally demanding ab initio methods [4][5][6][7][8].What is more, ML approaches are revolutionizing the discovery and design of materials and molecules at an astounding rate through direct in silico screening and statistical analysis of massive chemical data sets [9][10][11][12][13][14][15].
The representation of materials and molecules is a crucial ingredient in constructing effective ML models dedicated to statistical property regression, clustering of chemical structures, or visualization of material phase space in a lowdimensional manifold.Conventional ML models adopt handcrafted descriptors that encode the raw information about atomic systems (such as the chemical nature and coordinates of each atom) into a suitable representation with physical symmetry, such as the widely used smooth overlap of atomic positions (SOAP) [6,16], Coulomb matrix [17], atomcentered symmetry function [5,18], and composition-based features [19].Arguably, the design of effective descriptors usually requires both considerable domain expertise and human effort, which proves challenging for an immense number of materials with complex structures and many compositions.
Recently, tremendous attention has been received in the development of deep learning (DL) models to automatically discover the flexible representations of materials and molecules with minimal human intervention.Notably, DL allows raw data directly as inputs from which complex and abstract material representations can be learned by a series of hierarchically nested neural networks [20].To date, a number of DL models have been proposed to address problems, especially in both material and chemical science, including deep tensor neural network (DTNN) [21], ANI-1 [22], crystal graph convolutional neural network (CGCNN) [23], SchNet [24], PhysNet [25], and MolNet-3D [26].Those deep models have shown strong and flexible capability to represent complex systems (such as protein-like compounds [25] and druglike molecules [27]) and generally outperform conventional ML methods in predicting various quantum-chemical properties of organic small molecules, crystals, disordered materials, and surfaces [28][29][30][31].
Aside from powerful representation learning, the idea of locality undoubtedly lays a solid foundation for current stateof-the-art (SOTA) ML-based potentials or property regression schemes.Locality, supported by the principle of electronic nearsightedness [32,33], is associated with the description of atom-centered short-range chemical environments to infer complex many-body interaction and renders ML models interpretable, scalable, and robust for extensive properties [34,35].In the context of locality, most current SOTA models only build global representations as a collection of atom-centered local environments and neglect long-range interatomic correlations.However, this could significantly undermine the reliability of ML schemes when long-range interactions beyond the cutoff radius (such as electrostatics and van der Waals dispersion) dominate the properties of systems like ionic solids and electrolyte solutions [36][37][38].Moreover, only encoding local information may lose global shape characteristics and prominently weaken pattern recognition for complex systems with widely structural and configurational diversity [29,39,40].Indeed, there is a great and increasing demand for ML to accelerate the discovery of complex materials.One class of representative materials is metal-organic frameworks (MOFs) that have great potential in many applications, such as gas storage and separation, sensors, thermoelectric, catalysis, and photovoltaics [41][42][43][44].Notably, over 80 000 nonporous MOFs have been synthesized over the past decade by assembling the organic linkers and metal clusters [43,45], but they are just a small part of the myriad of possible structural motifs of MOFs that can be yielded [46].This results in an urgent need for ML in this area.However, the overwhelming chemical space plus the large number of atoms in MOF structures makes it difficult to obtain a high-level global representation only incorporating short-range interatomic correlations.
In this work, we introduce a unified graph attention neural network (GANN) architecture that allows capturing local and non-local features of atomic environments, aiming to provide a deep and high-level representation of complex materials and molecules.Our model will be assessed by predicting the electronic bandgaps of MOFs from their crystal structures based on a recent quantum-chemical database [40] for MOFs.This will be done by comparison with density-functional theory (DFT) calculations and the prior SOTA models.It will be further illustrated how the learned information-rich representations can be used in high-fidelity chemical clustering to sharply narrow down the candidate space for fast searching of desired materials.Finally, the versatility, data effectiveness, and accuracy of our model will be demonstrated by tackling a more challenging task with a limited training dataset, i.e. the prediction of the heat capacity of a diverse range of nanoporous materials, including MOFs, covalent-organic frameworks (COFs), and zeolites.

Graph attention neural networks
Figure 1 depicts the proposed GANN architecture.Inspired by the interpretability, generalizability, and remarkable performance of deep graph networks in predicting material properties [28,30,[47][48][49], GANN receives atomistic structures of materials through a graph-based descriptor where the atoms and the bonds that connect the atoms are regarded as the nodes and the edges in the graph, respectively.Following a routine protocol in graph-based models [23,47], the initial node attributes are F-dimensional one-hot encodings of the chemical properties of elements, independent of atomic environments for now.The edge attributes between nodes i and j are encoded as a set of Gaussian-expanded distances whose component values take the form where r ij is the interaction distance between atoms i and j; γ is an adjustable parameter specifying the width of the Gaussian basis; A graph convolutional neural network module is built to upgrade the atomic embeddings by passing messages from neighbors and bonds.Distinct from previous SOTA deep graph networks like CGCNN [23] and SchNet [24], we introduce new bond convolution (BondConv) operations to directly extract the interaction features from all the bonds emanating from each atom.The updated atomic embeddings are outputted by a channel-wise symmetric aggregation operation, namely max pooling, to remain permutational invariance, and takes the form in the first convolutional layer where (φ 1 , . . ., φ M , θ 1 , . . ., θ M , ϕ 1 , . . ., ϕ M ) are the learnable weight parameters of atomic features and interactions; l denotes the current layer, here l = 0.In the subsequent several convolutional layers, we change the BondConv to a simpler but effective form Both equations ( 2) and ( 3) can be implemented by a shared multilayer perceptron, guaranteeing permutational invariance to the ordering of neighbors.Similar convolutional operations have been successfully used in the visual tasks on 3D point clouds [50], but have not yet been applied to atomic structures.Finally, the graphs are updated through each convolutional layer of the network by increasingly embedding more information of local environments into atomic features.Meanwhile, besides aforementioned permutational invariance, the outputs are also strictly invariant to translation and rotation because atom-centered descriptors and only pairwise distances are used in the networks.
To address the loss of long-range information correlations between atoms in previous local graph representations, we introduce the self-attention mechanism to our architecture.The self-attention mechanism was originally proposed and applied in the emerging Transformer architectures for boosting the performance of neural machine translation and the speed of the model training [51].The biggest benefit of selfattention comes from the fact that it allows for the processing of the word sequences in parallel and captures the dependencies between words without regard to their distance in the input sequences.We note that a few studies [52][53][54][55] have just recently introduced self-attention to map the space of chemical reactions from text-based representations, namely SMILES [56,57], but none involves mapping atomic configurations to the best of our knowledge.
We now illustrate how self-attention is implemented in our model.Let the node-level embedding sets outputted from the graph networks be G em ∈ R N×dg with the number of atoms N in a system and the feature dimensionality d g .Regardless of the graph structure, the node set in the graph can serve as a N-component and out-of-order sequence.To compute selfattention, three matrices-query Q, key K, and value V as defined in the original literature [51]-need to be created by linear transformations of the input features G em as follows: where ates the shared learnable linear transformation matrices, and d a is the dimension of the query or key vector.With the dot product between the query and key matrices, we can evaluate the attention weights of any local atomic environment against itself and each of all others in the whole system over a long distance.The attention weight matrix A takes the form: The attention weights determine how relevant the information of a certain local atomic environment is to that of other local and nonlocal ones, from which the long correlation of information can be built.To make gradients more stable and all attention weights positive, the attention weights are further scaled by a factor 1/ √ d a and normalized by a softmax operation:

Ã = softmax
The outputs F sa of the self-attention layer are expressed by summing up the weighted value vectors Here, multiplying each value vector by the softmax weights is to automatically drown out irrelevant information between local atomic environments and keep intact those that are worth attending.As all operators in self-attention are independent of the order and size of inputs, this endows our model with strictly permutational invariance and scalability.The self-attention module is shown in figure 1(c).In fact, there are many available variants of self-attention that can be used to enhance the model.In this work, we employ an evolved self-attention module-offset-attention-to replace the original one, which shares similar conception to the Laplacian operator used in graph convolution networks [58].One can refer to recent literature for details on offset-attention [59].
After the raw descriptors flow through a stack of the graph and self-attention layers, the local and nonlocal information will be extracted hierarchically on a large scale, eventually arriving at a high-level and global embedding of atomic structures.Finally, the non-linear mappings from atomic structures to material properties are established by three fully connected hidden layers.

Quantum MOF (QMOF) database
MOFs are a class of promising porous materials, and the fascinating aspects of MOFs are their synthetic versatility, chemical tunability, and stability.The isoreticular principle enables the size of MOFs to vary in a wide range without changing their underlying topology [60].For instance, the tunable pore aperture and surface area of MOFs could range from <10 Å to ∼100 Å and 1000 to 10 000 m 2 g −1 [42,61], respectively.It allows one to fine-tune the structures of these materials with respect to selectivity and activity [62].To date, more than ten thousand kinds of MOFs have been synthesized by assembling the organic linkers (benzene-1,4-dicarboxylate, 2,5-dihydroxybenzene-1,4-dicarboxylate, biphenyl-4,4 ′ -dicarboxylate, etc) and metal clusters (Mn, Fe, Co, Cu, Zn, Ni, etc) [45,46].The QMOF database is a collective MOF subset of the Cambridge Structural Database [45] and the 2019 CoRE MOF database [63].All crystal structures in the QMOF database are experimentally synthesized [64][65][66].Compared with the previous MOF databases, such as the OQMD [67] and the CoRE MOF database [68], the QMOF database particularly calculates an important electrical structure property of MOFs-bandgap, E g in eV.The bandgap is an excellent indicator for classifying MOFs into metal or semiconductor.Given that the majority of MOFs are electrical insulators, it is essential to identify the metallic MOFs or those with low bandgaps for expanding applications of MOFs into (opt-)electronic devices and revealing novel quantum-chemical insight into MOFs [42,69,70].The periodic table with color (figure 2 show the complexity of the MOF chemistry and make the QMOF database an excellent target for assessing the generality of our GANN model.

Learning bandgaps of MOFs
We now move to evaluate the performance of the GANN model on the QMOF benchmark set and make a comprehensive comparison with other common ML models.Here, we have divided the benchmark models into two categories: classical machine learning (CML) models with hand-crafted descriptors and deep learning (DL) models with learnable representations.The original work [40] on the QMOF database provides the benchmarks for a DL model (i.e.CGCNN) and five CML models in terms of the QMOF-2 dataset.The five CML models are constructed using the same kernel ridge regression method but different descriptors, i.e.Sine Coulomb matrix [71], 'Stoichiometric-45' (SM-45) [72], 'Stoichiometric-120' (SM-120) [19], orbital field matrix [73], and SOAP.Additionally, SchNet is a representative graphbased DL model and is also included as a comparison benchmark.For direct comparison with the benchmarks from the QMOF database, we also train GANN and SchNet on the same QMOF-2 dataset, with the mean absolute error (MAE) and Spearman rank-order correlation coefficient (ρ) as the joint metrics to quantitatively gauge the performance of different models.The dataset is randomly split into 80% for training, 10% for validation, and 10% for testing.The random splitting is repeated five times for five parallel runs over which the statistics of MAEs and ρ on the testing sets are obtained.
As shown in figure 3(a), on the one hand, SOAP achieves the best performance among the CML models.It indicates that the descriptors sensitive to atomic structures and chemical elements are essential to improve the model performance.In fact, SOAP has been proven to perform equally well as other SOTA models in building machine-learning potentials for systems containing a few elements [74].On the other hand, the DL models (CGCNN, SchNet, GANN) substantially outperform all CML models in predicting the properties of structurecomplex and element-diverse MOFs.The three DL models simply take raw molecular graphs consisting of atomic attributes and Gaussian basis as inputs, while they achieve the SOTA performance.It highlights the strength of the DL models to learn complex and abstract molecular representations out of simple ones, which can significantly reduce the human effort in designing suitable descriptors.Remarkably, our GANN model outperforms the prior SOTA DL models, i.e.CGCNN and SchNet.It shows that our proposed graph convolutional operators are effective to capture the configurational information of local atomic environments and embedding atoms in a suitable way.In addition, the introduction of self-attention mechanism leads to a strong correlation between atoms over a long distance and incorporates more non-local information into global representations, thus boosting the accuracy of the GANN model.Figure 3(b) provides a correlation plot for predicted quantities of the GANN model.We further train and evaluate the GANN on the QMOF-3 dataset which contains larger unit cells with up to 500 atoms.Although the structures of MOFs become more complex, the added data greatly assist in the learning.Eventually, we observe a 5% decrease in testing-set MAEs when training is moved from on the QMOF-2 dataset to the QMOF-3 dataset.These results demonstrate the generality and extensibility of the GANN model.
The latent features inside DL models usually serve as the final learned representations of molecules and materials.
Gaining insights into the learned latent space is essential for efficient data mining and analysis, which is most often in relation to rational materials design and accelerated materials discovery.We use the unsupervised t-distributed stochastic neighbor embedding (t-SNE) [75] technique to project the high-dimensional latent features into 2D space for visualizations.As illustrated in figure 4, the structure distribution of MOFs in the 2D latent representation space shows a discernible pattern with respect to property values.MOFs with high and low bandgaps are located in the distinctly different regions.It is shown that the learned representations from our GANN model are leading to groupings of those MOFs that share similarities in their atomic structures, elemental compositions, and chemical properties.We first These findings collectively substantiate that the representations generated by the GANN model are sensitive to the atomic structures, elemental compositions, and chemical properties of materials.This powerful tool enables one to identify promising underlying topologies and infer local chemical trends with respect to various elemental compositions.Eventually, rational materials design can be achieved within the candidate space that has been sharply narrowed down.

Learning heat capacity of nanoporous materials
Heat capacity is a fundamental and important physical property of nanoporous materials.For instance, it directly affects the energy efficiency of carbon capture processes using temperature swing adsorption [80].However, there is a significant knowledge gap in understanding how heat capacity relates to the atomic structure of nanoporous materials, underscoring the need for fast and accurate methods to predict heat capacity of these materials.In this regard, the GANN model is further applied to predict the heat capacity of nanoporous materials to showcase its versatility in handling different tasks.Here, we employ a dataset [80] of ∼230 structures with different chemical environments to train and test our model.This dataset, in addition to MOFs, includes experimentally synthesizable COFs and zeolites.Moreover, the specific heat of each structure in the dataset has been labeled using DFT.To facilitate the analysis, the dataset is also randomly split into 80% for training, 10% for validation, and 10% for testing.The overall performance of the GANN model is demonstrated in figure 5. Surprisingly, with merely around 180 training data points, our model achieves a high accuracy, with a MAE of 0.039 J gK −1 on the testing set.This highlights the data effectiveness of our model.Additionally, it is suggested that a joint graph and attention mechanism can effectively embed the structures into a high-level representation which is advantageous to enhance the predictive capability of ML models.

Conclusions
In summary, we proposed a multiscale graph attention neural network architecture for hierarchically learning deep representation of materials, aiming to address the loss of long-range correlations between atoms from present local graph representations.The introduction of the self-attention mechanism enables our model to capture multi-scale characteristics of systems over potentially long distances on top of the local graph.Meanwhile, this method can keep high parallelizable efficiency during training and prediction, thanks to the attention mechanism.The SOTA performance achieved by our model on predicting the quantum-chemistry properties of MOFs demonstrates its generality and extensibility for complex materials with large unit cells and widely diverse structures and compositions.Moreover, the latent space analysis substantiates the high fidelity of our model in chemical clustering for materials that share chemical and structural similarities.The successful application of our model on the prediction of the heat capacity of nanoporous materials further shows its versatility and data effectiveness.Conclusively, our model makes it possible to accelerate the screening and design of complex materials by quickly predicting material properties and exploring the latent representations for gaining more chemical insights and useful knowledge to sharply narrow down the search space.

Training details
The GANN architecture mainly consists of the graph convolutional layers, self-attention layers, and fully connected layers.The graph part contains three convolutional layers, each followed by batch normalization and ReLU activation.These three layers progressively increase the atomic feature dimensions from a 32 to 64 to 128 to 512 dimensions.Subsequently, the single headed attention with each channel of 512 dimension is followed.The final part of the architecture consists of three fully connected layers, each except the last layer followed by batch normalization and ReLU activation.The dimensions of these three layers are as follows: the first one maps from 1536 to 1024 dimensions, the second one reduces this from 1024 to 512 dimensions, and the last one maps the 512-dimensional features to the final output channels.All weights and other learnable parameters in the networks are iteratively updated using mini-batch stochastic gradient descent, with a batch size of 16 and the Adam optimizer, by minimizing the loss function, which is determined by the mean squared error (squared L2 norm) between the predicted properties and the reference data computed by DFT.The whole training process is conducted over 200 epochs, allowing the model to adequately learn and adjust its parameters for optimal performance.All training tasks of this work are carried out using PyTorch.

Figure 1 .
Figure 1.Illustration of our graph attention neural networks for mapping materials and molecules.(a) Architecture of the overall GANN model.GANN receives atomistic structures of materials through graph-based descriptors where the atoms and the bonds that connect the atoms are regarded as the nodes and the edges in the graph, respectively.After the raw descriptors flow through a stack of the graph blocks and self-attention blocks, the local and nonlocal information will be extracted hierarchically on a large scale, and this eventually arrives at a high-level and global embedding of atomic structures.(b) Architecture of the GCNN blocks.(c) Architecture of the self-attention blocks.(d) Schematic of hierarchically mapping molecules by the GANN model.(e) Evaluation of the attention weights of any local atomic environmental (LAE) against itself and each of all others in the whole systems and over a long distance by the scaled dot-product attention operators.
(a)) shows 78 chemical elements covered in the QMOF database.The violin plots (figure 2(b)) further illustrate the statistical distributions of sizes per primitive unit cell in the QMOF databases.The multiple chemical elements (figure 2(a)) associated with the diverse structures (figure 2(b))

Figure 3 .
Figure 3. Performance of the GANN model on the QMOF-2 dataset.(a) Comparison of the MAEs and ρ in the bandgaps between the GANN model and prior benchmark models.All models have been divided into two categories: classical machine learning (CML, blue boxes) models with hand-crafted descriptors and deep learning (DL, yellow boxes) models with learnable representations.Note that all results presented here are the average of five parallel runs, with standard deviations illustrated as error bars.(b) Comparison of DFT-computed (from the QMOF-2 database) and GANN-predicted bandgaps of MOFs on a test set.

Figure 5 .
Figure 5. of DFT-computed and GANN-predicted heat capacity of nanoporous materials.The DFT data are obtained from other literature [80].
R c is the cutoff radius and K is the number of bond features.Herein, an undirected graph G i = (Γ i , g i ) is ready for describing the local configuration of atom i, where Γ i = {v 1 , v 2 , . . ., v N li } ⊆ R F×N li and g i = {g i1 , g i2 , . . ., g iN li } ⊆ R N li ×K are a collection of the node and edge attributes in the local graph, respectively; N li denotes the number of all neighbors of the ith atom within a certain cutoff radius R c or usually a fixed number of the nearest neighbors to save computer memory especially for those unit cells with large size.The whole system is naturally described by an undirected multigraph consisting of all local graphs.