End-to-end AI framework for interpretable prediction of molecular and crystal properties

We introduce an end-to-end computational framework that allows for hyperparameter optimization using the DeepHyper library, accelerated model training, and interpretable AI inference. The framework is based on state-of-the-art AI models including CGCNN, PhysNet, SchNet, MPNN, MPNN-transformer, and TorchMD-NET. We employ these AI models along with the benchmark QM9, hMOF, and MD17 datasets to showcase how the models can predict user-specified material properties within modern computing environments. We demonstrate transferable applications in the modeling of small molecules, inorganic crystals and nanoporous metal organic frameworks with a unified, standalone framework. We have deployed and tested this framework in the ThetaGPU supercomputer at the Argonne Leadership Computing Facility, and in the Delta supercomputer at the National Center for Supercomputing Applications to provide researchers with modern tools to conduct accelerated AI-driven discovery in leadership-class computing environments. We release these digital assets as open source scientific software in GitLab, and ready-to-use Jupyter notebooks in Google Colab.


August 2023
Abstract. We introduce an end-to-end computational framework that allows for hyperparameter optimization using the DeepHyper library, accelerated model training, and interpretable AI inference. The framework is based on state-of-the-art AI models including CGCNN, PhysNet, SchNet, MPNN, MPNN-transformer, and TorchMD-NET. We employ these AI models along with the benchmark QM9, hMOF, and MD17 datasets to showcase how the models can predict user-specified material properties within modern computing environments. We demonstrate transferable applications in the modeling of small molecules, inorganic crystals and nanoporous metal organic frameworks with a unified, standalone framework. We have deployed and tested this framework in the ThetaGPU supercomputer at the Argonne Leadership Computing Facility, and in the Delta supercomputer at the National Center for Supercomputing Applications to provide researchers with modern tools to conduct accelerated AI-driven discovery in leadership-class computing environments. We release these digital assets as open source scientific software in GitLab, and ready-to-use Jupyter notebooks in Google Colab.

Introduction
With the explosion of AI models [1][2][3][4][5] developed to predict various material properties over the recent years, it has become difficult to keep track of the available AI models and the datasets that are used for training and inference. Numerous efforts [6,7] have been made toward the integration of AI models and their associated datasets in one place to streamline their use for a wide range of applications and a broad community of users [8][9][10]. AI models and datasets are often available through open repositories, in the best scenario, so a user can download, deploy and reproduce their putative capabilities. Unfortunately, this is a time-consuming and laborious process, which can be further complicated when tools and libraries used to develop the AI models are not available, deprecated, or non-backwards compatible in computing environments of new users. Furthermore, most of the existing packages are specialized in predicting quantum mechanical properties of small molecules, few of them support crystals. In order to address these shortcomings, here we report the construction of a computational framework that consolidates libraries, AI models and AI interpretability tools to study molecules, crystals, and metal-organic frameworks. The framework enables hyperparameter tuning through the open source library DeepHyper [11], model training, and interpretable inference of small-molecule quantum mechanics (QM) properties from public datasets such as QM9 [12], and crystal properties from datasets such as hMOF [13]. Key aspects of this computational framework include: Novel features of AI models. The node and edge embedding schemes of two graph neural network models, PhysNet [2] and CGCNN [5], were modified from the original adjacency matrix format to an adjacency list format to reduce redundant information and enable faster training. We also adapted small-molecule property prediction models to take in crystal structures as input such as the crystal version of SchNet [1].
Transferable AI applications. We demonstrate the transferability of the learned force fields by training TorchMD-NET [3] model using selected molecular dynamics (MD) trajectory data of a given set of molecules in the MD17 dataset [14] to perform MD simulations of similar molecules. In particular, we show that a model trained based on ethanol is transferable to both n-propanol and iso-propanol, and a model trained based on uracil is transferable to pyrimidine and naphthalene. All of the results are automatically logged to weights and biases (WandB) [15], a machine learning tracking tool, for simple access.
Interpretable AI inference. Intrepretation methods provide a novel pathway to deepen the understanding of the structure-property relationships of materials. Previous work [16] has shown great success in applying interpretation methods to identify key functional groups in molecules that contribute to toxicity, including Excitation Backpropagation [17], CAM [18] and Grad-CAM [19] and Contrastive gradient [20]. Moreover, The UMAP method has also been applied to effectively visualize the distribution shift of sampled molecules on a 2D plane with and without transfer learning [21]. To gain a better understanding of model predictions, we provide two interpretation methods to explain the learned features. The first method, Grad-CAM, highlights selected atoms of molecules that are significant for model predictions. The second method, uniform manifold approximation and projection [22], or UMAP, maps the last hidden layer of the model onto a 3D plane. In this way, we can make more sense of the molecular clusters with similar properties.
This AI suite and scientific software are released with this manuscript, and may be found in GitLab [23]. To facilitate the use of these resources, we have prepared Jupyter notebooks in Google Colab [24], which have been tested independently by researchers not involved in this work to ensure that these resources are easy to follow and use. To ensure that all the resources used in this article are self-contained, we have also published the datasets used for these studies in Zenodo [25]. We expect that this collection of state-of-the-art graph neural networks, transformer models, and analysis methods for small molecules and crystals will empower AI practitioners to seamlessly perform hyperparameter optimization, accelerated training, and interpretable AI inference in modern computing environments with a unified, standalone computational framework.

Related work
Graph neural networks have shown great success for modeling molecular and crystal structures. For small molecules, a suite of models have been proposed, including DimeNet [4], GemNet [26], SphereNet [27], ComENet [28], SchNet [1] and PhysNet [2]. These models take in atomic coordinates and atomic numbers as input, and represent atoms as nodes and bonds as edges. Typical target properties for these models are QM properties of molecules such as internal energy, heat capacity and zero point vibrational energy (ZPVE). For crystal structures, periodic boundary conditions need to be considered, therefore crystal graph representations are typically used. Example graph neural networks that take in crystal structures as input include ALIGNN [29], CGCNN [5], and MEGNet [30]. These models first extract crystal graphs from the structures, then generate atom and edge embeddings for the center atoms and their neighbors. The bond and edge information is then updated via message passing. The target properties for these models are typically QM properties of crystals, e.g., formation energy and band gap. The growing number of the graph neural networks available for this purpose pushes the need for an end-to-end AI framework. Previous efforts toward such a goal typically missed one or more important aspects. For example, MatDeepLearn [7] integrates a suite of graph neural networks, including CGCNN, MEGNet, MPNN [31], GCN [32] and SchNet. Although it can be used for hyperparameter tuning, model training, and inference, it lacks the explainability feature, which limits the amount of chemical insights that could be extracted from the results. Another example is Dive in Graphs (DIG) [6], which enables model training and explanation. However, it does not allow for hyperparameter tuning, therefore only models with preset hyperparameters can be used. A complete package offering all of the aforementioned functionalities is therefore needed. Our AI framework also offers the functionality to perform MD simulations for small molecules, enabled by TorchMD-NET, an SE3-equivariant transformer interatomic potential model that establishes a relationship between atomic configurations and potential energies and forces. The MD trajectories of selected molecules taken from the MD17 dataset were used for training the TorchMD-NET models.

Methods
Here we describe the key building blocks of our general-purpose AI framework: (i) It provides built-in datasets and neural networks that we modified to take in adjacency list format node and edge embeddings, a more memory efficient embedding scheme than adjacency matrix format (ii) It enables distributed hyperparameter tuning of neural networks via the scalable and computationally efficient library DeepHyper (iii) Model training and interpretable inference are performed by specifying a few command line arguments (iv) Results are auto-logged to WandB, a machine learning tool for easy tracking and visualization (v) Molecular dynamics simulations can be performed for small molecules using TorchMD-NET if trained with MD trajectories from the MD17 dataset, enabled by the atomic simulation environment (ASE) library [33]. This framework has been deployed and tested in leadership computing platforms to reduce the overhead for researchers that require access to hyperparameter tuning, model training and explainable inference tools in a single, unified framework. Below we describe each of these components in further detail.
Hyperparameter tuning. This feature was done using the DeepHyper [11] library. In this method, hyperparameters of interest are given prior distributions and their posterior distributions are adjusted based on the Centralized Bayesian Optimization (CBO) algorithm with a given acquisition function and a surrogate model. The graph neural networks in this framework are coupled with DeepHyper to enable faster hyperparameter tuning.
Datasets QM9 and MD17 datasets were used as input to graph neural networks. The Node and edge embedding schemes. Instead of using the original adjacency matrix format for node and edge embeddings, we modified it to adjacency list format. The term embedding, for both molecular and crystal graphs, refers to the information attached to a node (an atom) or an edge (a bond). Both node and edge embeddings can be scalars, vectors or higher order tensors. Node embeddings encode information such as mass, charge and orbital hybridization, whereas edge embeddings encode information such as interatomic distance and bond order. Depending on the model architecture, some embeddings are physics-or chemistry-based while others are learned. For physicsor chemistry-based embeddings, fixed information such as hybridization, mass, atomic radius, and whether the fragment is a part of an aromatic ring is encoded. On the other hand, learned embeddings refer to embeddings that are iteratively optimized by a neural network model via stochastic gradient descent. In adjacency matrix format edge embeddings, the adjacency matrix is encoded into a fixed-size matrix, whose size is determined by the largest molecule in the dataset. For other molecules, their vectors are padded to be the same dimension as the largest one. Each element in the adjacency matrix indicates whether the two corresponding nodes are connected, as determined via some distance-based criteria. Since padding is applied to smaller molecules in the matrix, users need to know a priori the largest molecule size, then perform masking to obtain the padding values, which can be burdensome for GPU memories. By using the adjacency list format, however, only the information for connected atoms is preserved, thereby avoiding the need for padding and taking less memory to load. In this case, faster loss convergence and higher prediction accuracy are expected. The adjacency list format has been implemented in a number of Python libraries such as Deep Graph Library (DGL) [34] and PyTorch Geometric [35]. In this work, we use CGCNN model as an example to demonstrate a boost in model training performance when an adjacency list format is used in place of an adjacency matrix format. Our AI framework allows users to perform hyperparameter tuning, model training and interpretable inference for pre-trained models or train new models with a few arguments passed. The main improvements over previously proposed general-purpose machine learning model libraries is the explainability feature, which consists of two parts. First, by extracting high dimensional hidden layer information from the learned models and projecting it onto low dimensions via the UMAP method, we can effectively visualize the clustering of molecules, with similar practice as in [36][37][38]. Second, Saliency Map [39], CAM and Grad-CAM methods are used to highlight important atoms in molecular graphs, as described in [40].

Results
Below we present a comprehensive analysis of our results, from hyperparameter optimization to interpretable AI inference.

Hyperparameter Optimization
The DeepHyper library is used for hyperparameter optimization of graph neural networks. DeepHyper is easy to use and can be readily deployed on GPU-based highperformance computing platforms. CPUs can be used if GPUs are not available. However, if the user has access to multiple GPUs, then the GPU option will be automatically chosen, with each core performing hyperparameter search using the CBO algorithm, given an acquisition function such as the upper confidence bound, and a surrogate model, e.g., random forest. The list of hyperparameters considered in this work along with their ranges are summarized in Table 1. Hyperparameter optimization results for PhysNet with ZPVE as target property are shown in Tables 2 and 3. It is worth mentioning that since DeepHyper tries to maximize the objective of search, the opposite number of validation error was used as the objective, therefore a larger absolute value of objective corresponds to a better combination of hyperparameters. The hyperparameter tuning results for PhysNet with HOMO as the target property are shown in Tables A1 and A2. For hyperparameters with integer or floating number values, the ranges represent the lower and upper bounds. For hyperparameters amp and optimizer, the ranges represent all available options. Among the hyperparameters, agb, or accumulated grad batches, helps overcome memory constraints; amp, or automatic mixed precision, speeds up neural network training; and gradient clip, a machine learning technique where the gradients of neural network parameters are rescaled to between 0 and 1, is known to stabilize neural network training by avoiding sudden changes in parameter values (also known as the exploding gradient problem) [41]. The optimal hyperparameter combinations found by DeepHyper are listed in the top rows of Tables 2 and 3, with the optimal objective being -0.9226. We notice that this set of hyperparameters include f32 precision (amp="false"), a standard learning rate (0.00296), and a low gradient norm clipping value (0.00245). These result in small gradient accumulation, which may help mitigate sudden gradient updates. We have tested multiple sets of hyperparameters with varying ranges and prior distributions. Our hyperparmeter tuning configuration input file is prepared in YAML format. Discrete hyperparmeter values such as the number of epochs and the batch size are sampled from uniform distributions whereas continuous hyperparameters such as the learning rate and the gradient clip are sampled from normal distributions with/without log scale. The ranges of hyperparameters along with the prior distributions for sampling are both user-customizable. Once the hyperparmeter configuration and the prior distributions are in place, DeepHyper can use multiple GPUs to perform hyperparameter tuning, taking full advantage of GPU parallelization. Next, all the optimization results will be saved and automatically logged to Weights and Biases. If the tuning step is interrupted, it can be resumed from the last saved checkpoint by specifying the ----resume tag. For the PhysNet model with HOMO and ZPVE as target properties, we compared hyperparameter tuning performance of DeepHyper with a naive algorithm that performs random selection of hyperparameters. Since DeepHyper utilizes the CBO algorithm to optimize hyperparameters, the target property values are used for decision making.
For the naive algorithm, however, hyperparameters were randomly selected from the hyperparameter grid in Table 1. A total of 20 models were trained for 30 epochs with hyperparameters given by the two methods. The distributions of the losses (mean squared error) are compared in Figure 1, and the metrics are summarized in Table 4.   Model uncertainty quantification was performed for PhysNet model with HOMO and ZPVE as target properties. Five PhysNet models with randomly initialized weights were generated using the random seeds method. The optimal hyperparameter combinations found by DeepHyper in Section 4.1 were used. The models were trained for 100 epochs to achieve convergence of loss function. Figure 3 shows that PhysNet makes consistent predictions regardless of the random initial weights. The standard deviations of losses for the five models with HOMO and ZPVE as target properties are 0.0379 eV and 3.9646e-05 eV, respectively, and the mean absolute errors are comparable with those reported in the literature [42].
Key findings: Our suite of AI models provide state-of-the-art results. Novel features that we added to the models, such as attention in MPNN-transformer, further improve model performance. We have also demonstrated that hyperparameter optimization leads to stable, statistically robust AI predictions.

Model improvement via modified node and edge embedding schemes
We modified the node and edge embedding schemes of CGCNN model from the original adjacency matrix format to an adjacency list format. There are two main advantages in using the adjacency list format. First, compared to the adjacency matrix format, it takes up less memory for loading, which speeds up model training. Second, the redundant information (zero paddings) in the representation is removed, resulting in higher training accuracy and stability. As an example, CGCNN models with the two embedding schemes were trained on a subset of the hMOF database [43], which contains 5,000 randomly selected MOF structures along with their CO 2 working capacities at 2.5 bar. The 5,000 MOFs are splitted into 80% training set, 10% validation set and 10% test set. The mean absolute error on the test set as a function of training steps for both models are shown in left panel of Figure 4. To smooth out local fluctuations, thirty point moving averaging was performed on both curves. We notice that the modified CGCNN model achieved faster convergence speed, higher training stability, and a lower mean absolute error compared to the original model. From the right panel of Figure 4, we show that the modified CGCNN model predicts CO 2 working capacity with an R 2 score of 0.93 and a mean absolute error of 0.53 mmol/g. To better understand the predictive performance of the modified CGCNN model, we benchmarked it against two recently proposed machine learning models for predicting CO 2 working capacity of MOFs, namely ALIGNN [44] and random forest regressor [45]. When trained on the entire hMOF dataset, ALIGNN predicts CO 2 working capacity at 2.5 bar with a mean absolute error of 0.48 mmol/g [44]. ALIGNN uses both normal graph and line-graph embedding schemes for training and inference of working capacity prediction. For a normal graph, it uses physical and chemical features for node embedding, and distances between atoms as edge embedding; for a line graph, distances correspond to nodes whereas bond angles correspond to edges. This scheme, however, can cause occasional CUDA memory issues and training batch size may have to be reduced (64 in [44]; 32 in our independent experiment), hence slower training. On the other hand, our modified CGCNN model only takes in crystal structures as input (i.e., atomic species and Cartesian coordinates) with larger batch size (256 in our model).
The random forest regression model takes in topological, structural, and word embedding features as input. When trained on the entire hMOF dataset, it achieves an R 2 and RMSE score of roughly 0.95 and 0.65, respectively, for the prediction of CO 2 working capacity at 2.5 bar. In other papers, extensive physical and chemical featurization schemes were used [46] [47] to feature the MOF structures, whereas our modified CGCNN model captures MOF information solely from atom species and coordinates. It is worth noting that we trained the modified CGCNN model on 5,000 randomly selected MOFs instead of all of the structures, therefore lower prediction error is expected if the model is trained on the entire hMOF dataset. Overall, the modified CGCNN model achieves competitive predictive performance compared to state-of-the-art machine learning models.  Key findings: Adopting adjacency list format node and edge embedding scheme improves the predictive capabilities of our modified CGCNN model. When making inference on a test set of 500 MOFs from the hMOF dataset, we have found that our modified CGCNN model provides state-of-the-art predictions for CO 2 working capacities at 2.5 bar.

Transferable AI Applications
Molecular dynamics simulations of two sets of small molecules were performed to demonstrate the transferability of TorchMD-NET: from ethanol to n-propanol/iso- propanol, and from uracil to pyrimidine/naphthalene. In Figure 6, within each set, the TorchMD-NET model trained with MD trajectories of the molecule on the left was used to perform MD simulations of the molecules on the right. The NVE ensemble was used, where the total number of particles in the simulation box and the box volume are fixed, and the total energy is conserved. For all molecules, the timestep and the total simulation time were chosen to be 0.1 fs and 10 ps (100,000 timesteps), respectively. Figure 7 shows that the C-C and C-O bond length distributions of ethanol, propanol and iso-propanol have similar means, whereas the latter two have a larger spread. It is worth noting that for propanol, the C-C bond length closer to the oxygen atom has a similar distribution to that of ethanol, which is expected because their local environments are similar. For bond angles, The C-C-O bond angle distribution of ethanol exhibits two peaks, whereas the other two only have one peak. For uracil, pyrimidine and naphthalene, the C-C bond length distribution of naphthalene is shifted to a higher range compared to the other two, which may be due to the absence of N atoms in its ring structure. The C-N bond length distributions of uracil and pyrimidine have similar means, whereas the latter has a larger spread. Similarly, we observe comparable C-C-C bond angle distributions in uracil and pyrimidine, whereas the same distribution for naphthalene is shifted to a higher range, again an effect which may be due to the absence of N atoms in naphthalene's ring structure. The similarity and differences of bond length and angle distributions demonstrate that TorchMD-NET trained on one type of molecule is transferable to other similar molecules. Key findings: We present a novel application of TorchMD-NET, in which this AI model was fine-tuned to describe a given small molecule by accurately predicting its potential energy and forces and perform NVE MD simulations, and then seamlessly used to describe other molecules with different structures, while still capturing physically realistic bond length and angle distributions.

Interpretable Inference
Here we explore the use of explainable AI and dimension reduction techniques. This is motivated by recent studies in which explainable AI approaches, such as Excitation Backpropagation [17], CAM [18] and Grad-CAM [19] and Contrastive gradient [20], were used to identify key functional groups in molecular structures which contribute to toxicity [16]. Similarly, dimension reduction has been used to visualize the distribution shift of sampled molecules in the feature space with or without transfer learning [21].
We explore the use of these tools to gain new insights into the information that our AI suite extracts from input data to produce reliable predictions. Model performance attribution. By projecting the second last layer's high dimensional vector representation of graph neural network onto molecular structure and visualizing the projection, we can better understand the physical and chemical properties of the input data that affect predictions of our AI models. The top panels of Figure 8 present model interpretation results of how PhysNet model predicts HOMO based on molecular structures via the Grad-CAM method. The N atoms and the H atoms connected to them are highlighted, possibly indicating that for PhysNet, these atoms carry more weight in the prediction of HOMO. We do not claim that explanations found by our deep learning model are the definite reasons for accurate prediction of QM properties such as HOMO or ZPVE, since these QM properties may not be simply determined by atomic species and coordinates. However, AI-explained visualization can help us better make sense of the patterns of complex molecular property predictions. Dimension Reduction To reveal the correlation between model features and target properties, we applied the UMAP dimension reduction technique to find the distributions of small molecules by projecting their high-dimensional structural and chemical data onto low-dimension spaces. UMAP has been shown to achieve comparable or even better performance than other dimension reduction techniques such as principal component analysis (PCA) [48] and t-distributed stochastic neighbor embedding (tSNE) [49] on non-linear datasets [50]. Dimension reduction results for PhysNet model with HOMO as target property is shown in the bottom panel of Figure 8, where each dot in the plot represents a molecule, color-coded based on its HOMO value. A 10% randomly selected subset (13.4k molecules) of the QM9 dataset was used to produce these results, which consists of stable small molecules composed of CHONF. From the scatter plot we know that molecules with similar HOMO values are clustered together and there is clear separation of molecules with low and high HOMO values. We present additional illustrative results in Appendix B. Roughness of Molecular Property Landscape. For molecular property prediction, the predictive performance of graph neural networks has been shown to correlate to the roughness of molecular property landscape [51] [52] [53]. We adopted the recently proposed state-of-the-art roughness index (ROGI) [54] to measure how rough the HOMO and ZPVE landscapes are for four graph neural network models: PhysNet, SchNet, MPNN, MPNN-transformer. The calculation of molecular landscape roughness involves specifying a molecular representation and a distance metric. Molecular representations can be either learned by the graph neural network, with values extracted from the second last layer, or calculated based on molecular structures, as represented by SMILES strings or 3D Cartesian coordinates. The distance metrics are used to measure how different two molecular representations are. Example distance metrics include Tanimoto similarity, Euclidean distance, cityblock distance and cosine distance. To calculate the ROGI values, learned molecular representation and one of the aforementioned distance metrics are used. Using Euclidean distance as the distance metric and HOMO as the target property, we show the ROGI values and mean absolute errors of four graph neural networks in Figure 9. We observe that a lower ROGI value in general corresponds to a lower mean absolute error, that is, higher predictive performance. We attribute this trend to the direct relation of a higher performing model to the smoothness of the resulting molecular property landscape. The exception is MPNN, which corresponds to a lower ROGI value despite a higher MAE as compared to the MPNN-transformer, which may be because the addition of transformer layers roughens the molecular property landscape while facilitating model training. Key findings: Our proposed approach brings together disparate interpretability AI tools to explore and make sense of AI model predictions, encompassing model performance attribution and scientific visualization; dimension reduction with UMAP to explore clustering of molecules with similar properties; and metrics such as the roughness index to quantify the predictive performance of our AI models for QM properties. These complementary tools provide valuable insights into the features and patterns of input data that are relevant for AI inference.

Conclusion
The rise of AI in the early 2010s was possible by a combination of elements, including disruptive technologies and computing approaches, as well as the desire to advance stateof-the-art practice through collaborative and friendly competitions in which high-quality datasets and AI models were freely shared. Similar approaches have been mirrored in science and engineering in recent years. These efforts are now being formalized through FAIR (findable, accessible, interoperable and reusable) initiatives [55,56] in the context of scientific datasets [57], research software [58] and AI models [8,59]. This study represents yet another significant step in this direction. We have assembled benchmark datasets, added novel features to state-of-the-art graph neural networks and transformer models, coupled them with robust libraries for hyperparameter tuning to improve their capabilities for scientific discovery, and developed and adapted a set of visualization and interpretability tools to make sense of the AI predictions. All these elements are unified within a single computational framework that has been deployed and extensively tested on leadership-class, high-performance computing platforms. Researchers using this computational framework will be able to conduct scientific discovery combining state-of-the-art AI models with datasets that are coupled with advanced supercomputing platforms. We expect that this approach will catalyze the sharing of AI knowledge and tools in the context of molecular and crystal property prediction applications.

Code availability
The AI models presented in this article are open source in GitLab [23]. We also provide a stand alone Colab tutorial [24] that shows users how to use our framework, i.e., loading data and AI models, doing AI inference, and interpreting AI model predictions. Following best practices, the datasets used in these studies are published in Zenodo [25]. We provide an exemplary case of our published AI suite in Appendix B.3.