Chemistry-informed macromolecule graph representation for similarity computation, unsupervised and supervised learning

The near-infinite chemical diversity of natural and artificial macromolecules arises from the vast range of possible component monomers, linkages, and polymers topologies. This enormous variety contributes to the ubiquity and indispensability of macromolecules but hinders the development of general machine learning methods with macromolecules as input. To address this, we developed a chemistry-informed graph representation of macromolecules that enables quantifying structural similarity, and interpretable supervised learning for macromolecules. Our work enables quantitative chemistry-informed decision-making and iterative design in the macromolecular chemical space.

An individual macromolecule is distinguished by the identity and spatial arrangement of its monomers and linkages [12]. Monomer and linkage are functions of atomic composition, connectivity and stereochemistry, while spatial arrangement of the monomers and linkages dictates the topology. Experimentalists and theoreticians have explored a vast chemical space by varying monomers [13,14], linkages [15], and topologies-both linear [4] and non-linear such as branched [16], star [17], and bottle-brush [18]. As a result of such chemical diversity, representing, comparing, and learning over macromolecules with different monomers, linkages, and topologies emerges as a critical challenge.
Linear biological macromolecules, such as proteins and DNA/RNA, are easily represented as strings of one/three-letter monomer codes. However, machine-readable macromolecule representations, such as hierarchical editing language for macromolecules [19], IUPAC international chemical identifier [20], CurlySMILES [21] (where, SMILES is simplified molecular-input line-entry system) and BigSMILES [22], do not always support non-linear topologies, require a fair amount of customization, and have non-canonical variants. In a recent attempt, glycans, which are non-linear biological macromolecules, were represented as SMILES-like sequences, where groups of monosaccharides were binned into 'glycowords' and placed in hierarchical brackets [23].
Likewise, similarity computation for macromolecules has mostly been limited to linear sequences [23][24][25][26], leveraging sequence alignment using the Smith-Waterman [27] or Needleman-Wunsch [28] algorithms, and scoring with substitution matrices, such as BLOSUM62 [29] (for proteins) and GLYSUM [23] (for glycans). These substitution matrices are based on evolutionary statistics and cannot be used for non-natural building blocks since they lack a general way to quantify chemical similarity. In the case of non-linear macromolecules, alignment of glycans has been explored using q-grams [30], tree matching methods [31][32][33], and tree kernels [34]. Unfortunately, these methods are tailored to specific classes of macromolecules and do not extend to the general macromolecular chemical space.
Unsupervised and supervised machine learning (ML) applications to individual macromolecule classes, such as proteins, have been very successful but typically rely on sequence-based representations that are tailored for linear architectures [35,36]. For instance, unsupervised learning of protein sequences has resulted in functional annotation and identification of sub-families for yet unseen sequences [37], in addition to creating information-rich embeddings for downstream data-poorer property prediction tasks [36,38]. On the supervised learning front, for artificial polymers, the PolymerGenome and similar works have used hierarchical fingerprints to predict glass transition temperature, dielectric point and other macromolecular properties [39,40] and there have been attempts to extrapolate macromolecular property by training over monomer input features [41][42][43]. However, these methods do not extend well to macromolecules with complex topologies, such as glycans or biohybrid sequence-defined polymers, which exhibit non-linear structure and higher levels of monomer and linkage diversity.
In this work, we developed a graph representation for macromolecules. Graphs are a natural and more general macromolecule representation, which can handle linear, branched and cyclic topologies along with any monomer and linkage composition. The representation generalizes ideas of macromolecule similarity, from sequence alignment to structural similarity, between complex topologies. Using chemical similarity between monomers through cheminformatic fingerprints and exact graph edit distances (GEDs) or graph kernels to compare topologies, the representation allows for the quantification of the chemical and structural similarity of two arbitrary macromolecule topologies. To investigate the relationship between chemical similarity and functionality, we used unsupervised learning over similarity vectors obtained from aforementioned similarity computation methods.
Leveraging advancements in ML over graph representations has achieved state-of-the-art results across several fields [44], including chemistry and life sciences, where graph neural networks (GNNs) have become the modern workhorse for molecular property prediction [45][46][47][48][49]. We coupled macromolecule graphs to supervised GNN models to learn structure-property relationships. Further, we used attribution methods compatible with GNNs to highlight how input features are relevant to model predictions of target properties [50,51].

Results and discussion
2.1. Text file system converts macromolecule structure to machine-readable graph We developed a generalized text file system to convert a macromolecule structure into a machine-readable format (figure 1(A) and SI section 2 available online at stacks.iop.org/MLST/3/015028/mmedia). The text file has three sections-SMILES, MONOMERS, and BONDS, inspired by the PDB file format [52]. Under SMILES, monomer and bond names followed by the stereochemical SMILES are noted. MONOMERS enumerates indices of all nodes numbered from 1 to n, where n is the total number of monomers, followed by the monomer names. Similarly, BONDS lists indices of connections between monomer indices, followed by bond names. In this way, we are able to incorporate complexity from the level of individual atoms to the full macromolecular structure.
For our experiments, the macromolecule text files were then processed into attributed NetworkX graphs [53], with monomers as nodes and bonds as edges (SI section 3). In line with our earlier work where fingerprint-based monomer representations worked effectively for macromolecule property prediction, the monomer and bond molecules were featurized using standard ECFP [35]. The fingerprints capture the atomic connectivity of the monomer/bond molecule as a series of bits using circular atom neighborhoods, for each constituent node or edge of the macromolecule graph, encoding the macromolecules in their native structure with explicit featurization of the stereochemistry and topology. We optimized the radius of the atomic neighborhood and the dimension of the fingerprint, by analyzing the distribution of Tanimoto similarity for the individual monomers and bonds. This system allows us to represent any macromolecule structure, irrespective of monomer and linkage types, and topology, using the same framework. Along with fingerprints, we benchmarked models with discrete one-hot encodings of the monomers, in order to understand the effect of these chemical features.

Graph similarity measures enable similarity computation for arbitrary macromolecules
To compute the (dis)similarity between macromolecule graphs, we used exact GED and graph kernels ( figure 1(B)). GED [54] computes the dissimilarity between two graphs by assigning node and edge insertion/deletion/substitution cost. For insertion and deletion of node/edge, we add a fixed cost to the distance, while for substitution, we multiply a constant cost with the Tanimoto dissimilarity of the molecules being substituted (SI section 4.2). Tanimoto dissimilarity is a metric to compute dissimilarity between two  bit-vectors on a scale of 0 and 1, where self-dissimilarity is zero. This process is analogous to sequence alignment using methods like BLAST [24]. However, instead of scoring using evolutionary statistics-based substitution matrices, such as BLOSUM62, the use of Tanimoto dissimilarity matrices over molecular fingerprints allows us to extend the similarity computation to unnatural monomers. We have demonstrated the similarity computation for a linear glycan with six additional glycans of different topology and/or monomer chemistry, as well as with itself, using Tanimoto chemical similarity matrix (figures 2(A) and (B)).
As the size of both individual graph and dataset increase, computing exact GED becomes computationally untractable, since GED belongs to a class of problems that are non-deterministic polynomial time-hard, otherwise known as NP-hard problems. To scale the similarity computation to large datasets, we used graph kernels, specifically propagation attribute kernels, to obtain approximate similarity matrices [55,56] (figure 2(C) and SI section 4.3). The propagation attribute kernel method captures local monomer node information and propagates this information along the bond edges, thereby capturing both local and global information, to produce a similarity score. The information flow in the propagation attribute kernel is similar to message passing in graphs, making it an ideal choice for macromolecule graphs represented through featurized node and edge, given the success of graph convolutional NNs for supervised tasks on macromolecule graphs.

Unsupervised learning separates functional macromolecules into distinct regions
We used dimensionality reduction methods, such as principal component analysis (PCA) [57], t-distributed stochastic neighbor embedding (t-SNE) [58] and uniform manifold approximation and projection (UMAP) [59] in combination with our similarity computations for unsupervised learning (SI section 5). Conventional implementations of dimensionality reduction methods are based on feature vectors, so macromolecule graphs cannot be processed directly. Instead, we used an approach inspired by multidimensional scaling and applied linear and non-linear dimensionality reduction to the similarity matrix obtained using graph kernels [60]. The graph similarity matrices make for a powerful representation, since they encode chemical and topological pairwise similarity across the dataset, without resorting to representing each macromolecule as a vector.
For glycans with immunogenicity labels, we observed that the non-immunogenic and immunogenic glycans are in nearly distinct regions (figure 3(A)). In a similar experiment we colored glycans by domain in two-and three-component UMAP plots (figures 3(B) and (C)). Noting that the taxonomic complexity was not being adequately captured by the two-component plot, we used a three-component plot. In the individual plots, we observed that glycans belonging to eukarya, bacteria, and virus clustered in distinct regions, with the bacteria glycans at the core, eukarya glycans spreading out of the core, and virus glycans at the fringes (figure 3(D)). Benchmarking against PCA and t-SNE, we observed that PCA was able to capture the global structure separating the immunogenic and non-immunogenic glycans, while t-SNE was better at capturing the local structure (SI figures 24 and 25). In contrast, UMAP performed better in separating functional glycans, for both local and global structures.
To check if more components in dimensionality reduction could help in finding distinct clusters, we performed dimensionality reduction for {2, 3, 5, 10, 30, 50} components and used hierarchical density-based clustering of applications with noise, an unsupervised clustering algorithm, to figure out the number of clusters [61]. We found that the numbers of clusters were similar across the different numbers of components, and in the low 400s (SI figure 22). The high number of clusters indicates the diversity of the space, and the differences in graph similarity of glycans with distinct taxonomy. As a further check of the validity of the clustering approach, we plotted the histograms of the glycans assigned to each cluster. Across all the components, the histograms were consistent with the number of glycans in each of them (SI figure 23).

GNNs predict macromolecule properties with high accuracy
We evaluated five different GNN model architectures to classify glycans by immunogenicity and taxonomy levels and predict the anti-microbial activity of peptides (figures 4(A), (B) and SI section 6). Each model architecture was trained over fingerprint and one-hot node and edge attributes on 60%, validated on 20%, and tested on held-out 20% data set, all determined using random splits. To report final metrics on validation and test datasets, for each model architecture, we averaged values obtained from 25 individual models-top five hyperparameter sets, each trained with five random seeds. Unlike previous implementations with all-atom GNNs [62], we used a hierarchical GNN model trained over monomer nodes and linkage edges, featurized using fingerprints. The hierarchical leads to learning over a coarse-grained representation without trying to learn over an all-atomistic representation. Moreover, this approach enables The node weights have been obtained using AttentiveFP-IG combination. (E) Attribution analysis using AttentiveFP model architecture and integrated gradients attribution method for a representative glycan shows that xylose (Xyl) contributes the most to immunogenicity. In xylose, the substructure centered on the anomeric carbon, C1, is shown to be the key contributor. A similar substructure analysis has been shown for fucose. In the glycan graph, the size of the node corresponds to the importance, and the color corresponds to the monomer. In the fingerprint, only the indices with positive contribution to immunogenicity have been visualized, with the color corresponding to the normalized importance. attribution analysis at the level of chemical substructures, which is more intuitive than weighted importance of atoms in a large macromolecule.
We obtained receiver operating characteristic-area under curve (ROC-AUC) greater than 0.95 on the held-out test data set, for all glycan immunogenicity and taxonomy classification tasks (SI tables 3 and 4). Against results reported in the literature, our models outperformed metrics for classification for four out of eight tasks and achieved comparable results for the remaining four (SI table 5) [62]. We noted that for most tasks the performance of the one-hot-featurized graphs were comparable to the ECFP-featurized graphs.

Attribution analysis finds features key to the model's decision-making
Graph attribution methods attempt to crack open the black-box supervised GNNs and allow to infer specific features-subgraphs, monomers and chemical moieties-and their impact on the predicted property. The critical features revealed through graph attribution help elucidate the fundamental structure-function relationships that underpin otherwise opaque chemical/biological properties like immunogenicity, postulate very explicit hypotheses that can be validated in the lab, and may help in further design of immunogenic or non-immunogenic scaffolds.
To avoid spurious hypotheses [63] that may occur in NN attribution on molecular models, we chose the optimal combination of GNN architecture and attribution method, following the implementation invariance [64] axiom which proposes that attributions using the same method with different model implementations should be identical. In other words, all implementations of the same model type should attribute similar features for the same predicted property. In an ideal scenario, all attributions over different implementations of a model should be equal, or have a standard deviation of zero.
We evaluated three attribution methods, integrated gradients [64], input × gradients [65] and attention weights. To quantify implementation invariance, we calculated the standard deviation of the node attribution weights across all immunogenic glycans using different combinations of model architecture and attribution method. For attribution of immunogenicity in glycans, we noted that AttentiveFP-integrated gradients had the smallest deviation between implementations and is thus the best choice (figure 4(C)). All further attribution analysis was done using node weights obtained from AttentiveFP-integrated gradients.
Across all immunogenic glycans present in the dataset, N-glycolylneuraminic acid, xylose, and fucose were found to be the key monomers responsible for immunogenicity, consistent with experimental findings and in line with their low prevalence in human glycans (figure 4(D)) [66].
For a single immunogenic glycan, we observed that xylose, followed by galectin, were the monomers that contributed most significantly to immunogenicity (figure 4(E)). Attribution identifies, in addition to the importance of individual nodes, the critical substructures in the monomers that contribute most to immunogenicity, such as the substructure centered on the anomeric carbon of xylose. To assess the sensitivity [64] of attribution analysis, we performed ablation analysis, where we muted individual xylose monomers and then all xylose monomers (SI figure 45). When features for a single xylose monomer were muted, we noted that attributions of other xylose monomers remained unchanged. Similarly, when all xylose monomers were muted, galectin was the key monomer responsible for immunogenicity.

Conclusion
This work provides a generalized method for representing macromolecules as hierarchical graphs with molecular fingerprints to capture chemical information which can be used to compute structural similarity between macromolecules with different composition and topology, and perform unsupervised and supervised learning. The unsupervised learning enables visualization of the complex landscape of different classes of macromolecules and understanding of the subtle differences between similar macromolecules. The attribution analysis helps in cracking open the black-box of supervised GNN models, which in turn can help elucidate fundamental design principles of otherwise opaque structure-property relationships and assist with hypothesis generation for future experimental studies. We therefore expect that this toolkit will be used by both experimentalists and computational practitioners in chemistry, biology and materials science for a variety of macromolecule property prediction tasks. Because accurate property prediction is key for design, application or these models could drive design of improved macromolecules by combining directed evolution, Monte Carlo tree search, or similar optimization methods that seek to maximize scores predicted through supervised models.

Glycans
A dataset of 19 299 glycans was downloaded from GlycoBase (accessed on 2 November 2020) [23]. The file contained GlycoBase ID, sequence, link (N, O, free, or none), species, and immunogenicity information for each glycan. For each glycan sequence string the brackets denote branches, with the point of attachment/bonding of the branch as the monomer immediately after the brackets. The 1st element within the bracket is the monomer most distant from the point of attachment, and the last element within the bracket is the abbreviation of the bond that connects the branch to the original main chain. Nested brackets indicate additional sub-branches off of branches, and multiple sets of brackets next to each other indicate several branches off of the same monomer.
Seven modifications and 152 glycan sequences were found to be erroneous, i.e. they are invalid data, owing to unequal number of opening and closing brackets and dangling branches without specified connectivity, and were thus deleted. Additional glycan sequences were removed due to missing SMILES sequences for a number of monomers. The curated dataset resulted in a total of 19 147 glycans.

Anti-microbial peptides (AMPs)
A dataset of 15 864 AMPs, including 15 450 monomers, 200 multimers, and 214 multi-peptides, from the database of antimicrobial activity and structure of peptides (DBAASP) was downloaded (accessed on 6 October 2020) [67]. Each peptide was represented in the dataset as an individual JavaScript Object Notation (JSON) file containing information about the peptide ID, name, sequence(s), unusual amino acids, connectivity, terminal modifications, complexity, synthesis type, target groups and objects, and target species. The term 'target species' is used loosely and actually encompasses both species and sub-classification information such as subspecies, strain for bacteria, and forma speciales for fungi. For each target species for any given peptide, the dataset provides the AMP concentration in units of mostly either M or g ml −1 as a function of four unique variables: activity measure, salt type, medium, and CFU. The dataset includes three different types of peptide complexities: monomers, multimers, and multi-peptides. In the DBAASP dataset, monomers consist of a single sequence, multimers between two and four separate sequences connected via interchain bonds, and multi-peptides between two and four separate sequences connected not via covalent bonds but instead weaker intermolecular forces.
The information from each JSON file was combined into a single table and the AMP concentrations for each target species was converted into a numerical value by removing symbols like '>' and '<' , taking the average whenever the dataset provides a range, and disregarding uncertainty values. Nine peptide monomer types included in a total of 86 peptides were removed due to ambiguity of the molecular structure, bringing the total number of peptides to 15 778. This condensed dataset was further processed to visualize the distribution of target species data points. For each 'target species' , the species name was separated from any sub-classifications (subspecies, strain, forma speciales, serovar, pathovar, biovar, etc).

Text file to machine-readable graph conversion
A custom parser was developed to convert the macromolecules from the SMILES-MONOMERS-BONDS text files to machine-readable NetworkX graphs with monomers expressed as nodes and bonds expressed as edges. The parser goes through the .txt file line by line, stores the monomer information in a dictionary with keys as integer positions and values as monomer abbreviations, and stores the bond information in a dictionary with keys as tuples containing bond connectivities and values as bond abbreviations. Afterwards, the reader uses NetworkX to add each key in the monomer dictionary as a node and each key in the bond dictionary as an edge, storing the abbreviations as attributes for the corresponding node or edge. The resulting NetworkX graphs include both linear and highly branched architectures. Before using the parser, all glycans and peptides obtained from the various datasets were converted to the standardized text file format.

Macromolecule graph representation and featurization
The macromolecule was represented as an attributed graph, G (V, E), where V stands for monomers/nodes, and E stands for bonds/edges. Stereochemical extended connectivity fingerprints, generated using RDKit, were used to featurize the monomers and bonds [68,69]. Radius and number of bits were optimized by calculating mean and standard deviation, and visualizing the distribution of Tanimoto similarity [70] of all monomers in the glycans dataset (SI section 3.1).

Similarity computation 4.4.1. Exact GEDs
GED is a measure of similarity between two graphs, computed as the cost of transforming one graph into another by basic edit operations, such as insertion, deletion and substitution of nodes and edges [54]. This method was used to calculate dissimilarity which is more intuitive when compared to a baseline case where the edit distance for the macromolecule with itself is zero, and with anything else is greater than zero. GED has three operations-insertion, deletion and substitution-for both nodes and edges. We performed a grid search over combinations of possible node/edge substitution costs and multipliers for node/edge insertion/deletion costs to find an optimal set of values. For node/edge substitution, the Tanimoto distance between the stereochemical fingerprints, a value in the range of 0 and 1, is multiplied by the substitution cost to obtain the edit distance. For node/edge insertion/deletion, a constant value which is the node/edge insertion/deletion cost is added to the edit distance.
The insertion/deletion and substitution costs can be tuned to accurately depict the differences across topology and monomer chemistry of the macromolecules in the dataset. Using a higher insertion/deletion cost than substitution cost would penalize changes in topology more than those in monomer chemistry, and vice-versa. When both insertion/deletion and substitution costs are same, the changes to both topology and monomer chemistry are treated equally. The magnitude of the costs helps in tuning the range of edit distances. We noted that using both insertion/deletion and substitution costs as three provided an optimal intuition for our dataset, accounting for both the changes in chemistry and topology (SI table 1).

Graph kernel
Graph kernel is a kernel function to compute inner product on graph-structured data [71]. The kernel approach results in a similarity matrix for a set of graphs, analogous to GED for a pair of graphs. In this work, we used propagation attribute kernel implemented in GraKeL [55,56] to compute the (n × n) similarity matrix. All glycans with labels on at least one taxonomic level were considered for the similarity computation. Each pair of graph similarity was computed for a maximum of 100 iterations. This resulted in 5% of the pairs being assigned a zero similarity (10% of all indices in the similarity matrix are zero). To benchmark against GED, we performed a grid search over hyperparameters of propagation attribute kernel-bin width {1, 3, 10, 100}, and the preserved distance metric on local sensitive hashing {'L1-norm' , 'L2-norm'}.

Unsupervised learning 4.5.1. Uniform manifold approximation and projection (UMAP)
UMAP is a method for dimensionality reduction based on manifold learning and topological data analysis, capturing both local and global structure of the data [59]. This method was used for dimensionality reduction of similarity matrices. Number of neighbors, {2, 4, 8, 16, 32, 64, 128, 256}, was optimized for two component UMAP dimensionality reduction of similarity vectors [59]. By visual inspection, UMAP with 128 neighbors was observed to resolve into optimal size and number of clusters. The subplot showed distinct regions for the immunogenic and non-immunogenic glycans. In addition to two component UMAP, we constructed UMAP for increasing number of components, and analyzed the segmentation at higher components.

t-distributed stochastic neighbor embeddings (t-SNEs)
t-SNE applies a non-linear dimensionality reduction technique that calculates the pairwise similarities between points and minimizes the difference between the similarities in higher and lower dimensions [58]. We benchmarked the dimensionality reduction results obtained from UMAP against a broad range of t-SNE models. For the different models, we varied perplexity as {2, 5, 30, 50, 100}, and number of steps as {500, 1000, 5000}. From the scatter plot, colored by immunogenicity labels, we noted that dimensionality reduction using t-SNE was not able to deduce the differences and cluster the glycans into distinct areas.

Principal component analysis (PCA)
PCA is a linear dimensionality reduction technique, using singular value decomposition to transform the high-dimensional data to low dimensional embeddings [57]. We used two-component PCA, with default hyperparameters, to benchmark the dimensionality reduction of immunogenic glycans.

Supervised learning 4.6.1. Tasks
For classification, we used the glycans dataset, and classified immunogenicity and eight taxonomy levels (domain, kingdom, phylum, class, order, genus, and species). For regression, we trained over minimum inhibitory concentration for AMPs against Escherichia coli and Staphylococcus aereus.

Model architectures
Five different model architectures-graph convolutional networks (GCN) [72], Weave [73], message passing neural networks (MPNNs) [74], graph attention networks (GAT) [75], and AttentiveFP [76], as implemented in deep graph library (DGL) LifeSci library [77], were used for classification and regression. GCN uses convolutional aggregation operations over node features in the graph. The Weave model architecture is an extension of GCNs to learn over molecular graphs, thus convolving over both atom/node and bond/edge features. MPNN updates the node features by summing over the node and edge features in the node neighborhood. GAT utilizes self-attention layers to implicitly focus on key node features, unlike GCNs that give equal weight to all node features. AttentiveFP learns both the local neighborhood by propagation of information at the nodes and the non-local information via GAT mechanism. While Weave, MPNN, and AttentiveFP utilize both node and edge features, GCN and GAT only consider node features.

Adapting NetworkX graphs to be trained using DGL
NetworkX graphs were converted into undirected, unweighted, and homogenous DGL graphs [77]. For GCN and GAT model architectures, self-loops were added to the DGL graphs to prevent silent performance regression due to zero-in-degree nodes during training.

Optimization of model
For classification, the optimization was done by minimization of average cross-entropy loss between batches and additional metrics such as F1 score, recall, precision and accuracy were noted. For regression, the optimization was done by minimization of root-mean-squared-error loss on the validation dataset, and additional metrics such as R 2 , Pearson's correlation, Spearman's correlation, and mean absolute error were noted. Hyperparameter optimization was carried out for 1000 iterations using SigOpt [78].

Attribution analysis 4.7.1. Graph attribution methods
Integrated gradients [64] and input × gradients [65] attribution methods were used over weave, AttentiveFP and MPNN, for attribution analysis. Additionally, node attention weights were analyzed for AttentiveFP. The model architecture selection was done to have one of each type of architecture-weave (graph convolution), AttentiveFP (graph attention), and MPNN (message passing).
Integrated gradients interpolate between the input graph and a baseline graph, where all features are zero, and accumulate the gradient values for each node equation (1). The notation follows [50].
Input × gradients attribution is the element-wise product of the input graph and the gradient.
For attention weights, the node attention weights were obtained by averaging over the attention scores of the adjacent nodes.
For each attribution method, we obtained the node weights by multiplying the positive weights with the input fingerprint vectors: The node weights were normalized to the maximum node weight to obtain the normalized weights.

Visualization of key substructures
To visualize the responsible substructures in the monomers, we used G A in equation (1), and multiplied the weights of the with the respective monomer fingerprint. This approach resulted in a weights vector with the same size as the fingerprint, with the most positively to the most negatively influencing substructure for the prediction. Using RDKit, we visualized the chemical substructures at different fingerprint indices and mapped it to the weights.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: 10.5281/zenodo.5237237.