Standardizing chemical compounds with language models

With the growing amount of chemical data stored digitally, it has become crucial to represent chemical compounds accurately and consistently. Harmonized representations facilitate the extraction of insightful information from datasets, and are advantageous for machine learning applications. To achieve consistent representations throughout datasets, one relies on molecule standardization, which is typically accomplished using rule-based algorithms that modify descriptions of functional groups. Here, we present the first deep-learning model for molecular standardization. We enable custom standardization schemes based solely on data, which, as additional benefit, support standardization options that are difficult to encode into rules. Our model achieves over 98% accuracy in learning two popular rule-based standardization protocols. We then follow a transfer learning approach to standardize metal-organic compounds (for which there is currently no automated standardization practice), based on a human-curated dataset of 1512 compounds. This model predicts the expected standardized molecular format with a test accuracy of 80.7%. As standardization can be considered, more broadly, a transformation from undesired to desired representations of compounds, the same data-driven architecture can be applied to other tasks. For instance, we demonstrate the application to compound canonicalization and to the determination of major tautomers in solution, based on computed and experimental data.


Introduction
From deep learning algorithms for forward reaction prediction [1][2][3], retrosynthesis [1,4,5], de novo molecular generation [6,7], to the prediction of yields [8] and molecular properties [9], artificial intelligence is increasingly prevalent in chemical discovery pipelines. This was enabled by the abundance of freely available molecular databases, with hundreds of millions of compounds relevant to drug and materials discovery [10][11][12][13]. However, the size of the datasets makes human curation campaigns impractical, resulting in the frequent presence of incorrect and inconsistent molecular representations [14]. In fact, a 2010 study [14] compiled a series of investigations which concluded that even minor structural errors and inconsistencies within a dataset could result in significant losses in the predictive ability of structure-activity relationship models. Because the quality of the input data limits the performance of machine learning models, the development of tools to address this issue has received increased attention in recent years [15].
The molecular representations that exist today often fail to map the real identity of a molecule into computer readable format [16]. For example, the simplified molecular-input line-entry system (SMILES) notation [17,18] was introduced to represent molecular structure as a linear string of symbols, which later allowed efficient data storage and search, and more recently leveraged the progress achieved by language models to chemistry [3,6,19]. However, SMILES strings cannot encode extensive information about the geometry of a molecule, and fail to represent chemical species that do not fit the valence bond theory, or polymeric species. Molfile formats rely on connection tables [20], which include exhaustive details about the nature of the atoms present, their spatial arrangement and properties, but they require a large amount of disk space and are not well adapted for basic cheminformatics analysis [21]. Graphs are interpretable and can be easily equipped with 3D details, but they fail to accurately describe delocalized or metal-metal bonds, among others [21]. Without a universal molecular representation that is truthful to the real identity of a molecule, irrespective of the molecule's environment, and with machine learning algorithms already in place building on existent molecular representations, one has to work out ways to engineer existent representations so that they are tailored to the application at hand. One solution to this is given by standardization practices, which aim to correct errors in chemical structure representation, while also generating uniform and self-consistent configurations of atoms and bonds, charges and bond orders, aromaticity and stereochemistry.
Chemical data standardization is commonly achieved by formatting compounds according to a set of hard-coded rules and conventions [22][23][24]. These assume the occurrence of specific patterns in the arrangements of elements, bonds, and charges, and necessitate the development of algorithms to convert them to a standard format (which often varies across organizations). Manually crafted and coded rules have inherent disadvantages, the most notable of which is the need for programming expertise and time resources. Even more importantly, it is not always possible to formulate an adequate set rules to automate a chosen standardization protocol, even when experts would be able to do so manually following specific guidelines.
Metal-organic compounds, which have become crucial for chemical catalysis, stand out in this regard. While great in number, catalyst data are highly individualized and contain unique formats [25,26]. This is in part due to the lack of defined conventions for representing metal-organic compounds, but also due to their complex structures. These have made standardization tools difficult to develop, and most databases do not enforce standard file formats or representations upon data upload [27]. In fact, the community calls for an infrastructure that supports digitalization, use of repositories, and standards, to facilitate data exchange and reuse, and bridge the gap between experimental and computational methods [28].
Standardization is also of interest in the case of compounds with isomerism, which entails multiple non-identical representations of the same molecule. Tautomers are a type of isomers that readily interconvert, usually through rearrangement of bonds accompanied by migration of hydrogen atoms, a phenomenon known as prototropic tautomerism. Often, major isomers are not known experimentally, and database curators must decide which of the possible structures best defines the compounds. This is a difficult task, which also depends on the use case (for example, a tautomer might be predominant in polar solvents, and another tautomer in nonpolar solvents). The choice of representative tautomers is shown to have major consequences in substructure searches [29] and in computed properties, such as pharmacophoric features influenced by the assignment of hydrogen bond acceptor and donor functionalities [30][31][32]. This, in turn, can have significant effects on the success rate in drug research and virtual screening.
In this work, we propose a deep learning method based on the transformer architecture [33] that converts a molecule represented with SMILES to its standardized format. We demonstrate the versatility of the model by training it with two different standardization protocols and allowing the user to select the preferred protocol when standardizing new molecules. Importantly, training the model on already existent and successful rule-based standardization protocols provides a base model that can be fine-tuned on annotations, to then learn standardization processes which cannot be reduced to a set of rules. As such, we accomplish standardization for metal-organic compounds and tautomer identification by exposing the pre-trained models to datasets compiled through human annotation and computational/experimental methods, respectively. In doing so, our model can capture commonalities in molecular structure representation and codify a specific set of guidelines for consistently modifying compounds to match the desired standardized outputs.

Model
We employ a transformer architecture [33] adapted from the reaction prediction model by Schwaller et al [3]. Accordingly, we formulate compound standardization as a translation task, taking as input a non-standardized compound, and producing as output the same compound in standardized representation. Tokenized SMILES strings are used as input and output format to the model; tokenization is performed using a custom regular expression pattern (see appendix A). Among other aspects, the tokenization ensures the separation of metal atoms and their charge, which simplifies the learning of atom identity rather than merging it with its charged state.
The transformer model was implemented using the OpenNMT-py library (version 1.0.0). The parameters used are reported in appendix B. The target molecule was represented by the major tautomer.

Data
We represent all the compounds (before and after standardization) in SMILES format. This choice is grounded in a series of factors. First, it allows us to leverage significant progress achieved recently in chemical language modelling [3,6,19]. Second, using SMILES strings enables us to accurately represent the diverse bond types existent in catalysts, such as coordination bonds, metal-metal or ionic bonds [21]. Lastly, this choice extends the applicability of the trained models to many popular chemical formats (graphs, Molfile [20], etc), thanks to the broad availability of algorithms converting molecules to and from their SMILES representation. Table 1 shows an overview of the datasets used in this work, and illustrates the format of an input string and its associated standardized representation. Model pre-training was performed using molecules deposited in the PubChem database [10,11], which is a rich open archive of chemical compounds. We extracted ∼200k molecules at random in both non-standardized (as deposited by users in PubChem and recorded in the Substance entries) and standardized formats (after the PubChem standardization protocol was applied, as recorded in the corresponding Compound entries), which served as source and target strings to the sequence-to-sequence model, respectively. The same source compounds were standardized using the ChEMBL protocol, to generate a second dataset used to train and validate the model developed here. Importantly, the PubChem protocol modifies the stereochemistry of compounds in ∼17% of the cases, and ∼23% of all compounds in the PubChem database [22] have annotated stereochemistry. The SMILES molecular representation employed in this work does not allow the encoding of 3D information (only relative stereochemistry annotations are possible). Accordingly, the model cannot assign stereochemical information in cases where it is absent in the input. Therefore, all transformations implying the addition, removal and conversion of stereo centers were neglected in the assessment and training of the model.
The catalyst dataset, which we introduce in this work, comprises 1512 unique metal-containing molecules which we selected from the Pistachio dataset [34] and standardized manually. All catalysts containing transition metals and a few additional p-block metals (see appendix D) were selected. Manual standardization was performed following a set of guidelines highlighted in figure 2 and detailed in appendix D. Nearly half (772) of the initial representations of the compounds in this dataset underwent modification upon manual standardization, while the rest remained unaltered. It is also worth noting that all the compounds in this dataset (non-standardized and standardized) have undergone RDKit [35] sanitization and are represented as canonical SMILES.
The last dataset employed in this work was compiled from Tautobase [36], a recently published database of tautomer ratios determined in solution, both experimentally and through theoretical methods. We selected tautomer pairs whose ratios were determined in water and the minor and major tautomers served as source and target inputs, respectively. We further augmented the dataset by duplicating each entry, where both the source and the target molecules were represented by the major tautomer.
For the latter datasets (catalysts and Tautobase), we used five-fold cross-validation to split the data, which resulted in five models trained in parallel. The results that follow report the average performance of the five models. The PubChem and ChEMBL datasets (see table 1) were split randomly or based on Tanimoto similarities of the present compounds (see section 3.1).

Model pre-training
The molecular transformer [3] model was first repurposed to learn two popular rule-based standardization procedures, namely the PubChem [22] and ChEMBL [23] protocols (see details in appendix C). The purposes of this experiment are two-fold: first, it exposes the model's ability to selectively modify SMILES Table 2. Performance of standardization models trained on different rule-based protocols. Accuracies are reported for the whole test dataset ('overall'), as well as for the subset of test set compounds that get modified during rule-based standardization ('modified'). The last column shows the validity of the generated SMILES strings, as recognized by RDKit [35]. syntaxes according to a set of learned rules, and second, the resulting model can act as a base model that can be fine-tuned to learn specific and complex standardization transformations. Two types of splits were used when training and testing the standardization models: a random split and a Tanimoto split. The latter employs Tanimoto indices, which are a popular measure of structural similarity between compounds [37]. As such, we adopted the method of Kovács et al [38] to allocate compounds to the training/test datasets so that no compound in the test set is within Tanimoto similarity σ = 0.6 of any compound in the training set. The intent of such a split is to avoid structural bias and to make a robust evaluation of the model's ability to generalize to unseen structures.
Besides learning the PubChem and ChEMBL standardization protocols in separate models, we explored a model that combines the two procedures in a prompt-based fashion. That is, each input string is accompanied by a token ('[CHEMBL]' or '[PUBCHEM]') denoting the preferred standardization type. Such a universal method could accommodate the plethora of distinct preferences in standardization, allowing the user to query one single model effectively. Table 2 contains a summary of the training results. It includes the accuracy on the entire test dataset as well as that solely for compounds modified during rule-based standardization. The PubChem standardization protocol was reproduced with an overall test set accuracy of 98.0% for a random split. For compounds that require modification, 91.5% of the predictions match the expected standardized structure. When using a Tanimoto split, the test set accuracy drops to 80.1%, which is a testament to the scaffold bias introduced by the nature of the dataset. In the case of the ChEMBL protocol, the model achieves a standardization accuracy of 94.5%, outperforming the learning of the PubChem protocol. Higher performance was also registered with a Tanimoto split of datasets, namely 87.8%. Additionally, the model recognizes the molecules that do not require standardization and achieves overall test accuracies of 98.8% (random split) and 96.7% (Tanimoto split). In appendix E, we show examples of correct and incorrect model predictions. The validity of the generated SMILES strings reflects that for most models, the predicted strings represent chemically correct compounds.
When the two protocols are learned in a combined model, a similar test accuracy is achieved. Only the performance on the ChEMBL standardization with random splits decreases slightly, while it stays the same or improves in all other cases. These results are in line with the expected behavior of multitasking [39], with the main benefit of reducing the number of models to train.

Catalysts standardization
We leverage the knowledge gained by the models trained above (see table 2) by fine-tuning them on a new human-curated dataset of metal-organic compounds ('catalyst dataset'). Not only do the compounds in this set deviate from the pre-training data in terms of class and vocabulary, but the standardization rules used here are also unique. Figure 1 illustrates the failed attempt of the (rule-based) PubChem standardization protocol to tackle metal-organic compounds. The protocol fails to preserve the chemical identity of the inputs, demonstrating that the rules designed for formatting PubChem compounds are inadequate for the chemical structure of the catalysts.
We attempt to learn the catalyst standardization guidelines outlined in figure 2(b) through a series of training setups, the results of which are summarized in table 3. We observe that model pre-training offers significant insight into catalyst standardization. The fine-tuned models showed an improvement in accuracy of ∼8% on the whole dataset, being superior at distinguishing between catalysts that require modification and those that do not. Additionally, a multi-task training approach was adopted in order to preserve information on the standardization rules learned during pre-training, but no benefits on learning catalysts transformations were noted. The best training results register a test set accuracy of 60.8 ± 2.4% for  compounds that require modification, and 78.1 ± 0.9% overall (see table 3). When ensembling the models obtained from performing five-fold cross-validation, we obtained an overall accuracy of 80.7% (overall) and 63.7% (modified compounds) on the held-out test set. We note that when ensembling, the model averaging is computed on hidden model states, rather than on the output. As such, no uncertainty estimates can be given for the ensemble model.
Further inspection of the model predictions highlight its ability to learn standardization preferences for metal-organic compounds. In particular, the model appears to learn the ligation preferences of metal centers. Figure 2(a) shows how distinct metals exhibit different coordination behaviors. Also, the model accuracy is Table 3. Evaluation of the model's ability to perform standardization of catalysts. The test set top-1 accuracies are reported. The first row refers to a model which is only trained on the catalyst dataset, whereas the next rows refer to models pre-trained on ChEMBL or PubChem standardized data, and fine-tuned on the catalyst dataset. The fourth row corresponds to a model fine-tuned to learn both PubChem and catalyst standardizations in a multi-task manner. We performed five-fold cross-validation and report average held-out test set performance and standard deviation. The last entry in the table corresponds to the performance of the ensemble model obtained from the five models constructed for cross-validation. As ensembling averages hidden model states rather than their outputs, no standard deviation can be provided for the ensemble model.  relatively high despite the relatively large changes in the SMILES strings during standardization (see figure 2(c)). An inspection of incorrect predictions (some of which are shown in figure 3) reveals some interesting insights. First, we note that 22% of the model predictions have an invalid SMILES syntax. The other predictions differing from the ground truth cannot be easily categorized, and we therefore inspected them one by one. Doing so, we observed that in roughly 28% of the incorrect predictions, the model prediction is actually sensible and matches the identity of the source molecule, corresponding to structures one could consider to be alternative standardization choices. In the remaining cases (∼50% of the incorrect predictions), we observed more significant structure mismatches, such as missing chunks of the compounds, changes in the identity of the metal center, or incorrectly branched molecule scaffolds. These errors notwithstanding, we note that the chosen data-driven formulation allows for correcting such errors and improving the model accuracy, provided that additional annotations are available.
A closer look to the dataset reveals additional insight. In figure 4, we illustrate the differences between the compositions of the PubChem and catalysts datasets. The increased diversity of the species in the catalysts stands out, as well as their more uniform distribution within the dataset. Namely, while more than 80% of the compounds in PubChem contain O or aromatic C atoms, that is the case for ∼40% of the catalysts, which reveals the lower degree of similarity between the compounds and a larger range of species present. Upon comparison of the vocabularies of the two datasets (i.e. a collection of unique tokens occurring in the training set of the language model), the catalysts contained more than 25 additional tokens compared to PubChem. Accordingly, the high diversity and heterogeneity of the catalyst dataset vouch for the few-shot learning ability of the language model and inherently contribute to model validation, as they introduce a stress test to learn standardization upon limited exposure to similar compounds and transformation types.

Determining major tautomers
We further tested the adaptability of our model by tackling the determination of major tautomers in solution. Strictly speaking, identifying major tautomers is not a standardization task, as it depends on electronic effects and on the properties of the solvent in which compounds are dissolved [41], rather than on human choices on the representation of compounds. This exercise, however, illustrates that one can apply the standardization model architecture to problems that can be loosely formulated as a standardization task. Also, the data-driven standardization approach introduced in this work has the advantage that it can be trained either on experimental data, or on the preferred tautomer representations selected by experts based on several prioritization criteria (as is the case for the PubChem protocol [22]).
We trained our standardization model to learn major tautomers present in water, as defined in Tautobase [36]. The model was trained on a small set of 755 tautomeric pairs. To enable the model to distinguish between compounds that are already represented by the major tautomeric form, and those that require modification, we duplicated each entry in the training dataset with the source string identical to the target string, reaching an augmented training set of size 1510. The training results outlined in table 4 reveal that in this case, pretraining the model on PubChem standardization rules adds a significant improvement to Table 4. Evaluation of the model's ability to identify major tautomers in water. Averaged results are reported from five-fold cross-validation and the last row reports performance of an ensemble of the five models. As ensembling averages hidden model states rather than their outputs, no standard deviation can be provided for the ensemble model.  While the overall accuracy of 53.0% may seem low, it is important to note that in many cases, more than two tautomer candidates need to be considered when enumerating all the possibilities. In addition, the model accuracy would increase, should a more comprehensive dataset be available.

Learning canonicalization
As a final experiment, we randomized SMILES strings by doing a cyclic rotation of the atomic indices, and presented the model with the task of recovering the canonical analogues of the molecules. This was achieved with an accuracy of 93.6% on a random test set. Canonicalization was performed using the algorithm provided by RDKit [35].

Conclusions
The increasing adoption of digital technologies for chemistry has highlighted the importance of how chemical compounds are represented. In particular, consistency in their representation has been shown to facilitate database queries and machine learning applications. While the standardization of chemical compounds is already widely applied, this has so far relied on handcrafted rules only, limiting its applicability to relatively easy transformations of organic compounds.
In this work, we present the first data-driven attempt for chemical standardization. Our approach relies on a transformer-based model to learn from a set of chemical structures and their standardized representations, enabling users to achieve the desired standardization transformations without the need for hard-coded rules. This approach offers several advantages, including flexibility (as it does not require expert rules), wider applicability (as it can be applied to various classes of chemical compounds), and refinability (as more diverse training data will improve the model).
In a first step, our model separately learned two popular rule-based standardization procedures, with accuracies above 98%. For the compounds that undergo modifications during rule-based protocols, the model predicts the correct outcome with test set accuracies above 91%. We then followed a multi-task learning approach, where the model was trained on both standardization schemes simultaneously, enabling one single model to support either scheme in a prompt-based fashion, and without reducing the standardization accuracy.
In a second step, we turned our attention to the standardization of metal-organic compounds. Standardizing this class of compounds is more challenging and especially difficult to address with rules. Furthermore, it is often subject to more subjective standardization preferences. To address this standardization setting, we fine-tuned our model on a new custom set of 1512 metal-organic compounds. The resulting model predicted the preferred standardizations with a test accuracy of 80.7%.
We also showed that the architecture presented in this work exhibits transferability to various tasks. For instance, we applied the data-driven standardization approach to learn tautomeric preferences in water. While this task depends on physical properties and electronic interactions, rather than human choices of preferred chemical representation, the model managed to learn the major tautomer with an accuracy of 53.0% (out of several candidates) after learning from less than a thousand samples of a dataset of non-trivial tautomers.
We validated the performances of the models in out-of-distribution domains by employing a Tanimoto split, which partitioned data into train and test sets to increase dissimilarity between them. We further believe that training and testing the model on the highly diverse and heterogeneous catalyst dataset which we introduce here, represents a stress-test and proof of the model's few-shot learning abilities.

Data availability statement
The data that support the findings of this study are openly available at the following URL: https://github. com/rxn4chemistry/rxn-standardization.  Appendix C. Description of rule-based standardization protocols C.1. PubChem protocol The PubChem standardization protocol has emerged as a curation tool for the PubChem chemical data repository [10,11], mapping all deposited chemical compounds to standardized analogues [22]. The protocol uses an extensive list of rules relying on routines from the OpenEye C++ toolkits [42], which are mostly designed for organic compounds (as the majority of compounds deposited in PubChem are organic, as of 2018 [22]). Besides verifying the validity of molecules, the protocol generates preferred tautomers for chemical compounds and it determines canonical Kekulé structures and stereochemistry, among others. It achieves the latter by evaluating available 2D and 3D information about stereocenters and attempting to construct a canonical configuration by following Cahn-Ingold-Prelog priorities [43] or by employing symmetry classes [44].

C.2. ChEMBL protocol
The ChEMBL standardization protocol is designed for bioactive compounds and is less restrictive than PubChem through its SMILES modification practices. ChEMBL does not attempt to generate canonical tautomers for compounds and it maintains the original geometries of double bonds. Figure 7 exemplifies such distinctions between the two approaches. variations in tautomeric representation. Additionally, in one instance, the model only standardizes one out of two functional groups. Lastly, certain incorrect predictions can be attributed to syntax errors in the output SMILES, such as a misplaced closing parenthesis or incorrectly reproduced functional groups.