Decoupled coordinates for machine learning-based molecular fragment linking

Recent developments in machine learning-based molecular fragment linking have demonstrated the importance of informing the generation process with structural information specifying the relative orientation of the fragments to be linked. However, such structural information has so far not been provided in the form of a complete relative coordinate system. We present a decoupled coordinate system consisting of bond lengths, bond angles and torsion angles, and show that it is complete. By incorporating this set of coordinates in a linker generation framework, we show that it has a significant impact on the quality of the generated linkers. To elucidate the advantages of such a coordinate system, we investigate the amount of reliable information within the different types of degrees of freedom using both detailed ablation studies and an information-theoretical analysis. The presented benefits suggest the application of a complete and decoupled relative coordinate system as a standard good practice in linker design.


Introduction
Computational drug design remains a challenging problem, foremost due to the vast size of the drug-like chemical space [1,2,3,4].A subset of the discipline of molecular generation is constituted by the task of fragment linking.Given a pair of structures, the goal of the design process is their connection via an appropriate linker.Chemical properties like toxicity and solubility are of general concern in order for a compound to turn out administrable as a drug.Furthermore, in order to maintain high activity and specificity, it is of major importance that the designed linker maintains the binding mode of the underlying fragments [5,6].Therefore, an effective linker design process should incorporate structural information in the form of the relative orientation between the molecules to be joined.Recently, DeLinker [7], a machine learning framework operating on molecular graphs, was proposed to incorporate such structural information.While the results of the method arguably constituted a breakthrough in machine learning-based fragment linking, unfortunately, the method employs a problematic set of coordinates.
The benefits of a maximally decoupled coordinate system are well known in computational chemistry and biophysics.For instance, they have turned out especially crucial in the calculation of configurational entropy [8,9,10,11,12,13,14,15] from molecular mechanics simulations.Here, in Cartesian coordinates, the adequate sampling of the soft constraints present in rigid chemical groups (e. g. a methyl group) quickly becomes computationally prohibitive with increasing molecular size.Recently, similar insights have led to a mathematically rigorous separation of the placeholder atom from the physical partition function in alchemical free energy simulations [16].Coordinate systems based on bonds, bond angles and torsions feature decoupling naturally by accommodating the molecular topology made up from rigid bonds.Here, we apply such a bond-angle-torsion (BAT) coordinate system in order to specify the relative orientation of the molecular fragments to be linked.As opposed to the coordinates applied in the investigated previous work [7], the coordinate system proposed in this article is mathematically well-defined, complete and geometrically decoupled.Here, geometrically decoupled denotes the absence of mathematical coordinate constraints [16], which constitutes a necessary criterium for the dynamical decoupling mentioned above in the context of configurational entropy.

Methods
In this article, we propose a beneficial choice of a relative coordinate system for linking pairs of molecular fragments by specifying their relative orientation in 3D space.In order to benchmark our proposed coordinate system, the details of which will be given below, we follow DeLinker [7], a recent pioneering machine learning framework for linking molecular fragments.

The framework
Here, a brief summary of the DeLinker [7] framework is given.For further details, the interested reader is referred to said publication [7] as well as references therein (in particular [17,18,19]).
On the top level, DeLinker [7] is embedded in a Variational Autoencoder (VAE) [20] framework; Linker generation is performed by seeding the latent variables of the decoder of the VAE.First, the graph representation of the fragment pair is encoded by a standard gated graph neural network [19].Then, a set of atoms is initialized in order to serve as expansion nodes for linking a pair of fragments.The maximum linker length is given by the chosen size of this set of expansion nodes.For each atom, a multidimensional hidden state is drawn from standard normal distributions.The atom type of the respective node is derived via sampling from a learned mapping from the hidden state to a Boltzmann distribution (i.e.softmax or, termed more precisely, softargmax).
Successively, the fragments are linked by attaching atoms from the set of expansion nodes in an iterative manner.Following the breadth-first paradigm, a first-in-first-out queue is initialized with two exit atoms (Figure 1), one per fragment.The focused node, i. e. the first node in the queue, forms covalent bonds to candidate atoms, which are themselves added to the queue upon bond formation.The selection process of the focused node is terminated upon forming a bond to a special stop node.Then, the focused node is removed from the queue and the next atom in turn initiates bond formation.Nodes become closed once they are focused, which means they are no longer considered as candidate nodes for bond formation throughout the entire generation process.The bond selection procedure is accomplished by computing feature vectors between the focused atom and all candidate atoms.These feature vectors comprise both atom types, their hidden states as well as their graph distance.Furthermore, the bond formation process takes as an input the average of the hidden states across all nodes during node initialization (i.e. iteration zero) as well as at the current iteration.The current iteration number is supplied in addition.Importantly, the feature vector is augmented with coordinates specifying the relative orientation between the fragments, the details of which will be given below.The actual chosen bond and its type (single, double or triple) are sampled from Boltzmann distributions, which are based on learned mappings from the feature vectors and are masked with valency constraints [17].
After each bond formation, the hidden states of all nodes are updated with respect to the newly formed graph topology.This update is performed via a standard gated graph neural network [19], taking as input the initial states of all nodes.Starting from the states of the nodes at iteration zero, instead of the current state, prevents the network from learning assembly pathways [17].
The generative process ends when the first-in-first-out queue is empty.Note that the described procedure does not prevent the generation of unlinked fragments, which, however, constitutes the only mechanism leading to invalid molecules as an outcome.The angles between r E1→L1 and r L1→L2 as well as between r L1→L2 and r L2→E2 , termed α 1 and α 2 respectively, yield another two BAT coordinates.In order to obtain six BAT coordinates, a well-known number for specifying relative molecular orientations, we add the dihedral angle φ of the atom chain E Fragment linking constitutes a multimodal problem.Two fragments can be linked in many different ways.Inspired by Jin et al. [18], a low-dimensional latent vector z is introduced in order for the model to accommodate this one-to-many mapping.To avoid difficulties known from Computer Vision [21], during training, z is derived from the encoding of the ground-truth linked fragments.In this manner, the model is encouraged to pay attention to the latent vector.Furthermore, z is regularized via the KL-divergence to the standard normal distribution.
Referring to Figure 1, let us introduce the following nomenclature.The atoms of the fragments to which the linker is covalently bound are referred to as exit atoms; E 1 and E 2 for the two fragments, respectively.The linker atoms L 1 and L 2 are attached to E 1 and E 2 via a rotable bond.Naturally for the problem of linking molecular fragments, we are not concerned with the global position and orientation of the fragment pair as a whole.Rather, we are interested in the position and orientation of the fragments relative to each other.Thus, without loss of generality, exit atom 1 (E 1 ) is positioned at the origin of the Cartesian coordinate system, linker atom 1 (L 1 ) on the x-axis and linker atom 2 (L 2 ) in the x-y plane.Then, these anchored Cartesian coordinates [22,23] are given as follows: In order to achieve ease of notation, we constitute the following definitions.
Then, our implemented BAT coordinates are given as: Matrix multiplications and outer vector products are denoted by • and ×, respectively.The back-transformation from BAT (Equation 3) to anchored Cartesian coordinates [22,23] (Equation 1), demonstrating the completeness of our proposed BAT coordinate system, is given in the Appendix.In this context, note that we do not explicitly feed the model the lengths of the exit vectors, i. e.
As physical bond lengths, their values hardly vary in comparison to the pseudo-bond length which dominates the geometry.Thus, they barely carry any useful information for the model.One topic is left to be discussed.In contrast to bond angles, i. e. α 1 and α 2 in the present case, dihedral angles are periodic.This means α 1 , α 2 ∈ [0, π], but φ ∈ [0, 2π).Periodicity poses a challenge for machine learning algorithms, since the implied distance metric does not correctly reflect the underlying physical geometry.
As done commonly (see e. g. [29,30,31]), we feed the model the sine and cosine of φ, instead of the plain dihedral angle value.One way to motivate this treatment states as follows.There is a periodic discontinuity [see Figure 2(a)] at 0°= 360°in the mapping of the underlying geometry to its numerical dihedral angle value.A solution is inspired by the complex unit circle [Figure 2(b)]: One maps the dihedral φ to exp[i(π/2 − φ)] = sin(φ) + i cos(φ).The machine learning model is fed the real and imaginary part of this transformation, i. e. both sin(φ) and cos(φ).These angular functions, as a representation of the periodic dihedral angle φ, are both continuous as well as faithful with respect to the underlying geometry.Additionally, they naturally map to the interval [−1, 1], which constitutes another desirable property.Note that the transformation exp[i(π/2−φ)] was chosen over just exp(iφ) in order to match the complex unit circle with the canonical definition of dihedral angles [Figure 2(b)].However, as the order of the input features is irrelevant for the machine-learning algorithm, this choice of swapping sine with cosine is arbitrary.

Data preparation
Following DeLinker [7], we are working with two datasets, a selected [32] set of 250 000 ZINC [33] compounds as well as CASF-2016 (i.e., the PDBbind core set) [34].A major difference between these two sets is the origin of the 3D structural information.While for CASF, the structures stem from 285 high-quality crystal structures of ligands bound to proteins, the ZINC structures are generated in silico using RDKit [35], as further detailed below.
For both data sets, preparation is performed in the same manner and sketched as follows.The ligands are split into three parts, i. e. two fragments and the corresponding linker.Cuts were performed on acyclic single bonds outside of functional groups [36].Triplet splits for which either of the three components is unrealistically small, or the linker is inflated with respect to the fragments, were removed.The remaining triplet splits were filtered further by molecular graph (2D) properties, in particular synthetic accessibility (SA) [37], ring aromaticity as well as pan-assay interference compounds (PAINS) [38].This procedure yielded ∼420 000 triplet splits for the ZINC data set and 309 triplet splits for CASF.Training is performed purely on the ZINC data set.400 compounds each were randomly selected from the whole ZINC set and held back as a validation and test set.The CASF data is used solely as a test set for evaluation.
As outlined above, unlike for CASF, where experimental 3D structures are available, the ZINC 3D coordinates need to be generated.This is accomplished using RDKit [35] via employing the Merck molecular force field (MMFF) [39,40].The DeLinker [7] source code ships without the 3D structures of the training set (obviously due to storage space reasons).For calculating our proposed relative coordinates of the training set, we reproduced the 3D structures.Referring to Figure 1, DeLinker [7] uses the distance from E 1 to E 2 as well as the angle between the exit vectors (i.e. r E1→L1 and r E2→L2 ) as relative coordinates.In order to ensure exact reproduction, we recalculated these relative coordinates alongside the proposed BAT coordinates and compared them to the data shipped with DeLinker [7].Our comparison demonstrated an exact match.For further details, the interested reader is referred to the DeLinker publication [7].

Training
The model is trained under a VAE framework over 10 epochs, exclusively on the ZINC training set, using the Adam optimizer with a learning rate of 10 −3 and a minibatch size of 16.The encoding of the nodes hidden states as well as the latent vector z encoding the ground-truth molecule are regularized via KL loss to follow standard normal distributions.A two-fold cross-entropy reconstruction loss is applied, one part measuring the error in the prediction of the atom types, the second part judging the sequence of bond-formation steps in order to reconstruct the ground truth molecule.Further details can be found in the DeLinker publication [7] as well as in Liu et al. [17].

Evaluation
For each of the fragment pairs in the triplet cuts of the test sets, i. e. 400 in the case of ZINC and 309 for CASF, 250 linkers are generated.This amounts to 100 000 and 77 250 linked molecules generated, respectively.We apply the standard metrics for molecular generation tasks, i. e. fractional validity, uniqueness and novelty.Validity is assessed by RDKit [35] being able to parse the generated SMILES [41] strings as connected molecules.Uniqueness is calculated by the cardinality of the (unique) set of generated molecules divided by the total number generated.Novelty describes the fraction of generated molecules which are not contained in the training set.
Furthermore, we assess the generated molecules via the 2D filtering metrics of subsection 2.3.SA [37] is a measure for the difficulty of physically synthesizing the molecule in the laboratory.Ring aromaticity amounts to the properties of a molecule constituting a drug.PAINS [38] assesses the reactivity of a compound by performing a knowledge-based analysis on its substructures.
Arguably the most significant estimator of the impact of our coordinate system is the models capability to recover the ground-truth linker from the original triplet cut.This statement is motivated by the fact that the structural information provided to the generation process is derived from the respective ground-truth linker.In this manner, the model is conditioned to reproduce the ground-truth linker more frequently.

Information-theoretical analysis
Input features, referring to the coordinates in the present case, should be uncorrelated, i. e. decoupled, in order to assure efficient learning for the model.Therefore, the mutual information between the input features should be low [42].In order to investigate said decoupling, we calculate pairwise mutual information values, given as [8,9,10,13,14,15] p x,y i,j ln p x,y i,j The formulas given in Equation 5refer to discretized differential entropy values.This means the underlying continuous probability densities are approximated by sampling to discrete histogram bins.The indices x and y designate an arbitrary coordinate, i. e. a bond, an angle or a dihedral in the present case.p x i denotes the probability for the coordinate x to assume a value in bin i.If N samples for coordinate x are taken, and N i of them fall into bin i, then p x i = N i /N (and analogous for p y j ).Furthermore, p x,y i,j = N i,j /N if N i,j of the N samples taken fall into bin i for the x-coordinate and bin j for the y-coordinate in the according 2D histogram.∆x and ∆y are the widths of the equally spaced bins.J x i is the Jacobian of coordinate x in the middle of bin i.Using the definition of the marginals we can write Equation 4as p x,y i,j ln p x,y i,j /(J x i J y j ∆x∆y) p x i /(J x i ∆x) * p y j /(J y j ∆y) = i,j p x,y i,j ln p x,y i,j The last equality exhibits interesting features to be discussed.First, the Jacobian determinants have canceled out, meaning the type of degree of freedom has become irrelevant.Furthermore, the bin sizes have vanished.This means that the mutual information calculated from discretized differential entropy values resembles discrete information in Shannon's [43] sense.Note, however, that the mutual information still is dependent on bin sizes; the probabilities p x,y i,j , p x i and p y j are affected by this choice.For example, let us denote the limit where the binning becomes binary, meaning the bin sizes are chosen so small that there is either no or only one data point in each bin.Assuming that both the x-as well as the y-marginal take on such a binary form, we can write for N data points: Furthermore, note that we write the equations without the Boltzmann or Gas constant.Spatial entropies, as in the equations 5, do not bear physically meaningful units.The reason is the fact that the semi-classical entropy integral cannot be split in a manner that either the spatial or the momentum entropy bear physically meaningful units [10].Upon taking entropy differences, however, the problematic terms cancel out [10].This means that the mutual information in equations 4, 7 and 8 can indeed be multiplied with the Boltzmann or Gas constant to yield physical entropy values.Stated without such a physical constant, the units obtained here are natural units of information (nats, similar to bits, however, using the natural logarithm).

Results and discussion
As mentioned in section 2.5, arguably the most indicative metric for the quality of the provided coordinates is the rate of recovery of the ground-truth linker, as it quantifies the models capability to follow the supplied information.This metric improves across all test sets, as shown in Table 1, with the most drastic enhancement for ZINC.The coordinate system carries information about the ground-truth linker.With higher quality information supplied, the model is expected to recover the original linker more frequently.This demonstrates that the model takes advantage of the augmented information by following the provided target geometry.Since the generation process is more directed, the uniqueness metrics drops, which serves as another indicator for the quality of the information content in our proposed coordinate system.
Considering the ablation study of DeLinker [7] on the ZINC test set, the bulk of the geometric information obviously is contained in the distance coordinate.Compared to providing no relative coordinates at all, the distance takes the recovery rate from 74.5 to 78.3 percent.Adding the angle coordinate, i. e., considering the complete DeLinker [7] coordinate set, yields an additional gain of comparatively low 0.7 percent.Figure 3 provides insight on this outcome.When using a different random seed for RDKit to generate conformers, the DeLinker [7] distances are preserved rather robustly with a Pearson correlation of 0.86.The angles, on the other hand, show significant deviations.Their Pearson correlation to the angles of the conformers generated with the previous random seed is given as a relatively low value of 0.47.This indicates that the reliability of the angular information is low compared to the distance information, which attributes to only minor gains when feeding them to the model.
In order to further investigate the information content in the relative coordinates proposed here, analogue  1 are investigated and the impact of the different types of coordinates is dissected.As was the case for the DeLinker [7] coordinates, the 2D filter and the valid criteria are generally high.Referring to the recovered metric, similar to Table 1, the bulk of our information is carried in the distance coordinate, albeit the improvement appears significantly more distinct.Furthermore, the unique metric continues to behave opposite to the recovery metric, supporting the hypothesis of the coordinate information providing valuable guidance for the model.
ablation studies were performed for the BAT coordinates.Table 2 lists the results.Similar to the DeLinker [7] coordinate system, the bulk of the BAT information is carried in the distance coordinate.However, the BAT distance (| r L1→L2 | in Figure 1) takes the recovered metric from 74.5 to 84.5 (Table 2), as opposed to only 78.3 (Table 1) for the DeLinker [7] distance | r E1→E2 |.Given the similarity of the two distances, this discrepancy appears rather drastic at a first glance.The reason is argued to lie in the fact that the DeLinker [7] distance fails to decouple from the angular and dihedral coordinate systems.For the 6 BAT coordinates in Figure 1, one can vary each of them while keeping the others constant.However, any such variation will change the DeLinker [7] distance | r E1→E2 |.Furthermore, varying the BAT angles or the BAT dihedral will change the DeLinker [7] angle.When comparing the distances in Figure 3 to those in Figure 4, the effect of this coupling becomes evident.The DeLinker [7] distance shows a Pearson correlation of 0.86 between coordinates of conformers generated with different random seeds.The decoupled BAT distance, on the other hand, demonstrates excellent robustness to the choice of random seed with a Pearson correlation of 0.99.This explains the arguably drastic improvement of the recovered metric by 10 percent using the BAT distance only.
Feeding the model the BAT angles additionally, we gain another 2.5 percent, as opposed to 0.7 for the DeLinker [7] angle.Comparing the correlation plots, BAT provides two angles with a Pearson correlation of 0.67 and 0.72 versus one angle with 0.47 for DeLinker [7]; the BAT angles are decoupled from the distance and dihedral system.
Inspecting Figure 4, the dihedral angle information is of low quality.Interestingly, even given this low quality information, the model improves the recovered metric by 1.3 percent.Note that this value exceeds the angular improvement of the DeLinker [7] coordinates (0.7) almost by a factor of two.The reason, as above, arguably lies in the decoupling from the distance and angular system; while the information is certainly non-reliable on its own, it is indeed orthogonal, i. e. non-redundant, with respect to the other coordinates.Altogether, the BAT coordinate system takes the recovery metric to 88.3 percent, which is an improvement of 9.3 percent over the DeLinker [7] coordinate system.
Last but not least, we have performed an information-theoretical analysis comparing the two input coordinate sets.Given the fact that the DeLinker [7] coordinates are given as a distance and angle pair, we compare the mutual information using the training set between the Delinker [7] coordinates with the mutual information between the BAT distance and either BAT angle.Figure 5 shows that, with few exceptions at large bin sizes where values have not converged, the BAT coordinates exhibit lower mutual information, which demonstrates an increased amount of decoupling.Given these insights on an information-theoretic level, we suggest the presented graphs as strong evidence for the decoupling of the coordinate system as a major factor for the improvements.

Conclusion
Inspired by recent pioneering work, we have demonstrated a considerable beneficial effect for the application of a well-behaved coordinate system on machine learning-based molecular linker generation.The enhancements of our proposed relative coordinate system, which roots in the bond-angle-torsion (BAT) coordinate formalism, were established by comparing to said pioneering work.In this context, we roughly halved (a reduction of 44.3 percent) the number of test set examples for which the ground-truth linker could not be recovered.A comparative analysis on various indicative aspects was performed.First, commonly used metrics for generative models as well as molecular graph evaluation were presented.Furthermore, ablation studies on the included coordinate types were compared in order to identify the coordinates which provide the most valuable information to the model.The reasons for the performance of the different coordinates were elaborated by performing a robustness analysis with respect to the conformer generation process given different random seeds.By means of information-theoretical calculations, the amount of decoupling in the two input coordinate sets was compared.The results support the hypothesis that the decoupled nature of our presented coordinate system plays a major role for the improved performance of the generative process.
The application of structural information for future state-of-the-art models in molecular fragment linking arguably suggests itself.Given the considerable improvements demonstrated, we propose the presented coordinate system as a standard technique for linking molecular fragments.

Acknowledgements
We thank Anton A. Polyansky for a critical reading of the manuscript.

Transformation from BAT to anchored Cartesian coordinates
In order to demonstrate the completeness of the proposed BAT coordinate system (Equation 3), the backtransformation to anchored Cartesian coordinates [22,23] (Equation 1) is given as follows.
Here, b 3 ′ represents b 3 for the case φ = 0.In order to obtain b 3 from b 3 ′ , we rotate b 3 ′ around an axis parallel to b 2 by an angle φ.Following previous work [27], this coordinate transformation is applied by using Rodrigues' rotation formula.Note that Equation 9 is never explicitly calculated.It is given here for the sole purpose of demonstrating the equivalence of the BAT and internal Cartesian coordinate systems.

Figure 1 :
Figure 1: Bond-angle-torsion (BAT) coordinates.For specifying the relative orientation of two molecular fragments, we first define the exit atom E 1 of the first fragment as the atom which is connected to the initial linker atom L 1 .E 2 and L 2 are defined analogously for the second fragment.Two of the BAT coordinates are constituted by the exit vectors | b 1 | = | r E1→L1 | and | b 3 | = | r L2→E2 |, bridging the exit atoms to the initial linker atoms.A third BAT coordinate is defined by the length of the pseudo-bond | b 2 | = | r L1→L2 |.The angles between r E1→L1 and r L1→L2 as well as between r L1→L2 and r L2→E2 , termed α 1 and α 2 respectively, yield another two BAT coordinates.In order to obtain six BAT coordinates, a well-known number for specifying relative molecular orientations, we add the dihedral angle φ of the atom chain E 1 -L 1 -L 2 -E 2 .

Figure 2 :
Figure2: Treating periodicity.a) shows an illustration of the problematic distance metric.Both the top and bottom angles measure 10 degrees geometrically.For the bottom angle, this is correctly reflected in its numerical value, i. e. |∆α bottom | = |185°− 175°| = 10°.For the top angle, however, the numerical value deviates from the underlying geometry due to the periodic discontinuity at 0°= 360°.Here, we have |∆α top | = |5°− 355°| = 350°.This discontinuity of the mapping from geometric angles to their numerical values poses a challenge for machine learning algorithms.b) The periodic discontinuity can be avoided by considering the dihedral an angle in the complex plane.Supplying both the real and imaginary part, i. e. sin(φ) and cos(φ), respectively, dihedral angles are presented to the model in continuous form and without loss of information.

Figure 3 :
Figure 3: Robustness of DeLinker coordinates.The plots compare two random seeds used for generating conformers for the test set.The distance as well as the angle of the DeLinker coordinate set are plotted.The distances demonstrate a somewhat adequate robustness between random seeds, with Pearson correlation of 0.86 between seeds.For the angles, we observe a considerable fraction of outliers, decreasing the correlation to 0.47 only.The robustness to altering random seeds is proposed as a proxy for the quality of the provided structural information.

Figure 4 :
Figure 4: BAT coordinate quality.Analogous plot to Figure 3 for the proposed BAT coordinates.The robustness to varying random seeds for the distances appears excellent with a Pearson R = 0.99.Compared to Figure 3, both our angles demonstrate enhanced reliability.Referring to the dihedrals, the estimated information content is considerably low.

Figure 5 :
Figure 5: Mutual information comparison.The graphs compare the mutual information between the DeLinker[7] distance and angle as well as between the BAT distance and bond angles.The training set of ground-truth molecules is analyzed.Since, as discussed in the Methods section, the mutual information is dependent on the bin size, the plots show the mutual information as a function of the granularity in terms of the number of bins used.Different regimes of the number of bins are shown.Note that the mutual information curves for both our angles lie on top of each other.With the exception of rare events in the regime of low number of bins (top graph), our coordinates exhibit lower mutual information, i. e. enhanced decoupling.

Table 1 :
[7]ph metrics.2Dqualitycriteriaarecomparedformolecules generated using BAT coordinates versus the values from the DeLinker[7]publication.Values which are not given in the DeLinker[7]publication are marked via an "N/A" entry.For the ZINC dataset, the DeLinker[7]ablation study values are included.The two 5+ test sets at the bottom constitute subsets of target linkers with at least 5 atoms in length.Best in category (row) values have been printed in bold font.

Table 2 :
BAT ablation study.The same metrics as in Table