This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.
Paper The following article is Open access

Estimating the probability of coincidental similarity between atomic displacement parameters with machine learning

, and

Published 14 July 2021 © 2021 The Author(s). Published by IOP Publishing Ltd
, , Citation Viktor Ahlberg Gagner et al 2021 Mach. Learn.: Sci. Technol. 2 035033 DOI 10.1088/2632-2153/ac022d

2632-2153/2/3/035033

Abstract

High-resolution diffraction studies of macromolecules incorporate the tensor form of the anisotropic displacement parameter (ADP) of atoms from their mean position. The comparison of these parameters requires a statistical framework that can handle the experimental and modeling errors linked to structure determination. Here, a Bayesian machine learning model is introduced that approximates ADPs with the random Wishart distribution. This model allows for the comparison of random samples from a distribution that is trained on experimental structures. The comparison revealed that the experimental similarity between atoms is larger than predicted by the random model for a substantial fraction of the comparisons. Different metrics between ADPs were evaluated and categorized based on how useful they are at detecting non-accidental similarity and whether they can be replaced by other metrics. The most complementary comparisons were provided by Euclidean, Riemann and Wasserstein metrics. The analysis of ADP similarity and the positional distance of atoms in bovine trypsin revealed a set of atoms with striking ADP similarity over a long physical distance, and generally the physical distance between atoms and their ADP similarity do not correlate strongly. A substantial fraction of long- and short-range ADP similarities does not form by coincidence and are reproducibly observed in different crystal structures of the same protein.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Experimental research often tests the question of whether or not properties of systems are the same. If the properties are discrete, one can decide this without much of a problem, but nature does not reveal continuous properties to arbitrary precision. The true nature of these properties is clouded from the experimenter's vision, and a statistical framework is indispensable for reaching a conclusion. Machine learning has many applications for macromolecular crystallography, and it has been used for improving the estimates of structure factor amplitudes [1] and their differences [2] with applications in experimental phasing [3], serial crystallography [1] and time-resolved structural studies [4]. These examples employed supervised machine learning tools. Clustering, as an unsupervised method, is also widespread in crystallographic analysis [5, 6]. One example of the usefulness of an unsupervised method is the hierarchical clustering of anisotropic displacement parameters (ADPs) in atomic resolution crystal structures [7].

The model of a crystal structure is inherently probabilistic. Rather than observing individual atomic positions that are later aggregated by descriptive multivariate statistics, atoms are described by the parameters of a distribution from the beginning. In this paper, we focus on the commonly used multivariate normal distribution description of atoms with spatial dimensions. Within this standard framework, there is no deeper layer of physical description. The analysis of the statistical relation between atoms has to start with their distribution parameters, the mean atomic coordinate vectors and scale matrices in the form of ADPs (equivalent to covariance matrices). A statistical model requires a distribution that can describe ADP matrices and a likelihood function that estimates the probability of ADPs given the various a priori distribution parameters. Wishart's distribution is a multivariate gamma distribution [8], and it is frequently used as a conjugate prior to covariance matrices. Wishart's distribution can be achieved stepwise from simpler random variables with Bartlett's method [9]. There are other routes to generate covariance matrices. Cholesky decomposition [10] of a covariance matrix results in the vector of variances and a correlation matrix. For these components, suitable prior distributions can be assigned, for example, the Lewandowski–Kurowicka–Joe distribution [11], to the correlation matrix. This route can be computationally efficient, but its current implementation is incomplete in supporting libraries. Using a model of ADPs, it is possible to infer the joint posterior parameter distribution of a Wishart distribution using Bayesian methods. In turn, we can generate random predictions of ADPs through random sampling from the joint posterior distribution and propagating the samples through the deterministic steps of the model. The randomly generated ADPs can be used to predict the baseline probability of coincidental pairwise similarity between random matrices.

We also explore which type of comparison provides a detailed map between ADPs and analyzes what level of similarity does not simply arise in the random model of ADPs. To define similarity, different definitions of distance between the ADPs were used [12, 13]. To this end, there is a need to use proper metrics, which incorporate multiple aspects of ADPs. It should be noted that there is no universally better or worse definition of distance. Hence, the choice depends on what property one wishes to highlight, namely the direction of the displacements or their (relative) magnitude. Which definition one uses matters less when the ADPs are highly similar. Regardless of the metrics, we still have to determine the magnitude of the distance that can arise through coincidence.

In this short report, we used a machine learning model to predict the shortest distance where coincidental agreements can occur with a pre-defined probability. The Bayesian model is trained on the ADPs of a protein structure and tested with a number of different distance definitions. Finally, we take a closer look at the recurring properties among atoms, which show significant similarity to a selected atom.

2. Material and methods

2.1. Location of atoms in high-resolution crystal structures

The location of an atom is usually defined by its coordinate vector μ and its symmetric (ideally) semi-definite ADP tensor U once the resolution of the crystallographic data is better than 1.2 Å and the atoms appear as separate electron density peaks. Crystallographic refinement algorithms may or may not enforce meaningful U. For the purpose of this analysis, non-positive definite atom records are ignored. Beq and Best are then defined as [14]:

Equation (1)

Equation (2)

where E1, E2 and E3 are eigenvalues of U. U is equivalent to the covariance matrix of a multivariate normal distribution. Using the multivariate normal description of the atoms, the probability of their positions can be estimated in the crystal structure. The graphical representation of U is an ellipsoid. The eigenvectors and eigenvalues of the U tensor define the axis directions and their relative lengths, respectively. The surface of the ellipsoid is drawn at an arbitrary probability level of the multivariate normal distribution. The machine learning treatment of correlated diffraction intensity observations [2] and atom positions is very similar.

2.2. Bayesian model of ADPs

The Bayesian model (equations (3)–(6)) approximated U matrices as fixed Wishart distributions [8]. The Wishart distribution is frequently used as a conjugate prior to covariance matrices. The prior distributions are described in equations (3) and (5). Equation (4) shows a deterministic transformation and equation (6) defines the probability of the observation U. Equations (5) and (6) represent the Wishart probability distributions:

Equation (3)

wher σ is the scale parameter of the half-normal distribution (σ = 3 a priori),

Equation (4)

Equation (5)

Equation (6)

wher U are the ADP observations, V and ν are the positive definite-scale matrix and the degrees of freedom of the Wishart distribution, respectively. The rank of the matrix is k = 3. Γp is the multivariate gamma function (p = 3).

The relationship between the model parameters is illustrated in figure 1. In order to generate the ν0 parameter a half-normal prior distribution was assumed with a σ parameter of 3. The prior scale matrix V is another Wishart prior distribution with an identity scaling matrix and ν parameter set to 5. The samples of the matrix V were generated by Bartlett decomposition [9] in order to maintain the symmetry and semi-definite positivity of its posterior samples. The ν0 parameter of the observations was incremented by a fixed value of 5 to ensure that the Wishart likelihood function is always defined. The Wishart likelihood function in equation (6) has a degree of freedom ν and a scale parameter V/ν. The training set consisted of 150 protein atoms randomly selected from the crystal structure of a 0.93 Å resolution complex of bovine trypsin in a complex with a canonical inhibitor (PDB entry 2xtt) [15]. The remaining protein atoms formed the test set.

Figure 1.

Figure 1. (a) Schematic representation of the Bayesian model used for machine learning. Variable names Vz and Vc are the off-diagonal and diagonal elements of the scale matrix V of the Wishart distribution, respectively. These are linked by Bartlett decomposition. (b) Degrees of freedom of the Wishart distribution are represented by ν, and ν0 is its prior distribution before it is incremented by 5. For simplicity, only three of the 150 Wishart distributions are illustrated. Each corresponds to a randomly selected atom of the protein.

Standard image High-resolution image

The model was implemented with the Python library PyMC3 [16], and the posterior distribution was estimated with the No-U-Turn Sampling (NUTS) algorithm [17] using the default parameters. The tuning was performed for 2000 Markov Chain Monte Carlo (MCMC) iterations, and the joint posterior distribution was represented by the last 1000 iterations. Four MCMC chains were generated to evaluate the reproducibility of convergence. The ADPs of the randomly selected atoms were assigned to separate 3 × 3 matrices, as vectorization of covariance matrices was not implemented in version 3.6 of PyMC3. This likely limited the performance of the algorithm and increased the memory usage. A total of 150 observations were already sufficient for reproducible convergence to very similar posterior distributions even if a different set of atoms was selected. Another limitation of this implementation was that the MCMC algorithm could only use one processor core, and calculating four MCMC chains took 190 s on a Linux workstation (i7 3970X CPU at 3.50 GHz clock frequency). This runtime corresponds to 12 000 iterations and excludes initialization/compilation time.

2.3. For ADP predictions, one posterior parameter sample was used for one prediction

2.3.1. Distance calculations

For the distance calculations, the ADPs were converted to dimensionless matrices by dividing the tensor element magnitudes by 1 Å2. The atomic positions were converted into dimensionless vectors by dividing the elements by 1 Å. The metrics between the dimensionless UA and UB were defined as:

  • (a)  
    Euclidean distance is Frobenius's norm of the difference matrix between UA and UB :
    Equation (7)
  • (b)  
    Riemann distance:
    Equation (8)
    wher λi are the joint eigenvalues of UA and UB (assuming common eigenvectors).
  • (c)  
    Log Euclidean distance is Frobenius's norm of the difference matrix between log(UA ) and log(UB ):
    Equation (9)
  • (d)  
    Log det distance:
    Equation (10)
  • (e)  
    Wasserstein distance [18]:
    Equation (11)
  • (f)  
    Symmetric Kullback–Leibler distance [19]:
    Equation (12)
  • (g)  
    Beq distance:
    Equation (13)
  • (h)  
    Eigenvalue distance:

Equation (14)

where λA and λB are the eigenvalues of UA and UB , respectively. The distance is defined as the vector norm of the differences between λA and λB .

In addition, the distance between atom positions was defined as:

Equation (15)

where μA and μB are the coordinates of atoms A and B. The distance is defined as the vector norm of the differences between μA and μB .

The matrix logarithm of U is defined as:

Equation (16)

The metrics were implemented using the Python libraries NumPy [20, 21], SciPy [22] and pyRiemann [23]. The Python libraries Matplotlib [24], Graphviz [25] and Seaborn [24] were used for data visualization. Validation of the method using synthetic data is detailed in the supporting online information (figures S1–S4 and tables S1 and S2 (available online at stacks.iop.org/MLST/2/035033/mmedia)).

3. Results and discussion

Based on the ADP tensors of the protein structure the posterior probabilities of the Bayesian model parameters were estimated using the NUTS algorithm. Convergence was reached after tuning, and in the subsequent 1000 iterations the parameter trace shows low autocorrelation (figure 2). Using the posterior probabilities of ν and V parameters, it was possible to draw samples from the Wishart random distribution, which here we assumed as a simulated reference, equivalent to the ADP tensors. We compare the random pool of samples to the protein ADPs in figures 35. In figure 3, the ADP matrix elements are compared between the protein atoms and posterior predictions, and in figure 4 their Best values are compared. Although the distribution of protein ADPs was more complex than what a single Wishart distribution could describe, it captured the central tendency of the distribution well. For example, the mean of Beq of the real samples was 8.3 Å2, while the mean of the predicted Beq was 8.0 Å2. The scale matrix is also adapted to the specific distribution of the off-diagonal elements in the ADPs. For example, U13 element was shifted towards more negative values in the observations and the predicted Wishart distribution displayed a similar tendency (means −0.010 and −0.011 Å2, respectively). These off-diagonal elements have stronger tails than in the experimental ADPs. In addition to the successful scaling of selected tensor elements, the predicted tensors always maintained symmetry and positive definite properties. The tensors are best visualized as ellipsoids (figure 5). Both the real atoms and the machine learning model had a similar variety of sizes and shapes of ellipsoids and the principal axis did not show a strong directional preference.

Figure 2.

Figure 2. Posterior distribution of the model parameters after MCMC sampling. Kernel density estimates of parameters (a) Vz and (b) Vc and (c) ν, respectively. Parameter values of (d) Vz, (e) Vc and (f) ν0 as a function of MCMC iterations. Four independent MCMC chains are plotted with different line styles.

Standard image High-resolution image
Figure 3.

Figure 3. Comparison of observed (test set, blue) and predicted ADP elements (random model, orange).

Standard image High-resolution image
Figure 4.

Figure 4. Histogram of Best values of the observed (test set, blue) and predicted (random model, orange) ADPs.

Standard image High-resolution image
Figure 5.

Figure 5. Comparison of 16 randomly selected (a) protein atoms from the test set and (b) 16 samples from the posterior predictions of the random model. ADPs are represented as ellipsoids where the axes are scaled by the corresponding eigenvalues of the matrices.

Standard image High-resolution image

By using a mixture model of multiple Wishart distributions, it may be possible to describe the ADPs of protein atoms more accurately and improve the similarities further. One should be cautious, however, as the increase in the number of parameters may lead to overfitting, which is why this type of treatment is not presented here.

3.1. Comparison of ADP metrics

We determine the distance between two samples from the same Wishart random distribution repeatedly and evaluate what distance corresponds to a certain level of cumulative probability. In order to compare covariance matrices, there are many metrics are available. All definitions assume that for identical covariance matrices the distance is 0 and most definitions are commutative, i.e. the distance is the same in both directions. Perhaps it is most computationally efficient to calculate the Euclidean distance (dE). A general case of the Euclidean distance is the Riemann distance (dR) defined between two points in a Riemann manifold. Locally, the geometry of a smooth Riemann manifold approaches Euclidean geometry, which makes the two interchangeable when very similar ADPs are compared. Unfortunately, dE and dR started to diverge rapidly. Atom pairs with similar dE can have different dR and vice versa. These metrics do not have to be particularly large to observe this tendency. Figure 6 shows a comparison of different covariance distances between pairs of randomly selected atoms in the protein structure and between two samples from the simulated Wishart distribution, respectively. For dE, dR, dlE, dW and dlD, the two distance probability distributions differ to a large extent. Our null hypothesis is that there is no relationship between ADPs of atoms and their similarity can be predicted by our statistical random model. Table 1 compares distances at different significance levels, i.e. at which distance the pairwise distances derived from our random model reach 1% and 5% cumulative density. When protein atoms are compared, the probability of finding pairs with very short distances is much higher. For example, 1% cumulative density is reached at distance dE of 0.052 in the random model, while 5% cumulative density of protein comparisons has already reached dE of 0.034. For any of these 5% atom pairs we can reject the null hypothesis. By putting cumulative density into context, we can focus on pairwise comparisons to a single, generic atom from all other atoms in the structure. The crystal structure contains 1876 protein atoms. If we accept 1% false positive pairwise relationships, 27.7% (520 atoms) of the protein atoms are significantly more similar to a single atom than expected by chance. This number of atoms is much larger than the first and second coordination shell around a typical atom and a substantial fraction of similarities develops over a relatively long distance. Clearly, a protein is more inhomogeneous than this simple model suggests. While some atoms have almost no significantly similar partners, other atoms belong to an even larger network of significantly similar atoms.

Figure 6.

Figure 6. Cumulative normalized histogram of distances dE, dR, dIE, dKL, dID, dW, dB and dλ (a)–(h) calculated by randomly selected pairs of atoms of the protein (blue) and two random samples of the posterior Wishart distribution (orange). Red line indicates 5% of the total observations.

Standard image High-resolution image

Table 1. Statistical comparison of pairwise distances between ADPs of protein atoms and between posterior predicted ADPs. The abbreviation CDF indicates 'cumulative distribution function'.

MetricDistance with CDF = 1% (random)Distance with CDF = 5% (random)Distance with CDF = 1% (protein)Distance with CDF = 5% (protein)Protein CDF at d with random CDF = 5% (%)Protein CDF at d with random CDF = 1% (%)
dE 0.0520.0730.0220.03427.715.1
dR 0.6390.8820.4030.58321.77.1
dlE 0.6300.8710.4030.58121.36.9
dKL 0.2100.4050.0820.17321.67.2
dW 0.0910.1250.0460.07125.811.8
dlD 0.2250.3090.1420.20521.77.1
dB 0.0310.1970.0470.2404.10.7
dλ 0.0130.0230.0090.01710.12.4

The cumulative density of dB at short distances is very similar when the experimental data and the random model are compared. It may be counterintuitive to realize that samples pulled from the same distribution do not have a maximum around zero for their pairwise distances and this is a characteristic of multivariate distributions. For the univariate Beq distance (dB) the most frequent pairwise distances are close to zero both for comparison of the random andexperimental data. This means that the high similarity of Beq is not surprising and expected by chance, i.e. the similarity does not carry substantial information. The metric dB is not good for detecting similarities. The distance distribution differences between the random and experimental model do diverge when distances are long, but in this study, we focus on the coincidental similarities not on coincidental differences.

3.2. Evaluation of ADP metrics and their correlation with the physical distance of atom pairs

Figure 7 shows the correlation between ADP metrics, and the distance between atom coordinates is also included (dr). The diagonal plots represent the kernel density estimates of each distance distribution. The dlD, dlE, dR and dKL metrics are highly correlated with one another. Therefore, only one of them, dR, was retained in the final analysis. The kernel density estimate of dr represents the pair distribution (p(r)) function, which provides information about the shape of the molecule. When dr is plotted against different ADP metrics, it is apparent that atoms can be far apart and yet have nearly identical ADPs, and they are much more similar than expected by chance. The converse does not have to be true. When atoms are close together, their ADPs do have increased similarity. This can be observed due to the relative lack of points under the diagonal. There are almost no atom pairs within bonding distances (1–4 Å), which have highly dissimilar ADPs. This local similarity could originate from crystallographic restraints, rigid body motions and other types of local constraints. There is also a low frequency of short distances in the p(r) function. Notably, metrics that are less correlated with dB, such as dR show more diverse atom pairs at closer interatomic distances. Beq is a pure measure of displacement amplitude without any directional information. This indicates that the local ADP similarity is mostly restricted to the displacement amplitudes rather than to the displacement directions. This is in line with the general observation that the crystallographic B-factors tend to increase from the core towards the surface of proteins.

Figure 7.

Figure 7. Comparison of distances and their kernel density estimates plotted by the Seaborn library. For clarity, only 1000 randomly selected atom pairs are visualized from the approximately two million combinations.

Standard image High-resolution image

The isotropic displacement amplitude distance dB and the metrics based on the shape parameters (eigenvalues) of U (dλ ) have poor correlation with other metrics, especially when the distances are short. This would normally be a favorable, complementary property, but figure 6 and table 1 show that short distances of dλ and dB occur easily by chance and do not carry substantial information.

3.3. Qualitative patterns of ADP similarities

Our earlier analysis identified the reproducible similarity of Cβ atoms in the amino acid residues His-57 and Ser-195 using a hierarchical clustering approach in trypsin in a complex with a small molecular inhibitor benzamidine [7]. We can now verify the robustness of the claim in a different crystal structure, by looking at a different protein-inhibitor complex refined by different crystallographic software, and put the results into a statistical context. In our earlier analysis, we only used a variant of Euclidean distance and one type of clustering algorithm. Hierarchical clustering tends to commit an atom early to a cluster, making it unavailable for other plausible groupings. This is especially problematic when the ends of the branches (small clusters) contain the most relevant information. Small changes in the metrics and algorithms used for clustering can result in a different tree. Thus, hierarchical clustering is excellent for capturing some patterns, but often misses links that are revealed in a more systematic analysis. In table 2, the similarities are ranked after increasing dR. At this high level of similarity, dE and dW do not diverge strongly, but if the same atoms are sorted according to other metrics, the ranking changes slightly. Distances of these magnitudes are well below the 1% significance level (table 1). The metric dλ and especially dB correlate poorly with dR, indicating that the displacement amplitudes alone do not contain sufficient information. The interatomic distance dr varies greatly in this group of highly similar atoms to Cβ of His-57. At rank 1 and 2, we find atoms 16.2 and 24.9 Å apart, respectively. Only at rank 3, do we find a covalently bonded atom (Cα of His-57). All other similarities arise in the absence of covalent bonding and van der Waals interactions. We find Cβ of Ser-195 from our previous study [7] at rank 23 in the company of other carbon atoms, and in particular Cβ atoms. Recurring atoms are observed from the amino acid residues Ser-195, Ser-214, Lys-87 and Ile-89. Residue Ser-195 is the most important member of the catalytic triad as it performs the first nucleophilic attack on the substrate peptide or ester and forms covalent tetrahedral and acyl-enzyme intermediates during the catalytic cycle. Ser-214 is often attributed to the 'catalytic tetrad' [26] because its mutagenesis drastically hampers serine protease catalysis. Amide nitrogen of Ser-214 forms a hydrogen bond of variable strength to the P1 residue of the substrate [27]. Its side-chain hydroxyl group (Oγ) maintains an unconventional CH–O hydrogen bond to His-57; thus, Ser-214 maintains a link between the substrate and His-57 [28].

Table 2. The shortest ADP distances to Cβ of His-57. The color scale red-yellow-green illustrates the range between the lowest and highest values in the columns.

3.4. Origin of significant ADP similarities

We established a statistical framework for comparing ADPs and revealed that ADP similarities in bovine trypsin do not arise by coincidence. We have still not given a causal explanation. Unfortunately, we can only speculate about its origin if we consider the spatial distribution alone. We discuss here two kinematic processes: constrained displacements and coherence of motion. Both types of motion could result in identical positional distributions. We illustrate the difference with a theoretical experiment, where we assume that an atom is part of a planar chemical group. Suppose that this atom is constrained in its freedom. It cannot move in the plane direction and it displaces only in- and out-of-plane directions. Suppose that we have a number of identical chemical groups separated by long distances so that no direct contact between them is possible. If there is a yet unexplained process, which aligns the chemical groups parallel to one another, then the displacement of atoms will occur in similar directions. They do not have to move in phase; the constraints provided by the planar chemical group explain the uniform directional distribution. One problem with this model is that we do not have a universally accepted mechanism for protein folding, which explains why chemical groups should orient themselves in similar directions. This model also requires very strong constraints, for example, the out-of-plane displacement has to be perfectly perpendicular to the planar chemical group. Moreover, empirically we can observe that chemical groups are not perfectly aligned in the protein folded structure, and they appear to be affected by steric hindrances that prevent crystalline order. However, we cannot dismiss the presence of atoms with very similar displacements.

We prefer an alternative explanation because it does not require a speculative mechanism that orients the chemical groups over a long distance. Instead, we can assume that the discussed atom in the planar chemical group displaces coherently (in phase) with the same type of atoms in other groups, at least occasionally. Then, the planar constraints on the atoms would then work the other way around. The parallel orientation of the planar chemical groups becomes a consequence of the displacement alignment of a particular atom type.

4. Conclusion

Atomic displacement parameters of protein crystal structures contain useful information for machine learning analysis. We approximated ADPs with Wishart's distribution and used MCMC sampling of their joint posterior distribution. From the posterior distribution, we generated random ADP samples by forward prediction. We used the random samples to evaluate frequencies of coincidental similarities. To evaluate similarities, we tested eight metrics of which the Euclidean, Riemann and Wasserstein distances formed complementary, non-redundant sets of metrics sensitive to different aspects of ADPs. Comparisons of ADP similarities with interatomic distances revealed poor correlation, indicating long-range constraints or coherence of protein dynamics. This study demonstrates that a substantial fraction of these long- and short-range ADP similarities does not form by coincidence. Finally, we showed further evidence that ADP similarities depend more on their chemical nature and bonding coordination.

Acknowledgments

We would like to thank Christopher Hartl for helping to develop the Bayesian model. We are grateful to Ran Friedman and Stefano A Mezzasalma for their careful reading of the manuscript. This work was supported by the Röntgen-Ångström Cluster Framework (Grant No. 2015-06099). We acknowledge financial support from the LINXS—Lund Institute of Advanced Neutron and X-ray Science.

Data availability statement

The data that support the findings of this study are openly available at DOIs. The implementation of the Bayesian model and data loading algorithm are available in the Zenodo depository, DOI: https://doi.org/10.5281/zenodo.4717293. The pairwise ADP distances of the test protein structure are available in the Zenodo depository, DOI: https://doi.org/10.5281/zenodo.4717613.

Please wait… references are loading.
10.1088/2632-2153/ac022d