Spurious negative eigenvalues of numerical variance-covariance matrices in many-body systems correlate with the existence of frozen degrees of freedom

The principal component analysis ( PCA ) is widely used to reduce the dimensionality of a dataset to its essential components. To perform PCA, the covariance matrix is constructed and its eigenvalues and eigenvectors are computed. In practical numerical applications, the tail of the sorted eigenvalues is sometimes found to contain negative eigenvalues, which are prohibited mathematically and are a pure consequence of ﬁ nite-accuracy numerics. The present study suggests that in the case of a many-body dynamical system, the spurious negative eigenvalues of the covariance matrix may in fact be related to the frozen degrees of freedom in the system. Here, we outline the mathematical connection

The advance of data-driven sciences has opened up the new research field of machine learning and brought about ground-breaking developments including, quite recently, the astonishing results of large language models like ChatGPT [1].An important tool in this branch of science is dimensionality reduction [2], which is related to the physical concept of generalized coordinates.The proper choice of a coordinate system could significantly simplify the complexity of a problem and the appropriate coordinate systems are often found by exploiting the symmetries of the system.The classical textbook example is the motion of two bodies attracted by gravitation: a problem conveniently solvable in spherical or elliptical coordinates due to the spherical symmetry of the gravitational potential.
Data driven sciences, however, often deal with significantly more degrees of freedom and have only access to a finite amount of noisy data, instead of analytically defined forces.In order to spot the most relevant generalized coordinates, principal component analysis (PCA) [2][3][4][5][6] offers a handy approach.To perform PCA on an Ndimensional data set, the N × N-dimensional variance-covariance matrix is usually constructed and its eigenvalues and eigenvectors are calculated and analysed.The approach is rather general, but can also be applied more specifically to study the dynamics of many-body-systems.In fortunate situations, the first few eigenvectors from a PCA may already suffice to account for over 90% of the variance in the data [7][8][9].Dimensionality reduction may therefore be achieved by analysing the temporal evolution of the first few eigenvectors only, neglecting the tail of the eigenvalue decomposition.Moreover, Tikhonov regularization [10] or diagonal loading [11] are sometimes employed during the numerical diagonalization of the covariance matrix to avoid singular matrices, i.e. to avoid the occurrence of small eigenvalues.The small eigenvalues of numerical variance-covariance matrices are often neglected, as they seemingly do not provide any relevant information on the underlying system.In this study, we demonstrate that these small eigenvalues could be crucial to reveal some dynamical features of many-body-systems, and namely indicate frozen (stiff) degrees of freedom, which in turn may suggest the existence of undiscovered symmetries in the system.Specifically, our analysis reveals that the number of negative eigenvalues should be positively correlated with the existence of frozen degrees of freedom.Following a theoretical framework, we justify this idea further by first discussing a model of coupled harmonic oscillators and then studying a more specific example of the solvated protein ubiquitin using atomistic molecular dynamics (MD).
The general properties of numerical covariance matrices, can be introduced conveniently with the aid of an example.Consider a classical system of N atoms with a predefined Hamiltonian in the canonical ensemble.The positions of the atoms are then given by a 3N-dimensional vector x y z x y z x , , ,..., , , . 1 Here, the entries (x 1 , y 1 , z 1 ) are the cartesian coordinates of the first atom and the vector continues until it reaches the last of a total of N atoms.Once the system evolves in time, the atomic positions are recorded at certain time instances, such that the positions x become distributed according to a probability distribution function P(x).Consider the entropy of the system; the entropy is a measure of the number of microstates that the system can access in the limit of infinite time; the fluctuations of x can be used to calculate the entropy.Approaches for estimating the entropy contributions from different degrees of freedom for many-body systems from the underlying positional fluctuations in the phase space have been developed by, e.g.Levy et al, Schlitter et al, and Carlsson et al [12][13][14].Among these approaches is the quasiharmonic approximation which employs the covariance matrix to quantify the fluctuations in the phase space [12,15,16].
The quasiharmonic approximation assumes that the probability distribution P(x) is a multivariate Gaussian probability distribution P x ˜( ) [16]: where σ is the covariance of P ˜, which exists for every multivariate Gaussian distribution [17], x is the coordinate vector defined in equation (1) and x ¯is the time-averaged coordinate vector.The same distribution P would be produced for the potential energy of the system with the introduced effective potential Note, that the kinetic energy contribution is not relevant here as the partition function is assumed to be calculated from a Boltzmann distribution [15].Here, the Hessian matrix F (in J/m 2 ) is given as: where T is the temperature and k B is the Boltzmann constant.The vector of normal modes ω of F can then be calculated from the characteristic equation: where M is the mass matrix containing the masses of all the atoms in the system on its diagonal.With , equation (4) can be rewritten as Here, λ is the vector of the eigenvalues of the system, and  is the identity matrix.A complete derivation of equation ( 5) is provided in the supplementary information (SI).
The elements of the variance-covariance matrix are defined as Here, x i is the ith entry of the position vector x and x x i i = á ñwith 〈•〉 denoting the average over time.Analogously, x j is the jth entry of the position vector.The definition in equation (6) implies a real, symmetric bilinear map (x i , x j ) a (σ ij ) such that its diagonal form in the normal representation is unique when all eigenvalues are sorted [18].Since the inverse of a diagonal matrix is furthermore given as the inverse of its diagonal elements, equation (5) becomes equivalent to Due to the way ŝ is defined, both ŝ and λ are given in units of kg • m 2 .The effective potential V eff that describes the interactions within the N-dimensional system is then given as a collection of harmonic oscillators with the respective frequencies (with λ i indicating a specific eigenvalue) where each of these oscillators describes the motion of the system along the respective eigenvector.The covariance matrix and its specific set of eigenvectors and eigenvalues forms the connection between the quasiharmonic approximation and PCA.
In the limiting case of λ i → ∞ , the oscillator frequency ω i → 0, so that the effective potential in the corresponding direction practically allows free motion (see figure 1 left).For finite values of λ i > 0, the effective potential has a finite, real frequency, corresponding to a stable harmonic pendular motion.For λ i → 0, the frequency diverges and the effective potential becomes infinitely steep, such that the atoms effectively do not move along this direction (see figure 1 right).In other words, PCA is expected to deliver large eigenvalues for large scale motions and small eigenvalues for small scale motions.
In the limiting case where a degree of freedom has a neglibible variability, one typically speaks of a frozen (stiff) degree of freedom.The existence of a single frozen degree of freedom f restrains the motion of a 3Ndimensional system to a 3N-1 dimensional hypersurface.This hypersurface has a characteristic normal vector, coinciding with the eigenvector v f of the covariance matrix for which the eigenvalue λ f is approximately zero.In cases where several eigenvalues are (close to) zero, one expects that the corresponding frozen degrees of freedom reduce the dimensionality of the system even further.From a numerical perspective, considering a variancecovariance matrix that has small eigenvalues means performing calculations on a singular matrix.For singular matrices, numerical inaccuracies become increasingly notable such that negative eigenvalues can arise.In other words, frozen degrees of freedom lead to singular variance-covariance matrices which in turn lead to negative eigenvalues.
The hypothesis outlined above could further be illustrated through a simple physical model system.Consider N = 100 coupled harmonic oscillators, where each oscillator i has a unique spring constant k i and neighbouring oscillators are harmonically coupled, see figure 2(A).The coupling parameter c is assumed identical for all couplings.The force that acts on a given oscillator i can then be written as: where x i is the position of an oscillator i, while the neighbouring oscillators have the positions x i−1 and x i+1 .The system is modelled with periodic boundaries such that the oscillators i = 0 and i = N are coupled.For the specific case study the model of coupled oscillators was studied numerically through several simulations, where the initial positions of all oscillators i were assumed x i (0) = 1 with velocities v i (0) = 0.For the sake of simplicity, all variables in this model were considered in arbitrary units and all oscillators have the same mass m = 1.Even though the initial conditions for the positions and velocities of the oscillators were identical, the simulation results differ based on the order of the unique spring constants k i ä {0.1, 0.2,...,10.0,10.1}.The spring constants were reshuffled until the average PCA converged.Convergence was determineed when the average sorted list of normalized eigenvalues reached a threshold value of 10 −3 for every eigenvalue.Further details on the simulation protocol is provided in the SI, while all related simulation scripts are available at https://gitlab.uni-oldenburg.de/quantbiolab/coupled_harmonic_oscillators_periodic. Figure 2(B) illustrates the list of normalized eigenvalues calculated for different coupling strength values.The uncoupled oscillators at c = 0 show that all 100 eigenvalues are greater than 0.6; in this case, the risk of numerically obtaining negative eigenvalues is minimal.Small couplings of c = 0.02 to c = 0.2 also provide eigenvalues that are visibly larger than zero.A further increase of the c value however, pushes the eigenvalues closer to the ordinate, where at c = 4 the last 20 eigenvalues closely approach zero; in practical applications, where sampling and simulations may further limit the convergence of PCA, these small eigenvalues may become negative.More detail on how the number of small eigenvalues χ evolves as a function of coupling strength is provided in figure 2(C).Here, the number of eigenvalues below a certain value is counted.As the coupling strength increases, this number grows monotonically for all considered threshold values.
From a conceptual point of view, it is worthwhile to consider the transition from the limiting cases c = 0 to c → ∞ .For c = 0, all oscillators are independent and distinguishable.In the limit c → ∞ however, they transition into a single body with a total mass of N • m.In the case of the infinitely strong coupling, where x 0 = x 1 = ... = x N−1 = x N , all coupled oscillators can be represented by a single degree of freedom, such that N − 1 degrees of freedom become frozen due to the internal couplings.A gradual increase of c is thus expected to increase the number of frozen degrees of freedom from 0 to N − 1.
The mathematical considerations above could also be applied to a real molecular system.To illustrate the connection of small eigenvalues with the frozen degrees of freedom, the small protein ubiquitin was simulated and artificial restraints were imposed on some of its degrees of freedom.The dynamics of the solvated protein was atomistically modelled using classical MD simulations and the variance-covariance matrix was constructed from its motions.The approach allowed to introduce an artificial force that pulled a target atom with a fixed constant velocity until a predefined displacement was achieved, a methodology referred to as steered molecular dynamics (SMD) [19].This pulling force controls the freezing of certain degrees of freedom, associated with the affected atoms: a stronger value of the force results in a more intense coupling of the atoms to it.The strength of the force was controlled by the velocity with which the target atom was pulled.SMD simulations with higher pulling velocity are influenced by the applied force more noteably, since a larger force is needed to achieve the desired pulling effect.More details on the simulation procedure are given in the SI. Figure 3 shows the simulated protein ubiquitin and illustrates the direction of the artificial force and its target atom: the α-carbon of residue 73.α-carbons are central atoms in every amino acid that connect the backbone and the sidechain.The examplary structure is a relatively small protein for which the crystal structure is fully resolved [20].Two different pulling conditions were studied.In the first setup, different values for the constant pulling velocity (SMD velocity) were tested while the total distance pulled (SMD distance) was fixed.In the second setup, different values for the SMD distance were tested and the SMD velocity was fixed.These two setups allowed to analyze and differentiate the effects of SMD distance, SMD velocity and finite time in the simulations.Lastly, it is necessary to define projections P from the 3N eigenvectors v N 3

Î
to N  in order to compare the simulations and to differentiate between positive and negative eigenvalues.Every eigenvector is represented in the basis of the cartesian coordinates of atoms of the protein.Thus, it contains the entries v v v , , ) for every atom with the index n ä {1,K,N}.Let us denote the vector of all eigenvalues as λ and the mass of an atom n as m n .The total principal component projection (TPCP) could then be defined as Ín the above equation, the summation is performed over all eigenvalues λ i with their corresponding eigenvectors v i .The PCA leaves the original coordinate system unchanged, therefore, the eigenvectors v i in the rows 3n + 1 to 3n + 3 (with n ä {0,K,N − 1}) represent the cartesian coordinates (x n , y n , z n ) of an atom n.By summing up the values of p n in equation (9), one defines projections to a set of atoms, or even for all the atoms in the protein.
A second, similar projection will be referred to as the negative principal component projection (NPCP) and is defined as NPCP may be used as a relative measure between simulations to indicate the number of frozen degrees of freedom in a system.The comparison between NPCP and TPCP is used to test whether any potentially observed effect is unique to the negative eigenvalues.
Let us now consider the expected results based on the mathematical sketch outlined earlier.The pulling process in ubiquitin increases the correlation between some degrees of freedom: at higher pulling velocities, the trajectories of all affected atoms become dominated by their movement in the direction of the pulling force at constant velocity and the trajectories of the affected atoms become similar.Since the components of the variance-covariance matrix are mean-centered, their respective columns in the variance-covariance matrix become similar too.It should therefore be observed that more negative eigenvalues are found when the SMD velocity increases.In other words, the SMD velocity and the NPCP should be positively correlated.The TPCP on the other hand, should remain independent of the SMD velocity as it quantifies the overall motion within the protein, which is primarily influenced by the SMD distance and the coordinate fluctuations.Figure 4 displays the TPCP and NPCP summed over each amino acid residue in the ubiquitin protein normalized by the total number of atoms in the residue.Equilibrium MD simulations, which lacked the pulling force, were performed for reference and their result is shown with the black line.Different pulling velocities are colour-coded and their respective velocities are listed in the legend.The yellow vertical line highlights the location of residue 73, which was the target of the pulling force.Both the TPCP and NPCP show a strong peak towards the C-terminal end of ubiquitin (high residue numbers), especially the TPCP increases (see inset).Apart from the identification of the highly flexible C-terminus, both projections exhibit somewhat similar peaks, but these shall be of no further interest here.Instead, a trend on the entire projection is noticeable: the NPCP is consistently larger for every residue when the SMD pulling velocity increases.Figure 4 illustrates, that the stronger the force that restrains the movement along the pulling direction, the larger the NPCP, i.e. the exact behaviour which was expected based on the proposed interpretation of the negative eigenvalues.Upon closer inspection, one notices that the sum of negative eigenvalues in figure 4(B) appears to be unusually large.Each eigenvalue would normally be expected to be on the order of machine precision, while in the presented case study, the sum of the negative eigenvalues is about 1/20 of the sum of the positive eigenvalues.For the data analyzed here, this peculiarity is expected to originate from the limited sample size (number of analyzed MD simulation frames), as further outlined in the SI.
Apart from NPCP and TPCP, some effects are also expected to be apparent in the first (largest) positive eigenvalue.Since PCA is a measurement of the overall movement, the first eigenvalue should, on average, increase with the total distance pulled, which is shown in figure S3.The pulling velocity at constant total distance pulled on the other hand should, on average, result in the same first eigenvalue, which is illustrated in figure S4.The direction of the first eigenvector is another possible measurement.Because it points in the direction of the greatest movement, the first eigenvector should demonstrate velocity-dependent alignment towards the pulling force, which was also the case, see figure S5.
To highlight the key observation, the positive correlation between restraining forces and the existence of the negative eigenvalues of the numerical variance-covariance matrix, the data was condensed further in figure 5.The figure compares the relative increase of the TPCA and NPCA projections with respect to the equilibrium reference simulation.The relative increase is defined as ( ) h = å å Here, Equil.indicates the equilibrium reference simulation without the pulling force and SMD indicates the simulation including the pulling force.Figure 5(A) shows the relative increase of TPCP and NPCP as a function of the SMD pulling velocity, while figure 5(B) illustrates the dependence on the SMD pulled distance.The TPCP in figure 5(A) shows small and irregular dependence of the SMD velocity, while the NPCP shows an increasing trend.This result illustrates that NPCP positively correlates with the existence of the frozen degrees of freedom in the system.There may however be two other effects that could possibly lead to wrong interpretations of the results, namely numerical accuracy and finite-time effects.Numerical accuracy could make a difference since the observation essentially revolves around a common error in numerical PCA.To test this assumption, the variance-covariance matrix was stored in single-and double-precision formats and the obtained results were compared.Figure 5(A ) shows no apparent difference between both variants.It is therefore possible to conclude that the structure of the variance-covariance matrix is profoundly more relevant than the precision up to which it was stored.
Finite-time effects also require some attention because the simulation times differ between the datapoints: shorter SMD distances or higher SMD velocities require shorter simulation times.However, it is plausible that if a finite-time effect leads to an increase in NPCP, then NPCP should decrease with the SMD pulled distance.Figure 5(B) reveals that in fact one observes the contrary behaviour: NPCP increases with the SMD distance alongside the TPCP, so the increase of NPCP cannot solely be attributed to a finite-time effect.Figure 5 suggests that the NPCP increases with the SMD velocity and that the NPCP increases alongside the TPCP with the distance pulled.It is the first of these conclusions that was expected to be true based on the mathematical intuition outlined earlier.It appears overall consistent to say that spurious negative eigenvalues are a consequence of small eigenvalues in the variance-covariance matrix, which in turn are caused by the frozen degrees of freedom in the many body system.
To summarize, the present study, to the best of our knowledge, is the first attempt to draw a connection between the spuriously negative eigenvalues of numerical variance-covariance matrices and the underlying data set in such a way that it is possible to extract more information other than just the information that there exist numerical artefacts.More specifically, the often neglected tail of the eigenvalue decomposition could contain valuable information, similar to the identification of first integrals in classical mechanics.These first integrals are N-dimensional restraints which can have intriguing consequences like e.g.self-organization [21], where spatiotemporal structures spontaneously emerge from the restraints.

Figure 1 .
Figure1.A flat (left) and a steep (right) effective potential V eff along with the distribution of randomly sampled states of a dynamic system (coloured circles).A system of three particles (middle) connected by weak (blue) and strong (red) springs could be described by these potentials.The eigenvalues λ that are associated with the motion within these potentials are indicated.

Figure 2 .
Figure 2. A: Illustration of the model system of coupled harmonic oscillators with identical couplings c and N unique spring constants k i .The N oscillators are treated with periodic boundaries.The arrow shows the direction of the possible oscillatory motion.B: Distribution of the eigenvalues for various coupling strengths c normalised by the first eigenvalue.Shown are averages that converged below 10 −3 at every eigenvalue as described in the SI.C: Number of eigenvalues χ that appear below a given threshold value calculated as a function of the coupling strength c.Colours differentiate between thresholds.

Figure 3 .
Figure 3. Representations of the protein ubiquitin (pdb ID: 1UBQ [20]) with atom positions taken from an SMD simulation with 0.5 A ̊/ns pulling velocity and 9.9 A ̊pulling distance.The red spheres indicate the restrained α-carbons of the residues 1-40 of the protein's backbone.The yellow sphere shows the location of the α-carbon of residue 73 which was used as the target for the applied pulling force.The yellow arrow shows the direction of the pulling force.(A) Atomic positions before pulling was started.(B) Atomic positions at the end of the pulling process.The yellow arrow is fixed in space between (A) and (B) and has a length of 10 A ̊.

Figure 4 .
Figure 4. Comparison between total principal component analysis (A) and negative principal component analysis (B) for varying SMD velocity at a constant SMD pulled distance of 9.9 A ̊. Residue 73, which was the target of the additional force, is highlighted with a yellow vertical line.The inlays give a detailed representation of the residues surrounding the pulled residue 73.

Figure 5 .
Figure 5. Relative increase in the sum of principal component projections obtained for the the SMD simulations as compared to the values from the equilibrium reference simulations.Panel (A) shows η as a function of the SMD velocity and includes an analysis of the raw data at 32-bit floating point (float) and 64-bit floating point (double) accuracy.Panel (B) shows η as a functino of the SMD pulled distance.