Information bottleneck in peptide conformation determination by x-ray absorption spectroscopy

We apply a recently developed technique utilizing machine learning for statistical analysis of computational nitrogen K-edge spectra of aqueous triglycine. This method, the emulator-based component analysis, identifies spectrally relevant structural degrees of freedom from a data set filtering irrelevant ones out. Thus tremendous reduction in the dimensionality of the ill-posed nonlinear inverse problem of spectrum interpretation is achieved. Structural and spectral variation across the sampled phase space is notable. Using these data, we train a neural network to predict the intensities of spectral regions of interest from the structure. These regions are defined by the temperature-difference profile of the simulated spectra, and the analysis yields a structural interpretation for their behavior. Even though the utilized local many-body tensor representation implicitly encodes the secondary structure of the peptide, our approach proves that this information is irrecoverable from the spectra. A hard x-ray Raman scattering experiment confirms the overall sensibility of the simulated spectra, but the predicted temperature-dependent effects therein remain beyond the achieved statistical confidence level.


I. INTRODUCTION
Proteins are formed of amino acids, often cited as the building blocks of life.The backbone of the amino acid chain consists of a -CO-NH-C α -sequence, where adjacent residues are bound by the peptide bond [1].The geometry of the backbone contains pairs of flexible dihedral angles, the Ramachandran angles, which together define the macroscopic structure and the function of the protein.The related folding of proteins is a complex and complicated question [2], affected by both intramolecular and intermolecular interactions.The process is ultimately dependent on the primary order of the amino acids [1] and the surrounding solvent [3,4].
While diffraction experiments allow for structure determination of proteins, they require a crystalline sample.Owing to its localized mechanism, X-ray spectroscopy maintains its sensitivity to atomistic structure also in the soft condensed phase, although statistical variation of the individual spectra in such environments has been found to be huge [5][6][7][8][9][10][11][12][13].This variation can be accounted for by evaluation of the spectra at numerous structures, sampled from the respective statistical ensemble, and averaging over them.Furthermore, X-ray spectroscopy is sensitive to the local structure of amino acids and peptides [6,[14][15][16][17][18][19][20].For example, different protonation forms of glycine yield different resonant inelastic X-ray scattering spectra [19] and X-ray photoelectron spectra [6,21].It has also been found that near-edge X-ray absorption fine structure is sensitive to a few neighbouring amino acids in poly-Gly peptides [15,16,20].
To predict the native form of proteins in their biological environment, classical molecular dynamics (MD) is often used [22].In this technique, the atomistic composition of the system is accounted for, but the forces between the atoms are computed from a prior parametrization rather than from the electronic-nuclear system that ultimately defines them.Being computationally much lighter, and allowing for time and size scales not otherwise accessible, this makes classical MD subject to the quality of the parametrization -the force field -used in the calculations [23,24].For the development of these models in the solution environment, their performance assessment in the atomistic level would be valuable.
Interpretation of X-ray spectra is typically not straightforward.This is due to the complicated relation between spectra and structures, originating from quantum nature of the electronic system.In addition, the statistical aspects of the liquid state cause an experiment to probe only the ensemble average.To tackle these problems, machine learning and related emulator-based component analysis (ECA) [25] may prove useful.The ECA algorithm carries out dimensionality reduction in structural space to identify structural variations having the strongest influence on the spectra.Thus the method identifies recoverable (and irrecoverable) structural information without a prior hypothesis.
In this work we report a theoretical study of the smallest tripeptide triglycine in aqueous solution (see Figure 1) at temperatures of 300 K and 350 K, expecting a change in the distribution of the Ramachandran angles.We calculate an extensive set of N K-edge X-ray absorption spectra from MD trajectories with three different force fields, observing notable structural and spectral variation.Next, we study the dependency between intensities of spectral regions of interest (ROI) and the corresponding structures using a neural network (NN) and ECA.The results show that while X-ray absorption ROIs are sensitive to the nearest neighboring atoms of the absorption site, they are not sensitive to the secondary structure i.e. the Ramachandran angles of the system, in agreement with Schwartz et al. [26].The simulated spectra are in good agreement with an X-ray Raman scattering (XRS) experiment we performed.However, the predicted spectral difference profile as a function of temperature remains unconfirmed with the achieved statistical uncertainty.

A. Simulations
We simulated the NPT ensemble by classical MD at 300 K and at 350 K and 1 bar using three different force fields: AMBER-03 [28], Charmm27 [29,30] and OPLS-AA [31].For the first two force fields we applied the TIP3P [32] water model and for the latter the TIP4P [32] model.The cubic simulation cell had an edge length of ∼ 50 Å.The input files were prepared using a desktop installation of GROMACS [33] package (version 2020.1), and the simulations were performed using the version 2020.5 [34] on a computing cluster.Rigid water molecules were used whereas no constraints were applied on the triglycine molecule.After 1 ns of initial thermalization with Berendsen thermostat and barostat [35], we ran MD for 51 ns (timestep 0.5 fs) using the Nosé-Hoover thermostatting [36,37] (time constant τ T =0.5 ps) and Isotropic Parrinello-Rahman barostatting [38,39] (time constant τ P =5.0 ps, compressibility 4.5 × 10 −5 bar −1 ).To avoid the "hot-solvent/cold-solute" problem, we used separate thermostats for the solute and for the solvent in the latter run, with conserved energy drift of the order 1 kJ mol −1 ns −1 atom −1 .The last 50 ns of the run were sampled with 10 ps spacing for spectrum calculations.
From the MD snapshots we calculated the N Kedge X-ray absorption spectra of 30006 structures using the projector-augmented-wave (PAW) method [40] with plane wave basis and density functional theory (DFT), as implemented in GPAW version 22.1.0[40][41][42].As the simulation cell of MD is too large to be used in the quantum mechanical electron structure calculations, we used cut structures including only the water molecules within 3.0 Å of the solute.A vacuum with a radius of 3.0 Å was added around each structure.Excitations to the lowest 1500 valence single-electron states were evaluated in the transition potential half-hole (TP-HH) approximation [43] for each nitrogen site (N ex ).The spectrum onset was corrected using ∆-DFT method for the lowest coreexcited state.The calculations utilized plane wave basis with the energy the cutoff of 350 eV and the Perdew-Burke-Ernzerhof (PBE) functional [44].To aid convergence, occupation smearing by the Fermi-Dirac distribution (width 0.25 eV) was used.
The simulations resulted in energy-intensity pairs for the transitions of each absorption site (denoted as N1, N2 and N3; see Figure 1), from which the spectra were obtained by convolution with Gaussian functions of increasing width analogously to the procedure presented in [45].The full width at half maximum of the function was obtained by a numerical grid search for the best match with the experiment; we used 1.4 eV for the lowest state and increased the value linearly in energy to 4.5 eV for states 5.5 eV above it, or higher.An alternative set of convolution parameters was tested (from 0.2 eV to 4.25 eV for states 10 eV above the lowest state, or higher) without qualitative changes in the results (see Supplementary Information).The spectrum of a single snapshot was evaluated as the sum of the spectra from the three nitrogen atoms and the resulting ensemble mean spectra of all the systems matched the experiment well as seen in Figure 2. In this set, 11 spectra were omitted in the subsequent analysis due to obviously nonsensical results.
To validate the computational results, sets of a few hundred to a few thousand spectra were calculated while varying one of the following simulation parameters: the number of water molecules included (both with a 6.0 Å cutoff, and also without any water), the plane wave energy cutoff (600 eV) or the functional (RPBE [46]).Some of these calculations were troubled with convergence issues, but the successful ones show none of the parameter changes to have a significant effect on the temperature difference profile (see Supplementary Information).In total, the spectrum calculations took approximately 300 000 CPU hours on Intel Xeon Gold 6148 processors.

B. Data analysis
We divided the spectra into three ROIs, the selection of which was based on mean zero-passing location in the T-difference profiles ∆ of the three models.These areas roughly correspond to pre peak (I), main edge (II), and continuum (III) in the spectrum, and each of them has a consistent T-dependence in the simulations.Using the full spectrum instead of ROIs would probably allow more information to be captured, but the risk of overinterpretation also grows with the grid tightness, as simulations are always erroneous in reproducing the experiment [47].To conclude, we analyze the dependency between the ROI intensities S and the corresponding structure R.
We apply emulator-based component analysis [25] to this data for its interpretation.The algorithm searches for a few orthogonal basis vectors of structural space by maximizing the generalized covered variance (R 2 -score) of the prediction for projections of the data onto the spanned subspace.This procedure utilizes the known target values of the original data points in the evaluation of the score.The components of the ECA basis vectors indicate dominant structural features in terms of the variation of the resulting output (in this work the spectral ROIs).The method needs a suitable machine learning (ML) based emulator S emu to predict the intensities of the three ROIs of any given structure R. We used an emulator consisting of nine separate NNs, one for each nitrogen atom and ROI, implemented by scikit-learn [48].
The individual structures R were encoded using the local many-body tensor representation (LMBTR) [49] implemented in the DScribe package [50].The system is represented as distances between pairs and angles between triples of each element combination that includes the central atom -the absorption site.The internal hyperparameters of the LMBTR descriptor and the emulator were chosen by an alternating search, for best average ROI prediction (see Supplementary Information).In the search for the best descriptor, some LMBTR features were zero depending on the according set of hyperparameters.These features were manually removed before training the final model, without a significant effect on performance.The total dimensionality of the final LMBTR vector D(R) describing the local neighborhoods of the three nitrogen atoms was 1140.The ECA algorithm aims at dimensionality reduction, an optimization task complicated by the notably large number of dimensions of this space.We applied an alternating partial optimization with respect to 10% of the ECA vector components at a time.The routine was completed with a full optimization of all components at once.For these tasks we used the optimization toolbox of the SciPy [51] Python library and the trust-region interior point method [52] therein.
For the analysis of general spectrum-structure relationships, we combined all the data from the three force fields and the two temperatures into one data set, which was randomly divided for model selection and training (80%) and for testing and application (20%) of the emulator.

C. Experiment
We measured XRS spectra [53] at the N K-edge of aqueous triglycine using the multi-element XRS endstation [54] at beamline ID20 at the European Synchrotron Radiation Facility (ESRF).In the experiment scattering signal proportional to the double differential scattering cross section (DDSCS) [55] d 2 σ/dΩdω 2 at finite angle element dΩ and outgoing photon energy element dω 2 is recorded with a monochromatic and collimated incident beam of hard X-rays.When carried out at small scattering angles (forward direction) the XRS core-level signal is reduced to the dipole spectrum, and thus yields a spectrum equivalent to XAS [56].Four analyzer modules, 12 spherically bent Si(660) crystals (bending radius R b = 1 m) in each, were used to detect scattering in nearforward direction.We used the spacial imaging property of the bent analyzer crystals to include signal only from the interaction region by selecting the corresponding pixels of the CCD detector in the analysis.We observed different signal from the bulk liquid region and from the capillary walls and therefore excluded the latter pixels from the analysis.
To minimize the effect from radiation damage, a heatable liquid flow cell (modified version of a cell described elsewhere [57]) was used.In the system, a magnetically driven pump rotor is used to circulate ∼ 5 ml of sample liquid through a capillary (2 mm outer diameter, 0.01 mm wall thickness), where inelastic X-ray scattering takes place.Moreover, the sample was replaced regularly to further reduce possible degradation.The sample solution of molality 0.3 mol/kg of aqueous triglycine was prepared by dissolving the powder sample (Sigma-Aldrich, purity ≥ 99.0%, lot # BCCB1590) into de-ionized water (ρ ≈ 18.2 MΩ cm, TOC ≈ 2 ppb).

III. RESULTS
The Ramachandran angles of the central residue, ϕ and ψ (see Figure 1), can be visualized in a corresponding scatter plot.Depending on these two dihedrals, the residue can either exhibit an α-helix or β-sheet secondary structure.Our MD simulations show triglycine appears in both of these structural classes (Figure 2a-f), but the distributions are different between the different force fields.As the molecule has no restrictive side chains, the plot is symmetric for the left-handed and right-handed isomers.The conformational variation of the liquid system gives rise to significant spectral variation.Using simulations allows studying the dependency between the shape of a spectrum and the underlying structure.
Figures 2g-i present the experimental spectra as well as the simulated ensemble-mean spectra for the corresponding force fields and the two temperatures.The temperature difference profile ∆ is also shown.The simulated curves were shifted by −2.9 eV to match the pre peak of the experiment, and the experiment was scaled for the same main peak height with the respective 300 K simulation.Due to the huge Compton scattering background, approximately subtracted from the spectra, our experiment suffers from a rather large statistical uncertainty.This prohibits further conclusions as the predicted dif-FIG.2. The calculated and experimental results for aqueous triglycine.a-f: Scatter plot of the Ramachandran angles of the central residues from the simulated trajectories and the definitions [58] of the allowed (99.95%) and favored (99.8%) regions of glycine residue according to the Top8000 data set [59].g-i: Computational ensemble mean and experimental spectra (background removed) with the respective temperature difference profiles ∆ and spectral regions of interest.For the experiment, the difference profile has been 4-fold binned from the spectra.The error bars and shading of the simulated curve indicate the statistical uncertainty σ (confidence level 68%).The computational spectra have been shifted by −2.9 eV in all cases for the pre peak to match with the experiment.The experiment is presented scaled for the same main peak height as the respective 300 K simulation.
ference profiles lie within.However, an impressive match between the simulated and experimental spectra is obtained.
We did not observe a significant linear correlation between any of the structural features (internal coordinates and water related parameters) and the ROI intensities (see Supplementary Information).However, a well-functioning emulator will enable the use of ECA decomposition as a more advanced analysis tool.In this case, the apparent collaborative action of the structural degrees of freedom is, indeed, captured by an NN-based emulator as depicted in Figure 3.The emulator achieved covered variances (R 2 score) of 0.714, 0.945, and 0.953 for each region I-III with the test data.The prediction accuracy is systematically worse for region I, the origin of which is the N1s→ π * resonance of the N atoms of the peptide bonds.Combining the regions together, the total covered variance stood at 0.874 for z-score standardized ROIs and 0.942 for the absolute intensity ROIs.The difference can be attributed to much larger total areas of ROIs II and III dominating the absolute score.Going forward, a score of 0.874 is the upper limit for covered spectral variance for the standardized ROI intensities by ECA decomposition, as even with complete structural coverage the respective ML-emulator-induced error will remain.
We split the test data further into two parts of the same size for the ECA.The first part was used for the optimization of the standardized structural descriptor space basis vectors (fit) and the latter was used for an independent test (validation).The covered variance as a function of the rank of the ECA expansion are given in Table I for both data parts.While the covered ROI intensity variance improves up to rank five (emulator limit is achieved), convergence for the validation data is practically reached with three components.The ECA components can be used for structural interpretation [25,60].We analyze the first component (Figure 4) in terms of convoluted distributions of interatomic distances, a subset of the entire LMBTR descriptor.This FIG. 4. The interatomic distance distributions for each of the nitrogen atoms deduced from the first component vector transformed into the descriptor space.a-d: Distances from the absorption site (denoted as Nex) to the neighboring hydrogen, carbon, nitrogen and oxygen atoms, respectively.The ROIs of the total N K-edge spectrum is sensitive to the hydrogen and oxygen atoms at around 3 Å, which can mainly be contributed to the water molecules.In addition, the ROIs are sensitive to the nearest carbon and nitrogen atoms of each absorption site.Finally, the ROIs are also sensitive to the three nearest hydrogen atoms of N1.
component shows the predominant structural changes associated with spectral ROI variation: increase in ROIs I and III, and decrease in ROI II (see Supplementary Information).There is a striking shift in N1-H, N ex -C and N ex -O interatomic distances, and N ex -N distances, which however remain somewhat inconclusive due to the limited distance range of the descriptor.The surrounding water affects the spectrum as for both N ex -H and N ex -O there is a shift at around 3 Å near the first H 2 O solvation shell.Interpretation of the angular structural features from the LMBTR is more complicated because exponential distance weighting was found to be required for best emulation performance.The weighting mode was one of the numerous hyperparameters varied in the model selection phase.
We next turn our focus on the features representing the secondary structure, the Ramachandran angles.We assigned each simulated structure in the test set (includes ECA fit and ECA validation data) to one of three classes: α-helix, β-sheet or 'other' based on the contours [58] from the Top8000 data set [59].Borderline cases were assigned manually as shown in Figure 5a.The fractions of α-helices and β-sheets are given in Table II.We then used principal component analysis (PCA) to reduce the dimensionality of the LMBTR-encoded structures to two (spectral R 2 = 0.002) as shown in Figure 5b.This plot shows clear clusters of α-helices and β-sheets with 'others' found in between the two.The LMBTR can thus encode this information, even though the dihedrals ϕ and ψ are not directly included in it.Correspondingly, we used the ECA approach as a 2D dimensionality reduction tool for the same data (spectral R 2 = 0.733), as depicted in Figure 5c, without clearly observable clusters.This difference can be explained by PCA focusing only on covered structural variance, whereas ECA is guided by the spectral one.The result therefore indicates that, for reasonable structures, XAS (or XRS) is insensitive to the Ramachandran angles, which in turn cannot be reconstructed from the spectral ROIs.Feature importance is a metric used for the significance of an input feature with respect to the output of an ML model [61].We define an importance score for each LMBTR feature as its absolute value in the first ECA vector.Next, we trained a new emulator with the same architecture as the original one (except for the input layer) using a given number of LMBTR features of the highest importance score.We tested the according models using the ECA validation set and find that most LMBTR groups (e.g. a pair-wise distance distribution) contain highly meaningful features in terms of performance which, in turn, indicates strong interplay of the structural features of all kinds.Improvement of covered spectral ROI variance is monotonous with respect to the importance score and strongly saturates already with 300 features (see Supplementary Information).This shows that magnitude of a feature in the ECA vector directly measures its spectral significance.

IV. DISCUSSION
Although the different force fields give different structural distributions (including the those of the Ramachandran angles) the spectral effect predicted upon change of temperature is similar for all of them.This complicates the validation of the force fields by core-level excitation spectra.Moreover, the predicted difference profiles are small and cannot be confirmed or rejected by the current experiment, as all predicted changes are within the error bars that also include the possibility of zero effect.However, the experiment supports the validity of the spectrum simulations, together with the respective convergence checks.
Schwartz and co-workers used a liquid jet to measure the nitrogen K-edge X-ray absorption spectrum of triglycine(aq) in ambient temperature [26] by total electron yield.They obtained a spectrum with roughly similar features, but notably different pre peak to main edge ratio.Interestingly, our study is in agreement with X-ray spectroscopy of solid triglycine [15,16,20].The bump at the post edge seen only in our 350 K spectrum seems to be present in the other studies [15,16,20,26] and is also present in the N K-edge spectrum of the similar molecule diglycine [15].The cause of this feature is not explained by our simulations or the measured 300 K spectrum, but based on these references the possibility of some solid triglycine in our 350 K heated sample cannot be excluded.The aforementioned results have been obtained with numerous yield techniques not necessarily equivalent to XRS used here, or the definition of XAS.Although the XRS has a rather low count rate, it is known to be bulk sensitive and extremely stable.
The descriptor used in an ML work sets the 'language' for the resulting scientific discussion.Although several descriptors for the atomistic structure of molecules have been developed [49,50,[62][63][64][65][66], we see three general requirements for the spectrum analysis by ECA: 1.The descriptor must allow for accurate ML with the available data.
2. The descriptor must allow for ECA decomposition of high spectral variance coverage with only a few dominant components.
3. The descriptor must allow for interpretation in terms of structure; preferably it is translatable into simple structural information.
The architecture of a well-performing descriptor gives insight into the physical system, as it can encode relevant structural information.With LMBTR, the distancebased weighting of angles improved the ML prediction accuracy, which is understandable as the nearby atoms should, by intuition, affect the spectrum more than the distant ones.Unfortunately, the weighting also makes the angular features of the descriptor harder to interpret and (3) is not reached.Thus, effective encoding of physical information for condition (1) may render some of it unrecoverable, and moreover, a trade-off between all three conditions can be expected.Previously, we have applied a similar method for glassy GeO 2 [60], where we found a variant of a Coulomb matrix [64] (similar to the Bag of Bonds [65]) a suitable descriptor for ML and ECA.Arguably the well-defined covalent bond topology renders LMBTR well-suited for aqueous triglycine.
The ECA algorithm applies projection of the structural descriptor vector onto a limited number of basis vectors and relies on an emulator to predict the spectral information for these new points much faster than the corresponding electron structure calculation would do.This allows for using iterative algorithms for the search of the the optimal basis vectors.The emulator must be complex enough of a function (likely non-linear) yet generalizable to sufficiently capture the relation between any relevant structure R and its spectrum S. The selection of this mapping S emu (D(R)) is a complicated problem with plenty of tunable hyperparameters, including those of the descriptor D(R).In this work we settled for nine independent neural networks, one for each atomic site and ROI with a combined output, as they worked best for the LMBTR-encoded triglycine system with points (1)-(3) in mind.As often the case in machine learning, the choice cannot be shown to be the global optimum, but only the best in the particular hyperparameter search.Up to date there are no generally accepted performance criteria in the X-ray spectroscopic community.
The results of the importance score analysis show that almost all groups of distance and angular distributions are necessary to maximise the covered spectral ROI variance.On the other hand, a significant number of features from each group could be ignored, as only 300 of them in total (originally 1140) are necessary to cover nearly all the accessible variance, which itself rises quickly and monotonously with respect to the number of selected features.This shows that the first ECA vector can indeed find the most relevant features of the descriptor in the order of their spectral significance and could serve as a tool for feature selection.Using the ECA, the dependence of ROI intensities, as captured by the descriptoremulator-ROI mapping S emu (D(R)), can practically be condensed into three structural degrees of freedom when a 1140-dimensional LMBTR is used.
The quantitative analysis of our simulations proves that while information about the Ramachandran angles is present in the LMBTR-encoded structural space, it is largely lost in the subspace covering a significant portion of the spectral ROI variance.Therefore, our results support the conclusion of Schwartz and co-workers [26] in that XAS is insensitive to the secondary structure of triglycine (within the reasonably expected structural space).Furthermore, our results are at least partly in contradiction with that of Gordon and co-workers [15] as we see a small dependency between the most relevant part of the backbone conformation and its spectrum (Figure 5c) at most.This conclusion is manifested by also taking the mean spectra of all α-helices and all β-sheets in the data (see Supplementary Information), which show a deviating difference profile from those presented in Figure 2. The advantage of K-edge spectra, locality, seems to be a limitation when it comes to secondary structure of proteins, for which there is an information bottleneck for reasonably expected structures.The ECA shows the ROIs of the total spectrum to mainly be sensitive to the nearest C and N atoms from the solute together with the O and H atoms mostly from the water; and additionally, to the nearest H atoms in the case of N-terminus of the peptide.

V. CONCLUSIONS
Experimental N K-edge X-ray Raman scattering spectra of aqueous triglycine can be modelled by classical MD and TP-DFT calculations for a good match.However, the experiment is not able to confirm or reject the predicted temperature difference effect.A machine learning emulator can predict the intensities of spectral regions of interest from the corresponding local many-body tensor representation encoded structures.This enabled the application of the emulator-based component analysis, which was able to condense the structure-spectrum dependency practically into three degrees of freedom.Moreover, the spectral significance of structural features was found to be ordered along the magnitude of the according component in the first obtained basis vector.From the analysis we conclude that the details of the secondary structure of aqueous triglycine, expressed by the Ramachandran angles, are lost in X-ray absorption spectral regions of interest due to an information bottleneck.Instead, in the structural information available for analysis from the used descriptor, the distance distributions between the absorption site and its nearest neighbors significantly account for the spectral region variance.Each of the tried force fields results in a similar temperaturedifference profile, and therefore distinguishing between them by X-ray absorption seems improbable.On the other hand, the spectrum can be predicted equally well with all of them.
In this work we have pushed structural decomposition by the ECA algorithm to maximal size scales that Xray spectra can be hoped to have sensitivity to.The method identifies spectrally relevant structural subspace in a complicated system without prior knowledge.Future prospects of this approach include reconstructing the maximum obtainable structural information from the spectra, the first steps of which have already been taken [60].We also note that the method is not limited to Xray spectroscopy, but can be applied to a wide variety of inverse problems for which an emulator for the forward problem is known.
The best LMBTR hyperparameters within the DScribe package A full list of the selected LMBTR hyperparameters within the implementation by DScribe [50] is presented in Table III.

Correlation between structural features and ROI intensities
We evaluated Pearson's r coefficients between the ROIs and internal coordinates i.e. bond lengths r, angles θ, and dihedrals ϕ.In addition, the following water parameters were investigated: number of donated D and accepted A hydrogen bonds, and the number of water molecules in each of the two solvation shell SS1 and SS2 with respect to the absorbing nitrogen.The definitions of the water parameters are described in [9]. Figure 7 shows the naming convention of the atoms in this analysis.
The results reveal that there are only a few features whose correlation is consistent with respect to the temperature difference profile for all ROIs (i.e. the sign of the correlation with a ROI intensity is always either equal or opposite to the sign of the difference profile) within 2σ error obtained from a 10000-fold bootstrap resampling (Table IV).There are also many features whose correlation is not strictly inconsistent with respect to the difference profile (Table V).Overall, the correlation analysis of the internal coordinates and ROIs reveals notable complexity of the problem and necessitates a more sophisticated analysis, that accounts for collaborative action of several structural features.Spectral significance of the components in the ECA vector We carried out analysis of features by ordering them along the importance score (component magnitude in the first ECA vector).We trained an emulator multiple times iteratively removing 30 of the least important features at each cycle.Only the input layer of the emulator architecture was modified between the iterations.
The results show that most and almost all spectral ROI variance is covered by 150 (Figure 9) and by 300 (Figure 10) features of the highest importance score, respectively.Almost every group of LMBTR features is present in both cases (Figure 9a and Figure 10a).The covered variance rises quickly and monotonously with respect to the number of selected features (Figure 9b and Figure 10b), which shows that ECA can find and order the most relevant structural features with respect to the spectral ROI variance.The PCA (Figure 9c and Figure 10c) of the selected features shows that the information about the Ramachandran angles is still present in the reduced feature space (150 or 300 features, respectively).

FIG. 1 .
FIG. 1.A schematic illustration of aqueous triglycine.The absorption sites are denoted as N1, N2 and N3.The backbone dihedrals known as the Ramachandran angles ϕ and ψ are shown for the central residue.The molecular plot was prepared using the VMD software [27].

FIG. 3 .
FIG. 3. Emulator evaluation with standardized data.ac: Predicted intensities for ROIs I-III, respectively, plotted against the known ones.Covered variance R 2 , Pearson's r, and the mean squared error MSE are given for each panel.Overall covered variance of the three ROIs is 0.874.

FIG. 5 .
FIG. 5. Analysis of the structural classes of the test data with PCA of LMBTR features and ECA reconstruction.a: Data points classified to structural classes.The allowed and favored regions according to the Top8000 data set [59] are shown as contours.b: A 2-component PCA decomposition of the LMBTR features of the data identifies the structural classes.c: A 2-component ECA-coordinate reconstruction, based on spectral ROIs, covers drastically more spectral variation but does not distinguish the structural classes.For details, see text.

FIG. 7 .
FIG. 7. The naming of the atoms.Molecular plot made by the Jmol software.

Figure 8
Figure 8 shows an increasing trend for the ROI I and III intensities, and a decreasing one for the ROI II intensity, along the first ECA component vector.

FIG. 9 .
FIG. 9. a: The frequency of occurrence of a feature from different LMBTR groups when 150 features of the highest importance score are used.b: The covered spectral ROI intensity variation as a function of the set size.Rapid and monotonous saturation is observed and the set of 150 features covers most of the spectral variance.c: 2-component PCA of the 150-dimensional feature vectors from feature selection by the importance score.The information about the Ramachandran angles is still present in this structural descriptor.

TABLE I .
Covered ROI intensity variances as a function of the rank of the ECA decomposition.