Using Gaussian process regression to simulate the vibrational Raman spectra of molecular crystals

Nathaniel Raimbault; Andrea Grisafi; Michele Ceriotti; Mariana Rossi

doi:10.1088/1367-2630/ab4509

1. Introduction

Machine-learning (ML) models are becoming increasingly popular in the field of atomistic simulations, providing a way to obtain data-driven physical insights [1–3] and reduce the cost of simulations [4, 5]. Most efforts have been concentrated into predicting total energies and forces from atomic coordinates [5–13], which are most often the largest cost in a first-principles simulation. More recently, machine-learning models have been also applied to the prediction of response properties of molecules [14–18]. When dealing with the response of a material to an applied field, the cost of a first-principles calculation is often larger than that of force evaluation. This is thus an area where one can also take advantage of supervised learning techniques in order to reduce the cost ab initio simulations that make use of such response properties.

Vibrational Raman spectra are a good example of properties that requires the knowledge of the response of the system to electric field perturbations. The Raman signal is very useful to monitor phase transitions, as well as for the identification of global and local structural patterns [19–21]. Any technique used to calculate this property requires the calculation of several instances of the polarizability tensor (in molecules) or the dielectric susceptibility (in crystals). Previously, some of the present authors have shown that anharmonic vibrational Raman spectra calculated through a time-correlation formalism can be a powerful tool to identify structural fingerprints in molecular crystals [22, 23]. Within this formalism, it is necessary to calculate ab initio molecular dynamics trajectories and compute the response quantities for subsequent atomic configuration, employing, for instance, density-functional perturbation theory (DFPT) [22, 24–27]. These calculations are computationally demanding, not only because of the tens of thousands of force evaluations that need to be performed to provide sufficient statistical sampling, but also because each DFPT calculation is typically four times more expensive than a force evaluation [23]⁵ . Furthermore, while there are several empirical potentials available that can be used to simulate the dynamics of molecular crystals [28], empirical models of the polarizability tensors are rare and poorly transferable [29].

Here we investigate frameworks to obtain accurate predictions of the dielectric response properties for a number of consecutive molecular dynamics (MD) configurations, that are necessary to converge simulated vibrational Raman spectra of molecules and molecular crystals. We compare different flavors of Gaussian process regression (GPR), which is a method that has already been proven to be efficient in predicting dielectric response properties [14, 15, 17]. In particular, we compare standard GPR schemes with symmetry-adapted (SA)-GPR [30], which is advantageous when describing tensorial quantities. For the former, we exploit the internal structural rigidity of the system in order to model each individual component of the polarizability tensor, therefore remapping the tensor learning problem onto many separate scalar regression tasks. For the latter, we employ a SA representation of the system in order to learn the irreducible spherical components of the tensor in a covariant fashion. As a trial system we consider the Paracetamol molecule and form I and form II of the Paracetamol crystal, represented in figure 1. As we demonstrate below, SA-GPR comes out as the methodology with the best performance. When predicting the Raman spectra of crystals, both methods can benefit from using a GPR trained on the monomer as a baseline. We also find that our model is transferable between different polymorphic forms. Given that empirical and machine-learned potential energy surfaces are becoming more accurate for molecular crystals, the methodology proposed here can be combined in a straightforward manner to such potentials, giving access to the electronic polarization and polarizability of crystals.

**Figure 1.** The systems considered in this work for the prediction of vibrational Raman spectra. (a) Isolated Paracetamol molecule. (b) Paracetamol crystal form I (monoclinic). (c) Paracetamol crystal form II (orthorhombic). Atomic color code: hydrogen white, nitrogen blue, carbon grey, oxygen red. The unit cell is drawn in black.
Download figure:
Standard image High-resolution image

In the following, section 2 introduces the general machinery describing GPR and SA-GPR, as well as the different representations used to map structures to the inputs of the ML scheme. In section 3, we present applications on the Paracetamol molecule and monoclinic form I crystal, showing for the latter how our models can be refined by the inclusion of molecular polarizability tensors. Finally, we illustrate in section 4 the transferability of the SA-GPR method, by predicting the Raman spectrum of the orthorhombic form II crystal without prior knowledge of the corresponding values of the dielectric response.

2. Theory

2.1. Vibrational Raman spectra

The central quantity needed for simulating vibrational Raman spectra is the electronic static polarizability tensor ${\boldsymbol{\alpha }}$ . For simplicity, we will refer to the polarizability tensor for all the rest of the paper, but one should keep in mind that for solids the quantity of interest is rather the electric susceptibility tensor of the system.

As discussed in [23] and others [31, 32], the vibrational Raman spectrum can be calculated using several approximations, the simplest of which is the harmonic approximation. The framework developed in this paper can be applied to the harmonic case, as shown in the supplemental information, figure S8 available online at stacks.iop.org/NJP/21/105001/mmedia, but we will here focus on the more challenging task of applying it to the linear-response time-correlation formalism, which fully takes into account the anharmonicity of the potential energy surface. In this formalism, vibrational Raman intensities can be obtained from the Fourier transform of the static polarizability autocorrelation function [33] at thermodynamic equilibrium. In particular, the so-called powder spectrum intensity is given by a combination of the anisotropic and isotropic contributions as

$\begin{eqnarray}\begin{array}{rcl}I(\omega ) & = & {I}_{\mathrm{iso}}(\omega )+\displaystyle \frac{7}{3}{I}_{\mathrm{aniso}}(\omega )\\ {I}_{\mathrm{iso}}(\omega ) & = & \displaystyle \frac{n}{2\pi }{\displaystyle \int }_{-\infty }^{+\infty }{{\rm{d}}{t}{\rm{e}}}^{-{\rm{i}}\omega t}\langle \bar{\alpha }(0)\bar{\alpha }(t)\rangle \\ {I}_{\mathrm{aniso}}(\omega ) & = & \displaystyle \frac{n}{2\pi }{\displaystyle \int }_{-\infty }^{+\infty }{{\rm{d}}{t}{\rm{e}}}^{-{\rm{i}}\omega t}\displaystyle \frac{1}{10}\langle \mathrm{Tr}\left[\tilde{{\boldsymbol{\alpha }}}(0)\cdot \tilde{{\boldsymbol{\alpha }}}(t)\right]\rangle ,\end{array}\end{eqnarray} \tag{ 1 }$

where n is the number of atoms in the system, the brackets $\langle \cdot \rangle$ denote an ensemble average and $\mathrm{Tr}$ represents the trace. $\bar{\alpha }$ and $\tilde{{\boldsymbol{\alpha }}}$ are, respectively, the isotropic and anisotropic parts of the polarizability tensor, defined as follows

$\begin{eqnarray}\begin{array}{rcl}{\boldsymbol{\alpha }} & = & \bar{\alpha }{\boldsymbol{I}}+\tilde{{\boldsymbol{\alpha }}}\\ \bar{\alpha } & = & \displaystyle \frac{1}{3}({\alpha }_{{xx}}+{\alpha }_{{yy}}+{\alpha }_{{zz}}),\mathrm{Tr}[\tilde{{\boldsymbol{\alpha }}}]=0.\end{array}\end{eqnarray} \tag{ 2 }$

In this paper, we do not address the problem of obtaining forces, which are necessary to calculate a full Raman spectrum from either molecular dynamics trajectories or in the harmonic approximation. Instead, we focus on predicting only the Raman intensities and combine them with pre-computed ab initio trajectories and displacements.

2.2. Component-wise GPR

GPR is a well-established method based on a kernel function that measures the similarity between structures. It is formally equivalent to kernel-ridge regression, to which it differs mostly by the fact that GPR frames the construction of the model in a Bayesian language. In this sense, the kernel represents the prior distribution for the statistical correlations of the property we aim to predict. In the usual supervised-learning framework, a dataset of structures (i.e. atomic coordinates) and associated polarizabilities is used to train the model. Once the training is complete, the model is tested on a different set of configurations for which the polarizability is in principle not known. If it is possible to align the system to a reference structure, a straightforward way to apply this procedure to tensorial quantities is to learn each component of the polarizability tensor separately. In particular, a GPR prediction of each individual polarizability component α_γ _δ reads

$\begin{eqnarray}&&{\alpha }_{\gamma \delta }^{\mathrm{ML}}({ \mathcal A })={\bar{\alpha }}_{\gamma \delta }^{\mathrm{ai}}+\displaystyle \sum _{j=1}^{N}{w}_{j}^{\gamma \delta }k({ \mathcal A },{{ \mathcal A }}_{j}),\end{eqnarray} \tag{ 3 }$

where N is the number of configurations included in the training set, γ and δ represent Cartesian coordinates, ${w}_{j}^{\gamma \delta }$ are the regression weights that need to be determined from the training data for each component, and k is the kernel that couples the target system ${ \mathcal A }$ , with the training structure ${{ \mathcal A }}_{j}$ . The quantity ${\bar{\alpha }}_{\gamma \delta }^{\mathrm{ai}}$ is the average (over the training set) of the γδ polarizability component computed from an ab initio method, which effectively allows the training to focus on the fluctuation of the property with respect to a known baseline value.

The kernel entering GPR is based on a Gaussian similarity measure between structures ${ \mathcal A }$ , given by

$\begin{eqnarray}&&k({ \mathcal A },{{ \mathcal A }}_{j})={{\rm{e}}}^{-\tfrac{| {\boldsymbol{u}}({ \mathcal A })-{\boldsymbol{u}}({{ \mathcal A }}_{j}){| }^{2}}{2{\sigma }^{2}}},\end{eqnarray} \tag{ 4 }$

with σ being a hyperparameter that controls the magnitude of the correlation between training points. We optimize the value of σ—as well as that of other hyperparameters entering the definition of the features or the kernel—by cross-validation (CV), even though in a GPR framework hyperparameters are often determined by likelihood maximization (see SI, section 1 for further details). ${\boldsymbol{u}}$ is a vector that has the role of mapping the atomic coordinates of the structure ${ \mathcal A }$ to a given representation of dimension M.

The regression weights ${{\bf{w}}}^{\gamma \delta }$ are obtained by minimizing a loss function regularized by an L2-norm [34] over the training set. This procedure leads to the following expression

$\begin{eqnarray}&&{{\bf{w}}}^{\gamma \delta }={\left({\mathbb{K}}+\eta {\mathbb{1}}\right)}^{-1}\cdot {\rm{\Delta }}{{\boldsymbol{\alpha }}}_{\gamma \delta },\end{eqnarray} \tag{ 5 }$

where ${\mathbb{1}}$ is the identity matrix, ${\mathbb{K}}$ is the N × N kernel matrix associated with the reference structures, such that ${{\mathbb{K}}}_{{ij}}=k({{ \mathcal A }}_{i},{{ \mathcal A }}_{j}),i,j\,=\,1,\ldots ,\,N$ , and η is the regularization parameter which controls to which extent the fitted data can deviate from the training points. The quantity ${\rm{\Delta }}{{\boldsymbol{\alpha }}}_{\gamma \delta }$ represents the vector containing all N entries in the training set of ${\rm{\Delta }}{\alpha }_{\gamma \delta }^{j}({{ \mathcal A }}_{j})={\alpha }_{\gamma \delta }^{\mathrm{ai}}({{ \mathcal A }}_{j})-{\bar{\alpha }}_{\gamma \delta }^{\mathrm{ai}}$ .

The efficiency of any GPR model strongly depends on the quality of the representation that is encoded in ${\boldsymbol{u}}({ \mathcal A })$ . For a representation to be efficient, it should contain the least possible number of elements to express the identity of a given structure, while avoiding redundant information. Among the vast choice of representations one could conceive, we will employ two that are very similar in spirit. First, we will consider a representation of each structure in terms of its atomic density (AD), explicitly evaluated on a grid. This AD is combined with the GPR framework, and with an alignment procedure that makes it possible to learn tensor components independently. Second, we will consider a SA-GPR scheme [30], whose application is summarized in the next section. The framework is based on λ-SOAP kernels, that are built by covariant integration of the atom density [35], and therefore automatically incorporate molecular symmetries.

Atomic-density grids consist of a conceptually simple representation. The construction introduces a Cartesian reference frame centered on the system under study, and defines a three-dimensional grid around it. For each grid-point ${\bf{r}}$ , we calculate the AD distribution, defined as

$\begin{eqnarray}&&{\rho }_{s}({\bf{r}})=\displaystyle \sum _{i\in s}\exp \left(-\displaystyle \frac{| {\bf{r}}-{{\bf{r}}}_{i}{| }^{2}}{2{\gamma }_{s}^{2}}\right),\end{eqnarray} \tag{ 6 }$

where s identifies a specific atom type, ${{\bf{r}}}_{i}$ are the nuclear positions and γ_s is an adjustable parameter. The feature vector is given by ${\boldsymbol{u}}({ \mathcal A })=({\rho }_{s}({\bf{r}})),s\,=\,1,\ldots ,{N}_{s}$ , where N_s is the number of different atomic species. For the applications using the standard GPR scheme shown in this paper, we have chosen the same γ_s = 0.5 Å for all species. This choice was based on the fact that we observed very minute changes when employing different values of γ_s for different species, but this observation is likely to be system- and method- dependent. We have used a grid of evenly-spaced points spanning the maximal extension of the system, although we note that more refined methods could be utilized to define physically-motivated grid points, based on the possible directions of vibrations of atomic species [36]. Such a representation is illustrated in figure 2, which shows a two-dimensional cut of the density distribution associated with a paracetamol molecule. Note that one could use a combination of different descriptors instead of a single one, whose elements would be concatenated to form the feature vector ${\boldsymbol{u}}$ .

**Figure 2.** 2D view of the nuclear density distribution of the Paracetamol molecule for a given structure. Each grid point is represented by a colored sphere. Blue (red) indicates a low (high) density.
Download figure:
Standard image High-resolution image

As mentioned above, the components of the polarizability tensor are not invariant to rotations in Cartesian space, and therefore a Cartesian space representation of this quantity, like the atomic densities, requires an alignment to a reference structure. To do so, we have used the Kabsch algorithm considering only heavy atoms [37, 38]. We note that this alignment procedure is not applicable to all systems, in particular to very flexible molecules. For simple relatively rigid molecules, it is known to work well, as has been shown previously for the case of hyperpolarizabilities of water molecules [15]. We will show below that even for a more complex and relatively flexible molecule like paracetamol, this representation still yields accurate predictions.

2.3. SA-GPR with λ-SOAP Kernels

SA-GPR [30] represents a generalization of the GPR formalism, where the tensorial nature of the target property, together with its covariant 3D transformations, are naturally incorporated within the regression algorithm [12]. This technique removes the need of often arbitrary alignment procedures of the systems (like the one described above) in order to predict tensorial quantities of any rank. As such, the model focuses only on the portion of variability of the tensor connected with an internal structural distortion of the molecular geometry, greatly improving the regression performances. Furthermore, by effectively learning an atom-centered model of the polarizability, a SA-GPR scheme makes it possible to build models that are transferable between different molecules [17], although we did not exploit this possibility in the present work.

A simplification of the learning problem can be obtained if the target property is first decomposed in its irreducible spherical tensor components. The static polarizability (or the static susceptibility), in particular, being a symmetric rank-2 tensor, can be formally decomposed into an isotropic contribution that transforms as a spherical harmonic of angular momentum λ = 0, and an anisotropic contribution that transforms as a spherical harmonic of angular momentum λ = 2. The former contribution, ${\alpha }_{0}^{(0)}\propto \bar{\alpha }$ , being directly proportional to the trace of the tensor (which is rotationally and translationally invariant), can be learned in the usual manner by a standard GPR. The tensorial nature of the polarizability is contained in the λ = 2 term, ${{\boldsymbol{\alpha }}}^{(2)}=({\alpha }_{-2}^{(2)},{\alpha }_{-1}^{(2)},{\alpha }_{0}^{(2)},{\alpha }_{1}^{(2)},{\alpha }_{2}^{(2)})$ , which is related to the anisotropic part of the Cartesian ${\boldsymbol{\alpha }}$ -tensor $\tilde{{\boldsymbol{\alpha }}}$ of equation (2) by a linear transformation. Within SA-GPR, the prediction of this contribution is carried out by making use of a tensorial kernel function ${{\boldsymbol{k}}}^{(2)}({ \mathcal A },{ \mathcal B })$ , which is a matrix of size 5 × 5, that describes at the same time the structural similarity between molecular configurations and the tensorial geometric relationship of order λ = 2 between these configurations, as detailed in [30]. In general terms, such a kernel can be thought of as a generalization of a scalar kernel function such as the one introduced in equation (4), which makes use of the following covariant integration

$\begin{eqnarray}&&{{\boldsymbol{k}}}^{(2)}({ \mathcal A },{ \mathcal B })=\int {\rm{d}}\hat{R}\ {{\boldsymbol{D}}}^{(2)}(\hat{R})\kappa ({ \mathcal A },\hat{R}{ \mathcal B }),\end{eqnarray} \tag{ 7 }$

where $\hat{R}$ represents the rotation operator and ${{\boldsymbol{D}}}^{(2)}(\hat{R})$ the associated Wigner-D matrix which has the role of expressing the rotation of λ = 2 spherical harmonics. In this definition, the scalar kernel $\kappa ({ \mathcal A },{ \mathcal B })$ only needs to be invariant under rigid translations and rotations of the laboratory reference frame with respect to which both ${ \mathcal A }$ and ${ \mathcal B }$ are defined. In the present work, $\kappa ({ \mathcal A },{ \mathcal B })$ is given by a superposition of atom-centered Gaussian densities, equivalent to the one used in the popular smooth overlap of atomic position (SOAP) kernel [39]. As such, the kernel of equation (7) represents the tensorial generalization of SOAP, usually called λ-SOAP, which recovers the scalar case of λ = 0 as a special limit [16].

Formally, a covariant λ = 2 prediction performed with a SA kernel function of the same order reads

$\begin{eqnarray}&&{{\boldsymbol{\alpha }}}^{(2)}({ \mathcal A })=\displaystyle \sum _{i=1}^{N}{{\boldsymbol{k}}}_{{ \mathcal A },{i}}^{(2)}\ {{\boldsymbol{w}}}_{i}^{(2)}=\displaystyle \sum _{i=1}^{N}{{\boldsymbol{k}}}_{{ \mathcal A },{i}}^{(2)}\displaystyle \sum _{j=1}^{N}{\left[{{\mathbb{K}}}^{(2)}+\eta {\mathbb{1}}\right]}_{{ij}}^{-1}{{\boldsymbol{\alpha }}}_{j}^{(2)},\end{eqnarray} \tag{ 8 }$

where i and j run over the N reference configurations used to train the regression model and η is the regularization parameter. The set of tensorial regression weights $\{{{\boldsymbol{w}}}_{i}^{(2)}\}$ are determined by inverting the (5 × N)² kernel matrix ${{\mathbb{K}}}^{(2)}$ associated with the training structures, and projecting it on the vector of reference tensors $\{{{\boldsymbol{\alpha }}}_{j}^{(2)}\}$ . In doing so, the statistical average of $\{{{\boldsymbol{\alpha }}}_{j}^{(2)}\}$ over the training set is assumed to be zero by symmetry. As such, no baseline of the anisotropic part of the polarizability is adopted when doing the regression.

2.4. Errors and uncertainty estimations

In order to gauge the accuracy of the machine-learned polarizability components, we calculate the root mean square error normalized by the standard deviation (STD) of the set we want to evaluate the error on

$\begin{eqnarray}&&{\epsilon }_{\gamma \delta }[\sigma ,\eta ]=100\times \sqrt{\displaystyle \frac{\tfrac{1}{N}{\sum }_{j}^{N}{\left({\alpha }_{\gamma \delta }^{\mathrm{ML}}({{ \mathcal A }}_{j})-{\alpha }_{\gamma \delta }^{\mathrm{ai}}({{ \mathcal A }}_{j})\right)}^{2}}{\tfrac{1}{N-1}{\sum }_{j}^{N}{\left({\alpha }_{\gamma \delta }^{\mathrm{ai}}({{ \mathcal A }}_{j})-{\bar{\alpha }}_{\gamma \delta }^{\mathrm{ai}}\right)}^{2}}},\end{eqnarray} \tag{ 9 }$

where j represents the jth configuration, N is the amount of points in the data set we consider, and the bar here denotes the average of the polarizability component over the dataset of interest.

Whenever the reference properties of the testing data are not available, one cannot make use of equation (9) to evaluate the error of the predicted polarizabilities. In these circumstances, one typically needs to estimate the error made on the predicted properties by making use of some a priori probabilistic criteria. In the particular case of GPR, the expected error associated with a testing structure ${ \mathcal A }$ , can be computed as $\varepsilon ({ \mathcal A })=k({ \mathcal A },{ \mathcal A })-{\sum }_{{IJ}}k({ \mathcal A },I){k}^{-1}(I,J)k(J,{ \mathcal A })$ , where I and J run over the training structures. As detailed in [40], this strategy is however not very practical because of its computational expense, so that other kind of methods such as bootstrapping or subsampling can rather be used to estimate the prediction errors. In addition, in the particular case of the present work one would like to propagate the error that occurs in the prediction of ${\boldsymbol{\alpha }}$ to the Raman spectrum. While this error propagation would be difficult to carry out on top of the GPR intrinsic covariance, it comes naturally from a bootstrapping or subsampling scheme.

In this work, ${N}_{\mathrm{RS}}$ subselections of the training dataset are considered to generate an ensemble of predictions for the polarizability. From these, ${N}_{\mathrm{RS}}$ Raman spectra are computed by Fourier transforming the time series of each model in the ensemble. Finally, the average and the STD of the predicted spectra over the m subselections give the final Raman spectrum prediction and the propagated estimated error respectively. The downside of this approach is that this model works under the assumption that the training data corresponds to independent identically distributed samples. This is of course not true in general, so that one needs to correct the model to take into account for the underlying correlations. Following [40], a maximum likelihood recipe can be adopted to linearly scale the variance of the predictions by a constant factor ν². The calibration of this scaling factor is carried out by computing the actual prediction errors of the polarizabilites over a suitably selected validation set ${N}_{\mathrm{val}}$ , for which the reference polarizabilities are known, and then considering

$\begin{eqnarray}&&{\nu }^{2}=\displaystyle \frac{1}{{N}_{\mathrm{val}}}\displaystyle \sum _{j=1}^{{N}_{\mathrm{val}}}\displaystyle \frac{| | {{\boldsymbol{\alpha }}}_{\mathrm{pred}}(j)-{{\boldsymbol{\alpha }}}_{\mathrm{ref}}(j)| {| }^{2}}{{\sigma }^{2}(j)},\end{eqnarray} \tag{ 10 }$

where σ²(j) are the variances of the predicted polarizabilities. Once the value of ν has been determined, each polarizability prediction of a given training model k can be updated by considering

$\begin{eqnarray}&&{{\boldsymbol{\alpha }}}_{k}^{{\prime} }=\bar{{\boldsymbol{\alpha }}}+\nu ({{\boldsymbol{\alpha }}}_{k}-\bar{{\boldsymbol{\alpha }}}),\end{eqnarray} \tag{ 11 }$

where $\bar{{\boldsymbol{\alpha }}}$ is the predicted polarizability averaged over the ${N}_{\mathrm{RS}}$ models. This scaling procedure guarantees that the variance of the models is consistent with the outcome of the likelihood maximization. By computing the Raman spectrum for each scaled model k, the propagated uncertainty estimation associated with the spectra will automatically take into account the calibration of the variance.

2.5. Simulation and training details

To perform the ab initio calculations, we used the FHI-aims [41] program package with light settings for all atomic species. We obtained aiMD trajectories using the PBE functional with many-body dispersion corrections [42, 43], and employing a time step of 0.5 fs. The electronic polarizabily/susceptibility was instead computed every 1 fs. Most of the data used here was already available from [22, 23]. For each system, we had 20 picoseconds of simulation in the NVT ensemble at 300 K. From this trajectory, a few thousands of configurations were selected to define the training and test set of our model. A full trajectory of 15 ps in the NVE ensemble was instead considered as our validation set to assess the quality of our predictions, by comparing the predicted Raman spectra to the ones obtained with ab initio polarizabilities. We chose to train our model on the NVT ensemble and predict on the NVE ensemble. This choice is justified by the fact that—particularly for this relatively short trajectory—canonical sampling should be more ergodic than microcanonical sampling, and therefore sample a larger portion of configurational space.

3. Results

In the following, we apply the previously described methods to the calculation of the Raman spectra of Paracetamol. Each time we mention GPR, it is used in combination with the AD representation. Similarly, each time SA-GPR is mentioned, it is associated to λ-SOAP kernels.

3.1. Paracetamol molecule

We first consider the case of the isolated Paracetamol monomer (figure 1(a)). We constructed the training dataset by selecting 2000 structures with farthest point sampling using the scalar SOAP metric from the full NVT trajectory. For GPR, the three-dimensional density field was constructed within a box of $6\times 4\times 2.5\,\mathring{\rm A}$ , where the molecule had its longest axis along the x direction and the equilibrium geometry lied on the xy plane, and the grid spacing was dr = 0.5 Å. The GPR was computed using σ = 10 and η = 10⁻³. Details about the optimization of the hyperparameters are given in the supplemental information. Regression performances are reported in figure 3(a), where the error , given by equation (9), computed on 500 randomly selected test structures (that are excluded from training) out of the total of 2000 is shown as a function of the number of training monomers⁶ .

Figure 3(a) shows that the learning capability of the model does not reach saturation within the training set sizes explored. The learning of all the polarizability components follow a similar slope, but they are predicted with different accuracy because of the strong anisotropy of the dielectric response of the Paracetamol molecule. Because of the π-conjugation of the system in the molecular plane, the system is much more polarizable along the x-axis rather than along other directions in this particular alignment, making it harder for the learning algorithm to capture the corresponding variations across the dataset. The α_xx component presents the largest error, going from about 40% with 300 training points to 17% with 1500 training points. The best learning performance is instead obtained for the α_yz component, where the prediction error can be brought down to about 6%.

In order to validate the model, we show in figure 4 the correlation between predicted and computed polarizability components on the validation set composed by the full independent NVE trajectory, for a representative training size of 900 molecular configurations, including for each component. Although we observe a worsening of the predictions when compared to the previous case, where we trained and predicted on the same ensemble (and trajectory), the predicted polarizabilities are still well correlated to the reference values. The remaining question is how these errors translate to the actual prediction of the vibrational Raman spectrum.

epsilon — **Figure 4.** GPR polarizability tensor components versus DFPT *ab initio* ones. The components were trained on 900 configurations coming from an NVT trajectory. The test set contains 20 000 configurations coming from an NVE trajectory. Numbers in brackets indicate for a given component.
Download figure:
Standard image High-resolution image

In figure 5, we show a machine-learned Raman spectrum averaged over 16 subselections of the training set of 900 configurations each, along with its STD (see section 2.4 for a detailed discussion about this procedure), and compare it to the one calculated from fully ab initio data. We find that the estimated variance has to be scaled by a factor of ν² = 2.0. Despite a relatively small amount of training points, the agreement with the reference spectrum is excellent in the entire frequency range. As shown in figure 3, increasing the number of training points in the model would decrease errors even further. From figures 3 and 5, it is clear that the error we make on the polarizability components does not translate directly into an error of similar magnitude on the spectrum. This is a consequence of the fact that the Raman intensities depend on the derivatives of the polarizability components with respect to atomic coordinates, and not on their absolute value. This simple procedure is able to reproduce almost perfectly a reference Raman spectrum with fewer than 1000 training points on a desktop computer in just a few minutes.

In order to assess whether the use of a model that incorporates symmetries can be advantageous even for a relatively rigid molecular system, we then contrast the GPR and the SA-GPR model. To this end, λ-SOAP kernels were constructed using a Gaussian width of 0.3 Å and an environment cutoff of 4.0 Å. Details about the SOAP parameters optimization can be found in the SI. The corresponding learning curve is shown in figure 3(b). The improvement over a standard GPR scheme is systematic at any training set size, underlining the importance of automatically incorporating the ${ \mathcal O }$ (3)-covariance of the tensor at the scale of individual atomic environments. Similarly to the case of GPR, the accuracy of predictions differs between tensor components. With 300 training points, the α_xx component presents the largest error (about 14%), that reduces to only 6% with 1500 training structures. The best learning performance is again obtained for the α_yz component, for which the prediction error can be remarkably reduced to less than 2%. As shown in figure 5, SA-GPR reproduces very accurately also the ab initio Raman spectrum with only 300 training points. Obviously, as reflected by the learning curves, increasing the amount of points reduces this error even further, as exemplified in figure S4 of the SI, where the machine-learned and ab initio spectra are virtually indistinguishable.

3.2. Paracetamol crystal

We now turn our attention to the first crystalline form of Paracetamol, containing four individual Paracetamol molecules per unit cell, as shown in figure 1(b).

3.2.1. Direct approach

Since we are now dealing with a periodic system, we first build a supercell corresponding to the appropriate crystal structure. Then, the AD representation is constructed following the same procedure discussed in the previous section. A value of σ = 40 has been selected to build the Gaussian kernel, and the regularization parameter has been set to η = 10⁻⁴ by CV optimization. The three-dimensional density field was evaluated within a box of 12 × 14 × 20 Å³, using a grid spacing of $d{\bf{r}}=0.75$ Å⁷ . For SA-GPR, the λ-SOAP kernels were constructed using the same parameters as before.

The training set is built by considering a random selection of 2500 configurations extracted from a NVT trajectory. A full NVE trajectory is once again considered to test the quality of the predicted Raman spectrum. Learning curves for both regression models are shown in figure 6 (solid lines). When comparing the two methods, we always use the same configurations in both cases.

**Figure 6.** Learning curves for the crystal polarizability tensor on an NVT trajectory, using different approaches. Here the mean error over all components is represented. Including molecular polarizability greatly improves the model, both with GPR and SA-GPR, especially with few training data.
Download figure:
Standard image High-resolution image

We observe that both the learning capability of GPR and SA-GPR do not reach saturation when increasing the number of training data, going from 81% (respectively 73% with SA-GPR) of error with 25 training points to 17% (respectively 11%) with 2000 of them; again, making use of a kernel that is built on a SA comparison between local environments brings a substantial improvement. Overall, however, for the same amount of training points, the errors are much (typically between two and three times) larger than for the monomer case.

3.2.2. Incorporating molecular polarizability

To improve our results, we refined our models by using the predictions of the non-interacting monomers that compose the molecular crystal, as we explain in the following.

Suppose we have a molecular crystal made of ${N}_{\mathrm{mol}}$ molecular units (in the case of Paracetamol I, ${N}_{\mathrm{mol}}=4$ , while ${N}_{\mathrm{mol}}=8$ for Paracetamol II). Since we have already learnt the polarizability tensors of the individual molecules, we can grasp most of the variability of the polarizability tensor of the crystal by summing up the predictions for individual monomers. Equation (3) is modified according to

$\begin{eqnarray}&&{\alpha }_{\gamma \delta }^{\mathrm{ML}}({ \mathcal A })={\bar{\alpha }}_{\gamma \delta }^{\mathrm{ai},\mathrm{crys}}+{\alpha }_{\gamma \delta }^{{\rm{\Sigma }}}({ \mathcal A })-{\bar{\alpha }}_{\gamma \delta }^{{\rm{\Sigma }}}+\displaystyle \sum _{j=1}^{N}{w}_{j}k({ \mathcal A },{{ \mathcal A }}_{j}),\end{eqnarray} \tag{ 12 }$

where ${{\boldsymbol{\alpha }}}^{{\rm{\Sigma }}}$ denotes the sum of the molecular polarizability tensors and ${\bar{\alpha }}_{\gamma \delta }^{\mathrm{ai},\mathrm{crys}}$ is the average of the ab initio polarizability tensors of the full crystal over the training set. Specific details about the procedure are explained in the appendix. An analogous expression to equation (12) is obtained for SA-GPR.

Figure 6 shows the advantage of using the molecular baseline for the regression models (dashed lines). The improvement is most noticeable for models based on few training points. Molecular baselining leads to a large decrease in error of about 25% for both GPR and SA-GPR. Upon increasing the number of training points, the difference diminishes, but an improvement remains visible. It is worth noticing that, starting from 250 training structures, the direct application of SA-GPR (without any molecular baseline) performs better than the GPR scheme with the baseline. Note that the prediction accuracy can be improved even further if one scales the molecular polarizability tensors so that their average matches that of the full crystal, as illustrated in the SI, figure S8.

Figure 7 shows the effect of the baselining procedure on the predicted SA-GPR Raman spectrum, when one increases the amount of training points. Several observations can be made. First, just like for the monomer, high frequencies require more training points to be reproduced. Second, including ${{\boldsymbol{\alpha }}}^{{\rm{\Sigma }}}$ greatly enhances the spectrum intensity accuracy when few training points are used. This is especially true at high frequencies, where the improved model already gives the right structure of the peaks, albeit not with the right intensity, while the direct learning does not show any peaks in this region. Overall, the predicted spectrum is extremely well reproduced when employing enough training points.

Figure 8 shows the predicted Raman spectrum and the corresponding estimated error. In this case, different learning models have been first defined by considering 16 random subselections, each of them made of 80% of the training dataset. Then, for each of these learning models, the polarizabilities of the full NVE trajectory have been predicted and the associated Raman spectra have been computed. We then estimated the STD of these predictions according to the procedure detailed in section 2.4. We find that in this case the estimated variance has to be increased by roughly an order of magnitude, i.e. ν² = 10.9. One can observe that the excellent agreement between the reference and predicted spectrum at low frequencies is consistent with a negligible estimated error, while larger discrepancies and error bars can be observed in the high-frequency domain.

Results this far show that, even for a relatively rigid molecular system, incorporating symmetries and learning all the components of the polarizability tensor in a covariant fashion can improve the accuracy and the efficiency of these ML schemes. As anticipated, an additional advantage of SA-GPR is that it is based on local atomic environments. This results in a greater transferability, as we will discuss in the following section.

4. Extrapolation on other polymorphic forms

Within the λ-SOAP formalism, the polarizability of the system is effectively decomposed in local atom-centered contributions that are summed in order to obtain the final predicted value of ${\boldsymbol{\alpha }}$ , making it possible to model the susceptibility of a molecule or a crystal through the definition of effective atomic polarizabilities. This implies that the information is learned at the local scale and, as such, can be transferred across systems that share a similar chemical nature. In the case of paracetamol polymorphs, one can think of predicting the polarizability of the form II crystal (figure 1(c)) with the model trained on form I only. Since different polymorphic forms are mainly distinguished by the different intermolecular interactions, major difficulties in this extrapolation procedure are expected to be associated with the low-frequency (intermolecular) modes of the molecular crystal. To put this idea to the test, we used the SA-GPR trained on 2000 structures of form I to make predictions for a NVE trajectory of form II. We used an ensemble of 16 subsampled models to estimate uncertainty. As expected, we observed that the rather large error in the prediction of the polarizability tensor is mostly associated to a small offset in the time series of some of the polarizability components (detailed in the SI, figure S10), that however does not have a substantial impact on the Raman spectrum.

As shown in figure 9, the general lineshape is excellent, and all the main features of the ab initio spectrum are reproduced, even though few discrepancies in terms of relative intensities can be observed, and the error in the intensities is overall higher than for the direct prediction of the first polymorph. We underline the difference in behavior in contrast to figure 8: now, high frequencies are better described and errors are more pronounced at low frequencies. This suggests that the model can reproduce accurately changes in the polarizability associated with intra-molecular vibrations, but is less accurate in predicting low-frequency components that are specific to the molecular packing of form II, which is not represented in the training set for form I. We also observe that the discrepancy between predictions and ab initio spectrum is reflected accurately in the estimated uncertainty, that can therefore be used as a reliable measure of the accuracy of the model also in an extrapolative regime.

5. Conclusions

In this work, we proposed GPR models to predict vibrational Raman spectra, based on learning polarizability and susceptibility tensors obtained from density-functional perturbation theory. As an example, we applied our methodology to predict anharmonic Raman spectra of the Paracetamol molecule and two polymorphs of the Paracetamol crystal. The methodology also works for simpler harmonic Raman intensities. The use of an ensemble of models to estimate the uncertainty in the polarizability tensors allows us to propagate the error estimation from the ML prediction of the polarizability tensors to the vibrational spectra, by generating an ensemble of spectra out of which it is simple to compute frequency-dependent confidence intervals.

We showed that for the molecule a standard GPR scheme that takes as input a nuclear density representation on a 3D grid works extremely well and enables one to reproduce Raman spectra almost perfectly with a low number of training points. For the crystal, such a scheme, albeit possible, is more difficult to apply for several reasons: the difficulty to compare crystal structures with different unit cell sizes, the redundancy of information contained in a fixed grid-based representation, and the increase of the number of grid points with system size. We have shown that a more effective solution is to use a SA-GPR scheme, used here in combination with the λ-SOAP representation [30]. Such a scheme yields accurate predictions for molecules and crystals, due to its capability to better capture the local structural information in a covariant fashion. Moreover, since λ-SOAP is a local representation, it is easy to treat larger systems sizes and even transfer models to other polymorphs. We have shown this transferability by successfully predicting the Raman spectrum of Paracetamol II with a model trained only on Paracetamol form I. This suggests the possibility of predicting Raman spectra of any polymorphic form, as long as a model trained for one of them is available. In addition, for all models presented, we observe a considerable improvement when using previously-trained GPR models for the molecular units as a baseline for the crystal prediction, thus reducing the amount of more costly condensed-phase calculations that must be performed to train the bulk model. In a similar manner, it is also straightforward to extend this framework to other ensembles (e.g. NPT) or path-integral molecular dynamics simulations, which include the quantum nature of the nuclei.

The models we presented regard the electronic electric-field response properties, and can be extended to dipoles and higher-order responses. They can thus be seamlessly combined with empirical potentials or other machine-learned potentials that give access to forces. This presents an alternative route to including the training of such quantities directly into these potentials [44], which can present a higher level of complexity. Finally, we remark that even though we applied our framework to polarizabilities and Raman spectroscopy, applying it to any other kind of spectroscopy, like infrared, sum-frequency generation, etc., would be straightforward, as long as reference electronic-structure data is available.

Acknowledgments

The authors acknowledge funding from the Max Planck-EPFL center. MC acknowledges support from the European Research Council (Horizon 2020 Grant Agreement No. 677013-HBMAP).

: Appendix. Molecular baselining with GPR

When building a model to predict the polarizability tensor of the crystal from its individual molecular components within the GPR and AD representation scheme, one should consider that each molecule is oriented in a specific direction that differs from the one we trained the molecule on. In order for the regression model to recognize the orientation at hand, we first need to find the relationship between the orientation of the molecules in the crystal and the reference one used in the molecular GPR procedure described in section 2.2. The scheme is depicted in figure A1.

**Figure A1.** Schematic explanation of the molecular baselining process for the standard GPR approach. An individual molecule of the crystal is first rotated by a rotation matrix R to match the alignment of a reference structure, on top of which we calculate the polarizability with GPR using equation (3), and we then rotate back. We repeat this process for every molecule in the unit cell, and sum the resulting polarizability tensors to obtain α^Σ.
Download figure:
Standard image High-resolution image

Each geometry G_im corresponding to the ith molecule in the crystal of the mth structure of the set is thus rotated by a rotation matrix R_im as

$\begin{eqnarray}&&{G}_{{im}}^{\mathrm{ref}}={R}_{{im}}{G}_{{im}},\end{eqnarray} \tag{ A.1 }$

where $i=1\cdots {N}_{\mathrm{mol}}$ , with ${N}_{\mathrm{mol}}$ the number of molecular blocks per unit cell. Finally, once a molecular polarizability ${{\boldsymbol{\alpha }}}_{i}^{\mathrm{ref},\mathrm{mol}}$ is predicted, we rotate the tensor back to its original orientation inside the crystal, i.e.

$\begin{eqnarray}&&{{\boldsymbol{\alpha }}}_{i,m}^{\mathrm{mol}}={R}_{{im}}^{-1}{{\boldsymbol{\alpha }}}_{i,m}^{\mathrm{ref},\mathrm{mol}}{R}_{{im}}.\end{eqnarray} \tag{ A.2 }$

Having defined the sum of molecular polarizabilities ${{\boldsymbol{\alpha }}}^{{\rm{\Sigma }}}$ as

$\begin{eqnarray}&&{{\boldsymbol{\alpha }}}^{{\rm{\Sigma }}}=\displaystyle \sum _{i=1}^{{N}_{\mathrm{mol}}}{{\boldsymbol{\alpha }}}_{i}^{\mathrm{mol}},\end{eqnarray} \tag{ A.3 }$

we consider the regression target

$\begin{eqnarray}&&{\rm{\Delta }}{\alpha }_{\gamma \delta }({ \mathcal A })={\alpha }_{\gamma \delta }^{\mathrm{crys}}({ \mathcal A })-{\bar{\alpha }}_{\gamma \delta }^{\mathrm{crys}}-({\alpha }_{\gamma \delta }^{{\rm{\Sigma }}}({ \mathcal A })-{\bar{\alpha }}_{\gamma \delta }^{{\rm{\Sigma }}}),\end{eqnarray} \tag{ A.4 }$

where, once again, the bar denotes the average over the training set. Finally, we modify accordingly equation (3), which becomes (see also equation (12))

$\begin{eqnarray}&&{\alpha }_{\gamma \delta }^{\mathrm{ML}}({ \mathcal A })={\bar{\alpha }}_{\gamma \delta }^{\mathrm{ai},\mathrm{crys}}+{\alpha }_{\gamma \delta }^{{\rm{\Sigma }}}({ \mathcal A })-{\bar{\alpha }}_{\gamma \delta }^{{\rm{\Sigma }}}+\displaystyle \sum _{j=1}^{N}{w}_{j}k({ \mathcal A },{{ \mathcal A }}_{j}).\end{eqnarray} \tag{ A.5 }$

Note that the weights in the previous equation are different than the ones in equation (3), as this time the regression target Δα contains the molecular polarizability tensors.

An analogous expression is obtained for SA-GPR, but the rotation and alignments previously described do not need to be carried out explicitly since the rotational covariance of the tensor is built in the structure of the method.

Using Gaussian process regression to simulate the vibrational Raman spectra of molecular crystals

Article metrics

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction