Atomic force microscopy simulations for CO-functionalized tips with deep learning

Atomic force microscopy (AFM) operating in the frequency modulation mode with a metal tip functionalized with a CO molecule is able to image the internal structure of molecules with an unprecedented resolution. The interpretation of these images is often difficult, making the support of theoretical simulations important. Current simulation methods, particularly the most accurate ones, require expertise and resources to perform ab initio calculations for the necessary inputs (i.e charge density and electrostatic potential of the molecule). Here, we propose a computationally inexpensive and fast alternative to the physical simulation of these AFM images based on a conditional generative adversarial network (CGAN), that avoids all force calculations, and uses as the only input a 2D ball–and–stick depiction of the molecule. We discuss the performance of the model when trained with different subsets extracted from the previously published QUAM-AFM database. Our CGAN reproduces accurately the intramolecular contrast observed in the simulated images for quasi–planar molecules, but has limitations for molecules with a substantial internal corrugation, due to the strictly 2D character of the input.


Introduction
Atomic Force Microscopy (AFM) (1) operating in one of the dynamic modes (2,3) is one of the most powerful techniques for imaging and manipulating materials at the nanoscale.Combining the atomic resolution provided by the frequency modulation mode with the functionalization of the metal tip with a CO molecule, FM-AFM is able to produce a striking view of the internal structure of molecules (4).We refer to this combination as High Resolution Atomic Force Microscopy (HR-AFM).This development, together with the AFM ability to address individual molecules (4,5), has opened up new research avenues, including the individual discrimination of hundreds of molecules in complex mixtures (6), the determination of the bond order for each of the bonds in large polycyclic aromatic hydrocarbons (7), the visualization of atomic-scale charge distributions (8,9), and the characterization of the reaction process and the intermediate products in on-surface chemical reactions (10,11,12,13).
Understanding the contrast in AFM images is a big challenge.The frequency shift ∆f measured in the experiments depends on the complex interplay between the total force acting on the tip during the whole oscillation cycle (14) and different operational parameters such as the tip-sample distance, the oscillation amplitude of the cantilever, the cantilever stiffness, and on the details of the attachment of the CO molecule to the metal tip, that result in different values of the CO tilting stiffness.A combination of experiments and theoretical simulations have paved the way for the identification of key factors controlling ∆f .Earlier work showed that the COterminated tip AFM contrast is mainly controlled by the Pauli repulsion between the lone pair of the oxygen atom in the CO molecule and the charge density of the sample (4,15).This main contribution is modulated by the interaction with the sample's electrostatic field (16,17,18), and the effect of both force components is enhanced by the probe tilting (7,19,20).The complex interplay of these interactions, particularly in the case of molecular systems -where they depend on the structure, chemical composition and internal torsion-makes the interpretation of the experimental features highly non trivial.AFM simulation models with different complexity and accuracy (21,19,22,23,24,18,25,20) have been developed to compute theoretical AFM images using as input the geometry of the molecule.They have allowed to fully understand not only the intramolecular contrast (21,19,22,23,24) but also the imaging of intermolecular features on hydrogen and halogen bonded systems (20,26,27).
While some of these simulation methods are extremely fast, the most accurate ones, retaining a precision similar to density functional theory (DFT) in the determination of the tip-sample forces, require at least a calculation of the charge densities of the tip and sample and the electrostatic potential of the sample.This can be done with standard DFT codes but it's time consuming and requires some theoretical skills that are not always available within experimental groups.
Data bases of theoretical AFM images calculated with accurate simulation methods are an alternative for the interpretation of experimental images.We have recently introduced QUAM-AFM (28,29), a data set generated from 686K organic molecules, whose geometries have been downloaded form the PubChem repository (30,31).The selected molecules contain the four basic elements of organic chemistry (carbon, hydrogen, nitrogen and oxygen) plus some other less common elements which are still frequent on organic compounds like sulfur, phosphorus and the halogen atoms (fluorine, chlorine, bromine and iodine), and include the most relevant structures and chemical moieties.In order to support a broad range of experimental conditions, QUAM-AFM contains 165M AFM images simulated for each of those molecules using 240 different combinations of operational parameters (tip-sample distance, oscillation amplitude, and CO tilting stiffness).They represent a computational effort that exceeds 2.5 million hours.
Given the comprehensive variety of molecular structures provided by QUAM-AFM, it is likely that AFM images for a molecule similar to the one considered in a hypothetical experiment are available, but identifying this related compound would not be easy, even with the search options provided within the data set.Furthermore, it is impossible to account for the factorial growth of possible structures compatible with the combination rules of organic chemistry.
In this paper, we introduce a new, extremely fast, and easy-to-use approach to simulate AFM images based on a deep learning model, that avoids all force calculations, and uses as the only input a 2D ball-and-stick depiction of the molecule, where balls of different colors and radii represent the atomic species, and sticks correspond to the covalent bonds.In particular, we have used a CGAN (32), a neural network model that maps each input image to an output image.Our CGAN has been designed and trained to take as input the 2D ball-and-stick image of a molecule and to produce as output a set of 10 constant-height AFM images at different tip-sample distances.As explained below, training, validation and testing has been carried out using information from QUAM-AFM, which provides both the ball-and-stick depiction and 3D stacks of constant-height HR-AFM images simulated under different operation conditions for each of the molecules.The analysis of the results has been performed by visual comparison of the target images (from QUAM-AFM) for molecules not included in the training set and those generated by our CGAN.It clearly demonstrates that the model provides an efficient and simple alternative to simulate HR-AFM images for a relevant range of tip-sample distances, identifies the best data for the training, and highlights its limitations for molecules with a significant internal torsion, due to the strictly 2D character of the input.
The CGAN has been used in a wide variety of applications, ranging from the medical field with the detection of covid-19 (33) or brain tumors (34) from the results of different medical imaging techniques, through common deep learning fields such as synthetic data generation (35,36), image denoising (37) or person re-identification (38).Deep learning has been already used in the AFM field.Convolutional Neural Network (CNN) has been employed for the determination of molecular geometries (39) and the prediction of electrostatic fields (40) from HR-AFM images, while graph neural networks (GNNs) have been applied to extract molecular graphs (41).The combination of AFM imaging with Bayesian Inference and DFT calculations has been used to determine the adsorption configurations for a known molecule (42).In previous work, we have demonstrated that it is possible to achieve a complete chemical identification of the structure and composition of a molecule from a 3D stack of constant-height HR-AFM images using two different approaches: (i) a Multimodal Recurrent Neural Network (M-RNN), that produces as output the IUPAC name of the molecule (43), and, (ii) a CGAN, that provides a 2D ball-and-stick model of the imaged molecule (44).To the best of our knowledge, deep learning has not been used before to perform theoretical AFM simulations.
The rest of the paper is organized as follows.Section 2 describes the structure and operation of the CGAN (section 2.1) and the training details (section 2.2).Results are presented in section 3, where the performance of our CGAN is analyzed for different training subsets that are characterized by the maximum internal corrugation of the molecules considered during the training, and by the value of the cantilever oscillation amplitude used for the simulation of the training images.We end up with the conclusions and an outlook for possible future work.

Our CGAN for AFM simulations
A CGAN includes two sub-networks, known as generator and discriminator (see fig. 1).Before describing the detailed block structure of the two networks, we focus on explaining how they operate during the training process.The ball-and-stick depictions and 3D stacks of HR-AFM images contained in a subset of QUAM-AFM are used as input and target images during the training (see section 2.2 for details on training hyper parameters).
The generator has a U-Net structure, where both the encoder and decoder are convolutional networks (see fig. 1).In our implementation, each ball-and-stick molecular depiction feeds the encoder, which transforms the input into a compressed representation of the molecule to be subsequently reconstructed by the decoder as a stack of 10 constant-height HR-AFM images for different tip-sample distances.Once the AFM images are generated, the discriminator network tests how good is the simulation produced by the generator network.Either the generator prediction or the real stack of simulated HR-AFM images, together with the ball-and-stick depiction, are alternatively used to feed the discriminator, which compares patches of the two inputs (the 3D stack and the molecular depiction) to find out if the HR-AFM images are the real ones or have been produced by the generator network.The overall loss function of the CGAN (32), that quantifies its performance during the training, simultaneously depends on the predictions performed by the generator (comparing the predicted and real images using an L1 norm) and by the discriminator, determining if the images are the predicted or real ones.In this way, both generator and discriminator try to minimize their losses in a confrontation in which the success of one network forces the other one to improve its predictions.This zero-sum game is a key factor in the performance of CGANs.Once the training is completed, the discriminator is discarded and the generator is ready to be used to predict the HR-AFM images for any molecule from its ball-and-stick depiction.
The block structure of the generator and discriminator is also displayed in fig. 1.Both the generator encoder and discriminator blocks are comprised of a convolutional layer, sequentially followed by a batch normalization and a LReLU activation.The generator decoder blocks follow a similar scheme, whereby convolution is replaced by transposed convolution and ReLU Once the AFM images are generated, the discriminator network tests how good is the simulation produced by the generator network, trying to guess if the if the HR-AFM images provided as input ( together with the ball-and-stick depiction) are the real ones or have been produced by the generator network.The confrontation of the two networks in a zero-sum game, with the generator improving its performance to fool the discriminator, provides an extremely efficient training.Once training is finished, the discriminator is discarded and the generator is ready to be used to predict the HR-AFM images for any molecule from its ball-and-stick depiction.Regarding structure, yellow boxes in the generator encoder and the discriminator represent blocks consisting of a 2D convolutional layer with batch normalization activated with Leaky ReLU (LReLU), the blue ones correspond to dropout layers, while the green boxes in the generator decodes stand for blocks with a 2D transposed convolutional layer with batch normalization activated with Rectified Linear Unit Activation Function (ReLU) activation (except for the one previous to the output, that is activated with a hyperbolic tangent function).A detailed description of each layer can be found in appendix A.
is used as the activation function, except in the last block, that is activated with a hyperbolic tangent function.Note that each encoder output feeds the next encoder block and, in addition, the decoder block with the same image size (see fig. 1).A detailed description of each of the blocks for the two networks can be found in appendix A.

CGAN training
As already mentioned above, we rely on information from the QUAM-AFM data set for the training of our CGAN.In particular, we use the ball-and-stick depictions contained in QUAM-AFM to feed the generator and the stack of 10 AFM images at different tip-to-sample distances as the target output to compare with the generator predictions.A random value in the range of [0.85,1.15] is selected to apply the zoom to each input-output pair during the training.HR-AFM images depend significantly on the internal torsion of the molecule (i.e the differences in height among its constituent atoms), and on operational AFM parameters,such as the cantilever oscillation amplitude, that can be controlled at will during experiment.In our work, we have explored the influence of both factors in the performance of our model.To this end, we have chosen for the training subsets of QUAM-AFM that only include images from molecules with maximum value in the internal height differences in a given range and where a particular value for the oscillation amplitude has been used for the image simulation.Other factors, such as the torsional stiffness of the CO molecule, that depends on the detailed attachment of the molecule to the metal tip, have been kept fixed (in this case, with a value of 0.4 N/m).
The different subsets used for the training are described together with the performance of the resulting models in section 3.All the trainings were performed with the same hyperparameter selection (epochs, batches and batch size) and were optimized in the same way.In particular, a L1 norm was used for the generator -compiled with Mean Absolute Error (MAE) (using the parameter λ = 100 defined by Isola et al. (32))-, while the binary cross entropy was used as loss function for the discriminator.The model was minimized by applying batches of 32 inputs with the Adaptive Moment Estimator (Adam) optimiser, where the learning rate and first moment parameters were set to 2 • 10 −4 and 0.5 respectively.The model was trained during 100K iterations, displaying 300 predictions of the validation set to estimate the optimal weights every 10.000 iterations.

Results and Discussion
As mentioned above, the performance of our CGAN has been analyzed for different training subsets.Each subset only includes images for molecules with internal corrugation below a given value and that have been simulated with a certain value of the cantilever oscillation amplitude.
Once trained, we have tested the model with a total of 1.000 structures that were randomly selected from the test set.The test was conducted by comparing predictions from the trained generator network with the ground truth (the real simulated HR-AFM images) visually.process seeks a balance between the sharpness of the output image and the objects in it.The generator must not only fool the discriminator, but also approach the ground truth output in a L1 sense.A priori the results are better with this combination.However, by using the L1 norm in the minimisation, we also introduce a blur filter, well known in deep learning (45,32).This contribution is responsible for the white patches on the grey background.Secondly, the AFM contrast depends on both the chemical species in the sample and their distance to the CO molecule at the tip.Since, during the training of this model, we have fed it with ball-and-stick depictions and HR-AFM images of molecules with a corrugation below 0.5 Å height difference, the range of distances explored to differentiate the moleculeformx its surroundings is limited.
We believe, that with this training, the model does not have enough information to predict the contrast change in the background accurately.Considering the limited information provided by a 2D ball-and-stick depiction for molecules with non-negligible internal torsion, we have explored the CGAN performance for these molecules, comparing the results produced by models optimized with two different training subsets: one includes molecules with maximum corrugations below 0.5 Å (this is the one also used in fig.2, while the other extends the posible internal height difference to 1.1 Å .Figure 3 shows the predictions of these two models for the 9,11-diazapentacyclo[12.8.0.0 2,7 .0 8,13.0 15,20 sult can be understood considering how the optimization process works.We are introducing an additional variable in the final result but we are not providing the corresponding input information.Therefore, the generator is unable to reproduce the output.This leads to an increase in the value of the loss function when calculating the L1 norm between the prediction and the ground truth.In turn, the information about the torsion is not provided to the discriminator either, so the generator learns to fool the discriminator very easily.This leads to a minimisation offset as the loss function is not balanced.Finally, we consider the effect of changing the oscillation amplitude used in the training images.Figure 4 shows the predictions for 3-(1,3-benzoxazol-2-yl)-N-(3-hydroxyphenyl)prop-2-enamide form models trained with images simulated with oscillation amplitudes 0.4 Å and 1.0 Å (fig.4c and e, respectively).These results are compared to the true simulations with the corresponding oscillations amplitudes (fig.4b and d, respectively).Although there are small variations in the intramolecular contrast, the main difference between the simulations is in the image background, that changes significantly with the tip-sample distance when the larger oscillation amplitude is used.The predictions of the HR-AFM images in fig.4c and e show that, while the intramolecular contrast is accurately reproduced in both cases, the model improves its performance in the prediction of the background when images calculated with smaller oscillation amplitudes are used in the training.

Conclusions
We have shown that a CGAN provides an efficient and simple alternative to simulate HR-AFM images for a relevant range of tip-sample distances, using as the only input, the 2D ball-andstick depiction of the molecule.We have determined its performance for different training subsets defined in terms of two key parameters: the maximum internal torsion of the molecules include in the set, and the oscillation amplitude used for the simulation of the training images.
Our CGAN reproduces accurately the intramolecular contrast observed in the simulated HR-AFM images for quasi-planar molecules (with a corrugation of less that 0.5 Å).The prediction of the image background deteriorates for larger tip-sample distances, but it can be improved using small oscillation amplitudes for the training.This problem with the background can be traced back to the L1 norm introduced in the optimization of the generator.While this choice is useful for the combined minimization of the generator and the discriminator, it also introduces a blur filter that is responsible for the irregular patches observed in the background.Due to the limited information provided by the strictly 2D input, predictions for molecules with significant internal corrugation correspond to planar structures, failing to capture the contrast associated with the atomic height differences.
Looking into the future, the fact that the background prediction improves when images with small amplitudes are used in the training suggests a possible (but rather complex) alternative to improve the results: to parameterize tip-sample forces using machine learning techniques and, later on, to use them to calculate the frequency shift.This alternative approach should also reduce the problem of blurred filters in the background because doing the integral to calculate the frequency shift would result in a smoother output.Machine-learning force fields have been already developed for different materials and applications (see (46) for a review), but given the large number of chemical species involved in organic chemistry, we anticipate it would be difficult to achieve the generality provided by our CGAN.The 2D character of the balland-stick depiction limits the molecules that can be accurately simulated with this model.The development of a data input providing information about the relative heights of the atoms is a key point to be explored in the next work.kernels with size (4,4) and stride (2,2) respectively.The last layer is a 2D convolution with a single kernel of size (4,4) which is activated with the sigmoid function.

Figure 1 :
Figure 1: Representation of our CGAN, composed of two networks: (a) generator -with encoder and decoder-and (b) discriminator.During training, each ball-and-stick molecular depiction feeds the encoder, which transforms the input into a compressed representation of the molecule to be subsequently reconstructed by the decoder as a stack of 10 constant-height HR-AFM images for different tip-sample distances.Once the AFM images are generated, the discriminator network tests how good is the simulation produced by the generator network, trying to guess if the if the HR-AFM images provided as input ( together with the ball-and-stick depiction) are the real ones or have been produced by the generator network.The confrontation of the two networks in a zero-sum game, with the generator improving its performance to fool the discriminator, provides an extremely efficient training.Once training is finished, the discriminator is discarded and the generator is ready to be used to predict the HR-AFM images for any molecule from its ball-and-stick depiction.Regarding structure, yellow boxes in the generator encoder and the discriminator represent blocks consisting of a 2D convolutional layer with batch normalization activated with Leaky ReLU (LReLU), the blue ones correspond to dropout layers, while the green boxes in the generator decodes stand for blocks with a 2D transposed convolutional layer with batch normalization activated with Rectified Linear Unit Activation Function (ReLU) activation (except for the one previous to the output, that is activated with a hyperbolic tangent function).A detailed description of each layer can be found in appendix A.
Figure 4: (a) Ball-and-stick depiction for 3-(1,3-benzoxazol-2-yl)-N-(3-hydroxyphenyl)prop-2-enamide, used as input to the model.The color code for the chemical species in (a) is the same one used in fig. 2. (b) AFM simulations with amplitude 0.4 Å and (c) the corresponding prediction, (d) simulations with amplitude 1.0 Å and (e) the prediction.These results show that, while the intramolecular contrast is correct in both cases, the model improves its accuracy in the prediction of the background when images calculated with smaller oscillation amplitudes are used in the training.