A data-driven machine learning approach for electron-molecule ionization cross sections

A L Harris; J Nepomuceno

doi:10.1088/1361-6455/ad2185

1. Introduction

Atomic and molecular cross sections play a pivotal role in many areas of applied physics, including plasma physics, biophysics, and astrophysics. In these fields, cross sections are essential fundamental inputs for modeling complex problems. The success of models within these fields to understand fundamental physical processes relies, at least in part, on the accuracy and availability of the scattering cross sections. Often, cross sections are required over a wide range of energies, target species, and collision processes. Databases such as the NIST electron elastic-scattering cross-section database [1], LxCat [2], BEAMDB [3], and others [4] have begun to address the need for large amounts of cross section data by compiling available experimental and theoretical datasets into openly accessible repositories. There are also computational methods that have proven to be reliable in predicting cross sections in various energy regimes, and many of these are being made publicly available through resources such as the atomic, molecular, and optical science gateway [5]. Despite the increasing availability of cross section databases, in many instances, the necessary data remains inaccessible experimentally or too computationally demanding for ab initio theory. It is therefore impractical to rely exclusively on experiment or computation to obtain all of the needed data.

Machine learning (ML) algorithms have proven to be effective tools in many areas of physics. These algorithms utilize data from existing experiment and/or simulation to train a model that is able to accurately predict the data for unknown systems. The use of ML algorithms in the physical sciences has exploded in recent years and is becoming commonplace in many areas of physics, such as high energy physics [6, 7], quantum many body problems [8], quantum computing [9], molecular chemistry and material science [10], and countless others. However, these techniques have seen only limited use in atomic and molecular collision physics. Existing applications in collision physics can be sorted into two broad categories that predict atomic and molecular cross sections: (1) training a ML model using measured or calculated cross sections or (2) using a ML model to solve the inverse swarm problem and predict the cross sections.

In the first category, one of the earliest applications of ML techniques for atomic and molecular cross sections was implemented by El-Bakry and El-Bakry in which a feed-forward neural network was trained on experimental data for the total cross sections for electron and positron impact collisions with sodium and potassium [11]. Predictions were made for the same cross sections for which training was performed. The neural network predictions showed good agreement with the measured data, indicating that the network was able to successfully learn the total cross sections. Since then, several groups have developed various implementations of ML techniques. Harris and Darsey trained a neural network to predict proton-impact ionization double differential cross sections for atoms and molecules [12] and demonstrated the method's success with a limited number of inputs and training sets. El-Bakry et al developed a model that used the gradient tree boosting method to predict total electron and positron scattering cross sections for alkali atoms [13], demonstrating that ML techniques other than neural networks are successful at predicting collision cross sections. Zhong trained a support vector machine algorithm that was used to predict larger molecule cross sections from the theoretical binary encounter bethe (BEB) cross sections of small molecules [14]. While these results showed good agreement between the ML model and the BEB model, this approach was limited because it relied on the approximate calculation of smaller molecule cross sections as inputs. Effectively, the technique trained a ML algorithm to replace the BEB model for large molecules, which was limited to the accuracy of the theoretical model used to create the input training data. Amaral and Mohallem applied the sure independence screening and sparsifying operator method to classify bound and unbound systems in low energy positron-molecule scattering and made predictions of possible new bound systems [15]. This is one of the few ML applications in atomic and molecule collisions that uses the algorithm for classification of a physical process involved in scattering, rather than the prediction of cross sections. Jasinksi et al used Bayesian ML to predict cross sections for inelastic scattering of two diatomic molecules [16]. Their results demonstrated that these types of algorithms can be used to improve approximate calculations using only a handful of rigorous calculations.

In the second category, known cross section data from available databases and the literature are combined with numerical solutions of the Boltzmann equation to solve the inverse swarm problem using ML techniques. The cross section data is then refined to provide complete and consistent datasets [17–19]. Stokes et al created a complete and self-consistent set of cross section data for quasielastic momentum transfer cross sections, dissociative electron attachment cross sections, and neutral dissociation cross sections for nitric oxide [18]. They showed that the predicted cross section dataset has improved agreement with experimental data. Jetly and Chaudhury compared different ML algorithms, such as artificial neural networks, convolutional neural networks, and a densely connected convolutional network to predict cross sections from swarm data [17]. They demonstrated that the dense network approach gave more accurate predictions than the neural network techniques. Stokes et al determined the elastic momentum transfer and ionization cross sections for helium and argon [19], and showed agreement with experiment within 4%.

In all of the above examples, the ML algorithms were able to successfully predict cross section data. However, one of the challenges with ML approaches is that they often require a large amount of data for training. In general, the more training data that is available to train the model, the better the network will predict the desired unknown data. However, it is exactly the lack of available data that motivates the use of ML algorithms for collision physics in the first place. Thus, the development of a ML model that can make accurate predictions with relatively limited amounts of data is therefore desirable and one of the goals of the work presented here. In this work, we focus on the prediction of total ionization cross sections for molecules of any size and a wide array of shapes and chemical compositions. We test the effectiveness of the algorithm with respect to the number of training datasets and input parameters.

Molecular cross sections were chosen because they present particular challenges to theory and experiment. Compared to atomic targets, molecular targets have a much more complex electronic and nuclear structure. From the theoretical perspective, multi-electron effects and orientation effects both influence the ionization cross sections, but remain difficult to model. Experimentally, molecular targets may be difficult to produce, have high reactivity, and can be expensive. All of these factors combined lead to the limited availability of molecular ionization cross sections.

To help fill these gaps, we have developed a ML-based model that uses a relatively small number of available experimental cross sections for molecular targets to train a neural network. Because experimental data is used for training, our algorithms are independent of any theoretical limitations inherent in theoretical models. Once trained, the network can make predictions for additional molecules. We show that even with limited training input, the network is able to successfully predict molecular ionization cross sections to within 30% in the worst case, and often to within less than 10%. We also provide a database of predicted molecular ionization cross sections for 27 additional target molecules. With the goal of making cross section predictions easily available to the community, we have made our training algorithm and the fully trained network (ready to make predictions) available on Figshare [20].

2. Methods

The ML algorithm that we use for predicting molecular ionization cross sections involves several steps. The first step is the identification and processing of training data. Step 2 is the creation of the network for training. Step 3 is the training of the network, and step 4 is the application of the trained network to predict new data. Each of these steps is described in detail in the sections below and all data sets and codes are available on Figshare [20].

2.1. Training data

All training datasets used here were taken from published literature (see table 1). We identified experimental data sets for 25 different molecules in the form of electron-impact total ionization cross sections as a function of projectile energy. In some cases, multiple data sets were available for a given molecular target, in which case, we selected the datasets that limited the number of independent sources from which the data was taken, while simultaneously giving preference to newer datasets. In general, the variability between experimental datasets for training molecules was less than 25%. When multiple datasets were available, the datasets that we selected for training were in reasonable agreement with theoretical models and other experimental data. It is important to note that because our ML algorithm depends on the training data, a different choice in input datasets could alter the prediction. However, given that the uncertainty in the input training data is on the same order as the uncertainty of our predictions and that our model shows good success in predicting cross sections for a variety of molecules, a different selection of training sets is unlikely to produce significantly different predictions. Each dataset was converted to units of $a_0^2$ for the cross section values and electron volts for the electron projectile energies. Because the energy range and energy grid of the measurements were not consistent between published data sets, we selected only data within an energy range common to all data sets (i.e. 25–100 eV). Each dataset was then linearly interpolated to a fixed grid of 101 energies.

Table 1. List of molecules whose ionization cross sections were used as training datasets. Molecules with a single asterisk (*) were excluded from the training dataset in partition 1 and made up the testing dataset for partition 1. Molecules with two asterisks (**) were excluded from the training dataset in partition 2 and made up the testing dataset for partition 2. The first ionization energies values came from [23].

IUPAC molecule name	Chemical formula	First ionization energy (eV)	Experimental data reference
Ethanal**	${{\text{C}}_{\text{2}}}{{\text{H}}_{\text{4}}}{\text{O}}$	10.23	[24]
Propanal**	${{\text{C}}_{\text{3}}}{{\text{H}}_{\text{6}}}{\text{O}}$	9.96	[24]
Butanal	${{\text{C}}_{\text{4}}}{{\text{H}}_{\text{8}}}{\text{O}}$	9.82	[24]
2-Methylpropanal*	${{\text{C}}_{\text{4}}}{{\text{H}}_{\text{8}}}{\text{O}}$	9.71	[24]
Ethoxyethane	${{\text{C}}_{\text{4}}}{{\text{H}}_{{\text{10}}}}{\text{O}}$	9.51	[24]
Propoxypropane	${{\text{C}}_{\text{6}}}{{\text{H}}_{{\text{14}}}}{\text{O}}$	9.3	[24]
2-Isopropxypropane	${{\text{C}}_{\text{6}}}{{\text{H}}_{{\text{14}}}}{\text{O}}$	9.2	[24]
Propanone*	${{\text{C}}_{\text{3}}}{{\text{H}}_{\text{6}}}{\text{O}}$	9.7	[24]
Butanone	${{\text{C}}_{\text{4}}}{{\text{H}}_{\text{8}}}{\text{O}}$	9.52	[24]
Pentan-2-one	${{\text{C}}_{\text{5}}}{{\text{H}}_{{\text{10}}}}{\text{O}}$	9.38	[24]
Pentan-3-one	${{\text{C}}_{\text{5}}}{{\text{H}}_{{\text{10}}}}{\text{O}}$	9.31	[24]
3-Methylbutan-2-one**	${{\text{C}}_{\text{5}}}{{\text{H}}_{{\text{10}}}}{\text{O}}$	9.31	[24]
Hexan-3-one*	${{\text{C}}_{\text{6}}}{{\text{H}}_{{\text{12}}}}{\text{O}}$	9.3	[24]
Hexan-2-one	${{\text{C}}_{\text{6}}}{{\text{H}}_{{\text{12}}}}{\text{O}}$	9.35	[24]
3,3-Dimethylbutan-2-one*	${{\text{C}}_{\text{6}}}{{\text{H}}_{{\text{12}}}}{\text{O}}$	9.14	[24]
3-Methylpentan-2-one	${{\text{C}}_{\text{6}}}{{\text{H}}_{{\text{12}}}}{\text{O}}$	9.2	[24]
4-Methylpentan-2-one	${{\text{C}}_{\text{6}}}{{\text{H}}_{{\text{12}}}}{\text{O}}$	9.3	[24]
Molecular Hydrogen	${{\text{H}}_{\text{2}}}$	15.43	[25]
Molecular Nitrogen**	${{\text{N}}_{\text{2}}}$	15.58	[25]
Carbon Monoxide	${\text{CO}}$	14.01	[25]
Nitric Oxide	${\text{NO}}$	9.26	[25]
Molecular Oxygen	${{\text{O}}_{\text{2}}}$	12.07	[25]
Methanol*	${\text{C}}{{\text{H}}_{\text{4}}}{\text{O}}$	10.84	[26]
Ethanol**	${{\text{C}}_{\text{2}}}{{\text{H}}_{\text{6}}}{\text{O}}$	10.48	[26]
Water	${{\text{H}}_{\text{2}}}{\text{O}}$	12.62	[27]

2.2. Creating the network

To help with terminology, we introduce some definitions. We define a network as the configuration of the number of nodes, layers, and their connectivity. All of our networks were 3-layer feed-forward neural networks, consisting of an input layer, one hidden layer, and an output layer (see figure 1). The networks were fully connected, such that every node in one layer is connected to every node in the adjacent layers. The input to a given node is a weighted sum of the values passed from each of the upstream nodes plus a bias. From this input, the output of the node is calculated using a logistic sigmoid activation function. This output is then passed downstream to the nodes in the next layer, where it is processed as the input. We define a network trial as the optimized weights and biases for a network that was trained with a specific set of training data.

**Figure 1.** Diagram of network configuration for Net10. The network has 4 input nodes, corresponding to the number of carbon, hydrogen, nitrogen, and oxygen atoms in a molecule; 3 hidden nodes, which is equal to 1/3 of the number of training sets; and 101 output nodes, corresponding to the ionization cross section (xsec) at each energy ranging from 25 eV to 100 eV. The network is fully connected with every node in a layer connected to every node in the adjacent layers.
Download figure:
Standard image High-resolution image

The following are examples of different networks and network trials. Net10 has 4 input nodes, 3 hidden nodes, and 101 output nodes. In contrast, Net15 has 4 input nodes, 5 hidden nodes, and 101 output nodes. We define Net10 and Net15 as different networks. Net15 Trial 1 are the optimized weights and biases for Net15 that was trained on a given set of 15 input training molecule datasets. Net15 Trial 2 are the optimized weights and biases for Net15 trained on a different set of 15 input training molecule datasets.

Many different networks and trials were created for testing the effectiveness of our ML algorithm. In all networks, the number of nodes in the hidden layer was equal to one third of the number of training sets [21]. Because the number of training sets could vary, this resulted in different networks depending on the number of training sets. All networks had 101 nodes in the output layer, corresponding to the value of the cross section at each of the 101 projectile energies.

In some networks, the input layer consisted of 4 nodes, each corresponding to the number of carbon, hydrogen, nitrogen, and oxygen atoms in the molecule (i.e. the molecule's chemical formula). In other networks, the input layer consisted of 5 nodes, with 4 of the nodes corresponding to the number of carbon, hydrogen, nitrogen, and oxygen atoms in the molecule and a fifth node corresponding to the first ionization energy (ionization potential) of the molecule in eV. The input data were collectively normalized to a range of values between 0.05 and 0.95 to approximately match the range of the logistic sigmoid function. The biases and weights were initialized with pseudorandom values, which were optimized through backpropagation during the training phase.

2.3. Training the ML model

For evaluation and testing of the ML models, the 25 cross section datasets were randomly sorted into two groups—the training group and the testing group. The training group contained 20 molecules (80%) and the testing group contained the remaining 5 molecules (20%). A list of these molecules, their chemical formula, their first ionization energy, and a reference to the original data is provided in table 1. A plot of the cross sections is shown in figure 2. Any training of the network used only the molecules in the training group. The testing group was set aside to provide a comparison of the network's prediction with data that the network had never seen. Two different partitions of the 25 datasets were made, partition 1 and partition 2, so that results could be evaluated for different testing datasets. In figure 2, the cross sections for the partition 1 testing dataset are shown as solid red data points and the cross sections for the partition 2 testing dataset are shown as solid blue data points. For partition 1, all datasets except those shown as solid red data points were included in the testing set, while for partition 2 all datasets except those shown as solid blue data points were included in the testing dataset.

**Figure 2.** Interpolated experimental ionization cross sections used for training in partitions 1 and 2 (only 11 of the 101 data points are shown for clarity). See text for a description of how published datasets were processed for use in the algorithm. Solid red (blue) data points denote datasets excluded from training (and included in the testing datasets) in partition 1 (partition 2). References for the experimental datasets can be found in table 1.
Download figure:
Standard image High-resolution image

During training, the input parameters were fed into the network for a given molecule. These values were propagated through the network and then a prediction was made for the cross sections (i.e. the output). This prediction was compared to the known cross section values, and the weights and biases of the connections between nodes were adjusted to optimize the agreement between the network's prediction and the known cross section values. This process was repeated with each of the molecules from the training set. After cycling through all molecules in the training set, the process repeated, starting again with the first molecule. Each cycle through all of the training datasets is called an epoch, and we trained each network for 400 000 epochs. Convergence testing was done with a few trials training for 200 000 epochs and 1000 000 epochs. No difference was observed between predictions made with 400 000 and 1000 000 epochs, but differences were found when only 200 000 epochs were used. Thus, we used 400 000 epochs for all network training.

All codes were written and executed in Mathematica version 12.1 [22], and the network was created with the NetChain command. Training of the network was conducted using the function NetTrain with the default settings. Details are available in the open source code at Figshare [20].

2.4. Predictions

Once the network was trained, it was given new inputs that were propagated through the network to make a prediction. For evaluation purposes, the network was given inputs corresponding to the 5 testing datasets that were excluded from training. Then, the network's prediction was compared with the known, published experimental data. Six networks were used to make predictions, three that did not include the ionization potential as an input parameter and three that did include the ionization potential as an input parameter. Thus, the effect of the ionization potential as an additional input on the accuracy of the networks' predictions could be tested. In each case, the three networks corresponded to different numbers of training datasets so that the effect of the number of training molecules on the networks' predictions could be examined. Lastly, 10 trials of each network were performed to evaluate how a network's predictions depended on the identity of the training molecules.

3. Results

To determine the effect of the number of training sets and the identity of the training molecules on the algorithm's ability to make accurate predictions, we compared predicted cross sections for three networks. Net10 had 4 input nodes, 3 hidden nodes, and 101 output nodes, and was trained on cross sections for 10 different molecules that were randomly selected from the 20 training datasets. Net15 had 4 input nodes, 5 hidden nodes, and 101 output nodes, and was trained on cross sections for 15 different molecules that were randomly selected from the 20 training datasets. Net20 had 4 input nodes, 6 hidden nodes, and 101 output nodes and was trained on all 20 training datasets. For Net10 and Net15, 10 trials were conducted, each with a different random selection of molecules from the training datasets. From these 10 trials, an average cross section and standard deviation were calculated.

Figure 3 shows the predicted cross sections from Net10, Net15, and Net20. The 5 testing molecules were 2-Methylpropanal, Propanone, Hexan-3-one, 3,3-Dimethylbutan-2-one, and Methanol. Each trial's predicted cross section is shown as a dashed-dotted line for Net10 and Net15, and the average of the 10 trials is shown as a solid blue line. For Net20, only 1 trial was calculated because all 20 training molecules were included in the training set, and this prediction is shown as a solid blue line. The light blue shaded region represents the standard deviation of the 10 trials from the average for Net10 and Net15. The experimental data for the 5 testing molecules is shown in figure 3 as open circles.

The algorithm's predictions show the most variation among the trials when only 10 datasets were used for training (i.e. Net10). The deviation of the prediction varied with projectile energy, and in the worst case (Methanol at 100 eV), the average of the trial predictions is 17.0 a.u. with a standard deviation of 0.63 a.u. When 15 training datasets were used (i.e. Net15), less variation among the trials was observed, indicating that with more training molecules, the identity of the training molecules had a decreased importance.

For all 5 testing molecules, the use of 20 training datasets yielded excellent predictions of the cross sections. For all of the molecules, the maximum percent difference between the Net20 predictions and the known experimental values was less than 14% (Propanone at 35 eV), and most percent differences were in the single digits. The maximum percent differences between the Net20 prediction and the experimental data is given in table 2 for each of the molecules. It is interesting to note that the prediction of Net20 was nearly identical to the averages of the predictions from the trials using Net10 and Net15. This indicates that the variation in prediction caused by different molecules in the training dataset was averaged out when multiple trials were conducted. For practical applications of our algorithm, one would ideally include all available datasets for training, thus eliminating the possibility of averaging over trials. However, based on the results shown in figure 3, accurate predictions can be achieved with as few as 10 training sets.

Table 2. Maximum percent difference between the Net20 and Net20Ip cross section predictions and the experimental data for the datasets in the testing groups.

Partition 1
Molecule	Net20 maximum % difference	Net20Ip maximum % difference
Propanone	14	16
2-Methylpropanal	4	5
Hexan-3-one	7	7
3,3-Dimethylbutan-2-one	5	6
Methanol	12	9
Partition 2
Molecule	Net20 maximum % difference	Net20Ip maximum % difference
Ethanal	13	14
Ethanol	26	23
Propanal	6	8
3-Methylbutan-2-one	13	12
Molecular Nitrogen	30	1940

For Net10, Net15, and Net20, 4 inputs were used, corresponding to the number of carbon, hydrogen, nitrogen, and oxygen atoms in the molecule. The good agreement between the network predictions and experimental data for the 5 testing molecules in figure 3 demonstrated that the molecular ionization cross sections were strongly correlated with the chemical formula of the molecule. However, some discrepancies remain, particularly for the smaller molecules, such as Methanol and Propanone. Also, the agreement between prediction and experiment becomes slightly worse as the projectile energy increases. To test whether these discrepancies could be resolved with an additional input parameter, specifically the molecule's ionization potential, we created 3 new networks that included the ionization potential as an input. We refer to these networks as Net10Ip, Net15Ip, and Net20Ip. Their network configuration is identical to Net10, Net15, and Net20, except that they have an additional input node corresponding to the ionization potential. We repeated the process described above for these 3 new networks. For Net10Ip and Net15Ip, 10 trials were performed with different randomly selected training datasets and the average and standard deviation of the predictions was calculated.

Figure 4 shows the averages and standard deviations for Net10, Net15, and Net 20 in blue and for Net10Ip, Net15Ip, and Net20Ip in red. For clarity, no individual trial predictions are shown in figure 4. Surprisingly, the inclusion of the ionization potential as an additional input parameter for the networks yielded greater variation among the trials and generally worse agreement with experiment. The agreement between predictions and experiment only improved for Methanol with Net20Ip and Hexan-3-one with all networks when the ionization potential was included. This indicated that there was not a strong or useful correlation between ionization potential and cross section. In addition, because the standard deviation was larger for the predictions of Net10Ip and Net15Ip than for Net10 and Net15, more training data sets were required to achieve accurate predictions when the ionization potential was used as an input parameter.

To further test the role of the training molecule identities on the algorithm's ability to accurately predict cross sections, we repeated the above training and predictions using a different partition of the 25 datasets into 20 training datasets and 5 testing datasets. In partition 2, the 5 testing datasets were for Ethanal, Ethanol, Propanal, 3-Methylbutan-2-one, Molecular Nitrogen. Figure 5 shows the predicted cross sections from Net10, Net15, Net20, Net10Ip, Net15Ip, and Net20Ip for the 5 testing datasets in partition 2. As before, 10 trials were performed for Net10, Net15, Net10Ip, and Net15Ip and an average and standard deviation of the cross sections was calculated.

**Figure 5.** Same as figure 4, but using the training and testing datasets from partition 2.
Download figure:
Standard image High-resolution image

Some similar trends were observed in partition 2 compared to partition 1. The variation among trials was larger for the networks that included the ionization potential as an input, and this variation decreased when more training datasets were used. Agreement between the algorithm's predictions was worse for some molecules when the ionization potential was included (Ethanal, Propanal, Molecular Nitrogen) and better for others (Ethanol, 3-Methylbutan-2-one), with the best agreement at low energy. Notably, for Molecular Nitrogen, only the networks that did not include the ionization potential were able to reasonably predict the experimental data. This is likely due to a combination of factors. First, only 1 training dataset for a molecule with a nitrogen atom was included in partition 2 (Nitric Oxide). Second, the cross sections for diatomic molecules are at least an order of magnitude smaller than those for the larger hydrocarbons. Third, the ionization potential of Molecular Nitrogen is larger than most of the other molecules included in the training dataset (see table 1). All of these factors likely combined to make it more difficult for the algorithm to predict the cross sections for Molecular Nitrogen. We note that despite the much smaller magnitude of the Molecular Nitrogen cross sections, the networks without the ionization potential do a remarkable job of accurately predicting these cross sections.

Overall, for the molecules in partition 2, the agreement between the algorithm's predictions and the experimental data is generally worse than for partition 1. In particular, the predictions for the testing molecules in partition 2 underestimate the high energy cross sections for Ethanal, Propanal, and 3-Methylbutan-2-one, while the prediction for Ethanol overestimates the high energy cross section. The overestimation for ethanol is likely due to the different shape of this experimental cross section relative to the molecules included in the training set. Figure 1 shows that only the Ethanol experimental cross section has a maximum (around 60 eV) and then a decreasing cross section for higher energy (greater than 60 eV). Because the networks in partition 2 had no training data with a local maximum, it was not possible for the prediction to exhibit a local maximum. One possible explanation for the underestimation of the 3-Methylbutan-2-one and Propanal cross sections is that the training datasets included cross sections for isoforms of these molecules, which have smaller cross sections, thus leading to an underestimation in the prediction. In the case of Ethanal, the most similar training molecule was Methanol, which has a smaller cross section and likely led to the Ethanal prediction underestimating the experimental data.

The predictions from partitions 1 and 2 imply that while the identity of training molecules within a given partition does not significantly alter the network's prediction, the partition of the training datasets can influence the network's prediction accuracy. We note, however, that the partitioning of datasets was done purely for testing purposes to evaluate the algorithm's accuracy in predicting cross sections. Any applications of the algorithm would use all available datasets. Based on the worst case of the Net20 predictions (partition 2, Molecular Nitrogen), the algorithm very accurately predicted the cross section for energies less than 40 eV, and at 100 eV differed from the experimental data by 30%. Because the algorithm's predictions improve with more training datasets, we expect that predictions from a Net25 network including all 25 available datasets would have an error of less than 30%. This is acceptable and similar to experimental datasets, which can have an error of around 20% [26, 28, 29].

Given the success of our algorithm for predicting molecular ionization cross sections, we present in figure 6 predictions for additional molecules of interest. To produce these predictions, the network was trained on all 25 experimental datasets listed in table 1 for 400 000 epochs, and the ionization potential was not included as an input parameter. The trained network was then given input values corresponding to each new molecule's chemical formula. The predictions for several classes of molecules are shown in figure 6, including alkanes, alkenes, molecules with a ring structure, and DNA nucleotide bases. For comparison, our algorithm's prediction is compared to published experimental data or BEB model predictions in a select few cases. The chemical formulae of the molecules are listed in table 3.

Table 3. List of molecules and their chemical formula for which predictions were made from a network trained with the 25 datasets from table 1 without the ionization potential as an input parameter.

IUPAC molecule name	Chemical formula
Methane	${\text{C}}{{\text{H}}_{\text{4}}}$
Ethane	${{\text{C}}_{\text{2}}}{{\text{H}}_{\text{6}}}$
Propane	${{\text{C}}_{\text{3}}}{{\text{H}}_{\text{8}}}$
Butane	${{\text{C}}_{\text{4}}}{{\text{H}}_{{\text{10}}}}$
Pentane	${{\text{C}}_{\text{5}}}{{\text{H}}_{{\text{12}}}}$
Hexane	${{\text{C}}_{\text{6}}}{{\text{H}}_{{\text{14}}}}$
Heptane	${{\text{C}}_{\text{7}}}{{\text{H}}_{{\text{16}}}}$
Octane	${{\text{C}}_{\text{8}}}{{\text{H}}_{{\text{18}}}}$
Ethene	${{\text{C}}_{\text{2}}}{{\text{H}}_{\text{4}}}$
Prop-1-ene	${{\text{C}}_{\text{3}}}{{\text{H}}_{\text{6}}}$
But-1-ene	${{\text{C}}_{\text{4}}}{{\text{H}}_{\text{8}}}$
Pent-1-ene	${{\text{C}}_{\text{5}}}{{\text{H}}_{{\text{12}}}}$
Hex-1-ene	${{\text{C}}_{\text{6}}}{{\text{H}}_{{\text{14}}}}$
Hept-1-ene	${{\text{C}}_{\text{7}}}{{\text{H}}_{{\text{16}}}}$
Oct-1-ene	${{\text{C}}_{\text{8}}}{{\text{H}}_{{\text{18}}}}$
Benzene	${{\text{C}}_{\text{6}}}{{\text{H}}_{\text{6}}}$
Cyclopropane	${{\text{C}}_{\text{3}}}{{\text{H}}_{\text{6}}}$
Cyclobutane	${{\text{C}}_{\text{4}}}{{\text{H}}_{\text{8}}}$
Cyclopentane	${{\text{C}}_{\text{5}}}{{\text{H}}_{{\text{10}}}}$
4-Hydroxy-3-Methoxybenzaldehyde (Vanillin)	${{\text{C}}_{\text{8}}}{{\text{H}}_{\text{8}}}{{\text{O}}_{\text{3}}}$
Napthalene	${{\text{C}}_{{\text{10}}}}{{\text{H}}_{\text{8}}}$
Pyridine	${{\text{C}}_{\text{5}}}{{\text{H}}_{\text{5}}}{\text{N}}$
Pyrimidine	${{\text{C}}_{\text{4}}}{{\text{H}}_{\text{4}}}{{\text{N}}_{\text{2}}}$
7H-purin-6-amine (Adenine)	${{\text{C}}_{\text{5}}}{{\text{H}}_{\text{5}}}{{\text{N}}_{\text{5}}}$
5-methyl-1H-pyrimidine-2,4-dione (Thymine)	${{\text{C}}_{\text{5}}}{{\text{H}}_{\text{6}}}{{\text{N}}_{\text{2}}}{{\text{O}}_{\text{2}}}$
6-amino-1H-pyrimidin-2-one (Cytosine)	${{\text{C}}_{\text{4}}}{{\text{H}}_{\text{5}}}{{\text{N}}_{\text{3}}}{\text{O}}$
2-amino-1,7-dihydropurin-6-one (Guanine)	${{\text{C}}_{\text{5}}}{{\text{H}}_{\text{5}}}{{\text{N}}_{\text{5}}}{\text{O}}$

For both the alkanes and alkenes, the predicted cross sections increased with the size of the molecule, a trend that was also observed in experimental cross section data for the molecules Ethanal, Propanal, and Butanal; Methanol and Ethanol; and Propanone and Butanone (see figure 1). For Methane, Ethane, and Butane, our model's predictions agree quite well with published experiment [30] and the BEB model predictions [31]. For Butane, the ML model's prediction is well within the error bars of the experimental data and the ML and BEB models predict nearly identical cross sections for Ethane. Comparing the predicted cross sections for the alkanes to the alkenes shows that either the double bond or the reduced number of hydrogen atoms in the molecule in the alkenes lowers the cross sections relative to the alkanes.

For molecules with a ring structure, the predicted cross sections again increase in magnitude as the number of atoms in the molecule increases. Comparison with experimental data for Pyrimidine [29] shows that the ML model's predictions are again within the uncertainty of the experimental data of 20%. However, for Pyridine, the ML model underestimates the experimental data [28] and is not within the 20% experimental uncertainty.

For the nucleotide bases, the cross section predictions for Adenine and Thymine are nearly identical, which is unexpected given the differences in the chemical formulae. Since Adenine is the only molecule without an oxygen atom, one might expect that its cross section would differ from the other nucleotide bases. There are two experimental datasets available for each of the nucleotide bases, which differ by as much as 50% [32–35]. Several theoretical predictions are also available that vary by up to 25% (see [32] figure 3 for a nice summary), including a BEB model that agrees well with the data of [32]. For adenine and guanine, our model's predictions underestimate the cross section data of [32, 33], while for thymine and cytosine, our model underestimates the cross sections of [32] and overestimates the cross sections of [34, 35]. Overall, our model predictions tend toward lower magnitudes of the nucleotide base ionization cross sections.

4. Conclusions

Due to their nuclear and electronic complexity, the calculation or measurement of collision cross sections for molecular targets remains a challenging task for theory and experiment. We have introduced a ML model capable of predicting ionization cross sections for a wide array of molecular targets with limited training data and input parameters. We used 3-layer feed-forward neural networks with inputs of the number of carbon, hydrogen, nitrogen, and oxygen atoms in the molecule, as well as the ionization potential, and outputs of the molecular cross section at projectile energies between 25 and 100 eV. The networks were trained using published experimental data, making them independent of any approximations required in theoretical models and ensuring that all physics captured in the experiment is included in the trained network.

Overall, our results showed that networks that used the chemical formula of the molecule as inputs, but not the first ionization energy, resulted in more accurate predictions for the cross sections, with the prediction differing from the known experimental result by at most 30% and frequently less than 10%. This indicated that there is a strong correlation between the chemical formula of the molecule and its cross section, but a weak correlation between the ionization potential and the cross section. The strong correlation of cross section with chemical formula is something that has been exploited for many decades when trying to predict molecular ionization cross sections. For example, one of the simplest models for predicting the cross sections is the Bragg additivity rule, in which the molecular cross section is estimated as the sum of the atomic cross sections for all the atoms in the molecule. A number of variations of this additivity rule have been introduced, including using a weighted sum of cross sections for atomic orbitals [36], the screening corrected additivity rule that accounts for some geometric overlap between the constituent atoms [37], and the pixel counting method [38] that includes molecular orientation and atomic overlap effects. All of these methods are based on the idea that the molecular cross section can be expressed as some function of the atomic cross sections, and our ML model exploits a similar relationship.

The weak correlation between the first ionization energy and the cross section was unexpected since ionization is an energy-dependent process and conservation of energy necessarily dictates that the kinetic energy of the final state fragments depends on the ionization potential. In the BEB model [39], reasonably accurate ionization cross sections are predicted by using the ionization energies for many molecular orbitals as inputs. Since these parameters have already been shown to be relevant for the BEB model, there is reason to expect that the inclusion of molecular orbital information and more ionization energies as inputs to our algorithm could also provide improved predictions.

Expectedly, networks that included more training data yielded more accurate predictions. However, even with as few as 10 training datasets, the network was able to predict the cross section to within a reasonable degree of accuracy. The identity of the molecules included in the training set had an effect on the success of the predictions, but this variation was again within acceptable limits. The uncertainty in our model predictions was typically similar to experimental uncertainties.

By using a network trained on 25 published molecular cross section datasets, we made predictions for a wide variety of molecules, including alkanes, alkenes, molecules with ring structures, and DNA nucleotide bases. These predicted cross sections exhibited the trends that molecules with more atoms had a larger cross section and molecules with similar chemical formulae yielded similar cross sections. These trends were consistent with published experimental data.

Prior ML models for predicting cross sections have so far either been limited to a small number of target molecules or relied on theoretical models to provide training data. The algorithm introduced here is broadly applicable to any molecule whose constituent atoms are carbon, hydrogen, nitrogen, or oxygen and is not limited by theoretical approximations. Training of a given network required less than 10 min on a laptop computer and predictions from a fully trained network require only a fraction of a second. Thus, our trained network offers fast and accurate predictions of molecular ionization cross sections. We have also made the algorithm available open access so that other researchers may use it for molecular targets beyond those we have examined.

It is possible that further improvements to the algorithm could be made by changing the network structure, using a different ML technique, including additional input parameters such as more ionization energies or polarizability, expanding the applicable energy range, or increasing the possible constituent atoms. However, these investigations are left to future studies. The success demonstrated by the current model provides additional evidence to a growing body of work that demonstrates the value of ML models for estimating collision cross sections. These techniques will hopefully continue to provide a quick and accurate source of approximate cross section data for use in many applications.

Acknowledgments

We gratefully acknowledge the support of the National Science Foundation under Grant No. PHY-1912093.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

A data-driven machine learning approach for electron-molecule ionization cross sections

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction