A regression model for plasma reaction kinetics

Machine learning (ML) is used to provide reactions rates appropriate for models of low temperature plasmas with a focus on A + B → C + D binary chemical reactions. The regression model is trained on data extracted from the QBD, KIDA, NFRI and UfDA databases. The regression model used a variety of data on the reactant and product species, some of which also had to be estimated using ML. The final model is a voting regressor comprising three distinct optimized regression models: a support vector regressor, random forest regressor and a gradient-boosted trees regressor model; this model is made freely available via a GitHub repository. As a sample use case, the ML results are used to augment the chemistry of a BCl3/H2 gas mixture.


Introduction
Utilizing the unique properties of the low-temperature plasma has become an integral part of almost every industry sector, spanning over a wide range of applications such as medicine, biotechnology, surface modification, microfabrication, harvesting energy, thrusters, ozone generation or abatement systems, to name just a few. As an example of the importance of the low-temperature plasma technologies for our every day lives, it has been estimated that as much as one-third of steps involved in the manufacturing of microelectronic technologies are plasma-based [1]. While providing desirable properties, the very complex nature of low-temperature plasma systems also poses challenges for describing and understanding plasma * Author to whom any correspondence should be addressed.
Original Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
phenomena. Understanding the plasma properties is crucial for the optimization of plasma-based processes and technologies and the only way to acquire an insight of any significant depth is through numerical modeling techniques. Therefore, any work aiming to improve modeling of plasma physics phenomena has the potential to carry high impact for the field.
There are many methods available for modeling lowtemperature plasma properties and behaviour which vary in both accuracy and complexity. No matter what kind of plasma model of whatever spatial dimensionality is considered, each is built around a chemistry set which describes the volumetric interactions between all the species tracked in the model, and additionally, the interactions between the species and surfaces. A volumetric and surface chemistry set is a very important base for every plasma model, accounting for the majority of sources and sinks of species. Many pre-compiled detailed chemistry sets for various feed gases and applications can be found in the literature, see for example [2][3][4][5][6][7][8][9]. As a consequence of advances in gas kinetics, published chemistry sets are becoming increasingly larger. For plasma physics modeling applications, chemistry sets may routinely include up to a hundred species and many thousands of reactions. For example, Koelman et al [10] provide a chemistry set for the splitting of CO 2 in non-equilibrium plasmas which contains 73 unique species involved in 5724 reactions. In the combustion modeling community, where very large chemistry sets have been used for longer than in the plasma modeling community, some applications require sets that may contain several hundred or even thousands of species and tens of thousands of reactions [11].
A major problem faced by the plasma modelers is the availability of the reaction kinetic data. Each reaction needs to have its kinetic parameters specified and these data are not always readily available. The are a number of online databases of kinetic processes for modeling low-temperature plasmas such as Quantemol Database (QDB) [9], and LXCat [12] which largely contains data on electron collisions, Phys4Entry [13] for modeling atmospheric re-entry plasmas; databases for astrochemical modeling include KIDA [14,15], UDfA [16] and BASECOL [17]; fusion oriented databases include the Japanese NIFS database [18], the Korean NFRI database [19] and the ALADDIN database maintained by the International Atomic Energy Agency [20]. These databaes provide only a finite and limited set of data. Models of new plasma chemistries (e.g. to model plasma in a novel gas mixture) or which extend an existing chemistry set (e.g. to cover a different range of conditions than those the source chemistry set was compiled for) often cannot find published kinetic data for key reactions.
Under these circumstances plasma modellers often fall back on estimation by analogy or an educated guess. In fact, in a typical published chemistry set, a substantial subset of the reaction kinetics may actually be estimated. For example, Turner [21] performed a review of the state-of-the-art chemistry set for helium-oxygen atmospheric pressure plasmas, carefully tracing the primary sources of kinetic data for all the reactions in the set. He found, that 63 out of the total 373 reactions had estimated kinetic data. This is a high fraction considering that He-O 2 is a fairly simple system and can be expected to have better coverage than more complicated chemistries. The same phenomenon can be observed also in online databases. For example, 1298 reactions in the KIDA database, as well as 869 reactions in the UDfA database share the same reaction rate coefficient k = 7.5 × 10 −8 (T/300) −0.5 , and the same reference, pointing to the publication by Harada and Herbst [22]. This paper which, which actually cites Smith et al [23] as the data source, only lists a single reaction with this value of k: All the reactions in KIDA and UDfA pointing to this source are mutual neutralisation reactions of the general form R − 1 + R + 2 → P 1 + P 2 (+P 3 + P 4 ), and their reaction rate coefficients have been estimated by analogy with reaction (1). The practice of making such estimated has a place in plasma modeling, but requires researchers insight and experience which so far has been difficult to algorithmize and automate. An approximate, automated and fast method for estimating of unknown kinetics could be very beneficial; machine learning (ML) offers this possibility. Given the expense of measuring individual reaction rates, it has been argued, for example by Mason and Tennyson in The 2017 Plasma Roadmap: Low temperature plasma science and technology [24] and by Bartschat and Kushner [25], that the majority of atomic and molecular data required by the plasma modeling community for diverse modeling application is expected to be derived from theoretical calculations. However, such calculations remain expensive and the accuracy needed for reliable quantitative predictions remains a challenge in many cases [25]; theory is therefore still far from providing all the data required by plasma modelers.
ML is already being used very extensively in plasma physics, processing, and modeling, as well as in computational chemistry. A sizable body of research has been done on artificial neural network (ANN) models used as surrogate models for prediction of macroscopic plasma processing outputs (such as etch rate, deposition rate, etc) from the processing reactor control variables, such as RF power, pressure, or feed gas flows. Examples from plasma etch process modeling and real-time process control include, among others, the extensive work of Kim et al [26][27][28][29], Himmel and May [30], Rietman and Lory [31], Han et al [32], Stokes and May [33], or Tudoroiu et al [34]. The same is true for other areas of plasma processing. The plasma deposition process control modeling researchers such as Rosen et al [35], Bhatikar and Mahajan [36], Chen et al [37], or Ko et al [38] have also been using ML. ANNs have been further used to model plasma spray processes (e.g. by Guessasma et al [39], Jean et al [40], and Choudhury et al [41]), for modeling of plasma sputtering (e.g. by Krueger et al [42] or Kino et al [43]), plasma-assisted nanoparticle synthesis (e.g. by Leparoux et al [44]), or plasma surface modification (e.g. by Wang et al [45], or Abd Jelil et al [46]). Finally, there is also a large amount of work dedicated to the utilization of ANNs in any plasma processing generally, such as by Rietman [47], Salam et al [48], Molga [49], Kim et al [50,51], or Mesbah and Graves [52].
Apart from modeling plasma processing, control, and diagnostics, ANNs have also been used to augment some traditional quantum chemistry calculation methods. For example, Dral et al [53] used ML models to learn the parameters for semi-empirical quantum chemistry calculation methods from molecule structure, while Komp and Valleau [54] used deep ANNs to predict quantum reaction rate constants for simple systems trained on calculated data, to overcome the high cost of ab initio calculation. Zhang [55] used ANNs to estimate the standard enthalpies of formation of several kinds of acyclic alkanes, and Hansen et al [56] used ML methods for predicting molecular atomization energies. The review paper by Goh et al [57] summarizes the use of deep learning in computational chemistry.
Pertinent to the present work, ML techniques were also used in the calculation of chemical kinetics. Ventura et al [58] and Galvan et al [59] used ANNs for curve-fitting complex experimental kinetic data, bypassing kinetic models built around chemistry sets altogether. Bas et al [60,61] developed an ANN model for estimating the reaction rates of the catalyzed enzymatic hydrolysis of maltose into glucose, also bypassing a kinetic model. Valeh-e-Sheyda et al [62] applied ANN trained on experimental data to estimate the reaction rate of methanol dehydration as a function of temperature, pressure, and the purity of the feed stream. Tumanov and Gaifullin [63] describe ANNs learning the activation energies of reactions of phenyl radicals with hydrocarbons at a single given temperature. Allison [64] trained an ANN to learn to predict rate coefficients of reactions of ·OH radicals from the bonds and bends of the selected set of possible reactants. Choi et al [65] discuss the feasibility of activation energy prediction of gas-phase reactions by gradient-boosted trees method from structural and thermodynamical properties of the molecules, as does Grambow et al [66] using deep learning. Kuang and Xu [67] showcased the use of a convolutional neural network for the prediction of kinetic triplets for pyrolysis processes from experimental data, more specifically the temperatures at preselected values of conversion degrees. Very similar work has also been done by Huang et al [68], and Vieira and Krems [69]. In most cases, a research work intersecting ML and chemical kinetics introduces ANNs and other ML model techniques (or soft computing) as an alternative to the hard kinetic model of a system, which typically integrates the differential equations governing the species densities to calculate the reaction rates. Inputs to such models are typically absorbance, concentration, temperature, pH, etc. This 'soft' approach for chemical kinetics is reviewed nicely in the paper by Amato et al [70].
In this work we explore the use of ML to supply unknown reaction rates in plasma chemistries thus allow complete chemistry sets to be generated without resorting to estimation or guesswork.

ML algorithms
In this work we test the use of ML, as implemented in the Scikit-learn [71] Python library, to fill gaps in kinetic data. We concentrate on chemical reactions represented by an Arrhenius form. Almost all regressor classes offered by the Scikit-learn package were tested; three of these regressor classes showed noticeably better performance than the rest. Thus the three regression model classes used here are the Support Vector Machine (SVM) regression model, the Random Forest regression model, and the Gradient-boosted Trees regression model. These are briefly discussed below; the full theory behind these models can be found in standard textbooks on ML such as [72].
A SVM is a class of powerful and versatile algorithms capable of performing linear and non-linear classification and regression. The SVMs were developed by Boser et al [73] originally for classification problems. The most common kernels used with SVMs are the linear, polynomial and the Gaussian radial basis function kernels; each kernel has its own set of hyperparameters.
Random forests are among the most powerful and versatile regression and classification ML algorithms available [72]. The simpler decision tree regression model forms a fundamental component of random forests. Decision trees [74,75] are a class of ML algorithms that can perform both classification and regression. The decision tree recursively splits the dataset into two subsets, building a binary tree of such splits all the way down to the leaf nodes. Each leaf node then corresponds to its range in the feature space and fits all the targets inside this range with a single value y. The decision nodes are built greedily from the root down, and the decision feature and the decision threshold for each decision node are determined by the CART algorithm (Classification and Regression Tree) [74]. For each decision node, the CART algorithm finds the feature and the threshold, which minimizes the weighted mean square error (MSE) for both subsets created by splitting the dataset by that feature and threshold. Instead of training a single decision tree on the whole training dataset, it is possible to train many separate decision tree regressors on random subsets of the training dataset and aggregate the predictions; this is called the random forest [76].
The gradient boosting method was introduced by Brieman [77] and further developed Friedman [78]. Gradient-boosted trees regressor follows a similar idea to random forests, that is, it combine many weak-learning trees to form a single powerful regressor. However, instead of building many trees on different subsets of the training dataset, in the gradient-boosting method the trees are added in a sequence, and each additional tree is trained on the residual errors of the previous tree. The regressors used in present work were trained to estimate the kinetic data from available data belonging to their reactants and products (ranging from trivial, such as charges, to more sparsely available, such as enthalpies of formation).

Training data
Kinetic parameters for plasma reactions can be found in scientific publications and in online databases. Here we extracted these from various databases. All the data used for training and testing the regression models were automatically scraped from the following databases: QDB [8], NFRI [19], KIDA [14], and UDfA [16]. These four widely-used databases provide a good quantity of kinetic data a for binary heavy-species collisions at or near room-temperature.
We have direct access to the QDB database, so simply queried its underlying relational database structure, which made data extraction simple. The UDfA database provides its raw data as a simple ASCII text file, with a clear structure, documented in the accompanying paper [16], which meant UDfA data could be extracted with a short text parser. The data from NFRI and KIDA databases were extracted using web scraping techniques using python package Scrapy [79], directly from the web user interface. The databases were all scraped in 2020.
The regression model developed here describes binary heavy-species collisions only. In addition, several other datafiltering criteria were established, to further limit the scope of the project. These criteria naturally make the resulting trained regression model only applicable to a well-defined but fairly narrow set of cases. The full set of criteria for the training/test data set acquisition are summarised as follows.
(i) Only heavy-species reactions are considered. Electron collisions and heavy-species collisions follow completely different dynamics, which would make it impractical to mix them in a single model. Furthermore, electron collisions are usually required for plasma simulations in the form of cross-section, which have a much more complicated form than reaction rate coefficients, which are typically sufficient to represent heavy-species collisions. (ii) Only binary reactions are considered. This is a practical choice, as the reaction rate coefficient changes units with the number of reactants in the reaction. (iii) Only reactions with two products are considered as part of the dataset feature-space directly describes the species of the reactions (both reactants and products) and their physical properties; limiting the dataset to only reactions with the same number of reactants and products prevents the problems that arise with datasets that have inherently missing values. (iv) Reactions involving photons are not considered. All the databases used to source the data support photons as species in their reactions. These, however, make up only a small fraction of the reactions listed, and were therefore excluded from the dataset. (v) Only reactions involving stateless species are considered.
This choice disallows great many reactions from entering the dataset, and limits the applicability of the resulting model considerably. However, the reprentation of the internal states of molecules is beyond the scope of an initial project.
The data collection also ignored any reactions which did not conserve charge, or elemental stoichiometry. Additional considerations for the data collection are discussed below.
The kinetics for heavy-species collisions are represented in QDB by the coefficients of the modified Arrhenius formula, parametrizing the temperature dependence of the reaction rate coefficient by three parameters, α, β, and γ, as The pre-exponential factor α is mandatory for each reaction, while the parameters β and γ are optional, and are indeed missing for many of the reactions listed. For these reactions, the rate coefficient is simply described as a constant, without any temperature dependence.
In the QDB object model, the reactants and products of each reaction are instances of the species object, which can hold its own properties. The following data were collected for each reaction stored in QDB: the kinetic coefficients α, β, and γ, and for each reactant and product of the reaction their formula, charge, and their enthalpy of formation at normal temperature, if available. QDB does not provide any range of validity its reactions, so every reaction adhering to the criteria listed above was collected. In some cases, QDB contains multiple data for the same reaction; in such cases, the reaction was added multiple times In contrast to QDB, the NFRI database does not provide Arrhenius parameters for heavy-species reactions; rather it represents the reaction kinetics for each reaction as discrete points of either the reaction rate coefficient, or the crosssection, as a function of temperature. As this work is focussed on cold plasma applications, only those reactions whose kinetic data range overlapped a temperature of T = 300 K ± 10% were collected. With this filtering criterion applied, only a handful of reactions remained in the cross-sectional form, which were excluded. NFRI does not provide any additional structure around its reactions' species, therefore only the species names (formulas) were collected, together with the reaction kinetics in the form of one or more [T, k] pairs.
The NFRI data presented an additional challenge over units of the reaction rate coefficient. Two different units for reaction rate coefficient data appear in the database: cm 3 ·s −1 , and cm 3 ·mol −1 ·s −1 . The distributions of rate coefficient values for the two units should be both fairly similar and only differing from each other Avogadro number; however, plot of the data showed otherwise, implying that some of the reactions in the NFRI database must have incorrect rate coefficients units. Units were (re-)assigned following the simple rules: • if the value of k(300 K) < 10 −6 , then k is in cm 3 ·s −1 , • if the values of k(300 K) > 10 4 , the unit is cm 3 ·mol −1 ·s −1 , • if 10 −6 ⩽ k(300 K) ⩽ 10 4 , the unit cannot be trusted and the reaction is removed.
Another source of data rejection was reactions involving species with ambiguous formulas and charges. Although considerable work was performed parsing as many species from NFRI reactions as possible, due to the fact that species are only represented by their formula string in this database, many species formulas could not be parsed and correctly identified. Like QDB, the KIDA database represents the heavy-species kinetics using the three parameters, α, β, and γ. However, KIDA supports three different temperature dependence functions for its reaction rate coefficients: the kinetics are parametrized either by the modified Arrhenius formula equation (3), or by one of the formulas for ion-polar systems, describing the rate coefficients for unmeasured reactions between ions and neutral species with a dipole moment, computed using the Su-Chesnavich capture approach [14,80,81]: or Each of the reactions (4), and (5) is defined for a different temperature range, α represents the branching ration of the reaction, β is the Lanagevin rate, while γ determines the temperature dependence for the given temperature range. The KIDA database also has a species model, and stores additional attributes for each reactant and product of any reaction. For each eligible KIDA reaction, the following data were collected: the kinetic parameters α, β, and γ, the type of reaction rate temperature dependence formula to interpret those parameters, and finally, for each reactant and product of the reaction, mass and charge were collected, and if present, also the enthalpy of formation at the normal temperature, polarizability of the species and its dipole moment.
KIDA also provides 4 tiers of data evaluation, assigning to each reaction one of the following values: not recommended, not rated, valid and recommended. The reactions labeled not recommended were ignored and not added to the dataset, while the reactions with all the other evaluation labels were added and treated equally. KIDA often lists multiple sets of kinetic parameters for a single reaction, and as in the case of QDB, these were all preserved and added into the dataset as individual data instances. Finally, each reaction in KIDA has a valid temperature range attached, and only reactions where this range of validity overlaps with the range of T = 300 K ± 10% are added to the dataset.
The last database scraped for data was the UMIST Database for Astrochemistry. The kinetics of reactions in the UDfA database is described exclusively by the modified Arrhenius formula of equation (3). Each reaction also has a temperature range of validity, and the criterion for adding the reactions into the datasets was the same as in the case of QDB and KIDA; the temperature range must overlap with a range around the room temperature. No additional species data are provided by UDfA, only strings representing their formulas. This means that the species charges, elemental stoichiometry, and possible states had to be parsed from the formulas.

Dataset unification
The structure of the final unified dataset, aggregating the data from all the four databases, can be summarized by the entity relationship (ER) diagram given in figure 1 (only the relevant parameters are shown). In this model, every reaction is uniquely identified by two species as reactants, two species as products, and the set of kinetic parameter values, α, and optionally β, and γ. Each species is then uniquely identified by its elemental stoichiometry and a charge.
For simplicity, different isomers having the same elemental composition and charge were collapsed to a single species, characterised by its stoichiometry and charge. As an example, the following three species collected from KIDA with their unique formulas of HNCCC, HCCNC, and HCNCC, were all unified into a single species characterised by the elemental stoichiometry of {'H': 1, 'C': 3, 'N': 1}, and the charge q = 0. If the enthalpy of formation ∆ f H • , the polarizability α, or the dipole moment p was found in KIDA or in QDB for more than one such isomer, the resulting species got assigned the parameters of the isomer with the lowest ∆ f H • .
Species from all four databases were identified by parsing the (database-specific) species formulas and extracting the elemental stoichiometry and charges from the formula strings. The species were also further validated with the help of the pyvalem python package by Hill [82], and by checking the charge and stoichiometry conservation of the reactions they appear in. The species masses were determined from the elemental stoichiometry and checked against the masses scraped from the databases, adding an additional layer of confidence in correct parsing of the species formulas.
The polarizability and dipole moment species parameters were populated exclusively from the KIDA database, where present. The enthalpy of formation values were being searched for, in order, in the KIDA database, the QDB database, and the NIST-JANAF [83] and ATcT [84] tables, which had previously been scraped by Lu [85].
Finally, in addition to the reaction criteria listed above, two additional criteria for reactions elimination were introduced based on a first analysis of the unified dataset. When creating a training dataset, it helps to eliminate obvious fringe and outlying data instances to increase the data coherence [72]. These additional criteria were: (i) Only reactions with neutral or singly-ionized species are kept. Doubly ionized species made up only less than 0.3% of all the species in the dataset.  [19] 1171 KIDA [14,15] 4862 UDfA [16] 1851 (ii) No reactions with free electrons are kept. The associative electron detachment reactions made up only about 1.7% of all the reactions in the dataset, and were therefore eliminated for sake of the dataset coherence.
After removing duplicate reactions the final dataset consists of 9470 reactions involving 1080 distinct species. Table 1 provides the number of reactions in the final dataset sourced from each one of the four databases. The final dataset, following the relational structure depicted in figure 1, is given as data_final.yaml file in the project GitHub repository https://github.com/martin-hanicinec-ucl/regreschem.

Targets
First we need to select the outputs ('targets') the model aims to regress. Kinetics for heavy species reactions are usually described by a modified Arrhenius formula equation (3), which parameterizes the temperature dependence of the reaction rate coefficient k(T) by three parameters α, β, and γ. Ideally, these three parameters could be predicted by a multivariate regression model. However, for this to work all the three parameters also need to be present in the training dataset as targets for supervised learning and most of our data sources do not give a full set of Arrhenius coefficients; in practice, the majority of reactions in our dataset are characterized by a single reaction rate constant parameter, α. The kinetic data in the NFRI database [19] are provided as a series of reaction rate coefficient values for different temperatures. In principle, the desired Arrhenius coefficients could be fitted to these data, but this would require at least three data points. However, less than 4% of the NFRI reactions offer 3 or more data points; for more than 90% of NFRI reactions only a single data point is provided which is the reaction rate constant for a temperature within 10% margin around 300 K. For the QDB [8], KIDA [14,15], and UDfA [16] databases, which offer kinetic data already in Arrhenius form, only about 3% of the reactions selected actually contained all the three Arrhenius parameters.
We therefore decided to limit our regression model to a single-value prediction of a reaction rate constant expressed for T = 300 K. In practice we used its logarithm as the rate coefficients need to be well resolved in a range of many orders of magnitude; the trick of target values logarithmization has been used before on a similar topic [54]. Therefore as the targets vector ⃗ y, we used the vector of log 10 k(300 K) values expressed for all the reactions in the dataset, with k in cm 3 s −1 .
There were two more uses with the targets which need considering: duplicate reactions, and target capping. The same kinetic data describing a particular reaction appeared in many cases in more than one database. These duplicates were detected based solely on the set of reactants, set of products, and k(300 K), or the target. While iterating over the dataset, each reaction was removed if it had the same two reactants, the same two products, and the k(300 K) value within 10% to another reaction present already.
The regression models train to minimize the error measure between the vector of predicted values and the vector of targets, usually the root MSE (RMSE) or the mean absolute error (MAE) [72]. With logarithmic targets, however, it would be a bad strategy to treat all data instances with the identical prediction error equally. Predicting e.g. k pred 1 = 10 −5 cm 3 s −1 for a data instance with the target of k 1 = 10 −7 cm 3 s −1 is clearly more significant, than, for example, predicting k pred 2 = 10 −25 cm 3 s −1 for a data instance with the target k 2 = 10 −27 cm 3 s −1 , even if the two instances will share the same square (and absolute) error in the logarithmic target space. This is because reactions with low rate coefficients will have relatively little impact on the solutions of plasma models. As a workaround we define an effective minimal rate coefficient k min = 10 −20 cm 3 s −1 . The targets of all reactions with k < k min were capped to the minimal value of log 10 k min . The predicted values were capped the same way, when evaluating different model classes, or when optimizing the model hyperparameters. Figure 2 shows histograms of all the dataset targets before and after capping. The bimodal distribution of k values is discussed in section 3.3.

Features
We tried to collect data on as many as possible features which might possibly correlate with the reaction rate coefficients being predicted. These data form the raw dataset.  For neutral reactants and products, The m and q values are naturally fully populated, but the rest of the values are not present in each data instance. The atom counts per block and group are an attempt to encode the elemental composition of the species into the data instances.
(ii) Data describing the exchanged fragment: This category of raw dataset table columns regards species fragments exchanged between reactants, in order to create the products. As an example, in the reaction a single H atom is exchanged. The attributes encoding the exchange fragments are the mass, the number of atoms, and the number of atoms per block and group of the periodic table, as in the previous point. This makes in total 22 columns. In some cases, a single fragment is not enough to turn reactants into products, and the values simply sum all the fragments exchanged. In most cases, multiple ways exist to turn reactants into products, and the fragments exchanged with the lowest total mass are picked. So in the reaction the fragments Cl (passed from CCl 4 to OH − ), and H (passed from OH − to CCl 4 ) are selected in favour of fragments O, and CCl 3 .
The entire raw dataset table is available in the project repository https://github.com/martin-hanicinec-ucl/regreschem as dataset_raw.csv. Apart from the columns described already, several additional columns exist, containing extra metadata about, such as reaction strings (e.g. ′ SF4 + SF6--> SF5 + SF5-′ ), the name of the database the reaction instance belonged to ( ′ qdb ′ , ′ kida ′ , ′ umist ′ , or ′ nfri ′ ), the doi identifier of the primary source, where available in the database, or the names and source databases for the individual species in each reaction (data instance) line. These columns are not used to construct features in the regression models.

Data imputation.
ML algorithms typically can not accept missing data [72]. This is a problem, there are many instances were at least one of the ∆ f H • , ∆ f H • n0 , α, or p values is missing for at least one reactant or product. Limiting the dataset to only instances with all the values present would decrease the dataset size considerably. Figure 3 gives an overview of how many data instances are missing which attributes. As an example, well over half of the instances are missing e.g. α at least one of its species, but hardly any instances are missing α for every one of its species. To prevent decreasing the dataset size to less than a half, the missing values must be imputed.
The IterativeImputer class, available in the sklearn.impute python module [71], was use to regress the missing data in a dataset from all the other attributes. In each iteration, a single column containing some missing data gets filled by an imputation regression model, which is trained on all the other completely populated columns. In this way, the imputation model is just another regression model, which is trained to predict the missing values, in order to produce a complete features matrix. The IterativeImputer model can use different regression model classes to perform the imputation; we used the default Bayesian Ridge regressor. The Scikit-learn implementation, the BayesianRidge regression model, is based on an algorithm described by Tipping [86] and  MacKey [87]. As an illustration, figure 4 shows histograms of the polarizability α before and after imputation of the missing values.
All the data instance attributes described in section 3.2.1 do not yet form the feature matrix for the regression models. Typically, it makes sense to manipulate the values in a way so that the final features utilize some heuristics already known about the system, or some more sensible representations [72]. This manipulation is referred to as feature engineering and is the key to a successful ML model [88].

Feature engineering.
As with model selection and hyperparameters tuning, feature engineering is domainspecific and the features matrix ⃗ X needs to be optimized, often iteratively by trial and error [88]. Here we describe our final set of features obtained from a lengthy process of optimization for the lowest prediction errors.
As the order of reactants and products in any reaction is purely a matter of chance or convention, the features encoding attributes of reactants and products should be symmetric with respect to swapping the two reactants (or products).
The features encoding the reaction species were engineered as follows: • Masses m of both reactants were replaced by the reduced mass µ of the left-hand-side (LHS) of the reaction. For a generic reaction The same was done for the products, and the right-hand-side (RHS) of any reaction. • Charges q of both reactants were replaced by a series of one-hot encoded charge combinations. For the reaction lefthand-side, this resulted in three boolean-valued features: Q 00 LHS , Q +0 LHS , and Q +− LHS . As an example, for the generic reaction (8), The reactant charges are converted by the same token in the features Q 00 RHS , and Q +0 RHS . The charge combinations characterized by the features Q −0 LHS and Q −0 RHS , were dropped from the dataset as there were very few collisions between neutrals and negative ions present in the dataset so nearly all values for these features were zero.
Additionally, the total enthalpy of formation for the whole reaction was explicitly added as a feature The ∆ f H • n0 values were manipulated exactly the same way. • Polarizability values α were turned into 2 distinct features following the species naming convention from the generic reaction (8). The choice of these features is motivated by the fact that the electrostatic force between a charged and a polar particle will, in the first approximation, be proportional to the product of the charge and the polarizability of the particles.
• Dipole moment values p of all the reactants and products were turned into the features F p LHS and F p RHS , following the same reasoning used for the polarizability: the electrostatic force between a charged particle and a particle with a dipole moment will be roughly proportional to the product of p and square of the charge, therefore All the other groups were dropped from the features space. All the 7 features described are evaluating the numbers of atoms found on the LHS of any reaction only. As each reaction conserves the species stoichiometry, the features belonging to RHS is identical so are not needed.
Apart from the features encoding the reactants and products, there are 9 more features describing the elements exchanged between the two reactants in order to create the two products. Here, X refers to a hypothetical particle made of the exchanged elements (see section 3.2.1), and N X is simply a number of atoms of X, no matter which block or group. Table 2 shows the final list of features forming the features matrix ⃗ X in this work. Also shown are the feature names consistent with the code in the project repository, and the features data types. In total, 33 features were used.
Finally, scale sensitivity is typically handled by applying a scaling to all the numeric features [72]; this was done by adding the StandardScaler instance from sklearn.preprocessing module [71] into the data transformation pipeline. The standard scaler subtracts the mean from each feature column and scales all the values to unit variance. Figure 5 shows the distribution of the final ∆ f H • total feature (using the ∆ f H • values after imputation) with different horizontal axes belonging to the original and standard-scaled feature data.

Dataset analysis
As mentioned above, duplicate reactions were identified as identical reactions with very close reaction rate coefficients  and were filtered out of the dataset. However, the dataset still contains many reactions which share the same reactants and products but have significantly different reaction rate coefficients. Those might be sourced either from different databases, or from the same database, but from different source publications. In some cases, the reaction rate coefficients for identical reactions differ vastly across different data samples. Figure 6 shows different target values found in the dataset for each one of three chosen reactions. As the data samples belonging to a single reaction will share the features vector, those form conflicting data samples in the dataset. It can be seen from figure 6, that the difference in reaction rate coefficient k between two conflicting training data samples can in some instances be over 10 orders of magnitude. The difference between the maximal and minimal value was evaluated for each reaction having conflicting reaction rate coefficients in the dataset. Out of the total of 1916 reactions with multiple k values, the vast majority of reactions have fairly consistent k values within a single order of magnitude.
There are, however, a significant number of samples with a much wider spread. This naturally has implications for the limits of how well any regression model can actually perform.
Another thing worth analyzing is the distribution of target values. Figure 2 showed the overall distribution of k values across the dataset with a distinct bimodal appearance. The individual peaks of the bimodal distribution correlate with the charge combinations of reactants or with the features Q 00 LHS , Q +0 LHS , Q +− LHS , as can be seen in figure 7. The rate coefficients belonging to reactions of two neutrals (Q 00 LHS = 1) and to neutral-ion reactions (Q +0 LHS = 1) together form the first, broader peak, while the reactions between positive and negative ions (Q +− LHS = 1) form the second, tighter peak. Most of the anion-cation collisions in the dataset are mutual neutralization reactions.
A closer look at the anion-cation collisions samples reveal two populous reaction rates. First, as discussed above, 953 reactions share the same target value, corresponding to the reaction rate coefficient and all appear to be a generalization of a single mutual neutralization reaction (1) sourced from Harada and Herbst [22]. Second, 166 reactions, all acquired from QDB, share the same value of k = 1.0 × 10 −7 cm 3 s −1 .

Training the model
The performance of our trained ML model is measured as a prediction error scored on a set of data using the error functions are RMSE and MAE functions. As the predicted target values were capped to y min , corresponding to k min = 1 × 10 −20 cm 3 s −1 , the error functions are defined by: where cap (y) = y otherwise.
There,⃗ y and y i refer to the known target values, while ⃗ y pred and y pred i are the values predicted by the model. N is the number of data samples the prediction error is evaluated on. Note that the known target values ⃗ y are already capped at y min .  We followed the standard practice of splitting our dataset into training and a test set with a training error for the training set, and a generalization or out-of-sample error [72] evaluated using the test set. A low training error and high generalization error is indicative of over-fitting [72]. Here we withheld 20% of a randomly selected samples as the test set giving a training set of 7576 reactions and a test set of 1894 reactions.
To overcome potential biases an n-fold cross-validation technique [89] is very often used. In n-fold cross-validation, the training set is split into n non-overlapping subsets, and each subset is used both as a training and validation set, in a sequence of n trials. We optimized the hyperparameters for the selected model classes by minimizing the mean validation error of a 5-fold cross-validation. Two techniques were used predominantly: grid search and randomized search. Hyperparameters were optimized for all three shortlisted regression models (support vector regressor, random forest regressor, and gradient-boosted trees regressor). The optimization was carried out using the mean MAE error given in equation (19) over 5-fold cross-validation, with some attention to the difference between training and validation errors. We chose to optimize for MAE, as the RMSE of equation (18) is more sensitive to outlier samples, which are definitely present in the dataset, as shown in figure 6. Finally, the three optimized models were combined into a single voting regression model, with the optimized vector of weights of the constituent models, as the only hyperparameter. For the full reproducibility, the optimized regression models and their hyperparameters as a code snippet in figure 8.
The mean cross-validation MAE errors µ MAE for each model are listed in table 3, together with the standard deviations σ MAE over the 5-fold cross-validation trials. Results for two more models are listed as benchmarks.
LinearRegression model [71] simply performs a linear regression in the features space. MedianEstimator model is an extremely naive custom estimator, which simply assigns each sample from the validation set (or test set) with the unknown target value the value of the median of all the known target values from the training set (while completely ignoring the features matrix). The results in table 3 were obtained with scikit-learn version 0.24.2 and with random (but repeatable) train/test, and train/validation splits. All the code is available as a Jupyter notebook [90] in the project repository https://github.com/martin-hanicinec-ucl/regreschem for full reproducibility. The test errors are slightly but significantly lower than the mean cross-validation errors; this is surprising and in general improbable. Thorough hyperparameters tuning will typically overfit to the training subset data instances, making the cross-validation errors (evaluated on the training set) typically lower than the test set error [72]. It is a hallmark of a well-trained and optimized model, that the test error is very close to the validation error (while both being as low as possible), but typically the test error is higher than the validation error. This anomaly can be explained by looking at not just the mean cross-validation error µ MAE val , but at the individual validation errors of the crossvalidation folds MAE val 1 -MAE val 5 . The individual folds validation errors are shown in figure 9, together with the test error MAE test .

Results and discussion
The MAE errors for individual cross-validation folds differ considerably between folds, which are trained on subsets randomly drawn from the same training set. It is possible, that the hyperparameters of the final voting regressor (and its constituent models) were optimized conservatively enough not to cause over-fitting to the training set, and at the same time, the withheld test just by chance consists of data instances responding to the final trained model exceptionally well. As discussed below, the whole test (and training) dataset can be split into various subsets, each with significantly different own test (and validation) errors. For a concrete example, the neutral-neutral reactions subset of the test set has much higher MAE than the cation-anion reactions subset, reactions of which get predicted by the final model relatively precisely. In this case, the final MAE error measure might be quite sensitive to the ratio of neutral-neutral reactions and cation-anion reactions among the data instances, which can be more favorable for the test set than for the training set, just by the act of the random train/test split.    Table 5 gives average prediction errors for different charge combinations of the reactants. Figure 10 illustrates the distributions of prediction errors plotted for each of the subsets from table 5. It is evident that different charge combinations among the reactants translated into different mean prediction errors. The very low MAE error measure for the cation-anion reactions is hardly a surprise. This subset had a very tight distribution of target values in the first place, tightly centered around a single value, as discussed in section 3.3 and shown in figure 7. The final regression model evidently recovered this tight distribution fairly well. The fact that neutral-neutral reaction rate coefficients span a larger range than the predominantly fast ion-neutral collisions is probably the reason that they are predicted by the model with much higher errors.
As the neutral-neutral, ion-neutral, and ion-ion collisions have such obviously different prediction error distributions, as well as target values distributions ( figure 7), it was worth exploring the idea of training a dedicated regression model for each of those subsets. Unfortunately, this did not lead to any lower prediction errors. Tests showed that the distribution of target values in the dataset as a whole and in the test set were similar.

Analysis of feature importance
The regression models based on decision trees (random forest and gradient-boosted trees in our case) offer a measure of feature importance as the average depth any particular feature appears as a decision node across all the constituent trees of the ensemble. Figure 11 shows the feature importance measure for every single feature in the dataset, as assessed by two parts of the final voting regressor: the random forest regressor and the gradient-boosted trees regressor.
Three interesting facts can be noted about the feature importance values shown in figure 11. Firstly, the features encoding the properties of reactants (lhs_ prefix) appear to be more relevant for predicting the reaction rate coefficients, than the features encoding the properties of products (rhs_ prefix). Notably, the rhs_polarizability_factor (F α RHS ) and the rhs_dipole_moment_factor (F p RHS ) features, see equations (15), (17), both appear to be completely irrelevant for predicting the rate coefficients, while their reactants counterparts (F α LHS , F p LHS ) proved to be somewhat important. Furthermore, the features designed to encode the fragments exchanged between reactants (exchanged_ prefix, see section 3.2.3 and table 2) all appear to be almost completely ignored by the model when predicting rate coefficients. Lastly, it can be seen that the boolean features explicitly encoding the charge combinations among reactants and products, lhs_charge_00, lhs_charge_+0, lhs_charge_+-, rhs_charge_00, rhs_charge_+0 (or Q 00 LHS , Q +0 LHS , Q +− LHS , Q 00 RHS , Q +0 RHS respectively), are being completely ignored by the random forest and gradient-boosted trees regressors. And yet, the distinct distributions of rate coefficients for categories represented by different values of those features were correctly recovered in the predicted rate coefficient values. This implies, that the same information (distinguishing the Q 00 LHS = 1, Q +0 LHS = 1, and Q +− LHS = 1 cases) must have been encoded implicitly by other features, assessed as more important by the final regression model, such as F α LHS in equation (15) or F p LHS in equation (17). Figure 10 clearly shows that some of the test set instances (mainly belonging to the neutral-neutral category) were predicted by the model with some significant prediction errors. Tables 6 and 7 show the ten reactions with the most Figure 11. Feature importances for all the model features, pulled out of the random forest and gradient-boosted trees regressors (as two constituent models of the final voting regressor model). For clarity the feature names from the project GitHub repository are used, see table 2 for a list of features used. Table 6. Top ten reactions with the most overestimated predicted rate coefficients. For each reaction, the prediction error is listed as well as the predicted value of the reaction rate coefficient and the target value. If there exist alternative rate coefficient targets in the dataset, those are also shown. overestimated and underestimated rate coefficient predictions, respectively; both tables show reaction instances from the full dataset, not only the test set. The prediction error in the tables refers to the test errors for instances of the test set and training errors for the instances of the training set. Apart from the prediction errors and predicted and target values of reaction rate coefficients, also alternative rate coefficient target values are shown for each reaction where they exist in the dataset. It is very encouraging to see, that in all the cases where any alternative target values exist, they agree with the predicted value much closer than the most diverging target value responsible for flagging these predictions as outliers. Even without inspecting the sources and credibility of the data instances, it could be argued that the data instances with the very low target values of k < 10 −20 cm 3 s −1 from table 6 are very probably erroneous, as the same reactions can in many cases be found in the dataset with rate coefficients about ten orders of magnitude lower. The cases from table 6 could then be considered erroneous data samples, rather than erroneous predictions.

Prediction error Prediction Target
Taking as an example the elastic reaction (i) The first comes from KIDA [15] with k = 10 −9 exp (−7850/T) cm 3 s −1 , which gives 4.32 × 10 −21 at T = 300 K and gets capped to the value of k = 10 −20 . This value is very low and a careful examination of the entry in KIDA database reveals the reason: KIDA lists it as the H + + HCN → H + + HNC reaction, therefore the low rate coefficient belongs to a reaction changing the isomer from hydrogen cyanide HCN, to hydrogen isocyanide HNC, rather than the elastic reaction (22) appearing in the dataset, which does not resolve different species isomers, as discussed in section 2.3. The rate coefficient of k = 10 −20 therefore does not apply for the reaction (22) and the model was right to regress much higher value. KIDA cites Harada et al [91] as the data source, but we could not find this reaction explicitly in the cited publication. (ii) The second was sourced from UDfA [16], which lists the reaction with the coefficient value of k (2) = 10 −9 cm 3 s −1 , without any temperature dependence (and without any citation). (iii) The third available data sample is also from KIDA, which lists it with rate coefficient in the form of one of the formulas for ion-polar systems see equation (4), giving k (3) = 1.47 × 10 −8 cm 3 s −1 . UDfA cites work by Woon and Herbst [80] for this coefficient value, where the authors performed quantum-chemical calculations for neutral molecules, among others for HCN. This data sample could be considered the most reliable out of the three as it actually contains the citation to a paper relevant for the reaction. The prediction error compared to this data instance is still about 1.6 orders of magnitude, but this is well within the main peak of the prediction errors distribution shown in figure 10.
Similar conclusions could be drawn about the most diverging data samples from table 7. For example, in the case of the first four reactions in table 7, it is obvious that the target values responsible for such high prediction errors are way too high and clearly incorrect. In the case of reactions with IDs 3837 and 7494 in table 7, the alternative target values are very close to the predicted one, validating the predictions. And in the case of the first reaction (ID 7041) in table 7, with the highest prediction error of the whole dataset, the data sourced from KIDA is also very obviously wrong. KIDA cites Hickson et al [92] as the source publication for the value of the reaction rate coefficient described by the functional dependence given in equation (5). KIDA lists the parameter β (which corresponds to the Langevin rate in cm 3 s −1 ) as β = 3.5 × 10 9 when it clearly should have been β = 3.5 × 10 −9 . This typo was corrected in February 2021 which was after the training data for this project was scraped. With the correct β coefficient, the rate coefficient for this sample evaluates to k = 3.76 × 10 −9 cm 3 s −1 , which is fairly close to the predicted value. Figure 11 shows that the features ∆ f H • total , F α LHS , and F p LHS appear on average fairly high in the decision trees of the random forest and the gradient-boosted trees regression models, which signals their relatively high importance for the prediction of reaction rate coefficients. These three features are also among those derived from values that were not present for all the data samples. More specifically, dipole moment p and polarizability α of both reactants were used to evaluate these features, as were the enthalpies of formation ∆ f H • of all the reactants and products in any reaction. The missing species data had to be imputed and it could be interesting to see how the prediction error correlates with features data availability. Such a correlation is shown as a bar plot in figure 12.

Analysis of missing features
At the first glance, there appears to be a negative correlation between the number of values missing relevant for the features discussed, and the MAE error measure, where one might expect (if any) a positive correlation. After all, the imputation process will likely be fairly crude in guessing missing values of a species from its other attributes. However, this correlation is in fact caused by a bias in the dataset, as shown in figure 13. The subsets of samples with more species attributes  values missing tend to have a higher ratio of cation-anion reactions, and a lower ratio of neutral-neutral reactions. This greatly affects the prediction errors, as the neutral-neutral reactions get predicted with much lower accuracy than the cation-anion reactions. If data similar to those in figure 12 are plotted for e.g. ion-neutral reactions only, the negative correlation disappears and the MAE error measure appears to be independent of the number of missing values. This implies the imputation process is fairly effective in recovering the missing data.

Packaging of the final regression model
As the final step in the regression model development, we wrapped the final optimized regression model into a custom Regressor class, which takes care of capping the predicted data to the minimal value y min = log 10 k min = −20 after prediction, and recovers the reaction rate coefficients in the original units of cm 3 s −1 by exponentiating the predicted values back to k pred = 10 y pred . We chained this custom regressor behind the data transformation pipeline, which takes care of missing data imputation, features engineering, and scaling (and which is described in section 3.2.3). The resulting final regression model pipeline was trained on the whole dataset and persistently saved as final_regression_pipeline.joblib by the joblib module from Python standard library. This ready-to-use trained regression model can be easily imported to any python code from the utils module in the project repository, by calling the get_final_regression_pipeline function.
The model needs to be fed by a pandas.DataFrame [93] instance, with rows representing the reactions for which the rate coefficients should be estimated. The project repository provides a sample input as sample_input.csv which can be read by pandas into the DataFrame required as the input for the trained model. Figure 14 shows a jupyter notebook python snippet detailing how the example_input DataFrame is instantiated from the sample_input.csv table and showing all the DataFrame columns required by the Finally, figure 15 gives a continuation of the code from figure 14, showing how the ready-to-use trained regression model can be imported from the utils python module. To obtain the predicted rate coefficients for all the reactions described by the input DataFrame, the predict method of the regression model instance must be called with the input table as a single parameter, as detailed by figure 15. This returns a NumPy array of reaction rate coefficients in cm 3 s −1 .

Example use case
In this section, we present how the algorithm can be used to fill data gaps in chemistry sets and how this might impact the results of a plasma simulation. As a gas mixture, we chose BCl 3 /H 2 ; note, that neither the gas mixture nor the process parameters used in this example are meant to replicate any specific plasma process but were rather arbitrarily chosen to have missing reactions to be filled with the ML algorithm and to show significant differences in the result when comparing the simulations with and without the additional reaction data. A case which can be compared to experimental data would be preferable; this would, however, necessitate bespoke experiments, which are beyond the scope of this work.
A basic BCl 3 /H 2 set was created with data from QDB using its set generator described in [9]. Cross-sectional data and rate coefficients for this set are taken from . In this basic set, reactions between the various BCl x species and H are missing. For some candidates, rate coefficients have been reported; [137,138] report data for the reaction BCl 3 + H → BCl 2 + HCl (23) and [139] gives a rate coefficient for the reaction However, both of these reactions are highly endothermic yielding rate coefficients on the order of 10 −15 cm 3 s −1 or smaller at 300 K. Hence, they are unlikely to have a significant impact at the temperature the ML algorithm was trained for. As additional candidates we consider and BCl + H → B + HCl.
Reaction (25) is exothermic while (26) is endothermic; therefore, we add reaction (25) but neglect (26). Using the ML algorithm for reaction (25) yields a rate coefficient of 1.3 × 10 −11 cm 3 s −1 . It should be noted, that there are also candidates for ion-neutral charge exchange and ion-ion recombination reactions; however, these either had no significant impact on the results or their rate coefficients as determined by the ML algorithm are close to common estimates. Therefore, they are not discussed for the sake of brevity.
To show the impact of adding the generated data to a chemistry set might have, we conducted global plasma simulations using the pygmol plasma model as detailed in [9]. We tested two sets with and without reaction (25) with (arbitrary) process parameters: • A pressure of 10 Pa • An absorbed power of 10 W • A chamber radius of 0.1 m • A chamber height of 0.1 m • A total flow of 100 sccm • The relative flow of BCl 3 was varied between 10% and 90% with the remainder as H 2 flow. Figure 16 shows the densities of the species involved in reaction (25) as a function of the relative BCl 3 flow for the process parameters discussed above. We observe significant differences between the basic set and the one with reaction (25) added, namely: • The density of BCl is between half an order of magnitude and a whole order of magnitude larger with the additional data; qualitatively it increases somewhat slower than in the basis set. • BCl 2 shows densities which are one to two orders of magnitude smaller than in the basis set. Qualitatively, its density increases faster as a function of the BCl 3 flow than in the basis set. • HCl shows the smallest difference with a similar qualitative behaviour and higher absolute densities up to a factor of about 5. • H displays a very different behaviour; in the basis set it stays relatively constant and only drops off for large BCl 3 flows, while with the added reactions it is monotonically decreasing. In absolute terms, the densities are roughly the same for small BCl 3 contents, but differ by an order of magnitude for large ones.
Overall, the example shows how adding a reaction with a hitherto unknown rate coefficients has a significant impact on the chemical composition of the discharge with densities of neutrals differing by up to two orders magnitude. This highlights the importance of taking such reactions into consideration for which our ML algorithm likely gives better estimates than intuitive guesses, while being faster than precise measurements, calculations, or calibrations. This and similar chemistries can constructed within the QDB environment using a combination of the assembled data and the in-house ML algorithm.

Conclusions and outlook
Here we present a ML-based regression model for fast approximation of unknown plasma reaction rate constants at T = 300 K from commonly available reaction and species data.
The model was restricted to binary reactions of heavy species. The room-temperature rate coefficients are regressed from features built from masses, charges, enthalpies of formation, polarizabilities, dipole moments, and elemental data of both reactants and products. The final model is a voting regressor consisting of three distinct optimized regression models: a support vector regressor [73], random forest regressor [75], and a gradient-boosted trees regressor [77]. This work forms a natural counterpart to our previous study [140] which presented a method of ranking the species in a chemistry set according to their importance for modeling densities of a pre-defined set of species of interest. The model was trained on a training set of over 7500 data instances acquired from four popular databases of plasma processes. The final generalization MAE error of the reaction rate coefficient prediction, evaluated on a withheld test set of around 1900 instances, was just under 0.6 orders of magnitude, compared to about 0.95 orders of magnitude for a benchmark case of a simple linear regression. The overall distribution of the reaction rate coefficients from the test set was recovered very well in the predicted values, as were the distributions for distinct subsets of neutral-neutral, ion-ion, and neutral-ion reactions.
The ability of the model to flag instances of erroneous data instances is demonstrated: we found that out of the ten most underestimated and overestimated predicted rates, the majority could be attributed to erroneous training targets, rather than incorrect predictions.
The most obvious potential for future work is the expansion of the model applicability space. The model presented was trained exclusively on reactions of two heavy-species reactants, and two heavy-species products, without any excited states, which means that the model applies only to such reactions. As a subject of future research, the model could be expanded to also handle three-body collisions, dissociative or associative processes with different numbers of reactants and products, ionization processes with electrons among the products, and similar. Furthermore, the model could be expanded to handle vibrational and/or electronic excited states among reactants and products. Such an expansion will require redesigning the feature space of the model to capture statespecific properties of species, and to allow features built from more than two species per reaction side.
Electron collisions form a somewhat distinct category of plasma reactions, as full collisional cross-sections are usually required to model low-temperature plasma phenomena. This, with the fact that electron collisions are driven by complicated underlying physics, would probably make the task of expanding the current model to also handle electron collisions challenging. Instead, the development and training of a separate model for electron collisions might be a more sensible approach and another topic of possible future research.
The model presented only regresses rate coefficients expressed for the room temperature, while a much more valuable output of the model would be a temperature dependence k(T) in some form such as the triplet of parameters for the modified Arrhenius formula of equation (3). Training such a model would, however, require training data instances with temperature-dependent target values of rate coefficients, for which there was limited data in our data sources.
Building a higher-quality training dataset would undoubtedly improve the predictions provided by our model. We found that our training dataset contained a number of reactions with multiple data instances with diverging target values, sometimes differing by many orders of magnitude. Curating the training dataset such that it contains only trustworthy data instances might be a very costly effort, but one which would undoubtedly also improve the quality and performance of any regression model trained on such a curated dataset.
Rejecting untrustworthy data would reduce the size of the training dataset meaning that exploring additional training data sources makes for another significant direction of future work. The training dataset could also benefit from an analysis of the biases present, as the selection of data sources inevitably influences which reaction classes, or which rate coefficient calculation methods, experimental techniques, etc, are dominant in the dataset. A careful bias analysis helps to understand how the biases in the training dataset translate to the expected prediction accuracies for, e.g. different reaction classes. Finally, testing the regression model on various well-established published chemistry sets by replacing a fraction of real rate coefficients with data synthesized by the model and comparing the model results, presents itself perhaps as another logical next step in this research.

Data availability statement
The datasets used to undertake this study and their documentation are openly available form the project repository https://github.com/martin-hanicinec-ucl/regreschem. The same repository also contains the ready-to-use trained regression model.