Machine learning for accurate methane concentration predictions: short-term training, long-term results

Although methane emissions from Alberta’s oil and gas sector have decreased in recent years, monitoring these emissions using Continuous Emission Monitoring Systems (CEMS) can be costly. Predictive Emissions Monitoring Systems (PEMS), powered by machine learning, offer an alternative to or can supplement CEMS. However, effective machine learning models for methane emissions prediction rely heavily on the amount of training data. To address this, we compare the prediction performance of different neural network models, including Long Short-Term Memory (LSTM), Stacked LSTM, Gated Recurrent Unit (GRU), and Bidirectional LSTM (BiLSTM), using varying time intervals for training of methane concentration data from Alberta airshed stations. The results showed that the GRU model performed better with shorter datasets, whereas the LSTM and Stacked LSTM models outperformed the GRU and BiLSTM models when trained with more historical data. However, the study found that more training data did not necessarily result in significantly better prediction models.


Introduction
Greenhouse gas (GHG) emissions, particularly methane, rank second after carbon dioxide in terms of environmental impact (US EPA 2022). Despite the economic slowdown during the COVID-19 pandemic, methane levels continued to rise in 2020, as reported by the U.S. National Oceanic and Atmospheric Administration (National Oceanic and Atmospheric Administration 2021). By January 2022, the global average methane concentration had reached 1908.9 ppb, representing a 7.5% increase since 2000 (Dlugokencky 2022). As yet, despite a lot of national commitments to the reduction of methane emissions, the atmospheric content is still rising (Dlugokencky 2022). There is a need to provide enhanced capabilities for monitoring, reducing, and predicting methane emissions trends. Among technologies for monitoring and reducing GHG emissions, predictive emission models utilizing machine learning approaches are anticipated to play a critical role in methane emissions control.
As a specific example, in Canada, Alberta's oil and gas industry is a major contributor to GHG emissions, with bitumen production from shallow surface oil sands mines or deep in situ oil sands recovery processes (Dones et al 2003) and hydraulically-fractured natural gas production exhibiting substantial growth between 2003 and 2015 (Government of Canada 2021). Emissions from oil sands mining operations stem from various sources, including the mine itself (methane exsolution from newly exposed oil sands resource), tailings ponds, and natural gas combustion for hot water generation (Liggio et al 2019, Norgate and Haque, 2010, Bhutto et al 2013. Given the growth of the oil and gas industry in Alberta over the past few decades, it serves as a model regional environment for examining the growth of emissions from the fossil fuel industry as it grows. In response to the growing concerns over methane emissions, various policies have been put in place both within Canada and internationally. The Global Methane Pledge, launched at COP26 in November 2021, aims to reduce global methane emissions by 30% before 2030 (CCAC, 2021). Canada has committed to this pledge. Alberta, as a proactive provincial government, has committed to a methane emissions reduction objective for the oil and gas sector, positioning itself as a leader in emission reduction efforts (Government of Alberta 2021). In September 2021, the Canadian government set a target to reduce methane emissions by 45% below 2012 levels by 2025. This reduction target is part of Canada's efforts to address climate change aligning with its commitments under the Paris Agreement. To achieve the 45% target, the Canadian government has implemented various measures, including regulations and industry-specific initiatives in particular, focused on the oil and gas sector. Alberta (Government of Alberta 2021) has reported that oil and gas methane emissions dropped by 44% between 2014 and 2021 although there are uncertainties that need to be addressed with respect to potential underestimates methane emission volumes (Bryant 2023). Further to the 45%, the Government of Canada has committed to reducing methane emissions from the upstream oil and gas industry by 75% from the 2012 levels by 2030 whereas the Government of Alberta is considering a 75%-80% reduction from 2014 levels by 2030 (Government of Alberta 2022c). However, despite these efforts, methane concentration data from monitoring stations in the Fort McMurray area show either flat or increasing trends, suggesting that further measures are necessary to curb emissions effectively.
Continuous Emissions Monitoring Systems (CEMS) offer valuable information on methane emissions but are relatively expensive, especially in harsh and remote locations like northern Alberta. Additionally, CEMS units require regular calibration to maintain accurate and reliable measurements of emissions. In contrast, Predictive Emissions Monitoring Systems (PEMS) provide a cost-effective alternative. These software-based systems rely on baselining (training) and data analytics, often utilizing state-of-the-art modeling methods to predict emissions (Chien et al 2003, Cheng andHagen 2006). PEMS generates predicted estimates of emissions and can be more economical in most situations compared to CEMS (Khaqan 2011). While PEMS cannot fully replace CEMS in all cases, it is often used in conjunction with CEMS data to calibrate and train the predictive models, after which it can provide accurate emission estimates (Ciarlo and Callero 2013). In Alberta, PEMS models are regulated within the CEMS Code (Alberta Environment and Parks 2021) and must be based on three process parameters or variables and at least 2880 measured CEMS quality assured hours of emission data. PEMS models should exhibit a deviation of no more than 10% from the measured CEMS data in 95% of the validation dataset.
Machine learning methods, particularly long short-term memory (LSTM), bidirectional LSTM, stacked LSTM, and gated recurrent unit (GRU), have been applied in PEMS approaches for analyzing and predicting air quality or emissions using time-series data (Xie et al 2019, Hamrani et al 2020, Tong et al 2019, Gao et al 2020, Gangopadhyay et al 2018, Hu et al 2021. Generally, the performance of these models improves with larger datasets and longer training time intervals. LSTMs tend to be more effective for systems with continuous change and multiple time scales within the data over static methods such as convolutional neural networks (Faruque et al 2022). However, a crucial question remains unresolved: how will training machine learning models with limited data impact the performance of methane emissions predictions?
In this study, we aim to address this question by conducting a comparison between four machine learning methods: LSTM neural network, stacked LSTM, bidirectional LSTM, and GRU. Each method will be trained with varying lengths of time-series methane emission data, ranging from one month to four years. This data will be collected from three monitoring stations stored in the Alberta AirData Warehouse (Government of Alberta 2022a, 2022b). The primary objective is to identify the minimum length of time-series data required to develop machine learning models that achieve a given prediction performance.
By understanding the relationship between the amount of training data and prediction performance, our research endeavors to optimize the use of machine learning techniques for methane emissions prediction. Such insights could significantly contribute to the development of more efficient and cost-effective emission control strategies for Alberta's oil and gas industry, thereby advancing the overall goal of reducing greenhouse gas emissions and mitigating climate change impacts globally.

Data description
In the province of Alberta, air quality is monitored in airshed regions by industry, communities, and Alberta Environment and Parks (Alberta Environment and Parks 2021). In 1992, Canada established the National Pollutant Release Inventory (NPRI), which requires industries, businesses, and facilities to report release of substances and disposal to the NPRI (Canada 2016). The CEMS Code in Alberta (Alberta Environment and Parks 2021) sets the bar for requirements when installing, operating, maintaining and certifying the equipment used in monitoring stations. These requirements provide a basis for quality assurance of the measurements and standards for reporting emissions in Alberta. As stated in the Code, there are strict requirements for the design specifications, test procedures, data acquisition requirements, quality of data, certification, quality assurance plan, missing data procedures, etc. Thus, we have assumed that all data from the AirData Warehouse has been quality checked.
For the analysis presented here, time-series methane data from three air quality monitoring stations Calgary Southeast, Edmonton East and Bruderheim, were selected from the Alberta AirData Warehouse, presented in figure 1. Table 1 lists details of the date range and the number of methane data points from each station.
The analysis of time series data, such as methane concentration measurements, often involves the application of various machine learning algorithms to make predictions, identify patterns, and extract valuable insights. In this context, three prominent algorithms are commonly used: Long-Short Term Memory (LSTM) neural networks, Gated Recurrent Unit (GRU) neural networks, and Bidirectional LSTM. LSTM is a type of recurrent neural network (RNN) designed to handle long-term dependencies in time series data (Hochreiter and Schmidhuber 1997). Traditional RNNs have difficulties in capturing long-term dependencies due to the vanishing or exploding gradient problem. LSTM overcomes these limitations by introducing a memory cell that allows information to be retained or forgotten over time. This memory cell consists of three main gates: the input gate, the forget gate, and the output gate. These gates control the flow of information within the LSTM cell, allowing it to selectively remember or discard information from previous time steps. LSTM has shown remarkable success in various time series prediction tasks, making it a popular choice in methane concentration analysis. GRU is another type of RNN that addresses the vanishing gradient problem (Chung et al 2014). It simplifies the LSTM architecture by combining the input and forget gates into a single 'update' gate. Additionally, it merges the cell state and hidden state, reducing the number of parameters and making training faster compared to LSTM. GRU has demonstrated competitive performance in sequence modeling tasks while being computationally efficient, making it a popular alternative to LSTM for time series analysis (Gangopadhyay et al 2018). Bidirectional LSTM is an extension of the traditional LSTM that considers both past and future  information when making predictions (Schuster and Paliwal 1997). It processes the time series data in two directions: from the past to the future (forward direction) and from the future to the past (backward direction). This allows the model to capture patterns and dependencies not only from past observations but also from future ones, enhancing its ability to understand complex temporal relationships in the data. Each of LSTM, GRU, and bidirectional LSTM neural networks possesses unique capabilities that make them effective in capturing temporal dependencies and patterns in the data, contributing to improved predictions and insights in the field of environmental monitoring and analysis.

LSTM and stacked LSTM
where s is the sigmoid function, f t denotes the forget gate, i t denotes the input gate, and  C t is the new cell, denote the weight matrix for the corresponding input vector and Î * * Here '+' means the Hadamard (element-wise) product and '+' represents pointwise addition. The output of the first hidden layer of LSTM is fed into the next hidden layer as input. In this method, an input x t at time t is fed to the network. The forget gate f t then decides which information from the previous outputh t 1 is discarded or retained. Then, the input gate i t decides which state will be updated. The outputs obtained from the forget and input gates, in addition to a vector of new  C t generated from a tanh layer, form the new cell state C . t Finally, the result h t is found from a sigmoid and tanh layers.
is the vector of weights for previous hidden state and current input, are bias vectors, and s is the sigmoid function s = + .
where s is the sigmoid function, z t denotes the update gate, r t denotes the reset gate, andh t denotes the candidate hidden layer, denote the weight matrix for the corresponding input vector and Î * * is the bias. The update gate z t decides how much the unit updates its activation or content. The GRU can save compute time without sacrificing performance for some scenarios, such as extended text and In BiLSTM, information propagates not only backward to forward but also forward to backward using two layers of hidden states. The output layer is updated through the computation of the forward hidden sequence  h t and backward hidden sequence  h . t The output sequence y t connects forward and backward layers, taking opposite time sequences from the same time slice [1, T]:

Measures of error
The root mean squared error (RMSE is used to evaluate the accuracy of the model, given by: where Y i is the actual value of methane concentration, P i is the predicted value, and n is the number of data points. The RMSE has the benefits that larger errors are emphasized due to the squaring of differences and that the sign of errors are eliminated leading to smoothening of the training optimization space. The Mean Absolute Error (MAE) evaluates model performance providing an estimate of the average offset compared with the actual observation, given by: where Y t is the actual value and P t is the predicted value. The MAE provides benefits compared to the RMSE in that it does not square the differences which makes it less sensitive to outliers. Thus, the use of both RMSE and MAE provides appropriate measures of the deviation of the models to the data.

Proposed framework and implementation details
The best case window size, neurons for each hidden layer, epochs, and batch size (hyperparameters) were those values that resulted in the lowest RMSE and MAE. The values of the hyperparameters for each monitoring station for each method are listed in table 2. The best case hyperparameters determined for training with the one-year case are used to train for the different time interval training datasets. Predictions on the test datasets are all made for one year. Figure 5 presents the flowchart for the modelling framework.
For example, both GRU and LSTM models use one layer to model methane concentration. We found that for Calgary Southeast with one year of training data, eight neurons and twelve neurons for each hidden layer had better accuracy for the prediction made from the GRU and the LSTM neural network models, respectively. The GRU model needs 50 epochs to reach its best accuracy, whereas the LSTM model only needs 10 epochs to reach its best performance. The BiLSTM and the Stacked LSTM models each have 12 neurons for each hidden layer for modelling the Edmonton East and the Bruderheim methane concentration data.
The coding environment, described in table 3   The x-axis represents the time and the y-axis is the methane concentration in parts per million (ppm). The blue and green lines show the raw observed data for training and validating the model, respectively. The orange line shows the predicted value from the machine learning model corresponding to the training phase, whereas the red line shows the predicted value for the test and validation phase. In general, the results demonstrates that the more extensive the training data, the more precise the prediction will be. Table 4 compare the RMSEs and MAEs of the methods versus length of training sample for both training and testing datasets from the Calgary Southeast station (results for Edmonton East, and Bruderheim stations shown in Tables S1 and S2). In most cases, the MAE and RMSE have a trend of declining with more training data. For example, the MAE for the LSTM is 0.363301 when trained with only one month of history which drops to 0.334295 with two months of data. The MAE decreased to 0.225043 when four years of data were used for training which is 38.06% better than the one-month training and 32.68% better than that of the two-month results. In general, the lowest RMSEs and MAEs were achieved with the GRU method when the training period was less than or equal to 1 year whereas above that value, the stacked LSTM achieved smaller RMSEs and MAEs. Gated Recurrent Unit (GRU) and Long-Short Term Memory (LSTM) are both types of recurrent neural networks (RNNs) designed to handle sequential data and mitigate the vanishing gradient problem. While they share similarities, GRU has a simplified architecture compared to LSTM. There are specific circumstances where GRU might outperform LSTM especially for a smaller training data set including: 1. the GRU has a simpler architecture with fewer parameters than LSTM leading to faster training and inference times, 2. GRU tends to perform well on smaller datasets because it has fewer parameters to train and it is less prone to over fitting, 3. The GRU might perform better than the LSTM since the gating mechanisms in LSTM are designed to capture longterm dependencies, which can be advantageous when dealing with rapid changes in patterns, and 4. GRU might handle noisy data or outliers better due to its simplified gating mechanism whereas LSTM's additional complexity could potentially be sensitive to noisy inputs.  For the longer training periods, the stacked LSTM outperforms the GRU because: 1. stacked LSTM networks consist of multiple LSTM layers stacked on top of each other increasing the depth enabling the model to learn more complex representations and patterns in the data, with a longer training dataset, there is a higher likelihood of intricate temporal dependencies that can be captured by deeper architectures, 2. longer sequences often exhibit hierarchical structures, where information at different time scales contributes to the overall patterns and stacked LSTMs, with multiple layers, can learn to capture different levels of abstraction and hierarchy within the data, 3. stacked LSTMs can facilitate smooth flow of information and gradients across multiple layers which is important when dealing with longer sequences since it helps in mitigating the vanishing gradient problem and ensures that information from distant time steps can still influence the predictions, and 4. in longer sequences, there might be more noise or variability and the multiple layers of stacked LSTMs can help in filtering out noise and focusing on the underlying patterns, making them more robust models for prediction. Figure 7 presents detailed comparisons of MAE on the left and RMSE on the right for the monitoring stations between different machine learning models. In the case of modelling with methane concentration data of Calgary Southeast emphasizes the importance of data to increase modelling accuracy. In general, the greater the time interval of training data, the higher the precision of the model. Furthermore, the LSTM and Stacked LSTM models outperform the BiLSTM and GRU models. Figure 8 show the detailed evaluation of MAE and RMSE with the different modelling approaches for all of the monitoring stations. In the case of the Bruderheim monitoring station, the results shows that the greater the amount of training data, the more accurate the prediction from the model will be. Another thing we noticed is that even though the prediction performance on the testing dataset got better with more training data, the improvement diminishes. For example, the MAE trend for the GRU model is almost a flat line for the testing dataset. The MAE trend for the LSTM model shows a flat line after training with six months of data. The BiLSTM and Stacked LSTM models show minor improvements after one year of data is implemented. For the Bruderheim monitoring station, one year of training data is sufficient to have an efficient prediction model instead of taking the extra time to train the model with a more extended period of data.

Results and discussion
From the analysis and evidence above, to reach an average offset of 10% between the predicted value and the observed value, the results illustrate that for all three monitoring stations, six months of training is a reasonable length for the training dataset to achieve a precision of 3.36% offset for the Bruderheim data, 10.72% for the Calgary Southeast data, and 3.53% for the Edmonton East. Compared with one year of training, the precision achieved is 3.13% for the Bruderheim data, 9.44% for the Calgary Southeast data, and 3.10% for the Edmonton East data. Table 5 lists the goodness-of-fit coefficients for the models for each of the datasets. The results show that the longer the training data interval, the better is the fit. However, there is not much improvement of the goodnessof-fit when the data is extended beyond 1 year and for some of the methods, beyond 1 year, the quality of the fit declines. Amongst the methods, the highest goodness-of-fit with only 1 month of training data is achieved with GRU followed by LSTM. The GRU method is notable in that the 1 month result is only slightly lower than that of the 6 year result. The most improved method between training periods of 1 month to 6 years was stacked LSTM. With greater training interval, beyond an interval of 1 year, for the testing dataset, all of the methods had similar results. Table 6 lists the computation time in minutes that was required for training the models for the different monitoring stations. The results show that the computation time is roughly linear with the number of years of training. Thus, there is incentive for determining the minimum amount of training required to achieve a specific threshold of precision. From the comparison above, for extended training datasets, the Stacked LSTM outperforms its variants of GRU, LSTM and BiLSTM for predicting testing datasets but it takes more computation time to train. The results show that the computational time needed varies not only with the methods but with the input data set-for the Calgary Southeast dataset, the GRU models have better prediction performance with smaller training datasets but require several times the computational time of that of the LSTM and BiLSTM methods. For the this dataset, the stacked LSTM method provides better accuracy with larger training time interval but the computational time is significantly greater than that of the other methods. For the Edmonton East dataset, the GRU time is comparable to the BiLSTM, both are higher than the LSTM. For the Bruderheim dataset, the GRU and LSTM have relatively low computational time whereas the BiLSTM has the highest time requirement.

Conclusions
There is a need to reduce methane emissions and monitoring systems are needed to ensure that its quantification occurs. Continuous emissions monitoring systems (CEMS) provide this data but they are costly and predictive emissions monitoring systems (PEMS), used together with data from CEMS can provide means to offset the costs of CEMS but also provide projections of emissions in the future. Machine learning based PEMS is a potentially powerful tool for methane emissions prediction. For machine learning, it is generally recognized that the greater the amount of training data, the more accurate are the predictions from the machine learning models. This paper evaluates four state-of-the-art variants of the recurrent neural networks, namely LSTM, Stacked LSTM, Bidirectional LSTM and GRU, with different time intervals of training datasets which are then used for forecasting one year of methane concentration data from three air quality monitoring stations in Alberta, Canada. First, the results found that the GRU neural networks are more effective for smaller ( 1 year) training datasets, whereas Stacked LSTM neural networks slightly outperform the other methods for longer periods (> 1 year) of training data. Second, the experiments validate that the more the data used for training, the better the prediction performance. However, there is a diminishing performance beyond using 6 months of training data. This suggests that for the methane concentration data examined here, the time interval of data required for training should be 6 months for reasonably accurate forecasting of one year. This is exemplified for the Bruderheim monitoring station where the performance from one-year training is very close to that from training with two years and even four years' worth of data. Thus, the results illustrate that the use of more timeseries data for training does not necessarily mean a significantly better model. This has to be taken into account in the context of the computation time required for training. Given the different datasets examined here and the methods, if only 6 months of training is required, the GRU appears to yield the most accurate and reasonable computational time needed. The results suggest for practical application of machine learning-based PEMS, an evaluation of the smallest dataset needed to reach a reasonable prediction RMSE and MAE needs to be found. The results also suggest that for the methane datasets examined here that updates to PEMS models, based on training from CEMS data should have a CEMS monitoring time interval of at least 6 months. This would especially be the case if new industrial or other developments are occurring that could lead to greater methane emissions where the behaviour would not be well represented by previously trained PEMS models. If there are no new industrial or other developments that could lead to new methane emissions, then updating a previously trained PEMS model to a shorter CEMS monitoring time interval may be effective.