A data-driven approach for PM2.5 estimation in a metropolis: random forest modeling based on ERA5 reanalysis data

Air pollution in urban environments, particularly from fine particulate matter (PM2.5), poses significant health risks. Addressing this issue, the current study developed a Random Forest (RF) model to estimate hourly PM2.5 concentrations in Ankara, Türkiye. Utilizing ERA5 reanalysis data, the model incorporated various meteorological and environmental variables. Over the period 2020–2021, the model’s performance was validated against data from eleven air quality monitoring stations, demonstrating a robust coefficient of determination (R2) of 0.73, signifying its strong predictive capability. Low root mean squared error (RMSE) and mean absolute error (MAE) values further affirmed the model’s precision. Seasonal and temporal analysis revealed the model’s adaptability, with autumn showing the highest accuracy (R2 = 0.82) and summer the least (R2 = 0.51), suggesting seasonal variability in predictive performance. Hourly evaluations indicated the model’s highest accuracy at 23:00 (R2 = 0.93), reflecting a solid alignment with observed data during nocturnal hours. On a monthly scale, November’s predictions were the most precise (R2 = 0.82), while May presented challenges in accuracy (R2 = 0.49). These seasonal and monthly fluctuations underscore the complex interplay of atmospheric dynamics affecting PM2.5 dispersion. By integrating key determinants such as ambient air temperature, surface pressure, total column water vapor, boundary layer height, forecast albedo, and leaf area index, this study enhances the understanding of air pollution patterns in urban settings. The RF model’s comprehensive evaluation across time scales offers valuable insights for policymakers and environmental health practitioners, supporting evidence-based strategies for air quality management.


Introduction
Urban air quality serves as a critical determinant of environmental health, encapsulating a comprehensive assessment of pollutant levels and overall atmospheric purity within densely populated areas.The rapid pace of urbanization, coupled with diverse human activities encompassing industrial processes, transportation, energy generation, and residential emissions, constitutes significant sources of air pollution in urban environments (Elbir et al 2010, Elbir et al 2011, Wang et al 2013, Huang et al 2014, Kara et al 2014, Shen et al 2019, Mentese et al 2020).These activities release plenty of pollutants, such as particulate matter (PM 2.5 and PM 10 ), nitrogen oxides (NO X ), sulfur dioxide (SO 2 ), carbon monoxide (CO), and volatile organic compounds (VOCs) (Elbir and Muezzinoglu 2004, Elbir et al 2007, Ogah et al 2020).Of particular significance, PM 2.5 comprises fine particles with an aerodynamic diameter of 2.5 micrometers or smaller, capable of deeply penetrating the respiratory system and exerting deleterious effects on human health (Amnuaylojaroen and Parasin 2023).Prolonged exposure to PM 2.5 has been firmly linked to various respiratory and cardiovascular ailments, including asthma, bronchitis, myocardial infarction, and other respiratory disorders (Manisalidis et al 2020, Basith et al 2022).Therefore, monitoring PM 2.5 concentrations is pivotal in assessing urban air quality.
Air quality monitoring networks in urban areas typically rely on measuring PM 2.5 levels at designated stations.However, when such monitoring systems may be insufficient in providing comprehensive coverage or real-time data, accurate prediction of air pollutant levels becomes most important.Traditional air quality monitoring stations may have limitations in spatial distribution, temporal coverage, and the ability to capture localized pollution patterns.In such cases, integrating machine learning techniques for air quality prediction offers a valuable solution.These methods leverage historical data, meteorological information, and other relevant factors to generate reliable predictions, filling the gaps left by monitoring networks (Liang et al 2018, Tuna Tuygun et al 2021, Gündoğdu et al 2022, Tuna Tuygun and Elbir 2023).
Machine learning techniques offer numerous advantages for predicting PM 2.5 .They enable the analysis of historical and real-time data to forecast PM 2.5 , thereby providing valuable insights for various applications.Machine learning-based early warning systems allow the detection of potential high pollution episodes in advance, facilitating preventive measures and reducing associated health risks (Ahani et al 2020).Accurate PM 2.5 estimations support effective air quality management, empowering policymakers to make informed decisions on emission controls, urban planning, and targeted interventions for pollution reduction (Chen et al 2023).Additionally, estimating PM 2.5 levels helps protect public health by enabling individuals, especially vulnerable populations, to make informed choices and minimize exposure to harmful levels (Zhang and Awang 2023).Machine learning-based predictions also aid in optimizing resource allocation by identifying areas with higher PM 2.5 levels, allowing authorities to prioritize pollution control measures (Dong et al 2009).Leveraging machine learning algorithms for PM 2.5 prediction contributes to scientific research and evidence-based policy development, deepening the understanding of air pollution dynamics and supporting the implementation of measures to improve air quality.
Recent studies have focused on developing models for PM 2.5 prediction, employing various machinelearning algorithms and techniques to enhance accuracy (Gupta et al 2021, Gündoğdu et al 2022, Tuna Tuygun et al 2022, Wei et al 2020, 2022, Yang et al 2022, Karimian et al 2023, Kim et al 2023, Wang et al 2021, 2023).These models utilize meteorological data and consider temporal dependencies in air quality measurements.Whereas certain studies depend solely on the observation data generated by meteorological stations (Bera et al 2021, Yang et al 2021, Gayen et al 2022, Suriya et al 2023), many studies use meteorological information from reanalysis databases (Gündoğdu et al 2022, Tuna Tuygun et al 2022, Wang et al 2023).The fifth-generation ECMWF (European Centre for Medium-Range Weather Forecasts) atmospheric reanalysis (ERA5) is a widely used global reanalysis validated through comparisons with observational data (Hersbach et al 2020).It provides reliable insights into meteorological conditions on a worldwide scale.Zuo et al (2023) also evaluated the performance of four reanalysis datasets for satellite-based PM 2.5 retrieval in China.The researchers observed that the ERA5 dataset demonstrated the highest level of agreement with in situ measurements for retrieving PM 2.5 levels.
In numerous investigations concerning the estimation of PM 2.5 concentrations based on reanalysis data, remarkable levels of achievement have been previously recorded.For instance, Wang et al (2023) achieved a notably high correlation coefficient (R 2 =0.96) in China.Such accomplishments are rooted in the application of different machine-learning methodologies.Noteworthy is the prevalence of several studies primarily dedicated to the comparison of diverse prediction models (Bera et al 2021, Yang et al 2021, Gayen et al 2022, Mengfan et al 2022, Suriya et al 2023, Wang et al 2023).Nevertheless, certain studies have diverged from this trend and employed distinct prediction models, such as gradient boosting (Gündoğdu et al 2022) and artificial neural networks (Tuna Tuygun et al 2022).Specifically, in a recent investigation by Wang et al (2023), the random forest (RF) method demonstrated exceptional efficacy in constructing a PM 2.5 model.The model utilized a combination of meteorological parameters, such as precipitable water vapor, water vapor pressure, and relative humidity, along with several air pollutants, such as O 3 , CO, NO 2 , SO 2 , and PM 10 .The results exhibited remarkable performance, reaffirming the potential of RF in air quality modeling and forecasting.On the other hand, in another study incorporating seven meteorological, three land-use, and aerosol optical depth (AOD) parameters as inputs to evaluate the predictive capabilities of various models for PM 2.5 in Delhi-NCT, India, the RF model outperformed other techniques, achieving an R 2 value of 0.68, while gradient boosting, support vector machines, and artificial neural networks yielded lower R 2 values.
Within the confines of Türkiye, there has been a discernible surge in scientific research in recent years, centered around the development of a range of methodologies aimed at estimating particulate matter concentrations (Zeydan and Wang 2019, Bozdağ et al 2020, Gündoğdu 2020, Tuna Tuygun et al 2022, Yağmur 2022, Tuna Tuygun and Elbir 2023).These investigations have harnessed meteorological data to facilitate precise estimations.In some of these studies, meteorological information was sourced from the meteorological stations of the Turkish State Meteorological Service (Yağmur 2022), while others utilized data from the reanalysis databases such as the National Centers for Environmental Prediction (NCEP) Climate Forecast System Reanalysis (CFSR) (Zeydan and Wang 2019) and MERRA-2 (Tuna Tuygun et al 2022).To our knowledge, no prior endeavor has been made to forecast using ERA5 data within an urban context in Türkiye.
This study aims to develop a robust predictive model using a random forest approach to forecast hourly PM 2.5 concentrations in the Ankara metropolitan area of Türkiye.The model utilized input variables from ERA5 reanalysis, consisting of meteorological, atmospheric, and land-related parameters such as ambient air temperature, surface pressure, total column water vapor, boundary layer height, forecast albedo, and leaf area index.The performance and applicability of the model were evaluated by comparing the estimated PM 2.5 concentrations with observations from air quality monitoring stations across the city.This research advances air quality estimation methodologies and provides valuable insights for effective air quality management and policy development.

Study area
This investigation focused on Ankara, the capital of Türkiye, situated centrally within the nation's geographic expanse.Encompassing an approximate area of 26 thousand square kilometers, Ankara is positioned between latitudes 39.7°and 40.0°N and longitudes 32.6°and 33.0°E, as depicted in figure 1.With a population exceeding five million, Ankara is one of the largest cities in Türkiye and represents a vibrant and dynamic urban environment.
The climate in the city falls within the Köppen-Geiger Climate Classification of a warm-summer Mediterranean climate (Csb-type), characterized by hot, dry summers and warm winters (Turhan et al 2023).In the city, between the years 1927 and 2022, the winter season exhibits an average temperature range of 0.7 (January) to 2.6 °C (December), while the summer season showcases an average temperature range of 20.0 (June) to 23.5 °C (August) (TSMS 2023).
The air quality in the city is influenced by various factors, including emissions from industrial facilities, vehicular traffic, residential heating systems, and the geographical characteristics of the surrounding region (Kadioǧlu et al 2010, Bari and Kindzierski 2015, Ulutaş et al 2021).Additionally, meteorological conditions, such as wind patterns, temperature inversions, and precipitation, play a significant role in the dispersion and accumulation of air pollutants in the city (He et al 2017, Tiğli and Cangür 2019, Bei et al 2020, Deak et al 2020, Ulutaş et al 2021).
Ankara has an air quality monitoring network comprising seventeen strategically placed monitoring stations measuring various air pollutants, including PM 10 , PM 2.5 , NO 2 , SO 2 , CO, and O 3 .During the study period spanning from 2020 to 2021, among these stations, only eleven were equipped with the capabilities to measure and monitor PM 2.5 levels accurately.The locations of these stations and the annual average PM 2.5 concentrations at the stations are given in figure 1.
In figure 1, a nuanced categorization of the air quality monitoring stations has been incorporated based on their geographical context and urbanization levels, in line with observations from Huang et al (2019), who reported on the formation of urban particulate matter islands.Stations such as Bahcelievler, Demetevler, Ostim, and Siteler have been classified as 'urban' due to their dense city center locations where traffic and industrial activities are concentrated.In contrast, Etimesgut and Yaşamkent are categorized as 'semi-urban,' reflecting their position in areas with a mix of residential and industrial land use, yet less densely populated than city centers.Törekent is an example of a 'rural' station situated in open spaces with agricultural land use and lower levels of industrial activity.
This classification is instrumental in dissecting the spatial distribution of PM 2.5 concentrations, revealing a gradient that correlates with urbanization intensity, as Huang et al (2019) noted.Urban stations report generally higher PM 2.5 values, underscoring the influence of anthropogenic activities on air quality.Semi-urban stations exhibit intermediate values, while rural stations show the lowest concentrations, indicating fewer pollution sources.

Data collection
Air quality and meteorological data were employed in this study.The meteorological data were sourced from the ERA5 reanalysis and can be accessed at https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset(CCDS 2023).Hourly PM 2.5 concentration data were obtained from the national air quality monitoring network operated by the Ministry of Environment, Urbanization, and Climate Change (NAQMN 2023).This dataset includes readings from eleven strategically placed PM 2.5 monitoring stations across Ankara, selected based on their capacity for accurate PM 2.5 level measurement within different city regions.Covering the period from 2020 to 2021, this dataset provides a comprehensive view of air quality fluctuations.

Estimation modeling approach 2.3.1. Preprocessing and data preparation
A spatiotemporal collocation was performed to integrate meteorological data with PM 2.5 measurements.This process involved aligning ERA5 meteorological data cells with the geographical locations of PM 2.5 monitoring stations, considering the differences in resolution and temporal frequency between the datasets.Such alignment facilitated the creation of a dataset enriched with relevant meteorological information corresponding to each PM 2.5 monitoring station location.ERA5 data, with a spatial resolution of 0.25°by 0.25°, was mapped to the monitoring stations, and relevant meteorological data were extracted from grid cells at approximately a 30 km resolution using MATLAB R2023a.By averaging the PM 2.5 values within a 30 km search radius, an aggregated value was obtained for ERA5 grid cells at each hour.This data integration process enables a more holistic analysis and understanding of the relationship between meteorological conditions and PM 2.5 concentrations, allowing for more accurate predictions and insights into air pollution dynamics (Zhang et al 2015, Yang et al 2017, Chen et al 2020).Only data points that included complete information for all independent variables and hourly PM 2.5 were considered for evaluation during the sample selection process.

Prediction model
In this study, a preliminary investigation was conducted to determine the prediction model.Four distinct methods-Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), Support Vector Regression (SVR), and Random Forest (RF)-were evaluated using the same dataset.The RF model showed the most promising results following detailed analyses and was decided to be used in this study.The results of this preliminary study, including the performance of each model, are presented comparatively in the supplementary materials (table S1).
The RF model, foundational to our analysis, is a machine-learning algorithm recognized for its efficacy in both regression and classification tasks.Originating from the work of Breiman (2001), RF enhances predictive accuracy and mitigates overfitting through an ensemble approach that combines multiple decision trees, each constructed from a randomly selected subset of data and features.This ensemble method, leveraging the strengths of individual trees, allows for improved generalization across diverse data samples.
RF's versatility is evident in its ability to manage high-dimensional data and uncover complex relationships between variables.It achieves this while offering robustness against outliers and missing values, making it particularly suited for environmental data analysis like PM 2.5 concentration estimation.Furthermore, the algorithm provides insights into feature importance, aiding in identifying significant predictors within the model.
For the RF model employed in this study, key hyperparameters were carefully optimized, including the number of trees (n_estimators) and the criteria for tree construction (e.g., min_samples_split, max_features), to ensure the model's efficacy in predicting PM 2.5 levels.Hyperparameter tuning was conducted using the GridSearch function, with an optimal configuration established for accurate and reliable estimations.The model incorporated 50 trees, with hyperparameter specifics and their tuning outcomes detailed in table 1.
A 10-fold cross-validation strategy was meticulously employed to enhance the robustness and reliability of our model's predictions.This method systematically cycles each subset as the test set by partitioning the input data into ten distinct subsets, with the remaining subsets used for training.Such a rigorous validation framework is particularly advantageous in datasets prone to imbalances, as it ensures each data point contributes to the model's learning process, thereby minimizing bias and enhancing the generalization ability of our model across diverse scenarios and pollution events.
This approach is instrumental in air quality forecasting, where temporal and spatial variability significantly impacts model performance.Through 10-fold cross-validation, the model's performance under various conditions can be rigorously assessed, providing a more comprehensive evaluation of its predictive accuracy and consistency.Moreover, this strategy facilitates continuous refinement of model weights, particularly in convolutional layers, further optimizing the model's effectiveness.
All computational procedures, including data partitioning, model training, and evaluation, and the generation of plots to visualize the model's performance, were conducted using Python 3.11 and MATLAB R2023a.
For readers seeking a deeper understanding of RF's mechanics and applications, further resources such as Breiman (2001) and Zhu et al (2022) are recommended, which offer extensive insights into the algorithm's foundational principles and advanced implementations.Here are the equations for the evaluation metrics: In these formulas, y pred represents the predicted values, y obs represents the observed values, N represents the number of data points, and y obs_avg and y pred_avg represent the average values of y obs and y pred , respectively.

Descriptive statistics
The comprehensive analysis of PM 2.5 measurements, aggregated from 11 monitoring sites over two years, yielded significant insights into the spatial and temporal dynamics of air quality within the study area.Table 2 presents the descriptive statistics of the variables considered in this study, including meteorological, atmospheric, and land-related features, alongside the PM 2.5 concentrations.
These statistics reveal the variability in meteorological and atmospheric conditions across the study period, highlighting the importance of considering such variables in PM 2.5 concentration estimation models.Notably, the average PM 2.5 concentration of 17.6 μg m −3 , with a standard deviation of 15.9 μg/m 3 , underscores the fluctuating nature of air quality within the region.The wide range in PM 2.5 values, from as low as 0.09 μg m −3 to as high as 358.9 μg m −3 , reflects both the episodic nature of high pollution events and the effectiveness of air quality management strategies in reducing pollution during certain periods.The seasonal variation in PM 2.5 concentrations, with winter months exhibiting significantly higher levels than summer, can be attributed to increased heating emissions and less effective dispersion of pollutants in Ankara due to atmospheric conditions prevalent during colder months (Ulutaş et al 2021).The descriptive statistics for the predictive features, such as total cloud cover, surface pressure, and boundary layer height, further illustrate the complex interplay between meteorological conditions and air quality, emphasizing the necessity of incorporating various variables for accurate PM 2.5 prediction.

Feature selection and variable importance analysis
A variable importance (VI) analysis was conducted to discern the contribution of each predictor within the estimation process.This analysis employed the VI score, a metric that reflects the frequency with which a feature is used across the decision trees within the RF model.Initially, all features were incorporated into the construction of the RF model to evaluate their importance as indicated by the VI score.
Following the VI analysis, the six most significant variables-ambient air temperature at 2 meters (T2M), surface pressure (SP), total column water vapor (TQV), boundary layer height (PBLH), forecast albedo (FA), and leaf area index high vegetation (LAIHV)-were identified.These variables exhibited VI scores exceeding the established threshold value of 6%, warranting their selection as input parameters for the estimation model.The ranking of features, based on their VI scores, is depicted in figure 2. In this figure, the variables with the highest scores are marked in green, signifying their crucial role as determinants in the PM 2.5 prediction.Conversely, variables portrayed in brown registered lower VI scores and were consequently excluded from the final model inputs.

Performance evaluation of the RF model
The relationship between the hourly measured and estimated PM 2.5 concentrations for both the training and test datasets is depicted in figure 3.These scatter plots offer a visual assessment of the model's performance.
During the training phase, the model achieved a cross-validated coefficient of determination (CV-R 2 ) of 0.75, indicating a strong predictive capability.The cross-validated root mean square error (CV-RMSE) was calculated at 7.9 μg m −3 , and the cross-validated mean absolute error (CV-MAE) stood at 4.8 μg/m 3 , suggesting a high level of precision in the model's predictions.Furthermore, the cross-validated mean absolute percentage error (CV-MAPE) of 41.7% reflects the variability in the model's performance across different ranges of PM 2.5 concentrations.
In the test phase, the model maintained a commendable level of performance, as indicated by an RMSE of 8.7 μg m −3 and an MAE of 4.7 μg m −3 .The MAPE for the test data was slightly lower than that of the training data at 39.8%, implying consistency in the model's estimation accuracy.The R 2 -value for the test data was 0.73, confirming the model's effective generalization from the training to the testing phase.Such a correlation coefficient is substantial and aligns with the benchmarks of acceptability referenced in the field of air quality modeling (Ahmad et al 2019).This value indicates that the model is proficient in capturing the stochastic variations in PM 2.5 concentrations, which is essential for reliable air quality prediction.
Overall, the estimated PM 2.5 concentrations align closely with the observed data, as evidenced by the density of points near the red line of perfect agreement.These results suggest that the model can reliably estimate PM 2.5 concentrations with a consistent performance across both training and testing datasets.
Seasonal performance assessment of the RF model reveals variability in its predictive accuracy, as shown in figure 4. When making seasonal assessments, hourly PM 2.5 data falling within the relevant seasons were used, ensuring that the predictions reflected the specific seasonal conditions.Figure 5  Conversely, the model's performance dipped in the warmer months of May through August, with R 2 values hovering around 0.50.This decrease in accuracy during the summer may be partly attributed to higher PBLH values, which lead to greater dispersion of pollutants and, consequently, lower PM 2.5 concentrations.The relationship between PBLH and PM 2.5 becomes less pronounced as the boundary layer expands, making it more challenging for the model to predict PM 2.5 levels accurately (Tuna Tuygun and Elbir 2020).In winter, the inverse relationship is more marked, with high PM 2.5 concentrations coinciding with a shallow boundary layer, which is a pattern the model captures more effectively.The importance of PBLH in model predictions is further corroborated by the SHAP (SHapley Additive exPlanations) summary plot (figure 8), which indicates a significant contribution of the PBLH feature to the model's predictions, especially when its values are low.This influence underscores the necessity of incorporating complex atmospheric dynamics into predictive models for enhanced accuracy across different seasons.In this study, the interpretability of the RF model was enhanced using the SHAP method, a technique increasingly utilized for its ability to provide transparency into machine learning predictions (Hu et al 2022, Cakiroglu et al 2024).The SHAP summary plot, as illustrated in figure 8(a), delineates the influence of various factors on the RF model output, whether negative or positive.Each point on the plot corresponds to an individual sample within the dataset, with the color gradient from blue to red representing the feature's value in the sample.Specifically, blue points indicate lower feature values, while red points indicate higher values.
The positioning of each point on the x-axis provides insight into the feature's impact on the model's output for the given data point.Points to the right of the zero line suggest that the feature contributes to an increase in the predicted PM 2.5 levels, whereas points to the left indicate a decreasing influence.
According to the SHAP analysis, FA shows the most considerable impact on the model's predictions, with points predominantly in the blue spectrum on the positive side, implying that lower FA values are associated with higher predicted PM 2.5 concentrations.Similarly, lower values of PBLH also correlate with higher PM 2.5 predictions.Conversely, the LAIHV and SP demonstrate a tendency for higher values to increase PM 2.5 predictions, as indicated by the clustering of red points on the positive side.
The SHAP summary plot also reveals that T2M and TQV have a more nuanced influence on the model output, as indicated by their mixed color distribution on both sides of the zero SHAP value line.This suggests that the relationship between these features and PM 2.5 levels is not strictly linear and may be influenced by other interacting atmospheric conditions.
Further underscoring the importance of these features, figure 8(b) ranks them based on their average absolute SHAP value, confirming their pivotal roles in the RF model's predictive capability for PM 2.5 concentration.By integrating SHAP analysis, the study not only enhances the model's transparency but also provides valuable insights into the atmospheric dynamics influencing PM 2.5 levels, facilitating more informed decision-making for air quality management.Figure 9 illustrates the hourly predictive accuracy of the RF model for PM 2.5 levels in Ankara, detailing the variation in model performance at different times throughout the day.This figure synthesizes the model's hourly performance over the comprehensive study period from January 1, 2020, to December 31, 2021.The highest R 2 value achieved at 23:00 (0.93) suggests a potent correlation between observed and predicted PM 2.5 levels during this hour.Notably, the model demonstrates enhanced accuracy during the early morning hours, from 23:00 to 05:00, marked by R 2 values of 0.83 or higher.These findings underscore the model's adeptness at capturing PM 2.5 concentrations at night when traffic emissions are typically reduced, and atmospheric conditions are more stable.However, this diurnal variation in model accuracy cannot be solely attributed to traffic emissions, particularly considering the reduced mobility and altered traffic patterns during the COVID-19 curfews.Rather, the variability is likely attributable to a combination of factors, including atmospheric mixing dynamics, photochemical activity timing influenced by solar radiation, and other anthropogenic factors that are not directly related to vehicle traffic but may still affect particulate matter levels.
Further, the afternoon enhancement in model performance around 16:00 h could be associated with the dispersal of pollutants and the relative stabilization of atmospheric conditions as the boundary layer height increases.Additionally, the morning reduction in performance may be partially explained by the breakdown of nocturnal temperature inversions, which can lead to unpredictable dispersion of pollutants once the sun rises and the ground starts to warm.
A focused assessment was also conducted to evaluate the model's prediction accuracy across the spectrum of PM 2.5 concentrations.It was observed that the model's predictions were less accurate at the extremes of the PM 2.5 concentration spectrum.Specifically, R 2 scores decreased significantly, to 0.208 for PM 2.5 concentrations below 10 μg m −3 and 0.294 for concentrations above 50 μg/m 3 .These findings indicate that, despite the model's general robustness, enhancements are necessary to capture air quality more accurately during periods of very low or high pollution levels.To improve the model's performance under such conditions, a targeted analysis to identify potential biases and formulate data-driven strategies for adjustment is recommended.This  Wang et al (2023), who reported a competent estimation of hourly PM 2.5 using an RF model with an R 2 of 0.78.Similarly, the model deployed by Gayen et al (2022) proficiently mirrored the PM 2.5 oscillations with an R 2 of 0.69 in the Delhi-NCT region.Notably, Jiang et al (2021) identified a significant R 2 of 0.85 in their approach to predicting PM 2.5 concentrations in China.Further affirmations of high correlation in PM 2.5 estimation have been documented in studies focusing on Türkiye (Gündoğdu et al 2022, Tuna Tuygun et al 2022, Yağmur 2022).
A comprehensive tabulation of the comparative performance of various studies focused on PM 2.5 prediction is provided in table 3, serving as a benchmark and a testament to the growing body of literature validating the effectiveness of machine-learning models, particularly the RF algorithm, in environmental monitoring and air quality assessment.
In a previous study conducted in Ankara, a prediction of PM 10 levels was carried out (Yağmur 2022).Various machine learning-based models such as ANN and RF were employed for the prediction, and by using PM 10 data from different stations as inputs, a forecast for PM 10 levels was generated for another station.Consequently, due to the absence of meteorological and environmental parameters that could influence the dispersion of pollutants in the atmosphere, the predictive accuracy in this study remained relatively modest.However, incorporating certain atmospheric parameters as inputs, as demonstrated in this study, is likely to enhance the predictive performance.In fact, the success ranking in the ANN and RF comparison, as observed in the study by Bozdağ et al (2020), could potentially undergo alterations.
Including meteorological and environmental parameters in predicting PM levels is a critical factor in enhancing the accuracy of machine learning models.In Ankara, a previous study attempted to forecast PM 10 levels by utilizing machine learning models, including Artificial Neural Networks (ANN) and RF, relying solely on PM 10 data from various stations to predict levels at a target station (Bozdağ et al 2020).The absence of atmospheric factors in their approach likely constrained the model's ability to accurately capture the complexities of pollutant dispersion, resulting in modest predictive performance.
Contrastingly, the current study underscores the importance of incorporating meteorological and environmental variables.The integration of such parameters has been shown to improve the predictive capabilities of these models significantly.With the introduction of relevant atmospheric inputs, there is a notable enhancement in the model's proficiency in estimating PM concentrations.This advancement suggests that, if applied to the earlier study, the predictive accuracy could have been improved, and the performance ranking between ANN and RF models might have shifted.Introducing atmospheric parameters, including temperature, wind speed, humidity, and others, provides a more comprehensive understanding of the factors that influence air pollution.By factoring in the dynamics of pollutant dispersion, the models can offer more precise estimations.This refined approach not only contributes to the field of environmental modeling but also aids policymakers and public health officials in developing more effective strategies for air quality management.

Conclusion
Based on the comprehensive analysis conducted in this study, the developed random forest (RF) model proved to be a valuable tool for accurately estimating hourly PM 2.5 concentrations in Ankara, Türkiye.The model demonstrated robust performance across different seasons, with autumn exhibiting the highest accuracies.However, it is important to note that the model's performance varied by season, and lower accuracies were observed during the summer season.This result highlights the need for further investigation and potential enhancements to improve the model's performance during this season.Moreover, the model's performance varied monthly, with December and November showing the highest accuracies.These findings emphasize the influence of temporal factors on the model's performance and highlight specific months where the model excelled in estimating PM 2.5 concentrations.The integration of various meteorological and atmospheric variables derived from the ERA5 reanalysis dataset allowed for a comprehensive understanding of their impact on PM 2.5 levels.Notably, temperature, surface pressure, total column water vapor, boundary layer height, forecast albedo, and vegetation indices emerged as significant factors in determining PM 2.5 pollution levels in the study area.The findings of this study contribute to a deeper understanding of the spatiotemporal dynamics of air pollution in Ankara, enabling evidence-based decision-making for effective air quality management in urban areas.Further research and refinement of the model can improve performance, particularly during challenging periods, such as the spring season, and enhance the accuracy of PM 2.5 predictions.
Despite the robustness of the RF model and the comprehensive dataset utilized, this study has limitations and potential biases, which may influence the interpretation and generalizability of the findings.Firstly, relying on data from urban air quality monitoring stations may introduce spatial bias, as these stations are unevenly distributed across Ankara and may not fully capture the variability in PM 2.5 concentrations in more remote or suburban areas.Secondly, the ERA5 reanalysis data, while extensive, may not perfectly represent local meteorological conditions due to its coarser spatial resolution, potentially affecting the precision of our model predictions.
Furthermore, the temporal scope of our study, spanning two years, may not sufficiently capture long-term trends in air quality or the impacts of exceptional events such as forest fires, industrial accidents, and summer dust transport on PM 2.5 levels.Therefore, future research should aim to cover a broader time frame and test various estimation methods.Additionally, the study's focus on PM 2.5 estimation using a specific set of predictors may overlook other significant factors influencing air quality, such as local emissions sources or socio-economic activities, which were not included in the model.
Recognizing these limitations, future studies can aim to incorporate a wider array of data sources, including satellite observations and more granular socio-economic data, to enhance the model's accuracy and

Figure 1 .
Figure 1.Geographical location map of the study area and distribution of air quality monitoring stations.

Figure 2 .
Figure 2. Bar plots of the feature importance.
also shows seasonal performance metrics of the RF model for PM 2.5 estimation in Ankara.Autumn emerges as the season with the most accurate predictions, reflected by the highest R 2 value (0.82) and the lowest error metrics (RMSE, MAE, and MAPE), suggesting that the model's variables are highly predictive during this season.In contrast, summer predictions are less accurate, with the lowest R 2 (0.51) and highest MAPE.Several factors may contribute to the model's poor forecasting performance during the summer months in Ankara.Raja et al (2018) have previously identified summer as the 'cleanest' period in the city.However, occasional local factors may occasionally disrupt this trend.For instance, unexpected increases in PM 2.5 levels, attributed to dust storms from the Middle East and North Africa during summer, owing to the region's position along dust transport routes(Kabatas et al 2014), and biomass burning activities, such as forest fires and agricultural burning near the area(Tariq et al 2023), may complicate the model's air quality predictions.Winter and spring show moderate accuracy, with winter performance slightly superior to spring, potentially because of more consistent meteorological patterns and heating-related emissions that are easier for the model to capture.The RF model's monthly performance analysis, depicted in figures 6 and 7, elucidates the accuracy fluctuations throughout 2020-2021.Monthly performance metrics for the PM 2.5 prediction model were obtained by aggregating the hourly prediction and observation values for each month for 2020 and 2021.November showcased the model's peak precision with an R 2 of 0.82, suggesting the model's input variables are highly attuned to conditions in late autumn.The model also performed robustly in the winter months of December, February, and January, with R 2 values over 0.70, which can be linked to the typically lower PBLH during these times, enhancing the model's ability to capture PM 2.5 concentrations (Tuna Tuygun and Elbir 2020).

Figure 3 .
Figure 3. Scatter plots of estimated versus observed PM 2.5 for train and test datasets.

Figure 4 .
Figure 4. Scatter plots for seasonal performances of the RF model.

Figure 5 .
Figure 5. Seasonal performance metrics of the RF model for PM 2.5 estimation in Ankara.

Figure 6 .
Figure 6.Scatter plots for monthly performances of the RF model based on the test data.

Figure 7 .
Figure 7. Monthly performance metrics of the RF model based on test data.

Figure 8 .
Figure 8. SHAP summary plot illustrates the top 6 features contributing to the Random Forest (RF) model.a) The characteristics of the features within the model are depicted.The feature's placement on the y-axis corresponds to its attribute, while its Shapley value determines its position on the x-axis.b) The importance of the SHAP feature is measured in terms of average absolute Shapley values.

Figure 9 .
Figure 9. Hourly performance metrics of the RF model based on test data.

Table 1 .
Hyperparameters of the RF model.
2.3.3.Evaluation metricsThe performance of the developed RF model was assessed using various evaluation metrics, including root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination (R 2 ).These metrics quantify the model's accuracy by measuring the average discrepancies between estimated and actual PM 2.5 values.Lower values of RMSE, MAE, and MAPE indicate superior model performance, signifying smaller deviations between estimated and observed values.Conversely, a higher R 2 value denotes a stronger correlation between the variables, validating the model's predictive accuracy.

Table 2 .
List of the predictive features with PM 2.5 concentration.

Table 3 .
The recent studies on PM 2.5 estimation.