Predicting forest fire vulnerability using machine learning approaches in The Mediterranean Region: a case study of Türkiye

Forest fires in Türkiye have devastated 2.5 million hectares of habitat over four decades, posing a grave threat to Mediterranean forest ecosystems. This study compares machine learning techniques: Decision Trees (DT), Naive Bayes (NB), Random Forest (RF), Artificial Neural Networks (ANN), and Support Vector Machines (SVM), for predicting forest fire vulnerability. Using a dataset encompassing various factors like precipitation, soil moisture, temperature, humidity, wind speed, land cover, elevation, aspect, slope, proximity to roads/electricity networks, and population density, the models were trained and tested. The dataset classified vulnerability into four classes: very low, low, moderate, and high. Evaluation metrics included overall accuracy, precision, sensitivity, F1-score, Cohen kappa, and cross-validation (CV).RF exhibited the highest performance (accuracy: 0.80, precision: 0.78, sensitivity: 0.80, F1-score: 0.78, Cohen kappa: 0.71, average CV: 0.71), predicting fire vulnerability classes very low (14.99%), low (0.68%), moderate (65.41%), and high (18.90%) with notable accuracy. DT yielded consistent results, while NB performed stably, though slightly lower than RF and DT. However, ANN and SVM demonstrated lower performance and higher variability. These findings advocate for RF as the most accurate algorithm for forest fire risk prediction, emphasizing its crucial role in proactive fire risk management strategies.


Introduction
Türkiye is the confluence of the Asian and European continents and three phytogeographic regions: Euro-Siberian, Mediterranean, and Irano-Turanian [1].The climatic diversity in Türkiye makes it one

Study Area
The research area (Figure 1) is in the Mediterranean-Türkiye climatic zone.This study focuses on five Provisions, which are astonomically positioned in the region of 35° 42' 38" N to 38° 33' 43" N latitudes and 37° 31' 42" E to 39° 17 ' 27" E longitudes.The Mediterranean forests in the study area are usually composed of coniferous tree species such as red pine, black pine, Taurus cedar, cilician fir and juniperus spps.Furthermore, oak species and the maquis shrublands distribute on this region.

Dataset
This study uses training data, including independent variables as predictors and dependent variables as response variables (Table 1).The dependent variable data was developed from the MODIS Fire Information for Resource Management System (FIRMS) dataset that records forest fire events.The dependent variable is classified into four classes: 1 (very low), 2 (low), 3 (moderate), and 4 (high) based on fire severity.Fire severity classification was conducted using the ΔNBR (Eq.1),ΔNDVI (Eq.2), and ΔSAVI (Eq.3) indices calculated from a collection of Landsat satellite images [8,9].

𝛥𝑁𝐵𝑅 = 𝑁𝐵𝑅 𝑃𝑟𝑒𝑓𝑖𝑟𝑒 − 𝑁𝐵𝑅 𝑃𝑜𝑠𝑡𝑓𝑖𝑟𝑒
(1)  =   − (2)  =   −   (3) This classification process compares the index values of the pixels with certain thresholds to determine the severity of the fire.The thresholds were determined using values from a study [9] that tested the performance of various satellite imagery and spectral indices, as well as two ground-measured severity indices, CBI and GeoCBI, with classes of unburned, low, moderate, and high in assessing fire severity in Turkish forest ecosystems.The processed hotspot data and Landsat index were used to locate fires during the summer period from May to October between 2001 and 2021.The independent variable data influencing fire occurrence are divided into: Environmental variables (i.e., elevation, slope, aspect, land cover, and land use); Climate variables (i.e., precipitation, temperature, wind speed, relative humidity, and soil moisture); and Socio-Economic variables (i.e., human population distribution, power lines, and road networks).Factors influencing fire occurrence were selected based on field observations found in various studies [10][11][12] and data availability.The filtered and classified hotspot dataset with coordinates and date of occurrence attributes is then run on the GEE (Google Earth Engine) program to query the values of the dependent variables automatically.The query program that has been built uses several main tools in GEE to allow users to query spectral index data and climate variables directly based on the time of hotspot occurrence.

ML Classification and Predictive Modeling
Machine Learning (ML) offers a variety of categorization and prediction algorithms.Classification is the process of determining a model that explains or classifies a concept or class of data [13].ML-based fire risk classifiers and predictions are more accurate than traditional methods [14].In this study, five algorithms were used, namely: Decision Trees (DT), Naive Bayes (NB), Random Forests (RF), Artificial Neural Networks (ANN), and Support Vector Machines (SVM).These five algorithms will be used to classify variables and predict the probability of forest fire occurrence.

Decision Trees.
The Decision Trees method offers rules in a hierarchically consistent framework, with each item representing one decision node.DT can be used for fire danger categorization in geographical modeling of fires, mathematical modeling of their effects, and further monitoring and forecast of natural fires [15].

Naïve Bayes.
The Naïve Bayes algorithm is one of the classification algorithms based on the Bayesian theorem in statistics to predict the probability of a class [16].Classification models using NB have the potential to be used effectively in fire event classification [17].

Random Forest. The Random Forest algorithm is an ensemble learning technique derived from
CARTs.The ensemble model built from the RF model comprises a few selected rules from each independent DT.After that, voting is done among the selected classes of each tree .RF has a high prediction accuracy and tolerance for outliers and "noise" and has demonstrated strong predictive abilities in forest fire prediction [18].

Artificial Neural
Network.Artificial Neural Networks are machine learning algorithms that attempt to replicate the structure and function of the human brain .These networks consist of layers of interconnected nodes, or neurons, that process and transmit data.The mathematical foundation of neural networks consists of using numerical terms to express laws, processes, and frameworks.Neural networks are essentially just arithmetic models that define a linear function X->Y, a function over X, or a function over X and Y, but are often associated with specialized training procedures or training rules [19].

Suport Vector Machine.
SVM is a machine learning technique that can be used for classification or regression analysis.SVMs are particularly useful when working with large, complex data sets with many features, as they can find the most valuable training samples prior to analysis.Hyperplane is a basic mathematical notion associated with SVM.SVM fits an optimal hyperplane between classes in the feature space.The set of weights and biases learned during training determines the hyperplane [20].

Accuracy Assessment of ML Algorithm
The trained ML model can predict the location of fires, but it is necessary to evaluate the performance of the algorithm model.For training and testing uses in evaluating the performance of ML models, the dataset is separated into 70:30 ratios [21].Accuracy evaluation assesses the model's performance through various characteristics: confusion matrix, overall accuracy, precision, sensitivity, specificity (fscore), and kappa coefficient.The confusion matrix summarizes prediction outcomes, detailing accurate and erroneous guesses across classes.Overall accuracy represents the percentage of correctly classified outcomes in this matrix, calculated using a general equation Precision is a metric to quantify the accuracy of a classifier's prediction of a particular class [23].
The formula for calculating classification precision is in equation 5; Sensitivity (recall) quantifies how much data from a particular class can be predicted correctly [23].
The formula for calculating classification sensitivity is in equation 6.
F-score provides a balanced measurement that combines precision and sensitivity (Eq.7).It is useful for assessing system performance, where finding the right balance between retrieving relevant documents (recall) and ensuring their accuracy (precision) is crucial [24].
Description: TP = True positive (number of correct positive predictions) TN = True negative (number of correct negative predictions) FP = False positive (number of incorrect negative predictions) FN = False negative (number of wrong positive predictions) The Kappa coefficient is a metric that compares observed accuracy to random chance, introduced by Jacob Cohen [25].The kappa value is calculated in equation 8; Where Po is the relative observed agreement between the two rasters and Pe is the hypothetical probability.

Figure 2. K-fold cross-validation
Cross-validation is a machine learning technique to evaluate model performance on independent data sets (Figure 2).The data set is divided into two parts for this technique, the training data set and the validation data set.The model is trained on the training dataset before being evaluated on the validation dataset.Cross-validation is used to determine how well the model performs on new data that has not been seen before [26].

Dataset description
The number of sample points of fire hotspots in the Mediterranean-Türkiye from 2001 to 2021 in the summer period obtained 920 sample points, with a very low class (unburned) of 221, a low (burned) of 96, a medium (burned) of 169, and a high (burned) of 434 sample points (Figure 3).In the dataset distribution (Figure 3) of forest fire classes, the largest number of samples is in the High class.Furthermore, the moderate class is the forest fire class, with the second highest number of samples.Meanwhile, the low class is the forest fire class with the lowest number of samples.This indicates that most of the areas in this dataset experienced high forest fires.Mediterranean forest ecosystems are susceptible to fire, especially in the severe fire class [27,28].In the years 2001 to 2021, the incidence of forest fires showed an increasing pattern (Figure 3), and the months with the most fires were July and August [29].77% of all major fires in Türkiye started in July and August.The summer season is very important in relation to fire risk, and from 2020 to 2021, there was a significant increase in fires.The major fire of 2021 occurred in Antalya, Türkiye province [30].A fire with wide area coverage started at four different points in Manavgat in July and August.They expanded and were accompanied by new fires occurring in the districts of Gündoğmuş, Alanya, Akseki, and Ibradı.A forest fire that flared up for several days in the Serik district of Antalya province in 2008, destroying 15,795 ha of forest, was caused by the forest's composition of Calabrian pine (Pinus brutia) and the high wind speeds in the area at the time.This was the greatest forest fire ever recorded in Türkiye [31].The correlation matrix shows the relationship between the various variables and fire severity.It shows that there are some significant patterns of relationships.For example, temperature strongly correlates positively with fire severity (correlation coefficient 0.60).This indicates that the higher the temperature, the higher the fire severity.Also, humidity has a moderately strong negative correlation with fire severity (correlation coefficient of -0.48).This indicates that the lower the humidity, the higher the fire severity.Other variables such as soil moisture, wind speed, land cover, elevation, slope, and road network correlate with fire severity, although not as strongly as temperature and humidity.In this case, land cover had a fairly strong positive correlation (correlation coefficient 0.76), suggesting that certain land cover types tend to be associated with higher fire severity.However, it should be noted that correlation coefficients do not imply causation but only indicate a linear relationship between variables [32].Feature importance in machine learning refers to a measure of the relative importance or contribution of each input variable (feature) in a prediction model.It provides insight into the impact or influence of various features on the prediction or model results.Feature importance helps identify which features have the most significant influence on the target variable.It can be used for feature selection, feature engineering, and gaining a better understanding of the underlying data [33].Feature importance (Figure 4) for each algorithm indicates the dominant and non-dominant variables in predicting fire severity.The land cover and temperature variables are consistently important in each algorithm used (DT, NB, RF, ANN, and SVM).This suggests that land cover type and air temperature have a significant contribution to influencing fire severity.On the other hand, Precipitation and Power Network features have low importance in some algorithms.The low contribution of the precipitation variable is due to the fact that the study was conducted in the summer months, which caused little rainfall to be recorded.The amount of rainfall per day and electricity network in meters did not significantly affect fire severity in these models.These results are based on feature importance analysis specific to the dataset and model settings.In addition, the importance of a feature may vary depending on the context and characteristics of the observed data and the algorithm used [34].

ML Model Evaluation
In the evaluation of ML model performance, the accuracy assessment method has been widely used in various studies [35].Accuracy assessment provides a general understanding of how the model works.Therefore, the results of the selected ML model are validated based on the characteristics of accuracy assessment: overall accuracy, kappa coefficient, precision, sensitivity, f-score, and cross-validation (Table 2 The RF algorithm showed the highest accuracy compared to the other algorithms in classifying the forest fire dataset in the Mediterranean region of Türkiye.The superior accuracy of RF can be attributed to its ensemble learning approach, which combines multiple decision trees and merges their predictions to make the final classification.This approach helps reduce overfitting and improves generalization performance by capturing more complex relationships and patterns in the data [36].In addition, RF also uses a randomized feature selection process during the construction of each decision tree, which helps reduce correlation among the trees and increase ensemble diversity.On the other hand, the DT and NB algorithms show relatively lower accuracy.DT faces limitations in handling complex datasets and capturing complicated relationships between features, which may lead to overfitting.Meanwhile, NB is based on the feature independence assumption [37], which can be unrealistic for forest fire datasets with complex interactions.Another factor contributing to the low accuracy of DT and NB is the lack of an optimization process in a given comparison [38] RF includes an optimization process to improve performance.The performance of ANN and SVM algorithms also showed low accuracy levels due to various factors, including the complexity of the ANN training process and SVM's reliance on selecting the right kernel function to map the data to the appropriate feature space [39].Overall, the findings of this study highlight the effectiveness of RF algorithms in achieving high accuracy in the classification and prediction of forest fire classes in the Mediterranean region of Türkiye.From the data, it can be seen that moderate risk is the risk category with the largest area, covering about 65.25% of the total observed Mediterranean region (Figure 5).The high-risk category also has a significant area, covering about 18.86% of the total area.Meanwhile, the very low risk has the smallest area, covering only about 15.21% of the total area.Low-risk has the lowest percentage, covering only about 0.68% of the total area.This data provides an overview of the level of wildfire risk in the Mediterranean region, where most areas are in the medium risk category, while high risk also deserves attention as it covers a significant area.This information can serve as an important basis for forest planning and management to prevent and reduce the risk of forest fires in the region.

Fire Risk Prediction
The forest plays an important role in fire dynamics and vulnerability among the land cover categories.Coniferous forests show a significant presence, especially in the moderate and high severity levels, with [40]percentages ranging from 11.8% to 30.4%.These forests, characterized by evergreen needle-leaved trees, are highly flammable due to their resinous composition.They contribute significantly to the spread and intensity of fires.Mixed forests, which are composed of a combination of different tree species, also showed a considerable representation at Moderate and High severity, ranging from 6.5% to 16.6%.This indicates that forest land cover types are more susceptible to fire occurrence than other land cover types.Coniferous trees cover a wide region of the Mediterranean Basin and provide significant ecological and economic advantages.Data reveal that pine species account for 61% of Turkey's total forest asset (21,678,134 ha), with the danger of forest fires increasing throughout the summer and on windy days in places with coniferous species (Pinus brutia Ten., Pinus nigra Arnold ssp.pallasiana [Lamb.]Holmboe, and Pinus pinea L.) [2].Forest fires were shown to be more dangerous in shrubs and coniferous forests than in wetlands, crops, and otherwise burned areas in the Mediterranean region.[41].Mixed forests show varying degrees of fire susceptibility, with an important percentage falling into low and medium severity.

Conclusions
Evaluation of the performance of Machine Learning-based prediction methods for forecasting forest fire events in the Mediterranean-Türkiye region revealed that the RF algorithm performed best with an overall accuracy of 0.80 and a Kappa coefficient of 0.71.In addition, RF also achieved high precision, sensitivity, and F-measure levels of 0.78, 0.80, and 0.78, respectively.These findings confirm that RF is an effective method for predicting the incidence of forest fires in the region.The Random Forest algorithm spatial model has successfully predicted the risk of forest fires.Using a dataset consisting of 920 rows and 12 columns of features, we predicted the class of forest fires in four categories.Using 100 decision trees and a random value 'random_state' where important features such as "land cover," "temperature," "slope," and "elevation" play a significant role in modeling the RF algorithm.Furthermore, the DT algorithm also showed satisfactory performance, with an overall accuracy of 0.75 and a Kappa coefficient of 0.66.Meanwhile, the NB algorithm achieved an accuracy of 0.68 and a Kappa coefficient of 0.56.However, the performance of ANN and SVM is relatively low, with an accuracy of only 0.54 for each.

Figure 1 .
Figure 1.The study area of Mediterranean Türkiye.

Figure 3 .
Figure 3. Hotspot dataset distribution (a) Spatial and (b) by year.

Figure 5 .
Figure 5. Prediction of fire susceptibility of Mediterranean-Türkiye forests using the algorithm with the highest accuracy (RF algorithm)

Table 1 .
Dataset sources (independent and dependent variables)