Spatial Classification of Forest and Land Fire Risk using Decision Tree C5.0 Algorithm

Forest and land fires in indonesia occur almost every year, one of the districts that often experience forest and land fires is Ogan Komering Ilir District, South Sumatra Province. One indicator of forest and land fires is hotspots. This study aims to create a classification model for the forest and land fire risk in Ogan Komering Ilir District using four attributes, namely hotspot density, distance to rivers, distance to settlements, and distance to roads. The decision tree C5.0 algorithm is used to develop a classification model. The decision tree C5.0 produces a model that can be used to classify new data based on the rules formed by the tree. The results of modeling using C5.0 with 80% training data and 20% test data show that the performance of the model can correctly classify around 86.49% of the total sample in the data. This shows that the model has a relatively high level of accuracy in classifying. The modeling decision tree shows that forest and land fires are most affected by distance to settlements


Introduction
Forest fires occur almost every year on the Indonesian islands of Kalimantan and Sumatra [18].Forest and land fires in 2019 amounted to 1,649,258 ha [16].Ogan Komering Ilir District, South Sumatra Province is one of the regencies on the island of Sumatra that often experience land and forest fires.Approximately 51% of the total area of forest and land fires in 2015 occurred in OKI District [10], while in 2019 [9] it was noted that forest and land fires occurred in Ogan Komering Ilir District of 316,472 ha.Forest and land fires often occur because of human actions and natural factors [3].Humans have an important role in the occurrence of forest fires, where humans can be the main cause of forest fires, whether they occur due to accidents, negligence, or due to intentional actions [5].Factors Human activity can cause or increase the risk of forest and land fires [8] by 60% [19].Meanwhile, apart from the role of humans as perpetrators of forest fires, climate factors [18] and the environment are also causing forest fires.Indications of the occurrence of forest fires in Indonesia can be known from information on hotspots [12].Forest fires increase when the distance from the hotspot to the road gets closer [17].In addition, the distance to [20] [1] and distance to the river have a significant effect on hotspots and forest and land fires.Identification of the appearance of hotspots can be obtained using remote sensing, namely processing satellite image data such as the Moderate Resolution Imaging Spectroradiometer (MODIS) and the Visible Infrared Imaging Radiometer Suite (VIIRS) [15].Remote sensing and Geographic Information Systems (GIS) can be used as early warning of forest and land fire risks.
The decision tree C5.0 algorithm was developed by [21] as a better version for refinement of the ID3 and C5.4 algorithms.This algorithm is a machine learning algorithm used to build decision trees and is easy to interpret.The C5.0 algorithm is an algorithm that has excellent features including the ability to handle missing values, overcome overfitting by using pruning, increase performance in terms of computation time and accuracy, and produce models that are easier to interpret.
Research [17] to see the correlation between the number of hotspots and 6 independent variables, namely distance to settlements, distance to roads, distance to rivers, distance to rice fields, distance from fields and distance from plantations, the results obtained are R2 = 0.9982 showing a spatial correlation the highest is the distance to the river and the distance to settlements R2 = 0.9412 while from the graph it can be seen that the number of hotspots increases as the distance from the road gets closer with a value of R2 = 0.8879.The results of this study indicate that there is a very strong spatial correlation between forest and land fires and human activities.Spatial modeling for the risk of forest and land fires [7] using the information value method (IVM) shows an accuracy of 66.87% where the area closest to the river (IVM = 0.123) and highways (0.125) is more vulnerable to forest fires and besides that, swamp land cover (IVM = 1.071) and shrubs (IVM = 0.024) are vulnerable to fire.A spatial model of forest and line fire vulnerability using Composite Mapping Analysis (CMA) in Central Kalimantan Province, the results from a forest and land fire vulnerability model with an accuracy of 52,6%.
This study aims to create a spatial classification model for the risk of forest and land fires by utilizing hotspot data and dominant factors such as the distance of the hotspot to the settlement, the distance to the river, the distance to the road using the decesion tree C5.0 algorithm as a first step in early prevention.

Data Collection
The data used in this study were obtained from various sources:

Research procedure
The stages of the research carried out in this study were data set formation, data pre-processing, data sharing, decision tree modeling, and model testing.1. Road map data, settlement maps, river maps obtained from https://tanahair.indonesia.go.id/portal-web can be seen in Figure 2.After data is collected, hotspot data is clipped according to the study area and a grid is created to calculate the density of hotspots.Density data and hotspot data are then saved to Postgres to find density, distance to rivers, distance to settlements and distance to roads.After that the data is saved in csv.

Data Sharing.
Pre-processed data is divided into two parts, namely training data and test data.
80% training data and 20% test data taken randomly.The training data is for making models while the test data is for making models and predictions.

Create Models.
Modeling is done with the C5.0 algorithm using the library in R. C5.0 is a decision tree-based classifier algorithm.C5.0 uses a recursive partitioning approach to build an optimal decision tree.The process involves selecting the best predictors to divide the data based on the most favorable criteria.The selection of these predictors was carried out by considering factors such as class diversity, the accuracy of the separation, and the importance of the predictors.

Model Evaluation.
After obtaining the model, the model is evaluated with accuracy (equation ) and the confusion matrix using the Caret package.Confusion matrix is a table used to describe the performance of a classification model based on the predictions made by the model on actual data.The confusion matrix contains four categories of prediction results, namely true positive (TP), false positive (FP), true negative (TN), and false negative (FN).
Precision is an evaluation metric that measures the extent to which a classification model can produce correct (positive) predictions from all given predictions.Precision describes the number of true positives (TP) divided by the total number of positive predictions made by the model (TP + FP) can be seen in equation .
Precision = TP / (TP + FP) () Recall is an evaluation metric that measures the degree to which a classification model can correctly identify and classify all true positive cases.Recall describes the number of true positives (TP) divided by the total number of existing positive cases (TP + FN) which can be seen in equation .Hotspot.Indications of the occurrence of forest fires in Indonesia can be known from information on hotspots [12].Hotspot is a term for a pixel that has a temperature value above a certain threshold from the interpretation of satellite imagery, which can be used as an indication of forest and land fires [14].Hotspot is a condition where one point has a higher temperature than the surrounding area.Hotspot density has an important role in fire vulnerability, the greater the temporal hotspot density, the higher the level of vulnerability [6].A hotspot does not always cause hotspots but needs to be handled quickly so it doesn't cause hotspots.[2] recorded in 15 years more than 15,000 hotspots.The appearance of hotspots can be obtained using remote sensing, namely processing satellite image data such as the Moderate Resolution Imaging Spectroradiometer (MODIS) and the Visible Infrared Imaging Radiometer Suite (VIIRS) [15].3.1.2.Decision tree C5.0 algorithm.The decision tree algorithm was first introduced in 1993 by Ross Qunlan.C5.0 is an algorithm developed from ID3 algorithm and (Classification and Regression Trees (CART) [10].C5.0 algorithm is a machine learning algorithm that is used to build decision trees.Decision trees are predictive models that describe decisions and their consequences in the form tree structure.The C5.0 algorithm has several superior features compared to the previous algorithm.These features include the ability to handle missing data (missing values), overcome overfitting by using pruning, increase performance in terms of computation time and accuracy, and generate models which is easier to interpret.In the C5.0 algorithm or decision trees in general, there is the term "node" which refers to the points or parts in the decision tree.At each node in the decision tree, the C5.0 algorithm performs data separation based on existing features and selecting the best features to separate data using several metrics, such as entropy, gain and gain ratio.Calculation of entropy can be seen in equation [4].= Total entropy value in a variable

Formation Data Set
The initial stage after the data was collected was clipping on the hotspot data to obtain hotspot data according to the study area, namely Ogan Komering Ilir District.Hotspot data cutting is done using the clip tool in QGIS.The results before and after the clip can be seen in Figure 3.The next step is to create a hotspot density grid using the tools available in QGIS.The area size of 1km x 1km is presented in grid form.The grid and the distribution of hotspots on the grid can be seen in Figure 4.
After the grid is formed, then look for hotspot density by using a query in the PostgresSQL software.The density formula is the number of hotspots divided by the area.The query to get density can be seen in Figure 5.The SQL command in Figure 5 will perform a join between the "grid" table (with a "geom" column) and a "hotspot" table (with a "geom" column) based on the ST_Contains condition, which checks whether the geometry in the "hotspots" table is contained in the geometry in the "gridclip" table.Next, the query will calculate the area using ST_Area and multiply it by 1,000,000 to convert units to square kilometers.The query will also calculate the number of hotspots in each grid, the midpoint of the grid using ST_Centroid, and the density of hotspots by dividing the number of hotspots by the area.The results of these calculations will be stored in a new table "class" with the appropriate columns.There are 4,257 lines where these lines represent the density values for each gird that has hotspots.The search results for the density level can be seen in The next step is to find the distance to the settlement, the distance to the river and the distance to the road to be used as independent variables.The formation of independent variables uses the ST_Distance function in PostGIS.The query for searching the distance for each attribute can be seen in Figure 6.
The SQL command in Figure 6 (a) will select from the grid midpoint (centroid) of the class table that has a point (assumed to have columns "grid midpoint" and "geom") and a linear path from the "road" table (assumed to have a column "geom ").Then the selection results will be stored in a new "walk distance" table with the addition of a "distance" column which is the distance between the center point of the grid and the linear path in kilometers.

Data Preprocessing
Data preprocessing is done for grouping data based on class labels.In this study the grouping for the distance to the road and the distance to the river is categorized into 5 classes while for the distance to settlements it is categorized as 4 classes.The category class for hotspot density is based on the interval of hotspot density data.Class categories for independent attributes can be seen in Table -and class categories for the dependent variable can be seen in Table .

Low vulnerability Medium vulnerability High vulnerability
The next stage is changing the value of the attribute which was originally numeric, changing the category according to class.For example, the distance to the settlement of 3 km is included in the medium category.An example of data that has been converted into class categories can be seen in Table .Table

Data Sharing
Pembagian The division of the 80% training data consists of 3,406 rows and the 20% test data consists of 851 rows, to ensure that there is no bias associated with the order of the data, randomization of the data sequence is carried out.

Create Models
The classification model uses the C5.0 decision tree algorithm, for modeling it is done with R using the

No
Distance to river Class   Based on the decision tree in Figure , it shows that the risk of land and forest fires is strongly influenced by the distance to the settlement, then the distance to the road and the distance to the river.The rules for forest and land fire risk classification from decision trees of 10 rules can be seen in Figure .

Model Evaluation
After building the Decision Tree model, testing the model is carried out using accuracy, confusion matrix, precision and recall.Evaluation can be seen in Table .Overall Table , the results of this evaluation provide an overview of how well the C5.0 model can classify test data, for example the precision of "High vulnerability", of 0.9609 means that of all data classified as "High vulnerability", approximately 96.1% of them are is really "High vulnerability", while for the recall for the "Low vulnerability" class, a recall of 0.9247 means that the model can detect around 92.5% of all instances of the " Low vulnerability" class in the test.

Visualization
The final stage of this research is to visualize the grid that has been created classified from the C5.0 decision tree model into spatial form (map).The vulnerability class is divided into 3, namely the low vulnerability class is yellow, the medium vulnerability class is orange, and the high vulnerability class is red.The map of Ogan Komering Ilir District is also included in the vulnerability map in green.The administrative map of Ogan Komering Ilir aims to provide information on the boundaries of the Ogan Komering Ilir region.

Conclusion
This study succeeded in classifying the level of vulnerability to forest and land fires using 4 attributes, namely hotspot density, distance of hotspots to settlements, distance to rivers, distance to roads.The resulting classification model of vulnerability is influenced by the distance to settlements.If the distance to the settlement is very close then all occurrences are significant.The decision tree C5.0 algorithm succeeded in producing a model for classifying the level of vulnerability of forest fires in the form of rules with an accuracy of 86,49%.Suggestions for further research can use factors that affect hotspots such as rainfall, temperature, El Nino, and El Nina weather anomalies.
(a) Hotspot data for 2019-2022 was obtained from the National Aeronautics and Space Administration (NASA) website in the form of Fire Information for Resource Management System (FIRMS) for hotspot VIIRS taken from the Suomi National Polar-Orbiting Partnership (S-NPP) satellite with attributes such as latitude, longitude, brightness, bright_t31, bright_ti4, scan, track, acq_date, acq_time, satellite, instrument, confidence, version, bright_ti5, frp daynight and type obtained at (https://firms.modaps.eosdis.nasa.gov/).(b) Settlement maps, road maps and river maps of Ogan Komering Ilir District were obtained from Geospatial for the Country (https://tanahair.indonesia.go.id/))

Figure 2 .
Figure 2. (a) Road map (b) settlement map (c) river map 2.2.2.Preprocess Data.Data preprocessing is done by changing the data structure into data that is ready to be used for modeling.In this study, data preprocessing was carried out by changing interval data into ordinal data according to each attribute class.The attributes of the distance to the river and the distance to the road are categorized into class 5 namely very high, high, medium, low, very low.The attribute of distance to the river is categorized into 4 classes namely very high, high, medium, and low, while the hotspot density attribute is categorized into 3 classes low vulnerability, medium vulnerability dan high vulnerability.
Recall = TP / (TP + FN) () Information: True Positive = Data with true actual values and true predicted values False Positive = Data with incorrect actual values and true predicted values False Negative = Data with true actual values and incorrect predicted values 3. Result and discussion 3.1.Literatur Review 3.1.1.

5
) where: S = Set of cases k k = Number of classes in variabel A p ୨ = Proportion of S ୨ and S Next, to find the gain value, use equation .Gain (S, A) = Entropy(S) Set of Cases S ୧ = Collection of cases in the i − category A = Variable m = Number of categories in variabel A |S ୧ | = Number of cases of i_category |S| = Number of cases in SAfter getting the entropy and gain, then the gain ratio value is calculated using equation .

Figure 3 .Figure 4 .Figure 5 .
Figure 4. (a) Hotspots on the grid (b) Enlarge image of a hotspot on the grid

Figure 6 .Figure 6 .Figure 6 .
Figure 6.(a) Distance to road .1088/1755-1315/1315/1/012059 9 C5.0 library and using the Caret library to get the confusion matrix, precision and recall.Modeling with C5.0 produces a Rule-Based model that has 10 rules.The decision tree rules can be seen in Figure.

Figure .
Figure .Decision tree C5.0 from the modeling

Rule 1 :
distance to roads in {Low, Very High, Very Low, Medium}, distance to settlements = Low.Class Low vulnerability [0.986] Rule 2 : distance to roads = Very Low.Class Low vulnerability [0.902] Rule 3 : distance to river in {High, Very Low, Medium, Low}, distance to roads = High, distance to settlements = High.Class Medium vulnerability [0.996] Rule 4 : distance to river = High, distance to settlements = Medium.Class Medium vulnerability [0.953] Rule 5 : distance to river in {High, Very Low}, distance to roads = High, distance to settlements in {High, Very High}.Class Medium vulnerability [0.945] Rule 6 : distance to roads = High, distance to settlements in {Low, Medium}.Class Medium vulnerability [0.878] Rule 7 : distance to settlements = Medium.Class Medium vulnerability [0.854] Rule 8 : distance to river = Very High, distance to settlements in {High, Very High}.Class Medium vulnerability [0.996] Rule 9 : distance to roads in {Very High}, distance to settlements in {High, Very High}.Class High vuln erability [0.995] Rule 10 : distance to river in {Very High, Medium, Low}, distance to settlements = Very High.Class High vulnerability [0.974] Figure is a map of the level of vulnerability forest and land fires in Ogan Komering Ilir.IOP Publishing doi:10.1088/1755-1315/1315/1/01205911

Figure .
Figure .Map of vulnerability levels forest and land fires in Ogan Komering Ilir district.
The data used in this study is VIIRS hotspot data in the Riau Province region from 2019 taken from https://firms.modaps.eosdis.nasa.gov/.Hotspot data attributes obtained from FIRMS NASA can be seen in Table

Table 1 .
Attributes of hotspot data Scan The size of the pixel width of the satellite image Tracks Satellite image pixel length Acq_Date Hotspot occurrence date Acq_Time Time of occurrence of hotspots Satellite Satellite used (Terra or aqua) Instruments Instrument used (VIIRS) Confidence Hotspot quality Version VIIRS version used

Table .
Based on Table , the correct classification for each class is 246 for high vulnerability classes, 232 for medium vulnerability classes and 258 low vulnerability.The resulting accuracy of 0.8649 indicates that the model can predict correctly about 86,49% and has a relatively high level of accuracy in classifying.In addition to the accuracy value, it is also important to consider other evaluation metrics such as precision and recall, which can be seen in Table.