Data Reconstruction of Sea Surface Temperature in Indonesia’s Fish Management Area 713 (IFMA-713) Using Machine Learning

The IFMA-713 in Indonesia is water that has dynamic of temperature changes due to interactions with the Pacific Ocean and the surrounding. Sea surface temperature data can be obtained by measuring with satellite imagery. However, satellite imagery measurements of sea surface temperature can be incomplete due to cloud cover. In this study, a machine learning method was used to reconstruct sea surface temperature data using a backpropagation neural network algorithm. The data used in this research is data captured with MODIS Satellite. Then, the reconstruction of sea surface temperature data is carried with four scenarios with missing data percentages: empty data, zero values, average values at the point of data collection, and Indonesia’s average sea surface temperature. Accurate results were obtained in reconstructing sea surface temperature where the scenarios had a positive correlation. The most accurate scenarios for reconstructing sea surface temperature data with missing data were those in which the empty data was filled with average values at the point of data collection or Indonesia’s average sea surface temperature.


Introduction
Indonesia as an archipelagic country has a sea area of 5.8 million km 2 or around 62% of the entire Indonesian state.As an archipelagic country with such a large sea area it has an impact on the climate around its territory.Indonesia's geographical location between the Pacific Ocean and the Indian Ocean, flanked by two large bodies of water, makes its climate maritime rather than continental [1].This maritime climate is affected by many oceanographic phenomena, such as El Niño-Southern Oscillation (ENSO), Indian Ocean Dipole (IOD), and monsoon wind cycles.These phenomena can cause variations in sea surface temperature in Indonesia.
The temperature in the sea has a tendency to vary both horizontally and vertically.With variations in temperature with depth, the surface layer tends to have a homogeneous temperature.However, as the IOP Publishing doi:10.1088/1755-1315/1245/1/012037 2 depth increases, the water under the homogeneous layer does not have time to experience rapid temperature changes, causing a considerable difference with the water layer above it.The layer where there is a sudden vertical change in temperature is known as the thermocline layer.Under the thermocline layer, temperature changes will occur slowly so that the layer can be said to be almost homogeneous.In each waters, the thickness of the homogeneous layer and/or the depth of the thermocline layer will vary.Sea Surface Temperature (SST) is the temperature of the waters around sea level.The temperature of the water, especially on the surface, is highly dependent on the amount of heat from the sun.Sea surface temperature is very important to know because the distribution of sea surface temperature can provide information about fronts, upwelling, currents, weather/climate and fishing grounds.
Sea surface temperature (SST) data can be obtained using MODIS Aqua satellite imagery.However, the MODIS Aqua satellite has some limitations when collecting sea surface temperature (SST) data.Clouds can block the satellite's sensor, preventing it from collecting data from areas that are covered by clouds.This can lead to missing or inaccurate data [2].Atmospheric disturbances due to clouds not only interfere with satellite image post-processing, but also interfere with the process of image recognition and classification [3].Several methods can be used to fill in the gaps in satellite image data due to cloud cover.One method to replace cloud-covered regions with satellite images of the same area that are not covered by clouds.This method is not always practical due to the difficulty of finding replacement data and the high costs involved.
The rapid development of technology and the use of good data archives allows SST data processing to be integrated with Artificial Intelligence (AI) to facilitate processing of sea surface temperature data and to reconstruct blank data due to cloud cover.One of the Artificial Intelligence (AI) approaches used is Machine Learning (ML) technology or machine learning which has advantages especially in automation.Machine learning is a powerful tool for spatial reconstruction.It has the potential to improve the quality of spatial reconstructions in a variety of applications.As machine learning algorithms continue to improve, spatial reconstruction will become more accurate, more efficient, and more widely used.

Study Area
This study focuses on the Republic of Indonesia State Fisheries Management Area 713 (IFMA 713), which is located at coordinates 1 degree north and 8.5 degree south latitude and from 114.4 to 122.7 degree east longitude.IFMA 713 covers several water areas, including the Makassar Strait, Bone Bay, Flores Sea, and Bali Sea.It is bordered by nine provincial governments: East Kalimantan, South Kalimantan, East Java, Bali, West Nusa Tenggara, East Nusa Tenggara, South Sulawesi, Central Sulawesi, North Sulawesi, and West Sulawesi.
The waters of WPPNRI 713 are influenced by the Pacific Ocean and are also part of the western Equatorial Pacific System which plays an important role in the phenomenon of sea and atmosphere interaction, namely ENSO (El Niño-Southern Oscillation).In addition, the territorial waters of WPPNRI 713 are part of the Trans-World Belt Flow (The Great Conveyor Belt) that crosses Indonesian territory, better known as Arlindo (Indonesian Cross Flow).Arlindo carries a mass of warm water from the Pacific Ocean to the Indian Ocean via the Makassar Strait, Lombok Strait, Timor Sea, Ombai Strait and Lifamatola.Arlindo was also identified as gathering in the warm pools of northern Papua [4].The study area can be seen in Figure 1

Data Used
The data to be used is Sea Surface Temperature (SST) data for the period from 1 January 2003 to 31 December 2021 (19 years).The SST data is obtained from the MODIS (Moderate-Resolution Imaging Spectroradiometer) satellite which can be accessed on the OPeNDAP website.The data has a spatial resolution of 4.63 km and a temporal resolution of 1 day.The SST data obtained from the MODIS satellite uses mid-infrared waves captured in bands 22 and 23 with a wavelength of 3.5 to 4.2 micrometer to capture SST data.The data used in this band is very suitable because it has less variation than bands 31 and 32.
In addition to using data from the MODIS satellite, daily sea surface temperature data was used for January 2003 to December 2021, which was sourced from Optimum Interpolated Sea Surface Temperature (OISSTv2) which can be accessed at NOAA.This data has a daily average temporal resolution.This data will be used to test machine learning performance to determine hyperparameters.

Application of backpropagation method
Backpropagation is a supervised learning method that trains multilayer perceptrons to recognize patterns and respond correctly to similar patterns, even if they are not exactly the same as the patterns used during training [5].
The backpropagation algorithm accepts the input pattern and performs a computational process based on the randomly obtained initial weights.If the output from the network is different from the expected target, the network will make adjustments to the existing weights.The process will continue until the output from the network and the expected target are the same.The learning process takes a long time to reach that value.Therefore, the learning process is limited and will stop if the difference between the output and the target has reached a value that is smaller than the tolerance value (error rate).The amount of weight adjustment in each learning cycle is determined by a parameter called learning rate 4 [6].There are 3 process in backpropagation method which is forward pass, backpropagation and weight adjustment 4.1.1.Forward pass During the input phase, the network receives n inputs (  ) to the and produces an output (  ) The output is then compared with the target to be achieved (  ), a hidden layer consisting of p inputs, and m output units.the error of the model can be calculated from the difference between   and   .If there is an error smaller than the tolerance limit, the model iteration will be stopped.However, if the error is still greater than the tolerance limit, then the weight of each line in the network will be modified to reduce the errors that occured.Forward pass value can be determined using the following formulation: (1) where    is the j hidden unit,  0 is the weight of the line connecting the bias from the input unit to the   hidden layer unit,   is the weight of the i line bias, f is the activation function and   is the output in the hidden unit.Next, calculate the network output in   units with the following formulation: Where y net k is the k hidden unit,  0 is the line bias weight form hidden layer unit to the   output unit to the   ,   is the weight of the j unit and   is output unit.

Backpropagation
After calculating the output layer in the first phase, the error from the output layer (  ) will be calculated to distribute the error in unit   to all hidden units connected to the value   .  is also used to make changes to the line weights that are directly related to units.In the same way, the error calculation is carried out for each unit in the hidden layer (  ) as the basis for changing the weight of the previous hidden layer unit.This calculation is performed for all  factors in the hidden layer unit associated with the input unit.The  factors can be calculated using the following formulation: where   is the target to be reached,   is the output, and f is the activation function.Then, the calculation of the   weight change rate (∆  ) is used to change the   weight with the learning rate () using the following equation: Where    is the error value in the j hidden layer unit,   is the error value in the hidden layer unit and  is the learning rate.

Weight adjustment
After all the δ factors are calculated, weight adjustments will be carried out on all lines.Changes in the weight of a line will be adjusted based on factors in the neurons in the previous layer using the following equation   () =   () + ∆  (10)   () =   () + ∆  (11)

Pearson's Correlation Coefficient
Correlation analysis is an analysis that is often used to determine the relationship between two different variables.The correlation coefficient is often denoted by r where the correlation value r requires magnitude and direction between negative and positive.The interval values used are between negative to positive with values that are absolute and nondimensional without a unit of measure.When the value of the coefficient r is zero, there is no measurable association of the two variables, whereas when the coefficient of r is positive, then there will be a strong association between the two variables.A positive correlation value shows the relationship between the two variables is positive or directly proportional to each other.Meanwhile, the negative correlation value shows the relationship between the variables is negative or inversely proportional.The value of the Pearson correlation coefficient can be calculated using the following equation.
Where r is the correlation coefficient,   is the sample value of the x-variable, ̅ is the average value of the x-variable,   is the sample value of the y-variable,   ̅ is the average value of the y-variable.

Root Mean Square Error
Performance evaluation needs to be done to test the accuracy of the model being run so that in the future the model can be applied to other problems.Root Mean Square Error (RMSE) is one method to evaluate the performance of the model.RMSE calculations are used to determine the magnitude of the error rate of prediction results from a dataset where the smaller the value of the RMSE (close to zero), The more accurate the prediction results, the closer they will be to the actual value.RMSE is calculated using the following equation: Where   is actual data value,   is the predicted value, and n is the amount of data being tested.

Model scenario
The

Timeseries plot
The performance test of the machine learning program was carried out using daily SST data to determine the hyperparameter to increase accuracy for performance as visualize in Figure 2. The results of the machine learning performance test show a good correlation with a value of 0.99 with an RMSE of 0.38 o C. Furthermore, using SST data from MODIS, cloud cover simulation is carried out on satellite sensors by removing some data at times with the greatest cloud cover, namely during the rainy season or October to April [8].After the data is emptied, a data reconstruction trial is carried out using a machine learning program with the same hyperparameter that has been run previously with daily SST data.As seen in Figure 4, after the program was run with the same parameters on the SST daily data, the reconstruction results showed a fairly good correlation of 0.99.However, the RMSE results obtained from the program are 4.56 o C so the model becomes less accurate.Therefore, it is necessary to repair the machine learning program before running it again for data reconstruction by performing hyperparameter tuning.
According to [9], hyperparameter tuning is the process of finding the optimal value of the hyperparameter of a machine learning model to improve the performance of the machine learning model.This is done by trying various hyperparameter values and comparing the results with performance metrics such as accuracy.After making adjustments to the hyperparameter.The following is the result of the reconstruction of empty sea surface temperature data on satellite data shown in Figures 5.If the training data is too short, the model will not be able to learn the patterns in the data and the accuracy will decrease.If the training data is too long, the model will overfit the data and the accuracy will decrease due to an increase in bias and a decrease in variation.

Reconstructed value with scenario
The results of the reconstruction model show good performance in reconstructing sea surface temperature data.However, it is necessary to carry out further program testing to see the model's performance in reconstructing data with certain scenarios.The results of testing the model against predetermined scenarios are shown in Figure 6 to Figure 9.The results from scenario 3 and scenario 4 show that the reconstruction performance is quite close to the initial data reconstruction.This is because the blank data values for sea surface temperature reconstructed in scenario 3 and scenario 4 are filled with the average value and 28 o C of all data so that the machine learning program can perform the reconstruction well when compared to scenario 1, namely without filling in blank data and scenario 2 with empty data filled with zero values.The results of scenario 3 and scenario 4 reconstruction can also capture the pattern of increase/decrease in sea surface temperature in IFMA-713.The performance produced by scenario 1 and scenario 2 performed poorly that it cannot be used as a program to reconstruct data.From Table 2 it can be seen that the performance of scenario 3 and scenario 4 has the same performance and is close to the initial reconstruction.so that scenario 3 and scenario 4 are tested again by reconstructing the missing data by 30%, 40% and 50%.The following graphs are generated for the reconstruction of missing data by 30%, 40% and 50% in Figures 10 to Figure The performance results of scenario 3 and scenario 4 have a difference that is not significant enough.This can be seen from the RMSE value which has a value that is not much different.However, the two scenarios have similarities, namely as the missing data increases, the resulting correlation value will decrease when compared to sea surface temperature data without missing data with the resulting correlation values of 0.82, 0.79 and 0.76 with a decrease a correlation value of 0.3 for each addition of 10% missing data.This is because when the missing data values increase, it will affect the performance of the model in reconstructing the data due to the tendency when the reconstruction is carried out, missing values can shift the input pattern, and the backpropagation network must complete a complex solution to fulfill all the reconstructed patterns [10].

Spatial plot
Spatial visualization of the reconstructed data in WPPNRI 713 with the aim of knowing the spatial distribution of sea surface temperature and program performance in all study areas.The reconstruction is carried by applying backpropagation to all data points in the study area in one period of time.The results of the spatial plot can be seen in Figure In Figure 16, spatial plots with empty data that is not filled with a value (NaN) will show blank spaces marked in white, which makes the information displayed by the data less informative.When data reconstruction was carried out for missing values, it was seen that the spatial plot became more informative with more complete data compared to the previous spatial plot.Even though the spatial plots of the data reconstruction results are good, it can be seen in the reconstructed spatial plots.The program adjusts certain values to make the resulting data more similar to the seasonal patterns of the training data, this is due to the insufficient training data length so that the program has limited ability to reconstruct data with very high accuracy values.As can be seen in the spatial plot image resulting from data reconstruction, the program has been able to capture a significant difference in sea surface temperature with sea surface temperatures in the north of the Makassar Strait being higher when compared to the Bali Sea.
Machine learning programs show better performance when compared to statistical methods in reconstructing data.However, that does not mean that statistical methods cannot be used at all to reconstruct data.There are several limitations and advantages possessed by both methods based on needs.
According to [11], the biggest difference from using statistical methods and machine learning is in the development of the model.In machine learning methods, models are built by algorithms based on data availability, while model development in statistical methods is done manually by users.The advantage of machine learning programs is that they can develop models with complex relationships between data variables.another advantage of machine learning is also found in the amount of data processed.The more data that is processed, the resulting model will be better.while statistical methods are more often used on data that has medium and short length sizes to be processed.

Conclusion
The machine learning model can reconstruct accurately with a value of r = 0.99 for all variations in length of training data in the WPPNRI-713 region.The results of the machine learning program show that the longer the training data used, the program results will have better performance in reconstructing the data as evidenced by the smallest RMSE value is a program with 15 years of training data with an RMSE value of 0.75 o C. The machine learning program cannot perform data reconstruction if there are blank values in the test data as shown by scenario 1 and does not have good performance when the blank values are filled with zero values or scenario 2. However, the program can perform reconstruction if the missing values are fill with the average value (scenario 3) or fill in the average value of sea surface temperature in Indonesia (scenario 4) and both scenarios (scenario 3 and scenario 4) show performance that is close to the initial reconstruction.Scenario 3 and scenario 4 are able to reconstruct missing data by 30%, 40% and 50% with good performance as 20 % missing data scenario.The results of the spatial reconstruction have been able to reconstruct the sea surface temperature of the IFMA-713 area by filling in the blank values with data from the machine learning program.

Figure 1 .
Figure 1.Indonesia's Fish Management Area 713 ) Next, calculate the δ factor based on the error in each hidden layer unit   and the weight change rate   (∆  )    = ∑      =1

Figure 2 .
Figure 2. Timeseries plot with machine learning on daily SST data

Figure 3 .Figure 4
Figure 3. Simulation of data loss satellite imagery in MODIS

Figure 5 . 2 .
Figure 5. Timeseries plot after adjusting hyperparameterTable 2. Evaluation performance on variation of training data length Training Data (year)

table 1 Table 1 .
[7]nloaded SST data is divided into training data and test data.Training data is used as input or to train the backpropagation model.Meanwhile, test data is used to test model performance when making predictions on actual data[7].After dividing the training data and test data, the SST data from MODIS satellite imagery is averaged monthly to fill in the gaps in the training data and test data.The SST data is also used as input for determining machine learning parameters.The machine learning program scenarios carried out in this study are divided based on several treatments for training data and test data.In training data, treatment is carried out by providing variations in the length of training data for 15 years, 10 years and 5 years and test data in the period of 2018-2021 period to test program performance with variations length of training data.Then, after performance testing, treatment is carried out on the available test data by removing the datum by 20%, to simulate empty data on satellite imagery and test program performance on test data.For details of each scenario, see Scenario model

Table 3 .
Evaluation performance