Reconstruction of Thermocline Depth based on Machine Learning

The thermocline, as a special structure in the ocean temperature field, has an important impact on sound propagation, material transport, and human underwater activities in the ocean. Using existing methods, the thermocline is distributed underwater and cannot be observed in real-time and with high spatiotemporal accuracy. It can only be calculated from measured data. In previous studies, it was demonstrated that the ocean thermocline is significantly correlated with its sea surface elements. Based on machine learning methods, this paper uses sea level anomaly, sea surface temperature, sea surface salinity, time, longitude and latitude, as well as the vertical gradient method to calculate the depth of the thermocline as data. By constructing a Random Forest model, the regression relationship between the thermocline and sea surface elements and spatiotemporal elements is obtained, this thermocline reconstruction model has a good fitting effect on thermocline depth, with a fitting coefficient of 0.85, and its accuracy has been evaluated. This paper also evaluated the feature importance of the six input data of the model. The results showed that the three sea surface parameters had similar importance, with sea surface temperature having the most significant impact on the depth of the thermocline. The final model can achieve rapid reconstruction of thermocline depth using sea surface information.


Introduction
The thermocline is the layer where the temperature of seawater changes sharply in the depths and its vertical gradient reaches a certain critical value [1] , which is an important indicator of physical properties reflecting the ocean temperature fields.The temperature structure of the upper layer of the ocean is closely related to the air-sea interaction.So deepening the understanding of the structure of the upper layer of the ocean, such as the thermocline and the mixed layer, is of great significance to the study of the air-sea interaction [2] .Meanwhile, the ocean acoustic field and the changes in the ocean temperature field are closely related.Therefore, how to identify the thermocline to process signals is an important topic in underwater communication.
Based on the influence of thermocline on the physical properties of the ocean, national defence, meteorological changes, and fishery production, the study of thermocline has long been a high concern for many researchers.However, because the thermocline is distributed underwater, it cannot be directly observed by oceanographic instruments but needs to be calculated based on measured temperature profiles, which are still insufficient in temporal and spatial resolution, whereas the physical data on the ocean surface are easier to obtain than those underwater.Therefore, how to reconstruct the underwater structure of the ocean by combining sea surface observations and in situ measured data is an essential scientific and technological issue.
For decades, numerous studies have shown a significant correlation between thermocline depth and sea surface elements [3][4][5][6][7][8] .Qiu, B, et al. found a significant correlation between sea surface temperature (SST) and thermocline depth by analysing temperature data spanning 25 years.This correlation is mainly due to the combined effects of seawater heat distribution and turbulent transport [9] .Watanabe, S et al. found a significant negative correlation between sea surface salinity (SSS) and thermocline depth in the subtropical region of the Northwest Pacific, and this correlation exhibits different characteristics at seasonal and interannual scales [10] .Fan, Y et al. used satellite remote sensing data and measured data to analyse the correlation between surface temperature and thermocline depth in the tropical western Pacific Ocean.They found that there is a significant correlation between surface temperature and thermocline depth in the tropical western Pacific Ocean, with significant seasonal and regional differences in intensity [11] .Based on the above correlation research, this paper further studies the reconstruction of the thermocline depth by combining sea surface elements.
In recent years, the reconstruction of the physical field of the ocean using machine learning has become a research hotspot.In the ocean research application of artificial intelligence, Wu Fangfang, Fu Zhiyi and others inverted the SSS in the Gulf of Mexico based on the Random Forest method, and its inversion accuracy was higher than that of the multiple linear regression, support vector machine and artificial neural network model [12] .Gou Yu proposed to combine marine data information with machine learning methods to achieve adaptive determination of thermocline by analysing marine data features through the Random Forest method [13] .Jiang Jiale et al. inverted the SSS in Hong Kong waters based on the Random Forest with multi-factor parameters, and their prediction results were highly accurate [14] .The Random Forest algorithm is widely used in the inversion of ocean physical parameters because of its high accuracy, strong overfitting resistance, and easy parameter setting, and it can achieve good results.
In this paper, to address the problem of the thermocline data, which is low spatiotemporal resolution and inability to be obtained in real-time, we mainly combine the in situ observation data and satellite remote sensing products to research the reconstruction technology of the ocean thermocline depth based on machine learning.Using the Random Forest algorithm to realise the use of the sea surface data to rapidly reconstruct the high-resolution data of the thermocline depth, to provide the data support for the scientific research of the sea-air interactions and the guarantee of the ocean hydrography.The structure of this paper is as follows.Section 2 describes the data sources and methods, and Section 3 presents the results.Section 4 analyses the results and discusses the feasibility of using the Random Forest method to reconstruct the thermocline depth.Section 5 is a summary of the entire paper.

Data Sources
The SLA data use satellite data from Aviso, which has sea level anomaly (SLA) information as a ssalto / duacs altimeter product with data from multiple satellites: Jason-3, Sentinel-3A, HY-2A, Saral/AltiKa, Cryosat-2, Jason-2, Jason-1, T/P, ENVISAT, GFO and ERS1/2.It is an along-orbit altimetry satellite product that provides data in both real-time and delayed formats.In this paper, its provided 0.25° × 0.25° monthly average grid data were used, spanning the period 1 January 1993 to 31 August 2022, and is available from https://marine.copernicus.eu/ [15,16] The in situ data used to calculate the thermocline depth in this paper are from the EN.4.2.2 dataset published by the Met Office Hadley Centre (UK).The data in EN4 are from four sources: Argo, the Arctic Synoptic Basin-wide Oceanography (ASBO),the Global Temperature and Salinity Profile Programme (GTSPP) and the World Ocean Database 2018 (WOD18).The data is available from https://www.metoffice.gov.uk/hadobs/en4/download-en4-2-2.html [17].It provides ocean subsurface temperature and salinity data from 1900 to the present in the form of quality-controlled scatter profiles and objectively analysed grid fields.
In this paper, for the objective analysis data among the EN4 dataset, the uppermost layer of temperature and salinity data with a depth of 5 m is selected as SST and SSS data.
The objective analysis data from 1993-2020 are used to calculate the thermocline depth.In order to maximise the accuracy of the calculation results, the gridded objective analysis data are processed as follows: • Data points with a depth of less than 20 m or with less than four observed layers are excluded, as too few observations will lead to an increase in the error of the calculation results; • The analysis data of each grid point are interpolated to a standard layer with a sampling interval of 1 m; • The data after interpolation are examined once again to see if the data after interpolation are examined again to see whether there are abnormal values, such as the temperature exceeding the norm (271-306K); • A small portion of the obviously abnormal data is manually rejected.In the process of interpolation of data, the selection of the interpolation method has a very important impact on the performance of the data, Figure 1 compares the performance of the data interpolated using the cubic spline, Akima and linear interpolation method at the inflexion point, it can be seen that compared with the latter two methods, the cubic spline method at the inflexion point of the smoother treatment, in line with the change rule of the temperature of the water; and the monotonous region of the change, the Akima interpolation and the cubic spline method at the inflexion point is more smooth, in line with the change rule of the temperature of the water.Akima interpolation and cubic spline interpolation performance gap is not large.Therefore, in this chapter, the cubic spline method is used to interpolate the vertical temperature data.

Dataset construction
The morphology of the thermocline in the ocean can be roughly divided into a single thermocline layer, inverse thermocline layer, multi-thermocline layer, mixed thermocline layer, etc.To facilitate the calculation and analysis, in this paper, only the case of the positive leaping layer is taken into account, and at the same time, selecting one layer for analysis based on its strength for profiles with multiple thermoclines, and in the process, we define the depth of the thermocline as the depth of the upper boundary of the thermocline, which is obtained by the calculation of the vertical gradient method.
The vertical temperature gradient is given by  = ∆ ∆ According to the minimum standard value of the intensity of the thermocline stipulated in the Specification for Marine Surveys as well as the Technical Regulations for Surveying China's Exclusive Economic Zone and Continental Shelf [18] , the minimum value of the temperature gradient of the thermocline in the deep water with a depth greater than 200 m is 0.05 ℃/m; and in the shallow water with a depth of less than 200 m, the minimum value of the temperature gradient of the thermocline is 0.2 ℃/m.The vertical gradient is obtained for the temperature data and the layer-by-layer Calculate the vertical gradient of temperature data and make judgments layer by layer.The water layer with a vertical gradient value that meets the above criteria is designated as the thermocline, and the depth of its upper and lower endpoints is used as the upper and lower boundary depths of the thermocline.Analyse all the layer junctions defined as thermocline in a profile, combine two consecutive layers into one thermocline section, and for those discontinuous, if the interval between the lower boundary depth of the upper layer and the upper boundary depth of the lower layer is less than 10 m when the upper boundary depth of the lower layer is less than 50 m, or the interval between the two is less than 30 m when the upper boundary depth of the lower layer is greater than 50 m, then select the upper boundary depth of the upper layer and the lower boundary depth of the lower layer as the new layer junction for gradient calculation [19] .If the new layer junction still meets the judgement criteria of the thermocline, the new layer junction is used to replace the original two thermoclines; if the new layer junction does not meet the criteria, the gradient sizes of the original two layers' junctions are compared, and the one with the larger gradient is selected as the final selected thermocline.If the upper boundary depth is less than 50 metres, the thickness of the final selected thermocline is required to be not less than 10 metres, and if the upper boundary depth is more than 50 metres, the thickness is required to be not less than 20 metres.If the final thermocline does not meet the criteria, it is judged that no thermocline exists in the profile.
In order to construct the dataset, it is necessary to correspond the depth data of the thermocline to the SLA data, which is a 0.25°×0.25°grid data, and the EN4 analysis data is a 1°×1° grid data, therefore, it is necessary to interpolate the grid data of the SLA to the coordinates of the points that contain the information of the thermocline data, and calculate the depth of the thermocline of each point and its corresponding SLA data.

Method
As Random Forest has good accuracy and stability, can be trained efficiently on large-scale data, and can handle high-dimensional data, without feature selection, it performs well in classification and regression problems.It can handle unbalanced datasets and can output the level of importance of the features, which is good for feature engineering and model interpretation.Therefore, the Random Forest approach is used to construct the reconstruction model in this study.The algorithm structure is shown in Figure 2.
The flowchart of reconstructing the thermocline depth based on the Random Forest method is shown in Figure 3.The details of the flowchart are described as follows.
Step 1 Data preprocessing.The matched data from 1993-2017 is used as the training set, and the data from 2018-2020 is used as the test set.The six normalised variables (Month, Lon, Lat, SLA, SST, SSS) are selected as the input data of the reconstructed model, and the thermocline depth calculated using the vertical gradient method is used as the output data.
Step 2 Model training.The Random Forest model is optimally trained by tuning the number of decision trees n_estimators and the maximum split depth max_depth in the model.The GridSearchCV or RandomizedSearchCV methods that come with the sklearn machine learning toolbox can be used in experiments to perform automatic parameter searches to get the best model parameters.The optimal parameters obtained after grid search based on the cross-validation method are max_depth=67, n_estimators=217.
Step 3. Reconstructing the thermocline depth with the model.The 2018-2020 test data are input into the reconstructed model.The thermocline depths can be obtained for the corresponding months and latitude and longitude regions.The coefficient of determination (R2), mean absolute error (mae), root      denotes the actual data and   denotes the reconstructed values.
As shown in table 1, the coefficient of determination R² determines how well the model explains the variance of the observed data.In this model, R² is greater than 0.85, indicating that the model can explain the relationship between the input variables and the output variables well and has a good fit.The Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) reflect the reconstruction accuracy of the model.In this model, the Mean Absolute Error is 4.01m, indicating that the reconstruction accuracy of the model is relatively high.The Mean Relative Error (MRE), on the other hand, can better reflect the credibility of the reconstructed model.Its mean relative error is less than 0.2, which means its computation needs to be further improved.Figure 5 shows the comparison between the reconstructed and actual values of the thermocline in December 2018, and the distribution of the relative errors.It can be seen that the reconstructed results of the thermocline depths in the western Pacific Ocean have a similar modal pattern and distribution to the actual values, and the reconstructed results in most of the sea areas are good, with the relative errors below 0.1.The reconstruction results are poorer in the near-shore sea area.It is shown that the thermocline depth and sea surface height changes are affected by many other factors because the water exchange in the shallow coastal area is more complicated, and it is affected by the continental thermal influence more significantly, so the reconstruction can not be done by relying on the three sea surface parameters only.In addition, there is a huge difference between the north and south of the central Pacific sea area, and it can be seen that there is a discontinuous phenomenon in this sea area in the thermocline depth distribution calculated from the measured values, which is an error that is easy to occur when using the thermocline intensity as the annotation of the multi-thermocline screening, so we need a more perfect method of thermocline layer screening to further improve the accuracy of the training data and the validation data, so as to achieve better reconstruction results.

Figure 7. Reconstruction effect monthly input data
The 2018-2020 data are divided into 12 datasets based on the month, and are input into the model one by one, and the change rule of the reconstruction effect assessment indexes obtained is shown in the Figure 7, which shows that the reconstruction effect of the model has a clear seasonal trend, in which the reconstruction effect is the best in December and January, and the reconstruction effect is the worst in May and June, and the whole shows the cyclical change characteristics of good at the beginning of the year, bad in the middle of the year, and good at the end of the year.From the above Figure 8, it can be obtained that 68.8% of the training results in 2018-2020 have data whose relative error is less than 0.2.On the whole, the Random Forest model performs well and is able to fit the data better, but its reconstruction accuracy is still not high enough, and further optimization of the model is needed.and the depth of the thermocline are weakly negative, with the correlation coefficients of -0.40, which is to be investigated in the future.
The thermocline depth and SST show a significant negative correlation in most areas of the Pacific Ocean (Figure 9b), and their correlation coefficients can reach -0.67 on average, i.e., the higher the SST is, the shallower the thermocline depth is.However, in the equatorial regions of the eastern and western Pacific, there is a band in which the correlation between the thermocline and the SST is positive, and the reasons for this need to be investigated in the future.
Figure 9c shows that the correlation coefficient between SSS and thermocline depth in the equatorial region is high, with a correlation coefficient of up to 0.6 or more, and a weak positive correlation in the high latitude regions of the northern hemisphere.In the mid-latitude region of the eastern Pacific Ocean, the correlation coefficient is negative, and the negative correlation coefficient in the southern hemisphere is larger than that in the northern hemisphere, which reaches -0.74.It can be concluded that, in the other areas of the Pacific Ocean except the mid-latitude region of the eastern Pacific Ocean, the correlation between the SSS and the thermocline depth is positive, and the correlation is the highest in the equatorial region, i.e., the larger the SSS, the deeper the depth of the thermocline is.
The Poisson correlations of the thermocline depth with season (time), latitude and longitude, SLA, SST, and SSS data are shown in the Table2, and their correlation coefficients have passed the significance test, i.e., time, geographic location, and the three sea surface parameters mentioned above all have a significant effect on the thermocline depth.The correlations between the thermocline depth and SLA, SST and SSS in different sea areas, and the correlations between the different sea surface parameters and the thermocline depth are different in geographic distribution.In the equatorial region, where the correlations are more significant, the thermocline depth is positively correlated with the SLA, SST, and SSS, where in the Northwest Pacific, the thermocline depth is weakly positively correlated with SLA, SSS, and strongly negatively correlated with SST; In the mid-latitude waters of the East Pacific, the thermocline depth is negatively correlated with SLA, SST, and SSS.
In the process of analysis, there are many areas with low correlation, which are different from the theory, due to: a. there are many uncertainties in the insitu observation, which leads to the fact that even after quality control and data processing, there are still many profiles with poor data quality, and there are errors in the calculation of the thermocline; b. there are many forms of ocean thermocline such as the positive thermocline, the inverse thermocline, the multi-thermocline, and the mixed thermocline, etc.In this paper, we only discuss the correlation between the positive thermocline and the sea surface parameters, in addition, the selection of the multi-thermocline layer is based on the strength of the standard, which means that there may be a discontinuity in the distribution of the layer in the horizontal direction; c. there is no optimal method for calculating the depth of the thermocline, and it can only be obtained through a comparative analysis of the relatively applicable methods for calculation, but there are still some cases in which there is a thermocline that can not be passed the standard test, and this is also a major cause of error.

Importance assessment of sea surface parameter features
In the Random Forest algorithm, the Gini index is generally used to assess the importance of each feature in splitting the decision tree.During the node splitting process of the decision tree, we will choose the features that minimise the Gini index of the subset, thus making the model more robust and reliable.In Random Forest, we will integrate the results of each decision tree learning to get more accurate results, and also the average value of Gini index will reflect the relative importance of each feature in the whole model.Therefore, using the Gini index as an index for evaluating feature importance (variable importance measures, VIM) in Random Forest can help us better understand how the model operates, optimise the selection of algorithms and parameter settings, and improve the reconstruction performance of the model.Using the feature importance assessment method based on the Gini index, the six features input to the model were calculated and analysed, and the results were obtained as shown in Figure 10.Latitude has the most significant effect on the thermocline depth, with a VIM of 0.45, followed by the time element, with a VIM of 0.17; for the three surface parameters of SST, SSS, and SLA, the VIMs were 0.078, 0.077, and 0.068, which shows that the three surface parameters were 0.078, 0.077, and 0.077, respectively, 0.068, which shows that the influence of the three elements is roughly the same, and the overall importance of SST is the largest of the three.

Comparison of other regression methods
Section 2.3 constructs a thermocline depth model based on Random Forest on the basis of previous studies, further improving the ability to obtain high spatiotemporal accuracy thermocline depth in real time.Table 3 summarizes the effectiveness of the Random Forest method (RF) and other commonly used regression methods in the reconstruction of thermocline depth, including the K-Nearest Neighbor algorithm (KNN), Bagging Regression (BR), gradient boosting (GB), and extreme random trees (ETs).Use grid search to adjust parameters for each regression method to obtain the optimal regression effect.The results indicate that in terms of MAE, RMSE, and MRE, the Random Forest method has better reconstruction ability than other machine learning methods, indicating that the Random Forest method is more suitable for reconstructing thermocline depth than other regression methods.

Summary
In this paper, by analysing the influence of sea level anomaly (SLA), sea surface temperature (SST) and sea surface salinity (SSS) on the distribution of the thermocline depth, we found that different sea surface parameters have significant correlations with the thermocline depth, and their correlations are different in the geographical distribution.In the equatorial region, where the correlation is more significant, the thermocline depth is positively correlated with SLA, SST, and SSS; in the northwestern Pacific, the thermocline depth is weakly positively correlated with SLA and SSS, and strongly negatively correlated with SST; and in the mid-latitude region of the eastern Pacific, the thermocline depth is negatively correlated with SLA, SST, and SSS.For the part of the correlation analysis results that are contrary to the theory, the reasons for the errors are analysed from the data point of view.Due to the existence of many uncertainty-interfering factors in the process of calculating the thermocline depths, such as the quality of the profile data and the screening of the multi-thermocline layer, these factors cause errors in the data, which leads to differences in the correlation analysis results.
Through the construction and evaluation of the thermocline depth reconstruction model in the Western Pacific Ocean, the thermocline depth reconstruction method based on the Random Forest is explored, and the importance of the characteristics of each sea surface parameter used as input data in the Random Forest is also evaluated and discussed, and finally, the effectiveness of the model is evaluated by using the method of cross-validation.The results show that latitude is the most significant feature affecting the thermocline depth, followed by the time element and sea surface parameters, with SST having the highest importance.The Random Forest model has a good fit to the data, and the coefficient of determination of the model reaches 0.85, and its Mean Absolute Error is about 4.01 m, the Root-Mean-Square Error is 7.16 m, and the Mean Relative Error is less than 0.2, which shows that the effect of reconstruction under the layer knots of different depths needs to be improved.In summary, the Random Forest algorithm has a better application prospect in the reconstruction of thermocline depth, and has a greater development potential and development space in ocean observation and prediction guarantee.

Figure 2 .
Figure 2. Schematic diagram of the Random Forest model

Figure 3 .
Figure 3.The flowchart of reconstructing the thermocline depth using the RF method

Figure 4 .
Figure 4. Comparison between the reconstruction results of the first 600 samples and the actual thermocline depth

Figure 5 .
Figure 5.Comparison of reconstruction results in December 2018 (shows (a)the reconstruction results of the Random Forest, (b)the actual thermocline depth calculation results, and (c)the distribution of mre)

Figure 8 .
Figure 8. Relative error probability density distribution of Random Forest reconstruction results

Figure 10 .
Figure 10.Comparison of the features importance of the sea surface parameters and their correlation with the depth of the thermocline.

Table 2 .
Poisson correlation between sea surface parameters and thermocline depths

Table 3 .
Results comparison between RF and KNN, BR, GB, ETs.