Probabilistic forecasting of tropical cyclones intensity using machine learning model

This study proposes a machine learning approach to probabilistic forecasting of tropical cyclone (TC) intensity. The earth system is complex and nonlinear, leading to inherent uncertainty in TC forecasting at all times, and therefore a representation of this uncertainty should be provided. Previous studies construct this uncertainty through ensemble or statistical methods, neither of which can directly characterize this uncertainty and suffer from problems such as excessive computational effort. And for this reason, we propose to assess the forecast without this uncertainty through the forecast distribution. Meanwhile, none of the previous studies on TC intensity forecasting by artificial intelligence methods characterize the uncertainty, so this study is a new supplement to data-driven TC forecasting. During the 2010–2020 evaluation period, the model’s point forecast can outperform the current state-of-the-art operational statistic-dynamical model results, and can obtain forecast intervals to provide reliable probabilistic forecasts, which are critical for disaster warnings.


Introduction
Tropical cyclones (TCs) are one of the extreme weather events that have the most significant impact on human life and property, and can cause disasters such as extreme winds and large waves. TC intensity has once again become one of the most important factors in TC forecasting, and accurate intensity forecasting has important social and economic significance [40].
The atmospheric system is a highly nonlinear chaotic system, and the disturbance of the initial value and the error of the model may bring uncertainty [37]. Therefore, it is necessary to characterize this uncertainty, moving from single-value forecasts to probabilistic forecasts, thereby facilitating more comprehensive risk assessment and decision making.
Although numerous studies have been conducted on the prediction of TC intensity in the past few decades, little progress has been made in the performance of intensity prediction, which is a bottleneck in the field of meteorological forecasting [4]. Meanwhile, few studies have attempted to quantify TC's uncertainty or probabilistic forecasts compared with the focus on deterministic forecasting techniques [23,30,31]. Currently, the probabilistic forecasting of TC mainly adopts the ensemble method and the statistical method [30].
Ensemble methods are usually obtained with different initial conditions, by adding random perturbations, or by a parallel set of multiple kinetic or statistical models [2,52]. These models can extract possible actual states from a subspace of ensemble members and estimate the state transition of uncertainty. However, the above methods also suffer from many limitations, such as limited model resolution, poor initial conditions around the TC, and insufficient knowledge of the physical processes of the TC. Consequently, intensity predictions obtained based on ensemble numerical models are generally poor [24].
For statistical methods, uncertainty estimates are mainly achieved through probability density functions of past errors, or forecasts from deterministic values or statistical models [44,51]. Two organizations, the National Hurricane Center (NHC) and the Joint Typhoon Warning Center (JTWC) used Monte Carlo methods for probabilistic forecasting, but the method soon had to be abandoned due to the large number of computational resources required to run up to 1000 simulations [12].
Artificial intelligence (AI) techniques have recently made significant advances in modeling and forecasting in the earth sciences [18]. AI has also made significant progress in TC forecasting, and the long-term forecasting of cyclone path has surpassed the performance of forecast centers [49]. In addition, the skill of TC rapid intensification (RI) by machine learning exceeds the NHC operational RI consensus [50]. Based on the Statistical Hurricane Intensity Prediction Scheme (SHIPS) predictor, the strength prediction skills can be enhanced using deep learning methods [61].
Current research on TC intensity using AI has focused on estimating and improving point forecasts [3,22]. However, point forecasts lack credible quantification, which easily raises concerns about the reliability of their forecast results. Quantifying uncertainty in TC intensity predictions is critical, given the significant impact on risk decisions guided by model predictions. Recently, machine learning-related studies in the climate field have begun to focus on this uncertainty. For example, Petersik and Dijkstra [41] applied two neural networks to estimate El Niño-Southern Oscillation (ENSO) uncertainty directly, and Gordon and Barnes [17] used recurrent neural networks to produce probabilistic forecasts of climate uncertainty over 2-10 years.
We propose an machine learning (ML) framework for probabilistic forecasting of TC intensity based on a hybrid of LightGBM [25] and NGBoost [14] methods, named TCP-NGBoost. This paper uses the SHIPS predictors for probabilistic forecasting of intensity 24th h ahead. We experimentally evaluate the results of the deterministic forecasts against multiple operational models. The performance of the probabilistic forecasts, the performance of the interval forecasts, and the calibration performance of the models are evaluated by comparing them with many baseline models. Further, we discuss the importance of the predictor variables and their impact on the forecasts using explainable machine learning methods. In conclusion, to the best of our knowledge, this is the first study that directly uses machine learning techniques for probabilistic forecasting of the intensity of TC.

Data and method
The dataset for this study consists of the SHIPS predictors [11], which is the most comprehensive environmental prediction dataset relating to changes in TC intensity. This study includes all SHIPS reanalysis data from 1982-2017 and operational data from 2010-2020 for the Atlantic Ocean, where operational data are available in real-time while the model is operating [13]. Our experiments are validated using operational data to ensure consistency in model development and deployment.

Data
SHIPS is a statistical dynamical model operated by NHC for a long time. The parameters include TCrelated climate sea surface temperature (SST) data, satellite bright temperature data, and data from other model analyses and satellite derived variables [13].
We used 121 predictor variables after selection and preprocessing from the SHIPS database by Xu et al [61]. Detailed information and statistics of these predictor variables are available in table S1 in supporting information (SI). There are more than 500 predictors associated with 24 h intensity changes in the SHIPS dataset. Therefore, predictor variables that were only included in the reanalysis data were removed, as data from the reanalysis data were not available in real time. Through linear regression experiments, only the predictor variables for the 24 h time step; and the change in intensity over the last 12 h (DELV-12), which are related to the persistence and trend of TC respectively, were ultimately retained [13,61]. Many of the predictor variables in the SHIPS operational data are available every 6 h, so our developed model also predicts at 6 h intervals. Before training our model, missing values are filled using the reanalysis mean, and we normalize each predictor variable by first subtracting the mean of the reanalysis data for that variable and dividing by the standard deviation. Both the mean and standard deviation are considered only for the reanalysis data from 1982 to 2017, so that the operational data are relatively independent, especially for the operational data of the independent test in 2019-2020, and there is no leakage of the data at all, allowing a complete simulation of the TC process in these two years.
For the intensity record, which is the predicted value in this study, the global TC data developed by Emanuel [15] and the International Best Track Archive for Climate Stewardship (IBTrACS) are used [26]. For intensity change, the calculation is done by subtracting the intensity at the target time from the

The proposed model: TCP-NGBoost
This study introduces a hybrid of LightGBM and NGBoost methods for modeling. NGBoost is a supervised machine learning algorithm for probabilistic prediction that uses natural gradients rather than standard gradients to build a boosted ensemble learner that outputs a full probability distribution over the entire outcome space [14]. LightGBM is a is a tree-based gradient boosting framework that is distributed and efficient [25]. Studies have shown the potential of algorithms such as XGBoost and Light-GBM to extract valuable insights in the context of multi-prediction data. For example, XGBoost exhibits superior performance over neural networks in terms of classifying structured data [57].
The proposed TCP-NGBoost framework proposed in this study is shown in figure 1. The core of NGBoost is that it utilizes boosting technique to estimate the parameters of a conditional probability distribution P θ (y|x). as function of x. The NGBoost algorithm consists of three main components: the base learner, the parameter probability distribution, and the scoring rule. The input to TCP-NGBoost is the predictor variable of SHIPS mentioned in the previous section, and the output is the TC intensity variation. The input feature vector x is first fed into the LightGBM learner to generate a probability distribution over the output space P θ (y|x). The model is then optimized by the scoring rule S(P θ , y) to produce calibrated uncertainty and point predictions using a maximum likelihood estimation function.
The specific model parameters and training settings can be found in the text S2 of the SI.
Since TC is an extreme weather phenomenon and has limited operational data, in order to maximize the use of operational data from 2010 to 2020, we adopt a new training strategy similar to cross-validation, i.e. leave one year out (LOYO) for testing strategy, where the training and testing sets are divided by year. From 2010 to 2018, we used LOYO strategy for training and testing, e.g. for 2010, we used all datasets except 2010 for training and the year 2010 for testing. In order to obtain the model evaluation effect for the new years, the data for 2019 and 2020 were used separately to test the models trained using all data from 1982 to 2018. so there is no data leakage, and the actual operating environment can be simulated. For the tests in 2019 and 2020, we used all data from 1982 to 2018 for modelling. The data for 2019 and 2020 were tested separately using the trained models.
To assess the accuracy of point prediction, we used mean absolute error (MAE) and root mean squared error (RMSE) as metrics, which assesses the deviation between prediction and observation. To assess the comprehensive performance of probabilistic forecasts, this study uses the continuous ranking probability score (CRPS) and negative-log-likelihood (NLL). To assess the usability of interval forecasts, we use the prediction interval coverage probability (PICP).
One of the most important plots for assessing model prediction uncertainty is the average calibration plot [9]. When making predictions, an α%prediction interval can be formed that is designed to capture observations at α% of the time. We can iterate over the value of α% and see the percentage of the test data that actually falls within the prediction interval. The calibration plot then shows the proportion of predicted test data that we expect to fall within the interval on the x-axis and the proportion of observed test data within the interval on the y-axis. Details of these metrics and how they are calculated can be found in the text S1.

Comparison models
To evaluate the performance of TCP-NGBoost, we compare the 24 h intensity change error statistics with three types of models, deterministic point models, probabilistic prediction models and operational models respectively. The validity of deterministic and probabilistic results can be examined, and the forecast consistency with operational models can be examined.
Detailed information about these models and their specific configurations can be found in the text S2. Information on NHC operating model can be found in text S3, which contains both the statisticdynamical model and the dynamical model, as well as the NHC consensus OFCL, the Official NHC forecast, which is subjectively determined by the NHC and the forecaster by combining various types of information. For a fair comparison, the operational numerical model results retains only events for which all models are operable. During the data pre-processing, the unique codes and predicted time periods for the hurricanes forecasted by the models SHIPS, DSHP, LGEM, HWFI and OFCL were extracted. The final test set was obtained by taking the intersection with our data. The intersection set is the set of data points common to these datasets, i.e. they are the data points for which all models have actionable forecasts. This ensures that the analysis is fair and that all models are compared on the same data points. Table 1 shows the results of comparing TCP-NGBoost with all the models mentioned above, showing the results of LOYO tests for the years 2010-2018 and fully independent tests for the years 2019-2020, respectively, and the table provides results with 95% confidence intervals for the PICP. As can be seen from the table, TCP-NGBoost demonstrates a balanced performance in both deterministic and probabilistic forecasts.

Forecast performance
The results in table 1 show that, in terms of the model's deterministic forecasts, the overall error is not significantly different from the NHC's official forecasts, especially in the 2010-2018 tests. TCP-NGBoost achieves comparable results with OFCL, and even the best results in the index of RMSE, and obtains similar results to the statisticdynamical model in the independent tests of 2019-2020. Meanwhile, our proposed model outperforms other machine learning deterministic forecasts in multiple tests. It is worth noting that the MLR model consistently fails to obtain converged results in the independent tests, and other uncertainty models obtain much weaker results in the independent tests than in the cross-validation, which further confirms the robustness of our model. Our method uses LightGBM as the base learner, several studies have shown that both LightGBM and neural networks can adapt to complex datasets, but that LightGBM outperforms neural networks for regression problems. Combining LightGBM and NGBoost for probabilistic forecasting can make full use of the efficiency and accuracy of LightGBM, while the NGBoost method can provide the distribution and interval of forecasts without adding too much computational burden, thus better quantifying the forecast uncertainty and further improving the reliability and stability of the model.
As shown in table 1, The proposed TCP-NGBoost model presents stable results for metrics CRPS and NLL in probabilistic prediction, and have the best performance on the CRPS score. In particular, in the independent testing stage, the CPRS index of the proposed model is still the best. It is worth noting that the 95% confidence interval of PICP of our model in the independent test can cover 95% of the test data, indicating that the interval prediction performance of the model is better, and its interval can almost completely cover the intensity variation of TC. Figure 2 shows a detailed plot of the intensity variation error of our proposed model compared to the currently operating model for each independent year. In all years of the LOYO testing phase, TCP-NGBoost outperforms all statistic-dynamical models SHIPS, DSHP, LGEM, and exceeds the dynamical model HWFI in almost all years, and is overall comparable to the consensus performance published by NHC. In the independent test phase, the performance of TCP-NGBoost is roughly comparable to the  statistical dynamical model performance and weaker than the dynamical model HWRF and consensus methods, We believe this is due to the fact that the HWRF model has significantly improved its spatial resolution after 2018, resulting in a substantial improvement in intensity forecast performance [19]. But it is worth noting that by combining table 1, TCP-NGBoost has a lower RMSE than HWFI in the independent test phase and a lower RMSE than all models in the LOYO phase. Since RMSE is more sensitive to large errors compared to MAE, this indicates that TCP-NGBoost has optimal performance in predicting extreme intensity changes. We also tested the need to simply validate the factors used in the model by taking only a subset of the predictors. We conducted correlation and importance analyses of the predictors prior to modelling, correlating TC intensity changes with all predictors using three correlation metrics. The top ten correlation ranked factors were selected for separate modeling experiments, with the factor correlation results in table S3 of SI and the test results in table S4 of SI. The results show that the model using 121 predictors outperformed the results using 10 predictors for almost all indicators. It should be noted, however, that this controlled experiment does not fully demonstrate that all the factors we used are necessary, and there is no denying that there will be some overfitting, but the experiment is able to demonstrate to some extent the need to use 121 factors.
We have also examined the test times of the model and the results, which are placed in S5 of the supporting information, show that our test time for a full hurricane cycle takes only a few seconds, which is an improvement of several orders of magnitude in consumption time compared to hurricane dynamic and hurricane super-assemble models. Figure 3 shows the results of independent tests of the two most intense and impactful hurricanes of 2020, Eta and Laura. Hurricane Eta formed and made landfall in Nicaragua as a Category 4, with Eta ultimately killing at least 175 people [39]. Laura made landfall in Louisiana as a Category 4 hurricane, making it the strongest TC on record to make landfall in the state. Laura caused at least $19 billion in damage and 77 deaths [38]. Figure 3(a) (e) shows the development of the entire intensity and path of the two hurricanes for our specific forecast time periods as follows:

Case study
• Eta Our forecast began with the intensification of a tropical storm, which rapidly transformed into a hurricane and continued to intensify, ultimately making landfall northeast of Nicaragua and resulting in catastrophic wind, flooding and storm surge damage. After peaking in strength at 00:00 UTC on 3 November, the hurricane weakened rapidly as it moved slowly westwards in northern Nicaragua and then remained largely stable. In figure 3(b) we can see that our forecast for explosive intensification was not particularly accurate, but subsequent periods of weakening and stabilization were able to be covered. Eta finally dissipated quickly after making landfall in Florida on 12 October. • Laura Our forecasting efforts commenced after Laura had intensified into a tropical storm, during which it made landfall southeast of the Dominican Republic with minimal wind loss. After remaining largely stable during the Caribbean Sea, a period of RI began on 26 August into the Gulf of Mexico, where the storm grew in size and developed a well-defined eye and strong deep convection that continued to strengthen until 27 August, when it reached its peak. It later made landfall in Louisiana, weakened rapidly as it pushed inland after landfall and dissipated at 12:00 on 29 August, where our forecast also ends. Figure 3 shows that the deterministic forecasting results of TCP-NGBoost show that the model can predict the overall intensity variation trend well. However, the model cannot capture the changes well in extreme intensity changes, and the predicted values are small. We believe this phenomenon is due to the small number of extreme variations in the sample, and the model cannot learn the characteristics of such samples well. Nevertheless, unfortunately, the 80% confidence interval still cannot capture the extreme changes in the intensity of the initial phase of Hurricane Eta above 60 kt, and it is worth noting that RI is also extremely difficult to forecast in practice. The NHC defines RI as an increase in the maximum sustained winds of a TC of at least 30 knots (35 mph; 55 km h −1 ) in a 24 h period.
The calibration plot measures the calibration error range of the model. The x-axis and y-axis of the calibration graph represent the proportions of the true and predicted values in their respective intervals. Moreover, the results in figures 3(c) and (g) show that our model produces miscalibration due to lack of confidence. TCP-NGBoost often produces overly broad distributions, especially since Laura's results show that 80% of the prediction intervals alone contain almost all of the data. The average calibration takes into account the prediction intervals of the entire test set, and we would like to carry out further investigation to offset the group calibration [63]. Figure 3(d) (h) shows the calibration plots of the confrontation group for the two hurricanes, the mean calibration is obtained by randomly acquiring many subsets, calculating the miscalibration for each subset, and finally reporting the most severe miscalibration in the subset. The shaded area in the figure shows the standard error bands on the random validation partitions. The figure shows that the calibration error for the antagonist calibration was able to decrease at a faster rate and remained largely small.
The results for Hurricane Eta and Hurricane Laura are shown in figures S2 and S3 of the supporting materials, respectively, which show that our proposed model yields the most reasonable deterministic forecast values and interval widths. Some models have narrower forecast intervals (e.g. Drop Connect and MC Dropout). Flipout performed very erratically in the forecasts of this hurricane, appearing to over-forecast the intervals. For Hurricane Laura, after conducting a comprehensive evaluation of the outcomes generated by both models, it can be posited that the neural network model outperforms the TCP-NGBoost model. However, a nuanced understanding of the results reveals that the neural network model does not consistently deliver the same level of outstanding performance across all hurricanes. The comparison of the performance of Hurricane Eta and the indicators presented in table 1 highlights this point. On the other hand, the TCP-NGBoost model demonstrates greater stability in its predictions for each cyclone.
To demonstrate the effectiveness of the proposed model in probabilistic forecasting of intensity, we visualized the probability density prediction curves for hurricanes ETA and Laura, which can be found in figures S4 and S5 of the SI. As can be seen from the figures, all measured values fall reasonably well in the probability density region of the probability forecast interval, and this visualization demonstrates the effectiveness of the model in probabilistic forecasting.
RI is one of the most challenging and vital phenomena to forecast for TC [28]. We calculate the ability of the proposed model to capture RI for TC. The calculation is based on the probability of RI in the confidence interval, where the 95% confidence interval in 2019 can capture 100% of RI, and the 80% confidence interval can capture 76.92% of RI. Likewise, 95% confidence interval in 2020 can capture 80% of RI and 80% confidence interval can capture 60% of RI.

Explainable analysis
Further, we perform an explainable analysis of the importance of the variables. Figure 4 shows the variable importance analysis based on the Shaply value [32], the core idea of which is to generate not only the importance of variables but also to study how each feature affects the output through the idea of cooperative game theory. The specific physical explanation is as follows: • vs0 The most critical variable is the current intensity (vs0). From the SHapley Additive exPlanations (SHAP) values, the current intensity is negatively correlated with the predicted value, which means that the greater the current intensity, the more the intensity tends to weaken. Studies have shown that the initial intensity of a TC affects the intensity 24 h later [43,54]. In addition, the ability of a storm to undergo RI is sensitive to its initial size [5]. • DELV-12 DELV-12 is the second important predictor. From the SHAP value, there is a positive correlation with the predicted value, and the future intensity change is generally consistent with the intensity change in the past 12 h. Studies have found a relationship between the intensity change of TCs in the past day and the intensity of TCs in the next day [21]. This relationship is thought to be due to both storm internal processes and largescale environmental conditions [21]. • VMPI The maximum potential intensity (MPI) is positively correlated with the intensity variation, and the larger the value the greater the intensity variation. There is a correlation between MPI and TC intensity [20,33]. The MPI of a TC is determined by the storm radius at which environmental parameters such as sea-level pressure, SST, and relative humidity are present [20]. These parameters can affect the variability of the MPI of a TC [60]. • TWXC The 850 hPa maximum symmetric tangential wind (TWXC-t24) is the fourth important variable, and the vertical shear of the ambient wind is an important variable affecting the TC RI. The maximum symmetric tangential wind at 850 hPa in relation to TC intensity is an important factor in determining the central pressure deficit and peak near-surface wind speed of a TC [7]. The stormrelative, shear-relative composites suggest that the coupling between the maximum tangential wind at 850 hPa and the shear vector is important for understanding TC intensity changes [42]. • Distance to land (DTL) DTL is also an important predictor, and the intensity of TC diminishes significantly after landfall, which is well reflected in the SHAP values. The study showed that, the intensity of TCs is affected by their distance from land [45,53]. Generally, the further away a cyclone is from land, the more intense it can become. However, when a cyclone moves parallel to the shoreline in close vicinity to land, its intensity can be reduced due to the ocean's effect [45].  effect of these two variables is relatively clear, SHAP values present a negative correlation with intensity variation. Studies have shown that the magnitude of wind shear at 200-850 hPa is related to the variation of TC intensity [55]. The 200-850 hPa wind shear is negatively correlated with TC intensity variation [55,62]. This has been studied since the 1950s and it has been shown that the ambient vertical wind shear applied to TCs can suppress their intensity [42]. • CD20 The climatic depth of the 20 • C isotherm affects the intensity of TCs by providing a measure of the oceanic thermal potential, which is an important factor in TC intensification [1,6]. A positive correlation with intensity variation is presented in the SHAP values. • ENEG Equivalent potential temperature (Theta epsilon) is an important factor in determining the intensity of TCs. Research has shown that a midtropospheric minimum in total energy vanishes as a TC approaches, with subsequent increases in Theta epsilon [27,47]. • CFLX Research has shown that there is a correlation between relative humidity and TC intensity [58,59]. Rapidly intensifying TCs are associated with higher relative humidity than weakening and neutral TCs [59]. Studies have also found that environmental moisture can have an impact on hurricane intensity, with storms becoming smaller when exposed to drier air [58].

Conclusion and discussion
This study is the first to investigate the application of machine learning to probabilistic forecasting of TC intensity. We propose a hybrid modeling of NGBoost and LightGBM, incorporating the advantages of both models to obtain probabilistic forecasts. Experiments demonstrate that our proposed model can achieve comparable performance to the currently operating state-of-the-art statistic-dynamical and dynamical models on 24 h deterministic forecasts. The model also provides quantification of uncertainty and exceeds some of the current popular probabilistic forecasting methods in terms of probability metrics and interval metrics with strong robustness. The visualization results of the two most influential TC in 2020 are independently tested and evaluated, showing that the model fits the actual values well and the forecast intervals cover the actual values well. In addition, we examined the ability of the model to capture RI for the two hurricane seasons of 19-20, and the probabilistic model can capture most of the RI within the 95% confidence interval, indicating the practical relevance of our model for risk warning. We also examine the feature importance and explain how the variables affect the prediction results.
The most important benefit of this study over the currently used models is that the uncertainty forecasts can be obtained directly without the need for computationally intensive super-set methods to obtain the forecast intervals, and we also list the forecast times for different events in the supporting material, which is an order of magnitude improvement over the dynamical model.
With regard to the phenomenon that the deterministic results for some years are lower than those of some established operational models, it is important to note that the core of our study is the direct prediction of distributions through machine learning methods. The purpose of our comparison with a wide range of deterministic forecasting models is to verify that the deterministic results we obtain are reasonable and that our models are valid as long as they outperform most operational models in terms of deterministic metrics. It is worth noting that our forecasts outperformed the data-driven models operated by the NHC in almost all of the years tested. It must be acknowledged that in many cases, TCP-NGBoost performs relatively poorly compared to the forecast performance of the state-of-the-art dynamic model HWFI. HWFI is almost the best current dynamical model, and the model has improved to varying degrees each year, especially in recent years, so we believe that the main reason our method was similar to HWFI in previous years and inferior to HWFI in 2019-2020 is due to its advances in the model. Future studies can take more into account the robustness of the model by upsampling certain special cases as well as more prediction intervals for the discussion. Due to the limitation of the amount of data, our proposed model may not be as good as the physical model for modeling complex nonlinear relationships, a more in-depth comparative study of HWFI and the model can be conducted in the future to attack more insights.
A limitation of the proposed model is that it lacks a complete and transparent mathematical and physical basis, and although we have used interpretable machine learning methods to explain the role of these important factors, it is still essentially a 'black box' . Being a purely data-driven model is limited by the limited amount of data available to TC, and we believe that the model would improve if the data were updated on a continuous basis.
In summary, the results suggest that TCP-NGBoost holds promise as a complement to existing operational models and provides some insight into uncertainty estimation, thus helping to better inform risk warnings. Our study only provides an indepth study of 24 h forecasts, and future studies could construct additional datasets to forecast more time. Future research could consider using deep learning sequential models for time series characteristics, and could incorporate advanced TC dynamical models for hybrid modeling. Moreover, we will extend uncertainty prediction to the subject of TC trajectories.

Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files).