Predicting the Type of Narrow Bipolar Pulse: A Machine Learning Approach

Narrow Bipolar Pulses (NBP) depicts the electric field (E-Field) changes due to a Compact Intra-cloud Discharge (CID) and are of two types namely Positive NBP (PNBP) and Negative NBP (NNBP). In this study, 437 NBPs were statistically investigated using a dataset collected in Sri Lanka in 2013, 2015, 2016 and 2017. Seven independent variables (Pulse Duration (PD), Rise Time (RT), Slow Front Duration (SFD), Zero Crossing Time (ZCT), Full Width at Half Maximum (FWHM) and Ratio between the Initial and Overshoot Peak Amplitudes (RIOPA)) and one dependent variable (Type of NBP) were analyzed. Two machine learning classification models, the Random Forest (RF) model and the Binary Logistic Regression (BLR) model, were used to predict the dependent variable based on the independent variables. Two models were compared in terms accuracy (ACC), sensitivity (SE), specificity (SP), AUC (Area Under the Curve)-ROC (Receiver Operator Characteristic) curve and kappa statistic. RF model scored the highest in terms of ACC (0.91), specificity (0.94), Kappa statistics (0.82) and AUC (0.98). In conclusion, RF model had the best performance in predicting the type of NBP hence can be used as a suitable automated method to classify the type of NBP.


Introduction
Compact Intracloud discharges (CIDs) are a category of lightning event that emit strong radiation in the high frequency to the very high frequency range.They possess distinct features compared to other lightning events.The electric field (E-field) signatures of these discharges were first observed and reported by Levine in 1980 [1]. CIDs are frequent in high altitudes (>10 km) during active thunderstorm stages [2].
CIDs are associated with strong changes in E-field radio frequency radiation.Due to their short duration (10-30 µs) and distinctive initial and overshoot pulse, they are referred to as Narrow Bipolar Pulses (NBPs) [3].Depending on the polarity of the NBP's initial pulse, they are classified into two categories: Positive NBP (PNBP) and Negative NBP (NNBP) [4] as shown in Figures 1 and 2. All directions are given according to the atmospheric electricity sign convention.
Regardless of the polarity, a set of characteristics can be defined for all NBPs such as RT, PD, SFD, ZCT, FWHM, and RIOPA.These characteristics are based on critical points of the pulse, which are marked in Figure 3 and explained in detail in Table 1.Few studies have been conducted to study the statistical characteristics of NBPs.Studies by Gunasekara et al. [5], Sharma et al. [6], and Thabrew et al. [3] have examined the characteristics of NBPs and their associations with the type of NBP.The results of these studies suggest a significant difference between the types of NBP with respect to each characteristic.Additionally, a study based on tropical thunderstorms speculated that the polarity of the NBP and event frequency depend on the height of the thunderstorm and meteorological conditions [7].
Machine learning (ML) is a branch of artificial intelligence (AI) that analyzes data and automates the development of analytical models with minimal human intervention.[8].These ML models are developed based on a training dataset and can then be used to make predictions or decisions.Out of many available ML classification models, logistic regression (LR) and random This study focuses on predicting the type of NBP using binary LR (BLR) and RF ML models and comparing their performance to determine the best model for this prediction.To make predictions, the wave characteristics of the pulses are incorporated into the model.These AI approaches have the potential to be useful in the design of automated electronic devices or mobile applications for identifying the type of NBP based on the wave characteristics.

E-field Data
The E-field data used in this study were collected from single tropical thunderstorms that occurred in Sri Lanka in the years 2013, 2015, 2016, and 2017.Similar measurement setups, consisting of a flat plate antenna, were used in all four years.The setup used in 2015 was described in detail by Abeywardhana et al. [9].
A total of 437 NBPs were identified from the E-field recordings of all four years.The year of data collection, PD, RT, SFD, ZCT, FWHM, and RIOPA were extracted from the E-field recordings and were labeled with the type of NBP, which is binary and has two levels: NNBP and PNBP.The type of NBP was considered the outcome variable.The details of the variables in the dataset are shown in Table 2.

Statistical Analysis
As a first step in the analysis, a univariate analysis was performed across the entire dataset.This was followed by a bivariate analysis based on the type of NBP and the seven explanatory variables.Pearson's Chi-squared test was used to compare the two groups of the outcome variable with respect to categorical variables, along with their frequencies and percentages.For continuous variables, the means and standard deviations were obtained and compared across the type of NBP using the independent t-test.
The main dataset was then split into a training set consisting of 80% of the full dataset and a test set consisting of 20% of the full dataset.The selection of observations for each training and test set was done randomly.Descriptive statistics of all variables were obtained for both the training and test data.Pearson's Chi-squared test was used to compare the training and test sets along with their frequencies and percentages.Means and standard deviations were obtained for continuous variables, and the independent t-test was used to compare the training and test sets.
The training set was used to fit the models, and these models were then tested using the test data.The performance of the models was evaluated using ACC, SE, SP, AUC-ROC curve, and the kappa statistic.All statistical analyses were conducted using R Studio.All statistical tests were two-tailed, and a p-value of less than 0.05 was considered statistically significant.

Random Forest
RF is an extended version of bagging that also uses randomness to create a collection of uncorrelated decision trees for classification or regression [10].This ensemble learning method was first proposed by Leo Breiman in 2001 [11] and can be used to predict categorical or continuous response variables.For categorical outcomes, RF is used for classification, and the output is determined by the class selected by the majority of trees in the forest.For continuous outcomes, the average of the predictions made by each tree is returned as the output [12,13].In classification, trees are constructed by random replacement sampling from the training dataset.The rest of the data that is not included in the sample is referred to as the Out-Of-Bag (OOB) sample and is used to evaluate the performance of the grown trees in the random forest.Explanatory variables that assess the outcome variable are used to create the nodes of the trees, and at these nodes, a random subset of covariates is chosen.The selection of these covariates for splitting into subsequent nodes is decided by a Gini Impurity Criterion (GIC).The GIC measures the frequency of incorrect classification of a randomly selected case in the decision trees if random classification was done according to the class distribution in the dataset.The covariate that causes the highest reduction in GIC is selected to split at a node.After several iterations, cases assigned to the same class will remain at the final nodes, and the final predicted class for a case by RF is determined based on the class selected by the majority of trees.The variable importance plot, which orders the explanatory variables by mean decrease in Gini, can be used to identify important variables in the RF model [11].

Logistic Regression
LR is a popular method to model categorical outcomes [14].BLR is used to model binary outcome variables.The model can be written as follows.
where x i are the explanatory variables in the model, α is the intercept and β i is the is the regression coefficient of the variable x i .( π 1−π ) represents the odds of classifying the outcome in one category than the other (Typically, category marked as 1 than the category marked as zero) [15].

Results and Discussions
Out of 437 NBPs, 177 (40.50%) were categorized as PNBPs, while 260 (59.50%) were classified as NNBPs.The highest number of NBPs were observed in 2013, making up 42.11% of the total, while the lowest number of NBPs were recorded in 2017 at 7.55%.The mean and standard deviation of continuous variables are presented in Table 3, along with other summary statistics for categorical variables.
Table 4 shows the descriptive statistics related to the two types of NBPs across the different independent variables considered in this study.Except for PD, all other variables are statistically significant with respect to the two types of NBPs.Significantly higher values of RT, SFD, ZCT, FWHM, and RIOPA were observed among PNBPs compared to NNBPs (p¡0.05).The highest number of PNBPs was observed in 2015, while the 2013 data set had the highest number of observed NNBPs.
Before fitting the RF and BLR models, the training and test datasets were explored.The training and test samples consisted of 349 (80%) and 88 (20%) NBPs, respectively.The test sample was used to evaluate the results obtained from the training sample.The distribution of data in all variables in the training and test sets is given in Table 5.According to the results, it  The BLR model was first fitted to the training dataset, and the results are presented in Table 6.It was observed that SFD, RIOPA, RT, and ZCT were among the most significant variables in the BLR model, in addition to the "Year" variable, where all levels of Year were significant.PD and FWHM were found to be not significant at the considered significance level.The odds of an NBP being a PNBP increased by 4762064 with a one-unit increase in SFD and by 1.84 for a one-unit increase in ZCT.The odds of an NBP being a PNBP for a one-unit increase in RIOPA were 125.17.The odds of a PNBP were 0.48 for a one-unit increase in RT.In 2015 and 2016, the odds of a PNBP were 50.92 and 4.36, respectively, while in 2017, the odds were 0.02.The model was then tested using the test dataset, and the predictions of the BLR model were evaluated in terms of ACC, Kappa statistics, SE, SP, and AUC-ROC.
A comparison between the RF model and the BLR model is shown in Table 7.According to the results, the values of ACC and SP were higher in the RF model than in the BLR model, although the SE value was slightly lower in the RF model than in the BLR model.The higher Kappa statistics indicate a higher agreement between the results of the RF model and the observations in the dataset compared to that of the BLR model.
The ROC curve is shown in Figure 5, and the related AUC values indicate that the RF model has the ability to distinguish a PNBP from an NNBP based on the given variables in the model, outperforming the BLR model.In summary, the results demonstrate that the RF model had better performance compared to the BLR model.
In this study, SFD had a significant impact on the type of NBP compared to other variables and was the most important variable in the variable importance plot, with a higher decrease in the Gini index.It was also identified as a highly significant variable with the greatest change in  odds for a one-unit increase in value in the BLR model.This confirms that SFD is a sensitive measure of the pulse that can be very useful in distinguishing between the two types of NBP.In contrast, PD and FWHM were insignificant in both the BLR and RF models, making them less useful in identifying the type of NBP.These results are in agreement with those of a previous study conducted in the tropics [3].
Of the two classification methods used in this study, RF outperformed the BLR model with a higher accuracy value and the strongest association between the predicted and observed values.The RF method is a popular non-parametric approach for classifying large amounts of data with many variables [16].It is effective at handling missing data [17].On the other hand, LR is a parametric modeling approach used in statistical analyses, and it uses odds ratios for interpretation.However, its accuracy can be compromised and it can experience collinearity issues when using many independent variables.Although statistical modeling and machine learning approaches like RF and BLR are new to NBP research, they have been used in other research areas, such as lightning and storm predictions [18][19][20].
Some limitations of this study include the limited previous research available for conducting a thorough literature review on the associations between the wave characteristics and other factors with the type of NBP.Additionally, the study did not consider the geographical location where the CIDs were observed in the model.Furthermore, the study sample may not be a perfect representation of NBPs that can occur at any location and at any time.The study only used two classification models, and the RF model performed better than the BLR model.

Conclusions
This study focused on exploring the impact of several variables, including the wave characteristics of NBPs, on the type of NBP.Two popular classification methods, RF and BLR, were used to fit two models and were compared in terms of several model performance assessment techniques.The results showed that the RF model was the best classifier method.SFD, RIOPA, RT, and ZCT were identified as the most important variables associated with the type of NBP.These important variables can be utilized for forecasting purposes and combined with electronic systems and AI to distinguish the type of NBP.Furthermore, they will be useful when analyzing the physical orientations of these electrical impulses.
Furthermore, this study can be extended or replicated in future studies by using other modeling or classification techniques and by collecting data during a specific period across different geographical locations or at different time points at one specific location to observe variations in the results with respect to geography or time.

Figure 4 .
Figure 4. Variable Importance Plot of RF

Figure 5 .
Figure 5. ROC curve comparing the two classification methods -random forest (RF) and binary logistic regression (BLR)

Table 1 .
Wave Characteristics forest (RF) are two of the most popular.Both models can be developed using training data and tested for performance using a testing dataset.

Table 2 .
Details of the variables

Table 3 .
Uni-variate analysis of all variables

Table 4 .
Descriptive statistics of all explanatory variables across the type of NBP

Table 5 .
Descriptive statistics of all variables across the training and test datasets was clear that no significant statistical differences were observed between the training and test sets, as the p-value of Pearson's Chi-squared test for all variables was greater than the 5% level of significance.Figure4shows the variable importance plot of the RF model.The results indicate that SFD,

Table 6 .
Results of Binary logistic regression (BLR)

Table 7 .
Comparison of RF and BLR in terms of accuracy measures