Comparative evaluation for alternative variable importance rankings for pedestrian injury severities

Little research is dedicated to evaluating the performance difference of various metrics in ranking predictor importance in the traffic safety field. To this end, the main objective of the current paper is to evaluate and quantify different methods for sorting the variable importance related to crash severity. A comprehensive database for pedestrian-related crashes in the state of California was developed. Four popular measurement metrics used in the past were chosen for evaluation purpose: Mean Decrease Accuracy (MDA), Mean Decrease Gini (MDG), log-likelihood ratio test associated with multinomial logit model, and Principal Component Analysis (PCA). The former two metrics come under the same umbrella of the Random Forest (RF) technique, while the latter two are methods belonging to different domains. The results show the alternative methods yield different variable importance rankings with PCA being isolated from others. The two methods under the same domain of the random forest, or MDG and MDA, have the most common results, but still reveal a 17% ranking difference. It is anticipated that the results could raise more awareness of the importance of selecting the appropriate metrics to evaluate the predictor importance from different perspectives.


Introduction
Replete studies in the past considered walking activity an important mode to reduce congestion, decrease motor-vehicles gas emissions, improve public health, and increase social connection opportunities [1].However, the percentage of total trips undertaken by pedestrians compared to other modes is very low.In the United States, the National Household Travel Survey [2] reported that trips made by walking accounted for only 0.6% of total person-miles travel (PMT).Researchers reveal that one main reason for the relatively low distance traveled by walking is that the pedestrians are among the most vulnerable and unsafe road users [3].Therefore, there is a pressing need to better understand the factors affecting the safety features of pedestrians, which also lead to the proper policies and strategies to enhance the walking activity.
Given this context, pedestrian injury severities have been the emphasis of a large number of studies dedicated to factor exploration including roadway-built environment, pedestrian behavior, driver behavior, traffic characteristics, drug/alcohol use, social and demographic attributes, and many more.Correspondingly, a wide spectrum of modeling approaches have been adopted by researchers such as logit/probit models [4], binary/ordinal/multinomial models [5], incorporation of random parameters [6], spatial-temporal models [7][8], and so on.It is to be noted that under many circumstances, the aim is not restricted to the prediction accuracy of the dependent variable, but also to identify the contribution of different influential variables.However, many times the models illustrate different statistically significant variables even though sometimes similar covariates are used for model development.There are different reasons for this phenomenon.Some researchers relied on automated variable selection methods which mainly aim to yield the best goodness of fit of models.Some others used engineering judgment or personal preference to select the variables of interest after removing other factors based on the correlation analysis to avoid multi-collinearity issues.
Recently, with the rapid advancement in technology and computational capabilities, many variable importance ranking methods have been proposed to facilitate the selection of significant covariates feeding into the subsequent model development.The multitude of techniques contain log-likelihood ratio test [9], partial least squares [10], recursive partitioning [11], bagged trees and boosted trees, Random Forest (RF) [12], Principal Component Analysis (PCA) [13], and some others.The safety profession has witnessed the applications of some of the methods in the past.Among them, the loglikelihood ratio test (LRT) may be the simplest one yet very popular with multiple steps being involved.Aside from LRT, the RF technique has also been widely used to rank the importance of various crash-pertinent variables.In this technique, a number of trees are grown from the original dataset by randomly selecting few observations with replacement, then a subset of variables is randomly selected at each split until the variable importance is ranked.Another prevalent method in the field is the PCA which can extract the important information via a set of few new variables resulting from a linear combination of some highly correlated original variables.
The aforementioned studies demonstrate some popular methods proposed for variable importance ranking in the safety field.However, to the best knowledge of authors, there is no prior research dedicated to the evaluation of ranking performance of alternative methods.To fill the research gap, the main objective of this study is to conduct a study centered on the assessment of some common algorithms used to rank the contributing variables of crash injury severities which contain Mean Decrease Accuracy (MDA), Mean Decrease Gini (MDG), log-likelihood ratio test associated with multinomial logit model; and PCA.The selected four criteria allow the authors to examine variable importance ranking differences both within the similar and across the different domains.It is anticipated that this study would yield more insights to safety professionals about variable-prioritizing for model development.It is also noteworthy that the paper seeks merely to raise the awareness of potential ranking difference among some of the frequently used methods, rather than do an exhaustive survey covering all feature importance ranking methods.

Data description
The data obtained for this study were provided by HSIS [14], which collected the data in the form of different raw files from California TASAS (Traffic Accident Surveillance and Analysis System).Five years (2010 to 2014) of the pedestrian-involved crash data from all California state highway segments were used to evaluate alternative methods for ranking variable importance in crash severity.In this study, the crash data were extracted from different types of files linked with road, vehicle, and crash characteristics.The data collected from these files have crash number along with other factors like geometric (lane width, number of lanes, median type, etc.), traffic (Average Annual Daily Traffic, Design Speed, average lane length, etc.), and driver attributes (race, sex and alcohol consumption), and so on.A total of 2869 pedestrian-related collisions were considered in this study.In addition to the dependent variable of crash injury, there are 52 covariates which contain 11 numerical variables and 41 categorical ones, which are shown in Table 1.

Methodology
This study used five levels of injury severities for pedestrian-involved crashes that occurred in the 58 counties in California over a period of five years to evaluate four selected metrics for variable importance ranking.The following subsections present the methodological details associated with each of them in order.Primary Collision Factor (DOT)

Random Forest (RF)
RF, a prediction method categorized as ensemble learning, or, methods that develop classifiers and aggregate their results, has been extensively used to identify important variables.It is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same split for all trees [12].Within a random forest, two main measurement indices can be developed to grasp the influence of relative classification among variables: Mean Decrease Gini (MDG) and Mean Decrease Accuracy (MDA).The former is the sum of the amount of decrease in impurity to measure the classification effect of variables, while the latter represents the total amount of decrease in accuracy depending on the absence or presence of certain variables.The present paper relied on the 'RandomForest' package in R for the calculation of MDA and MDG.The readers wishing more details can refer to the document by James et al. [13]

Principal Component Analysis (PCA)
The PCA aims to reduce the dimensionality of a dataset consisting of a large number of interrelated variables while retaining the data variation to the highest level possible.This is achieved by mathematical transformation of possibly correlated variables into a number of uncorrelated variables which are known as the principal components (PCs), whose amount of variation are represented by eigenvalues.The first principal component has the largest eigenvalue, which corresponds to the directions with the maximum amount of variation in the dataset.The number of principal components is determined by the eigenvalues.In general, the PCs are retained whose corresponding eigenvalues are greater than 1, indicating that PCs explain more variance than explained by one of the initial variables.For each PC, the percentage of the variability explained by certain variable over the total PC variability is normally defined as the contribution of the variable to the specific PC.The importance of any variable can be calculated as the weighted average of contribution across all determined PCs using the accompanying eigenvalues as the weights.The following formula illustrates the overall contribution of the variable to the dataset: Where,  is the contribution of variables I to the entire dataset,  , is the contribution of the variable on the j th PC, and E is the eigenvalue of j th component, respectively.The value of a contribution ranges from 0 to 1.The larger the contribution to the dataset, the more important the variable is.The data analytics and visualization were conducted using R packages 'FactoMineR' and 'Factoextra' [15].

Log-likelihood Ratio Test (LRT)
Likelihood-Ratio test (or the likelihood-ratio chi-squared test) is a hypothesis test usually used to determine the "better" model between two nested models.In the present study, the crash severities from HSIS were divided into five different levels: fatal, severe injury, other visible injury, complaint of pain, and Property Damage Only (PDO).Out of the different modeling approaches available to address the discrete outcomes of crashes, the typical multinomial logit was used since it can relax the parameter restriction imposed by ordered-probability models and can also provide consistent parameter estimates even with the presence of crash-underreporting.
The multinomial framework used to model the degree of injury severity endured by a crash involving pedestrian can be expressed as follows: (2) Where  is a linear function that determines injury severity outcome i for crash n;  is the intercept;  is the vector of coefficient estimates,  is the vector of explanatory factors (e.g., roadway, crash, vehicle characteristics) determining the pedestrian injury severity i for crash n; and  is an independent and identically distributed error term that accounts for the unobserved heterogeneity.In Equation 2, if there is only intercept  , the model collapses to the null model, which is nested into alternative models where the predictors are included as well.Once the log-likelihood values of the null model and alternative model(s) are known, the LRT statistic can be computed using the following expression:  2     (3) Where   is the log-likelihood value of the null model, and   is the log-likelihood value at convergence for the alternative model.The LRT follows a chi-square distribution with degrees of freedom equal to the difference in parameters between the alternative model and the null model.If the p-value associated with the chi-square distribution is smaller than 0.05, then the variable will be considered as highly significant.

Results
This study aims to evaluate the alternative statistics for variable importance ranking for pedestrian injury severity.The evaluation of four different measurement metrics was employed to compare and perform the variable importance ranking using the statistical software 'R'.First, distinct metrics, MDA, MDG, PCA, and LRT, were calculated to rank the variables of pedestrian-related crash data separately.Second, different tools such as correlation analysis, heatmap, and dendrogram were applied to evaluate the ranking difference and similarity of various measures.
As previously mentioned, the variable importance ranking of MDA and MDG were depicted by building a classification type random forest based on bootstrap training sample, where each split in a CTIS-2023 Journal of Physics: Conference Series 2595 (2023) 012015 tree was considered by sampling a random number of variables which is approximately equal to the square root of the total number of variables [13].Regarding the variable importance ranking based on LRT, alternative models (with only one variable of interest) were compared to the corresponding null model under the framework of a multinomial logit model to obtain various p-values of associated chisquare test followed by LRT.The variable importance ranking for PCA was performed by computing the weighted average of the contribution of variables across all determined PCs using the accompanying eigenvalues as the weights.The present study retained all statistically significant PCs whose eigenvalues were greater than one [15].In total, 97 PCs were used with the 67.28% of variability combined being explained.
The detailed ranking of importance of variable importance for each method is illustrated in Table 2.The smaller ranking value indicates the greater variable importance.It can be seen that the different metrics yield some levels of dissimilarity for ranking.PCA and MDA consider "acctype" and "light" as the most important variable, respectively, while both MDG and LRT (i.e, M_p) assign the 1 st place to "ped_actn".On the end, PCA, LRT, MDA and MDG treat "light", "lanewid", "feat_lf" and "divided" as the least important variables to pedestrian injury severity, respectively.To better understand the importance ranking of individual variables, the composite ranking was also developed based on the sum of rankings of each alternative method.All to be consistent with individual criterion, the smaller the ranking sum, the great importance of the variables under the composite ranking.Under the composite index, "acctype" and "rdsurf" stay on both ends of the spectrum, with the former being the most important and the latter being the least important one.In addition to the above results, the authors were also curious about the detailed consistency among the four ranking metrics.A scatterplot matrix was developed for this purpose which also displays the correlations of each ranking pair.As revealed in Figure 1, MDA and MDG show a strong correlation with the correlation coefficient of 0.83.The LRT illustrates a moderate correlation with both MDA and MDG, with the correlation coefficients of 0.69 and 0.60, respectively.Comparatively, PCA appears to be isolated from others with relatively small coefficient values (0.12, 0.20, and 0.31).Such a phenomenon is corroborated by the ranking values in Table 2.For instance, "light" is ranked the last under PCA in terms of variable importance, while it is considered as one of the top 3 variables in each of other criteria.

Conclusions and recommendations
The main objective of this study was to conduct the evaluation of popular measurement metrics in safety field with different algorithms to rank the contributing variables of pedestrian-related injury severities.Based on the empirical results described in this paper, the conclusions are made: 1.The variable importance rankings vary, sometimes significantly, from metric to metric in this study.2. the ranking metrics, even from the same domain of algorithm, may lead to an unexpected level of dissimilarity.3. It is highly recommended that multiple methods should be employed and compared before the most appropriate one for variable importance evaluation is chosen.The above findings of models may not hold true when employed in other states.It is therefore encouraged to explore different datasets to test whether the current findings of the present study remain the same.Second, this study employed four different metrics to perform variable importance ranking.Future studies could extend this work by adopting more metrics frequently used in other professions to rank the variables which might help gain some new insights.

Notes: 1 .
R_PCA, R_M_p, R_MDA and R_MDG are the associated ranking of variable importance under the criteria of PCA, LRT, MDA and MDG, respectively.2. The correlation coefficients are shown with various font sizes commensurate with the magnitude of the coefficient values.3. *** and * signify the level of significance of 99% and 95%, respectively.

Figure 1 .
Figure 1.Scatterplot matrix of four alternative methods for variable importance ranking.

Table 1 .
All variables used in the study.
numvehs Total number of vehicles involved in the crash severity Collision Severity drv_age The age of the driver of the vehicle involved in the crash weather1 Weather miscact1 Movement Preceding Accident weekday Day of Week curb1 Curb and Landscape terrain Terrain celphone Usage of cellphone in the vehicle desg_spd Design Speed feat_rg Right Road Border Special Feature drv_race Driver Race feat_lf Left Road Border Special Feature drv_sex Driver Sex veh_invl Involved in The Accident insur Insurance contrib1 First Associated Factor rururb Rural/Urban surf_typ Surface Type Roadway 1 of separate Highway divided Divided Highway medbarty Median Barrier Type object1 First Object Struck loc_typ1 First Collision Location spec_inf Special Information sobriety Sobriety of the driver of this vehicle access Access Control func_cls Functional Class vehtype Vehicle Type acctype Type-of-Collision med_type Median Type cause1

Table 2 .
The variable importance ranking under various methods.Notes: 1. PCA represents of principal component analysis; M_p represents multinomial model p-values of log-likelihood ratio test; MDA is Mean Decrease in Accuracy; MDG is Mean Decrease Gini; Com represents composite ranking based on the sum of four individual rankings.2. The bold fonts indicate the highest ranking, while the shaded cells signify the lowest ranking of variable importance.