Boosting the performance of anomalous diffusion classifiers with the proper choice of features

Understanding and identifying different types of single molecules' diffusion that occur in a broad range of systems (including living matter) is extremely important, as it can provide information on the physical and chemical characteristics of particles' surroundings. In recent years, an ever-growing number of methods have been proposed to overcome some of the limitations of the mean-squared displacements approach to tracer diffusion. In March 2020, the Anomalous Diffusion (AnDi) Challenge was launched by a community of international scientists to provide a framework for an objective comparison of the available methods for anomalous diffusion. In this paper, we introduce a feature-based machine learning method developed in response to Task 2 of the challenge, i.e. the classification of different types of diffusion. We discuss two sets of attributes that may be used for the classification of single-particle tracking data. The first one was proposed as our contribution to the AnDi Challenge. The latter is the result of our attempt to improve the performance of the classifier after the deadline of the competition. Extreme gradient boosting was used as the classification model. Although the deep-learning approach constitutes the state-of-the-art technology for data classification in many domains, we deliberately decided to pick this traditional machine learning algorithm due to its superior interpretability. After the extension of the feature set our classifier achieved the accuracy of 0.83, which is comparable with the top methods based on neural networks.


I. INTRODUCTION
Single-particle tracking (SPT) is a popular method for observing the molecular dynamics in a wide range of materials, including the living cells [1,2]. Typically, an SPT experiment results in trajectories of tracers, i.e. time series of their consecutive positions, the analysis of which allows one to extract properties of the molecules and their environment (e.g. density of obstacles, velocity).
The key first step in the analysis of SPT data is to connect the observed motion of particles with the available models of diffusion [3], since it already sheds light on the mechanical properties of the particles' surrounding [4]. The most common approach to characterize the diffusion of particles is the mean-squared displacement (MSD) [3,5,6], which measures their deviations with respect to a reference position in time. A MSD growing linearly in time corresponds to Brownian motion (a.k.a. normal diffusion), which describes the movement of a microscopical particle as a consequence of thermal forces. Any deviations from that linear behavior are called anomalous diffusion [7], which can occur in physical, biological, or geological systems. Those deviations are usually represented by a power-law scaling of MSD (i.e. M SD(t) ∝ t α ) [3,5,6], with α being the anomalous diffusion exponent. A sublinear MSD (α < 1) corresponds to subdiffusion. It indicates trapped particles [8,9], particles which hit upon obstacles [10,11] or particles moderated by viscoelastic properties of the environment [12]. Subdiffusion can take place in crowded cellular fluids and membranes [13], and can be caused by crowding and complex interactions with the cytoskeleton and macromolecular complexes: in cytoplasm [14][15][16], the nucleus [17] and the plasma membrane [18][19][20]. A superlinear MSD (α > 1) is referred to as superdiffusion. In this case, the movement of particles is usually faster than in normal diffusion and frequently directed to a specific location. Examples of superdiffusion include Lévy flights [21,22], active cytoplasmic flows and transport mediated by molecular motors [23,24], particle motion in random potentials or in turbulent flows [25,26], foraging by animals and humans [27], and many others [28].
Despite its simplicity, the MSD-based analysis of SPT trajectories is truly challenging.
The experimental trajectories are very often too short to extract meaningful information from the MSD. Moreover, the finite precision of tracers' localization adds a term to the MSD, which can limit the interpretation of data [29][30][31]. Consequently, several other ap-proaches going beyond MSD have been introduced. For instance, the radius of gyration [32], the velocity autocorrelation function [33,34], the time-dependent directional persistence of trajectories [35], the distribution of directional changes [36], the mean maximum excursion method [37], the fractionally integrated moving average (FIMA) framework [38] may efficiently replace the MSD estimator for classification purposes. The full distribution of displacements may be fitted to a mixed model in order to extract differences in diffusive behavior between subsets of particle ensembles [39]. Heterogeneity within single trajectories can be checked with Hidden Markov Models [40,41].
In the last few years, machine learning (ML) has become an interesting alternative for the analysis of anomalous diffusion. This approach is very appealing because in contrast to the analytical methods it does not require explicit rules for data processing. Instead, a ML algorithm can learn those rules directly from a series of data. ML is already known to excel in different domains including computer vision [42], speech recognition [43] and natural language processing [44]. First attempts to apply them to SPT data turned out to be very promising, even though the characterization of diffusion remains very challenging. Bayesian approach [45][46][47], random forests [48][49][50][51][52], gradient boosting [49][50][51][52], neural networks [53], and deep neural networks [49,[54][55][56][57] have been used to either classify the trajectories or to extract quantitative information about them.
The increasing number of methods, both analytical and ML-based, and the lack of an objective comparison between them have resulted in the launch of the Anomalous Diffusion (AnDi) challenge by a team of international scientists [58]. The goal of the challenge was at least two-fold: to provide a framework to benchmark the existing and new methods for the analysis of anomalous diffusion and to spur the invention of new high-performance methods. The challenge itself was divided into three main tasks: (1) inference of the anomalous diffusion exponent α from the trajectories provided by the organizers, (2) classification of trajectories into 5 different models of diffusion, and (3) detection of points within single trajectories, at which the anomalous exponent α and/or the diffusion coefficient D change.
From the results of the challenge, it follows that the ML methods based on neural networks outperform the more traditional statistics-based approaches [59]. However, it should be emphasized here that the latter usually offer a deeper insight into the mechanisms of classification.
The main obstacle limiting the deployment of ML to the trajectory analysis is the avail-ability of decent training data. The experimental trajectories are not provable (otherwise we would not need any new method of analysis). As a consequence, synthetic trajectories generated by computer simulations of various diffusion models are typically used as training data.
The organizers of the AnDi challenge have identified five models of particular importance for the interpretation of experimental results: continuous time random walk (CTRW) [60], fractional Brownian motion (FBM) [61], Lévy walk (LW) [26,62,63], annealed transient time motion (ATTM) [64] , and scaled Brownian motion (SBM) [65]. To ensure that all participants of the challenge use the same data for inventing/improving their methods, a Python package with the implementation of the models has been provided by the organizers. This paper describes the details of the method we proposed in response to the AnDi challenge. We focus on the classification of 2D trajectories, i.e. a part of the Task 2 within the challenge. In our recent researches [49,51,52,57], we already investigated the applicability of both feature-based models and deep learning techniques to anomalous diffusion. The first approach requires a set of human-engineered features to characterize each trajectory. Vectors of these features are then used as input for a classification model. Deep learning methods, in contrast, extract significant features from raw data on their own, without any effort from a human expert. The deep learning approach yielded slightly better performance than the feature-based ones, in line with the general findings of the AnDi challenge, but we decided to further elaborate on the latter due to its superior interpretability. Our method turned out to be inferior to the winning teams, but still offered a reasonable performance [59]. Moreover, further investigation of the method after the challenge allowed us to significantly improve its accuracy by adding some new features and to achieve performance similar to the best methods (0.83 accuracy in Task 2 versus 0.88 accuracy of the winners).
The paper is divided as follows. Diffusion models used for training and the classification method are introduced in Sec. II. Sec. III briefly summarizes the dataset used for training.
Results of our analysis are presented in Sec. IV. Finally, some concluding remarks are given.

A. Anomalous exponent α
The most popular method of deducing the particles' type of motion from their trajectories is based on the mean-squared displacement (MSD) [3,5,6,66], defined as where (X t ) t>0 is a particle's trajectory, · stands for the Euclidean distance and E is the ensemble average. Since in many experiments only a limited number of trajectories is observed, the time-averaged MSD (TAMSD) calculated from a single trajectory is usually used as the estimator of MSD, The trajectory consists of N consecutive two dimensional positions X i = (x i , y i ) (i = 0, . . . , N ) recorded with a constant time interval ∆t. The quantity n is the time lag between the initial and the final positions of the particle. If the underlying process is ergodic and has stationary increments, TAMSD converges to the theoretical MSD [67].
Anomalous diffusion refers to a broad class of processes that deviate from the standard Brownian motion. Those deviations display an asymptotic power-law dependence of the MSD on the time lag, where K α is the generalized diffusion coefficient and α is the anomalous exponent. As already mentioned in the introduction, the value of the latter may be used to discriminate the trajectories into normal diffusion (α = 1), subdiffusion (α < 1), and superdiffusion (α > 1).

B. Stochastic models of diffusion
Particles' movements that exhibit deviations from the linear behavior of MSD may be In this section, the main characteristics of those models are briefly summarized.

Continuous time random walk
CTRW defines a large family of random walks with arbitrary displacement density for which the time spent between subsequent steps (i.e. the waiting time) is a stochastic variable [60]. Here, we will consider a particular instance of CTRW with the waiting times sampled from a power-law distribution ψ(t) ∼ t −σ , and the displacements drawn from Gaussian distribution with mean zero and variance D, N (0, √ D). In this case, we have α = σ − 1.

Fractional Brownian motion
FBM [59] is the solution of the stochastic differential equation Here, σ > 0 is the scale coefficient, which relates to the diffusion coefficient D via σ = √ 2D.
H ∈ (0, 1) is the Hurst parameter and B H t is a continuous-time, zero-mean Gaussian process starting at zero, with the following covariance function The value of H determines the type of diffusion in the process. For H < 1 2 , FBM produces subdiffusion (regime with a negatively correlated noise). For H > 1 2 , FBM generates superdiffusive motion (positively correlated noise). It reduces to the normal diffusion at

Lévy walk
Lévy walk [26,62,63] is another model of a random walk, but in contrast to CTRW, the waiting times and step lengths are coupled. In this work, we assume that waiting times are power-law distributed (as in case of CTRW, ψ(t) ∼ t −σ ) and that the displacements and waiting times are correlated in such a way that the probability to move a certain step ∆x at time t and stop at the new position to wait for a new random event is Ψ(δx, t) = where v is the velocity. Then, one can show that the anomalous exponent α = 2 for 0 < σ < 1 and α = 3 − σ for 1 < σ < 2. We may also observe then that the distribution of displacements is not Gaussian.

Annealed transient time motion
ATTM describes the motion of a Brownian particle with a time-varying diffusion coefficient [64]. The tracer performs Brownian motion with a random diffusion coefficient D 1 for a random time t 1 , then with D 2 for t 2 and so on. The diffusion coefficients are drawn from a power-law distribution P (D) ∼ D σ−1 . If the random times are sampled from a distribution with the expected value E[t|D] = D −γ (σ < γ < σ + 1), then we have α = σ/γ. According to Ref. [59] we will assume that P (t|D) ∝ δ(t − D −γ ).

Scaled Brownian motion
The scaled Brownian motion (SBM) is a process described by the Langevin equation with a time-dependent diffusivity K(t), where ξ(t) is white Gaussian noise. If the diffusivity has the form K(t) = αK α t α−1 , the MSD follows the power-law M SD(t) ∝ K α t α .

C. Classification model
In machine learning, all classification algorithms may be divided into two classes. The traditional machine learning is a set of statistical learning methods, which do not operate on raw data. Instead, each data sample is characterized by a vector of human-engineered features or attributes. Those vectors are then used as input for the classifier. The second class consists of deep learning methods that identify and extract features on their own. The representation of data is constructed automatically and there is no need for its complex preprocessing as in the case of the feature-based methods.
Deep learning is nowadays the state-of-the-art technology for data classification in many domains and overshadows a little bit the qualities of the classical machine learning. However, the choice of a suitable classification method is usually more subtle than simply looking at its performance. A lot of deep learning methods do indeed an excellent job with respect to the predictions but are extremely complex to interpret. To give an example, the ResNet18 architecture we considered in one of our previous attempts to trajectory classification [57] originally had 11, 220, 420 parameters. We were able to reduce their number down to 399, 556 with a positive impact on accuracy. Although it was an impressive achievement, interpretation of all of those remaining parameters is almost impossible. Thus, if one wants to comprehend the decisions made by an ML algorithm, it could be tempted to go for featurebased methods and gain on interpretability at the expense of accuracy.
Being aware of this trade-off, we deliberately decided to follow the interpretability path in our contribution to the AnDi challenge and chose the extreme gradient boosting (XGB) algorithm with decision trees as base learners [68] as the classification method. We already used similar approach [49,51,52], however for different sets of data.
XGB is an ensemble method, which combines multiple decision trees to obtain better predictive performance. A single decision tree [69] is fairly simple to build. The original dataset is split into smaller subsets based on the values of a given feature. The process is recursively repeated until the resulting subsets are homogeneous (all samples from the same class) or further splitting does not improve the classification performance. A splitting feature for each step is chosen according to similarity score measure [70].
Single decision trees are famous for their ease of interpretation. However, they are sensitive to even small variations of data and prone to overfitting. That is why one rather uses forests (ensembles) of trees as reliable classifiers. In the case of the XGB algorithm, the trees are built in a stage-wise fashion. We start with a single tree trained on random subsets of data and features and then at every step, a new tree is added to the ensemble. This new tree learns from the mistakes committed by the previous trees.

D. Diffusion characteristics for AnDi challenge
The main challenge of feature-based classification methods is the proper choice of features that can differentiate between samples belonging to different classes. The impact of their choice on the final performance of the classifier was already discussed in Ref. [52]. In this work, we continue the search for a robust set of features for the classification of different diffusion modes.
In this section, we briefly introduce features that have been used for the AnDi challenge.
Since we were not satisfied with the performance of the resulting classifier in Task 2 [59], after the end of the challenge we further searched for features that would improve the classification results. Those additional features will be described in Sec. II E. Both the original feature set for the challenge and its extension are listed in Table I.

Anomalous exponent
In our set of features, we included 4 estimates for the anomalous exponent α (see Eq. (3)), calculated using the following methods: • the standard estimation, based on fitting the empirical time-averaged MSD (TAMSD) from Eq. (2) to Eq. (3), • 3 estimation methods proposed for the trajectories with noise [71], under the assumption that the noise is normally distributed with zero mean: with n max equal to 0.1 times the trajectory length, rounded to nearest lower integer (but not less than 4), simultaneous fitting of parametersD,α andσ in the equation where d denotes dimension, D is the diffusion coefficient and σ 2 -the variance of noise, simultaneous fitting of parametersD andα in the equation

Diffusion coefficient
We used the diffusion coefficient extracted from the fit of the empirical TA-MSD to Eq. (3).

Asymmetry
The asymmetry of a trajectory can detect directed motion. According to Saxton [32], it can be derived from the gyration tensor, which describes the second moments of positions of a particle. For a 2D random walk of N steps, the tensor is given by x j is the average of x coordinates over all steps in the random walk. We define the asymmetry as [72] where λ 1 and λ 2 are the principle radii of gyration, i.e. the eigenvalues of the tensor T.

Efficiency
Efficiency E measures the linearity of a trajectory. It relates the net squared displacement of a particle to the sum of squared step lengths, It may help to detect directed motion (superdiffusion).

Empirical velocity autocorrelation function
Empirical velocity autocorrelation function [73] for lag 1 and point n is defined as: It can be used to distinguish subdiffusion processes. In our model, we used χ n for points n = 1 and n = 2.

Fractal dimension
Fractal dimension is a measure of the space-filling capacity of a pattern (a trajectory in our case). According to Katz and George [74], the fractal dimension of a planar curve may be calculated as where L is the total length of the trajectory, N is the number of steps, and d is the largest distance between any two positions. It takes values around 1 for straight trajectories (i.e. directed motion), around 2 for random ones (normal diffusion), and around 3 for constrained trajectories (subdiffusion).

Maximal excursion
Maximal excursion of the particle is given by the formula: It should detect relatively long jumps (in comparison to the overall displacement).

Mean maximal excursion
According to Ref. [37], the mean maximal excursion is usually a better observable than MSD to determine the anomalous exponent α. Given the largest distance traveled by a particle, the mean maximal excursion is defined as its standardized value, i.e.: Here,σ N is a consistent estimator of the standard deviation of D N ,

Mean gaussianity
Gaussianity g(n) [75] checks the Gaussian statistics of increments of a trajectory and is defined as where < r k n > denotes the kth moment of the trajectory at time lag n: Gaussianity for normal diffusion is equal to 0. The same result should be obtained for FBM, since its increments follow Gaussian distribution. Other types of motion should show deviations from zero.
Instead of looking at gaussianities at single time lags, we will include the mean over all lags as one of the features: 10. Mean-squared displacement ratio MSD ratio gives information about the shape of the corresponding MSD curve. We will define it as M SDR(n 1 , n 2 ) = r 2 where n 1 < n 2 . M SDR = 0 is zero for normal diffusion (α = 1). We should get M SDR ≤ 0 for sub-and M SDR ≥ 0 for superdiffusion. We simply took n 2 = n 1 + ∆t and calculate an averaged ratio for every trajectory.

Kurtosis
Kurtosis gives insight into the asymmetry and peakedness of the distribution of points within a trajectory [72]. To calculate it, the position vectors X i are projected onto the dominant eigenvector r of the gyration tensor (10), Kurtosis is then defined as the fourth moment of x p i , withx p being the mean projected position and σ x p -the standard deviation of x p .

Statistics based on p-variation
The empirical p-variation is given by the formula [76,77]: These statistics can be used to detect the fractional Lévy stable motion (including FBM).
We defined 6 features based on V (p) m : • power γ p fitted to p-variation for lags 1 to 5 [76], • statistics P used in Ref. [52], based on the monotonicity changes of V (p) m as a function of m: m is convex for the highest p for which it is not monotonous, m is concave for the highest p for which it is not monotonous.

Straightness
Straightness S measures the average direction change between subsequent steps. It relates the net displacement of a particle to the sum of all step lengths,

Trappedness
With trappedness we will refer to the probability that a diffusing particle is trapped in a bounded region with radius r 0 . A comparison of analytical and Monte Carlo results for confined diffusion allowed Saxton [32] to estimate this probability with Here, r 0 is approximated by half of the maximum distance between any two positions along a given trajectory. D in Eq. (28) is estimated by fitting the first two points of the MSD curve (i.e. it is the so called short-time diffusion coefficient).

E. Additional diffusion characteristics
In order to improve the performance of the original classifier, we searched for further feature candidates after the AnDi challenge. The additional features extending the basic set are described below.

D'Agostino-Pearson test statistic
D'Agostino-Pearson κ 2 test statistic [78,79] is a goodness-of-fit measure that aims to establish whether or not a given sample comes from a normally distributed sample. It is defined as where K is the sample kurtosis given by Eq. (24)  Their definitions may be found elsewhere [78,79]. This feature should help to distinguish ATTM and SBM from other motions.

Kolmogorov-Smirnov statistic against χ 2 distribution
The Kolmogorov-Smirnov statistic quantifies the distance between the empirical distribution function of the sample F n (X) and the cumulative distribution function G n (X) of a reference distribution, Here, n is the number of observations (i.e. the length of a trajectory). The value of this statistic for the empirical distribution of squared increments of a trajectory against the sampled χ 2 distribution has been taken as the next feature. The rationale of such choice is that for a Gaussian trajectory the theoretical distribution of squared increments is the mentioned χ 2 distribution.

Noah, Moses, and Joseph exponents
For processes with stationary increments, there are in principle two mechanisms that violate the Gaussian central limit theorem and produce anomalous scaling of MSD: longtime increment correlations or a flat-tailed increment distribution (in the latter case the second moment is divergent) [80]. These mechanisms are referred to as the Joseph and Noah effects, respectively. FBM is the prototypical process that exhibits the Joseph effect.
LW on the other hand is an example of a process with the Noah effect. An anomalous scaling can also be induced by a non-stationary increment distribution [80]. In this case, we deal with the Moses effect. It should help to handle ATTM and SBM trajectories.
All three effects may be quantified by exponents, which will be used as features in our extended set of attributes. Given a stochastic process X t and the corresponding increment process δ t (τ ) = X t+τ − X t , the Joseph, Moses and Noah exponents are defined as follows.
1. Joseph exponent J is estimated from the ensemble average of the rescaled range statistic: and S t is standard deviation of process X t .

Detrending moving average
The detrending moving average (DMA) statistic [81,82] is given by for τ = 1, 2, ..., where X τ i is a moving average of τ observations, i.e. X τ i = 1 τ +1 τ j=0 X i−j . As mentioned in Ref. [82], the DMA-based statistical tests can help in the detection of the scaled Brownian motion. In our model, we used two values of DMA for each trajectory as input features, namely DM A(1) and DM A(2).

Average moving window characteristics
As the moving average methods have already been successfully applied to many problems (see for example Ref. [83] for change point detection), we decided to accommodate them in our feature set as well. We added eight features based on the formula where X (m) denotes a statistic of the process calculated within the window of length m and sgn is the signum function. In particular, we used the mean and the standard deviation for X and calculated M W with windows of lengths m = 10 and m = 20 separately for x and y coordinates.

Maximum standard deviation
The idea of the moving window helped us to introduce another two features based on the standard deviation σ m of the process calculated within a window of length m. They are given by and where σ denotes the overall standard deviation of the sample. We used the window of length m = 3 and calculated the features for both coordinates separately. They should improve the detection of ATTM type of movements.

III. DATA
Extreme gradient boosting, i.e. the algorithm we decided to use, is an example of a supervised model. In other words, it requires a training dataset consisting of trajectories and the corresponding labels indicating their diffusion type (i.e. the diffusion model).
We used a synthetic set of trajectories produced with the andi-dataset package [84]  for each of them in the preprocessing phase. The resulting set consisting of vectors of attributes and their labels was used as input for the XGB model. The parameters used to generate the trajectories are shown in Table II. Their ranges have been constrained by the organizers of the challenge. The range of diffusion exponents prevents stationary paths. All trajectories were standardized so that the distribution of displacements over the unit time is characterized by the standard deviation equal to 1.
To better simulate experimental endeavors, each trajectory was contaminated with a finite localization error. For that purpose, a random number from a normal distribution N (0, σ noise ) was added to each trajectory coordinate.

IV. RESULTS
We used the implementation of the XGB from the xgboost module [70]. The values of the hyperparameters were optimized for the original set of features. However, the same values of hyperparameters were also used with the extended set for comparison purposes.

A. Performance of the classifiers
The performance of the classifiers in Task 2 of the AnDi challenge was evaluated with the F 1 score, where T P , F P , and F N are the true positive, false positive, and false negative rates, respectively. Since we are dealing here with a multilabel classification problem, a microaveraged F 1 score was calculated to assess the aggregated contributions of all classes (i.e. the sums of all T P , F P and F N were first determined and then inserted into Eq. 35. For balanced sets (same number of samples for each class), like those of the AnDi challenge, the micro-averaged F 1 score coincides with the accuracy of the classifier.
Results for the performance of the classifiers are shown in Table IV Inspection of the confusion matrices of the classifiers may give us additional insight into partial contributions of the overall performance. To recall, an element c ij of the confusion matrix indicates how many observations known to belong to class i (true label) are predicted to be in class j (predicted labels) [68]. Normalized confusion matrices for both models are shown in Fig. 2. The best performance of the base model is observed for Lévy walks.
Among all LW samples in the test data, 92% of them were correctly recognized. Most of the misclassified LW trajectories were assigned to FBM motion (5%) and SBM (2%). The accuracy of the classifier for CTRW (85%) and SBM (81%) is smaller, but still quite good.
The incorrectly classified CTRW trajectories have been usually confused with ATTM. In the case of SBM, many trajectories have been assigned to ATTM (11%) and FBM (6%) classes.
We have not expected such a moderate performance of the original classifier for FBM (72%). In our earlier attempts, [49,51,52] we usually got much higher performance for this type of motion. However, it should be stressed that previously we trained our classifiers on Adding the features from Sec. II E to the input vectors has significantly improved the overall accuracy of the classifier. As can be seen in the right plot of Fig. 2, all of the partial contributions to the accuracy improved as well. Although the performance for ATTM remains weak (61% only), it is much higher than in the previous case. As for the other classes, the modified classifier achieves very good accuracies for LW (98%) and CTRW (94%) and good ones for SBM (83%) and FBM (82%). The biggest challenge is still to distinguish ATTM and FBM from SBM trajectories.

B. Impact of trajectory lengths
In Fig. 3, the F 1 scores for different ranges of the trajectory lengths in the case of the extended model are shown. Not surprisingly, the classifier struggles a lot with very short trajectories. In this regime, many of the paths are indistinguishable from normal diffusion. Moreover, in the case of ATTM and SBM models, the information contained in the short trajectories may be not enough for detecting the time-varying diffusion coefficient D. Interestingly, the accuracy of the FBM classification increases with the sample length to achieve around 90% for trajectories longer than 500 steps.
It is worth to mention that the difficulties with very short trajectories are not specific to the extreme gradient boosting in particular or to feature-based methods in general. All methods presented in the challenge were afflicted by similar problems in this regime [59].

C. Impact of the level of noise
In Fig. 4, the F 1 scores for different levels of noise (i.e. different values of its standard deviation) are presented in case of our extended model. While the classifier performs quite well for the CTRW and LW even in the case of high noise levels, its accuracy significantly decreases for the remaining diffusion models. It seems that the classifier is not able to One of the popular feature assessment techniques is the permutation feature importance [68]. It is defined as the decrease in a classifier's score when a single feature values are randomly shuffled. Since this procedure breaks the relationship between the feature and the target, the drop in the score indicates how much the classifier depends on the feature. In with correlated features [86] and this is the case in our model (e.g. we have several features based on MSD), we decided to check the gain importance as well [87]. In this approach, the importance of a feature represents the relative contribution of the feature to the model, the anomalous exponent α and the MSD ratio get now the highest scores. But again, many of the most important features were added in the extended version of the model. Thus it seems we made some reasonable choices to improve the classification accuracy.
Last but not least, to further elaborate on that issue, we will employ another well known technique -the SHAP values based on feature attribution, i.e. the impact of having a certain value of a given feature in comparison to the prediction we'd make if that feature took some baseline value [88,89]. Compared the the previous methods, SHAP values may be used to explain individual predictions. Moreover, they offer a more consistent approach to assess the overall contribution of features to the final outcome of the model.
In Table VII, SHAP values for the most important features in different diffusion models are shown, in the case of the extended model. It should be noted that for every single diffusion model, the additional features turned out to be the most important ones, in agreement with the previous methods. In other words, the additional features prove to be an excellent choice. This also shows how critical the choice of features can be for a classifier (see Ref. [52] for further details).
The Moses exponent M , which relates to non-stationary increments, turned out to be the essential attribute for distinguishing ATTM and SBM from the rest of models; it seems to be quite important for FBM as well. However, in the latter case, the anomalous exponent is the most meaningful feature, followed by the D'Agostino-Pearson test statistic κ 2 .
Characteristics based on the average moving window (M W ) have the highest importance for CTRW. And finally, the maximum standard deviation feature is the top one for LW.
In Figs In other words, it would be rather impossible to build a statistical test to distinguish these three models that operates with a single feature (among the ones present in our set).

V. CONCLUSION
In this paper, we have introduced two sets of attributes that may be used for the classification of SPT data with help of feature-based machine learning methods. The first of these sets was proposed as our contribution to Task 2 of the AnDi challenge [58,59]. The latter one is the result of our attempt to improve the performance of the classifier presented in the challenge.
The extreme gradient boosting was used as the classification model. It is an ensemble method, which combines multiple decision trees to obtain better predictive performance.
Although the deep-learning approach constitutes the state-of-the-art technology for data classification in many domains, we deliberately decided to pick one of the traditional machine learning algorithms due to its superior interpretability.
Our original method turned out to be inferior to the top AnDi challenge teams, but still offered a reasonable performance of 73%. Moreover, further elaboration on the feature set after the challenge allowed us to significantly improve its accuracy by adding some new characteristics and to achieve performance similar to the winning teams (83% in 2D). Although the final accuracy is already quite good, there is still room for improvement, in particular in the case of short trajectories. Thus, more research into meaningful characteristics is encouraged.
Our results show how crucial the choice of features is for good classification performance.
Moreover, we were able to identify the most important among the features for all diffusion types. It does not only contribute to our overall understanding of the decisions made by the classifier. The analysis of feature importance may also be helpful in the selection of meaningful features that can be then used to build simpler classifiers trained on subsets of the attributes.