Interpretable deep learning as a means for decrypting disease signature in multiple sclerosis

F Cruciani; L Brusini; M Zucchelli; G Retuci Pinheiro; F Setti; I Boscolo Galazzo; R Deriche; L Rittner; M Calabrese; G Menegaz

doi:10.1088/1741-2552/ac0f4b

1. Introduction

The mechanisms driving multiple sclerosis (MS) are still largely unknown, calling for new methods allowing to detect and characterize tissue degeneration since the early stages of the disease. MS affects the brain and the spinal cord, resulting in physical and cognitive disability due to the damage of the myelin sheath wrapping white matter (WM) axons as well as neurodegeneration and axonal loss [1]. Four principal clinical phenotypes of MS have been described, among which relapsing-remitting multiple sclerosis (RRMS) and primary progressive multiple sclerosis (PPMS) are the most common [2, 3]. While demyelination and atrophy characterize both forms, their patterns and distribution vary across the brain, suggesting that different driving mechanisms might underpin these two main clinical manifestations [4]. Therefore, there is the growing clinical need to find specific fingerprints to distinguish between them in order to enable precision medicine, that is patient-specific treatments with clear clinical impact on treatment decision making [3].

Diffusion MRI (dMRI) is increasingly exploited for assessing microstructural alterations occurring in MS [5, 6]. This technique allows to define numerical indices that describe the brain tissue microstructure based on the measurements of signal decay along a predefined set of directions, providing an in-vivo indirect measure of the geometry of the diffusion pores [7, 8]. In particular, novel acquisitions based on multi-shell schemes have opened the way to the definition of a wider set of indices capturing microstructure degeneration and informing on the underlying disease process [5].

Diffusion signal models are generally tailored on WM [9] and are well suitable for modeling WM damage and structural connectivity alterations due to the disease. However, their exploitability for deriving neuroanatomically plausible microstructural descriptors from grey matter (GM) is far from trivial and has still to be proved. In recent years, several studies have attempted the characterization of GM modulations through dMRI acquisitions in different pathologies such as Alzheimer's disease (AD) [10] and migraine [11]. In MS both classical and advanced diffusion models were employed to investigate the disease patterns in different phenotypes [12, 13] and to longitudinally monitor patients over time [14]. Their findings strengthened the hypothesis of a GM modulation in MS and highlighted the dMRI sensitivity in detecting those changes. For the specific task of disambiguating PPMS from RRMS subjects, microstructural indices derived from the 3D simple harmonic oscillator-based reconstruction and estimation (3D-SHORE) model [7, 8] were used to demonstrate that the probability density function of the return to the plane probability (RTPP) was significantly different between the two groups in Hippocampus relying on histogram features [15].

In the context of patients classification from neuroimaging data, convolutional neural networks (CNNs) have recently gained popularity thanks to their ability in solving complex classification tasks, though in general requiring large amounts of data for training due to the high number of parameters that need to be calculated. Besides the availability of big data, one of the main bottlenecks for the use of CNNs for clinical purposes is that they are notoriously hard to interpret in retrospect. For this reason, deep learning (DL) methods, including CNNs, are often criticised to be non-transparent and still considered as 'black boxes'. Therefore, the availability of a means allowing to interpret the network decisions becomes the key element for their exploitability.

In the last years, a number of solutions have been proposed for visualizing what is actually learned by a CNN. Besides straightforward methods such as the extraction of activations and weights of the different layers, among the most widespread methods are: (i) sensitivity analysis or backpropagation (BP) [16], in which the relevance score is calculated as the the gradient of the output probability given the input, computed through the BP algorithm; (ii) guided backpropagation (GBP) [17], which modifies BP by setting to zero the negative gradients; (iii) deconvolution and occlusion [18], where recursively a part of the input image is covered by a black patch and the network output recalculated in order to assess the changes in the classification probability under the assumption that the covered region was relevant for the classification; and (iv) layer-wise relevance propagation (LRP) [19], which allows to detect and visualize in a relevance heatmap the voxels of the input data that mostly contributed to the classification decision. To this end, the LRP algorithm uses the network weights and the neural activations resulting from the forward pass to propagate the output back through the network up until the input layer, in a backward pass.

Among these, we consider the LRP to be the most promising tool for two main reasons. First, it provides an individual heatmap for each subject lying in the same space as the input image, indicating the weight of each voxel for the final (individual) classification decision. Second, LRP heatmaps have proven to be more eloquent than those provided by GBP [17] in that they reflect image-specific relevance, whereas GBP, relying on gradients, tends to emphasize the areas that are more susceptible to changes that might not coincide with the areas on which the CNN bases the decision [20].

In [21], LRP was employed on CNNs for classifying between MS subjects and healthy controls based on structural MRI data while [22] compared multiple visualizations methods applied to the same task relying on susceptibility-weighted imaging (SWI). The specific problem of MS patients stratification was addressed from a DL point of view in [23] by combining CNNs and graph metrics derived from structural connectivity. Finally, Zhang et al [24] and Wang et al [25] used a 2D-CNN model on structural MRI data for classifying physiological versus pathological subjects without stratification.

To the best of our knowledge, no attempts have still been made for exploiting LRP in the PPMS versus RRMS patients stratification task. Therefore, the objective of this work is twofold: disambiguating the considered groups of MS patients relying on advanced dMRI models and DL techniques focusing on GM, and decrypting CNNs decisions through the adoption of LRP.

For the first aim, considering the increased interest in assessing the role of GM in the MS disease fingerprinting, we relied on three dMRI signal models to derive microstructural indices with the specific goal of assessing their sensitivity to the microstructural contract between the PPMS and RRMS phenotypes. Being aware of the potentials and limitations of the considered models, we specifically aimed at capturing possible feature modulations in GM leaving the biophysical interpretation of the results to further investigations relying on multimodal acquisitions.

For the second objective we build on the claim in [20], that LRP has the potential to answer the question 'What speaks for AD in this particular patient?' providing guidance to understanding the mechanisms ruling the disease. In our work, such a question can be reformulated as 'What speaks for PPMS in this particular patient?', which is the core question that was addressed. This is a key issue to be solved and a very challenging problem because of the subtle intra-pathology tissue modulations differentiating the two stages of the disease. In this respect, the goal of this work was to investigate whether CNNs were able to blindly capture such subtle differences while providing insightful information about the underlying mechanisms through the observation of LRP maps. Restraining to these two categories represents the worst case from the classifier perspective, because of the subtle microstructural differences across the two phenotypes. However, we consider this as an important task because it is close to clinical practice conditions where a matched cohort of control subjects could not be available. In particular, as LRP allows to map the value of the network decision function onto the input voxels shedding light on the reasons behind the classification decisions, it can potentially provide hints for the interpretation of the mechanisms at the basis of the MS disease course besides the primary classification task, opening new perspectives for diagnosis, prognosis and treatment.

2. Materials and methods

An overview of the complete process proposed in this work is presented in figure 1.

2.1. Dataset

The population consisted of 91 subjects, including 46 RRMS (35 females, 42.5 ± 10.4 years old) and 45 PPMS (25 females, 47.2 ± 9.5 years old) patients. Expanded Disability Status Scale (EDSS) score was 2.8 ± 1.4 and 4.8 ± 1.3 respectively for the two groups. Significant group differences in age, EDSS score ( $\textit{p} \lt 0.05$ , obtained via a t-test) and gender numerosity ( $\textit{p} \lt 0.05$ , obtained via a χ-squared test) were recorded, this last reflecting the epidemiology of the disease.

MRI acquisitions were performed on a 3T Philips Achieva scanner (Philips Medical Systems, Best, The Netherlands) equipped with an eight-channel head coil. The following sequences were used for all patients: (1) two-shell dMRI (repetition time (TR)/echo time (TE) = 9300/109 ms, flip angle = 90^∘, field of view (FOV) = 112 × 112 mm², 2 mm isotropic resolution, 62 slices, b-values = 700/2000 s mm⁻² with 32/64 gradient directions respectively and 7 b0 volumes); (2) 3D T1-w Fast Field Echo (TR/TE = 8.1/3 ms, FA = 8^∘, FOV = 240 × 240 mm², 1 mm isotropic resolution, 180 slices); (3) 3D fluid-attenuated inversion recovery (FLAIR) image (TR/TE = 8000/290 ms, inversion time (TI) = 2356 ms, flip angle = 90^∘, FOV = 256 × 256 mm², 0.9 × 0.9 × 0.5 mm³ resolution, 180 slices). The study was approved by the local ethics committee, and informed consent was obtained from all patients. All procedures were performed in accordance with the Declaration of Helsinki (2008).

2.2. Data preprocessing

dMRI data denoising, Gibbs ringing removal, motion and eddy currents distortion correction were performed using the DIFFPREP module of Tortoise software (https://tortoise.nibib.nih.gov/tortoise). These steps led to preprocessed dMRI data with a size of 90 × 125 × 125 voxels. The Brain Extraction Tool in FSL (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/[26]) was used for skull stripping and for deriving a binary mask for each subject. In addition, the individual b0-weighted image was averaged across all the b0 volumes and the resulting image was spatially normalized to the MNI152 standard space using first the FSL epi_reg tool, to register the b0 image to the respective T1-w one, and then ANTs (http://stnava.github.io/ANTs/), to normalize the T1-w to the standard MNI152 space.

For each subject, the FLAIR image was rigidly registered to the T1-w using the FSL flirt tool. The lesion prediction algorithm [27] available in the lesion segmentation toolbox for SPM12 (www.statistical-modelling.de/lst.html) was used to automatically segment and fill the WM lesions in the native T1-w image. Each filled T1-w image was then imported in the FreeSurfer software (http://surfer.nmr.mgh.harvard.edu/, Harvard University, Boston, MA, USA) to perform a complete brain parcellation with 112 anatomical regions of interest (ROIs). Filled images were also skull stripped (FSL bet tool) and projected back to the dMRI native space by inverting the previously estimated transformation matrix. The segmentation of the registered T1-w images into GM, WM and cerebrospinal fluid was finally performed (FSL fast tool) and a binary mask representing the GM tissue probability thresholded at 95% was derived for each subject. This was applied to all the indices maps and to the T1-w, as only GM tissue was considered throughout the subsequent analyses.

2.3. Diffusion signal modeling

Microstructural indices were derived from three analytical signal models: DTI [28] (FA and MD); 3D-SHORE [7] (propagator anisotropy (PA), return to the axis probability (RTAP), return to the origin probability (RTOP) and RTPP); and the recently proposed RIF [29]. DTI and 3D-SHORE indices were calculated using DIPY [30] while an in-house software was used for RIF.

As opposed to DTI and 3D-SHORE indices, which are well known in the current literature [31–33], RIF have been only recently introduced. Basically, RIF are calculated on the Laplace-series expansion of a given spherical function and are high order rotation invariants related to the spherical mean, power-spectrum and bispectrum invariants if calculated on the diffusion signal. Moreover, they can be linked to statistical and geometrical measures of spherical functions, including the mean, the variance and the volume of the function.

In this work, we used 4th order SH to fit the diffusion signal, but only the first two RIF, I₀ and I₂₂, were considered. Indeed, due to the characteristics of the diffusion process in GM, the corresponding spherical signal is almost flat, such that all the high order invariants vanish. The only two non negligible RIF are I₀ and I₂₂ and are given by (1) and (2), respectively. In this work we will refer to them as RIF1 and RIF2 for convenience of notations:

$\begin{align} RIF1(f) = c_{00}\sqrt{4\pi} \end{align} \tag{ 1 }$

$\begin{align} RIF2(f) = \sum_{m = -l}^l|c_{lm}|^2 \end{align} \tag{ 2 }$

where, in this work, l = 2. If the RIF are calculated on the diffusion signal, as it was the case here, RIF1 corresponds to the mean of the diffusion signal across one shell, while RIF2 is related to the variance [29]. Since RIF were calculated separately on each shell, two maps were obtained for each RIF in our two-shells dMRI scheme. Overall, eleven features resulting from the DTI, 3D-SHORE, RIF and T1-w models were handcrafted and used for each patient.

2.4. Network architecture

A 3D-CNN VGG (Visual Geometry Group) net architecture [34] was used. This architecture has been well assessed in combination with MRI data in few recent studies [20, 35, 36], and it has been shown to achieve comparable performance with respect to the ResNet model [37] in distinguishing AD patients from controls [35]. In addition, the VGG model allows for a straightforward application of visualization techniques and is thus particularly suitable for this work aiming at interpretability. More details are provided in section 2.5.

The network structure consists of four volumetric convolutional blocks for feature extraction, two fully connected layers with batch normalization, and one output layer with softmax nonlinearity. Each convolutional block consists of a convolutional layer followed by a ReLU, batch normalization, and 3D pooling. A graphical representation of the 3D-CNN structure highlighting the main parameters for each layer is provided in figure 2. Of note, throughout the manuscript, we will refer to the 11 derived CNNs models as FA-CNN, MD-CNN, PA-CNN, RTAP-CNN, RTOP-CNN, RTPP-CNN, RIF1-CNN, RIF2-CNN and T1-CNN, respectively, based on the target feature.

**Figure 2.** 3D-CNN architecture with single channel diffusion MRI index input.
Download figure:
Standard image High-resolution image

2.4.1. Training, validation and testing

The feature maps resulting from the DTI, 3D-SHORE, RIF models and T1-w, separately masked to retain GM voxels only, were split in subsets to be used for training/validation (78% of the total, 71 subjects) and testing (22% of the total, 20 subjects). For the RIF, the two different maps for each index (one per shell) were considered together as separate channels of input data. Therefore, the whole input of the network was a four dimensional tensor of size 1 × 90 × 125 × 125 for DTI and 3D-SHORE indices, 2 × 90 × 125 × 125 for the RIF and 1 × 180 × 240 × 240 for T1-w, respectively. The optimal weights were learned during training by minimizing the cross-entropy loss by means of the Adam optimizer [38]. Training and validation were performed in all cases on batches of size four.

Data augmentation was performed on the training and validation sets in order to improve the generalization capabilities of the models. To this end, the following operations were applied: addition of random Gaussian noise (µ = 0, σ = 0.1); random affine rotation from −5 to +5 degrees around the z axis and from −3 to +3 degrees around the x axis; random volume translation from −3 to +3 voxels along each of the three axes and flipping across the x axis. In addition, clipping of the values to the 99th percentile was performed. It is worth mentioning that rotation over the y axis was not included as it would have changed the data orientation. The x axis corresponds to right-left, y axis to posterior-anterior, and z axis to inferior-superior direction in radiological conventions, respectively. Multiple tests were performed to fit the hyperparameter values over the training/validation phases. Their values were varied across the respective feasible and empirical range and the ones leading to the best accuracy and lower loss were retained. Validation was performed following a five-fold cross validation strategy (five-fold CV). The 71 subjects used for training/validation were randomly split in five groups, resulting in folds of 14 subjects each (except one consisting of 15 subjects). The experiments were repeated five times and, for each run, four folds were used for training and the remaining one for validation. The best model, for each fold, was chosen as the one corresponding to the lowest loss and highest accuracy values obtained over the validation sets. The remaining 20 subjects were kept unseen and used for testing using the best model resulting from each fold of the five-fold CV procedure. No data augmentation was performed on the test set.

The whole DL analysis was carried out using Pytorch. The computation was performed on a laptop (Ubuntu 18.6, Nvidia Geforce GTX 1050, Intel Core i7, 16 GB RAM). Torchsample wrapper was used as high-level interface.

2.4.2. Performance assessment

Performance was assessed following the objective of avoiding the misclassification of PPMS patients. Accordingly, we called true positives (TP) and true negatives (TN) the number of correctly classified PPMS and RRMS subjects, respectively.

2.4.2.1. Performance metrics

The following measures were calculated to assess the performance of each CNN model:

$\begin{align} Accuracy & = \frac{TP+TN}{TP+TN+FP+FN} \end{align} \tag{ 3 }$

$\begin{align} Sensitivity & = \frac{TP}{TP+FN} \end{align} \tag{ 4 }$

$\begin{align} Specificity & = \frac{TN}{TN+FP} \end{align} \tag{ 5 }$

$\begin{align} Precision_{PPMS} & = \frac{TP}{TP+FP} \end{align} \tag{ 6 }$

$\begin{align} Precision_{RRMS} & = \frac{TN}{TN+FN}.\end{align} \tag{ 7 }$

Results were reported for the testing set in terms of mean and standard deviation of the classification measures over the five best models resulting from the five-fold CV.

2.4.2.2. Controlling for confounding variables

In the biomedical field, great importance is attributed to the role of confounds, that could bias the results hiding, contrasting or annihilating other factors that could hold important clinical information. In general, when linear regression is used, deconfounding is applied before modeling, regressing out the confounds directly from the data. However, this is not a common practice when deep networks are used, relying on their ability of capturing all the discriminating features. Though, this does not provide any guarantee with respect to possible biases in the outcomes neither on the prevalence (or not) of confounding variables in shaping the results. In this work the issue was faced by following the post-hoc method described in [39] leaving the input data unchanged. In particular, logistic classification models were used to assess the role of age, sex and EDSS in the differentiation between PPMS and RRMS phenotypes. To this end, two logistic models were contrasted for each index through the likelihood ratio (LR) test, that is predicting the outcomes using either the confounding variables or the confounding variables and the CNNs model predictions. The statistical significance of the LR, assessed through a χ-squared test, would reveal that the role of the confounds in shaping the classification outcomes is not prevalent.

2.5. Layer-wise relevance propagation

LRP visualization was employed to determine which features and voxels in the input volume contributed most to the classification output. This technique relies on a backward pass ruled by the conservative relevance redistribution procedure, proceeding backwards from the CNN output values (i.e. the classification probabilities) to the input layer. In this approach, each neuron of a layer receives a relevance score from the next layer and redistributes it to its predecessors in equal amount until the input layer is reached. In this way, neurons that contribute the most to the deepest layer receive more relevance.

The core rule of LRP visualization is the relevance conservation per layer as illustrated in figure 3. Briefly, let i and j be the indices for neurons at two successive layers r and r + 1. Let $R_j^{r+1}$ be the relevance of neuron j for the prediction f(x) where x is the neural network input. The relevance $R_j^{r+1}$ is redistributed to the connected neurons in layer r such that the relevance conservation holds:

$\begin{align} \sum_{i} R_{i \leftarrow j}^{r} = R_j^{r+1}.\end{align} \tag{ 8 }$

**Figure 3.** Overview of LRP procedure. The example shows the workflow to obtain PPMS LRP heatmaps.
Download figure:
Standard image High-resolution image

By iterating equation (8) through all the layers, it is possible to decompose the relevance score of the prediction f(x), R_f in terms of the input variables of the first layer. This allows to easily visualize the relevance values as heatmaps.

Different rules have been applied in literature for redistributing the relevance [40]. In this work we used the β-rule as in [19, 41]:

$\begin{align} R_{i \leftarrow j}^{r,r+1} = \left( (1+\beta) \frac{w_{ij}^+}{w_j^+} - \beta \frac{w_{ij}^-}{w_j^-} \right) R_j^{r+1}.\end{align} \tag{ 9 }$

In this equation, $w_{i,j}^{+/-}$ is the amount of positive/negative contribution that node j transfers to node i, divided by the sum over all positive/negative contributions of the nodes in layer r. In fact $w_{j}^{+/-} = \sum_{i} w_{i,j}^{+/-}$ , such that the relevance is conserved from layer r + 1 to layer r. This approach was adopted following [20], where multiple β values were tested to assess their impact on the resulting heatmaps. The authors proved the LRP robustness relatively to the corresponding β-value. The β-rule decomposes the relevance score in positive and negative contributions, and weights the relative importance according to the β parameter. Setting β = 0 adds only positive contributions to the relevance score. Negative contributions hold an inhibitory effect highlighting the voxels that are antagonist to those having strong positive impact on the classification function. Since in this work we aimed at detecting those voxels playing in favor of the correct classification of PPMS patients, we constrained β to be zero.

In a multi-class classification task, f(x) consists of multiple values indicating the probability for the input x to belong to each of the classes c_i, e.g. $f(x) = \{f_{C_1(x)}, f_{C_2(x)}, \dots, f_{C_N(x)} \}$ where N is the total number of classes. In order to calculate the LRP map, the target class must be specified. Let n be the class index, then C_n−LRP(x) is obtained by backpropagating $f_{C_n(x)}$ through the network.

Following this notation, in this work the prediction f(x) is defined as $f(x) = \{f_{C_{PPMS}(x)}, f_{C_{RRMS}(x)} \}$ . Accordingly, two target class-driven LRP heatmaps were derived for each subject, providing complementary information about the significance of each voxel in the classification process.

It might be useful to point out here that the network behavior is not symmetric across the two classes. To get the flavor of this let us consider a toy example. Let PPMS_i be a Primary Progressive patient and let us assume that the patient is correctly classified by the network. Then, let f(PPMS_i) be the corresponding value of the decision function. To get the corresponding LRP map (that we call PPMS_i-LRP map), such value is backprojected through the network. Though, this same patient is a TN for the other class (e.g. the RRMS one) and it will contribute with the value 1 − f(PPMS_i) to the RRMS-LRP map. Then, since the TPs of one class coincide with the TNs of the other and contribute with backprojected values that sum to one, in the computation of the respective LRP maps, that is PPMS-LPR and RRMS-LRP, these two will in general be different.

2.5.1. LRP heatmaps analysis

For each CNN model, PPMS-LRP and RRMS-LRP heatmaps were generated for each subject of the test set, based on the best model among the five derived from the five-fold CV. This led to 22 LRP maps per subject, representing the performance of the six diffusion indices, RIF1 (RIF1₇₀₀ and RIF1₂₀₀₀), RIF2 (RIF2₇₀₀ and RIF2₂₀₀₀) and T1-w (thus eleven maps per target class).

Both qualitative and quantitative analyses were performed relying on a ROI-based approach. To this end, 15 brain ROIs, which were previously demonstrated to be highly relevant in the MS disease [42, 43], were selected. The chosen ROIs were: Thalamus, Caudate, Putamen, Hippocampus, Insula, Precuneous, Superior Frontal Gyrus and Cingulate Gyrus, Lateral Occipital Cortex, Pericalcarine and Lingual Gyrus, Cerebellum, Temporal Pole, Pallidum and Parahippocampal Gyrus.

Inspired to [20], size-normalized importance metrics were derived for both the PPMS- and RRMS-LRP heatmaps. In particular, the median of the relevance values of both the target and non-target classes were extracted and averaged for each ROI across the correctly classified subjects (TPs and TNs) of the test set. Then, two additional measures, that we call gain and differential gain were calculated. The first is inspired to the gain metric used in [20], that was given by the ratio between the LRP median values of the two categories. In this work we propose to use the difference between such values to avoid the divergence of the measure that could occur in sites with vanishing LRP. Following the definition of the LRP, we consider the difference in relevance to be more representative of the actual contribution of a given ROI in forming the classification decision. All these steps resulted into two values per ROI, named as PPMS-LRP gain and RRMS-LRP gain, respectively. The differential gain was calculated as the difference between RRMS-LRP gain and PPMS-LRP gain. This was computed to measure the difference in gain brought by a given ROI when switching across the target class.

As a final explorative analysis, we aimed at investigating the LRP neuroanatomical plausibility, following [20, 21]. In particular, in [20] the hippocampal volume was used as biomarker for AD, while in [21] the WM lesion load was considered for MS and the correlation between the LRP relevance sum and lesion sum was assumed to provide evidence in favor of the informative potential of the LRP. In a joint work [15], we have shown that the RTPP mean value in the Hippocampus was significantly different for PPMS and RRMS in a subset of the cohort of patients considered here. In this work, four LRP maps were available for each index (including RTPP), one pair for each target class. Relying on that, our working hypothesis was that the statistical significance of the Spearman correlation coefficient between the mean LRP value of either of the four maps in the Hippocampus and the RTPP mean value in the same region would provide first evidence in favor of the neuroanatomical plausibility of LRP as a potential staging signature. Accordingly, four correlation coefficients were assessed.

3. Results

Qualitative microstructural assessment is given in supplementary material (supplementary section 1 and supplementary figure S1) (available online at stacks.iop.org/JNE/18/0460a6/mmedia). The contrast of DTI and 3D-SHORE indices is in agreement with previously reported results [32, 44], while RIF maps appeared inline with the results in [29], showing reduced intensity at increasing b-value, RIF degree and order.

3.1. 3D-CNN classification performance

The classification performance for each CNN model is reported in table 1 for the test set in terms of accuracy, precision (for each class), sensitivity and specificity. Average values and standard deviations were calculated from the five best models resulting from the five-fold CV.

Table 1. Classification performance metrics calculated in the test set for each CNN model. Values were calculated by averaging the results of the five best models derived from five-fold cross validation and reported together with the respective standard deviation values. The best performance metrics are highlighted in bold.

Test	Accuracy	Precision RRMS	Precision PPMS	Sensitivity	Specificity
RIF1-CNN	0.75 ± 0.04	0.71 ± 0.04	0.82 ± 0.05	0.64 ± 0.08	0.86 ± 0.05
RIF2-CNN	0.58 ± 0.15	0.73 ± 0.26	0.58 ± 0.16	0.82 ± 0.16	0.34 ± 0.28
FA-CNN	0.70 ± 0.08	0.76 ± 0.15	0.77 ± 0.13	0.68 ± 0.31	0.72 ± 0.20
MD-CNN	0.70 ± 0.06	0.67 ± 0.07	0.74 ± 0.04	0.60 ± 0.13	0.80 ± 0.05
PA-CNN	0.80 ± 0.04	$\textbf{0.96} \pm \textbf{0.09}$	0.73 ± 0.04	$\textbf{0.96} \pm \textbf{0.08}$	0.64 ± 0.08
RTAP-CNN	0.76 ± 0.09	0.75 ± 0.11	0.80 ± 0.04	0.68 ± 0.20	0.84 ± 0.05
RTOP-CNN	0.75 ± 0.03	0.77 ± 0.08	0.77 ± 0.05	0.74 ± 0.15	0.76 ± 0.10
RTPP-CNN	0.81 ± 0.05	0.81 ± 0.10	0.82 ± 0.03	0.80 ± 0.11	0.82 ± 0.04
T1-CNN	$\textbf{0.84} \pm \textbf{0.10}$	0.82 ± 0.14	$\textbf{0.94} \pm \textbf{0.07}$	0.74 ± 0.24	$\textbf{0.94} \pm \textbf{0.08}$

CNNs models for PA, RTAP, RTOP, RTPP, RIF1 and T1-w reached an accuracy ${\geqslant} 0.75$ . In particular, T1-CNN was the most accurate with an average score of 0.84, while for the dMRI-based models RTPP-CNN reached an accuracy of 0.81, followed by PA-CNN and RTAP-CNN achieving a mean accuracy score of 0.80 and 0.76, respectively. RIF2-CNN showed the worst performance with an average score of 0.58. Considering the two different classes, the highest precision for RRMS was reached by PA-CNN with a mean score of 0.96, followed by T1-CNN and RTPP-CNN showing a mean score ${\geqslant} 0.80$ . For the PPMS group, while T1-CNN provided the highest precision (average precision 0.94), good performance was also reached by RIF1-, RTAP- and RTPP-CNN, all providing a precision ${\geqslant}0.80$ . The highest sensitivity was reached by PA-CNN with a score of 0.96, followed by RIF2-CNN and RTPP-CNN (having mean sensitivity of 0.82 and 0.80, respectively). Finally, T1-CNN achieved the highest specificity of 0.94. Classification performance measures for the validation set were inline with those obtained on the test set, providing evidence of the absence of overfitting despite the relatively low cardinality of the sample with respect to the number of the network parameters.

Concerning the influence of the three confounds on the CNNs classification outcomes, the LR test highlighted that all the logistic classification models to which the CNNs outcomes were added as predictor were significantly different (χ-squared test, $p\lt0.05$ ) from the logistic classification model employing only the confounds as predictors, except for the RIF1-, RIF2- and RTOP-CNN.

3.2. LRP heatmaps

Figures 4 and 5 show the group LRP heatmaps (RRMS-LRP and PPMS-LRP), averaged over the correctly classified subjects of the test set for both RRMS and PPMS classes. Only the maps derived from the best indices in terms of mean accuracy are shown (see section 3.1 for details). For ease of visualization, the maps are clipped to the range 60th–99.5th percentile calculated over the respective LRP target group mean heatmap. The group mean maps for the other indices are reported in supplementary material (supplementary figures S2 and S3).

**Figure 4.** LRP heatmaps obtained from CNNs models based on the first rotation invariant feature (RIF1) and PA. The heatmaps are shown for both target classes (relapsing-remitting multiple sclerosis (RRMS-LRP) and primary progressive multiple sclerosis (PPMS-LRP), columns), and are overlaid to the MNI152 template in coronal, axial and sagittal views (rows). Each LRP map is averaged across the correctly classified RRMS and correctly classified PPMS subjects of the test set, respectively. The reported values are clipped to the range 60th–99.5th percentile, calculated over the RRMS and the PPMS class group mean heatmaps.
Download figure:
Standard image High-resolution image

**Figure 5.** LRP heatmaps obtained from CNNs models based on RTAP, RTPP and T1-w. The heatmaps are shown for both target classes (relapsing-remitting multiple sclerosis (RRMS-LRP) and primary progressive multiple sclerosis (PPMS-LRP), columns), overlaid to the MNI152 template in coronal, axial and sagittal views (rows). Each LRP map is averaged across the correctly classified RRMS and correctly classified PPMS subjects of the test set, respectively. The reported values are clipped to the range 60th–99.5th percentile, calculated over the RRMS and the PPMS class group mean heatmap.
Download figure:
Standard image High-resolution image

As expected, higher contrast, reflecting higher relevance, characterizes the average LRP map of the target class (i.e. PPMS in PPMS-LRP and RRMS in RRMS-LRP) in all CNNs models. This follows from the LRP relevance propagation algorithm, which starts from a larger number in the output layer for the target class and the correctly classified subjects. Even if a widespread activation of the GM regions was present in all the heatmaps, the patterns of the two families of LRP maps were not overlapped and PPMS-LRP resulted in more scattered activations compared to the others. The LRP maps corresponding to the different indices revealed different activations, highlighting that these could mirror specific microstructural properties. In particular, for both the target classes and both LRP heatmaps, RIF1-CNN (RIF1₇₀₀ and RIF1₂₀₀₀) showed a widespread activation over the GM regions, involving both cortical and subcortical structures. A similar pattern appeared also in RTAP-CNN, RTPP-CNN and T1-CNN LRP maps, showing higher frontal activation in PPMS-LRP maps of T1-CNN. A different pattern can be observed in both PA-CNN derived LRP heatmaps, revealing higher activation values in deep GM structures and considerably lower values in cortical structures.

Moving to the non-target class, high relevance values were present in the PPMS average heatmap for the RRMS-LRP, which were not found in the counterpart group for the PPMS-LRP maps. This is particularly evident for RIF1 and T1-CNN derived heatmaps. Of note, these maps showed lower relevance scores compared to the others. However, this depends on the higher number of voxels on which the relevance had to be redistributed for these two inputs (two maps for the RIF1-CNN and a larger map for the T1-w).

3.3. LRP ROI-based analysis

ROI-based analyses were performed to assess the relevant areas for the classification task. Figure 6 illustrates the size-normalized importance metrics for the correctly classified patients of the test set, separately for the two classes and for the two LRP types. The ROIs mean relevance values for the wrongly classified subjects are shown in supplementary material (supplementary figure S4). Of note, the gain is always positive and the non target class (TNs for the PPMS-LRP and TPs for RRMS-LRP following the notations) featured relevance values following the same trend of the correctly predicted ones across ROIs. It is important to highlight that the non-target class still owed some relevance for all the CNNs models in both cases, which was particularly high for PPMS-LRP.

Considering the different models, the same behavior was reported in PPMS-LRP for RTAP-CNN and RTPP-CNN heatmaps. In addition, RIF1-CNN maps showed similar values between the two shells and, together with T1-CNN LRP maps, presented sensibly lower LRP values compared to the other indices. All the CNNs models highlighted almost the same regions leading the classification between RRMS and PPMS, for both RRMS-LRP and PPMS-LRP. Among the cortical ROIs, Parahippocampal Gyrus appeared among the first five most relevant ROIs in all cases except T1-CNN, showing also the greater distance between the PPMS- and RRMS-LRP values in RRMS-LRP for all CNNs models except RIF1-CNN. Temporal Pole appeared highly relevant especially for PA-, RTAP- and RTPP- and T1-CNN, for both LRP types. Moreover, it reached high relevance values also for the non-target group in PPMS-LRP. Superior Frontal Gyrus showed a large LRP value for all the CNNs models, particularly for PPMS-LRP but also for the non-target class in RRMS-LRP for RTAP- and RTPP-CNN. Finally Lateral Occipital Cortex was highly relevant for RTPP in the PPMS group of PPMS-LRP heatmaps.

Concerning deep GM ROIs, Insula was among the most relevant ROIs for all CNNs except RTPP-CNN, for both RRMS-LRP and PPMS-LRP, while Cingulate Gyrus was highly relevant for RRMS-LRP of RIF1-CNN. Of note, those deep GM ROIs showed non-overlapping sets of relevance values between groups in RRMS-LRP of all the CNNs models. Cerebellum, lastly, was a remarkably relevant ROI for all the CNNs models and both LRP types, showing also disjoint distributions of LRP values in RRMS-LRP.

In figure 7, the results for the gain values are reported. This metric revealed that the highest difference in ROI relevance between the two LRP types was found in Parahippocampal Gyrus, Temporal Pole, Superior Frontal Gyrus, Cerebellum, Cingulate Gyrus and Insula, confirming the previously presented qualitative results. This quantitative analysis demonstrated also a different sensitivity of the considered indices to tissue modulations across separate ROIs, in particular of T1-w compared to the other indices. Among the dMRI indices, PA was the most different compared to the others. The results for the differential gain are reported in the supplementary material (supplementary figure S5).

The final Spearman correlation analysis revealed a significant and positive correlation (ρ = 0.77, p = 0.016) for the non-target class (PPMS) between the mean RTPP RRMS-LRP value in the Hippocampus and the RTPP mean value in the same ROI. No other significant correlations could be detected.

4. Discussion

In this work, we introduced LRP as a forceful method for explaining individual CNNs decisions in MS patients stratification. We trained different CNNs models to detect PPMS patients and capture the microstructural features as well as the main ROIs leading to the classifier decision. The dMRI considered indices were derived from DTI (FA and MD), 3D-SHORE (PA, RTAP, RTOP and RTPP) and from a novel framework for the extraction of RIFs from dMRI signal (RIF1 and RIF2). Only the GM tissues were fed to the CNN and a CNN model based on T1-w was also trained for benchmarking. For each CNN model, two heatmaps indicating the relevance of each voxel were derived, one for each target class in the test set. The relevance of 15 selected brain regions was then evaluated region-wise using three different importance metrics: (i) the size-normalized PPMS (or RRMS) importance, which is the median value of the LRP map for the target and non-target class, respectively; (ii) the PPMS-LRP and RRMS-LRP gain, measured as the difference between size-normalized target and non-target importance measures; and (iii) the differential gain, which combined both RRMS-LRP and PPMS-LRP by measuring the difference between the RRMS-LRP gain and the PPMS-LRP gain.

Our results demonstrated that: (1) dMRI features extracted in GM tissues can help disambiguating PPMS from RRMS patients; (2) LRP heatmaps highlight areas of high relevance which relate well with what is known from literature for MS disease.

Starting from the classification performance, 3D-SHORE derived indices as well as RIF1 outperformed the DTI-based ones, reaching comparable results with T1-CNN. Moreover, while literature generally reports WM features as signature of the MS disease, our study highlighted the GM potential role in identifying PPMS from RRMS patients, opening the way to a new type of potential numerical biomarkers focusing on GM.

Moreover, the LR test between the two models associated to each index in the post-hoc assessment of the prevalence of the confounding variable revealed that the classification results were not dominated by the confounds for the DTI-, 3D-SHORE- (except for the RTOP) and T1-CNN models.

These results provide evidence of the potential improvement brought by dMRI features other than DTI for MS staging, as well as of the eloquence of microstructural information in GM. In particular, the optimal performance reached by PA was in agreement with the results reported in [45], suggesting the sensitivity of anisotropy measures to MS modulation of GM tissues, although using classical DTI indices.

Precision, sensitivity and specificity values as defined in this work were tailored on the ease of classification of PPMS patients. More specifically, the precision was calculated separately for PPMS and RRMS classes and measured, respectively, how well the CNNs models could characterize the PPMS (RRMS) cohorts by minimizing the number of RRMS (PPMS) wrongly classified subjects. PA provided the best results for precision for RRMS, that is in minimizing the number of wrongly classified PPMS subjects, which was the main objective of this work. These results prove that dMRI is highly relevant for detecting the first signs of the PPMS stage of disease. Conversely, the T1-CNN reached the highest precision for PPMS, demonstrating its ability to minimizing the number of wrongly classified RRMS. This behavior was further confirmed by the sensitivity and specificity values. Regarding sensitivity, PA reached the best performance, highlighting the index ability to distinguishing the PPMS patients by minimizing the number of misclassifications. The maximum specificity was provided by T1-w that allows to characterize the RRMS subject class.

Although these two models showed outstanding performance related to the correct classification of one of the two classes, the performance metrics scores were relatively low. The index showing the highest stability across all the proposed measures was RTPP, which achieved an average value ${\geqslant}0.80$ in all the classification performance metrics meaning that it minimized at the same time both the PPMS and RRMS wrong classifications.

Regarding the recently proposed RIF, even tough they did non achieve significance in the control for confounds analysis, are interesting to be analyzed. RIF1 outperformed RIF2 in differentiating PPMS and RRMS. This was expected because in GM diffusion signal tends to be mostly isotropic and thus poorly described by high order SH models. High order RIF would be more suitable for WM, where the signal is highly anisotropic especially in regions having complex topology (crossing, fanning). In fact, since RIF were calculated on the diffusion signal, RIF1 represents the mean of the signal across one shell and thus it is proportional to the inverse of the diffusivity, while the second is related to the variance of the signal across one shell and thus it is more sensitive to the complexity of the tissue [29].

Comparing our results to the current literature, as pointed out in section 1 the stratification of MS patients is still largely underinvestigated and few studies addressed this specific problem so far. Among these, [23] achieved an average precision of 0.84 and an average sensitivity of 0.80 on a dataset of 604 acquisitions (b-value 1000 mm² s⁻¹). Although they relied on dMRI data, their focus was on connectivity while the methods proposed in this work availed of microstructural information. Moreover, our RTPP-CNN and PA-CNN achieved comparable accuracy values on a smaller dataset.

The differentiation between healthy and pathological subjects is much more common in literature than intra-pathology stratification. In [24, 25], two different 2D-CNNs architectures were combined with conventional structural MRI data to this end reaching high accuracy scores (98.77% and 98.23%, respectively). Using the different slices of each subject as a separate input led to a much larger sample size (e.g. [25] amounting to 1357 images in total for the 64 subjects) which brings a clear advantage for training.

A 3D-CNN based approach was proposed in [21], reaching an accuracy of 87.04% on a set of 147 fully volumetric structural MRI acquisitions. Despite the lower accuracy compared to the 2D-CNN based approaches, the use of a 3D-CNN architecture facilitated the interpretation of the CNN performance through the use of feature visualization techniques. However, the difference in the research question makes these works not directly comparable to ours. Concerning the feature visualization, [22] compared different techniques applied to a 2D-CNN trained on 66 healthy controls and 66 MS patients SWI data. Their results highlighted the outstanding performance, based on the quantitative image perturbation method, of LRP maps and DeepLIFT [46] over simpler methods, strengthening the exploitability of such methods to address clinically relevant questions.

Regarding neural network visualization, the application of specific techniques, such as the LRP here adopted, provides a mean for CNNs interpretability and, when used in combination with other clinical and imaging data could support diagnosis and treatment decisions. To the best of our knowledge, this is the first work showing an application of LRP visualization on a dMRI-based classification problem. By relying on this technique, it was possible to identify the regions playing a prominent role in the classification between the two MS phenotypes, which were the regions of higher LRP gain across groups. From the analysis of these maps, it was clear that the different indices showed a selective pattern, being sensitive to modifications related to the disease in different ROIs. In particular, for dMRI-derived CNNs models and for the target class, both PPMS-LRP and RRMS-LRP group average heatmaps for RIF1-, RTPP- and RTAP-CNN showed a complementary relevance pattern compared to PA-CNN ones, suggesting that different regions encountered specific microstructural alterations. This result pushes toward their integration within a unified model accounting for all the relevant information at a time. Though, this would require a larger sample to compensate for the input size and we leave it for future investigation.

RIF1-CNN mean LRP heatmaps showed a redundancy in the information provided by the two different shells. For clinical purposes and applications, further investigation should be carried on measuring the discriminative power of RIF1 as calculated on a single shell. Determining whether one single b-value acquisition would be sufficient would allow reducing the MRI scan time with clear clinical advantages.

Concerning ROI-based analysis, the regions playing a leading role were in agreement with the literature. Indeed, the Parahippocampal Gyrus and Insular Cortex have been shown to report high probability of focal GM demyelination in MS pathology [42], while the Cerebellum has been demonstrated to be a major site for demyelination, especially in PPMS patients [47]. The Superior Frontal Gyrus has been associated with fatigue, particularly in RRMS [48], and the Temporal Pole appeared to be present in clinically relevant MS cortical atrophy patterns [49].

Finally, it is important to note that the focus in interpreting LRP maps was not on the absolute values of the relevance, but on the differences and overlaps between the violin plots of the considered ROIs in the two classes of patients. This means that the relevance values allowed to understand how the voxels of certain ROIs contributed to the classification, but still did not allow identifying the underlying reasons (lesion load, atrophy, etc) [20].

In order to investigate whether higher importance scores could correspond to relevant microstructural modulations, the association between RTPP average values in the ROI and the mean value of the RRMS-LRP heatmap for the non-target PPMS class Hippocampus was assessed. The Hippocampus was chosen in light of previous results [15] showing RTPP sensitivity to GM differences between PPMS and RRMS. Though the interpretation of this correlation is far from trivial, the straight meaning is that the PPMS subjects showed some tissue modulations typical of RRMS subjects to which RTPP-CNN was sensible and which were significantly linked to a biomarker for MS patients stratification. Although deeper investigation is needed, in a broader view, this result, together with [20, 21] provides evidence of CNNs ability to learn to identify important disease biomarkers as relevant for the classification. Of note, a significant feature for MS is a lower diffusion restriction and massive neuronal loss and demyelination in Hippocampus [50] which is indeed well captured by RTPP in other studies [44, 51]. For the future, it could be valuable indeed to perform an assessment of the performance of the dMRI indices in the tissues of the analyzed MS groups. This would enable a better understanding of the pathophysiology beyond the microstructural changes induced by the disease.

5. Limitations and future works

In what follows, some of the main limitations of this study are briefly summarized, and some among the many possible future steps that will be taken to fully exploit the potential of the proposed method are presented. First, the low numerosity of the sample could cast shadows on the statistical significance of the outcomes and also affect the hyperparameters optimization that, in an optimal setting, should be performed on separate sets. In this work data augmentation was used in both the training and the validation sets of the five-fold CV for simulating a larger sample of subjects.

The lack of healthy controls is a reason of concern as it impedes to benchmark the performance of the proposed architecture in the patients versus controls classification task. However, as mentioned in the discussion, the CNN architecture we chose was previously employed in [20] to differentiate AD patients from controls based on T1-w MRI, and a slightly modified version in [21] to classify healthy controls and MS patients based on FLAIR and T1-w MRI. We acknowledge that their datasets were different from our local cohort, however, their work can be considered as literature benchmark for the CNN in use. Our aim was to disentangle the relevance linked to the two groups of disease which is an important open issue per-se. In consequence, we focused on this and investigated whether relying on sophisticated methods, such as advanced dMRI-based indices and 3D-CNNs models, could help in differentiating the two MS groups.

The lack of a control class also impeded to assess the relevance of the voxels and ROIs in distinguishing each of the phenotypes (PPMS, RRMS) from healthy matched controls. This is an interesting issue because it could reveal shared features of the two manifestations of the pathology that could not be captured by the proposed analysis yet potentially being insightful for understanding the mechanisms of the disease. We leave this for future investigation.

Finally, as a future work, including WM in the analysis, would widen the spectrum of the microstructural features potentially distinguishing the two disease phenotypes as well as unravelling the link with GM tissue modulations.

6. Conclusion

This work provides evidence in favor of the capability of dMRI indices of distinguishing the PPMS from the RRMS state of disease in MS. We proved that 3D-SHORE based indices and RIF1 outperformed FA and MD, pushing to shift the attention on dMRI features other than DTI ones. In addition, thanks to the use of a 3D-CNN and LRP visualization, we could observe that the CNNs classification was based on clinically relevant ROIs and that different indices were sensitive to GM modulation in different brain regions. Our results support the hypothesis of dMRI based indices suitability as numerical biomarkers for the characterization of pathological brain tissues. Moreover, this work has the potential to address clinically important problems in MS, like the early identification of the clinical course for diagnosis and provides evidence in favor of the feasibility of precision medicine.

Data availability statement

The data generated and/or analysed during the current study are not publicly available for legal/ethical reasons but are available from the corresponding author on reasonable request.

Acknowledgment

This work was supported by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (ERC Advanced Grant Agreement No. 694665: CoBCoM—Computational Brain Connectivity Mapping). This work has been partly supported by the French government, through the 3IA Côte d'Azur Investments in the Future project managed by the National Research Agency (ANR) with the Reference Number ANR-19-P3IA-0002. This work was also partially supported by Fondazione CariVerona (Bando Ricerca Scientifica di Eccellenza 2018, reference number 2018.0855.2019).

Interpretable deep learning as a means for decrypting disease signature in multiple sclerosis

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction