Characterization of breast lesions using multi-parametric diffusion MRI and machine learning

Rahul Mehta; Yangyang Bu; Zheng Zhong; Guangyu Dan; Ping-Shou Zhong; Changyu Zhou; Weihong Hu; Xiaohong Joe Zhou; Maosheng Xu; Shiwei Wang; M Muge Karaman

doi:10.1088/1361-6560/acbde0

Abbreviations

Magnetic Resonance Imaging	MRI
Dynamic contrast-enhanced MRI	DCE-MRI
Diffusion-weighted magnetic resonance imaging	DWI
Apparent diffusion coefficient	ADC
Diffusion coefficient	D_diff
Pseudo-diffusion coefficient	D_perf
Perfusion fraction	f
Fractional order calculus	FROC
Continuous-time random-walk	CTRW
Intravoxel incoherent motion model	IVIM
Diffusion-weighted	DW
Region of interest	ROI
Support vector machine	SVM
Gaussian Processes	GP
Decision Trees	DT
Random Forest	RF
Gradient Boosted	GB
adaBoost	AB
Area under the receiver operating characteristic curve	AUC

Introduction

Breast cancer is the most frequent cancer among women and leading cause of female cancer death worldwide (Bray et al 2018). An accurate characterization of breast lesions is important for efficient risk assessment and optimized treatment. Magnetic resonance imaging (MRI) has been used to evaluate suspicious breast lesions and demonstrated superior sensitivity to ultrasound and mammography (Morris 2010). Dynamic contrast-enhanced MRI (DCE-MRI) is a prevalent MRI technique for breast cancer diagnosis (Sardanelli et al 2010). Although DCE-MRI offers a high sensitivity of 94%–100% (Orel and Schnall 2001), the reported specificity varies widely from 37% to 97% (Orel and Schnall 2001, Guo et al 2002, Woodhams et al 2005, Rubesova et al 2006, Lo et al 2009), and the use of exogenous contrast in DCE-MRI is potentially problematic in patients with compromised kidney functions, cardiac conditions, respiratory disorders, or pregnancy (Beckett et al 2015).

With its ability to probe underlying tissue microstructures, diffusion-weighted magnetic resonance imaging (DWI) has been increasingly used for characterizing breast lesions (Partridge and McDonald 2013, Partridge et al 2017, Dietzel et al 2020). Numerous studies have reported a lower apparent diffusion coefficient (ADC) in malignant breast lesions due to increased cellularity compared to benign lesions (Partridge and McDonald 2013). For differentiation between malignant and benign breast lesions, ADC has been shown to achieve a sensitivity of 85%–95% (Guo et al 2002, Woodhams et al 2005, Rubesova et al 2006, Lo et al 2009) and a specificity ranging from 50% to 90% (Guo et al 2002, Woodhams et al 2005, Rubesova et al 2006, Lo et al 2009). To improve diagnostic performance, DWI with ADC has been used conjointly with DCE-MRI (Partridge et al 2009). While the combination of the two MRI techniques has improved breast lesion characterization, the mono-exponential model that the ADC is derived from assumes Gaussian diffusion where the diffusion medium is typically homogeneous. Given the heterogeneous nature of biological tissues, particularly the tumors that exhibit a high degree of structural heterogeneity and complexity, it is well known that water diffusion in tissues does not follow a Gaussian distribution (Le Bihan et al 1988). Therefore, ADC has not been fully established as an imaging marker for breast lesion characterization (Pinker et al 2018).

ADC has been conventionally associated with tissue cellularity (Chen et al 2013), however, breast tissue has many other properties, such as vascularity and heterogeneity, that are relevant to neoplastic changes. Recent studies have indicated that these properties can be probed by utilizing a specific portion of the b-value 'spectrum' in DWI (Tang and Zhou 2019). The lower b-value range (∼0–200; and 800 s/ mm²) can reveal tissue cellularity and micro-vascularity by using an intravoxel incoherent motion (IVIM) model with three parameters: diffusion coefficient (D_diff), pseudo-diffusion coefficient (D_perf), and perfusion fraction (f) (Le Bihan et al 1988). On the other hand, high b-values (e.g. 1000–4000 s/mm²) can help probe tissue microstructures using a wealth of advanced diffusion models (Tang and Zhou 2019). One such model is the continuous-time random-walk (CTRW) model (Karaman et al 2016a, Zhong et al 2019), which has been shown to be sensitive to intravoxel diffusion heterogeneities in time and space (Magin et al 2008, Zhou et al 2010, Ingo et al 2014), as reflected by two parameters α and β, respectively. These parameters, together with those from a predecessor fractional order calculus (FROC) model (Magin et al 2008, Zhou et al 2010), have been successfully applied to tissue characterizations in adult (Sui et al 2016) pediatric brain tumors (Karaman et al 2016a, 2016b), gastric cancer (Karaman et al 2021), gastrointestinal stromal tumor (Tang et al 2018), hepatocellular carcinoma (Chen et al 2021), and bladder cancer (Feng et al 2022). For breast cancer characterization, however, only a few studies with high-b-value DWI have (Feng et al 2022) been reported and the results are inconclusive (Dietrich et al 2010, Bedair et al 2017, Bickelhaupt et al 2017). Moreover, the use of a full b-value spectrum to simultaneously probe multiple tissue properties such as cellularity, vascularity, and heterogeneity, has not been reported.

While the IVIM and CTRW model parameters can offer valuable insights into a lesion, the relative importance of these parameters for delineating malignant and benign lesions remains unknown. In this study, we use machine learning to investigate the relative importance of these parameters. Specifically, we generate histogram features obtained from the parameters of a low-b-value IVIM model and a high-b-value CTRW model, then investigate the performance of the histogram features for differentiating malignant and benign breast lesions using machine learning algorithms.

Materials and methods

Patient characteristics

The institutional review board approved this retrospective study, and written informed consent was obtained from all participating patients. Thirty-one patients (age range: 27–86) with a total of 40 pathologically verified breast lesions were enrolled. The inclusion criteria were as follows: (1) aged 18 years or older, (2) no previous neoadjuvant chemotherapy or radiotherapy before biopsy or surgery, and (3) presence of histopathologically verified breast lesions with biopsy or surgery within one month after breast lesions detected on MRI, and (4) ACR BI-RADS 4 or 5 (suspiciously malignant or highly suggestive of malignancy) and ACR BI-RADS 2 or 3 (benign or probably benign) lesions detected on breast MRI. The exclusion criteria were: (1) excessive artifacts in the MR images, (2) non-mass-like enhancement, (3) benign cysts, (4) ACR BI-RADS 0 (i.e. further evaluation) or (5) ACR BI-RADS 6 lesions (i.e. biopsy-confirmed malignancy) as the radiologists were blinded to the histologic evaluations.

Diffusion-weighted image acquisition

MRI was performed on a Discovery MR750 3T scanner (GE Healthcare, Waukesha, WI) with an 8-channel breast coil (Invivo Corp., Gainesville, FL). DWI with 11 b-values (0₁, 50₁, 100₁, 300₂, 500₂, 800₂, 1100₄, 1500₄, 2000₆, 2500₆, and 3000₈ s/mm², where the subscripts denote the number of averages) was carried out using a single-shot spin-echo echo planar imaging (EPI) sequence. Other image acquisition parameters were: TR/TE = 7000/78 ms, slice thickness = 5 mm, FOV = 32 cm × 32 cm for bi-lateral breast coverage, and reconstruction matrix size = 256 × 256. In order to mitigate the effect of diffusion anisotropy, trace-weighted images were obtained by successively applying the Stejskal-Tanner diffusion gradient along the x, y and z directions at each non-zero b-value.

Diffusion-weighted image analysis

The multi-b-value diffusion-weighted (DW) images were first analyzed with the CTRW model whose diffusion-attenuated MR signal takes the following form:

$\begin{eqnarray}&&\displaystyle \frac{S}{{S}_{0}}={E}_{\alpha }\left(-{\left(b{D}_{m}\right)}^{\beta }\right),\end{eqnarray} \tag{ 1 }$

where D_m (in mm²/s¹) is an anomalous diffusion coefficient, the parameters α and β (dimensionless) are temporal and spatial fractional orders that are related to temporal and spatial diffusion heterogeneities, respectively, and E_α is a Mittag-Leffler function (Karaman et al 2016a).

The DW images in the low b-value range (0–800 s/mm²) were also analyzed by using the IVIM model in which the diffusion-attenuated MR signal is given as (Le Bihan et al 1988):

$\begin{eqnarray}&&\frac{S}{{S}_{o}}=f{e}^{\left(-b{D}_{\mathrm{perf}}\right)}+\left(1-f\right){e}^{-b{D}_{\mathrm{diff}}},\end{eqnarray} \tag{ 2 }$

where $f$ is the perfusion fraction, ${D}_{\mathrm{diff}}$ is diffusion coefficient, and ${D}_{\mathrm{perf}}$ is the pseudo-diffusion coefficient, both in units of mm²/s¹.

The CTRW model parameters, D_m, α, and β, and the IVIM model parameters, f, D_diff and D_perf, were estimated by fitting equations (1) and (2) respectively using a nonlinear least-squares estimation with an iterative Levenberg-Marquardt method in Matlab (MathWorks, Inc., Natick, MA). After noise filtering and Rician noise correction (Zhou et al 2010), the nonlinear fitting proceeded with two major steps for the CTRW model: (a) estimating D_m by a mono-exponential model using the diffusion images at low b-values (b ≤ 1100 s/mm²) and (b) simultaneously estimating α and β from the diffusion images in the full b-value spectrum (b-values = 0–3000 s/mm²). For the IVIM model fitting, a 'segmented' approach was employed using the diffusion images at b-values up to 800 s/mm² in three steps (Luciani et al 2008).

Machine learning analysis

The following subsections discuss how IVIM and CTRW model parameters are utilized in a machine learning framework for extracting imaging characteristics that can differentiate malignant and benign breast lesions. A visualization of our machine learning framework is shown in figure 1.

**Figure 1.** Workflow of image analysis from left to right with corresponding parameter maps and histograms. The data analysis began with lesion contouring by lesion ROI selection and DWI parameter extraction within the selected ROIs. Preprocessing of the dataset was then performed by normalization and centering the histogram, followed by ROI augmentation. We extracted features consisting of the mean, kurtosis, median, 10% quantile, 25% quantile, inter quartile range, and 75% quantile values of the histogram. We then used the Boruta algorithm to determine the most significant features and used these features in various machine learning classifiers to discriminate malignant and benign lesions.
Download figure:
Standard image High-resolution image

Pre-processing

The ROIs in tumor-containing slices were manually delineated by a radiologist, and then applied to the CTRW parameter maps, D_m, α, and β; and the IVIM parameter maps, f, D_diff, and D_perf, resulting in six parameter maps for each patient, as shown in figure 1. This is followed by a max-min normalization of each distinct parameter map across samples into a range of [0, 1] (appendix figures A1–A6). This normalization procedure allows us to compare all the parameter maps and is a standard procedure for classification algorithms (Breiman 2001). For example, the f parameter maps were normalized into a range of [0, 1] with respect to the maximum and minimum of all the f parameter maps in our study. Finally, the ROIs were randomly cropped to increase the dataset size as a general technique to prevent overfitting (Takahashi et al 2018). While other augmentation techniques exist, many of these methods would neither improve the performance due to the type of features to be extracted from the diffusion parameters (i.e. random rotations) nor be applicable to diffusion MRI parameters (adding color/noise). Ultimately, the dataset consisted of the original 40 full tumor ROIs and an additional 80 cropped subsets of the ROIs, resulting in a total of 120 samples (appendix A for more details).

Feature extraction

A histogram of the voxel values within the ROI was generated for each diffusion parameter, D_m, α, β, D_diff, D_perf,and f. For each of the six parameters, nine histogram features were evaluated including kurtosis, skewness, variance, mean, median, interquartile range, 10% quantile, 25% quantile, and 75% quantile values resulting in a total of 54 quantitative features for use in the machine learning analysis.

Feature selection

To ensure the robustness of the machine learning classifiers, the Boruta algorithm, an iterative feature selection method, was used to select a subset of relevant histogram features by removing the most irrelevant and redundant features from the data (Kursa and Rudnicki 2010). To control for false positives, we used a modified two-stage multiple testing methodology process. The Benjamin Hochberg FDR test was first used to select the most significant features within an iteration, followed by the Bonferroni correction to account for multiple comparisons across multiple iterations of using the same set of features. At every iteration, feature importance was determined for all features with respective to the classification performance such that the most significant features are ranked as one.

Machine learning algorithms

The ability of the top features to predict a lesion as benign or malignant was compared using Kernel Methods, Ensemble Methods, and Naïve Bayes. Kernel methods included kernelized support vector machine (SVM) and Gaussian Processes (GP). Ensemble methods included techniques that make predictions based on either subsampling (i.e. training individual parallel classifiers using bootstrapped samples with a random subset of feature variables) such as bagging with Decision Trees (DT) and Random Forest (RF); or boosting (i.e. training sequential individual classifiers on weighted samples such that the next classifier concentrates more on the misclassified samples from the previous classifier) such as Gradient Boosted (GB) or adaBoost (AB).

Statistical analysis

We created a stringent statistical analysis to increase the validity of our results. First, we created an 80%–20% split-ratio for the training and test data sets. Second, we ensured that the test set only included samples that were mutually exclusive from the training set such that the classifier only tests on unseen samples. Such a methodology provided assurance that a model which performed well on the training set would also perform well on the test set. Third, we employed scikit-learn (Pedregosa et al 2011) to create a model selection scheme of the classifiers using a 'stratified-nested' cross-validation that reduces overfitting (Krstajic et al 2014). This also allowed us to optimize the classification and histogram parameters such as histogram bin width. Specifically, we performed this optimization by maximizing the area under the receiver operating characteristic curve (AUC) using grid search with a 'stratified-nested' cross-validation of 10 repeated 5-folds. Lastly, to account for a slightly imbalanced dataset, we under-sampled the majority class (malignant) during each grid search so that every fold contained an equal representation of each class. The optimized parameters for each classifier are given in our code (https://github.com/rmehta1987/MriDiffusion).

The classification algorithms in the testing stage were compared using AUC, F1-Score, and accuracy metrics with 10 000 bootstrapped samples to define 95% confidence intervals. We used a Mann-Whitney U-test to perform a pair-wise comparison on the performance metrics (specificity, sensitivity, or accuracy) of the classifiers with a statistical significance set at p-value < 0.05. Similarly, a comparison of the most significant quantitative features after feature selection within malignant and benign samples was performed by generating a p-value using a Mann-Whitney U-test with a statistical significance set at p-value < 0.05.

Results

Diffusion parameter maps

Figure 2 shows the maps of CTRW model parameters D_m, α, and β (a)–(c) and IVIM model parameters D_diff, D_perf, and f (d)–(f) from two representative patients with malignant (left column) and benign lesions (right column). Qualitatively, the malignant lesion exhibited lower values in D_m, α, β, and D_diff and higher values in f in some portions of the ROI than the benign lesion. The malignant lesion also showed a larger variance of the parameters within the ROI in comparison to the benign lesion.

Comparison of features

The top 18 features among the 54 features are shown in figure 3(a) with 95% confidence intervals. Of these 18 features, the top 8 features that statistically significantly improve classification performance are: median of β (β_median), skewness of β (β_skewness), mean of β (β_mean), 75% quantile of f (f_75%) , 75% quantile of D_diff $\left({D}_{\mathrm{diff}}^{75 \% }\right),$ kurtosis of D_perf $\left({D}_{\mathrm{perf}}^{\mathrm{kurtosis}}\right),$ 75% quantile of D_m ( ${D}_{m}^{75 \% }$ ), and median of D_m ( ${D}_{m}^{\mathrm{median}}$ ). The boxplots of these top features for both the malignant and benign groups are given in figure 3(b) and table 1, respectively. These features were found to be significantly different between the malignant and benign lesions with a p-value < 0.05 (Supplementary table 1). We used ${\beta }_{\mathrm{mean}}$ rather than ${\beta }_{\mathrm{kurtosis}}$ as the classifier performance metrics (AUC, F1-score, accuracy) were consistently better with ${\beta }_{\mathrm{mean}}$ than with ${\beta }_{\mathrm{kurtosis}},$ even though their importance rankings are similar.

**Figure 3.** (a) Feature importance plot of the top 18 features by feature rank where rank 1 is the best feature with their respective confidence intervals indicated by the horizontal lines on each point. (b) Paired box plots of the malignant and benign groups and the top 8 features selected from the feature importance paradigm. The top features are statistically significantly different (p-value < 0.05) between the malignant and benign groups except ${D}_{\mathrm{perf}}^{\mathrm{kurtosis}}$ where the variance is significantly different (p-value < 0.05) between the malignant and benign groups.
Download figure:
Standard image High-resolution image

Table 1. Comparison of mean and standard deviation of the top performing features respective of their class.^{^a}

Feature Name	Malignant (mean±std)	Benign (mean±std)
${D}_{m}^{75 \% }$	0.197 ± 0.187	0.255 ± 0.082
${\beta }_{\mathrm{skewness}}$	−1.218 ± 2.710	−0.579 ± 0.612
${f}_{q3}$	0.266 ± 0.105	0.374 ± 0.148
${D}_{\mathrm{diff}}^{75 \% }$	0.266 ± 0.105	0.374 ± 0.148
${\beta }_{\mathrm{median}}$	0.876 ± 0.140	0.774 ± 0.208
${D}_{\mathrm{perf}}^{\mathrm{kurtosis}}$	4.456 ± 7.884	3.087 ± 5.975
${\beta }_{\mathrm{mean}}$	0.779 ± 0.134	0.714 ± 0.174
${D}_{m}^{\mathrm{median}}$	0.157 ± 0.056	0.202 ± 0.081

^aEach feature is significantly different (p-value < 0.05) between malignant and benign groups except ${D}_{\mathrm{perf}}^{\mathrm{kurtosis}}$ where the variance is significantly different between malignant and benign groups (p-value < 0.05).

Comparison of classifier performance

In the cross-validation stage, the comparison between the ROC curves (figure 4) showed that among the eight machine learning classifiers (linear SVM, RBF SVM, GP, DT, RF, AdaBoost, NB, and GB), GB and RF classifiers performed the best with respect to AUC of ROC curves (p-value < 0.05). The swarm plots given in figures 5(a)–(c) summarize the performance metrics of all the classifiers, which again show that the GB and RF classifiers outperformed the other classifiers in all performance metrics (p-value < 0.05).

**Figure 5.** The swarm plots of AUC (a), accuracy (b), and F1 Score (c) metrics of each classifier during the cross-validation stage.
Download figure:
Standard image High-resolution image

Similar results were observed in the performance metrics of the classifiers obtained during the testing stage indicating that the machine learning algorithms can discriminate between unseen samples. The boxplots of AUC, accuracy, and F1 score of all classifiers are given in figures 6(a)–(c), respectively and summarized with the mean and standard error in table 2. Again, the GB classifier provided the best performance with the highest mean AUC of 0.942 (p-value < 0.05) with a 95% confidence interval of [0.904, 0.981]. The GB and RF classifiers both performed the best in terms of accuracy and F1 score (p-value < 0.05) where the GB classifier had a mean accuracy of 0.833 and a mean F1 score of 0.872; the RF classifier had an accuracy of 0.833 and a mean F1 score of 0.871. While the RF and GB classifiers had similar performance, we found that the GB classifier consistently had the highest AUC over a larger number of iterations performed on the test set and as such offers the best discrimination between benign and malignant lesions.

**Figure 6.** The box plots of AUC (a), accuracy (b), and F1 Score (c) metrics of each classifier during the testing stage.
Download figure:
Standard image High-resolution image

Table 2. Comparison of the eight top classifiers during the testing stage.

Classifier name	F1 Score (mean±std)	Accuracy (mean±std)	AUC (mean±std)
Linear SVM	0.723 ± 0.034	0.566 ± 0.445	0.873 ± 0.039
RBF SVM	0.731 ± 0.041	0.633 ± 0.043	0.733 ± 0.047
Gaussian Process	0.729 ± 0.041	0.649 ± 0.042	0.692 ± 0.039
Decision Tree	0.842 ± 0.034	0.824 ± 0.035	0.846 ± 0.041
Random Forest^{^a}	0.871 ± 0.030	0.833 ± 0.033	0.918 ± 0.027
AdaBoost	0.776 ± 0.037	0.764 ± 0.426	0.831 ± 0.043
Naïve Bayes	0.778 ± 0.032	0.733 ± 0.037	0.91 ± 0.028
Gradient Boosted^{^a}	0.872 ± 0.031	0.833 ± 0.037	0.942 ± 0.02

^aThe GB and RF classifiers are the best performing classifiers as they significantly outperform the other classifiers across all the metrics with a p-value < 0.05. While the RF and GB classifiers have similar performance, we found that the GB classifier consistently had the highest AUC over a larger number of iterations performed on the test set.

Discussion

We have shown that a combination of quantitative features from the parameters of the CTRW and IVIM diffusion models can characterize malignant and benign breast lesions with a machine learning-based approach. This approach has allowed us to probe multiple tissue properties such as vascularity, cellularity, and heterogeneity using a set of DW images acquired with a full b-value spectrum.

The shortcomings of using only median or mean parameter values in lesion characterization have led to recent diffusion MRI studies that extract more quantitative features obtained from the ADC or IVIM model parameters. These studies have used machine learning (Vidić et al 2018, Hu et al 2019) or other statistical paradigms (Song et al 2014) to determine molecular subtype (Vidić et al 2018) or breast lesion malignancy (Song et al 2014, Hu et al 2019). An initial histogram analysis of the quantitative features is a common tool within the studies to collate additional information by parsing the statistical distribution of the DWI parameters that is generated by the histogram. Our study extends this data-driven approach to the characterization of breast tissue by using a set of quantitative features of diffusion parameters that are related to various tissue properties such as vascularity, cellularity, and/or heterogeneity.

The relationship between the conventional ADC parameter and the underlying tissue microstructures has been extensively studied by others (Chen et al 2013, Surov et al 2017). Many of these studies have observed a correlation between ADC and tissue cellularity based on histopathologically estimated cell count (Chen et al 2013). Sharing the same units as the conventional ADC, CTRW and IVIM models' diffusion coefficient parameters, D_m and D_diff, respectively, have a close resemblance to ADC as they also reflect tissue diffusivity, and hence, are related to tissue cellularity. Utilizing the low b-values to extract the microcapillary perfusion component from the DW signal, IVIM model has been shown to probe tissue vascularity through its perfusion fraction parameter, f, based on its correlation to contrast enhancement obtained with DCE-MRI (Sigmund et al 2011) or histopathological vascular density (Lee et al 2014, Togao et al 2018). The relatively new CTRW model, on the other hand, incorporates intravoxel diffusion heterogeneity into the classical Fick's equation by using two fractional order parameters, α and β, with respect to time and space, respectively. While originating from the same underlying tissue property of structural heterogeneity, CTRW model's α and β reflect two independent and complementary aspects of diffusion heterogeneities: temporal and spatial heterogeneity. Water molecules can take a variable time to make a move, or can produce a variable displacement in each move. The β parameter which accounts for the non-Gaussian distribution of diffusion displacement (Zhou et al 2010, Magin et al 2008, 2011) has been related to the diffusion jump length (Ingo et al 2014) that is associated with intravoxel spatial heterogeneity. The α parameter, on the other hand, reflects the temporal heterogeneity of the diffusion process, that is, the likelihood of the water molecule to be 'trapped' or 'released' while it diffuses through the complex tissue structure and environment. With its generalized representation that can characterize both temporal and spatial diffusion behavior, the CTRW model has the potential to be used for studying time-dependency of water diffusion behavior, which has gained increasing attention in the diffusion MRI research community in recent years (Pyatigorskaya et al 2014, Novikov et al 2019, Wu and Zhang 2019, Aggarwal et al 2020). While the CTRW model's time-dependency has been demonstrated in Sephadex gels (Dan et al 2021), in this study we have not investigated its time-dependency on breast tissue as it is challenging to acquire clinical data over a broad range of diffusion times.

The most important quantitative feature for differentiating malignant and benign lesions in our study was the 75% quantile value of the CTRW parameter, D_m, which was found to be significantly lower in malignant lesions than the benign ones (p-value = 5.945e-10). This result is consistent with previous studies (Partridge et al 2017) which showed that malignant breast lesions exhibit lower diffusion coefficient (i.e. ADC) values. Our study differs from these previous studies with the use of higher b-values and evaluation of multiple quantitative features obtained from the diffusion coefficient for discriminating malignant and benign lesions. It is worth noting that the 75% quantile of D_m, which may have a higher sensitivity to detect focal regions of higher cellularity, outperformed the mean D_m which was not among the top performing features. We also performed a conventional DWI-based analysis that uses histogram features of ADC with a GBC classifier. The results given in appendix D demonstrated that the performance of the ADC-based histogram features was not only lower than the one based on IVIM and CTRW parameters, but the high variance suggested that the ADC-based classifier was highly sensitive to parameter perturbations.

The mean, median, and skewness of β have been found sensitive to changes in malignant breast lesions which concur with our results where the mean (p-value = 0.0099), median (p-value = 0.0003), and skewness of β (p-value = 1.98e-6) were significantly different in malignant and benign lesions. β has also been reported to be related to intravoxel tissue heterogeneity in a growing number of studies using CTRW and its predecessor FROC model (Magin et al 2008, Zhou et al 2010, Ingo et al 2014, Sui et al 2015, Karaman et al 2016a, Bickelhaupt et al 2017, Yu et al 2017, Tang et al 2018, Karaman et al 2016b, 2021, Feng et al 2022). Bickelhaupt et al (Bickelhaupt et al 2017), however, did not observe a significant change in β for discriminating breast lesions and attributed this discrepancy to relatively low b-values employed in their study. Deployment of β, particularly at high b-values, can enhance the value of diffusion MRI for tissue characterization and should be further investigated.

While the features extracted from the spatial diffusion heterogeneity parameter, β, has been found to show differences between the benign and malignant lesions, our study did not find that the temporal diffusion heterogeneity parameter, $\alpha ,$ contributed significantly to characterizing breast lesions. The exact underpinning biophysical reason remains to be investigated, but was likely related to inadequate signal-to-noise ratios at high b-values and a fixed diffusion time was used in the experiments. This suggests that the FROC model, which features on β parameter, may be adequate for breast tissue characterization under the experimental conditions of the current study.

Our analysis of the IVIM model (figure 3) showed that kurtosis of ${D}_{\mathrm{perf}}$ $\left({D}_{\mathrm{perf}}^{\mathrm{kurtosis}}\right),$ the 75% quantile of f $\left({f}_{75 \% }\right),$ and the 75% quantile of ${D}_{\mathrm{diff}}$ $\left({D}_{\mathrm{diff}}^{75 \% }\right)$ were among the top performing features to discriminate malignant and benign breast lesions. The mean values of the features, ${f}_{75 \% }$ (p-value = 5.95e-10), and ${D}_{\mathrm{diff}\,}^{75 \% }$ (p-value = 5.95e-10), were significantly higher in benign lesions (p-value = 5.95e-10 for both). While the mean ${D}_{\mathrm{perf}}^{\mathrm{kurtosis}}$ was similar in both lesion types, the variance of ${D}_{\mathrm{perf}}^{\mathrm{kurtosis}}$ was significantly higher in benign lesions (p-value = 0.007). Other studies have also reported the use of features from the $f\,\mathrm{and}{D}_{\mathrm{diff}}$ parameters to discriminate breast lesions (Song et al 2014, Vidić et al 2018). These studies showed quantitative features generated from f parameter maps were higher in malignant cases. Our results were inverse. We attribute this to the max-min normalization process in our preprocessing stage. Specifically, the size of the malignant lesions was found to be significantly larger than benign lesions (p-value = 2.114e-08), and f within the lesion ROIs had higher positive skewness in the malignant lesions than in the benign lesions. As such, while malignant lesions exhibited higher f values, most of the lesion had a very high concentration of low f values. Therefore, the max-min normalization, which was required for pre-processing the parameter maps, further increased the influences of the lower values in the f parameter maps in the malignant cases. We verified this with a simple linear regression analysis (appendix B for more details) and showed that f maps had significantly higher (positive) skewness (p-value < 0.05) when the lesion is malignant (appendix figure C1). Before normalization, the values in the f parameter maps in our study were consistent with established literature (appendix figures A6 and C2) (Song et al 2014). The underlying biological implications from the IVIM parameter maps indicate that specific areas of a lesion can be malignant, and as a result have high vascularity, while other regions within a lesion may not be malignant and as a result have low vascularity. The different regions of a lesion having different properties recapitulates cancer causes lesion heterogeneity.

A comprehensive analysis of multi-parametric information provided by the IVIM and CTRW models offers a unique perspective to the characterization of underlying breast tissue. This is consistent with previous studies which showed improved performance with a combination of quantitative features from DWI model parameters (Song et al 2014, Vidić et al 2018) in comparison to a single feature. For example, Vidic et al (Vidić et al 2018) used a RBF SVM classifier with a combination of four features from IVIM model $\left({D}_{\mathrm{diff}}^{\mathrm{mean}},{D}_{\mathrm{diff}}^{\mathrm{median}},{D}_{\mathrm{perf}}^{\mathrm{median}},{f}_{\mathrm{kurtosis}}\right)$ to classify breast lesions and achieved a cross-validation accuracy of 0.96. The RBF SVM classifier using our quantitative features from the IVIM and CTRW models performed poorly with a mean accuracy of 0.633 on the test set. However, the cross-validation (CV) accuracy on the training using our model was higher with a mean of 0.734. Furthermore, the study by Vidic et al (2018) did not investigate performance metrics on an independent test set; and their dataset was not available for a direct comparison. The difference between our results and Vidic et al (2018) may be caused by batch effects due to acquisition and fitting strategies.

Our results showed that the GB and RF classifiers are the most robust in classification performance as they provide the highest metrics for F1-score and accuracy. We find that the GB classifier is the most ideal as it has the highest AUC (although not significantly different than RF classifier) due to its consistency over a larger number of bootstrapped iterations. Moreover, neither of the classifiers underfit or overfit as the error in the independent test set is similar to the error in the cross-validation set. This can be clinically advantageous as each patient's parameter maps are unique and distinct with regards to the breast lesion. In addition, as shown in appendix A, the dataset which consists of only full tumor ROIs (appendix A, figures A1–A4, Column 3) and the one that consists of a subset of the cropped tumor ROIs (appendix A, figures A1–A4, Columns 4–6) yielded different statistics of the mean, variance, and median. This indicates that the additional cropped samples have enough differentiation from the full tumor ROI for the classifier to avoid fitting on only a specific set of features. In summary, the classifier exhibited a high degree of generalizability; features that represent malignant tumors correspond correctly regardless of whether they are extracted from a cropped or full tumor ROI.

Our study has limitations. First, the retrospective analysis of imaging results was performed at a single institution with a small sample size. We addressed this issue in this study by augmenting our dataset through random crops of the tumor ROIs which reduces the chance of overfitting, and improves classifier performance at the cost of the assumption that the randomly cropped tumor ROIs have the same pathology as the original tumor ROI (Takahashi et al 2018). Since the classifier produced a high and consistent prediction rate with low variance on both full and cropped tumor ROIs during testing, the proposed technique has potential to scale to larger datasets for discriminating benign and malignant full tumors. The results of this study can be further validated by conducting a future study with a larger sample size. Second, although our quantitative features were reproducible, it remains unclear whether the GB or RF classifier can account for variations across scanners and sites. We can address this problem by correcting for batch effects, however, this is currently infeasible as it would require repeated samples of the same subject. Third, while the gold standard employed in the study was determined by biopsy taken from a specific area of the breast lesion, the discrimination of the benign and malignant lesions based on the DWI model parameters has been performed on the entire lesion. Ideally, we would have multiple biopsies of the same tumor, however this would also be infeasible due to the invasiveness required. Finally, this study considered only two of the advanced DWI models whose parameters can be related to specific tissue properties such as cellularity, vascularity, and heterogeneity. As no direct correlation between these parameters and the tissue properties based on histology was investigated in this study, our results do not directly demonstrate any specific microstructural distinctions between lesions. Nevertheless, the purpose of this study is to offer a multi-parametric approach that utilizes a machine learning-based automated analysis technique. This approach allows for the utilization of the most significant parameter-based features related to the tissue pathology, rather than providing a full biophysical explanation.

Conclusion

In conclusion, we have shown that multiple statistical features of IVIM and CTRW model parameters can be used conjointly to characterize breast lesions. Analysis of the multiple features can show that the GB classifier performs the best to discriminate between malignant and benign breast lesions. This approach allowed us to simultaneously probe tissue properties from multiple facets and may be extended to characterization of other tissue pathologies using IVIM and CTRW model parameters.

Acknowledgments

The authors thank Drs Cui Feng, Kaibao Sun, Qingfei Luo, and Alessandro Scotti for helpful discussions.

Data availability statement

The data cannot be made publicly available upon publication because they are not available in a format that is sufficiently accessible or reusable by other researchers. The data that support the findings of this study are available upon reasonable request from the authors.

Funding

This work was supported in part by the Zhejiang Basic Public Welfare Research Project (No. LGF21H180003). The authors whose affiliation is with The University of Illinois at Chicago are not associated with this grant.

Appendix A: : Data augmentation

Our augmentation strategy that was applied on each parameter map is as follows:

1.
Centering the region of interest (ROI) on a 64 × 64 grid.
2.
Setting the parameter values outside of the ROI to 0.
3.
Cropping the original ROIs into the three random partitions called Augmented Crop set 1, Augmented Crop set 2, and Augmented Crop set 3.

The boxplots of the CTRW model parameters, D_m, α, and β and IVIM model parameters D_diff, D_perf, and f, after each augmentation step is given in appendix figures A1–A6, respectively.

**Figure A1.** $\alpha$ parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistics, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
Download figure:
Standard image High-resolution image

$\alpha $ — **Figure A1.** $\alpha$ parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistics, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
Download figure:
Standard image High-resolution image

**Figure A2.** $\beta$ parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistics, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
Download figure:
Standard image High-resolution image

$\beta $ — **Figure A2.** $\beta$ parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistics, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
Download figure:
Standard image High-resolution image

**Figure A3.** ${D}_{\mathrm{diff}}$ parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistics, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
Download figure:
Standard image High-resolution image

**Figure A3.** ${D}_{\mathrm{diff}}$ parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistics, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
Download figure:
Standard image High-resolution image

**Figure A4.** ${D}_{m}$ parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistics, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
Download figure:
Standard image High-resolution image

**Figure A4.** ${D}_{m}$ parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistics, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
Download figure:
Standard image High-resolution image

**Figure A5.** ${D}_{\mathrm{perf}}$ parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistics, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
Download figure:
Standard image High-resolution image

**Figure A5.** ${D}_{\mathrm{perf}}$ parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistics, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
Download figure:
Standard image High-resolution image

**Figure A6.** $f$ parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistic, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
Download figure:
Standard image High-resolution image

**Figure A6.** $f$ parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistic, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
Download figure:
Standard image High-resolution image

Appendix B:: Significance of top features between malignant and benign groups

We used a Mann-Whitney U-test for the comparison of the following top features as discussed in the main text: median of β (β_median), skewness of β (β_skewness), mean of β (β_mean), 75% quantile of f (f_75%) , 75% quantile of D_diff $\left({D}_{\mathrm{diff}}^{75 \% }\right),$ 75% quantile of D_m ( ${D}_{m}^{75 \% }$ ), and median of D_m ( ${D}_{m}^{\mathrm{median}}$ ). We used a Chi-square test to compare the variance of the kurtosis of D_perf $\left({D}_{\mathrm{perf}}^{\mathrm{kurtosis}}\right)$ between malignant and benign groups as itsmean was not statistically significant. The top features were found to be statistically significantly different between malignant and benign groups as shown in appendix table B1.

Table B1. The p-values for the statistically significant differences between the top features between malignant and benign groups.^{^a}

Parameter name	p-value
β_median	0.003
β_skewness	1.98e-6
β_mean	0.009
f_75%	5.95e-10
${D}_{\mathrm{diff}}^{75 \% }$	5.95e-10
${D}_{m}^{75 \% }$	4.92e-10
${D}_{m}^{\mathrm{median}}$	9.24e-7
${D}_{\mathrm{perf}}^{\mathrm{kurtosis}}$	0.007

^aThe p-values were generated with Mann-Whitney U-test with the exception of ${D}_{\mathrm{perf}}^{\mathrm{kurtosis}}$ where a Chi-square test was used.

Appendix C:: Tumor size comparison between f parameter maps

To evaluate whether tumor size and histology (i.e. breast lesion malignancy) effects the skewness of the f parameter, we used a linear regression model:

$\begin{eqnarray*}&&{\mathrm{log}}_{10}{y}_{i}=\gamma +{\phi }_{1}{s}_{i}+{\phi }_{2}{x}_{i}+{\int }_{i},\end{eqnarray*}$

where ${y}_{i}$ is the skewness of the i-the sample, $\gamma$ is the intercept denoting the overall skewness, ${\phi }_{1}$ is the linear change of skewness across samples due to size, ${\phi }_{2}\,$ is the linear change of skewness across samples due to histology, ${x}_{i}$ is an indicator variable for the tumor histology (malignant is 1, benign is 0), ${s}_{i}$ is the size of the i-th sample, and ${{\epsilon }}_{i}$ is the normally distributed residual for the i-th sample with zero mean and unit variance (i.e. ${\epsilon }\sim N\left(0,1\right)$ ) . Log-transformed skewness was used as the outcome to ensure normally distributed residuals and homogeneity of variance. Thus, a statistical test (t-statistic) of the null hypothesis ${H}_{0}:{\phi }_{2}=0$ addressed the main question whether histology influences the skewness of the f parameter maps.

For the normalized parameter maps (appendix figure C1) the coefficient ${\phi }_{2}=\,0.6346\,\left(97.5{\rm{ \% }}\,{CI}\,-0.098,\,1.357\right)\,$ and p-value = 0.036 indicating that skewness increases on a logarithm scale by 0.6346 among malignant lesions. The coefficient for size, ${\phi }_{1},$ has no significant impact on the skewness p-value = 0.184. In the unnormalized case (appendix figure C2) the histology has no significant impact on the skewness at a p-value = 0.763. This demonstrates the reason why 75% quantile of f $\left({f}_{75 \% }\right)$ in the main text has much higher values for the benign lesion and is one of the most important features in our study.

**Figure C2.** Regression plot of Skewness against size and histology when f map parameters are not normalized. There is no relationship between skewness, size, and histology when the paremeters are not normalized.
Download figure:
Standard image High-resolution image

Appendix D:: ADC-based analysis with a gradient boosted classifier

Here, we show the results from a conventional DWI-based approach that uses histogram features of ADC with a GBC. As shown in appendix figure D1 below, the combination of the most significant histogram features of ADC (mean, variance, and kurtosis) provided a mean accuracy of 0.711 ± 0.147 (mean ± std deviation) for differentiation between benign and malignant breast lesions. The AUC (figure D2), on the other hand, was found to be 0.750 which is similar to the previously reported results in Hu et al (Hu et al 2019) that produced an AUC of 0.680 with a random forest classifier and 0.740 with a deep learning classifier. These results demonstrate that the ADC-based histogram features yielded lower mean accuracy (0.711) than the one based on IVIM and CTRW parameters (0.833) with also a higher variance (0.147 versus 0.033).

**Figure D2.** ROC curve of histogram features of ADC using GBC.
Download figure:
Standard image High-resolution image

Characterization of breast lesions using multi-parametric diffusion MRI and machine learning

Article metrics

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

Abbreviations

Introduction