Characterization of breast lesions using multi-parametric diffusion MRI and machine learning

Objective. To investigate quantitative imaging markers based on parameters from two diffusion-weighted imaging (DWI) models, continuous-time random-walk (CTRW) and intravoxel incoherent motion (IVIM) models, for characterizing malignant and benign breast lesions by using a machine learning algorithm. Approach. With IRB approval, 40 women with histologically confirmed breast lesions (16 benign, 24 malignant) underwent DWI with 11 b-values (50 to 3000 s/mm2) at 3T. Three CTRW parameters, D m , α, and β and three IVIM parameters D diff, D perf, and f were estimated from the lesions. A histogram was generated and histogram features of skewness, variance, mean, median, interquartile range; and the value of the 10%, 25% and 75% quantiles were extracted for each parameter from the regions-of-interest. Iterative feature selection was performed using the Boruta algorithm that uses the Benjamin Hochberg False Discover Rate to first determine significant features and then to apply the Bonferroni correction to further control for false positives across multiple comparisons during the iterative procedure. Predictive performance of the significant features was evaluated using Support Vector Machine, Random Forest, Naïve Bayes, Gradient Boosted Classifier (GB), Decision Trees, AdaBoost and Gaussian Process machine learning classifiers. Main Results. The 75% quantile, and median of D m ; 75% quantile of f; mean, median, and skewness of β; kurtosis of D perf; and 75% quantile of D diff were the most significant features. The GB differentiated malignant and benign lesions with an accuracy of 0.833, an area-under-the-curve of 0.942, and an F1 score of 0.87 providing the best statistical performance (p-value < 0.05) compared to the other classifiers. Significance. Our study has demonstrated that GB with a set of histogram features from the CTRW and IVIM model parameters can effectively differentiate malignant and benign breast lesions.


Introduction
Breast cancer is the most frequent cancer among women and leading cause of female cancer death worldwide (Bray et al 2018). An accurate characterization of breast lesions is important for efficient risk assessment and optimized treatment. Magnetic resonance imaging (MRI) has been used to evaluate suspicious breast lesions and demonstrated superior sensitivity to ultrasound and mammography (Morris 2010). Dynamic contrastenhanced MRI (DCE-MRI) is a prevalent MRI technique for breast cancer diagnosis (Sardanelli et al 2010). Although DCE-MRI offers a high sensitivity of 94%-100% (Orel and Schnall 2001), the reported specificity varies widely from 37% to 97% (Orel and Schnall 2001, Guo et al 2002, Woodhams et al 2005, Rubesova et al 2006, Lo et al 2009, and the use of exogenous contrast in DCE-MRI is potentially problematic in patients with compromised kidney functions, cardiac conditions, respiratory disorders, or pregnancy (Beckett et al 2015).
With its ability to probe underlying tissue microstructures, diffusion-weighted magnetic resonance imaging (DWI) has been increasingly used for characterizing breast lesions (Partridge and McDonald 2013, Partridge et al 2017, Dietzel et al 2020. Numerous studies have reported a lower apparent diffusion coefficient (ADC) in malignant breast lesions due to increased cellularity compared to benign lesions (Partridge and McDonald 2013). For differentiation between malignant and benign breast lesions, ADC has been shown to achieve a sensitivity of 85%-95% (Guo et al 2002, Woodhams et al 2005, Rubesova et al 2006, Lo et al 2009) and a specificity ranging from 50% to 90% (Guo et al 2002, Woodhams et al 2005, Rubesova et al 2006, Lo et al 2009. To improve diagnostic performance, DWI with ADC has been used conjointly with DCE-MRI (Partridge et al 2009). While the combination of the two MRI techniques has improved breast lesion characterization, the mono-exponential model that the ADC is derived from assumes Gaussian diffusion where the diffusion medium is typically homogeneous. Given the heterogeneous nature of biological tissues, particularly the tumors that exhibit a high degree of structural heterogeneity and complexity, it is well known that water diffusion in tissues does not follow a Gaussian distribution (Le Bihan et al 1988). Therefore, ADC has not been fully established as an imaging marker for breast lesion characterization (Pinker et al 2018).
ADC has been conventionally associated with tissue cellularity (Chen et al 2013), however, breast tissue has many other properties, such as vascularity and heterogeneity, that are relevant to neoplastic changes. Recent studies have indicated that these properties can be probed by utilizing a specific portion of the b-value 'spectrum' in DWI (Tang and Zhou 2019). The lower b-value range (∼0-200; and 800 s/ mm 2 ) can reveal tissue cellularity and micro-vascularity by using an intravoxel incoherent motion (IVIM) model with three parameters: diffusion coefficient (D diff ), pseudo-diffusion coefficient (D perf ), and perfusion fraction ( f ) (Le Bihan et al 1988). On the other hand, high b-values (e.g. 1000-4000 s/mm 2 ) can help probe tissue microstructures using a wealth of advanced diffusion models (Tang and Zhou 2019). One such model is the continuous-time random-walk (CTRW) model (Karaman et al 2016a, Zhong et al 2019, which has been shown to be sensitive to intravoxel diffusion heterogeneities in time and space (Magin et al 2008, Zhou et al 2010, Ingo et al 2014, as reflected by two parameters α and β, respectively. These parameters, together with those from a predecessor fractional order calculus (FROC) model (Magin et al 2008, Zhou et al 2010, have been successfully applied to tissue characterizations in adult (Sui et al 2016) pediatric brain tumors (Karaman et al 2016a, 2016b), gastric cancer , gastrointestinal stromal tumor (Tang et al 2018), hepatocellular carcinoma (Chen et al 2021), and bladder cancer (Feng et al 2022). For breast cancer characterization, however, only a few studies with high-b-value DWI have (Feng et al 2022) been reported and the results are inconclusive (Dietrich et al 2010, Bedair et al 2017, Bickelhaupt et al 2017. Moreover, the use of a full b-value spectrum to simultaneously probe multiple tissue properties such as cellularity, vascularity, and heterogeneity, has not been reported.
While the IVIM and CTRW model parameters can offer valuable insights into a lesion, the relative importance of these parameters for delineating malignant and benign lesions remains unknown. In this study, we use machine learning to investigate the relative importance of these parameters. Specifically, we generate histogram features obtained from the parameters of a low-b-value IVIM model and a high-b-value CTRW model, then investigate the performance of the histogram features for differentiating malignant and benign breast lesions using machine learning algorithms.

Patient characteristics
The institutional review board approved this retrospective study, and written informed consent was obtained from all participating patients. Thirty-one patients (age range: 27-86) with a total of 40 pathologically verified breast lesions were enrolled. The inclusion criteria were as follows: (1) aged 18 years or older, (2) no previous neoadjuvant chemotherapy or radiotherapy before biopsy or surgery, and (3) presence of histopathologically verified breast lesions with biopsy or surgery within one month after breast lesions detected on MRI, and (4) ACR BI-RADS 4 or 5 (suspiciously malignant or highly suggestive of malignancy) and ACR BI-RADS 2 or 3 (benign or probably benign) lesions detected on breast MRI. The exclusion criteria were: (1) excessive artifacts in the MR images, (2) non-mass-like enhancement, (3) benign cysts, (4) ACR BI-RADS 0 (i.e. further evaluation) or (5) ACR BI-RADS 6 lesions (i.e. biopsy-confirmed malignancy) as the radiologists were blinded to the histologic evaluations.

Diffusion-weighted image analysis
The multi-b-value diffusion-weighted (DW) images were first analyzed with the CTRW model whose diffusionattenuated MR signal takes the following form: where D m (in mm 2 /s 1 ) is an anomalous diffusion coefficient, the parameters α and β (dimensionless) are temporal and spatial fractional orders that are related to temporal and spatial diffusion heterogeneities, respectively, and E α is a Mittag-Leffler function (Karaman et al 2016a).
The DW images in the low b-value range (0-800 s/mm 2 ) were also analyzed by using the IVIM model in which the diffusion-attenuated MR signal is given as (Le Bihan et al 1988): where f is the perfusion fraction, D diff is diffusion coefficient, and D perf is the pseudo-diffusion coefficient, both in units of mm 2 /s 1 . The CTRW model parameters, D m , α, and β, and the IVIM model parameters, f, D diff and D perf , were estimated by fitting equations (1) and (2) respectively using a nonlinear least-squares estimation with an iterative Levenberg-Marquardt method in Matlab (MathWorks, Inc., Natick, MA). After noise filtering and Rician noise correction (Zhou et al 2010), the nonlinear fitting proceeded with two major steps for the CTRW model: (a) estimating D m by a mono-exponential model using the diffusion images at low b-values (b 1100 s/mm 2 ) and (b) simultaneously estimating α and β from the diffusion images in the full b-value spectrum (b-values = 0-3000 s/mm 2 ). For the IVIM model fitting, a 'segmented' approach was employed using the diffusion images at bvalues up to 800 s/mm 2 in three steps (Luciani et al 2008).

Machine learning analysis
The following subsections discuss how IVIM and CTRW model parameters are utilized in a machine learning framework for extracting imaging characteristics that can differentiate malignant and benign breast lesions. A visualization of our machine learning framework is shown in figure 1.

Pre-processing
The ROIs in tumor-containing slices were manually delineated by a radiologist, and then applied to the CTRW parameter maps, D m , α, and β; and the IVIM parameter maps, f, D diff , and D perf , resulting in six parameter maps for each patient, as shown in figure 1. This is followed by a max-min normalization of each distinct parameter map across samples into a range of [0, 1] (appendix figures A1-A6). This normalization procedure allows us to compare all the parameter maps and is a standard procedure for classification algorithms (Breiman 2001). For example, the f parameter maps were normalized into a range of [0, 1] with respect to the maximum and minimum of all the f parameter maps in our study. Finally, the ROIs were randomly cropped to increase the dataset size as a general technique to prevent overfitting (Takahashi et al 2018). While other augmentation techniques exist, many of these methods would neither improve the performance due to the type of features to be extracted from the diffusion parameters (i.e. random rotations) nor be applicable to diffusion MRI parameters (adding color/noise). Ultimately, the dataset consisted of the original 40 full tumor ROIs and an additional 80 cropped subsets of the ROIs, resulting in a total of 120 samples (appendix A for more details).

Feature extraction
A histogram of the voxel values within the ROI was generated for each diffusion parameter, D m , α, β, D diff, D perf, and f. For each of the six parameters, nine histogram features were evaluated including kurtosis, skewness, variance, mean, median, interquartile range, 10% quantile, 25% quantile, and 75% quantile values resulting in a total of 54 quantitative features for use in the machine learning analysis.

Feature selection
To ensure the robustness of the machine learning classifiers, the Boruta algorithm, an iterative feature selection method, was used to select a subset of relevant histogram features by removing the most irrelevant and redundant features from the data (Kursa and Rudnicki 2010). To control for false positives, we used a modified two-stage multiple testing methodology process. The Benjamin Hochberg FDR test was first used to select the most significant features within an iteration, followed by the Bonferroni correction to account for multiple comparisons across multiple iterations of using the same set of features. At every iteration, feature importance Figure 1. Workflow of image analysis from left to right with corresponding parameter maps and histograms. The data analysis began with lesion contouring by lesion ROI selection and DWI parameter extraction within the selected ROIs. Preprocessing of the dataset was then performed by normalization and centering the histogram, followed by ROI augmentation. We extracted features consisting of the mean, kurtosis, median, 10% quantile, 25% quantile, inter quartile range, and 75% quantile values of the histogram. We then used the Boruta algorithm to determine the most significant features and used these features in various machine learning classifiers to discriminate malignant and benign lesions. was determined for all features with respective to the classification performance such that the most significant features are ranked as one.

Machine learning algorithms
The ability of the top features to predict a lesion as benign or malignant was compared using Kernel Methods, Ensemble Methods, and Naïve Bayes. Kernel methods included kernelized support vector machine (SVM) and Gaussian Processes (GP). Ensemble methods included techniques that make predictions based on either subsampling (i.e. training individual parallel classifiers using bootstrapped samples with a random subset of feature variables) such as bagging with Decision Trees (DT) and Random Forest (RF); or boosting (i.e. training sequential individual classifiers on weighted samples such that the next classifier concentrates more on the misclassified samples from the previous classifier) such as Gradient Boosted (GB) or adaBoost (AB).

Statistical analysis
We created a stringent statistical analysis to increase the validity of our results. First, we created an 80%-20% split-ratio for the training and test data sets. Second, we ensured that the test set only included samples that were mutually exclusive from the training set such that the classifier only tests on unseen samples. Such a methodology provided assurance that a model which performed well on the training set would also perform well on the test set. Third, we employed scikit-learn (Pedregosa et al 2011) to create a model selection scheme of the classifiers using a 'stratified-nested' cross-validation that reduces overfitting (Krstajic et al 2014). This also allowed us to optimize the classification and histogram parameters such as histogram bin width. Specifically, we performed this optimization by maximizing the area under the receiver operating characteristic curve (AUC) using grid search with a 'stratified-nested' cross-validation of 10 repeated 5-folds. Lastly, to account for a slightly imbalanced dataset, we under-sampled the majority class (malignant) during each grid search so that every fold contained an equal representation of each class. The optimized parameters for each classifier are given in our code (https://github.com/rmehta1987/MriDiffusion).
The classification algorithms in the testing stage were compared using AUC, F1-Score, and accuracy metrics with 10 000 bootstrapped samples to define 95% confidence intervals. We used a Mann-Whitney U-test to perform a pair-wise comparison on the performance metrics (specificity, sensitivity, or accuracy) of the classifiers with a statistical significance set at p-value < 0.05. Similarly, a comparison of the most significant quantitative features after feature selection within malignant and benign samples was performed by generating a p-value using a Mann-Whitney U-test with a statistical significance set at p-value < 0.05.

Results
Diffusion parameter maps Figure 2 shows the maps of CTRW model parameters D m , α, and β (a)-(c) and IVIM model parameters D diff , D perf , and f (d)-(f) from two representative patients with malignant (left column) and benign lesions (right column). Qualitatively, the malignant lesion exhibited lower values in D m , α, β, and D diff and higher values in f in some portions of the ROI than the benign lesion. The malignant lesion also showed a larger variance of the parameters within the ROI in comparison to the benign lesion.

Comparison of features
The top 18 features among the 54 features are shown in figure 3(a) with 95% confidence intervals. Of these 18 features, the top 8 features that statistically significantly improve classification performance are: median of β (β median ), skewness of β (β skewness ), mean of β (β mean ), perf kurtosis 75% quantile of D m (D m 75% ), and median of D m (D m median ). The boxplots of these top features for both the malignant and benign groups are given in figure 3(b) and table 1, respectively. These features were found to be significantly different between the malignant and benign lesions with a p-value < 0.05 (Supplementary table 1). We used b mean rather than b kurtosis as the classifier performance metrics (AUC, F1score, accuracy) were consistently better with b mean than with b , kurtosis even though their importance rankings are similar.

Comparison of classifier performance
In the cross-validation stage, the comparison between the ROC curves ( figure 4) showed that among the eight machine learning classifiers (linear SVM, RBF SVM, GP, DT, RF, AdaBoost, NB, and GB), GB and RF classifiers performed the best with respect to AUC of ROC curves (p-value < 0.05). The swarm plots given in   figures 5(a)-(c) summarize the performance metrics of all the classifiers, which again show that the GB and RF classifiers outperformed the other classifiers in all performance metrics (p-value < 0.05). Similar results were observed in the performance metrics of the classifiers obtained during the testing stage indicating that the machine learning algorithms can discriminate between unseen samples. The boxplots of AUC, accuracy, and F1 score of all classifiers are given in figures 6(a)-(c), respectively and summarized with the mean and standard error in table 2. Again, the GB classifier provided the best performance with the highest mean AUC of 0.942 (p-value < 0.05) with a 95% confidence interval of [0.904, 0.981]. The GB and RF classifiers both performed the best in terms of accuracy and F1 score (p-value < 0.05) where the GB classifier had a mean accuracy of 0.833 and a mean F1 score of 0.872; the RF classifier had an accuracy of 0.833 and a mean F1 score of 0.871. While the RF and GB classifiers had similar performance, we found that the GB classifier consistently had the highest AUC over a larger number of iterations performed on the test set and as such offers the best discrimination between benign and malignant lesions.

Discussion
We have shown that a combination of quantitative features from the parameters of the CTRW and IVIM diffusion models can characterize malignant and benign breast lesions with a machine learning-based approach. This approach has allowed us to probe multiple tissue properties such as vascularity, cellularity, and heterogeneity using a set of DW images acquired with a full b-value spectrum.
The shortcomings of using only median or mean parameter values in lesion characterization have led to recent diffusion MRI studies that extract more quantitative features obtained from the ADC or IVIM model parameters. These studies have used machine learning ( An initial histogram analysis of the quantitative features is a common tool within the studies to collate additional information by parsing the statistical distribution of the DWI parameters that is generated by the histogram. Our study extends this data-driven approach to the characterization of breast tissue by using a set of quantitative features of diffusion parameters that are related to various tissue properties such as vascularity, cellularity, and/or heterogeneity. The relationship between the conventional ADC parameter and the underlying tissue microstructures has been extensively studied by others (Chen et al 2013, Surov et al 2017. Many of these studies have observed a correlation between ADC and tissue cellularity based on histopathologically estimated cell count (Chen et al 2013). Sharing the same units as the conventional ADC, CTRW and IVIM models' diffusion coefficient parameters, D m and D diff , respectively, have a close resemblance to ADC as they also reflect tissue diffusivity, and hence, are related to tissue cellularity. Utilizing the low b-values to extract the microcapillary perfusion component from the DW signal, IVIM model has been shown to probe tissue vascularity through its perfusion fraction parameter, f, based on its correlation to contrast enhancement obtained with DCE-MRI (Sigmund et al 2011) or histopathological vascular density , Togao et al 2018. The relatively new CTRW model, on the other hand, incorporates intravoxel diffusion heterogeneity into the classical Fick's equation by using two fractional order parameters, α and β, with respect to time and space, respectively. While originating from the same underlying tissue property of structural heterogeneity, CTRW model's α and β reflect two independent and complementary aspects of diffusion heterogeneities: temporal and spatial heterogeneity. Water molecules can take a variable time to make a move, or can produce a variable displacement in each move. The β parameter which accounts for the non-Gaussian distribution of diffusion displacement (Zhou et al 2010, Magin et al 2008, 2011 has been related to the diffusion jump length (Ingo et al 2014) that is associated with intravoxel spatial heterogeneity. The α parameter, on the other hand, reflects the temporal heterogeneity of the diffusion process, that is, the likelihood of the water molecule to be 'trapped' or 'released' while it diffuses through the complex tissue structure and environment. With its generalized representation that can characterize both temporal and spatial diffusion behavior, the CTRW model has the potential to be used for studying timedependency of water diffusion behavior, which has gained increasing attention in the diffusion MRI research community in recent years (Pyatigorskaya et al 2014, Novikov et al 2019, Wu and Zhang 2019, Aggarwal et al 2020. While the CTRW model's time-dependency has been demonstrated in Sephadex gels (Dan et al 2021), in this study we have not investigated its time-dependency on breast tissue as it is challenging to acquire clinical data over a broad range of diffusion times.
The most important quantitative feature for differentiating malignant and benign lesions in our study was the 75% quantile value of the CTRW parameter, D m , which was found to be significantly lower in malignant lesions than the benign ones (p-value = 5.945e-10). This result is consistent with previous studies (Partridge et al 2017) which showed that malignant breast lesions exhibit lower diffusion coefficient (i.e. ADC) values. Our study differs from these previous studies with the use of higher b-values and evaluation of multiple quantitative features obtained from the diffusion coefficient for discriminating malignant and benign lesions. It is worth noting that the 75% quantile of D m , which may have a higher sensitivity to detect focal regions of higher cellularity, outperformed the mean D m which was not among the top performing features. We also performed a conventional DWI-based analysis that uses histogram features of ADC with a GBC classifier. The results given in appendix D demonstrated that the performance of the ADC-based histogram features was not only lower than the one based on IVIM and CTRW parameters, but the high variance suggested that the ADC-based classifier was highly sensitive to parameter perturbations.
The mean, median, and skewness of β have been found sensitive to changes in malignant breast lesions which concur with our results where the mean (p-value = 0.0099), median (p-value = 0.0003), and skewness of β (p-value = 1.98e-6) were significantly different in malignant and benign lesions. β has also been reported to be related to intravoxel tissue heterogeneity in a growing number of studies using CTRW and its predecessor FROC model (Magin et al 2008, Zhou et  While the features extracted from the spatial diffusion heterogeneity parameter, β, has been found to show differences between the benign and malignant lesions, our study did not find that the temporal diffusion heterogeneity parameter, a,contributed significantly to characterizing breast lesions. The exact underpinning biophysical reason remains to be investigated, but was likely related to inadequate signal-to-noise ratios at high b-values and a fixed diffusion time was used in the experiments. This suggests that the FROC model, which features on β parameter, may be adequate for breast tissue characterization under the experimental conditions of the current study. Our analysis of the IVIM model ( figure 3) showed that kurtosis of D perf ( ) D , perf kurtosis the 75% quantile of f 75% were among the top performing features to discriminate malignant and benign breast lesions. The mean values of the features, f 75% (p-value = 5.95e-10), and D diff 75% (pvalue = 5.95e-10), were significantly higher in benign lesions (p-value = 5.95e-10 for both). While the mean D perf kurtosis was similar in both lesion types, the variance of D perf kurtosis was significantly higher in benign lesions (pvalue = 0.007). Other studies have also reported the use of features from the f D and diff parameters to discriminate breast lesions (Song et al 2014, Vidić et al 2018. These studies showed quantitative features generated from f parameter maps were higher in malignant cases. Our results were inverse. We attribute this to the max-min normalization process in our preprocessing stage. Specifically, the size of the malignant lesions was found to be significantly larger than benign lesions (p-value = 2.114e-08), and f within the lesion ROIs had higher positive skewness in the malignant lesions than in the benign lesions. As such, while malignant lesions exhibited higher f values, most of the lesion had a very high concentration of low f values. Therefore, the maxmin normalization, which was required for pre-processing the parameter maps, further increased the influences of the lower values in the f parameter maps in the malignant cases. We verified this with a simple linear regression analysis (appendix B for more details) and showed that f maps had significantly higher (positive) skewness (p-value < 0.05) when the lesion is malignant (appendix figure C1). Before normalization, the values in the f parameter maps in our study were consistent with established literature (appendix figures A6 and C2) (Song et al 2014). The underlying biological implications from the IVIM parameter maps indicate that specific areas of a lesion can be malignant, and as a result have high vascularity, while other regions within a lesion may not be malignant and as a result have low vascularity. The different regions of a lesion having different properties recapitulates cancer causes lesion heterogeneity.
A comprehensive analysis of multi-parametric information provided by the IVIM and CTRW models offers a unique perspective to the characterization of underlying breast tissue. This is consistent with previous studies which showed improved performance with a combination of quantitative features from DWI model parameters median kurtosis to classify breast lesions and achieved a cross-validation accuracy of 0.96. The RBF SVM classifier using our quantitative features from the IVIM and CTRW models performed poorly with a mean accuracy of 0.633 on the test set. However, the cross-validation (CV) accuracy on the training using our model was higher with a mean of 0.734. Furthermore, the study by Vidic et al (2018) did not investigate performance metrics on an independent test set; and their dataset was not available for a direct comparison. The difference between our results and Vidic et al (2018) may be caused by batch effects due to acquisition and fitting strategies.
Our results showed that the GB and RF classifiers are the most robust in classification performance as they provide the highest metrics for F1-score and accuracy. We find that the GB classifier is the most ideal as it has the highest AUC (although not significantly different than RF classifier) due to its consistency over a larger number of bootstrapped iterations. Moreover, neither of the classifiers underfit or overfit as the error in the independent test set is similar to the error in the cross-validation set. This can be clinically advantageous as each patient's parameter maps are unique and distinct with regards to the breast lesion. In addition, as shown in appendix A, the dataset which consists of only full tumor ROIs (appendix A, figures A1-A4, Column 3) and the one that consists of a subset of the cropped tumor ROIs (appendix A, figures A1-A4, Columns 4-6) yielded different statistics of the mean, variance, and median. This indicates that the additional cropped samples have enough differentiation from the full tumor ROI for the classifier to avoid fitting on only a specific set of features. In summary, the classifier exhibited a high degree of generalizability; features that represent malignant tumors correspond correctly regardless of whether they are extracted from a cropped or full tumor ROI.
Our study has limitations. First, the retrospective analysis of imaging results was performed at a single institution with a small sample size. We addressed this issue in this study by augmenting our dataset through random crops of the tumor ROIs which reduces the chance of overfitting, and improves classifier performance at the cost of the assumption that the randomly cropped tumor ROIs have the same pathology as the original tumor ROI (Takahashi et al 2018). Since the classifier produced a high and consistent prediction rate with low variance on both full and cropped tumor ROIs during testing, the proposed technique has potential to scale to larger datasets for discriminating benign and malignant full tumors. The results of this study can be further validated by conducting a future study with a larger sample size. Second, although our quantitative features were reproducible, it remains unclear whether the GB or RF classifier can account for variations across scanners and sites. We can address this problem by correcting for batch effects, however, this is currently infeasible as it would require repeated samples of the same subject. Third, while the gold standard employed in the study was determined by biopsy taken from a specific area of the breast lesion, the discrimination of the benign and malignant lesions based on the DWI model parameters has been performed on the entire lesion. Ideally, we would have multiple biopsies of the same tumor, however this would also be infeasible due to the invasiveness required. Finally, this study considered only two of the advanced DWI models whose parameters can be related to specific tissue properties such as cellularity, vascularity, and heterogeneity. As no direct correlation between these parameters and the tissue properties based on histology was investigated in this study, our results do not directly demonstrate any specific microstructural distinctions between lesions. Nevertheless, the purpose of this study is to offer a multi-parametric approach that utilizes a machine learning-based automated analysis technique. This approach allows for the utilization of the most significant parameter-based features related to the tissue pathology, rather than providing a full biophysical explanation.

Conclusion
In conclusion, we have shown that multiple statistical features of IVIM and CTRW model parameters can be used conjointly to characterize breast lesions. Analysis of the multiple features can show that the GB classifier performs the best to discriminate between malignant and benign breast lesions. This approach allowed us to simultaneously probe tissue properties from multiple facets and may be extended to characterization of other tissue pathologies using IVIM and CTRW model parameters. Figure A2. b parameter map statistics between benign (green) and malignant (orange) groups. From top to bottom: the y-axis indicates the value of the following statistics, mean, min, max, and size of ROI. From left to right: type of dataset from original, to normalized, to augmented original normalized, augmented Crop Set 1, Augmented Crop Set 2, and Augmented Crop Set 3.
3. Cropping the original ROIs into the three random partitions called Augmented Crop set 1, Augmented Crop set 2, and Augmented Crop set 3.
The boxplots of the CTRW model parameters, D m , α, and β and IVIM model parameters D diff , D perf , and f, after each augmentation step is given in appendix figures A1-A6, respectively.  We used a Mann-Whitney U-test for the comparison of the following top features as discussed in the main text: median of β (β median ), skewness of β (β skewness ), mean of β (β mean ), 75% quantile of f ( f 75% ) , 75% quantile of where y i is the skewness of the i-the sample, g is the intercept denoting the overall skewness, f 1 is the linear change of skewness across samples due to size, f 2 is the linear change of skewness across samples due to histology, x i is an indicator variable for the tumor histology (malignant is 1, benign is 0), s i is the size of the i-th sample, and  i is the normally distributed residual for the i-th sample with zero mean and unit variance (i.e. ( )  N 0, 1 ) . Log-transformed skewness was used as the outcome to ensure normally distributed residuals and  For the normalized parameter maps (appendix figure C1) the coefficient 0.098, 1.357 2 and p-value = 0.036 indicating that skewness increases on a logarithm scale by 0.6346 among malignant lesions. The coefficient for size, f , 1 has no significant impact on the skewness p-value = 0.184. In the unnormalized case (appendix figure C2) the histology has no significant impact on the skewness at a p-value =0.763. This demonstrates the reason why 75% quantile off ( ) f 75% in the main text has much higher values for the benign lesion and is one of the most important features in our study.

Appendix D: ADC-based analysis with a gradient boosted classifier
Here, we show the results from a conventional DWI-based approach that uses histogram features of ADC with a GBC. As shown in appendix figure D1 below, the combination of the most significant histogram features of ADC (mean, variance, and kurtosis) provided a mean accuracy of 0.711 ± 0.147 (mean ± std deviation) for differentiation between benign and malignant breast lesions. The AUC (figure D2), on the other hand, was Figure C2. Regression plot of Skewness against size and histology when f map parameters are not normalized. There is no relationship between skewness, size, and histology when the paremeters are not normalized. Figure D1. Performance metrics of the top performing histogram features (mean, variance, and kurtosis) of ADC with a GBC.