Brought to you by:
Topical Review

Machine learning for multi-parametric breast MRI: radiomics-based approaches for lesion classification

, , , , and

Published 20 July 2022 © 2022 Institute of Physics and Engineering in Medicine
, , Citation Luisa Altabella et al 2022 Phys. Med. Biol. 67 15TR01 DOI 10.1088/1361-6560/ac7d8f

0031-9155/67/15/15TR01

Abstract

In the artificial intelligence era, machine learning (ML) techniques have gained more and more importance in the advanced analysis of medical images in several fields of modern medicine. Radiomics extracts a huge number of medical imaging features revealing key components of tumor phenotype that can be linked to genomic pathways. The multi-dimensional nature of radiomics requires highly accurate and reliable machine-learning methods to create predictive models for classification or therapy response assessment.

Multi-parametric breast magnetic resonance imaging (MRI) is routinely used for dense breast imaging as well for screening in high-risk patients and has shown its potential to improve clinical diagnosis of breast cancer. For this reason, the application of ML techniques to breast MRI, in particular to multi-parametric imaging, is rapidly expanding and enhancing both diagnostic and prognostic power. In this review we will focus on the recent literature related to the use of ML in multi-parametric breast MRI for tumor classification and differentiation of molecular subtypes. Indeed, at present, different models and approaches have been employed for this task, requiring a detailed description of the advantages and drawbacks of each technique and a general overview of their performances.

Export citation and abstract BibTeX RIS

1. Introduction—radiomics and machine learning (ML) in medical imaging

Recently, the volume of data produced from each individual has increased at incredible rates. In 2018, 33 zettabytes (ZB) of data have been produced, an average of 6.6 TB for each of the 5 billion 'active' individuals (Reinsel et al 2018). However, by the end of 2025, the total amount of generated data is expected to be about 175 ZB, with an astonishing average close to 30 TB per active person (estimated to be about the 75% of the total world population). Part of this massive amount of data comes from healthcare, considering that nowadays many diagnostic and therapeutic techniques are based on digital imaging (for instance, a complete PET/MR scan can weigh up to GBs) and that all the clinical and radiological information are stored and available for further analysis or inspections (Alexander et al 2020). This rapid increase of data is consistent with the observed dramatic rise in diagnostic imaging exploitation (Serrano Cardona and Muñoz Mata 2013) and with the new, cutting-edge high- and super- resolution modalities available through dedicated acquisition modalities or reconstruction techniques which tends to increase the throughput of ordinary CT and MR exams (Van Reeth et al 2012, Isaac and Kulkarni 2015).

Even though the rapid increase of collected and available data seems positive and promising, the work pressure on the limited number of healthcare personnel has experienced a parallel escalation. The shortage of personnel dedicated to image inspection and analysis (mainly radiologists and physicists) is further accentuated by the global tendency to cost reduction (Dash et al 2019). In this context, artificial intelligence (AI) is a promising candidate to rough-cut the huge amount of imaging data and to support physicians in their clinical practice. One of the major applications of AI is ML, where systems are trained to automatically learn from data without explicit instructions, mimicking human behavior and sometimes outperforming humans, either in terms of execution time (Kattan et al 1992) or performances, as testified by the famous chess battle where the world champion Garry Kasparov was defeat by the IBM supercomputer Deep Blue. In healthcare, for instance, a specific ML algorithm might be able to skim the negative patients from screening mammography (Lei et al 2019) or to distinguish CT of COVID19 positive patients from CT of other interstitial pneumonias, very similar at visual inspection to radiologists (Cardobi et al 2021).

Since ML-based analyses are quantitative and exclusively based on objective data through specific and well-documented algorithms, one of their main advantages is the user-independence and reproducibility. Indeed, intra- and inter-observer variability has shown to be a severe bottleneck in both contouring (Vinod et al 2016, Patrick et al 2021) and prescriptions (Audibert et al 2017, Jung et al 2019), whereas AI-based algorithms has recently been proposed to overcome both issues (Sultana et al 2020, Wong et al 2020). For these two reasons (i.e. the ability to analyze in limited times huge amounts of data and the objective, quantitative and reproducible approach of ML), the attention of the scientific community and of the healthcare personnel to AI is growing every day, prospecting a future where ML will be a milestone in nosocomial workflow.

Even though an exhaustive overview of the ML techniques used in medical imaging lies outside the scope of the present manuscript, a brief technical introduction will accompany the reader across the specific glossary of the topic and the selected inclusion/exclusion criteria for the present review.

Depending on the kind of task, ML can be defined as supervised or unsupervised. In supervised learning a prior knowledge is used to define a reference on which the model is to be trained. Therefore, this ground truth is used to teach to the machine the relationship between the desired output (outcome) and the input data. Classical examples of supervised learning are regressions, where a function is fitted to reproduce the independent outcome variable based on the information contained on other variables. Once the function is fitted, the model is well-defined and can be used to predict the outcome in new cases. Supervised learning is the core of all the predictive or classification models in medical imaging. Unsupervised learning, on the other hand, aims to infer the intrinsic structure of data without prior knowledge. It is usually adopted to classify data or to perform cluster analysis, often exploiting internal correlation of the dataset.

At present, two main routes are exploited when supervised ML is applied to medical image analysis: neural networks, often referred to as deep ML or deep learning for brevity, and radiomic features extraction and the consequent analysis, usually labelled as radiomics.

Deep learning can exploit different kind of neural networks and has an innate flexibility due to the high number of parameters and hyperparameters of the model (e.g. number of layers or epochs, activation function, batch size, etc). These systems are able to extract information directly from images which, passing through the different layers, are decomposed and their useful information is extracted. Nevertheless, these systems are often seen as black-boxes and the extraction of information from these models, even though possible, is sometimes complex (for instance, through activation maps).

On the other hand, radiomic analysis is based on the extraction of properties from the region of interest (ROI), and these features are then analyzed later. Since these features are well defined and standardized in the literature, the radiomic features extraction process should be independent on the software used, as long as it respects guidelines such as IBSI (Zwanenburg et al 2020). From here on, the presented material will focus only on the radiomic process, the deep learning approach falling beyond the scope of this review.

Similarly to all the models in supervised ML, radiomics models applied to medical imaging requires ad input and an output, the latter being the outcome we are interested in.

The inputs are usually composed of clinical and imaging information. Within medical images, 2D (like in planar radiography) or 3D (tomography), a ROI is delineated and from this region the radiomic features are extracted. The ROI contouring, sometimes defined segmentation, can be accomplished through manual, semiautomatic, or automatic procedures. When the contouring is performed on whole organs, such as lungs or brain, fully automatic procedures are usually adopted and accepted (Cardobi et al 2021). When the lesion pertains to a limited region of a fairly homogenous organ, semi-automatic segmentation can be exploited (Montemezzi et al 2021). If the ROI has to include different organs or is not circumscribed to a well-defined anatomical region, an expert radiologist is required to draw the ROI manually (Simoni et al 2020).

Before features extraction, the original image can be transformed by using different filters (just to mention a few, gradient filter, Logarithm of Gaussian—LoG, wavelet, etc), increasing the number of extracted features up to thousands or millions. In addition to the filter parameters, other degrees of freedom are provided by the gray-level discretization, i.e. a parameter that must be fixed for textural features extraction, the whole image normalization, the ROI resampling and the resampling algorithm. All these parameters must be fixed and clearly stated to ensure results reproducibility. The clinical to mine and the starting point to build the model. These input variables are usually referred to as predictor variables, predictors, or independent variables, even though this last definition might be misleading since these variables are often not independent from each other.

The output of the model, on the other hand, is usually identified as the desired clinical outcome and labelled as response variable, or sometimes dependent variable, even though using the last nomenclature might wrongly suggest that the dependent variable is the consequence of the independent variables, whereas the model can only prove a correlation and not a cause-effect relationship. The path to connect the input to the output is usually a composition of many different techniques for skimming the useless predictors (referred to as features selection, such as LASSO, correlated variables removal, principal component analysis, etc) and for building a predictive model (Logistic regression, support vector machines, etc)

When the outcome is a class, the model is usually referred to as 'classifier', in the sense that it assigns each case to one of the possible groups (in a probabilistic or non-probabilistic way). If the number of classes is two and the response is dichotomous, such in the case of a benign/malignant lesion or responder/non-responder, it is called binary classifier. When more than two classes are involved, like for molecular subtypes of breast cancer or for a more detailed response to therapy (minor response, major response, complete response), the classifier is called multivariate or multinomial. In these cases, many variables are used as input of the model (e.g. radiomic features and clinical information together) and the output of the model will be the probability of being in each class, or when non-probabilistic models are employed, the class to which the case is more likely to belong.

Too often, the adjectives multivariable (i.e. a model with many predictors, as opposed to univariate, where one single variable is used as input) and multivariate (i.e. with many outcome classes) tend to be used interchangeably in the literature, but their meanings are conceptually very different (Hidalgo and Goodman 2013).

When the outcome is a real variable (continuous), the model is called a regressor. The most used are cox regressions, for survival analysis, and generalized linear model regression (GLM), employed in many different tasks. Logistic and probabilistic regressors lie in the latter group, but since they are often used to obtain probabilistic responses linked to a binary outcome, these models are usually identified as probabilistic binary classifiers rather than regressors.

The reproducibility of a result strongly depends on the level of details regarding the radiomic extraction and data analysis reported in the paper. In the present review more weight has been given to the studies that detail the procedure thoroughly, explain clearly the hyperparameters choice, use an internal validation in the model training (i.e. k-fold cross validation) and test the results on a test set.

This review will then focus on the most recent ML studies exploiting breast MR-based radiomics to assist radiologists in improving the accuray of suspicious breast lesions classification. In the first section, an overview of breast magnetic resonance imaging (MRI) is provided to help the reader to focus on the main advantages and drawbacks of this imaging modality. Then, a critical review of recent literature of ML algorithms applied to radiomic analysis in breast MRI is provided. This section is organized as follows: the first part is focused on benign versus malignant classification. Then the capability of ML algorithm to discriminate molecular subtypes is investigated. Finally, other three aspects that can influence the classification performance are analyzed: imaging modality from which the radiomic features are extracted, the influence of magnetic field strenght and implemented ML algorithms. In the final part the results, limitations and drawbacks are discussed.

2. Breast MRI

Breast cancer is the most frequent malignancy in women worldwide and is curable in about 70%–80% of patients with early-stage and non-metastatic disease (Harbeck et al 2019).

It is characterized by a huge heterogeneity at molecular level. In the last 15 years, diagnostic techniques and, even more important, treatments have evolved to take this heterogeneity into account, focusing more on biologically-targeted therapies and on reducing adverse effects of cancer therapy (Harbeck et al 2019).

In this context imaging plays a key role both in screening to discover early stages of disease, for which effective treatments are possible, and in treatment response assessment. Moreover, functional imaging, aimed not only at visualizing tumor mass, but also at capturing different functional characteristic linked to tumor biology, has an important role in decoding tumor phenotypes.

Breast MRI represents a key technique for breast imaging and a complementary investigation modality with mammography and ultrasound (US) and it is currently involved in breast cancer screening in high risk women as well as in staging and evaluation of tumor response (Mann et al 2019). Compared to mammography and US, MRI is a functional technique that allows to investigate different aspects of normal and pathological tissues, from vascular organization and cellularity (Mann et al 2019).

Breast MRI protocol is typically performed including dynamic contrast enhanced (DCE) technique and post-contrast T1 weighted image as basis for lesion characterization and classification. In the last years a multi-parametric approach has been preferred, including T2 weighted imaging and diffusion weighted imaging (DWI) (Mann et al 2008, 2019).

T1 weighted images can be acquired with or without fat suppression. In order to depict all enhancing cancers 5 mm or larger in size, slice thickness should be no more than 2.5 mm with an in-plane resolution of 1 × 1 mm or lower (Mann et al 2008, Sardanelli et al 2010).

DCE-MRI is a functional technique that allows to study the kinetic of lesion enhancement after the injection of contrast agent: malignant lesions usually show early enhancement with rapid washout, whereas benign lesions typically show a slow increase followed by persistent enhancement. In order to sample the kinetic curve, five or six T1-weighted images are usually acquired, one before contrast agent administration for basal assessment, and then four or five acquisitions after contrast agent injection (Petralia et al 2011, Newell et al 2018). Fat appears bright in DCE-MRI due to its relatively short T1 relaxation time. For this reason, it could interfere with the evaluation of tissue signal changes or obscure abnormal areas of contrast enhancement. Therefore, it is very important to suppress fat tissue signal to improve the detection of enhancing lesions (Lin et al 2015).

DCE MRI images can be analyzed qualitatively (curve shape) and quantitatively using a pharmacokinetic model (pk-DCE). Pharmacokinetic models allow to quantify parameters such as ktrans, kep, vp and ve that reflect blood perfusion and vascular permeability. Tofts model is the most used and it assumes that diffusion of contrast agent between vascular and extravascular extracellular space is passive (Schabel et al 2010).

T2-weighted imaging is included in the standard MRI protocol (Westra et al 2014). T2-weighted imaging enables cysts visualization due to their liquid nature that appear bright in T2 images. Fat suppression is required because fat too appears bright in T2w MRI images because of its long T2 relaxation time. The fat signal, if not suppressed, could mask benign lesions (Lin et al 2015).

DWI quantifies the random movement of water molecules which is connected with tissue microstructure and cell density (Baliyan et al 2016). Tumors present a restriction of water molecules diffusion due to an increased cell density causing a hyper intense signal on DWI images. From DWI acquisition it is possible to obtain apparent diffusion coefficient (ADC) quantitative maps. Mean ADCs are generally lower (range 0.8–1.3 × 10–3 mm2 s−1) compared with those in benign lesions (range 1.2–2.0 × 10–3 mm2 s−1) (Mann et al 2019).

Lesion characterization for all imaging modalities is based on american college of radiology breast imaging reporting and data system (ACR BI-RADS) (Morris et al 2013). For MRI BI-RADS includes both morphological (round shape and smooth margin for benign lesions; irregular shape and margins for malignant lesions) and functional evaluation (DCE enhancement curve). For BI-RADS lesions with a category IV or higher, histological verification is required for final diagnosis.

Currently, biopsy remains the gold standard for tumor pathological confirmation, but the possibility to have an informative and accurate imaging could help avoiding unnecessary interventions. Indeed breast MRI has shown diagnostic sensitivity of 94%–99% (Reinsel et al 2018, Alexander et al 2020) but reported specificity is still moderate (Houssami et al 2008, Mann et al 2008, Montemezzi et al 2021).

Furthermore, biopsy represents just a small area of the tumor volume. Thus, a non-invasive imaging analysis offers a method to assess the whole tumor volume and accounts for heterogeneity of the disease. In this sense, there is a high request for improvement in the diagnostic capability of breast imaging, and radiomics, thanks to its capability to unravel underneath characteristics connected with textural features, could fill this gap.

3. ML for lesion classification/diagnosis

The main features of most of the reviewed works are summarized in table 1, including number of patients, classification outcome, ML algorithms, best AUCs. The criteria for selecting papers were the inclusion of 'radiomic' or 'radiomics', 'breast', 'MRI' or 'magnetic resonance imaging' and 'ML' in the Pubmed search keys considering a temporal cutoff from 2016, in order to select the most recent works.

Table 1. Review of machine-learning studies predicting breast lesion classification and/or molecular subtypes.

PaperMagnetic fieldMR techniquesPatient #Classification questionML algorithmValidationTrain/TestBest AUC
Bickelhaupt 2017 (Bickelhaupt et al 2017)1.5 TDWI50Ben versus malLASSOBootstrapYes0.842
D'amico 2020 (D'Amico et al 2020)1.5 TDCE45Ben versus mal (breast foci)TWIST algorithm/Yes94% accuracy
Demircioglu 2020 (Demircioglu et al 2020)1.5 TDCE, T298Phenotype (BI-RADS5/6)Naive bayes, RF, LRFive-fold CVYes0.97 HER2 + versus tripl- 0.86 LumB versus tripl- 0.81 Ki67 expression 0.62 HER2 + expression
Fan 2019 (Fan et al 2019)3 Tpk-DCE including parenchima211Molecular subtypeRFLOOCVYes0.897
Fan 2020 (Fan et al 2020)3 TDCE, DWI144Histological grade Ki67 expressionMulti task learning modelLOOCVYes0.811 for histological grade 0.816 for Ki67
Gibbs 2019 (Gibbs et al 2019)3 TDCE149Ben versus mal (breast foci)SVM/Yes0.78
Hao 2020 (Hao et al 2020)3 TT2, post contrast T1178Ben versus mal (BI-RADS4)SVM/Yes0.77 T1, T2
Hu 2020 (Hu et al 2020a)1.5 T or 3 TDCE, T2, DWI852Ben versus malSVMCVYes0.84 DCE 0.83 T2 0.69DWI 0.87 DCE T2 e DWI
Ji 2019 (Ji et al 2019)3 TDCE1979Ben versus malSVM/Yes0.89
Li 2016 (Li et al 2016)1.5 TDCE92Receptor statusLDALOOCVYes0.89 ER + versus ER- 0.69 PR + versus PR- 0.65HER2 + versus HER2-
Ma 2018 (Ma et al 2018)1.5 TDCE377Ki67 expr.KNN, NB, SVMTen-fold CVNot reported0.733 NB
Ma 2021 (Ma et al 2021)1.5 TDCE81Triple neg versus non triple negSVM, LDA, MP, RF, LR, LASSO, AB, DT, GP, NBFive-fold CVYes0.867
Militello 2021 (Militello et al 2021)1.5 TDCE111Ben versus malSVMHeld-out CVYes0.725
Naranjo 2021 (Naranjo et al 2021)3 TDCE, DWI93Ben versus mal (BI-RADS4)SVMFive-fold CVYes0.79 ADC 0.83 DCE 0.85 ADC + DCE
Parekh 2017 (Parekh and Jacobs 2017)3 Tpk-DCE DWI124Ben versus malSVMLOOCVYes0.91
Saha 2018 (Saha et al 2018)1.5 T or 3 TDCE922Molecular subtypesRFCVYes0.697 lumA versus others
Song 2020 (Song et al 2020)3 TDCE132HER2 statusLRA, SVM, QDALOOCVYes0.884 QDA on subtracted images 0.89 SVM on subtracted images
Sun 2021 (Sun et al 2021)1.5 TDWI DKI IVIM271Ben versus malRF, PCA, L1 reg, SVMTen-fold CVYes0.85 RF with biexponential IVIM
Tao 2021 (Tao et al 2021)3 Tpk-DCE, DWI, T1, T2232Ben versus malLR, SVM, SVC, DT, RF, AB, NB, Gaussian NB, kNN, LDA, SGD, MPFive-fold CVYes0.86 ktrans 0.9 ktrans, T1, T2, ADC
Whitney 2019 (Whitney et al 2019)1.5 T or 3 TDCE508Ben versus Luminal ALDATen-fold CVYes0.848
Zhang 2020 (Zhang et al 2021)3 TDCE299Ben versus malDT, SVMTen-fold CVYes0.92 SVM

CV=cross validation; RF = Random Forest; LR = Logistic Regression; LOOCV = Leave-one-out cross-validation; SVM = Support Vector Machine; LDA = Linear discriminant analysis; MP = Multilayer Perceptron RF = Recursive feature; LASSO=least absolute shrinkage and selection operator; AB = Adaboost; DT = Decision Tree; GP = Gaussian process; NB = Naibe Bayes; QDA = Quadratic discriminant analysis; SVC = Support Vector Machine Classification; kNN = k nearest neighbors; PCA = Principal Component Analysis.

Then we have selected only works dealing with the classification of breast lesions (no prediction of therapy response, lymph nodes status or survival). Furthermore, we have excluded all reviews and other studies on deep learning.

3.1. Benign versus malignant classification

As previously underlined, classification of lesions in benign/malignant represents the major question in breast imaging screening. As reported in table 1 several works focused on the breast lesion classification between malignant and benign.

The more representative works in terms of sample size are (Ji et al 2019, Hu et al 2020a, Sun et al 2021, Tao et al 2021, Zhang et al 2021) all the authors obtained AUCs higher than 0.85 (max AUC 0.92 for (Zhang et al 2021)).

More precisely, Ji et al (2019) found in 1974 patients acquired on 3 T an AUC of 0.89 extracting radiomic features from DCE images. Comparable results in terms of AUCs was found by Hu et al (2020a) on 852 patients. The authors found higher AUC (0.87) with a multiparametric approach.

AUC higher then 0.9 were found by Zhang et al (2021) on 299 patients (AUC = 0.92) using DCE and Tao et al (2021) and (Parekh and Jacobs 2017) on 232 (AUC = 0.9) and 124 (AUC = 0.91) patients respectively. These two latter have used pk-DCE to increase classification performance.

Lower AUCs were found by Hu et al (2020a) on 852 patients using mpMRI from different scanners (1.5 and 3 T) and different protocols. Also Militello et al (2021) have found a lower AUC compared to other studies (0.72). This can be due to smaller sample size as well as the use of single phase DCE image for the radiomic features extraction, as also the authors have pointed out, that can be compared in terms of AUC to the results found by Zhang et al (2021) considering only T2w images.

One sub-group of works have exploited the potential use of radiomics in the classification between malignant and benign lesions in breast foci. Small enhancing lesions, with 5 mm or lower maximum diameter, are defined by ACR BI-RADS as foci (Morris et al 2013) and they are associated with the presence of benign lesion (fibroadenoma, cyst and fibrocystic changes, lymph node), but they can also represent the early onset of a malignant lesion (Clauser et al 2016).

In (Lo Gullo et al 2020) the authors focused on sub-centimetre lesions in 96 BRCA-positive patients extracting radiomic features from all DCE phases, subtracted images and also clinical parameters were included. The best performance was achieved including features from all the post contrast dynamic and clinical factors (accuracy 81.5%), improving lesion characterization compared with radiologists' BI-RADS classification. D'Amico et al (2020) too focused on breast foci. They investigated the capability of radiomics to discriminate malignant from benign enhancing foci on breast MRI. 45 patients underwent MRI examinations that included four dynamic scans after contrast agent injection. Enhancing lesions were contoured and radiomic features were extracted separately for each dynamic scan. They obtained an accuracy of 94%, higher than (Lo Gullo et al 2020) despite their smaller (45 versus 96) and less homogeneous dataset (only patients reporting BRCA mutations were included); this values should be taken with great care.

Gibbs et al (2019) evaluated the utility of radiomics analysis for breast cancer diagnosis in small breast lesions using radiomics DCE-based parameter maps, obtained using a semi-quantitative approach, and achieved an AUC of 0.78.

3.2. Differentiation of molecular subtypes

Breast cancer has been classified in four molecular subtypes (Perou et al 2000): luminal A, luminal B, HER2-enriched, triple-negative. Each of these subtypes is associated to different risk factors, treatment response, risk of disease progression, and different sites for metastases. Luminal tumors are positive for estrogen (ER) and progesterone (PR) receptors and they are commonly treated with hormonal interventions, whereas HER2-enriched tumors have amplification and overexpression of the ERBB2 oncogene and require anti-HER2 therapies. Basal-like tumors in general lack hormone receptors and HER2 and they are called triple-negative breast cancer (TNBC) (Harbeck et al 2019). Ki67 expression is widely used to determine proliferation and predicts chemosensitivity. However, Ki67 is relevant only for ER-positive, HER2-negative breast cancers (Harbeck et al 2019).

Several works investigated the possibility to differentiate molecular subtypes using breast MRI (Szabó et al 2003, Martincich et al 2012, Guvenc et al 2016, Suo et al 2017, Montemezzi et al 2018). Indeed, distinct breast tumor subtypes are endowed with peculiar histopathological features that can be related to imaging features. More precisely, high contrast agent uptake and rapid wash out detected with DCE-MRI correlate with higher histologic grade, positive Ki67, and negative ER status (Szabó et al 2003) and reflect an elevate neoangiogenesis and vascular organization. Lower ADC values could reflect higher tumor cellularity and then more aggressive subtypes (Guvenc et al 2016). For these reasons a multiparametric MRI approach allows to investigate different aspects of tumor features and a multivariate analysis of different imaging parameters results in higher diagnostic power compared to the use of single MRI features (Montemezzi et al 2018).

In this context radiomics, in combination with ML analysis, has the potential to decode more features connected to histological characteristics, improving the ability of MRI to unveil molecular subtypes and/or receptor status. The possibility of non-invasively exploiting the molecular profile has clinical benefits because it could help the prediction of breast cancer subtypes that would allow specific/individualized therapies.

Several works focused on the classification of malignant lesions according to different molecular subtypes. In general this type of analysis is more complex, as already described in in the introduction section, because it requires a multi variate analysis instead of a dichotomic classification (malignant versus benign). For this reasons some authors focused on the discrimination of one subtype respect all other categories (Saha et al 2018).

In 2016 Li et al (2016) investigated the possibility to discriminate receptor status from radiomic signature of DCE images. They considered the public database of TCGA/TCIA, that included DCE images of 92 patients from 4 different centers. The implemented ML analysis was able to distinguish ER + versus ER- with an AUC of 0.89. The other receptor status was discriminated with lower AUCs (0.69 for PR+/−, 0.65 for HER2+/− status).

In (Saha et al 2018) DCE-MRI from 922 patients with invasive breast cancer were included in the study. Radiomic features were extracted and analyzed to predict the following molecular, genomic and proliferation characteristics: tumor surrogate molecular subtype, estrogen receptor, progesterone receptor and human epidermal growth factor status, as well as a tumor proliferation marker (Ki67). This study demonstrated that there were associations between characteristics of tumors and fibroglandular tissue in dynamic contrast-enhanced MRI and tumor molecular composition. The association using multivariate models, however, is moderate with the highest AUC of 0.697 for distinguishing luminal A from other subtypes. The authors conclude that there is a moderate associations of imaging features with luminal A subtype, TNBC, ER, and PR status. This shows a potential for extending the use of imaging in oncology. However, this needs to be done with caution and possibly in conjunction with other variables. These results (Saha et al 2018) are in agreement with Demircioglu et al (2020) that also focused on the prediction of molecular subtypes in a smaller sample size (98 patients) compared to 922 patients of Saha et al (2018). Demircioglu et al (Clauser et al 2016) obtained moderate/low AUCs for one molecular subtype against all others of 0.75 (HER2-enriched), 0.73 (triple- negative), 0.65 (luminal A) and 0.69 (luminal B). Highest accuracies for correct classification were achieved for the differentiation of HER2-enriched from triple-negative with an AUC of 0.97, followed by luminal B from triple-negative (AUC 0.86) and luminal A and B from HER2-enriched (AUC 0.79 and 0.78 respectively). Prediction of Ki67 Expression was achieved (AUC 0.81).

Song et al (2020) evaluated whether radiomic features are associated with HER2 2+ amplification status in breast cancer using three different ML methods based on texture features derived from pre contrast, post contrast, and subtraction images of DCE-MRI. 92 HER2 positive patients and 40 HER2 negative patients were considered and the ML algorithm included logistic regression analysis (LRA), quadratic discriminant analysis (QDA) and support vector machine (SVM). SVM and QDA ML methods applied to DCE subtracted images have shown the best AUC (0.89 and 0.884) respectively, demonstrating that subtracted images contained more useful information compared to the pre and post contrast images, and the choice of ML method has an impact on model performance for determining HER2 status.

Fan et al focused on the prediction of molecular subtypes in Fan et al (2019) and the differentiation of histological grade an Ki67 expression in Fan et al (2020). In the first study they considered pk-DCE finding a predictive value for molecular subtypes if parenchyma is included (AUC = 0.897). The inclusion of DCE-MRI features in the bulk parenchyma was found to be associated with breast cancer subtypes (Wu et al 2017).

In the latter study DCE and ADC were considered. As already reported AUC for histological grade was 0.811 while for Ki67 expression 0.816.

The prediction of Ki67 expression are consistent with Demircioglu et al (2020) although the latter did not include DWI. Indeed Mori et al (2015) found that higher ADC values have also associated with lower Ki67 scores but DWI is not typically performed in routine breast MRI (Saha et al 2018).

A recent study (Ma et al 2021) investigated the possibility to discriminate triple negative versus non-triple negative tumors using features extracted from DCE-MRI. They concluded that is it possible to classify them with an AUC 0.876. In spite of the good AUC, in this work luminal A, luminal B, and HER2-enriched patients were mixed together in non-triple negative groups and their heterogeneity probably affect classification performance.

3.3. MR-imaging modality

DCE represents the most informative sequence for breast MR imaging (Mann et al 2008). For this reason, most radiomics studies have considered features extracted from DCE MRI images. Few studies have processed the quantitative maps obtained applying the pharmacokinetic model comparing the classification performance of ktrans, kep, ve, and vp maps (Parekh and Jacobs 2017, Fan et al 2019, Tao et al 2021).

The majority of authors did not consider the pharmacokinetic model but extracted features directly from one or more single phases, from the whole DCE dynamics or from subtracted images defined as the difference between the pre and the first or last post contrast images or a combination of them (Saha et al 2018, Whitney et al 2019, Demircioglu et al 2020, Fan et al 2020, Lo Gullo et al 2020, Song et al 2020).

All studies have concluded that radiomic analysis applied to subtracted images resulted in better performance with respect to single phases with an AUC up to 0.89 (Song et al 2020) but also a combination of multiple phases has shown higher AUC compared to a single phase approach (Lo Gullo et al 2020).

In Fan et al (2020) the radiomic features were derived from the pre contrast images, and two different subtraction between the second post contrast or the fifth post contrast images and the pre contrast dynamic. 144 patients were included in the study, considering also DWI for the differentiation of histological grade and predict Ki67 expressions. A multitask learning method was implemented to predict multiple clinical indicators (histological grade and predict Ki67). Compared to single task feature selection, in a multitask learning method a subset of features common to all tasks are selected. In this case models including only one image showed similar results for pre contrast compared to subtracted images, while implementing a multitask model with multiple images features, AUC increased in particular considering also ADC. The higher AUCs were found for pre contrast images combined with ADC maps (0.811 for historical grade and 0.816 for ki67 expression). The lower performance of subtracted images as compared to other studies can be explained with the difference classification task involved (differentiation of histological grade and Ki67 expression).

Few works (Fan et al 2019, Tao et al 2021) focused on pk-DCE MRI, extracting radiomics features directly from quantitative DCE maps (ktrans, kep, vp, ve).

In Tao et al (2021) 199 patients were investigated with multi-parametric MRI that included functional techniques. Indeed images were processed in order to obtain ktrans, kep ve and vp maps from DCE acquisitions by applying pharmacokinetic Tofts model, and also ADC maps from DWI acquisitions were considered. Radiomics features from all these 8 maps were extracted and analyzed considering multiple ML models from one parameter to multi-parameters. Authors concluded that ktrans ML model had the best discriminative performance (malignant versus benign) in the single-parametric ML models (AUC 0.86) but the best AUC (0.9) was achieved considering the four parameter model (ktrans, ADC, T1w, T2w).

Fan et al (2019) investigated the use of pk-DCE MRI to discriminate molecular subtype in 211 patients with breast cancer including parenchyma to take into account for peritumoral heterogeneity. Tumor and the surrounding parenchyma were decomposed into three compartments, representing plasma input, fast- flow kinetics, and slow-flow kinetics. Fast flow kinetic component of tumoral region exhibits the higher AUC (0.832) in the differentiation of molecular subtype compared to whole tumor analysis (AUC = 0.719). Including parenchyma, AUC increases up to 0.897 demonstrating that the decomposition in regions with different kinetic behavior helps the differentiation as well as the inclusion of surrounding peritumoral region.

One study (Hao et al 2020) did not consider the DCE, but only the post contrast T1 image and T2 image to discriminate malignant and benign lesions in 178 patients with BI-RADS 4. Using a combination of features extracted from both images, the AUC of the model (obtained using SVM) was 0.77 improving the diagnostic capability of the single images (0.71 for post contrast T1 and 0.69 for T2), but lower if compared other model obtained using DCE images.

Only one study focused on MR-diffusion (Sun et al 2021) considering different models to fit multi b value data: radiomic features were computed from mono exponential (ME), biexponential (BE), stretched exponential (SE), and diffusion kurtosis imaging (DKI). The authors found that a biexponential Intravoxel incoherent motion (IVIM) model performs better with an AUC of 0.85 in the discrimination between malignant and benign lesions. This result is in line with (Bickelhaupt et al 2017): the authors have used only DWI images obtaining an accuracy of 84.2%.

As previously underlined in Fan et al (2020) and Tao et al (2021), adding to the model the features extracted from ADC maps, generally increases classification performance.

Parekh et al (2017) evaluated the classification performance of radiomic feature maps derived from radiomics analysis of ADC maps and DCE-MRI with pharmacokinetic modeling in 124 patients. They demonstrated differences in radiomic feature map curves for benign and malignant lesions, with an increased entropy in malignant tumors. Their model, which included quantitative MRI metrics of ADC and perfusion, achieved an AUC of 0.91 with a sensitivity of 93% and specificity of 85%.

In Naranjo et al (2021) the diagnostic capabilities of ML approach applied to radiomic features extracted from DCE and DWI were compared. 93 patients from two different centers were considered. The classification performances of ADC alone are lower compared to DCE (AUC 0.79 and 0.83 respectively). When a multi parametric model is considered, including both ADC and DCE, AUC increases to 0.85.

Also Hu et al (2020b) underlined the importance of mp-MRI for breast lesions classification. In their work a multiparametric approach was considered, comparing the classification performances of DCE, T2 images, DWI separately and together (using a multi parametric model) in 852 patients. They found, in agreement with other studies, that DCE alone has the best performance (AUC = 0.84) and DWI the worst (AUC = 0.67). Considering radiomic features from all MRI modalities the AUC increases to 0.87.

3.4. Differences across magnetic fields

As reported in table 1, all the reviewed studies considered images acquired on different straight magnetic fields (1.5 T or 3 T or both). Studies that included patient acquired on a 1.5 T or on a 3 T (Saha et al 2018, Whitney et al 2019, Lo Gullo et al 2020, Hu et al 2020b) did not consider in the multivariate model the contribution of magnetic field, analyzing the impact of different magnetic field strength in the variance explanation. It is well known that differences in magnetic strength affect images SNR as well as other characteristics related to DCE kinetic curves (Jansen et al 2009). In a recent work, Whitney et al (Whitney et al 2021a) investigated feature robustness and classification performance across field strength. In this paper 612 patients were scanned (373 at 1.5 T and 239 at 3 T) and robustness of radiomic features across magnetic field strengths was investigated. Only some of the extracted features (related to shape/irregularity and kinetic curve enhancement) resulted robust and unaffected by different magnetic fields. Due to the large variability of image protocols as well as different magnet fields, it is important, in the developing of AI tools, to understand which characteristics of radiomic features are impacted.

3.5. ML algorithms

As reported in table 1, several ML algorithms were implemented including SVM, Random Forest (RF), LRA, Decision Tree (DT), Naïve Bayes.

The choice of ML algorithm has a huge impact on classification performance as demonstrated by Tao et al (2021) that compared the diagnostic power of 12 different ML algorithms (Logistic Regression, SVM, Linear SVC, Decision Tree, Random Forest, AdaBoost, Bernoulli NB, Gaussian NB, K Nearest Neighbors, Linear Discriminant Analysis, SGD and Multilayer Perceptron).

In order to choose the best ML algorithm, performances of all classifiers were compared in a single parameter model (considering ktrans, kep, ve, vp, non-enhanced T1WI, enhanced T1WI, T2WI, and ADC maps as single parameters). They found an elevate variability in AUC ranging from 0.7 to 0.9 considering different classifiers. The models with Logistic Regression, SVM, Multilayer Perceptron had relatively high AUC values (>0.80) and these three classifiers were considered as appropriate classifiers. Then multi-parameter models were built considering these classifiers.

In general the most widely used algorithm is SVM (Ma et al 2018, Ji et al 2019, Hao et al 2020, Lo Gullo et al 2020, Song et al 2020, Zhou et al 2020, Hu et al 2020b, Naranjo et al 2021, Zhang et al 2021). SVM is a supervised learning model that analyze actual data points and recognize patterns, which are used for classification analysis.

Song et al (2020) compared the performance of LDA, SVM and QDA finding that SVM performed better than others. It must be said that the differences in AUCs among three classifiers are small (from 0.831 to 0.89). Zhang et al (2021) reported a better discrimination capability of SVM with respect to DT, a similar result was found by Ma et al (2021) that included 10 different classifiers.

An innovative algorithm both for train/test splitting and feature selection was used in D'Amico et al (2020) called training with input selection and testing (TWIST) algorithm (twist), based on an evolutionary strategy. k-nearest neighbor (kNN) was used for classification. A k-nearest neighbor classifier based on 35 selected features was identified as the best performing ML approach.

Most of the reviewed studies included a validation method in their analysis. Validation is necessary to obtain robust and reliable results, avoiding overfitting. An interesting result was showed in Lo Gullo et al (2020) where the accuracy with and without cross validation is compared. Accuracy without validation is higher (90%) but the model overfits data. Implementing a 5-fold cross validation the accuracy reduces to 75%.

4. Discussion

It is difficult to compare results from different studies because of differences in methodology (in the feature extracted or classifier/validation method used) as well as in sample size and inclusion criteria. Besides these limitations, it is possible to draw some conclusions based on literature reviewed.

All studies using radiomics to discriminate benign versus malignant lesions with a high dimensional dataset (from about 200 patients until 1900) have shown AUCs higher than 0.89. This means that it is possible obtain reliable models for lesion classification from breast-MR imaging, especially when the radiomic features are extracted from DCE images.

Some studies (D'Amico et al 2020, Lo Gullo et al 2020, Song et al 2020) focused on small lesions/breast foci. Results are difficult to compare and conflicting both for differences in population studied and methodologies. Despite that, it is possible to conclude that radiomic analysis of breast MR-images helps to distinguish malignant from benign small lesions.

Considering molecular classification, as we can observe in table 1, AUCs are generally lower compared to benign versus malignant classification.

Several factors concur on this: first of all, the question posed to discriminate among four different subtypes is more complex compared to the binary question (benign/malignant). Moreover, the division of datasets in four groups causes a reduction in data size and data tends to be unbalanced because typically luminal A category (less aggressive subtypes) is more populous than the others. For this reason some authors compared one subtypes against all the rest or focused directly on the differentiation of ER, PR, Ki67 or ERBB2 expressions that are strongly connected with molecular subtyping.

Despite these limitations, the studies that focused on molecular subtypes/receptor status provided encouraging results on classification capability of radiomic MR-features.

All the studies underlined the importance of DCE as a primary diagnostic tool for breast classification, reflecting the importance of this technique also in non-radiomic context. Indeed, all the studies but one (Hao et al 2020) included DCE as a primary investigated techniques and, when compared to other MR-techniques (Hu et al 2020b, Naranjo et al 2021, Tao et al 2021), DCE showed better performance.

In general it has been shown that it is better to focus on the extraction of radiomic features on subtracted images or include directly all dynamics (Lo Gullo et al 2020, Song et al 2020).

No studies compared the performance of models based on subtracted/dynamic series with the one based on quantitative pk-DCE maps (ktrans etc), thus it is not possible to draw any conclusion about the superiority or one approach or the other.

The inclusion of parenchyma generally improves the diagnostic performance but only one work investigated this aspect (Fan et al 2019). BI-RADS inserted in 2013 background parenchymal enhancement (BPE) in the breast MRI lexicon to acknowledge that normal parenchymal enhancement may vary substantially in pattern and degree across MR imaging examinations (Wu et al 2017). For this reason, it could be helpful to investigate deeper the different arrangement of parenchymal region between benign and malignant lesions.

The inclusion of ADC, that alone is incapable of obtaining adequate results in terms of classification, when it is joint to DCE in a multi variable model typically increases AUCs (Hu et al 2020b, Naranjo et al 2021, Tao et al 2021). This result reflects the superior classification performance of multiparametric approach in standard diagnostic framework (Montemezzi et al 2018).

In general, additional work is needed before these methods can be used in clinical practice in particular for the noninvasive assessment of breast cancer histopathological characterization.

5. Limitations

AI uses in medical imaging and ML are rapidly evolving and their application in breast MRI is beneficial for lesion classification to reduce false-positive diagnoses and, consequently, to reduce the number of biopsies.

Despite the improvement in classification task, as reviewed in this work but also in detection and assessment of therapy response, the clinical use of ML still suffers several limitations that require further investigations.

The small sample size represents a strong limiting factor for the analyzed studies. When dealing with small samples, often below 100 cases, the power of the model is strongly capped despite the cross-validation adopted (Button et al 2013). This problem is further aggravated from the multitude of radiomic features which makes this approach prone to the overfitting in small datasets and the consequent mandator requirement of adopting a train/test splitting procedure. All results are compared in terms of AUCs; however, high AUC does not necessarily reflect credibility of the model. For this reason, we reported in table 1 both the validation method and whether the train/test approach is used or not. Indeed, validation techniques are used to select best hyperparameters for the train set maintaining a low probability of overfitting, and the test set ensures a reliable performance assessment for an independent dataset.

Most of the reviewed studies have reported the validation method (mostly leave one out CV and k fold CV) and all of them have divided the dataset in train and test set.

In order to generate convincing results regarding the potential clinical value of radiomics as a diagnostic tool, it is also essential to consider larger patient cohorts, that can be available through multicenter studies. Furthermore, the inclusion of different centers/scanners in the same model strengthens the study generalizability, allowing its application on different scenarios. Nevertheless, the majority of radiomics studies were based on single-center data and small cohorts of patients, with a retrospective nature, and most radiomic models were not externally validated. Indeed, multicenter radiomic studies are subject to several confusing factors, including variability in scanner models, acquisition protocols and reconstruction settings. It is well known that radiomic features are sensitive to such variations, which subsequently hinders pooling data to carry out statistical analysis and ML, in order to build robust models. Hence, there is a strong need for feature harmonization, to allow consistent findings in radiomics multicenter studies (Da-Ano et al 2020). Recently, Whitney et al (Whitney et al 2021b) have proposed a batch harmonization approach for robust application of AI across different MR breast databases. Batch harmonization of radiomic features extracted from DCE images from two different databases was applied to a ML classification workflow. Harmonization consists in data integration with the aim of reducing the unwanted variation associated with batch effects. Train and independent test sets, as well as the combination of them, from the two databases were used for pre-harmonization and post-harmonization, to investigate the generalizability of classification performance.

Lack of standardisation and reported details in the analysis workflow is another of the the most relevant issues for generalisation and clinical use of ML methodologies. In general, methods and results in a ML paper should be clearly reported. This includes the key objects under study, details on dataset composition and data splits, and experimental details on algorithms and models (hyper-parameter configurations, average runtime, computing infrastructure, inter alia), which foster openness and reproducibility of ML research (Pineau et al 2021). For instance, as previously underlined in the introduction, the knowledge of the gray-level discretization and the resampled voxel size are important aspects for reproducibility of extracted features, since differences in these extraction parameters have a huge impact in the value assumed by the radiomic features, not only rescaling them but also introducing a strong volume-confounding effect (Shafiq-ul-Hassan et al 2017) (i.e. a volume-dependent offset is added to the feature). Despite their importance, these parameters were not always available in the studies analyzed in the present review.

Recently, automatic information extraction methods from scientific publications have been developed and employed in either research or industrial applications (Ramponi et al 2020a, 2020c), including the extraction of experimental details (D'Souza et al 2021) and key aspects of biomedical relevance (Ramponi et al 2020b). These new approaches employed to mine the enormous amount of published data, although very promising for the meta-analysis of the burgeoning number of publications in the field of radiomics, require a well-documented material and methods section, even better if integrated with one or more reproducibility checklists (Kenall et al 2015). Specifically, the IBSI checklist (Zwanenburg et al 2020) for radiomics studies should always be compiled and provided during the submission process, together with other specific checklists such as the TRIPOD (Collins et al 2015), for individual prognostic and diagnostic predictive multivariable models. Among the analyzed papers, only one cited the IBSI initiative, but none of them explicitly supplied the aforementioned checklists.

Please wait… references are loading.