Radiomics software comparison using digital phantom and patient data: IBSI-compliance does not guarantee concordance of feature values

Introduction. Radiomics is a promising imaging-based tool which could enhance clinical observation and identify representative features. To avoid different interpretations, the Image Biomarker Standardisation Initiative (IBSI) imposed conditions for harmonisation. This study evaluates IBSI-compliant radiomics applications against a known benchmark and clinical datasets for agreements. Materials and methods. The three radiomics platforms compared were RadiomiX Research Toolbox, LIFEx v7.0.0, and syngo.via Frontier Radiomics v1.2.5 (based on PyRadiomics v2.1). Basic assessment included comparing feature names and their formulas. The IBSI digital phantom was used for evaluation against reference values. For agreement evaluation (including same software but different versions), two clinical datasets were used: 27 contrast-enhanced computed tomography (CECT) of colorectal liver metastases and 39 magnetic resonance imaging (MRI) of breast cancer, including intravoxel incoherent motion (IVIM) and dynamic contrast-enhanced (DCE) MRI. The intraclass correlation coefficient (ICC, lower 95% confidence interval) was used, with 0.9 as the threshold for excellent agreement. Results. The three radiomics applications share 41 (3 shape, 8 intensity, 30 texture) out of 172, 84 and 110 features for RadiomiX, LIFEx and syngo.via, respectively, as well as wavelet filtering. The naming convention is, however, different between them. Syngo.via had excellent agreement with the IBSI benchmark, while LIFEx and RadiomiX showed slightly worse agreement. Excellent reproducibility was achieved for shape features only, while intensity and texture features varied considerably with the imaging type. For intensity, excellent agreement ranged from 46% for the DCE maps to 100% for CECT, while this lowered to 44% and 73% for texture features, respectively. Wavelet features produced the greatest variation between applications, with an excellent agreement for only 3% to 11% features. Conclusion. Even with IBSI-compliance, the reproducibility of features between radiomics applications is not guaranteed. To evaluate variation, quality assurance of radiomics applications should be performed and repeated when updating to a new version or adding a new modality.


Introduction
Medical images have primarily been used as diagnostic tools interpreted qualitatively by trained observers, while in the last years -especially in oncology -more quantitative evaluation is performed primarily to tailor personalised healthcare further. In this context, conventional metrics from functional imaging were first explored, such as the maximum standardised uptake value (SUVmax) in 18 F-FDG Positron Emission Tomography (PET) and K trans in perfusion Magnetic Resonance Imaging (MRI), which have been proven to be robust medical tools for imaging biomarkers (O'Connor et al 2017). Over the last decade, radiomics has emerged as a new promising tool for quantitative image analysis in addition to the more conventional biomarkers. Extracted radiomics features from the region of interest (ROI) are meant to non-invasively provide important phenotypic information about the entire tumour region and the surrounding tissues , Rogers et al 2020, expressing different properties with different levels of complexity from the ROI. Radiomics features can be categorised into four groups (Rizzo et al 2018): (I) shape features describe the morphological characteristics of the ROI; (II) intensity features use firstorder statistics calculated from the intensity histogram, without concern for spatial relationships; (III) texture features describe heterogeneity by calculating spatial arrangement of the intensity values; (IV) features extracted after applying filters (e.g., wavelet transform) to the original image aim to identify specific image characteristics such as edges and repetitive patterns. These features, potentially combined with complementary clinical data, can be processed with advanced statistical methods and machine learning to create models that predict clinical diagnosis, patient prognosis or therapy response to support clinical decision making (Gillies et al 2016, Dercle et al 2021. Despite the increasing interest in radiomics, a vast majority of current studies lack robustness and external validation (Sanduleanu et al 2018, Park et al 2020, mainly due to false-positive discovery rate (Chalkidou et al 2015), improper study design  and sensitivity to unwanted sources of variation not related to the disease phenotype (Yip and Aerts 2016). A useful feature on one dataset may lose its value on another dataset because of different imaging systems and protocols (Jha et al 2021). This absence of reproducibility can influence the generalizability of the final model across multiple datasets from different institutions, limiting its applicability to a very narrow field of use.
The problem of reproducibility also manifests itself in the computation and implementation phase of the features. Many research groups do not report the software used or use unknown in-house developed tools despite the many software solutions that are freely available and well documented (Song et al 2020). Different implementations often compute various features belonging to different categories with dissimilar terminology, parameter settings, and image processing, resulting in different feature values (Liang et al 2019, Fornacon-Wood et al 2020, McNitt-Gray et al 2020. Therefore, the image biomarker standardisation initiative (IBSI) (Zwanenburg et al 2020), an international collaboration of researchers from various institutions, was founded to standardise radiomics feature computation by providing feature nomenclature and definitions as well as reporting guidelines and recommendations for image processing prior to feature extraction.
The purpose of this study was to investigate the reproducibility of radiomics features between radiomics applications that participated in the IBSI standardisation. Three IBSI-compliant applications were compared in an in-depth evaluation, using a digital phantom and two clinical datasets of various anatomical regions and imaging modalities.

Study design
The comparison of the three radiomics applications was divided into four phases. In the first phase, we checked the correspondence of the feature names between the IBSI standard and the applications based on their equations. In the second phase, features from the digital phantom were computed on each platform and compared with the reference values provided by IBSI. As applications undergo frequent updates, we tested different versions of the same application for their consistency in the values of the features in the third phase. Lastly, in the fourth phase, the three applications were compared based on their shared features to assess the reproducibility of those features. Clinical data was used in both the third and fourth phases.

Imaging data sets and segmentation
This study used three different image data sets: a digital phantom and two clinical data sets of two different pathologies, comprising 27 computed tomography (CT) and 39 magnetic resonance imaging (MRI) scans.
IBSI provides a digital phantom with one predefined ROI specifically designed to benchmark feature implementation (Zwanenburg et al 2016). They benchmarked 169 features values (Zwanenburg et al 2020). The 3-dimensional image consists of 80 voxels of 2 × 2 × 2 mm 3 with intensity ranging from 1 to 9 and is associated with a mask of 74 voxels (figure 1). Both image and mask, provided in NIfTI format, were converted to DICOM format (RT-STRUCT object for the mask) using 3D Slicer v4.11 (Fedorov et al 2012) (www.slicer.org).
The clinical data sets originated from two retrospective studies in Jules Bordet Institute, Brussels, Belgium. These studies were approved by the Jules Bordet Institute Ethics Committee (CE 2953 and CE 3117).
The first clinical data set included 27 patients (13 women and 14 men; mean age of 61 years) with histologically confirmed colorectal liver metastases who underwent a contrast-enhanced (CE) CT of the abdomen acquired at the portal phase before their curativeintent surgery. In this data set, a single radiologist observer segmented 36 metastases in a semi-automatic approach using syngo.via Frontier Radiomics v1.2.5 (Siemens Healthcare, Erlangen, Germany). The procedure consisted of adjusting a volume box containing the whole tumour, refining the automated threshold and manually correcting the contours. The median tumour volume was 3.9 ml (range: 0.2-340.4 ml). An example is shown in figure 2(A).
The second clinical data set was composed of 39 patients (39 women, mean age of 52 years). All patients had pathologically confirmed untreated invasive ductal or lobular breast cancer with an indication for neoadjuvant chemotherapy and surgical resection and had undergone a multiparametric MRI prior to treatment. The MRI acquisition protocol included diffusionweighted imaging (DWI) and dynamic contrastenhanced (DCE) sequences, where T1 relaxation times  were determined using a variable flip angle gradient echo acquisition. Two maps (D, f) were computed using an in-house developed software with the intravoxel incoherent motion (IVIM) model from the DWI series. Three DCE-MRI maps (K trans , K ep , IAUG) were extracted from the dynamic image series with TumourMetrics (Chao et al 2017). An MRI specialist semi-automatically segmented the DCE subtraction series in the same approach as described above. Another radiologist observer validated each contour. For one patient, the segmentation of the tumour was composed of two nonassociated clusters. The IVIM maps were rigidly registered to the DCE maps. The median tumour volume was 7.7 ml (range: 0.4-127.0 ml). An example of K trans map and its segmentation is shown in figure 2 Additional details about image acquisition and reconstruction for the two clinical data sets are described in supplementary data (available online at stacks. iop.org/BPEX/8/065008/mmedia) tables 1 and 2. The clinical data structures were stored in syngo.via Frontier Radiomics. They were first exported from syngo.via in SLT format, then converted into NIfTI and DICOM format (RTSTRUCT object) using 3D slicer.

Radiomics applications
The three commercially available radiomics applications were RadiomiX Research Toolbox All three applications, or their original code, participated in the IBSI standardisation where their respective features were compared with reference values of the digital phantom and clinical CT images, and validated by achieving a good reproducibility for multimodal images Except for image discretisation, no image processing in the radiomics applications was performed (neither spatial resampling nor normalisation). Two approaches to discretisation can be used: a fixed number of bins (FBN) which introduces a normalising effect, recommended for arbitrary intensity units (e.g. raw MRI), and a fixed bin size (FBS) where the relationship between image intensity and physiological meaning is kept. As syngo.via has only the FBS option, this approach was used. A bin size of 1 was applied for the digital phantom. For the clinical datasets, the bin size was selected to produce a bin count between 30 and 128 for the majority of the tumours, as this range is frequently used in radiomics studies (Tixier et  Calculation settings for each application were left to their default value. In that way, they were matched as closely as possible according to the information provided in the documentation. Table 2 shows the settings available in each application and the values chosen for this study. Investigated features were categorised by shape, intensity, texture and filter-based. The shape category includes all the morphological features. The intensity category combines local intensity features, intensitybased statistical features, and intensity histogram features referred to as discretised intensity features in this study. The texture category includes grey level co-occurrence matrix (GLCM), grey level run length matrix (GLRLM), grey level size zone matrix (GLSZM), grey level distance zone matrix (GLDZM), neighbourhood grey tone difference matrix (NGTDM), and neighbouring grey level dependence matrix (NGLDM). The filter used in this work is the wavelet filter. Since IBSI has not yet standardised the filter-based category, it was not analysed in phase 2. The feature names used in this study correspond to the IBSI nomenclature.

Statistical analysis
After matching feature names based on the provided formulas, we calculated, in phase 2 (comparison to IBSI reference values), the absolute relative difference (RD) between each application and the IBSI benchmark values for shared features. RD is the absolute value of the percentage difference between the application value and the reference value divided by the reference value. The features were classified to indicate an excellent (RD1%), good (1%<RD5%), moderate (5%<RD10%) or poor (RD>10%) match to IBSI reference.
Agreement on common radiomics features across software versions (phase 3) and across the different applications (phase 4) was analysed using the two-way mixed effect absolute agreement single rater intraclass correlation coefficient (ICC) and its 95% confidence interval. The ICC quantifies the absolute agreement between data sets by comparing the variability in feature values across software packages to the variability in values across patients. Using the lower boundary of the 95% confidence interval (ICC LCI), feature reliability was stratified to poor (values less than 0.5), moderate (values between 0.5 and 0.75), good (values between 0.75 and 0.9) and excellent (values greater than 0.9) (Koo and Li 2016). ICC was computed for each feature using pingouin library v0.3.11 (Vallat 2018) (pingouin-stats.org), an open-source statistical package written in Python.

Phase 1: matching terminology and formulas
The naming conventions of matrices and features were not entirely consistent with IBSI nomenclature and between the three applications. For example, in the case of matrix names, NGLDM is tagged as GLDM in syngo.via, and LIFEx NGLDM corresponds, in fact, to NGTDM. Another example of different naming conventions for features calculated with the same equation, the GLSZM GrayLevelNonUniformity in syngo.via is referred to as GLNU in LIFEx and IN in RadiomiX. Supplementary Material contains a complete list of feature names from each application and their correspondence to each other and to the IBSI nomenclature.
The percentage of shared features of RadiomiX, LIFEx and syngo.via compared to the total number of reference values in IBSI are 92%, 29% and 62%, respectively. Forty-one common features were identified between the three applications, consisting of 3 shape features, 8 intensity features, and 30 texture features (see table 3 and Supplementary Material for more details). In those common features, Kurtosis is not a feature standardised by IBSI, which standardised Excess kurtosis instead (by subtracting the kurtosis by 3 to centre on zero for normal distributions). For the filter-based features, the wavelet filter is available in all three applications except in LIFEx v6.65 where it was not yet implemented. The wavelet filter adds 304 features to the phase 4 analysis (intensity and texture features for the 8 different decompositions of the original image).
Differences in the implementation of some features between the three applications were recorded based on their documentation. Firstly, shape features with length units are calculated in centimetre in LIFEx and RadiomiX, contrary to millimetre in IBSI and syngo.via. Consequently, these shape features were converted to millimetre as a post-processing step before comparison. Secondly, the applications do not handle segmentation containing more than one cluster the same way: LIFEx takes into account only the largest cluster within the ROI, syngo.via and Radio-miX treat the multiple clusters as one volume to compute every shape feature for syngo.via and only 6 shape features from 26 implemented for RadiomiX. Therefore, the only patient with two clusters after segmentation was discarded from the analysis. The IBSI standard defines the volume of a ROI in two ways, a voxel-based approach and a mesh-based approach. The latter is used to compute shape features with volume in their formula. Syngo.via computes both options for volume calculation and uses mesh-based for subsequent shape features, as per IBSI definition. RadiomiX and LIFEx only implement the voxel-based volume, which can impact the subsequent features (e.g. SurfaceArea) compared to the IBSI reference or syngo.via. Finally, the definition for the intensity_Me-dianAbsoluteDeviation feature did not match between IBSI and RadiomiX. This feature was therefore discarded in the remainder of the analysis.

Phase 2: comparison to IBSI benchmark values with phantom data
Syngo.via provided feature values with excellent agreement (RD<1%) with the IBSI reference values across all 104 features.
With LIFEx, good to excellent agreement was achieved for 33/48 features (3 shape, 10 intensity and 20 texture). The local intensity feature was not computed on the digital phantom due to a bug in the application. The intensity features with moderate and poor agreement were from the discretised intensity group, including the discretised_Entropy feature (RD>10%).
Using RadiomiX, 10 features had a moderate to poor agreement with the IBSI references values. These included 8 shape features (due to differences in volume definition) and 2 intensity features (intensity_Loca-lIntensityPeak and intensity_GlobalIntensityPeak). The other 144 features had a good to excellent agreement. Figure 3 summarises the performance of each application compared to IBSI for each feature category. Supplementary Data contains the list of features and their RD values for the three applications.

Phase 3: application version comparison with clinical data
In LIFEx v6.65, the wavelet filter is not implemented; the filter-based features were thus not compared. LIFEx had excellent reliability for all features between v7.0 and v6.65, except for discretised_Maximum, which had moderate reliability (0.75>ICC LCI > = 0.5) for IVIMD and Kep images and good reliability (0.9>ICC LCI > = 0.75) for CECT and IAUG images. The discretised_Maximum in LIFEx v6.65 was systematically one unit lower than in v7.0.
For the comparison between the PyRadiomics versions and all its other dependencies, there was excellent agreement for most features (98/107) for CECT, IVIMf, IVIMD and IAUG images. For Ktrans and Kep images, features with poor agreement were registered for intensity (2/18 for Ktrans and 16/18 for Kep), texture (36/75 for Kep) and wavelet (61/744 for Ktrans and 382/744 for Kep) categories. Maximum and Minimum features are included in those poorly reliable intensity features, with a substantial difference between versions: for Kep images, the mean Minimum and Maximum feature calculations are 18.9 and 11680 for v2.1 compared to 8.2 and 3927 for v3.0.1, respectively. These Minimum and Maximum value changes affect dependent features, such as Kurtosis, Skewness, and some texture and wavelet features.
For the intensity features, agreement errors appear in the DWI and DCE maps, as shown in figure 4. The Minimum feature was in the excellent and good category for CECT and IVIMf, respectively, and in the poor category for IAUG, Ktrans and Kep. LIFEx was  Figure 6 shows bar graphs of the number of texture features in excellent, good, moderate or poor agreement for each image type.
The largest discrepancy across all image types was observed for the wavelet features, where most features showed poor agreement (percentage of wavelet features with poor agreement across all software platforms: 69%  for Kep, 77% for Ktrans, 78% for CECT, 84% for IAUG, 93% for IVIMf and 94% for IVIMD).
Overall, the level of reliability for each feature was inconsistent across the clinical data sets. The list of features with their ICC and LCI values for the three applications is available in supplementary data.

Discussion
This study aimed to investigate the reproducibility of radiomics features between three fully automated IBSI-compliant radiomics applications by comparing their feature nomenclature, formulas and output value. In prior studies (Foy et al 2018, Defeudis et al 2020, McNitt-Gray et al 2020, Doshi et al 2021, the lack of reproducibility of the same features was reported for applications that were not all compliant with the IBSI standard, with one study (Fornacon-Wood et al 2020) reporting better reliability if only IBSI-compliant applications were used in the analysis. To our knowledge, the study reported here is the first to investigate feature reproducibility across exclusively IBSI-compliant applications on multiple imaging modalities (CECT and MRI, including maps from IVIM and DCE).  We showed that, for the same image and segmentation, the feature values with corresponding mathematical definitions and settings were not always identical between applications. These discrepancies can affect the performance of radiomics models and limit their generalisability (Bogowicz et al 2017, Liang et al 2019, Korte et al 2021. Shape features were the most reliable features on clinical data, contrarily to the other study's results (Fornacon-Wood et al 2020) that compared IBSI-compliant software, reporting an uncertainty for the Sphericity feature that was not present in our study. Wavelet features were the least reproducible, in line with the study results that also analysed this filter (Bogowicz et al 2017). These results were expected as IBSI has not yet tackled the standardisation of filter-based features. The reproducibility of intensity and texture features was heavily dependent on the imaging modality used. The results on CECT images were better than in the studies that compared applications not IBSI-compliant (Bogowicz et al 2017, Foy et al 2018, but worse results were observed with perfusion MRI-based maps. It was linked to the substantial difference in the Minimum and Maximum values between RadiomiX, LIFEx and syngo.via for those images (figure 5), which would have a knock-on effect on dependent features in the intensity category, as well as on the discretisation process, which results in most texture features not matching either. Those large discrepancies in extreme intensity values can be linked to two different sources. The first source is the orientation resampling executed automatically in LIFEx, where images acquired with an orientation matrix different from the identity matrix (which is often the case for MRI) are reoriented in the DICOM reference coordinate system. The interpolation linked to the resampling modifies the extreme values, especially visible for the Minimum feature in Kep images where negative intensities appeared. This resampling can be disabled in the application's settings, resolving one issue. We have not found a definitive explanation for the significant increase of the Maximum feature from LIFEx and PyRadiomics v3.0.1 to RadiomiX and syngo.via. A possible explanation is that the conversion between the input images and the ones used for feature computation is different inside each application, using unknown and different DICOM tags. As this behaviour appeared for the MRI maps, it may be due to some mishandling of DICOM tags during the pre-processing steps of the creation of those maps.
The matching of features and settings between applications was not straightforward. The naming convention should be further harmonised to better correspond to the unique identifier defined in the IBSI guidelines, as shown in phase 1. Therefore, this study has provided a detailed overview of matching nomenclature between the radiomics applications and the IBSI standard. In addition, it is challenging to match the feature extraction settings between RadiomiX, LIFEx and syngo.via. Different settings are available between applications, and sufficient documentation is not always provided to match the settings across platforms. For example, several options are available for the wavelet filter in LIFEx and syngo.via but not in RadiomiX, where the implementation is not well documented and unclear.
The utilisation of the IBSI digital phantom enabled the detection of differences and errors in feature implementation. The volume definition differences resulted in inferior reproducibility of shape features for RadiomiX and LIFEx in phase 2. A discretisation error in LIFEx v7.0 was discovered due to the poor reproducibility of the discretised intensity group and some texture features, with deviations from the IBSI reference values that were not present in the earlier version (v6.65). Those variations in shape, discretised intensity and texture features were not visible in clinical data where the intensity range and volume are larger than in the phantom, which could explain the inferior results obtained with the digital phantom. This part of the analysis shows that it is important to test each application and each version, even if the application is considered to be IBSI compliant and has therefore already been tested with the digital phantom. The result may vary from one version to another and some errors may appear in most recent versions if this phantom check is not systematically performed by the application during updates. The use of this phantom with known ground-truth values also helps to confirm the feature matching and the default parameters used in the applications.
Performing resampling and normalisation is recommended as part of a workflow in a radiomics clinical environment. Indeed, these pre-processing steps can improve feature reproducibility (Shafiq-ul-Hassan et al 2017, Scalco et al 2020, Traverso et al 2020a and radiomics performance. Harmonisation, for raw MRI especially, reduces variability in intensity distributions between patients, imaging protocols and scanners and therefore produces more robust radiomics studies (Isaksson et al 2020, Hu et al 2021, Saint Martin et al 2021. Different harmonisation solutions are emerging, such as ComBat which shows promising results for all types of imaging biomarkers (Orlhac et al 2022), as well as normalisation using deep learning (Mali et al 2021). Despite the importance of normalisation in radiomics studies, we decided to minimise the pre-processing steps as we focused on how the features were mathematically implemented in each application, not the pre-processing around it. Indeed, IBSI provides recommendations for pre-processing but not a clear workflow. Therefore, the definition of users' dependant workflow (e.g. how to perform image resampling: up-sampling, down-sampling, interpolationK), and their different implementations, or their unavailability, among all the applications (table 1) would introduce even more sources of variation. Image preprocessing would reduce the observed differences in feature values between radiomics applications when there should be no difference at all for the same image and segmentation.
Another important part of the radiomics workflow is the strong correlation between the volume and the voxel size. It was demonstrated that features, and hence radiomics signature, can be a surrogate for volume (Welch et al 2019). Analysis to identify features that have unanticipated cross-correlation with known prognostic features have to be executed to avoid false conclusions, as is proposed in published selection procedures (Traverso et al 2020b, Pfaehler et al 2021. An alternative to eliminate the dependence on voxel size and VOI size is to normalise radiomics features by including the number of voxels in their mathematical definitions, as proposed by (Shafiq-ul-Hassan et al 2017, 2018. This could be a next step for IBSI to tackle, to improve the formulas to avoid volume-confounding effect. There are some limitations to this study. The impact of resampling and normalisation was not tested as discussed before. The lack of pre-processing might have led to greater inter-patient variability, which might bias the clinical data results due to the metric used (ICC). As the images used are not raw images as it can be the case for some MRI sequences, and they come from the same scanner with the same imaging protocols, the variations between patients should remain limited. This study also used in-house developed programs to create the maps and perform the registration between them. As much as this should be avoided to allow easier reproducibility and validation of radiomics studies, the use of internal scripts to compensate for the lack of comprehensive applications remains a reality. Despite those limitations, the results in this study and the subsequent message remain: for equivalent inputs (image, segmentation and settings), IBSI-compliant radiomics applications currently do not provide the same output (feature values).

Conclusion
IBSI-compliant radiomics applications were explored in detail using different datasets (digital and clinical). We showed that the reproducibility of features between those radiomics platforms is not guaranteed, and the translation between applications of radiomics models directly using these features should be performed with care. Further harmonisation is required to better adhere to the IBSI guidelines regarding terminology, settings and calculation. Based on the results, we recommend performing a commissioning of the radiomics application before starting a radiomics study. This commissioning should be accomplished in two steps, first with the IBSI phantom and then with clinical data, as both allow the discovery of different sources of errors. This commissioning of radiomics software should be repeated when updating to a new version or when using a new imaging modality.