Digital phantoms for characterizing inconsistencies among radiomics extraction toolboxes

Yushi Chang; Kyle Lafata; Chunhao Wang; Xiaoyu Duan; Ruiqi Geng; Zhenyu Yang; Fang-Fang Yin

doi:10.1088/2057-1976/ab779c

1. Introduction

Medical imaging research labs are rapidly creating quantitative image analysis techniques that can potentially achieve expert human performance or aid clinical decision making. In particular, radiomics is emerging as one of the leading techniques for decision support, which is a high-dimensional extraction of quantitative features from medical images (i.e., CT, MRI, PET, etc) (Lambin et al 2012). These quantitative imaging features (i.e., radiomic features) can be implemented to develop predictive models for various clinical endpoints including overall survival (Parmar et al 2015a, Parmar et al 2015b), tumor staging (Liang et al 2016), tumor histology classification (Wu et al 2016, Zhu et al 2018, Lafata et al 2018), cancer recurrence (Saha et al 2018, Lafata et al 2019a), functional information (Lafata et al 2019b) etc Inherently based on standard-of-care imaging, radiomics may eventually be fully developed into a non-invasive, low-cost, and personalized decision support tool within clinical environments.

While radiomics is a promising technique, clinical translation is limited due to considerable variation in research pipelines. Typical radiomics pipelines consist of several essential steps, including image acquisition, image segmentation, feature extraction, and feature analysis and/or model building. Often, detailed descriptions of these key steps are missing (e.g., image acquisition parameters (Gilles, 2016), modeling parameters, radiomics extraction software etc), leading to a lack of standardization (Traverso et al 2018). This is a primary reason for the non-reproducibility of radiomics research. Feature extraction consistency is a particularly challenging issue. For example, radiomics findings have been published using various open source software packages, such as CERR (Computational Environment for Radiological Research) (Apte et al 2018), IBEX (imaging biomarker explorer)(Zhang et al 2015), CGITA (Chang-Gung Image Texture Analysis) (Fang et al 2014), and Pyradiomics (Van Griethuysen et al 2017). Furthermore, other groups have opted to develop their own in-house radiomics extraction toolboxes (Parmar et al 2015a, Lafata et al 2018, Larue et al 2017, Leger et al 2017). There is currently a lack of understanding regarding the radiomics extraction consistency among these different toolboxes, and possible variation may be in part responsible for discrepancies in research findings. For example, Foy et al (Foy et al 2018a) demonstrated that features extracted from an in-house toolbox and IBEX resulted in significant differences in classifying patients with and without radiation pneumonitis.

In general, the radiomics community has certainly been aware of the importance of benchmarking feature extraction techniques. Lambin et al (Lambin et al 2017) provided reference feature values extracted from pre-processed CT data of four patients with lung cancer. Foy et al. (Foy et al 2018b) compared features extracted from 2 in-house platforms and 2 open source radiomics toolboxes based on mammography scans and head and neck cancer CT images, and they found feature value discrepancies for texture features by nonparametric Friedman tests. Both studies used patient data. While using patient data provides realistic scenarios for radiomics studies, it is not intuitive for seeking out the source of feature extraction variation (such as texture matrix computation, scalar feature definitions, normalization techniques, etc). Given this consideration, digital phantoms are more suitable for benchmarking the feature extraction techniques. For example, Zwanenburg et al. initialized the Imaging Biomarker Standardization Initiative (IBSI) (Zwanenburg, 2017) to standardize feature extraction across institutions. They built a 5 × 4 × 4 voxel digital phantom with several discrete grey levels, from which standard feature values were computed and shared to 19 institutions in 8 countries. The institutions then reported the feature value similarities and differences.

Due to the current lack of standardization regarding feature extraction, different toolboxes may produce different—yet reasonable—feature values. More specifically, different feature values could be caused by different mathematical definitions, pre-processing implementations, scalar feature computation methods, etc, but these approaches may all be reasonable. Therefore, comparison of absolute feature values may not be fully informative. Instead, we propose a new method based on a series of digital phantoms to quantify relative inconsistencies and characterize sources of variation among several radiomics toolboxes. In particular, in this study we develop a series of digital bar phantoms and a series of heteromorphic sphere phantoms, from which we characterize feature extraction inconsistency among three radiomics toolboxes: IBEX, CERR and an in-house radiomics toolbox.

2. Materials and methods

The overall workflow of this study as shown in figure 1 includes: phantom design (bar phantoms and heteromorphic sphere phantoms), feature extraction using three independent calculation toolboxes, and evaluation on feature extraction consistency.

2.1. Phantom design

We designed two types of phantoms: heteromorphic sphere phantoms and bar phantoms. There were several considerations during the phantom design: (1) the feature values of the phantom should be easy to be manual verification or intuitively perceived for the convenient final validation; (2) no pre-processing (e.g. segmentation and grey level discretization) is necessary; (3) the phantom should be comprised of a series of phantoms with regular intensity, texture, and shape variations, for the convenience of characterizing the sources of variation.

2.1.1. Heteromorphic sphere phantoms

We constructed a series of heteromorphic sphere phantoms for characterizing shape features in a stochastic and iterative mechanism. The rational for this approach was to generate local topological changes on the surface of the phantoms (e.g., as shown in figure 2(a)) and study their effect on shape feature calculation. The starting point was a perfect sphere with radius of 80 voxels, which was iteratively perturbed to simulate changes in size and structure relative to baseline. In each iteration, a sphere with ten-voxel radius was stochastically attached to the surface of the previous iteration of the phantom. The voxel size of 1 mm × 1 mm × 1 mm was used in this study. In addition to the starting point, we acquired the phantoms every 500 iterations and up to 5500 iterations.

2.1.2. Bar phantoms

As shown in figure 2(b), a series of bar phantoms were developed for characterizing intensity and texture features. The bar phantoms consisted of 66 × 66 × 12 voxels, with the voxel size of 1 mm × 1 mm × 2 mm. The outermost one-voxel shell was of value 0. The region of interest (ROI) was specified as the non-zero regions in the phantoms. The shell was designed to mimic the real environment in common radiomics studies. The phantoms were evenly distributed into 2, 4, 8, or 64 bars, and the grey levels of the n bars were evenly distributed between 1 and 64. The construction parameters of each phantom are shown in appendix A, Table A1.

2.2. Feature extraction

We calculated and compared 61 typical features (as shown in appendix B, Figure B1) from 3 radiomics toolboxes: (1) CERR (2) IBEX (3) MATLAB-based in-house platform (MathWorks, Natick, MA, USA). These features were with common naming conventions or different naming contentions but common feature definitions among the toolboxes. They included 22 texture features from the grey-level co-occurrence matrix (GLCOM) (Haralick and Shanmugam, 1973), 11 texture features from the grey-level run-length matrix (GLRLM) (Galloway 1975), 13 texture features from the grey-level size-zone matrix (GLSZM) (Thibault et al 2014), 5 features from the neighboring grey-level difference matrix (NGLDM) (Amadasun and King 1989), 6 shape features, and 4 intensity histogram-based features. All texture features were calculated in 3D mechanism: (1) GLCOM and GLRLM features were calculated along 13 searching directions (as shown in appendix C); and (2) GLSZM and NGLDM features were calculated with 3D-neighbor volume of 26 voxels. Note that four calculation methods were used for shape features as CERR provides two algorithms, Delaunay algorithm and marching cubes algorithm. We compared both algorithms and nominated them as CERR_D and CERR_M, respectively.

2.3. Evaluation on feature extraction consistency

Both concordance correlation coefficient (CCC) and Pearson correlation coefficient (PCC) were used for pair-wise comparison among the three toolboxes. CCC evaluates data reproducibility and inter-rater reliability (Lawrence 1989). High CCC indicates high degree to which the data pair lie along the line of equality (i.e. the 45° line in linear coordinate), and CCC that is not equal to 1 indicates inconsistencies between two groups of data. CCC was scored into 4 categories: (1) perfect agreement: CCC = 1; (2) strong agreement: CCC > = 0.8; (3) moderate agreement: 0.5< = CCC < 0.8; (4) weak agreement: CCC < 0.5. This scoring breakdown mechanism was adopted based on previous studies (Lafata et al 2018, 2019a, Foy et al 2018b, Parmar et al 2014). PCC measures a linear relationship which could departure from the 45° line (Lawrence 1989). The rationale of using PCC is as follows: When the results from different toolboxes do not agree exactly based on CCC, we would want to know by how much the inconsistencies would likely to affect downstream modeling results. If the features produced from two toolboxes are different values but linearly correlated, they may potentially generate similar results with respect to downstream models, relative to non-linear feature inconsistences.

3. Results

3.1. Shape features

The variations of shape features calculated from the heteromorphic sphere phantom with iteration number are shown in figure 3. The volume values calculated from the four calculation algorithms were comparable, while the surface area values increased at two different rates. Since all other shape features are correlated with volume and surface area, they also varied at two rates with iteration number.

The CCC and PCC for each shape feature are provided in appendix D, Table D1 and D2. The CCC confirmed the trends shown in figure 3: the four methods produced similar volume (CCC > 0.98); surface area was highly reproducible (CCC > 0.99) between CERR_D/in-house pair or CERR_M/IBEX pair yet weakly reproducible (CCC < 0.5) between the two groups. Only 2 features, the compactness 2 between CERR_M /IBEX and volume between CERR_M/CERR_D, demonstrated perfect agreement. Meanwhile, all four methods produced PCC > 0.8 for all shape features except compactness 1. For this feature, in-house/IBEX and CERR_M /IBEX resulted in PCCs of 0.76 and 0.77, respectively. These results demonstrate that inconsistencies existed in shape feature calculations among the four methods, but the features values were usually highly linear-correlated.

3.2. Intensity and texture features

The CCC and PCC of intensity histogram-based features and texture features are provided in appendix E, Table E1. The statistical results of CCC evaluation are shown in table 1. It should be noted that IBEX calculates GLRLM features only along 0° and 90° directions and does not provide calculation of GLSZM features, therefore the CCC and PCC of GLRLM and GLSZM features related to IBEX are not applicable in this comparison.

Table 1. Percentage of score categories in CCC pair-comparison.

	Number of Features	Perfect Agreement (CCC = 1)	Strong Agreement (CCC > =0.8)	Moderate Agreement (0.5 < CCC < 0.8)	Weak Agreement (CCC < 0.5)
CCC (IH/CERR)	55	52.7%	70.9%	3.6%	25.5%
CCC (IH /IBEX)	31	45.2%	61.3%	0%	38.7%
CCC (CERR/IBEX)	31	54.8%	61.3%	0%	38.7%

Note: 'IH' represents for 'in-house' radiomics toolbox.

As seen in table 1, the percentage of perfect agreement was only 53%, 45%, and 55% for in-house/CERR, in-house/IBEX, and CERR/IBEX pairs, respectively. The average CCC reached 0.981, 0.991, and 0.998 for features with strong agreements for in-house /CERR, in-house/IBEX, and CERR/IBEX pairs. Most features demonstrated either strong agreement or weak agreement between two radiomics toolboxes. Among the features with moderate or weak CCC, only a few demonstrated a PCC lower than 0.95, which are shown in table 2. A negative PCC value indicates an inverse relationship between two features from different toolboxes. These features which were with moderate or weak agreement on CCC but high PCC demonstrated strong linear relationship, even though the feature values were not highly reproducible between two toolboxes.

Table 2. Features with Pearson correlation lower than 0.95.

Toolboxes	Feature Name	PCC
IH versus CERR	Variation Of Intensity	−0.735
IH versus IBEX	Differential Entropy	−0.387
	Info Measure Correlation1	0.452
	Variation Of Intensity	−0.735
	Coarseness	−0.381
	Complexity	−0.114
CERR versus IBEX	Differential Entropy	−0.386
	Info Measure Correlation1	0.479
	Busyness	−0.387
	Coarseness	−0.382
	Complexity	−0.113
	Contrast	−0.386

4. Discussion

In this study, digital phantoms were developed and implemented to investigate potential inconsistencies among three different radiomics feature extraction toolboxes: CERR, IBEX, and an in-house toolbox. While all three toolboxes incorporate similar stages of computation (i.e., DICOM reading, image pre-processing, contour delineation, gray-scale re-binning, etc), the fashion in which they are implemented are often quite different. Our results demonstrate that variations exist among radiomic features derived from different toolboxes.

As mentioned above, the IBSI (Zwanenburg 2017, Zwanenburg et al 2016) greatly contributed to benchmarking radiomic extraction processes across multiple institutions. Taking a different approach to the sequential phantoms developed in our study, IBSI utilized a single cubic-like digital phantom with several discrete gray levels to achieve its goal of benchmarking. Compared to a single phantom implementation, sequential phantoms have several unique advantages: (1) they are convenient for perceptive checking, which is especially meaningful for first-order features and shape features (relative to a perfect sphere). (2) They make the task of figuring out the sources of variations relatively easy, which would be helpful for the developers of radiomics toolboxes to debug their programs. (3) They provide more information during referring other radiomics studies. Researchers could know which features could be with extraction discrepancies by the CCC evaluation, and by how much the inconsistencies are non-linear by the PCC evaluation. Beyond the merits of serial phantoms, the bar phantoms and heteromorphic sphere phantoms were designed in such shapes and textures on purpose. First, the phantoms were designed to enable easy, intuitive check on the features, considering the physical meanings and mathematical definitions of radiomics features. In particular, most shape features describe the similarity of between the ROI and a perfect sphere, hence we designed a sphere-based phantom with morphological changes. Many texture features are relevant to the homogeneity or heterogeneity of the ROI, and the computation of texture features are related to the rotational variance. Therefore, bar phantoms with a regular homogeneity change and texture variations along different axes of rotation were designed. Second, the phantoms were designed in a way that the differences will be apparent and meaningful trends will be shown. For example, the complex surface of the heteromorphic phantoms results in apparent differences caused by the different algorithms used for calculating surface area and volume. Another example is that any differences in texture feature matrix computation (e.g., GLCOM and GLRLM) will also be apparent with the bar phantoms. Third, they help in understanding mathematical interpretations of features, which is especially helpful for characterizing new feature engineering techniques. For example, our results (as shown in figure 3) demonstrated that the sphericity of the heteromorphic sphere phantoms should be maximized at a perfect sphere and should decrease as the morphology changes until saturating.

We provided two evaluation metrics- CCC and PCC. The importance of CCC and PCC depends on the specific task. For the purpose of QA, CCC is the most important evaluation criteria, since it evaluates data reproducibility. Noted that in general there is no absolute threshold indicating an acceptable degree of discrepancy, even though we scored our CCC results into different categories. PCC is informative during referring others' research findings, since researchers would want to know how much inconsistencies could exist if they used different toolboxes. If the features produced from two toolboxes do not match based on CCC but linearly correlated, they may potentially generate similar results with respect to downstream models, relative to non-linear feature inconsistences.

The use of sequential phantoms, specifically separate phantoms for different types of features, can therefore potentially provide useful information regarding feature extraction inconsistencies across toolboxes. In our study, the disagreement between feature values was driven by many different factors. First, disagreements were observed as a result of differences in mathematical definitions. For example, feature S1 (compactness1) is defined via equation (1) (Aerts et al 2014) in in-house toolbox and IBEX, yet via equation (2) (Zwanenburg et al 2016) in CERR toolbox. Feature I3 (kurtosis) by in-house toolbox and by CERR is in relationship (3). This is the reason why the I3 is in weak agreement in CCC (IH/CERR) but high Pearson correlation (PCC = 1). None of the mathematical definitions were incorrect, however, we can see the necessity of standardizing the mathematical definitions for the generalization of radiomics research findings.

$\begin{eqnarray}&&\begin{array}{l}Compactness1\_inhouse\\ \,=\,Compactness1\_IBEX=\displaystyle \frac{V}{\sqrt{\pi }{A}^{\displaystyle \frac{2}{3}}}\end{array}\end{eqnarray} \tag{ 1 }$

$\begin{eqnarray}&&Compactness1\_CERR=\displaystyle \frac{V}{\sqrt{\pi }{A}^{\displaystyle \frac{3}{2}}}\end{eqnarray} \tag{ 2 }$

$\begin{eqnarray}&&Kurtosis\_CERR=Kurtosis\_inhouse-3\end{eqnarray} \tag{ 3 }$

Second, disagreements were observed due to different pre-processing steps inherent to different toolboxes. As shown in appendix F, Table F1–F6 features I1 (energy) from IBEX demonstrated weak agreement with the other two toolboxes. This is due to the specific gray-level re-binning method implemented in IBEX. Specifically, IBEX applies a re-binning algorithm of dividing the CT number 0-4096 into 25 bins for intensity features, except for entropy whose binning range and bin width are adjustable. In general, however, this type of disagreement could be avoided for clinical applications of radiomics research findings.

Lastly, disagreements were observed as a result of differences in calculation methods. For example, when calculating texture features from joint-probability matrices, the in-house toolbox applied a z-direction normalization factor when accounting for rotational invariance of anisotropic voxels, while CERR did not apply this correction. Another example is the calculation of surface area. As shown in figure 3, the surface area calculated from Delaunay algorithm is lower than that from marching cubes algorithm, and the difference increases with iterations. This difference could potentially be caused by the inherent characteristics of the algorithms, and they are sensitivity in deriving final values. For example, the Delaunay triangulation algorithm tends to incorrectly calculate the fine interfaces of the small attaching spheres, especially at high iteration. Therefore, while these algorithms resulted in similar volume measurements, other shape features that are formulated based on surface area and volume demonstrated larger differences at high iteration.

In addition to these findings, our results also demonstrated the significance of calculating texture features in a fully 3D fashion. In the radiomics toolboxes, feature values are calculated as the average of feature values along 13 unique axes of rotation. Figure 4 illustrates the value of feature 'GLRLM Run Length Percentage' along the 13 unique axes of rotation, where we can see the directional dependence of the feature. IBEX calculates GLRLM features only along 0° and 90° directions, which may cause inconsistencies due to not fully approximating a rotationally invariant system.

**Figure 4.** GLRLM Run Length Percentage feature value along 13 directions.
Download figure:
Standard image High-resolution image

We have demonstrated that there are many sources of variation in feature extraction. Although discrepancies exist, it is important to note that the different implementations are not necessarily incorrect, as long as the software is consistent when comparing results. Researchers should be aware of the differences and consider it when building radiomic models. Additionally, how these variations would affect the clinical decision support is very much a task-based problem and worth further exploration. As concluded in previous studies (Lafata et al 2018, Foy et al 2018a), variations in feature values have been shown to affect downstream modeling results. In particular, Foy et al (Foy et al 2018a) demonstrated that features from different radiomics toolboxes affected the model performance on predicting radiation pneumonitis. Therefore, for future clinical radiomic application, the standardization or QA of radiomics toolboxes will be necessary for the reproducibility and generalization of radiomics research findings. Based on our findings, we suggest a working group for the standardized of radiomic features and the development of a protocol for the QA of radiomic applications. Digital phantoms, similar to the ones used in this study, can potentially be used as a tool for this purpose. Once researchers agree on a universal feature extraction ontology to address the mentioned disagreements, feature values extracted from the digital phantoms may serve as a warranted reference calculation.

We acknowledge that the current research only studies the variability associated with a single component of a typical radiomic pipeline. Thus, no clinical outcome comparison was involved in this study. However, variability of radiomic features with respect to image acquisition parameters, imaging modalities, preprocessing steps, or modeling algorithms could also affect final clinical decision making. Eventually the community will need to understand how different sources of variations interact to understand their aggregate behavior.

5. Conclusions

We developed a series of heteromorphic sphere phantoms and a series of bar phantoms for checking the feature extraction consistency among three radiomics toolboxes: CERR, IBEX, and an in-house toolbox. We observed inconsistent feature values and discrepancies in mathematical definition, pre-processing implementation, and calculation methods among the three toolboxes, thus demonstrating the significance of radiomics software QA. The digital phantoms can be applied as useful tools for the QA.

Appendix A

Table A1. Bar phantom construction parameters.

Number of bars	Grey levels	Bar width (voxel)
2	1, 64	32
4	1, 22, 43, 64	16
8	1, 10, 19, 28, 37, 46, 55, 64	8
64	1, 2, 3, ..., 64	1

Appendix B

**Figure B1.** Full list of features compared in this study. Colors represent feature classes.
Download figure:
Standard image High-resolution image

Appendix C

13 searching directions

Appendix D

Table D1. Pair-wise CCC on shape features.

Feature	CCC IH/CERR_D	CCC IH/CERR_M	CCC IH /IBEX	CCC IBEX/CERR_D	CCC IBEX/CERR_M	CCC CERR_D/CERR_M
S1	0.978	0.275	0.352	0.381	0.092	0.265
S2	0.987	0.570	0.566	0.564	1.000	0.567
S3	0.975	0.200	0.193	0.183	0.998	0.189
S4	0.954	0.046	0.042	0.035	0.977	0.039
S5	0.992	0.194	0.183	0.167	0.995	0.177
S6	0.997	0.997	0.986	0.996	0.996	1.000

Table D2. Pair-wise PCC on shape features.

Feature	PCC IH/CERR_D	PCCIH/CERR_M	PCC IH /IBEX	PCC IBEX/CERR_D	PCC IBEX/CERR_M	PCC CERR_D/CERR_M
S1	0.979	0.995	0.763	0.827	0.772	0.991
S2	0.987	0.991	0.991	0.999	1.000	0.999
S3	0.975	0.994	0.993	0.978	1.000	0.980
S4	0.957	0.916	0.900	0.819	0.999	0.840
S5	0.998	0.997	0.996	0.994	1.000	0.995
S6	1.000	1.000	1.000	1.000	1.000	1.000

Appendix E

Table E1. Pair-wise comparison on intensity and texture features.

Feature	CCC IH/CERR	CCC IH /IBEX	CCC CERR/IBEX	PCC IH/CERR	PCC IH /IBEX	PCC CERR/IBEX
I1	1.000	0.000	0.000	1.000	1.000	1.000
I2	1.000	1.000	1.000	1.000	1.000	1.000
I3	0.023	1.000	0.023	1.000	1.000	1.000
I4	1.000	1.000	1.000	1.000	1.000	1.000
CO1	1.000	1.000	1.000	1.000	1.000	1.000
CO2	1.000	1.000	1.000	1.000	1.000	1.000
CO3	0.000	0.000	1.000	1.000	1.000	1.000
CO4	1.000	1.000	1.000	1.000	1.000	1.000
CO5	1.000	1.000	1.000	1.000	1.000	1.000
CO6	0.996	0.996	1.000	0.998	0.998	1.000
CO7	1.000	−0.188	−0.183	1.000	−0.387	−0.386
CO8	0.000	0.000	0.000	1.000	1.000	1.000
CO9	1.000	1.000	1.000	1.000	1.000	1.000
CO10	1.000	0.984	0.983	1.000	0.999	0.999
CO11	1.000	1.000	1.000	1.000	1.000	1.000
CO12	1.000	1.000	1.000	1.000	1.000	1.000
CO13	0.995	0.197	0.201	0.999	0.452	0.479
CO14	1.000	1.000	1.000	1.000	1.000	1.000
CO15	0.942	0.942	1.000	0.977	0.977	1.000
CO16	0.942	0.942	1.000	0.977	0.977	1.000
CO17	1.000	1.000	1.000	1.000	1.000	1.000
CO18	1.000	1.000	1.000	1.000	1.000	1.000
CO19	1.000	1.000	1.000	1.000	1.000	1.000
CO20	1.000	0.975	0.975	1.000	0.997	0.997
CO21	1.000	0.143	0.143	1.000	0.992	0.992
CO22	0.116	0.323	0.128	1.000	1.000	1.000
RL1	0.000	N.A.	N.A.	1.000	N.A.	N.A.
RL2	0.307	N.A.	N.A.	0.999	N.A.	N.A.
RL3	0.044	N.A.	N.A.	1.000	N.A.	N.A.
RL4	0.359	N.A.	N.A.	0.999	N.A.	N.A.
RL5	0.942	N.A.	N.A.	1.000	N.A.	N.A.
RL6	1.000	N.A.	N.A.	1.000	N.A.	N.A.
RL7	1.000	N.A.	N.A.	1.000	N.A.	N.A.
RL8	0.848	N.A.	N.A.	0.970	N.A.	N.A.
RL9	0.998	N.A.	N.A.	1.000	N.A.	N.A.
RL10	0.866	N.A.	N.A.	1.000	N.A.	N.A.
RL11	0.602	N.A.	N.A.	1.000	N.A.	N.A.
SZ1	1.000	N.A.	N.A.	1.000	N.A.	N.A.
SZ2	1.000	N.A.	N.A.	1.000	N.A.	N.A.
SZ3	0.000	N.A.	N.A.	1.000	N.A.	N.A.
SZ4	0.000	N.A.	N.A.	1.000	N.A.	N.A.
SZ5	0.750	N.A.	N.A.	1.000	N.A.	N.A.
SZ6	1.000	N.A.	N.A.	1.000	N.A.	N.A.
SZ7	1.000	N.A.	N.A.	1.000	N.A.	N.A.
SZ8	1.000	N.A.	N.A.	1.000	N.A.	N.A.
SZ9	1.000	N.A.	N.A.	1.000	N.A.	N.A.
SZ10	1.000	N.A.	N.A.	1.000	N.A.	N.A.
SZ11	1.000	N.A.	N.A.	1.000	N.A.	N.A.
SZ12	0.000	N.A.	N.A.	−0.735	N.A.	N.A.
SZ13	1.000	N.A.	N.A.	1.000	N.A.	N.A.
ND1	0.000	0.000	0.000	1.000	1.000	−0.387
ND2	0.820	0.000	0.000	1.000	−0.381	−0.382
ND3	0.915	−0.108	−0.105	1.000	−0.114	−0.113
ND4	0.000	0.000	−0.084	1.000	1.000	−0.386
ND5	0.058	0.000	0.000	1.000	1.000	1.000

Note: 'IH' represents for 'in-house' radiomics toolbox. IBEX calculates GLRLM features only along 0° and 90° directions and does not provide calculation of GLSZM features, therefore the CCC of GLRLM and GLSZM features related to IBEX is not applicable in this comparison.

Appendix F

Table F1. Shape feature values.

	In-house	0.037	0.024	0.021	0.020	0.020	0.021	0.021	0.020	0.020	0.020	0.020	0.020
	CERR_D	0.030	0.018	0.017	0.017	0.017	0.017	0.017	0.017	0.017	0.017	0.017	0.016
Compactness1	CERR_M	0.027	0.007	0.005	0.004	0.004	0.004	0.003	0.003	0.003	0.003	0.003	0.003
	IBEX	13.940	9.514	9.334	9.415	9.564	9.864	10.094	10.368	10.613	10.798	10.985	11.199
	In-house	1.512	0.648	0.511	0.445	0.469	0.480	0.477	0.447	0.439	0.432	0.439	0.442
Compactness2	CERR_D	1.010	0.344	0.333	0.326	0.322	0.309	0.309	0.314	0.320	0.318	0.317	0.290
	CERR_M	0.800	0.059	0.031	0.021	0.016	0.014	0.013	0.011	0.010	0.009	0.009	0.008
	IBEX	0.952	0.069	0.035	0.023	0.017	0.015	0.013	0.011	0.010	0.009	0.008	0.008
	In-house	1.148	0.866	0.799	0.764	0.777	0.783	0.782	0.765	0.760	0.756	0.760	0.762
Sphericity	CERR_D	1.003	0.701	0.693	0.688	0.686	0.676	0.676	0.680	0.684	0.683	0.682	0.662
	CERR_M	0.928	0.389	0.315	0.278	0.254	0.243	0.232	0.226	0.217	0.211	0.205	0.200
	IBEX	0.984	0.411	0.328	0.285	0.258	0.244	0.232	0.225	0.216	0.209	0.201	0.196
	In-house	0.871	1.155	1.251	1.310	1.287	1.277	1.279	1.308	1.315	1.323	1.316	1.313
Sphere Dispropor--tion	CERR_D	0.997	1.427	1.443	1.454	1.458	1.478	1.479	1.472	1.463	1.465	1.467	1.510
	CERR_M	1.077	2.568	3.170	3.600	3.933	4.115	4.304	4.431	4.606	4.750	4.883	5.009
	IBEX	1.017	2.434	3.051	3.513	3.880	4.091	4.307	4.451	4.621	4.791	4.963	5.098
	In-house	1197	1526	1824	1995	2155	2322	2521	2688	2853	2992	3124	1197
	CERR_D	1484	1764	2014	2237	2459	2640	2787	2934	3091	3257	3507	1484
Surface Area	CERR_M	2670	3874	4989	6031	6844	7680	8392	9240	10024	10843	11632	2670
	IBEX	2511	3686	4800	5848	6674	7528	8248	9077	9888	10757	11551	2511
	In-house	3134	4007	4888	5736	6514	7268	7959	8688	9414	10192	10915	3134
	CERR_D	3152	4017	4851	5647	6378	7088	7750	8449	9118	9839	10522	3152
Volume	CERR_M	3152	4017	4851	5647	6378	7088	7750	8449	9118	9839	10522	3152
	IBEX	3115	3947	4749	5502	6198	6872	7501	8186	8817	9488	10143	3115

Note: CERRD represents for using Delaunay algorithm in CERR for calculating shape features, and CERRM represent for using marching cubes algorithm.

Table F2. First order feature values.

	In-House				CERR				IBEX
Feature	2	4	8	64	2	4	8	64	2	4	8	64
Energy	8.39E + 07	6.58E + 07	6.07E + 07	5.72E + 07	8.39E + 07	6.58E + 07	6.07E + 07	5.72E + 07	1.07E + 06	1.07E + 06	1.07E + 06	1.07E + 06
Entropy	1.00	2.00	3.00	6.00	1.000	2.000	3.000	5.969	1.000	2.000	3.000	6.000
Kurtosis	1.00	1.64	1.76	1.80	−2.000	−1.360	−1.238	−1.201	1.000	1.640	1.762	1.799
Skewness	0.00	0.00	0.00	0.00	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

Table F3. GLCOM feature values.

	In-House				CERR				IBEX
Feature	2	4	8	64	2	4	8	64	2	4	8	64
Auto Correlation	2026.265	1595.147	1471.971	1389.853	2026.692	1595.385	1472.154	1390.000	2026.728	1595.405	1472.169	1390.012
Cluster Prominence	1.56E + 07	7.78E + 06	4.93E + 06	3.21E + 06	1.56E + 07	7.78E + 06	4.93E + 06	3.21E + 06	1.56E + 07	7.78E + 06	4.93E + 06	3.21E + 06
Cluster Shade	0.000	0.000	−0.001	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Cluster Tendency	3924.529	2170.412	1669.235	1335.118	3925.385	2171.077	1669.846	1335.692	3925.457	2171.133	1669.898	1335.741
Contrast	44.471	14.824	6.353	0.706	43.615	14.538	6.231	0.692	43.543	14.514	6.220	0.691
Correlation	0.978	0.986	0.992	0.999	0.978	0.987	0.993	0.999	0.978	0.987	0.993	0.999
Difference Entropy	0.083	0.195	0.355	0.000	0.081	0.191	0.348	0.000	0.087	0.209	0.391	0.892
Dissimilarity	0.706	0.706	0.706	0.706	0.692	0.692	0.692	0.692	0.691	0.691	0.691	0.691
Energy	0.489	0.234	0.107	0.010	0.489	0.234	0.107	0.010	0.489	0.234	0.107	0.006
Entropy	1.083	2.215	3.419	6.690	1.081	2.210	3.411	6.677	1.086	2.225	3.446	7.510
Homogeneity1	0.989	0.968	0.929	0.647	0.989	0.969	0.931	0.654	0.989	0.969	0.931	0.654
Homogeneity2	0.989	0.966	0.923	0.647	0.989	0.967	0.924	0.654	0.989	0.967	0.924	0.654
InfoMeasure Correlation1	−0.917	−0.893	−0.860	−0.885	−0.919	−0.895	−0.863	−0.886	−0.914	−0.887	−0.851	−0.748
InfoMeasure Correlation2	0.916	0.985	0.997	1.000	0.916	0.985	0.997	1.000	0.916	0.986	0.997	1.000
Inverse Difference Moment Normalized	0.994	0.997	0.998	1.000	0.995	0.997	0.999	1.000	0.995	0.997	0.999	1.000
Inverse Difference Normalized	0.994	0.992	0.990	0.989	0.995	0.992	0.991	0.989	0.995	0.992	0.991	0.989
Inverse Variance	0.000	0.000	0.001	0.706	0.000	0.000	0.001	0.692	0.000	0.000	0.001	0.691
Maximum Probability	0.494	0.242	0.115	0.010	0.495	0.242	0.115	0.010	0.495	0.242	0.115	0.007
Sum Average	65.000	65.000	65.000	65.000	65.000	65.000	65.000	65.000	65.000	65.000	65.000	65.000
Sum Entropy	1.072	2.181	3.340	5.984	1.070	2.178	3.334	5.984	1.076	2.195	3.376	6.876
Sum Variance	3924.529	2170.412	1669.235	1335.118	3925.385	2171.077	1669.846	1335.692	8011.710	6115.584	5467.415	4714.126
Variance	1983.530	1537.590	1410.180	1325.240	992.250	546.404	419.019	334.096	3925.457	2171.133	1669.898	1335.741

Table F4. GLRLM feature values.

	In-House				CERR
Feature	2	4	8	64	2	4	8	64
SRE	0.050	0.068	0.091	0.712	0.046	0.046	0.046	0.046
LRE	775.710	588.919	527.330	498.253	532.084	406.098	361.074	336.254
GLNU	0.033	0.021	0.014	0.009	0.500	0.250	0.125	0.016
RLNU	0.026	0.029	0.048	0.565	0.461	0.437	0.478	0.935
RP	0.085	0.106	0.148	0.727	0.099	0.119	0.159	0.719
LGLRE	0.500	0.251	0.127	0.025	0.500	0.251	0.127	0.025
HGLRE	2048.500	1607.500	1481.500	1397.500	2048.500	1607.500	1481.500	1397.500
SRLGLE	0.025	0.017	0.012	0.018	0.030	0.020	0.014	0.018
SRHGLE	102.949	108.569	134.187	995.507	123.624	130.967	161.102	979.291
LRLGLE	387.950	147.650	66.959	12.685	266.107	101.814	45.848	8.561
LRHGLE	1.59E + 06	9.47E + 05	7.81E + 05	6.96E + 05	1.09E + 06	6.53E + 05	5.35E + 05	4.70E + 05

Note: IBEX calculated GLRLM only along 0 and 90 degrees.

Table F5. GLSZM feature values.

	In-House				CERR
Feature	2	4	8	64	2	4	8	64
SZE	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
LZE	4.19E + 08	1.05E + 08	2.62E + 07	4.10E + 05	4.19E + 08	1.05E + 08	2.62E + 07	4.10E + 05
GLNU	0.000	0.000	0.000	0.000	0.500	0.250	0.125	0.016
SZNU	0.000	0.000	0.000	0.001	1.000	1.000	1.000	1.000
SP	0.000	0.000	0.000	0.001	0.000	0.000	0.000	0.002
LGLSE	0.500	0.251	0.127	0.025	0.500	0.251	0.127	0.025
HGLSE	2048.500	1607.500	1481.500	1397.500	2048.500	1607.500	1481.500	1397.500
SSLGLE	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
SSHGLE	0.000	0.000	0.000	0.003	0.000	0.000	0.000	0.003
LSLGLE	2.10E + 08	2.63E + 07	3.33E + 06	1.04E + 04	2.10E + 08	2.63E + 07	3.33E + 06	1.04E + 04
LSHGLE	8.59E + 11	1.69E + 11	3.88E + 10	5.72E + 08	8.59E + 11	1.69E + 11	3.88E + 10	5.72E + 08
VoI	0.001	0.002	0.003	0.008	992.250	551.250	425.250	341.250
VoA	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

Note: IBEX does not include the computation of GLSZM features.

Table F6. NGTDM feature values.

	In-House				CERR				IBEX
Feature	2	4	8	64	2	4	8	64	2	4	8	64
Busyness	307.333	91.831	25.479	0.195	222.561	66.768	18.547	0.008	291.892	29.717	5.120	0.000
Coarseness	0.000	0.000	0.000	0.004	0.000	0.000	0.000	0.094	0.000	0.000	0.000	1.00E + 10
Complexity	0.000	0.000	0.000	0.003	43.132	67.094	123.234	33.593	80.453	127.495	236.227	0.000
Contrast	1172.604	108.140	17.856	0.088	679.330	62.901	10.398	0.003	633.565	83.898	15.654	0.000
Texture Strength	0.205	0.229	0.353	5.128	0.283	0.315	0.485	127.988	0.216	0.235	0.358	7.33E + 14

Digital phantoms for characterizing inconsistencies among radiomics extraction toolboxes

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction