Radiomic analysis for early differentiation of lung cancer recurrence from fibrosis in patients treated with lung stereotactic ablative radiotherapy

Objective. The development of radiation-induced fibrosis after stereotactic ablative radiotherapy (SABR) can obscure follow-up images and delay detection of a local recurrence in early-stage lung cancer patients. The objective of this study was to develop a radiomics model for computer-assisted detection of local recurrence and fibrosis for an earlier timepoint (<1 year) after the SABR treatment. Approach. This retrospective clinical study included CT images (n = 107) of 66 patients treated with SABR. A z-score normalization technique was used for radiomic feature standardization across scanner protocols. The training set for the radiomics model consisted of CT images (66 patients; 22 recurrences and 44 fibrosis) obtained at 24 months (median) follow-up. The test set included CT-images of 41 patients acquired at 5–12 months follow-up. Combinations of four widely used machine learning techniques (support vector machines, gradient boosting, random forests (RF), and logistic regression) and feature selection methods (Relief feature scoring, maximum relevance minimum redundancy, mutual information maximization, forward feature selection, and LASSO) were investigated. Pyradiomics was used to extract 106 radiomic features from the CT-images for feature selection and classification. Main results. An RF + LASSO model scored the highest in terms of AUC (0.87) and obtained a sensitivity of 75% and a specificity of 88% in identifying a local recurrence in the test set. In the training set, 86% accuracy was achieved using five-fold cross-validation. Delong’s test indicated that AUC achieved by the RF+LASSO is significantly better than 11 other machine learning models presented here. The top three radiomic features: interquartile range (first order), Cluster Prominence (GLCM), and Autocorrelation (GLCM), were revealed as differentiating a recurrence from fibrosis with this model. Significance. The radiomics model selected, out of multiple machine learning and feature selection algorithms, was able to differentiate a recurrence from fibrosis in earlier follow-up CT-images with a high specificity rate and satisfactory sensitivity performance.


Introduction
Stereotactic ablative radiotherapy (SABR) is a standard treatment for patients with early stage non-small-cell lung cancer (NSCLC) who are medically inoperable or refuse surgery (Moghanaki and Karas 2013, Senan et al 2013, Van Schil and Van Meerbeeck 2013, Nagata et al 2015, Zappa and Mousa 2016, Li et al 2019. The reported local control rates for SABR have been comparable to surgery with 96% at 1 year follow-up and 92% at 2 years follow-up (Chen et al 2019). Unfortunately, the radiation that is delivered to the tumour and surrounding tissues can lead to the development of a radiation-induced lung injury, such as radiation pneumonitis (acute toxicity classified within <6 months of the treatment) or fibrosis (classified thereafter), which can obscure underlying progression of local disease (Takeda et al 2008, Huang and Palma 2015, Mattonen et al 2016a. Fibrosis can mimic a local recurrence in terms of radiologic morphology and time course evolution (Matsuo et al 2007, Dahele et al 2011. This poses a difficult clinical problem: accurately detecting a local recurrence and distinguishing it from radiation-induced fibrosis. Inaccurate classification of a local recurrence as fibrosis (false negatives) can delay the treatment of a recurrent disease and classification of fibrosis as recurrence (false positives) can lead to unnecessary medical interventions (Matsuo et al 2007, Taira et al 2014. Response evaluation criteria in solid tumours (RECIST) has been the standard method of evaluating treatment response in oncology based on CT imaging (Huang and Palma 2015). Since RECIST relies on the diameter of the region of interest alone to classify recurrence, it is not a reliable technique for recurrence detection when there is an emerging mass of fibrosis (Huang and Palma 2015). Furthermore, RECIST measurements are performed in the plane of image acquisition, tumours have irregular shapes, and the measurements are subject to inter and intra-observer variability (Erasmus et al 2003). Other radiographic highrisk features (HRFs) predictive of recurrence have also been reported (Huang et al 2012). These HRFs are also subject to inter-observer variability, and may only become apparent more than a year after treatment (Huang et al 2013). FDG-PET/CT scanning may be more accurate than CT imaging for recurrence detection, but currently it is not routinely available in many healthcare institutions (Huang and Palma 2015). Furthermore, FDG/PET imaging is more reliably used to confirm a suspicious recurrence in periods within 12 months of SABR treatment (SUV max > 5) (Lee et al 2022). Biopsy is the gold standard for confirming a tumour recurrence, however it poses a risk of serious complications during the procedure (Wiener et al 2013). Therefore, the detection of a lung tumour recurrence would benefit from the development of early, accurate, and low-risk detection tools.
Radiomics is a quantitative image feature analysis tool in which a large number of features are extracted from medical images and then used to assess the underlying pathophysiological information (Rizzo et al 2018).
Mattonen et al showed that the radiomic features, extracted from CT images of 45 NSCLC SABR patients within 6 months of treatment, could be useful for the detection of a local recurrence (Mattonen et al 2016b). In this study we demonstrate that the radiomic features extracted from the CT-images, at the time a recurrence is detected by radiology or FDG/PET, can be used to differentiate a local recurrence from radiation-induced fibrosis at an earlier timepoint. The choice of machine learning methods and feature selection techniques are critical for the overall accuracy of the model. Consequently, we investigated a combination of four widely used machine learning and five feature selection techniques and selected the classification model that achieved the highest AUC score in the test set (Parmar et al 2015, Demircioğlu 2022. In conclusion, we demonstrate the clinical utility of the model with the highest AUC score in differentiating a local recurrence from fibrosis.

Material and methods
Patient baseline characteristics Our lung SABR protocol included CT assessment at 3, 6, 12, 18 and 24 months, then yearly until 7 years posttreatment. In this retrospective study, images acquired between 1st January 2011 and 1st February 2019 were included, in accordance with research ethics board approval. The final patient dataset (n = 66) consisted of patients with adenocarcinoma, squamous cell carcinoma or unspecified NSCLC. Metastatic cases were removed from the dataset due to the possibility that the underlying tumour physiology may be different to that of patients with primary NSCLC. The clinical database is comprised of 550 patients. The criterion for patient selection was that the patients should have either a local recurrence or fibrosis identified in their imaging reports. All patients with a local recurrence (n = 22) were selected, as well as 44 other patients (fibrosis) for radiomic analysis. Table 1 summarizes the baseline patient characteristics of the patients included in the study. Of the 22 patients with local recurrences confirmed by a radiation oncologist with 9 years experience, 17 were identified with fluorodeoxyglucose (FDG) -positron emission tomography (PET) or FDG/PET images (average SUV max = 10.8, median = 10), 1 with biopsy, and 4 were identified with only follow-up CT imaging. These 4 had multiple CT-scans during the follow-up surveillance (>1.5 year of follow-up) and had new recurrent nodules documented in the medical imaging reports. Patients with fibrosis were selected if the radiologists report contained keywords such as 'no recurrence' or 'fibrosis' and were subsequently evaluated by the radiation oncologist. The routine follow-up CT images used were from 5 to 9 months, 9-12 months, and 24 months (median) follow-up.

Training and test set
The training and test set were split based on the follow-up timepoint. The CT-images acquired at the 24 month (median) timepoint, when a local recurrence was initially confirmed, were used as the training set (n = 66 patients). The final test set (n = 41) was a combination of the two earlier follow-up timepoints, i.e. 5-9 months (n = 19) and 9-12 months (n = 22). Table 2 summarizes the patient separation of the training and test set evaluated in the current study. Note that the patients in the two earlier follow-up timepoints were a subset of the patients in the training set (due to lack of availability of new recurrence and imaging data). The 5-12 months follow-up CT images of the 41 patients were acquired with different CT scanners and imaging parameters (appendix table D shows the scanner parameters used in this study). Using different follow-up CT images, with various CT-imaging parameters, ensures that there was no information leakage between the training and test data.

Segmentation and feature extraction
Most of the CT-images in the study were originally scanned with a 2.5 mm slice thickness and approximately a 1 mm axial resolution. In order to maintain as much original information as possible, images were interpolated using a Sitkspline function to a common voxel size of (1 mm´1 mm´2.5 mm) using the pyradiomics package (Van Griethuysen et al 2017). These preprocessed images (n = 108) were segmented using the GrowCut segmentation algorithm in 3D Slicer software (Fedorov et al 2012). A margin was created and two pixels were labeled by the user, a pixel inside the region of interest and a background pixel, enabling the algorithm to delineate the segmentation based on the binary labels (appendix A, figure A1 and A2) (Vezhnevets and Konouchine 2005). The segmentation was reviewed by an experienced radiation oncologist and adjusted prior to feature extraction, which was completed with an open-source python-based package, pyradiomics (Van Griethuysen et al 2017). The pyradiomics settings are shown in appendix B. There were 106 features extracted from the seven radiomic feature classes (appendix C), (1) first order, (2) shape, (3) gray level dependence matrix (GLDM), (4) gray level co-occurrence matrix (GLCM), (5) gray level run length matrix (GLRLM), (6) gray level size zone matrix (GLSZM) and (7) Neighbouring gray tone difference matrix (NGTDM) (Van Griethuysen et al 2017).

Feature standardization and selection
The images had been acquired with three different scanner vendors (General Electric, Siemens, Toshiba), which necessitated feature standardization. Accordingly, three groups were created in the training set and z-score normalization of the radiomic features was performed using the mean and standard deviation of the features for each group (appendix D). (Da-Ano et al 2020). The test set was individually normalized using the mean and standard deviation of the corresponding scanner in the training set. This z-score normalization was completed before the feature selection.
To approach the imbalance class (recurrence versus fibrosis) in our training set, we used a synthetic minority over-sampling technique (SMOTE) to synthesize new recurrence data to train the model (Chawla et al 2002). The final training set contained approximately 1:1 ratio of recurrence and fibrosis.
The feature selection was performed in a two-stage process. Firstly, five feature selection techniques were chosen for evaluation based on their prevalence in radiomics literature as published by Parmar et al (2015). Furthermore, the implementation of these feature selection techniques is published in R and python libraries, which are publicly available for reproducibility purposes (R Core Team R 2019). Three of the feature selection methods were filter-based: relief feature scoring (RELIEF version 2. . The five feature selection algorithms were initially used to reduce the features list from 106, based on their respective search strategies and individual scoring system. This was performed to remove the redundant features and increase computational efficiency for the second stage of feature selection (Yang et al 2018). For relief and MIM, we selected the radiomic features above a threshold score of 50% of the maximum relief score (Cunningham et al 2021). With MRMR, a search strategy of reducing 50% of the total number of features was used. LASSO (by tuning parameter lambda) and forward feature selection algorithms (using cross validation, appendix E, figure E1) were implemented such that an optimal number of features is automatically selected. In the second stage of the feature selection, a recursive feature elimination with cross-validation (RFECV) and hyperparameter tuning, based on the respective classification model, was used to select an optimal subset of features from the first stage of the feature selection. The trend in classification accuracy for each classification model is presented in appendix F (figure F1-F4). The best subset of features was then used to tune the hyperparameters and finally train the classification model (appendix H, figure H1-H4).

Classification using five widely known classifiers
The Python-based library (sci-kit learn) was used to build and evaluate the radiomics model (Pedregosa et al 2011). Four widely used classifiers: support vector machines with linear kernel (SVM), gradient boosting (XGBoost), random forests (RF), and logistic regression (LR) were chosen based on the previous study by Demircioğlu (Mattonen et al 2016b, Demircioğlu 2022. After the feature selection process, GridSearchCv (with five-fold cross validation) was used to tune the hyperparameters of each of the classification models (LaValle et al 2004). Appendix G shows the grid points used for each classifier and the tuned hyperparameters used for each classification model. From each classification model, the area under the receiver operating characteristic curve (AUC) and the Mathews correlation coefficients were reported. The highest AUC in the final test set was considered to be the best performing model. Delong's test was implemented to compare the AUC performances of the 20 machine learning models presented in the study and determine if the classification model with the highest score was statistically significant or not. R library 'pROC' was used for this part of the statistical analysis (Robin et al 2011).  Figure 1 illustrates the heat map of the accuracy values obtained using the combination of five radiomic feature selection techniques and four classification models evaluated on the training set (from images at 24 months), using a five-fold cross validation approach. The highest average training accuracies were obtained by the classification models trained with random forest classifiers. For the feature selection, LASSO performed the highest in training accuracy (figure 1). The RF+MIM model achieved the highest overall training accuracy with 91%, followed by LR+LASSO with 88%, SVM+LASSO with 87%, RF+LASSO and RF + forward feature selection both with 86% accuracy, based on a five-fold cross validation approach. The sensitivity and specificity of the five-fold cross validation using RF + LASSO in the training set was 75% and 91% respectively. However, in the final test set (n = 41), the highest AUC was obtained using the RF + LASSO technique, with an AUC of 0.87 (figure 2). The model with the highest AUC in the test set (RF + LASSO) used a maximum random forest depth parameter of 13. The top three features based on the feature importance plot were Interquartile range (first order feature), Cluster prominence (GLCM), and Autocorrelation (GLCM). The top three radiomic features selected by the RF + LASSO technique are described in table 3. Figure 3 presents the heatmap of the second evaluation criteria, the Mathews correlation coefficient, which was implemented in the study. The model with the highest AUC also corresponded to the highest Mathews correlation coefficient value (0.63). The feature importance plot of each classification model is presented in appendix appendix H. Appendix I summarizes all the features selected by the best performing model for each type of classifier.
A sensitivity of 75% and a specificity of 88% were obtained in the final test set (n = 41) with the RF+LASSO classification model. Based on the sensitivity results in detecting a local recurrence, the RF+LASSO and LR +Relief scored the highest with 75%, however, the LR+Relief model performed inferior in terms of specificity (88% with RF+LASSO and 78% with LR+Relief). The area under the receiver operating curve (ROC) of the best performing models from each classification model are presented in figure 4. Table 4 illustrates the performance of each of the classifiers with the respective feature selection methods, with the p-value from DeLong's test. The AUC values achieved by the 11 machine learning models that were significantly different from the RF+LASSO model are shown in bold. However, the AUC values obtained by the 8 remaining models were not statistically different based on the Delong analysis.   Calculates the difference of 75th and 25th percentile of the image array   Table 5 summarizes the top 3 consistently ranked radiomic features across the 8 classification models, which were not statistically different from the RF+LASSO model. One radiomic feature (Cluster Prominence) was common in both tables 5 and 3 (top features selected by the RF+LASSO model).

Discussion
The detection of a local recurrence, and differentiation from radiation-induced fibrosis, is an important clinical challenge. Presuming that a recurrence is simply fibrosis may miss the opportunity to perform a salvage treatment and can potentially impact patient survival, whereas identifying fibrosis as a recurrence could result in unnecessary medical interventions.
Our study investigated several common machine learning and feature selection algorithms in combination with each other to select the best performing radiomics model that can be used for this application. We chose the model that scored the highest AUC in the test set for discussion, with the ultimate goal of developing a radiomics model that will assist the physician at an earlier timepoint post-SABR treatment, despite the fact that some of the other models performed better in the five-fold cross validation result on the training set. Nevertheless, we conclude that our model performed reasonably well at identifying a local recurrence and fibrosis in both the early and late follow-up CT images, which may assist clinical decision making. The sensitivity and specificity analysis demonstrated that the RF+LASSO radiomics model could be a supplementary tool for physicians to discriminate a recurrence from fibrosis at an earlier timepoint (<1 year). Nevertheless, four other machine learning models had superior performance based solely on specificity, but only one other model performed equally well in terms of sensitivity, in comparison to RF+LASSO.
The radiomics assessment in our study indicated that a local recurrence could be detected at 5-12 months (test set) follow-up with a satisfactory sensitivity of 75% and specificity of 88%. At 24 months follow-up (training set), an accuracy of 86% was achieved using five-fold cross validation approach, with the model scoring the highest AUC in the test set. Mattonen et al reported that within 6 months of SABR treatment, physicians classified the majority as fibrosis, with a sensitivity to detect a local recurrence of only 1%. In this study, we obtained a sensitivity of 75% within 5-12 months follow-up period, based on our best performing radiomics model. In the Mattonen study, five radiomic features were selected and a false positive rate of 24% and a false negative rate of 23% were reported with an area under the receiver operating characteristics curve (AUC) of 0.85 (Mattonen et al 2016b). It should be noted that they used a folded cross-validation approach to evaluate the five radiomic features (Mattonen et al 2016b) In our study we evaluated the radiomics models on a separate test set, which may be more indicative of the true performance of the features selected, compared to the folded-cross validation approach.
In the context of clinical utility, we compared the RF+LASSO model performance with CT-based imaging high-risk features, which have been reported to detect a local recurrence with excellent sensitivity and specificity at later time points. One of the most accurate predictors of recurrence was enlarging opacity after 12 months, achieving a sensitivity of 100% and specificity of 83% at 22 months follow-up (Huang et al 2013). Akinci et al reported that using three or more high-risk CT features could achieve sensitivity and specificity beyond 90% (Akinci D'Antonoli et al 2020). Nonetheless, these high-risk features inherently fail to detect a recurrence within a year of the SABR treatment (Huang et al 2013). Furthermore, another study has reported that up to half of their patients with no recurrence have developed high-risk CT features, indicated its limitations and the need for further refinement of the predictors (Akinci D'Antonoli et al 2020). For comparison, the sensitivity and specificity values from our current study in the training set may be compared to the results obtained with highrisk CT features at 22 months. The results indicate a lower sensitivity value (75% with RF+LASSO, 100% with HRF), but higher specificity (91% with RF+LASSO, HRF with 83% specificity) based on five-fold cross validation approach (Huang et al 2013). The test set results obtained in the current study for <1 year follow-up cannot be compared with the high-risk CT features due to a lack of data using the CT imaging predictors. However, our model was able to accurately identify radiation-induced fibrosis in early follow-up timepoints with high specificity (88%) in the test set.
The top feature selected through the entire feature selection process for the RF+LASSO method is interquartile range (first order feature), which calculates the difference between 75th and 25th percentile in the image array. Based on the final trained model, this first order feature has the highest importance in classifying a recurrence or fibrosis. The second most important feature was a texture feature known as Cluster prominence, which measures the skewness of the GLCM. The third most important feature is Autocorrelation (GLCM), which measures the fineness and coarseness of the image. It should be noted that other algorithms obtained AUC values that were not significantly different from RF+LASSO (see table 5). The top 3 consistently ranked radiomic features across the 8 machine learning algorithms were dependence non-uniformity normalized (GLDM), cluster prominence (GLCM), and dependence entropy (GLDM).  Table 5. Presents the radiomic feature selected by RF+MRMR.

Radiomic features Formula Description
Dependence non-uniformity normalized The novelty of our approach is the use of the late follow-up CT image data to build a radiomics model, and then to investigate if the model could be utilized to a detect recurrence at earlier timepoints. The advantage of using late follow-up CT images is that we have the FDG/PET confirmation of a local recurrence disease, whereas our PET findings from the patient reports in the earlier follow-up are generally indeterminate. Therefore, the current approach suffers an unavoidable limitation of not being able to use earlier follow-up CT images to build the model since a local recurrence is not detectable with conventional assessment within the 5-12 month period. Thus, in this work we used later follow-up CT-images such that the radiomic features correspond to a true local recurrence. Another inherent limitation of the current approach is that we have assumed that the radiomic features that distinguish a recurrence from fibrosis (in the later follow-up images) can be used at an earlier timepoint. However, due to a local recurrence being unconfirmed in the earlier stages of the follow-up, the current approach offers more benefit than using earlier follow-up CT images in the training of the radiomics model.
Another limitation of the current study is that we assumed the presence of a local recurrence developing in the test set of patients at the earlier timepoint when evaluating the classification model, but a recurrence was only confirmed at the 24 month follow-up time point. The radiological findings in those earlier timepoints are uncertain, so this could impact the reported sensitivity in the study test set. Our aim is to have the radiomics model assist the physicians by flagging patients with a possible recurrence or identifying patients with fibrosis. Clinical decisions would still be made in the context of other radiological findings, FDG/PET or interventional biopsy procedures. Another limitation of this study is that DeLong's statistical analysis was unable to confirm that the highest AUC achieved with RF+LASSO is statistically significant compared to 8 other classification models. For simplicity, we used the model with highest AUC and sensitivity performance to assess the clinical utility in comparison to current clinical tools.
One other limitation of the study is that the patients in the test set were a subset of the training set, though their follow-up CT images were different in terms of timepoint and imaging protocol in some cases. However, there could be similar radiomic feature characteristics between the early and late follow-up CT images of those patients. Finally, there are improvements that could be made to the current study. The lack of available recurrence data was a challenge, especially with confirmed biopsy. In order to refine machine learning techniques, prior to clinical implementation, collaborations forming a larger patient dataset would be required.

Conclusion
This study has shown that radiomic features can be used to accurately identify a radiation-induced fibrosis at an earlier time point after SABR treatment, with satisfactory performance in detecting a local recurrence. The analysis included a separate test set, resulting in a high specificity (88%) and a satisfactory sensitivity (75%) with an RF+LASSO model. To conclude, non-invasive tools such as radiomic analysis could be used to differentiate a recurrence from fibrosis. If introduced into a clinical workflow, this could aid physicians in planning salvage therapies or additional treatment for lung cancer patients who have underwent SABR treatment.

Funding statement
This work was supported by the BC Cancer Foundation and Moss Rock Park Foundation.
Appendix A Figure A1. GrowCut segmentation algorithm available in the 3D Slicer software. The GrowCut algorithm requires two input, approximate region of interest (green) and boundary (yellow). The final segmentation is produced using a cellular automaton algorithm. (a) represents a tumour recurrence and (b) represents a fibrosis. Figure A2. Illustrates an example of segmentation produced by the semi-automated segmentation algorithm. Relief Autocorrelation, large dependence low gray level emphasis, elongation, mesh volume, small dependence emphasis, surface volume ratio, cluster prominence XGBoost Forward feature selection Dependence entropy, maximum 2D diameter row, cluster tendency, dependence non-uniformity normalized, high gray level emphasis, surface area, least axis length, dependence non uniformity, small dependence low gray level emphasis, mesh volume, small dependence high gray level emphasis, minor axis length, correlation, contrast, Small dependence emphasis, gray level non uniformity, low gray level emphasis Support vector machine Forward feature selection Dependence non-uniformity normalized, dependence entropy, small dependence low gray level emphasis, contrast, gray level non uniformity, mesh volume, least axis length, minor axis length, correlation, high gray level emphasis, small dependence emphasis, Low gray level emphasis, maximum 2D diameter row, small dependence high gray level emphasis, surface area, dependence non uniformity, cluster tendency