Comparison of semi-automatic and manual segmentation methods for tumor delineation on head and neck squamous cell carcinoma (HNSCC) positron emission tomography (PET) images

Objective. Accurate and reproducible tumor delineation on positron emission tomography (PET) images is required to validate predictive and prognostic models based on PET radiomic features. Manual segmentation of tumors is time-consuming whereas semi-automatic methods are easily implementable and inexpensive. This study assessed the reliability of semi-automatic segmentation methods over manual segmentation for tumor delineation in head and neck squamous cell carcinoma (HNSCC) PET images. Approach. We employed manual and six semi-automatic segmentation methods (just enough interaction (JEI), watershed, grow from seeds (GfS), flood filling (FF), 30% SUVmax and 40%SUVmax threshold) using 3D slicer software to extract 128 radiomic features from FDG-PET images of 100 HNSCC patients independently by three operators. We assessed the distributional properties of all features and considered 92 log-transformed features for subsequent analysis. For each paired comparison of a feature, we fitted a separate linear mixed effect model using the method (two levels; manual versus one semi-automatic method) as a fixed effect and the subject and the operator as the random effects. We estimated different statistics—the intraclass correlation coefficient agreement (aICC), limits of agreement (LoA), total deviation index (TDI), coverage probability (CP) and coefficient of individual agreement (CIA)—to evaluate the agreement between the manual and semi-automatic methods. Main results. Accounting for all statistics across 92 features, the JEI method consistently demonstrated acceptable agreement with the manual method, with median values of aICC = 0.86, TDI = 0.94, CP = 0.66, and CIA = 0.91. Significance. This study demonstrated that JEI method is a reliable semi-automatic method for tumor delineation on HNSCC PET images.


Introduction
Head and neck cancer (HNC) is a common type of cancer arising at multiple sites within the head and neck region, and about 90% of HNCs are squamous cell carcinomas (Sekine et al 2017, Gormley et al 2022).Head and neck squamous cell carcinoma (HNSCC), the seventh most common cancer globally, accounts for 890 000 new cases and 450 000 deaths annually (Barsouk et al 2023).It is often curable if diagnosed at an early stage with a 5 year overall survival rate over 90% and below 50% for late stage diagnosis (Mavuduru et al 2020, FH et al 2021).
Medical imaging plays a vital role in the diagnosis and treatment of HNC (Kim et al 2021).Positron emission tomography (PET) is a non-invasive imaging technique that can distinguish between malignant and benign tissues.Unlike magnetic resonance imaging (MRI) and computed tomography (CT) which are anatomic imaging tools, PET is a functional imaging modality.PET imaging involves the administration of a radiopharmaceutical (18F-fluorodeoxyglucose (FDG) is commonly used) labelled with a positron emitting radioisotope.The radioisotope decays emitting a positron that combines with an electron, generating a pair of photons travelling in nearly opposite directions.The emitted photons are detected by the detectors on the PET scanner.Multiple coincidence events at various angles are detected and reconstructed to generate the PET images (Shukla andKumar 2006, Lameka et al 2016).
PET imaging is an essential technique in oncology for initial staging, localisation of unknown primary tumors, identifying regional and distant metastasis and recurrence, treatment planning and assessment of therapeutic response that affects the management and prognosis of the disease (Dewalle-Vignion et al 2011, Junn et al 2021, Strohl et al 2021).
Image segmentation, however, is one of the challenging tasks in medical imaging since accurate tumor segmentation is vital for voxel classification, feature extraction for prediction modelling, diagnostic and prognostic assessment and treatment planning (Foster et al 2014, Koyuncu et al 2018, Cardenas et al 2019, Bennai et al 2020).Metabolic parameters derived from PET images are predictive of treatment response improving cancer treatment decision making.Employing a reliable segmentation method is required to obtain accurate and reproducible PET parameters.Including PET in radiotherapy treatment planning could provide additional information to target oncological lesions effectively (Comelli et al 2018).
Manual segmentation, considered the gold standard (or 'ground truth') (Dewalle-Vignion et al 2011, Tamal 2020), is often inconvenient in a healthcare setting, indicating that it is time-consuming (average of about 2.7-3 h) and labour-intensive and is subject to inter-or intra-operator variability (Dewalle-Vignion et al 2011, Berthon et al 2017, Kosmin et al 2019).Automatic segmentation methods are preferred for tumor segmentation as they offer improved efficiency due to reduced tumor delineation time and inter-operator variability with enhanced reproducibility and improved treatment planning efficiency (Lee 2010, Pfaehler et al 2021, Oreiller et al 2022).
Segmenting a tumor on PET images is a challenging task owing to the low resolution and high level of noise.Manual contouring is typically adopted in clinical practice (Comelli et al 2018).However manual segmentation results in variability of contours since PET images look blurred and human eye cannot easily distinguish the tumor boundaries.Moreover, changing the display settings (windowing level and width) impacts the volume perception.This requires expertise and strict guidelines for the display device, as failure to do so result in large operator dependence.The best approach to reducing the tumor delineation variability is to rely on semiautomatic segmentation methods ensuring consistency and reproducibility of the segmented tumor (Lee 2010).When FDG radiotracer is used, the lesion must be initially highlighted by the operator resulting in segmentation methods in FDG-PET to be exclusively semi-automatic (Comelli and Stefano 2020).
The segmentation method utilised affects radiomic feature stability and applications such as radiotherapy planning (Tunali et al 2019, Pfaehler et al 2021).A PET study in nasopharyngeal carcinoma revealed segmentation methods have a smaller impact on radiomic features than image discretisation methods (Lu et al 2016).Studies on HNC PET and MR (Magnetic Resonance) features reported generation of more stable results with semi-automated methods compared to manual segmentation (Pfaehler et al 2021).
According to the American Association of Physicists in Medicine (AAPM), most published segmentation methods have inadequate or inconsistent validation (Hatt et al 2017).Additionally, the impact of segmentation strongly varies depending on the anatomical site of the tumor and the clinical situation due to differences in signal-to-noise background, anatomical boundaries, non-pathological uptake surrounding the tumor and motion artifacts.This emphasises the need for specific investigation tailored to the individual clinical scenario (Belli et al 2018).
Despite the development of several semi-automatic and automatic segmentation methods in the last decades in HNC medical images (Guo et al 2019), there remains no agreement on the best segmentation method.Therefore, this study aims to compare the agreement of radiomic features between the manual segmentation method and those generated from semi-automatic segmentation methods by three independent operators using robust statistical criteria for tumors delineated on HNSCC PET images.

Data
The dataset used in this study was from The Cancer Imaging Archive (TCIA) (https://cancerimagingarchive.net/)(Vallières et al 2017b, Grossberg et al 2020) comprising images from multiple centers recorded between 2003 and 2014 with a four-year follow-up time.Further details about data sources, instruments used for data collection, and image dimension can be found in Grossberg et al (2018) and in the supplementary information of Vallières et al (2017a).PET images are available for 100 patients (63 survived beyond four years) which represents a larger sample size compared to a similar study (Lu et al 2016).

Region of interest (ROI) segmentation
Semi-automatic tumor segmentation and radiomic feature extraction of the PET images were performed using the free and open-source software 3D Slicer (version 4.13.0,available at https://slicer.org/).The semiautomatic segmentation methods based on just enough interaction (JEI) principle, watershed, grow from seeds (GfS), flood filling (FF), 30% SUV max and 40% SUV max threshold were implemented.Figure 1 depicts the segmentation method difference across the six methods for a sample included in the study.Three radiologists (JW, SNMM, MM) outlined primary tumors for all patients and generated features by manual and semiautomatic segmentation methods independently.

Just enough interaction (JEI)
The graph-based segmentation method JEI provides cues to guide the segmentation algorithm and offers a high degree of automation, requiring minimal user interaction.This algorithm is said to perform flawlessly for the majority of head and neck lesions.In JEI segmentation approach, a graph structure is generated around the user specified approximate lesion center point and a suitable cost function is derived based on local image statistics.The user input (mouse click needed to correct the segmentation boundary) is minimal during the refinement of a segmentation.Further details about the segmentation can be found in Beichel et al (2016).

Watershed
The watershed transform model is a region-based segmentation method introduced by Beucher and Lantu ´ejoul, which was derived from a geophysical model of rain falling on a terrain (Beucher andLantuejoul 1979, Preim andBotha 2014).For the marker-based watershed, the user specifies markers (seeds) to include or exclude points, thus creating an additional watershed at the maximum level between them (Preim andBotha 2014, Beare andLehmann 2022).The watershed effect in 3D Slicer uses 'MorphologicalWatershedFromMarkersImageFilter' class in Insight Toolkit (ITK) (Anon n.d.).Major advancement in the implementation of watershed was made by Meyer and Beucher.Meyer defined the watershed transform in terms of topographic distances ultimately redefining the behaviours in the presence of plateaus and led to the development of an algorithm known as 'hill climbing' which is a graph flooding approach.This is derived from the Dijkstra-Moore minimal path algorithm of graph theory.Further details of the algorithm can be found here (Beare and Lehmann 2022).

Grow from seeds (GfS)
The grow-from-seeds method implemented in the 3D slicer is an improved version of the grow-cut algorithm.The grow from seeds tool works based on the adaptive Dijkstra algorithm.The GrowCut algorithm is reformulated as a clustering problem based on finding the shortest path which is solved with Dijkstra algorithm.Editing is not allowed while running Dijkstra algorithm.An alternate to this is adaptive Dijkstra algorithm that can incorporate user inputs and updates only those local regions affected by the new input.Further details about the algorithm is available in Zhu et al (2014).The user specifies image pixels (seed pixels) that belong to objects to segment them from each other.The segmentation is achieved automatically by assigning labels to all other pixels (Vezhnevets 2005).

Fixed threshold
The fixed threshold-based method is simple and easy to implement.A fixed percentage (T fixed ) of the SUV maxT of the tumor (in some pre-defined ROI) determines the SUV fixed value.Voxels having SUV greater than or equal to SUV fixed are included in the segmented ROI (Tamal 2020).

SUV
T SUV .
fixed fixed max T

= Ín
the clinical setting, the thresholding value of 40%-43% of SUV max is chosen, but it may not work in all cases as it is based on scanner type, reconstruction, low resolution, image noise, etc of the PET images and variability in pathologies, size and shape of the tumor and fuzzy object boundaries (Erdi et al 1997, Guan et al 2006, Foster et al 2014).In this study, we explored the 30%SUVmax and 40%SUVmax thresholds for tumor delineation.

Flood filling (FF)
The flood-filling algorithm determines connection in an area in a multi-dimensional array with the help of the similarity of intensity voxels to the selected node determined by users.In this method, nodes are added around the tumor region using a mouse cursor and when the flood-fill tool is activated, the ROI is segmented based on similar voxel intensity (Haniff et al 2021, Ramli et al 2022).

Texture analysis
Tumor heterogeneity was quantified using the segment statistics and radiomics modules available in 3D Slicer.A total of 128 features were extracted across eight feature domains: quantitative (21 features), shape (14 features), first order (18 features), gray level co-occurrence matrix (GLCM-24 features), gray level dependence matrix (GLDM-14 features), gray level run length matrix (GLRLM-16 features), gray level size zone matrix (GLSZM-16 features) and neighbouring gray tone difference matrix (NGTDM-5 features).All PET images were resampled into 4 × 4 × 4 mm 3 voxels, and images were discretised using a fixed bin width of 0.2.Images were normalised before feature extraction to account for the multicenter nature of the data.

Statistical analysis
Data represented measurements of 128 features based on manual and the six semi-automatic methods (JEI, 30% and 40% SUVmax, watershed, GfS and FF) from 100 patients.We assessed the normal distributional properties of each feature.One feature (ROI Peak) did not generate adequate data and thirty-five features did not meet the distributional properties after appropriate transformation (residual distribution departing from the assumptions); hence, we removed these features from the final analysis.We considered 92 log-transformed features for all subsequent analyses.Further details about all radiomics features and excluded features are given in the online resource.
We used five statistics to compare the agreement between manual and semi-automatic methods for each feature.To estimate these statistics for a feature, first, we created subsets of the data for the given feature in combination with the manual and one (of the six) semi-automatic segmentation method.Next, we fitted a linear mixed effect model for this subset of the data incorporating the method as a fixed effect (two levels: manual and the selected semi-automatic method) and the subject and the operator as the random effects.We continued similar model fits for the combination of other methods.Hence, we summarised a total of six models for a given feature and 552 models (6 × 92) for all 92 features.
The linear mixed effect model of a single feature is mathematically represented as where: y ijk = the feature value on the subject i by operator j with method k; i = 1, K, 100; j =1, 2, 3; k = 1 (manual), 2 (a semi-automatic method) m = the overall mean, k b = the fixed effect of the method i a = the random effect of subject i; i a ∼ N 0, 2 ( ) s a where 2 s a is the between-subject variability, j t = the random effect of operator j; j t ∼ N 0, 2 ( ) s t where 2 s t is the between-operator variability, ~e where 2 s e is the residual (within-subject) variability.
We used the outputs from the fitted linear mixed model to estimate four different statistics to capture the agreement between the manual and a semi-automatic segmentation method: intraclass correlation coefficient (agreement), coverage probability, total deviation index and coefficient of individual agreement.The following section presents a summary of each statistic.We reported the summary statistics and relevant estimates of 92 features.The detailed explanation and applications of these statistics for a single feature in more complex experimental designs are available in the literature (Parker et al 2020).

Intraclass correlation coefficient (ICC)
ICC for repeated measures is a standardised coefficient representing the proportion of between-subject variability to the total variability of a feature.It takes values between 0 and 1. ICC values above 0.8 or 0.9 indicate good or excellent reliability and values below 0.5 indicates poor reliability.The larger the between-subject variability relative to the within-subject variability, better the agreement between methods (Koo andLi 2016, Liljequist et al 2019).The estimate of ICC agreement (aICC) of 1 implies two methods are in complete agreement.
where, 2 f b = the variance due to the fixed factor (method); accounts for systematic differences between the two methods.If 2 f b is not included in the denominator, the estimate represents the consistency between methods.

Coverage probability (CP)
CP calculates the probability that the between-method differences lie within some tolerance interval.If the probability is larger, it means a close agreement.
k 1 = manual method, k 2 = a semi-automatic method δ = the range of clinically acceptable differences (CAD) between two methods as mentioned later.Φ(.) = the standard normal cumulative distribution function.

Total deviation index (TDI)
For a given containment probability p, TDI provides the boundary within which the differences will be contained p × 100% of the time.In cases where the calculation of CAD (δ) is difficult, this method is useful.The narrower the calculated boundary, the better i.e. the methods can be used interchangeably.
p = prespecified proportion of between methods difference; p = 0.95.
1 F -= the inverse of the standard normal cumulative distribution function.

Coefficient of individual agreement (CIA)
CIA is a scaled coefficient which compares the disagreement between methods to the disagreement within methods, within subjects.The value of CIA ranges from 0 to 1, with 1 indicating that using different devices makes no difference to the variability of repeated measurements taken under same conditions within the same subject.CIA is calculated as: s = e 2.4.5.Limits of agreement (LoA) LoA quantifies the dispersion among paired differences in feature values measured in the same patient using two different methods.The paired differences between measurements between two methods were fitted with a random effects model incorporating subject and operator as random effects: where: the difference in a feature value between two methods (manual and a semi-automatic method) on the subject i by operator j; i = 1, K, 100; j = 1, 2, 3 * m = the overall mean of the paired difference between two methods, * s e is the residual (within-subject) variability.
LoA was estimated as follows: LoA 1.96 Here, D m is the mean of the paired differences (corresponding to the intercept of the fitted random effect model) = * m D s is the standard deviation of the paired differences calculated as: LoA is judged by comparing this with a CAD (δ).When the limits of agreement are narrower and within CAD, the methods are in agreement and can be used interchangeably.CAD is generally obtained from experts.Identifying suitable CADs may not be practical for the radiomic features.We considered that the CAD of a feature on average should be within ±10% range of the manual method.Therefore, if the differences in measurements between the manual and semi-automatic methods are within the CAD, the difference would be regarded as clinically unimportant (and hence clinically acceptable).Therefore, a narrower LoA within the CAD would suggest two methods showed the desired level of agreement.The Bland-Altman plot presents the differences versus the average of measurements of the two methods with the LoA overlain.All statistical analyses were performed in the R software environment (version 4.2.1).The R package lme4 (Bates et al 2015) was used to fit models 1 and 2 under the linear mixed model framework.The variance components were estimated using the restricted maximum likelihood method (REML).For selected eight features, 95% confidence intervals were obtained using bootstrap resampling with 1000 replications.

Results
The study included 100 patients, comprising 83 males and 17 females with an average age of 60.92 years (SD = 11.92).The most common site of the primary tumor was the oropharynx (61%), followed by nasopharynx (15%) and larynx (12%).More than half of the patients (58%) had stage IVA cancer.Further details of summary statistics of patients and features are given in the online resource supplementary tables (1: Patient characteristics; 2: Feature details; 3: List of excluded features; 4: Summary statistics) and supplementary figures (1-4: Feature characteristics).
To evaluate the agreement of various semi-automatic methods with the manual method, we analysed five different statistics: aICC (Intraclass correlation coefficient agreement), TDI, CP, CIA and LoA. Figure 2 demonstrates the heatmap of the estimates of aICC for best agreeing features in eight feature domains and table 1 presents the estimates along with the 95% confidence interval of the selected eight features for four statistics.Online Resource presents heatmaps of estimates of CP (supplementary figure 5), TDI (supplementary figure 6) and CIA (supplementary figure 7) for selected features and estimates of statistics for all 92 features in the supplementary tables (5: aICC; 6: CP; 7: TDI; 8: CIA).
The JEI method had the closest agreement (aICC) with the manual method.More than half of the features (77; 83.70%) from the JEI method demonstrated the highest agreement with the manual method; 57 features in JEI and 41 features in FF presented estimates of aICC 0.80.The heatmap (figure 2) shows that 30%SUVmax, 40%SUVmax, watershed and GfS methods had the lowest agreement with the manual method.
Table 1 shows that the features with the best aICC estimates for manual versus JEI demonstrated consistent and favourable estimates of CP, TDI and CIA.The number of features with the highest estimates of CP with the manual method was for the JEI method (81, 88.04%).For example, the estimate of CP for I_TotalEnergy (table 1) within CAD (±10%) was 0.90 for manual versus JEI.In other words, the probability that measurements between manual and JEI would lie within CAD was 0.90 suggesting good agreement between the two methods.
The number of features with the best estimates of TDI with the manual method was for the JEI method (81, 88.04%).For example, the 95% TDI of I_TotalEnergy was estimated as ±1.03 for manual versus JEI suggesting 95% of the time the differences between manual and JEI methods would be within ±1.03.
The number of features with CIA 0.80 with the manual method was highest for the JEI method (83 features).For example, the CIA estimate of I_TotalEnergy was 0.98 suggesting using the JEI method made no difference to the variability of repeated measurements taken under same conditions within the same subject.
Figure 3 depicts that the JEI method produced reasonable estimates of aICC, CP, TDI and CIA across all feature domains.The FF method followed the JEI method in terms of agreement with the manual method.The 30%SUVmax, 40%SUVmax, watershed and GfS methods had relatively lower estimates compared to the manual method across all statistics.
Online resources includes the range of estimates of aICC, TDI, CP and CIA for all features (supplementary figure 8).All median values showed favourable estimates for JEI with the manual method in all statistics reflected by desirable high between-subject variability and low between-operator variability and between-method variability (supplementary tables 9, 10 and 11).
Figure 4 presents the Bland-Altman plot of one feature (I_TotalEnergy) demonstrating the differences versus the average of measurements of the manual method versus the other six methods.The LoA for I_TotalEnergy was narrower for manual versus the JEI method.For JEI the differences in measurements for most observations were within the LoA boundary and did not show any trends suggesting strong agreement between JEI and the manual method.FF and manual method displayed reasonable agreement.The comparison of the manual method with 30% SUVmax, 40%SUVmax, GfS and watershed showed wider LoA.Similar patterns were observed in the Bland-Altman plots of other features (online resource supplementary figures 9-15).

Discussion
We investigated 92 radiomic features extracted from tumors segmented manually and semi-automatically three independent operators, compared together using five different agreement statistics in HNSCC PET images.Based on all statistics, we observed that the JEI method demonstrated consistently higher agreement with the manual method, followed by the flood filling method.The 30%SUVmax, 40%SUVmax, watershed and GfS had a lower agreement with the manual method.
Reasonably good agreement of JEI with the manual method could be to a high degree of automation, intuitiveness and time efficiency.The JEI method of segmentation was reported as significantly faster than the manual method in HNSCC (Beichel et al 2016) and cervical tumors (Altazi et al 2017).We observed that the JEI method was the quickest and provided the segmentation accurately, particularly for large tumors, in the first mouse click in the majority of the cases as reported elsewhere (Beichel et al 2016).In a small sample of patients (data not shown), the JEI method took less than 30 s (0:21-0:26 s) and the manual method took more than a minute (1 min 09 s to 1 min 23 s) for segmentation.However, the quality of the pre-segmentation step might affect the final segmentation in identifying the desired borders by the JEI method (Yin et al 2010).Additionally, in some cases more than ten mouse clicks are needed for tumor refinement which could be addressed by adding suitable refinement tools and users must be properly trained in using JEI tool to achieve good efficiency (Beichel et al 2016).We also observed that the JEI method exhibited poor performance even with the use of refinement tools and multiple mouse clicks when a large tumor had an elongated course, or any unaffected tissues intervened in parts of a large tumor (supplementary figure 16).
The estimates of agreement between the FF and manual segmentation methods are the second highest after the JEI method.The FF method could be useful in cases with weak colour swings and homogeneous backgrounds.We observed that in cases of high background signals where the tumor was large and the margin was inaccurate, the FF method was less accurate, and it required multiple mouse clicks.The flood-filling method was reported as a better alternative to the manual method in other studies (Podgornova andSadykov 2019, Haniff et al 2021) Our results suggest that the watershed, GfS, 30%SUVmax and 40%SUVmax methods had less significant estimates for most statistics compared with the manual method.The less favourable estimates of watershed and GfS may arise from the dependence of these methods on initial seed definition (number and size of seeds), image noise and blurred or indistinct boundaries (Li et al 2008, Xu et al 2011, Gardin 2020).A small sample of test data suggested (data not shown) that increasing the number of seed points in both watershed and GfS could improve the accuracy of both methods.We observed that 30%SUVmax performed better than 40%SUVmax, especially in the case of large tumors.The lower agreement of the 40%SUVmax threshold with the manual method could be attributed to tumor characteristics (Parmar et al 2014a).To implement an effective thresholding-based method, it is suggested that the contrast is either less than 30% or more than 90% between the and the background (Drever et al 2007).Watershed method is reported to have good accuracy in other cancers (Cui et al 2009, Gómez et al 2010).
Our results indicate that the operator level variability was low between JEI and the manual method (supplementary table 10).The JEI method has reported minimal intra-and inter-operator variations in HNSCC PET studies (Beichel et al 2016).When inter-operator variability is assessed using simple correlation for selected 8 features in all patients, for the manual method the estimates of correlation coefficient ranges between 0.79 and 0.99 (data not shown) and for the JEI method the estimates vary between 0.77 and 0.99 (data not shown).Assessing the variability across different operators is essential as the segmentation accuracy depends on the experience of the operator (Belli et al 2018).
We employed five statistics to compare the agreement between methods.Among these, ICC is a frequently used method, and particularly, it is an important statistic where an appropriate CAD is difficult to define.ICC is the preferred statistical method to assess the agreement and reliability of radiomic features by various segmentation methods (Lu et  Other statistics, like LoA, CP and TDI, are generally easy to compute and interpret, but LoA and CP statistics are most suitable when the CAD on the original measurement scale is available.However, CIA is much less dependent on between-subject and between-operator variability than ICC and is used in cases where defining an appropriate CAD is difficult (Parker et al 2020).It is suggested that ICC along with estimates of other statistics provide a critical appraisal of agreement results, in establishing a better understanding of the feature-level agreement (and disagreement) and contribute to the selection of a suitable semi-automatic method (Parker et al 2020).
We applied the logarithmic transformation to all 92 features before estimating different statistics.It is therefore suggested that the development of predictive and prognostic models should also include the transformed data for those features.This would allow the estimates of agreement statistics to hold true on the transformed scale.The transformed data might also contribute to the development of stable and computationally feasible models

Limitations and recommendations
Our study has several limitations.Although the current study (n = 100) is relatively large compared to earlier reports on the agreement between methods, the estimates of different agreement statistics could be obtained with better precision in a large-scale study with more than three operators.Our of patients represented HNSCC with various sizes, shapes, sites and stages.A study with a larger sample size could account for featurelevel stability due to these factors (Guo et al 2019).Additionally, our results might not be generalisable to cancers originating from different organs.The model did not include additional complexity, for example, interaction terms between fixed effects since a complex model with limited sample size might have convergence issues.Additionally, the robustness of JEI to replace manual segmentation depend on the importance of specific features used under the prediction and prognostic models.Although we expect similar performance metrics for the model developed using features from manual and JEI methods due to their high agreement, the reliability of these metrics depend on the final model fit on the external validation data.We also excluded machine learning and deep learning-based tumor segmentation algorithms although recent studies reported the potential of such algorithms in medical image segmentation tasks (Bennai et al 2020, Cardenas et al 2021).

Conclusion
Based on different statistics of agreement, we conclude that the JEI method has reasonably good estimates of agreement with the manual method for a range of radiomic features.Additionally, the JEI method is the easiest to implement to segment HNSCC PET images.Hence adopting the JEI method could improve the reliability and robustness of the features and reduce inter-operator variability and time for segmentation compared to using the manual method.
al 2016, Altazi et al 2017).In MR and CT studies, ICC was used to assess the feature stability of various segmentation methods (Parmar et al 2014b, Haniff et al 2021).
. The use of transformed PET radiomic features in the model development stage was reported in other studies (Eertink et al 2022, Ferrández et al 2022).

Figure 4 .
Figure4.Bland-Altman plot showing paired differences against the mean of the pairs of differences between manual and six semiautomatic segmentation methods for a feature (I_TotalEnergy).The solid blue line shows mean bias and dashed line shows limits of agreement.JEI: Just enough interaction; Thresh30:30%SUVmax threshold; Thresh40:40%SUVmax threshold; GfS: Grow from seeds; FF: Flood filling.