Predicting lymph node metastasis in patients with oropharyngeal cancer by using a convolutional neural network with associated epistemic and aleatoric uncertainty

Michael Dohopolski; Liyuan Chen; David Sher; Jing Wang

doi:10.1088/1361-6560/abb71c

1. Introduction

Physicians use computed tomography (CT) and positron emission tomography (PET) scans to help identify cervical lymph nodes (LNs) with metastases in patients with head and neck (HN) cancer, but even so, this process can involve significant uncertainty. LNs with metastatic disease are easier to identify on imaging when the LNs are large or highly fluorodexoxyglucose (FDG)-avid, but harder when the LNs are small or less FDG-avid. A meta-analysis evaluating radiologists' ability to identify metastatic LNs reported average false negative rates ranging from 15%–21% (and as high as 50% in select studies) despite the use of modern imaging modalities (Kyzas et al 2008). Classifying a benign LN as malignant could lead to over-treatment, and classifying a malignant LN as benign could negatively impact oncologic outcomes. Machine learning is aptly poised to aid physicians in this task.

Convolutional neural networks (CNNs) have been successfully utilized to help imaging-based diagnosis or classification in patients with HN cancer. Kann et al used a CNN to identify the presence of extracapsular extension on CT imaging (Kann et al 2018). Chen et al used a hybrid model combining radiomic features and a CNN to evaluate LNs on PET/CT scans and determined whether they were benign, malignant, or suspicious for malignant involvement (Chen et al 2019). Both of these groups reported excellent performance and obtained area under the receiver operating characteristic (ROC) curves (AUCs) greater than 0.90 with these deep learning–based models.

Yet, to bring these models more quickly into clinical practice physicians, we need to have the means to assess the reliability of a prediction as there is unaddressed uncertainty. In the publications mentioned above, their datasets had fewer malignant than benign examples, which could hinder the predictive robustness of these models and contribute to uncertainty—in this case, epistemic uncertainty, also known as model uncertainty, which stems from limited data. Aleatoric uncertainty, which is associated with inherent randomness, should also be measured to assess a model's robustness (Kendall and Gal 2017, Papernot and Mcdaniel 2018). Accurately approximating prediction uncertainties—through Bayesian methods or pseudobayesian approximations—via dropout variational inference and test-time augmentation (TTA) can potentially quell physicians' fears regarding prediction reliability (Friedman 2009, Kendall and Gal 2017, Novak et al 2018). Additionally, reports have shown that measuring and incorporating uncertainty as a form of prediction validation could minimize false predictions, which could theoretically improve clinical outcomes (Leibig et al 2017).

In this work, we trained an AlexNet-like CNN to classify LNs in patients with oropharyngeal squamous cell carcinoma (OPSCC) by using LNs segmented on preoperative PET/CT scans that had been labelled according to pathological data from neck dissections (Krizhevsky et al 2017). Furthermore, we calculated the epistemic and aleatoric uncertainty via dropout variational inference and TTA, respectively, then used these uncertainties to assess the predictions' reliability and thus make them more clinically useful (Krizhevsky et al 2017, Ayhan and Berens 2018).

2. Materials and methods

2.1. Dataset and pre-processing steps

2.1.1. Dataset

This retrospective study included 129 patients with OPSCC who had preoperative PET and CT imaging and subsequently underwent neck dissections. We used the commercially available software package Velocity (Varian Medical Systems) to perform rigid registrations between the PET and CT images. All images were resampled to the same final pixel resolution, 0.5 mm x 0.5 mm x 3 mm, for the CT and PET images. The LNs were contoured on the CT images under the guidance of PET. There were a total of 791 LNs, 79% of which were classified as benign and 21% of which were classified as malignant, according to the corresponding pathology reports. LNs were classified as malignant or benign if they were explicitly labeled on the pathology report according to the clinically defined five cervical LN levels and could be localized on CT and PET imaging. Each node was randomly assigned to the training, validation or test cohort. Table 1 provides a brief summary of the volumes and standardized uptake volumes (SUVs) of the LNs in each cohort.

Table 1. Summary of volumes and SUVs of LNs in the training, validation, and testing datasets. Q {25%, 75%} denotes 25 and 75 percentile values.

		Benign		Malignant
		Median	Q {25%, 75%}	Median	Q, {25%, 75%}
Training *(n = 479)*	Volume (cm³)	0.13	0.06, 0.29	7.27	1.64, 13.7
	SUV (max)	0.79	0.56, 1.17	5.04	2.52, 7.57
Validation *(n = 125)*	Volume (cm³)	0.19	0.12, 0.37	6.08	1.92, 12.9
	SUV (max)	0.74	0.54, 1.06	5.56	3.08, 8.32
Testing (n = 187)	Volume (cm³)	0.19	0.10, 0.38	11.34	2.88, 17.0
	SUV (max)	0.96	0.67, 1.32	6.46	3.87, 9.90

Q = quartile

2.1.2. Data preprocessing

Based on the nodal contours, we extracted patches of pixel size 72 × 72 × 48, which included the respective nodes and their surrounding voxels, as inputs for the proposed CNN. The values contained within the CT images ranged from −1000 to 3095, so we normalized the images by adding 1000 to the original values in the CT images then dividing the sums by 4095. The SUV values in the original PET scans were normalized by dividing by the max SUV in the training cohort. We normalized the CT and PET images to help the model converge faster.

We randomly and equally separated the data into five groups according to patients' medical record numbers (i.e. all LNs associated with a single patient were in the same group). We selected two groups, one as the validation cohort and the other as the test cohort. We trained the model on the remaining PET/CT LN combinations and their respective augmented counterparts, which we obtained through 3D image augmentation: we randomly rotated the original CT/PET LN combinations around the x, y, z dimensions through [−30°, 30°]. The testing cohort included 187 original PET/CT LN combinations. The validation cohort included 125 original PET/CT LN combinations. The training cohort included 479 original PET/CT LN combinations and 1822 augmented PET/CT LN combinations obtained from the 479 original PET/CT LN combinations. A sample slice (72x72) obtained from a PET/CT LN combination is shown in figure 1.

**Figure 1.** A single slice (72 x 72 pixels) of a PET/CT patch extracted from an LN within the training cohort. The left image is from the CT scan, and the right image is from the PET scan. The LN is roughly centered in both images.
Download figure:
Standard image High-resolution image

2.2. Model

Figure 2 shows the AlexNet-like CNN architecture used in this work. We used Keras, a neural network API, to construct the model, with Tensorflow as the backend engine. Each convolutional block consisted of a convolutional layer followed by a relu activation layer and a dropout layer. Max pooling layers (size: 2x2x2) with a 2x2x2 stride followed each convolutional block. A global max was taken after the final convolutional block, then processed through a dense fully connected layer. Binary cross-entropy was used as the loss function.

2.3. Hyperparameter optimization (Hyperopt/Tree of Parzen estimator algorithm)

We implemented the python library 'Hyperopt' to search for hyperparameters (Bergstra et al 2011). 'Hyperopt' requires three inputs for optimization: an objective function, a parameter space, and an optimization algorithm. The objective function for our model was a modification of the Youden Index (3*sensitivity + specificity − 1). Sensitivity and specificity were calculated on the validation data after the model was trained. We decided empirically to prioritize sensitivity (i.e. multiplying by 3) to minimize false negative predictions. We selected the Tree of Parzen Estimator algorithm as the optimization algorithm.

2.4. Model performance evaluation

We measured the model's performance by sensitivity, specificity, and the AUC obtained from a ROC curve. We calculated positive predictive value (PPV) and negative predictive value (NPV) as measures of clinical utility.

2.5. Prediction uncertainty quantification

We discuss the methods for calculating epistemic and aleatoric uncertainty in detail below. To summarize, we calculated epistemic uncertainty via dropout variational inference, (Kendall and Gal 2017) and we measured aleatoric uncertainty via TTA (Ayhan and Berens 2018). We performed these calculations on the validation and test datasets. We obtained the validation epistemic and aleatoric uncertainty values to compute both the median and mean uncertainties of the entire cohort, and the median and mean uncertainties associated with the incorrect predictions (false negative and false positive predictions). We used the median uncertainty values from the validation cohort as cutoff values to determine whether a prediction was 'certain' or 'uncertain,' because, in clinical settings, we cannot always prove definitively whether the LN harbors disease. Therefore, we need a predetermined value to indicate whether more focused physician scrutiny is necessary. We measured model performance separately for the 'certain' and 'uncertain' cohorts. We ultimately used the median values, as they were less likely than the mean values to be influenced by outliers. We specifically evaluated the median uncertainty associated with incorrect predictions (false negative and false positive predictions), as these values would likely be larger than the values obtained from the entire validation cohort. If we classify more predictions as certain by using a larger cutoff value, this reduces the number of predictions that need expert review later, while still preserving the model's performance, and would thus make the model more clinically useful.

2.5.1. Epistemic uncertainty estimation via dropout variational inference

Uncertainty estimates for large CNNs can be computationally expensive. Some practitioners perform n-fold cross validation and take the variance of the predictions, which requires training n models. Other uncertainty estimating methods aim to calculate the posterior distribution p(W|,X, Y) by using the marginal probability p(Y|X), a prior probability p(W), and a likelihood P(Y|X, W)—where W is a distribution over its weights, X are data, and Y are outcomes—also known as Bayesian inference:

$\begin{equation}p\left( {W\mid X,Y} \right) = \frac{{p\left( {Y\mid X,W} \right)p\left( W \right)}}{{p\left( {Y\mid X} \right)}}\end{equation} \tag{ 1 }$

Unfortunately, the marginal probability is difficult to compute analytically. Gal and Ghabramani argued that dropout variational inference could approximate the posterior distribution (Kendall and Gal 2017). The premise involves compiling a model that includes a dropout layer after each convolutional layer. These dropout layers are employed during testing. In practice, this yields different class prediction probabilities. These probability measures can then be used to calculate epistemic uncertainty, also known as model uncertainty or uncertainty in model parameters. Specifically, we calculated epistemic uncertainty p(y = c|x,X,Y) by using Monte Carlo integration. Then, we calculated the entropy of the probability vector, H(p), by equations (2) and (3) below:

$\begin{equation}p\left( {y = c\mid x,X,Y} \right) \approx \frac{1}{T}\mathop \sum \limits_{t = 1}^T Softmax\left( {{f^{\widehat {Wt}}}\left( x \right)} \right)\end{equation} \tag{ 2 }$

$\begin{equation}H\left( p \right) = - \mathop \sum \limits_{c = 1}^C {p_c}{\text{log}}\left( {{p_c}} \right)\end{equation} \tag{ 3 }$

where T is the number of predictions per input, $\widehat {Wt}$ is the sampled masked model weights, C is the number of classes (benign + malignant = 2 classes), and ${p_c}$ is the probability of class c for input x. ${p_c}$ was obtained through 300 repeated predictions on a single CT/PET LN combination in the test cohort by using dropout at test time. We performed a similar process with the validation cohort.

2.5.2. Aleatoric uncertainty estimation via TTA.

Aleatoric uncertainty describes the noise inherent in an input image, which cannot be addressed by adding more data, as epistemic uncertainty can. Ayhan and Berens proposed using TTA to estimate aleatoric uncertainty (Ayhan and Berens 2018).

For each original test PET/CT LN combination, we applied a unique combination of shifts, rotations, translations, blurring, flipping, and intensity alteration to obtain augmented images (50 alterations per CT/PET LN combination). Predictions were made from these augmented images. We used the entropy formula (equation (3)) to measure the uncertainty by using the mean class probabilities obtained from the predictions made on the augmented images. We performed a similar process with the validation cohort.

3. Results

3.1. Model performance

The AUC for the test data was 0.99 (figure 3). Sensitivity and specificity were 0.94 and 0.90, respectively. NPV and PPV were 0.99 and 0.64, respectively. The model made 169 correct, 16 false positive, and two false negative predictions. When we stratified our model's performance by LN volume (with cutoff 0.525 cm³) and SUV value (with cutoff 2.5), the model's performance was excellent even for small and less FDG-avid LNs that radiologists might consider ambiguous (figure 4). AUC, sensitivity, and specificity were 0.98, 0.67, and 0.98 for LNs with volumes < 0.525 cm³, respectively. For LNs with volumes ≥ 0.525 cm³, AUC, sensitivity, and specificity were 0.96, 0.96, and 0.54, respectively. For LNs with SUV < 2.5, AUC, sensitivity, and specificity were 0.97, 0.85, and 0.90, respectively. For LNs with SUV > 2.5, AUC, sensitivity, and specificity were 1.0, 1.0, and 0.75, respectively. Note that our model achieves a slightly higher AUC for LNs with volumes < 0.525 cm³ compared with LNs with volumes ≥ 0.525 cm³. One possible reason for this result is our model emphasizes sensitivity more, resulting in the majority of incorrect predictions being false positives (16), in contrast to only two false negative predictions. As such, for LNs with larger volumes, the false positive predictions greatly decrease our model's specificity (0.54), which also lead to a lower AUC.

**Figure 4.** Our model's performance on the test data stratified by lymph node size approximated by a sphere of volume 0.525 cm³ with approximately 1 cm diameter (A and B) and by LN SUV of 2.5 as measured on PET imaging (C and D).
Download figure:
Standard image High-resolution image

3.2. Prediction uncertainty

3.2.1. Epistemic uncertainty

Mean epistemic uncertainty associated with correct and incorrect predictions was 0.139 and 0.608, respectively. Median epistemic uncertainty associated with correct and incorrect predictions was 0.037 and 0.667, respectively. We used an unpaired t-test to compare the epistemic uncertainty associated with the correct predictions to the uncertainty associated with the incorrect predictions (n = 18). The two groups were statistically different (p-value = 7.403e-13).

In the validation data, the median epistemic uncertainty associated with the incorrect predictions was 0.556, and the median epistemic uncertainty associated with the entire validation cohort was 0.180. Model performance on the test data stratified by the median epistemic uncertainty obtained from the incorrect predictions and from the total predictions on the validation dataset is illustrated in figures 5 and 6, respectively. The model's performance was visibly better when the epistemic uncertainty was below the median values. The difference in model performance was more evident when we used the median uncertainty associated with the incorrect predictions obtained from the validation dataset. Among the test data stratified by the median epistemic uncertainty value obtained from the incorrect predictions in the validation dataset (figure 5), the model sensitivity and specificity were 1.0 and 0.98, respectively, for the cohort below the median uncertainty value, and 0.67 and 0.41, respectively, for the cohort above the median uncertainty value. Among the test data stratified by the median epistemic uncertainty value obtained from the entire validation dataset (figure 6), the model sensitivity and specificity were 1.0 and 0.99, respectively, for the cohort below the median uncertainty value, and 0.80 and 0.68, respectively, for the cohort above the median uncertainty value.

**Figure 5.** Model performance measured on the test data stratified by the median epistemic uncertainty obtained from the incorrect predictions within the validation cohort.
Download figure:
Standard image High-resolution image

**Figure 6.** Model performance measured on the test data stratified by the median epistemic uncertainty obtained from the entire validation cohort.
Download figure:
Standard image High-resolution image

3.2.2. Aleatoric uncertainty

Mean aleatoric uncertainty associated with the correct and incorrect predictions was 0.139 and 0.628, respectively. Median aleatoric uncertainty associated with the correct and incorrect predictions was 0.038 and 0.683, respectively. We used an unpaired t-test to compare the aleatoric uncertainty associated with the correct predictions to the uncertainty associated with the incorrect predictions (n = 18). The two groups were statistically different (p-value = 2.632e-15).

In the validation data, the median aleatoric uncertainty associated with the incorrect predictions was 0.582, and the median aleatoric uncertainty obtained from the entire validation cohort was 0.178. Model performance stratified by the median aleatoric uncertainty obtained from the incorrect predictions and from the total predictions is illustrated in figures 7 and 8, respectively. The model's performance on the test cases was visibly better when the aleatoric uncertainty was below the respective median values. As with the epistemic uncertainty analysis, the difference in model performance was more evident when we used the median uncertainty associated with the incorrect predictions in the validation data as a cutoff value. Among the test data stratified by the median aleatoric uncertainty value obtained from the incorrect predictions in the validation dataset (figure 7), the model sensitivity and specificity were 1.0 and 0.98, respectively, for the cohort below the median uncertainty value, and 0.67 and 0.32, respectively, for the cohort above the median uncertainty value. Among the test data stratified by the median aleatoric uncertainty value obtained from the entire validation dataset (figure 8), the model sensitivity and specificity were 1.0 and 1.0, respectively, for the cohort below the median uncertainty value, and 0.80 and 0.68, respectively, for the cohort above the median uncertainty value.

**Figure 7.** Model performance measured on the test data stratified by the median aleatoric uncertainty obtained from the incorrect predictions within the validation cohort.
Download figure:
Standard image High-resolution image

**Figure 8.** Model performance measured on the test data stratified by the median aleatoric uncertainty obtained from the entire validation cohort.
Download figure:
Standard image High-resolution image

4. Discussion

We constructed a deep learning model that utilizes an AlexNet-like CNN to predict the malignancy status of LNs in patients with OPSCC. This model performed favorably, as illustrated by an excellent AUC. Moreover, the model's sensitivity and specificity were comparable to, if not better than, human performance seen in prospective studies (Kyzas et al 2008). We used pathologically correlated CT and PET LN data in this patient subset to predict LN malignancy status. Model performance was excellent even when LNs were of sizes and SUVs that might be ambiguous to radiologists (figure 4). Moreover, we used epistemic and aleatoric uncertainty to quantitatively assess the predictions' reliability.

Radiologists use several CT-based criteria to assess whether an LN harbors metastatic disease, including size, location, lymphatic drainage patterns, margins, and morphology (necrotic versus cystic versus calcifications). For example, a radiologist might interpret an LN to contain disease if it had a diameter larger than 1–1.5 cm, had a necrotic or cystic appearance, or was located in a specific lymphatic drainage site. PET imaging could provide corroborating information such as FDG avidity (Hoang et al 2013). A meta-analysis comparing PET imaging to other modalities like CT imaging found that adding PET imaging increased radiologists' sensitivity and specificity from 79% to 85% and 80% to 86%, respectively (Kyzas et al 2008). We obtained similar sensitivities and specificities—94% and 90%, respectively—with our model. Our model's performance, however, was based solely on evaluating LNs' PET/CT imaging characteristics and did not include data such as location, number of additional positive LNs, or common sites of lymphatic drainage. Unlike our study, these meta-analysis metrics were evaluated on a per patient basis, not per LN, as in our study. Per patient evaluation allows a more holistic assessment that accounts for the details mentioned above, not just the node in isolation. For example, the prognosis of HPV-associated SCC HN cancers differs from that of HPV-negative HN cancers, so this information could guide the model and aid its performance (Riaz et al 2014). In future studies, we hope to bolster the model's performance by incorporating, through model ensembling, additional models trained on data such as location of the primary disease, presence and location of other malignant or benign LNs, and p16 status.

CNNs have been used for a variety of medically oriented tasks. Kann et al used a CNN to predict extracapsular extensions from CT segmented LNs. Their model's performance on LNs with diameters greater than 1 cm was excellent (AUC 0.91, sensitivity 0.88, specificity 0.85, PPV 0.66, and NPV 0.95). Studies reporting on radiologists' ability to distinguish LNs have noted sensitivities of 57%–74% and specificities of 76%–98% (Prabhu et al 2014, Carlton et al 2017). Although Kann et al and others have demonstrated that their models perform their specific predictive task at or above the current gold standard, questions remain regarding these models' generalizability, interpretability, and reliability. Including uncertainty measures would not address all of these concerns, but it could increase the clinical utility of these models. Kwon et al visualized epistemic and aleatoric uncertainties to better understand the predictive limitations of their models, which were trained to segment areas associated with strokes. Interestingly, they also validated that the epistemic uncertainty that they calculated reflected the theoretical concept (epistemic uncertainty) that it was estimating (Kwon et al 2018). Raczkowski et al used variational dropout to estimate aleatoric and epistemic uncertainty (via Entropy H and BALD methods) to identify mislabeled sections on histopathological slides with high accuracy. Moreover, they used their uncertainty calculations from the Entropy H method to direct model learning and reduce the data needed to train their model appropriately (Rączkowski et al 2019). Similarly, we quantified the reliability of our predictions by measuring epistemic and aleatoric uncertainty. If our model were used in a clinical setting, predictions that had markedly higher epistemic or aleatoric uncertainty could undergo multidisciplinary evaluation. Leibig et al adopted a referral algorithm that 'sent' images with larger uncertainty for further diagnostic testing, which significantly improved the overall accuracy within the images retained (Leibig et al 2017). We plan to use a similar referral plan in a future clinical validation.

Our study has several limitations. Correlating the pathology report to the PET and CT images involved a degree of subjective judgment. For example, two of eight LNs in cervical level II could have had metastatic disease, and we had to use additional data within the pathology report to identify these two LNs on the CT and PET scans. We included only data with a high degree of certainty. For example, we included patients that had pathology reports that explicitly detailed the location, number, and appearance of malignant and benign LNs. Any uncertain LNs were reviewed by a senior radiation oncologist and were excluded if any doubt remained. This process could have biased the data and led the model to discriminate according to known correlates associated with malignant LNs, such as LNs of large size, LNs with a necrotic appearance, or LNs with high PET avidity. Another limitation is we only explored variations of an AlexNet-like CNN. It is possible that other architectures could further increase sensitivity and specificity. Other authors have seen success with recurrent neural networks using various input data, such as ultrasound (Azizi et al 2018). Capsule networks could be a promising architecture, as they include both convolutional and capsule layers, so the model can handle transformations better and, thus, preserve spatial orientation better. Additionally, these networks may permit users to interpret the rationale of its decision-making process (Shahroudnejad et al 2019).

In conclusion, we constructed a CNN trained on PET and CT images correlated with pathological reports of LNs, and this CNN predicted LN malignancy status as well as, if not better than, human expert performance reported in the literature. We also evaluated prediction reliability by using epistemic and aleatoric uncertainty. In the future, we will conduct a study where physicians re-evaluate predictions with large uncertainty to improve accuracy. After a multidisciplinary evaluation including radiologist review, this model could aid physicians in difficult cases where they must choose between watching/waiting and possible intervention with radiation therapy (Ansinelli et al 2018, Vargo et al 2016). Further investigation, however, is necessary to assess our model's generalizability.

Acknowledgments

We thank Dr. Jonathan Feinberg for editing the manuscript.

Conflicts of interest

The authors have no conflicts of interest to declare.

Predicting lymph node metastasis in patients with oropharyngeal cancer by using a convolutional neural network with associated epistemic and aleatoric uncertainty

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction