3D Brain Tumour Segmentation Using UNet with Quantitative Analysis of the Tumour Features

Brain tumours are abnormal cells that grow within the brain and may cause severe harm to an individual. The diagnosis of tumours often requires a manual review of Magnetic Resonance (MR) images by a skilled radiologist, which is a laborious task. The study aims to use a deep learning-based approach to segment the brain tumour and quantitatively evaluate the performance of the model by comparing the statistical distribution of five textural and four non-textural feature differences between the ground truth and the prediction. Textural features were extracted from a Grey Level Co-occurrence Matrix (GLCM) and averaged over four different directions. A UNet model was constructed with MONAI using BraTS 2020 dataset. Due to hardware constraints, the model was trained over a short epoch range of 100 but managed to achieve a mean Dice similarity coefficient (DSC) of 0.86. The quantitative analysis showed that the ground truth and the prediction had low differences in most features, except for percentage variation in signal intensity, energy, correlation and Hausdorff distance, which could be attributed to false positive voxels that are far from the ground truth. The textural properties of homogeneity, entropy, and correlation had an average absolute percentage difference of less than 5% from the ground truth.

1. Introduction A central nervous system tumour is a disorder in which abnormal cells grow in the brain and spinal cord tissues without a definite cause [1].Brain tumours account for 2.71% of all cancer deaths worldwide [2].In the diagnosis of brain tumours, a radiologist would typically manually review and segment the malignancy in a time-consuming process.In recent years, machine learning based approaches, specifically involving convolutional neural networks (CNNs) have become popular as they aim to tackle the issue of complex and irregular structure of brain tumours [3] and aim to make therapy safer and successful [4].
There have been numerous deep learning-based brain tumour segmentation attempts conducted, that often achieve a mean Dice similarity coefficient (DSC) index [5] of 0.8 and above.A common architecture is a 3D-UNet architecture, which takes 3D input and performs 3D operations following that of a traditional UNet [6].Ali et al. utilised this architecture with an ensemble of a 3D-CNN to produce a mean DSC index of 0.83 [7].Cherguif et al. reported a maximum DSC index of 0.81 for their UNet-based model [8] and Mortazavi-Zadeh et al. achieved DSC index of 0.89 [9].Naser and Deen utilised a modified UNet model and reported a mean DSC index of 0.84 [10].Though most papers reported a good DSC index, the quality of the tumour segmentation is often not quantitatively assessed beyond the DSC index.Radiomic features extracted from tumour segmentation could be used for early diagnosis or research into the patient's medical condition in relation to the tumour features.The DSC index provides the similarity between two sets of data.The textural and non-textural features could be additional factors to be considered in evaluating the segmentation performance of a model.
The objective of this study was to design a deep CNN (DCNN) model to perform brain tumour segmentation.The performance of the DCNN segmentation was evaluated based on the DSC index and Hausdorff distance [11].Next, the segmented tumours were used for quantitative analysis of the tumour's features, which was compared with the features extracted using the ground truth.The statistical distribution of these differences was used to visualise the model's overall performance.

MRI Datasets
A total of 369 MRI datasets were sourced from the 2020 BraTS Challenge [12][13][14][15][16].Each dataset contains five neuroimaging informatics technology initiative (NIfTI) files consisting of the ground truth of tumour and four MRI sequences: T1-weighted, T1-weighted contrast enhanced (T1CE), T2weighted, and T2-weighted fluid attenuated inversion recovery (FLAIR).The ground truth is labelled with a decimal number to represent different segments of the tumour as shown in Table 1.The 369 datasets were split into training and testing datasets with a ratio of 80%:20%, that are 295 training datasets and 74 testing datasets.To ensure that the ratio of high-grade glioma (HGG) to lowgrade gliomas (LGG) is the same within the split datasets, a stratification algorithm was used.Both training and testing datasets have an LGG to HGG ratio of approximately 20%:80%.

Hardware and software
This study used PyTorch as its backend with helper functions derived from MONAI, a PyTorch-based framework for working with machine learning workflows related to medical imaging.The model's training extent and architecture were limited by the available hardware resources.This study was conducted on a consumer-grade CPU with 6 cores and an NVIDIA GTX1660 with 6 GB of GPU memory and 1408 CUDA cores.The available system RAM was 16 GB.

Data pre-processing
The training and testing datasets underwent similar pre-processing.Both were converted from NIfTI into 3D PyTorch tensors and their dimensions were reordered to ensure that the channel dimension occurs first.The 3D tensor corresponding to the ground truth segmentation is one hot encoded to classify different segments of tumours as shown in Table 2. Subsequently, the tensors were converted to Meta Tensors before aligning all the tensors to the same spatial axis of RAS which means that the x, y, and z axis are aligned to the right, anterior, and superior region of the brain respectively.The MRI sequence tensors were then normalized in a channel-wise manner, where normalization occurs within each scan.Zero values are ignored in the normalisation as the MRI sequences contain background voxels.The normalisation is the final data augmentation step that was similar between the training and testing datasets.For the training dataset, the model was further augmented by reducing the input dimensionality of each MRI sequence from 240  240  155 to 128  128  128.Two random spatial crops were taken for each case and were randomly flipped on all axes with a probability of p = 0.5.This resulted in different spatial orientations for the input scans.Within each random sample, the orientation and spatial crop region for all MRI scans and the segmented ground truth remained consistent.

DCNN model architecture
A UNet was constructed with two variants using MONAI UNet Class.The model was initialized with three spatial dimensions, four input channels, and three output channels.The convolutional strides were set to reduce the input dimensionality by two with each stride.Due to hardware limitations, the size of the input to the model, number of feature maps extracted per encoding block and the number of residual connections within the model were investigated to find out the best performing model for evaluation.These configurations are given in Table 3. Figure 1 shows a basic structure of a UNet in MONAI when residual connections are set to zero.The input is sampled based on the feature map configuration set during the initialisation of the model.The number of feature maps sets the depth of the UNet as well.This number has been indicated with a  ̅̅̅̅̅ over the value.Tweaking the residual parameter introduces residual pathways at each layer that take an identity input from the first convolution of the input tensor as seen in Figure 2. Residual pathways are added for both encoding and decoding layers.The loss function used a combination of an equally distributed DICE [5] and cross-entropy loss [17].The cross-entropy element was used to decrease the training time.A Novograd optimiser [18] was instantiated with MONAI to optimise the loss function.The learning rate of the model starts at a low initial learning rate of 1  10 -13 and follows a linear warmup schedule of 10 epochs to reach the peak learning rate of 1  10 -3 .Afterward, it undergoes cosine annealing for 90 epochs.The maximum number of epochs was limited to 100, due to hardware limitation.

Training and evaluation
During training, PyTorch's automatic mixed precision type auto casting was utilized during the computation of the loss to cast the tensors to half-tensors.This was done to reduce memory usage and increase speed of computation.The optimiser function was stepped through after scaling the gradient with PyTorch gradient scaling to prevent gradients from flushing to zero due to the reduced precision from using float16 values.
During evaluation, the DSC index [5] and Hausdorff distance [11] between the ground truth and prediction were calculated.To handle the relatively large input size (240  240  155), a sliding window inference approach was employed to partition the input into smaller dimensions that align with the training data.The original tensor of full size was retained in RAM due to inadequate GPU memory.To expedite the inference process, smaller segments of the tensor were copied onto the GPU.The overlapping ratio between these sections is 50%, and each window was accorded equal weighting.After processing each sliding window, the respective predictions were consolidated to create an output tensor that matches the dimensions of the original input.The model, optimiser, and learning rate scheduler weights were saved during training if the average DSC index reported was the highest for that epoch.A TensorBoard logger was used to visualise this metric in real-time and for manually terminating training if the DSC index was seen to increase slowly or become stagnant.The best average DSC index during evaluation was observed from the UNet configuration with two residual units, and the channels were set to (16,32,64,128).The UNet model was initialized with the said configuration for quantitative analysis of the model.

Quantitative analysis
The 74 testing datasets were used to investigate any variations of the textural and non-textural features between the ground truth and the predicted segmentation.This study analysed the textural features which include energy, entropy, correlation, homogeneity, and contrast as given in Equation ( 1) to ( 5). (1) where P is the Grey Level Co-occurrence Matrix (GLCM) [19] confined within the tumour's segmented boundary for each MRI scan type.N is the total number of grey levels in the image, i and j are the row and column respectively,  and  2 are the GLCM mean and variance respectively.Energy is used to quantify repeating patterns; homogeneity evaluates the overall variance in voxel intensities.
Higher homogeneity indicates a more uniform image.Contrast compares the brightness of the brightest and darkest voxel; entropy demonstrates how randomly scattered the grey levels are in a tumour.Correlation quantifies linear gradients across the MR images.
A custom function was programmed in this study to compute the GLCM from a 3D NumPy array, given a 3D direction vector.This study only considered four direction vectors for the GLCM calculation, which are x, y, z, and diagonal, by considering a Cartesian coordinate.The absolute percentage difference for each feature in the predicted segmentation, as compared to the ground truth, was computed.This was calculated for each MRI sequences and for each direction vector of the GLCM.
This study also analysed the non-textural features which include tumour volume, centre of mass, signal intensity, and the Euclidean distance of the tumour.The Euclidean distance was calculated between the centre point of each predicted tumour class and the ground truth.The difference in tumour volume was also calculated in a similar manner.The difference in average signal intensity of the segmented region was found as a percentage difference from the ground truth.This difference was calculated for each MRI sequences as their intensity values vary.

Results
Figure 3 shows the ground truth and predicted segmentation masks for three target segments for the case with DSC index of 0.95.The segmented regions were overlayed on the T2 image.Table 4 shows the mean and the standard deviation of the DSC index and Hausdorff distance across 74 evaluation cases.The metrics are individually shown for 59 HGG cases and 15 LGG cases.The value for ET Hausdorff distance for LGG was only computed from 10 cases.This is due to MONAI giving an infinite output if either ground truth or prediction is null during evaluation of a metric.

Table 4:
The mean and standard deviation of the DSC index and Hausdorff distance calculated from the 74 testing datasets.

HGG (59 cases)
LGG (     Figure 4 shows that the median DSC index was 0.87 and the median Hausdorff distance was 30 mm. The model performed better in segmenting the WT, which could be attributed to its size.However, the model performed poorer in segmenting the TC.Referring to the Hausdorff distance distribution in Figure 4(b), all segments showed a wide distribution, especially with the WT segment.Figure 5(a) shows that the centre point offset is more widely distributed for smaller segments such as TC and ET as compared to WT.This could be due to false positives having more impact on smaller segment since the centre point is the computed average positioning of all voxels.With fewer voxels, false positives have a greater impact on the centre point.However, the opposite trend is observed in the Figure 5(b) where the difference in volume is more widely distributed in WT than in the rest of the segments.Most outliers are noted in the positive region of TC, indicating that false positives could have been adding more volume to the TC segment.It should be noted that these are statistical outliers.The majority of the percentage difference in average signal intensities among the MRI sequences did not exceed 5% for non-statistical outliers, as seen in the Figure 5(c).The outliers may be attributed to false positives increasing the value during average signal computation.The impact of a few false positives is high since MRI NIfTI files have high integer ranges.
Based on the distribution in the Figure 6, the textural properties of homogeneity, entropy, and correlation had an average absolute percentage difference of less than 5%.However, energy and contrast were observed to have a higher percentage difference.Energy quantifies repeating patterns in an image while contrast compares the brightness of the brightest and darkest voxel.These features may be more sensitive to misclassification than the other three.Hence, while achieving comparable effectiveness to the previous studies cited in references [5]- [10], the study demonstrates that it is crucial to consider the influence of misclassification on the corresponding textural features when selecting them as inputs in subsequent machine learning pipelines.
The predicted tumour and its extracted features can be used in tandem with other medical applications such as glioma grading, patient care prioritization, and estimation of life expectancy.

Conclusion
This study demonstrated that an UNet DCNN model achieved a competitive performance for brain tumour segmentation despite hardware limitation and 100 epochs of training.The quantitative analysis showed that the ground truth and the prediction had low differences in most features, except the percentage variation in signal intensity, energy, correlation, and Hausdorff distance, which could be due to the false positive voxels that are far from the ground truth.Future study may apply other post processing techniques to reduce the false positive voxels.A more advanced hardware system is highly recommended to train a model over 100 epochs to achieve a higher performance.

Figure 1 :
Figure 1: Basic architecture of a MONAI UNet with no residual units.

Figure 2 :
Figure 2: Change in UNet convolution layers with two residual units per layer.

Figure 3 :
Figure 3: (a) Ground truth and (b) predicted masks for each segment.

Figures 4 (
Figures 4(a) and 4(b) show the box plots of the statistical distribution of DSC index and Hausdorff distance respectively for the different segment of the tumours.Figure 5 shows the box plots of the statistical distribution of three non-textural features which are the (a) centre point offset, (b) difference in volume, and (c) difference in signal intensity between the ground truth and DCNN predicted segmentation.Figure 6 shows the box plot of the statistical distribution of average percentage difference in textural features between ground truth and predicted segmentation.

Figure 4 :Figure 5 :
Figure 4: Statistical distribution of the (a) DSC index and (b) Hausdorff distance.
This study has designed a simple implementation of UNet DCNN model for tumour segmentation with relatively low training time of 15 hours yet achieved a competitively good DSC index.Referring to Table 4, it achieved an average DSC index of 0.86  0.09 for HGG cases and 0.70  0.11 for LGG cases.The Hausdorff distance is 28.58  15.50 for HGG cases and 47.92 ± 31.48 for LGG cases.

Table 1 :
Label in the ground truth of the BraTS 2020 dataset.

Table 2 :
One hot encoding.

Table 3 :
Values that were tested for MONAI UNet.

15 cases)
a Mean and deviation for only 10 cases.