Hippocampus substructure segmentation using morphological vision transformer learning

Abstract The hippocampus plays a crucial role in memory and cognition. Because of the associated toxicity from whole brain radiotherapy, more advanced treatment planning techniques prioritize hippocampal avoidance, which depends on an accurate segmentation of the small and complexly shaped hippocampus. To achieve accurate segmentation of the anterior and posterior regions of the hippocampus from T1 weighted (T1w) MR images, we developed a novel model, Hippo-Net, which uses a cascaded model strategy. The proposed model consists of two major parts: (1) a localization model is used to detect the volume-of-interest (VOI) of hippocampus. (2) An end-to-end morphological vision transformer network (Franchi et al 2020 Pattern Recognit. 102 107246, Ranem et al 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition Workshops (CVPRW) pp 3710–3719) is used to perform substructures segmentation within the hippocampus VOI. The substructures include the anterior and posterior regions of the hippocampus, which are defined as the hippocampus proper and parts of the subiculum. The vision transformer incorporates the dominant features extracted from MR images, which are further improved by learning-based morphological operators. The integration of these morphological operators into the vision transformer increases the accuracy and ability to separate hippocampus structure into its two distinct substructures. A total of 260 T1w MRI datasets from medical segmentation decathlon dataset were used in this study. We conducted a five-fold cross-validation on the first 200 T1w MR images and then performed a hold-out test on the remaining 60 T1w MR images with the model trained on the first 200 images. In five-fold cross-validation, the Dice similarity coefficients were 0.900 ± 0.029 and 0.886 ± 0.031 for the hippocampus proper and parts of the subiculum, respectively. The mean surface distances (MSDs) were 0.426 ± 0.115 mm and 0.401 ± 0.100 mm for the hippocampus proper and parts of the subiculum, respectively. The proposed method showed great promise in automatically delineating hippocampus substructures on T1w MR images. It may facilitate the current clinical workflow and reduce the physicians’ effort.


Introduction
The hippocampus is a pair of medial and subcortical brain structures located in proximity to the temporal horn of the lateral ventricles, which is an active research area due to its implication in memory and neuropsychiatric disorders (Jafari-Khouzani et al 2010).In radiation therapy, hippocampal avoidance whole brain radiation using volumetric modulated arc therapy (VMAT) plus the medication memantine has been shown to preserve cognitive function without compromising progression-free survival or overall survival when compared to classic whole brain radiation therapy plus memantine (Gondi et al 2014, Brown et al 2020).In Alzheimer's disease (AD), the progression of AD occurs from the trans-entorhinal cortex to the hippocampus, and finally to the neocortex (Braak and Braak 1991).These progression steps depend on the severity of the neurofibrillary tangles found in neuropathological studies.However, similar patterns can also be observed in the progress of brain atrophy found on MRI imaging studies.The atrophy of hippocampus measured from MRIs can be used as an early sign of AD progression (Jack et al 2004).Additionally, evidence of hippocampal atrophy as measured from MRIs can occur before the onset of clinical symptoms (Soldan et al 2015).Therefore, accurate segmentation of the hippocampus from MRIs is a meaningful task in medical image analysis across multiple disciplines (Siadat et al 2007).
To determine whether the hippocampus is atrophic, clinicians often need to segment the bilateral hippocampus on MRI scans and analyze their shape and volume (van de Pol et al 2006, McHugh et al 2007).This task is difficult, however, due to several factors.Firstly, the hippocampus has low contrast with the surrounding tissues on MRI scans (Salat et al 2011), since it is a gray matter structure.Secondly, the hippocampus has an irregular shape leading to a blurred boundary in cross-sectional slices (Canada et al 2020).The substructures are shown in figure 1 + in supplemental.Thirdly, the hippocampus is a small structure with limited volume as compared to other structures that are routinely delineated as organs-at-risk (OARs) in radiation therapy (Frodl et al 2006).Finally, there are large variations in the size and shape of the hippocampus across patients (Brown 2017).Therefore, accurate and automatic segmentation of hippocampus is a challenging task.However, manual segmentation is a tedious and error-prone process, which limits its application in big data and clinical practice.Thus, many efforts have been devoted to developing computer-aided diagnostic systems for automated segmentation of the hippocampus.
The existing automatic hippocampal segmentation methods can be categorized into two main types: atlasbased methods and machine learning-based methods.Atlas-based methods can be further divided based on the number of atlases used in the segmentation process into single-atlas-based, average-shape atlas-based, and multi-atlas-based approaches.For instance, Haller et al first proposed to use the single-atlas-based approach for hippocampal segmentation (Haller et al 1996, Haller et al 1997).However, single-atlas-based approaches are limited by inter-patient variations.To address this, average shape-based mapping approaches were proposed to overcome such limitations, but the segmentation results depend on the alignment quality of the target and average maps.Thus, a priori knowledge of medical mapping was incorporated into the multi-atlas-based segmentation approach.For example, Wang et al proposed a robust discriminative multi-atlas label fusion approach to segment hippocampus by building the conditional random field (CRF) model that combines distance metric learning and graph cuts (Wang et al 2013).Wang's approach is a patch embedding multi-atlas label fusion method that utilizes only the relationship between the target block and the atlas block, and ignores the possibility that unrelated atlas blocks may dominate the voting process.Existing atlas-based methods do not consider the anatomical differences in hippocampus among patients, and do not consider the correlation between atlases.
Machine learning-based methods can be further classified into traditional machine learning-based approaches and deep learning-based approaches.Traditional machine learning-based approaches mainly include support vector machine (SVM), Markov random field (MRF), principal component analysis (PCA), et al (Lei et al 2019, Lin et al 2021).For instance, Hao et al proposed a local label learning strategy to estimate segmentation labels of target images by using SVM with image intensity and texture features (Hao et al 2014).However, these traditional approaches to machine learning rely heavily on the quality of handcrafted features, and further suffer from slow segmentation, susceptibility to noise interference, insufficient generalization performance, and manual annotations of the training data (Lei et al 2021).
Because convolutional neural network (CNN) models can automatically extract the pixel feature information from images, they have been widely used in multiple medical image analysis tasks (Fu et al 2021).For example, CNN-based models can be used to segment the hippocampus from MRIs (Nobakht et al 2021).Qiu et al proposed a multitask 3D U-net framework for hippocampus segmentation by minimizing the difference between the targeted binary mask and the model prediction, and optimizing an auxiliary edgeprediction task (Qiu et al 2021).Cao et al developed a two-stage segmentation method to perform the task of 3D hippocampus segmentation by localizing multi-size candidate regions and fusing the multi-size candidate regions (Cao et al 2018).These methods show promising results, demonstrating the potential of CNN-based models to improve the efficiency and accuracy of hippocampus segmentation.
However, most existing deep learning-based methods ignore the spatial information of the hippocampus relative to the entirety of the human brain (Ranem et al 2022).As a result, they cannot effectively fuse the shape features and the semantic features, which leads to lower segmentation accuracy.Hippocampal tracing began from anterior where the head is visible as an enclosed gray matter structure inferior to the amygdala, and continued posteriorly using surrounding white matter or CSF as boundaries.Subiculum (posterior parts of hippocampus) was included in the hippocampus.Delineation stopped when the wall of the ventricle was visibly contiguous with the fimbria.The subiculum occupies a portion of the para-hippocampal gyrus in the mesial temporal lobe and is a component of the medial temporal memory system.Therefore, in this work, we aim to develop a novel deep network framework to segment the hippocampus by introducing a spatial attention mechanism to capture the spatial location information of the hippocampus relative to the cropped hippocampus region.We also designed a cross-layer dual encoding shared decoding network to extract the semantic characteristics of the hippocampus.By combining the spatial location information and semantic characteristics of the hippocampus, we enhanced the segmentation accuracy of the hippocampus.In this study, we trained a novel morphological visual transformer learning-based hippocampus substructure segmentation for accurate segmentation of the anterior and posterior regions of the hippocampus from T1-weighted (T1w) MR images.

Overview
Figure 1 outlines the schematic flow chart of this hippocampus multi-substructure segmentation process.The proposed network follows the same path for both training and inference.A collection of hippocampus images and multi-substructure contours was used for model training.The proposed model, named as morphological visual transformer-based network, takes the hippocampus image as input and generates the auto-contour of two substructures, which are the hippocampus proper and parts of the subiculum.The manual contours of these two substructures were used as ground truth to supervise the proposed network.
The proposed model consists of two deep learning-based subnetworks, i.e. a localization model and a segmentation model.The localization model is a hippocampus detection network that is used to detect the volume-of-interest (VOI) for both the hippocampus proper and parts of the subiculum (Carlesimo et al 2015) from the T1w MR image.The MR image is then cropped within the VOI before transfer to the segmentation subnetwork to ease the computational task.The segmentation model is implemented via an end-to-end morphological vision transformer network, which is used to perform substructures segmentation within the hippocampus VOI.The vision transformer incorporates the dominant features extracted from MR images.The integration of the morphological operators into the vision transformer increases the ability of separating the hippocampus into two substructures.
During inference, the trained localization model takes a hippocampus T1w MR image as input and detects the VOI of hippocampus as the first step.Then, the cropped image within the VOI is sent to the segmentation model, i.e. morphological visual transformer, to segment the substructures.Finally, based on the detected coordinates derived by the localization model, the segmented contour is converted back to its original coordinates to obtain the final segmentation.

Localization model
The aim of the localization model is to crop the image to a VOI that only covers the hippocampus to ease computational task of substructure segmentation.In order to preserve the spatial information of substructure, the coordinate the detected VOI is recorded during testing.Thus, the localization of ground truth hippocampus is used to supervise the localization model.To derive it, the manual contour is needed.For a set of MR images Î ´Í R , The localization model design is inspired by a recently developed focal modulation network, which is used in object detection (Yang et al 2022).The localization model includes a hierarchical contextualization, which is used for feature extraction from different hierarchical levels, a modulator, which combines the features from different levels, and a neural network layer works for location position estimation.The details of the localization model are explained as follows.
Given input MRI Î ´Í R , w h d Img with a first convolution layer for feature map initiating (F 0 ), a multi-scale hierarchy feature map set are collected via the steps defined as follows iteratively: where - F k 1 denotes the feature map from previous iteration, F k is then derived by the operating convolution and Gaussian error linear units (GeLU) activation function.(Hendrycks and Gimpel 2016) After several iterations of equation (1), multi-hierarchical features are collected, we then match these feature maps to same size via interpolation and sum together Then, by using a neural network layer, we aim to derive the estimation of C, labeled as = (( In this work, the l is set to 0.5 based on the 5-fold cross-validation (shown in supplemental table 3 + ).

Morphological visual transformer
For the next step, the MRI I Img are cropped within a VOI box, whose center is defined as Ĉ.This process mitigates the unrelated region for hippocampus segmentation and thus improve the efficiency of the model.To ensure the cropped image is uniformly sized for the following subnetwork, zero-padding is used.The processed image is then input into the morphological visual transformer (MVT).The MVT is built in an end-to-end fashion, meaning that the input and output share the same size.After several convolutional layers with a stride size of 2, the MVT uses two auto-learned morphological operators, dilation and erosion, to process the hidden feature maps (Franchi et al 2020).More details of morphological operators are explained in supplemental material (S.1).
As compared to convolutional kernel with stride size of 2 or max-pooling layer, which can be regarded as a dilation with a flat square structuring element followed by a pooling, the learned morphological operator can be tuned to aggregate the most important information.This can further reduce the redundant and meaningless information for the next operator, the visual transformer, and therefore improve its performance.The output of the two morphological operators is then concatenated and fed into a projection convolutional layer and a linear projection operator to fit it to the input of visual transformer.A widely developed visual transformer is used (Ranem et al 2022).Afterwards, several deconvolutional layers are applied until the output of this MVT model is equal in size to the input.
After the MVT step, consolidation can be used to transform the segmentation back to the original coordinate system (I img ), since the location information has been obtained from the localization model.
To supervise the MVT, a combination of two loss functions is used, which are generalized cross entropy loss L GCE and generalized Dice loss L .
GD The L GCE is used to evaluate the difference between the predicted label and the ground truth label at each voxel, which is defined as: where l i denotes the ground truth label at voxel i, li denotes the predicted label at voxel i.
The L GD is used to address the issues about the voxel quantity imbalance of the segmented voxels (often a small portion of the whole image) and background (large portion), which is defined as: where e is a small value.The weighted sum of these two loss terms is then used to train the MVT model.

Dataset
In total, 260 T1w MR images from Medical Segmentation Decathlon were used in this study (Antonelli et al 2021).The medical segmentation decathlon is a dataset consisting of T1-weighted magnetization-prepared rapid gradient echo (MPRAGE) MRIs of both healthy adults (ninety healthy adults) and adults with a nonaffective psychotic disorder.Structural images were acquired with a 1.5 T 3D T1-weighted MPRAGE sequence (TI/TR/TE, 860/8.0/3.7 ms; 170 sagittal slices; voxel size, 1.0 mm 3 ).All images were collected on a Philips Achieva scanner (Philips Healthcare, Inc., Best, The Netherlands).Manual tracing of the head, body, and tail of the hippocampus on images was completed following a previously published protocol (Woolard and Heckers 2012).
For the purposes of this dataset, the term hippocampus includes the hippocampus proper (CA1-4 and dentate gyrus) and parts of the subiculum, which together are more often termed the hippocampal formation (Amaral and Witter 1989).The last slice of the head of the hippocampus was defined as the coronal slice containing the uncal apex.The resulting 195 labeled images are referred to as hippocampus atlases.Note that the term hippocampus posterior refers to the union of the body and the tail.
The corresponding target region of interest (ROIs) were the anterior and posterior of the hippocampus, defined as the hippocampus proper and parts of the subiculum.This dataset was selected due to the precision needed to segment such a small object in the presence of a complex surrounding environment.
We conducted a five-fold cross-validation study on the first 200 T1w MR images.Then, a hold-out test was performed on the remaining 60 images using a model trained on the first 200 images.The segmentation was evaluated with multiple quantitative metrics including the Dice similarity coefficient (DSC), 95th percentile Hausdorff distance (HD 95 ), mean surface distance (MSD), volume difference (VD) and center-of-mass distance (COMD).A Bland-Altman analysis and volumetric Pearson correlation analysis were also performed.

Implementation and evaluation
The investigated deep learning networks were designed using Python 3.6 and TensorFlow and implemented on a GeForce RTX 2080 GPU with 12 GB of memory.Our work is inspired by two of recent deep learning networks, i.e. deep morphological network https://github.com/ranjanZ/2D-Morphological-Networkand vision transformer https://github.com/emla2805/vision-transformer.Optimization was performed using the Adam gradient optimizer.The learning rate was 2 × 10 −4 .With the batch size setting of 20 during training, the percentage of utility of GPU memory is 96%.Once the network was trained, it only takes ~1.5 min for hippocampus segmentation.To demonstrate the utility of morphological operator, an ablation study was conducted.Namely, we tested the performance of the proposed method with and without using a morphological operator.To further demonstrate the significance of the proposed work, we compared the proposed method with another popular segmentation models, cascaded U-Net (CasU) (Liu et al 2019) and visual transformer network (VIT) (Ranem et al 2022).The cascaded U-Net is composed of two U-Nets.The first U-Net separates the whole structure from background and sends the extracted features into the second U-Net, which further segments the substructures.The major difference between the proposed method and cascaded U-Net is that the localization model is used to localize the whole structure only, the cropped image is then sent to MVT for segmentation purpose.Different from U-Net, the MVT integrates auto-learned morphological layers and visual transformer into an end-to-end network to increase the segmentation precision.The major difference between the proposed method and the VIT is that the proposed method is a cascaded network, which includes a localization model and a segmentation model, whereas the VIT is a one path network.As compared to the segmentation model, the VIT includes more encoding and decoding layers but without using morphological layers.Comparisons were performed using the same training and testing datasets and computational environment.

Comparing with state-of-the-art
The visual comparison between the proposed method and comparing methods are shown in figure 2. As can be seen from the first row, the proposed method shows good agreement with the ground truth, whereas the comparing methods cannot.In the second row it is observed that misclassification of posterior part occurs for the cascaded U-Net.To better demonstrate the segmentation accuracy, we performed absolute subtraction of the segmentation results of the proposed method and comparing methods with the manual contour.As can be seen from the 2nd to 4th columns, the proposed method shows more agreement with the manual contour.
The linear correlation coefficient calculated as target volume of ground truth and segmentation, is shown in figure 3. The linear correlation coefficient obtained using the proposed method was 0.999 and 0.993 on five-fold cross-validation and hold-out test, respectively.These values indicate a good agreement between the ground truth and proposed results, as compared to 0.989/0.983and 0.991/0.979obtained by the cascaded U-Net and VIT, respectively on five-fold cross-validation/hold-out test.On hold-out test, the VIT consistently underestimated the region, which became more pronounced for larger tumors.
The quantitative metrics of the proposed method and the alternate methods from the 200 cases' crossvalidation and 60 cases' hold-out test are listed in tables 1 and 2, and tables 3 and 4, respectively.For the crossvalidation experiment, the proposed model significantly outperformed Cascaded U-Net and VIT in all metrics.In five-fold cross-validation, the DSCs, HD 95 , MSD and CMD were 0.900 ± 0.029 and 0.886 ± 0.031, 1.156 ± 0.277 and 1.133 ± 0.264, 0.426 ± 0.115 and 0.401 ± 0.100, 0.491 ± 0.300 and 0.738 ± 0.452 for the hippocampus proper and parts of the subiculum, respectively.
In the hold-out test using external dataset the proposed model is significantly superior to the alternate approaches, as shown in tables 3 and 4 in comparison with cascaded U-Net and VIT.In hold-out test, the DSCs, HD 95 , MSD and CMD were 0.881 ± 0.033 and 0.863 ± 0.034, 1.328 ± 0.404 and 1.272 ± 0.388, 0.494 ± 0.113 and 0.466 ± 0.112, 0.608 ± 0.313 and 0.834 ± 0.478 for the hippocampus proper and parts of the subiculum, respectively.As compared to five-fold cross-validation, the hold-out test did slightly worse with slightly higher standard deviation, which may be caused by the training data's distribution not covering the range of cases in the hold-out test.

Discussion and conclusions
A novel hippocampus segmentation method is proposed by introducing a localization mechanism to aid segmentation and by designing the morphological visual transformer network for substructures segmentation.The localization model is used to detect the VOI of hippocampus.The end-to-end morphological vision transformer network is used to perform substructures segmentation within the hippocampus VOI.The substructures include the anterior and posterior regions of the hippocampus, which are defined as the hippocampus proper and parts of the subiculum.The vision transformer incorporates the dominant features extracted from MR images and is improved by learning-based morphological operators.The morphological  operators integrated into the vision transformer enhance the ability to separate the hippocampus structure into two substructures.While our study offers valuable insights, it is important to acknowledge some inherent limitations.Firstly, due to limited computational resources, our method focused on domain incremental learning with a cropped region for analysis.We plan to test the performance of our method in a class incremental setup.Secondly, visual transformer contains several orders of magnitude larger number of parameters due to the self-adapting process as compared to the traditional CNNs.As compared to state-of-the-art CasU, which in our database experiment, the number of trainable parameters is around 48 million, our proposed network needs 8.6 million trainable parameters used in localization model and 113.5 million trainable parameters for morphological visual transformer, which is in total 122.1 million.However, as compared with VIT, which needs around 200 million trainable parameters, our model size is relatively small because of the reduced size of morphological visual transformer as compared to the relatively bulky VIT U-Net structure.Since we introduced the localization model to reduce the complexity of segmentation, the next segmentation model, morphological visual transformer, does not need to be extremely deeper for segmentation purpose.To prevent the potential underfitting issue that may occur on transformer-based networks, we increased the dataset size via introducing various augmentation, such as rotation, scaling, and rigid registration.In total, the training dataset in our work is increased by 7680 times.
It is essential to investigate an effective optimization method to reduce the amount of GPU memory allocation as well as simplify the overall VIT U-Net architecture.Another potential challenge of this work is that the proposed method may be affected by the quality of the MR images or the non-uniformity in their distribution.In this work, the T1w MRI was used for hippocampus substructures segmentation.For future work, we plan to integrate deep learning-based MRI inhomogeneity correction (Dai et al 2020b) and MRI multimodality synthesis (Dai et al 2020a) into the framework to enhance the input image's quality.
Our MVT is a supervised method, which means it still requires accurate manual contours as training labels.Currently, there are semi-supervised learning methods that can learn features from unlabeled data.We will extend the proposed method with the ensemble approach to improve its generalization performance by integrating the supervised learning and semi-supervised learning methods from the limited labeled data and large-scale unlabeled data of MRIs in a future study.
In this study, the dataset includes both healthy adults and adults with a non-affective psychotic disorder which are from an international challenge.Such inclusion increases the diversity of the training dataset, which may enable the trained model to be more general to a large cohort of people.However, based on the data description from the source, further detail information about physiology and medical history is unavailable.In future, it is desirable to conduct studies with a case-by-case analysis relating the segmentation result with the specific condition of patient.
Although in this study we investigated the proposed method using datasets of healthy adults and adults with psychotic disorder, the proposed method can also be used for brain cancer patients who are planned for radiation therapy where the auto-segmentation of substructures of hippocampus has significant clinical relevance.For example, in hippocampal sparing whole brain radiation therapy (HA-WBRT) (Popp et al 2020), current intensity modulated radiation treatment (IMRT) and arc-based VMAT techniques can reduce dose to the hippocampus without sacrificing target coverage and homogeneity (Yuen et al 2020).Further improvements in patient outcomes may be possible by considering substructures separately for optimal dose sparing; however, accurate segmentation is critical.With more accurate contouring of substructures of hippocampus, it is possible to have different dose constraints of these substructures in HA-WBRT (Popp et al 2022), allowing for better sparing of the critical part of the hippocampus.Note that to achieve the best performance, the training dataset should be collected from the corresponding patient cohort.

Figure 1 .
Figure 1.The workflow of the proposed morphological visual transformer learning-based hippocampus substructure segmentation.

6
Imgwhere w and h denote the width and height of the I , Img d represents its depth, and the corresponding physician-delineated hippocampus, seg denotes the hippocampus proper.I s seg denotes the parts of the subiculum.Based on the I , Seg the bounding box that only covers the hippocampus can be derived.This bounding box is defined as the ground truth volumes-of-interest (VOI).The coordinate of the VOI is represented by = where x , c y c and z c denote the center of hippocampus VOI, w , c h c and d c denotes the width, height and depth of the VOI along the 3D direction.
F .m To achieve this aim, we set the loss function, as shown in equation (3) during the training of localization module.

Figure 2 .
Figure 2. A representative case of proposed method and state-of-the-art methods.The 1st column shows MR images overlaid with ground truth contour (yellow polygon denotes the hippocampus proper; red polygon denotes the parts of the subiculum).The 2nd to 4th columns show the results of proposed method, cascaded U-Net and VIT, displayed on the MR images, respectively.

Figure 3 .
Figure 3. Bland-Altman analysis of the segmented volumes between ground truth (semi-log scale) against the proposed method and comparing methods.Each dot indicates a data point from the dataset for that model.(a) row denotes the results of five-fold crossvalidation.(b) row denotes the results of hold-out test.First column denotes the segmentation of first substructure.Second column denotes the segmenting results of second substructure.

Table 1 .
Numerical results (hippocampus proper) on 5-fold cross-validation of proposed method, cascaded U-Net and VIT, respectively.

Table 3 .
Numerical results (hippocampus proper) on hold-out test of proposed method, cascaded U-Net and VIT, respectively.

Table 4 .
Numerical results (parts of the subiculum) on hold-out test of proposed method, cascaded U-Net and VIT, respectively.