Cardiac MRI segmentation using shifted-window multilayer perceptron mixer networks

Objectives. In this work, we proposed a deep-learning segmentation algorithm for cardiac magnetic resonance imaging to aid in contouring of the left ventricle, right ventricle, and Myocardium (Myo). Approach. We proposed a shifted window multilayer perceptron (Swin-MLP) mixer network which is built upon a 3D U-shaped symmetric encoder-decoder structure. We evaluated our proposed network using public data from 100 individuals. The network performance was quantitatively evaluated using 3D volume similarity between the ground truth contours and the predictions using Dice score coefficient, sensitivity, and precision as well as 2D surface similarity using Hausdorff distance (HD), mean surface distance (MSD) and residual mean square distance (RMSD). We benchmarked the performance against two other current leading edge networks known as Dynamic UNet and Swin-UNetr on the same public dataset. Results. The proposed network achieved the following volume similarity metrics when averaged over three cardiac segments: Dice = 0.952 ± 0.017, precision = 0.948 ± 0.016, sensitivity = 0.956 ± 0.022. The average surface similarities were HD = 1.521 ± 0.121 mm, MSD = 0.266 ± 0.075 mm, and RMSD = 0.668 ± 0.288 mm. The network shows statistically significant improvement in comparison to the Dynamic UNet and Swin-UNetr algorithms for most volumetric and surface metrics with p-value less than 0.05. Overall, the proposed Swin-MLP mixer network demonstrates better or comparable performance than competing methods. Significance. The proposed Swin-MLP mixer network demonstrates more accurate segmentation performance compared to current leading edge methods. This robust method demonstrates the potential to streamline clinical workflows for multiple applications.


Introduction
Cardiovascular diseases (CVDs), such as heart disease and stroke, are the leading causes of death worldwide.In most CVDs, cardiac function is affected and, therefore, plays an important role in the diagnosis or monitoring of CVDs (Chen et al 2020, Lei et al 2021, Yu et al 2023).In the past few decades, there have been significant advances in methods for evaluating cardiac function, such as ultrasound (Bahtiyar and Copel 2015), magnetic resonance imaging (MRI) (Hu et al 2023), and computed tomography (CT) (Zheng et al 2008, Jun Guo et al 2020).Cardiac MRI is a radiation-free, non-invasive, multiplane, and multimodal imaging technique that can accurately quantify ventricular anatomy and function at rest and under stress.(Chen et al 2020) Cardiac MRI imaging renders rich and dynamic multi-frame image datasets and can be performed in various orientations, most notably, short-and long-axis views.Short-axis cardiac MRI images are considered the gold standard for quantitative cardiac analysis.Quantitative measurements such as myocardial mass, wall thickness, ejection fraction, and ventricular volume and mass can be obtained from segmented cardiac MRI images (Nappi et al 2019, Dewey et al 2020, Kramer et al 2020, Yu et al 2023, Zhou et al 2024).
Cardiac image segmentation is a fundamental intermediate task in quantitative measurements.Accurate segmentation of cardiac substructures, such as left ventricle (LV), right ventricle (RV), and endocardial and epicardial borders, while critical, is a time-consuming and tedious task for physicians to perform manually and is subject to variation among individuals.The cardiac substructures of greatest interest include the myocardium (Myo), LV, RV, left atrium (LA), right atrium (RA), and coronary arteries.The Myo, LV, and RV can be segmented directly on short-axis cardiac MRI scans whereas the LA and RA can be segmented on standard 2D long-axis MRI images.(Kramer et al 2020, Zhang et al 2021) Moreover, the number of publications for MRI segmentation is significantly higher than those for for CT and Ultrasound segmentation, which may be secondary to more publicly available MRI datasets.(Chen et al 2020) There is an unmet need for accurate and efficient segmentation in cardiac MRI to aid in diagnosing and monitoring CVDs, given its pivotal role in cardiac function assessment.Despite advancements in cardiac imaging, manual segmentation of cardiac substructures is labor-intensive and prone to variation, underscoring the necessity for automated methods to clinical decision-making in cardiovascular care.
Different deep learning models have been proposed for automatic image segmentation, the majority of which have been focused on the chambers of the heart, including the LV, RV, and LA.A residual convolutional neural network was proposed to better use spatial aspects of cardiac MR data to improve cardiac segmentation accuracy (Liu et al 2020, Xu et al 2023).A fully convolutional neural network is an end-to-end pixel-wise segmentation and a variant of convolutional neural network (CNNs) aiming to achieve further improvements in segmentation performance by optimizing the network structure to enhance the feature learning capacity for segmentation and investigating different loss functions.The dilated residual network proposed to capture the features at full resolution in the bottleneck of UNet for segmenting the LV, RV, and Myo, which significantly increases spatial and temporal information and maintains the localization accuracy.(Harms et al 2021, Zhang et al 2021, Ahmad et al 2022) multilayer perceptron (MLP)-Mixer uses MLP to replace the convolution operation of CNN and the self-attention mechanism in the Transformer.(Tolstikhin et al 2021, Cheng andWang 2023), MLP-Mixer builds contextual and inter-channel correlations between tokens through cross-position and per-position operations.All of these methods still have limitations, especially anatomically in the base and the apex and in modeling long-distance relationships.Models based on U-shaped symmetric CNN, shifted-window (Swin) transformers, and MLP-Mixers have demonstrated state-of-the-art accuracy and performance on different 3D segmentation tasks.(Ronneberger et al 2015, Liu et al n.d., Pan et al 2023a, 2023b) We are motivated by the Swin transformers and MLP mixer network in segmentation tasks.In this study, we propose a Swin-MLP Mixer network using a token-based U-shaped MLP-Mixer network (Swin-MLP) adapting for cardiac MRI segmentation.Swin-MLP Mixer network layers replace convolutional layers in both the encoder and decoder for better extracting global features without losing local features and ultimately improving segmentation performance.We evaluated the Swin-MLP on the public automated cardiac diagnosis challenge (ACDC) dataset aiming to segment the LV, RV, and Myo.Volume-based (Dice, precision, sensitivity) and surface-based (Hausdoff distance, mean surface distance (MSD), residual mean square surface) metrics were used to calculate the similarities between the predicted and ground truth contours and assess the performance of Swin-MLP.We compared the performance of the proposed Swin-MLP mixer network against both a CNN-based network as Dynamic UNet (Ronneberger et al 2015) and a vision transformer-based network as Swin-UNetr (Hatamizadeh et al 2022).The main contributions of this work are summarized as follows: (1) Introducing a novel approach that adapts the MLP-Mixer architecture for cardiac MRI segmentation.
(2) Enhancement of the MLP-Mixer layers for augmented extraction of global characteristics, thereby facilitating an enhancement in segmentation efficacy.
(3) Integration of a shifted-window-partition layer designed to concurrently capture local interactions, thereby augmenting the MLP-Mixer layers' capability in feature assimilation.This strategy enriches the representation of features and bolsters the performance of learning algorithms.(4) Evaluating performance of our proposed framework and demonstrating that Swin-MLP achieves superior segmentation performance compared to state-of-the-art machine learning techniques for cardiac MRI images.

Material and methods
The proposed workflow comprises several steps, from raw cardiac MR images to the final segmentation results.The two primary steps are (1) preliminary data preprocessing (including cropping, normalizing, and data augmentation) and (2) segmentation network architecture.We designed a network that automatically segments the LV, RV, and Myo from raw input images composed of 3D volumes of cardiac MRI generated by stacking 2D short-axis frames.A final 3D label maps are provided as segmentation maps containing black, LV, RV, and Myo.

Dataset
We have used the publicly available ACDC dataset (Bernard et al 2018).Briefly, the acquisitions were obtained using two MRI scanners of different magnetic strengths (1.5 T (Siemens Area, Siemens Medical Solutions, Germany) and 3.0 T (Siemens Trio Tim, Siemens Medical Solutions, Germany)).Cardiac MR images were acquired in breath hold with a retrospective or prospective gating in short-axis orientation.The spatial resolution ranges from 1.37 to 1.68 mm 2 pixel −1 .Depending on the patient, 28-40 images acquired which cover completely or partially the cardiac cycle.This dataset includes patients with four well-defined pathologies: (1) myocardial infarction (MINF), (2) dilated cardiomyopathy (DCM), (3) hypertrophic cardiomyopathy (HCM), and (4) abnormal RV (ARV).The dataset comprises scans from 150 patients divided into five subgroups of 30 patients each, one healthy/normal (NOR) and four pathological groups.In this study, 20 subjects from each group were used with LV, RV, and Myo manual segmentations for a total of 100 subjects over five groups.We have trained and tested our segmentation network on these 100 subjects from the ACDC training dataset, with 80%-20% for our network training and test set, respectively.Each subject has two frames, resulting in 160 images for training and 40 images for testing in our study.

Data pre-processing
MRI scans and their corresponding manual ground truth contours were resampled to a voxel size of 1 × 1 × 1 mm 3 and zero-padded to dimensions of 256 × 256 × 12 voxels during each training iteration.
From each MRI scan, four patches sized 256 × 256 × 4 were randomly selected for segmentation prediction using a sliding window approach, with the window size matching the patch size and applying patch-sized overlap and Gaussian weighting to the window edges.Data augmentation techniques, including rotation within a ±30-degree range, rescaling between −0.2× and 0.2× the original size, and Mixup augmentation with a parameter of 0.2, were employed to enhance generalization.Voxel intensities of all scans were independently normalized to the interval [−1, 1] for both training and inference.The output generated by the network during inference was interpolated to match the dimensions of the original input before undergoing resampling (Psaroudakis and Kollias 2022, Pan et al 2022c).

Swin-MLP segmentation network
We proposed a Swin-MLP mixer network built on a 3D U-shaped symmetric encoder-decoder network, as shown in figure 1.The encoder was a down-sampling path consisting of one early convolutional block and four subsequent down-sampling Swin MLP-Mixer blocks, which extract semantic information from the input scans.The Early convolutional block efficiently encoded pixel-level spatial information.In order to improve segmentation accuracy, we used MLP-Mixer blocks to extract long-range dependencies across the input features.The up-sampling decoder mirrored the encoder and reconstructed a segmentation map from the learned features.A Softmax function transformed the decoder's output into a segmentation map for three classes.Instance normalization and a Gaussian error linear unit (GELU) activation function were applied following each convolutional layer and MLP-Mixer layer.In practice, the channel dimensions were set to 32, 64, 128, 256, and 512 for the early convolutional layer and the four subsequent Swin-MLP Mixer blocks in the encoder, respectively.The decoder's channels were set to 256, 128, 64, 32 for the up-sampling Swin-MLP Mixer blocks and the final convolutional layer's channel was set to the number of the segmentation classes (N = 3).The Dice and Cross-entropy loss were used to optimize the network.

Convolution layer
Input features were defined as a 4D map X in ∈ R H×W×L×D , where H, W, L represent the input dimensions, and D represents the feature map channel.In the first layer, a convolutional block was first applied to down-sample the input feature into . The early convolutional block with a 1 × 1 × 1 voxel size better encoded pixel-level spatial information.The output X conv was then fed into the Swin-MLP block.

Swin-MLP mixer block
In our Swin-MLP block, X conv was passed into a window-partition layer to divide itself into multiple local windows and then two MLP-Mixer layers to calculate the global information inside each local window.Finally, a window recovering layer generated a feature map with the same size of X conv .

Window partition layer
We first partitioned the input X conv into non-overlapping windows by the window-partition layer.More specifically, the    2) a channel-mixing MLP calculating the information shared across all channels (the voxels in rows).The blue arrows represent the summation.

MLP-mixer layer
Two MLP-Mixer layers calculated the global information inside each local window.The global information from all the windows was then collected to form a feature map X window with the size of the original X conv .Furthermore, the X window was then passed into a Swin-partition layer, two MLP-Mixer layers, and the same windowing recovering layer to obtain the final output X Swin .In this way, we encouraged the network to learn information not only limited to windows, but also the interactions between the voxels across the windows.The output attentions were fused with the input feature to recover pixel-level details.
MLP-Mixer layers were implemented to calculate the global interactions within and between tokens.Figure 3 represent the MLP-Mixer layer.Each MLP-Mixer layer was composed of (1) a token-mixing MLP that calculated the information shared across all features (the voxels in columns) and (2) a channel-mixing MLP that calculated the information shared across all channels (the voxels in rows): (Pan et al 2022a(Pan et al , 2022b) (1) where W 1 , W 2 , W 3 , W 4 are four linear weights of four linear layers, respectively; B 1 , B 2 , B 3 , B 4 are corresponding biases, σ is a GELU activation function, λ is a layer normalization, and T (U) is a transposed U from equation (1).In the encoder, the dimension of each token-mixing linear layers (W 1 , W 2 ) were empirically set to 256, 512, 1024, and 2048 for the first to the fourth MLP-Mixer blocks, respectively.The dimensions of the channel-mixing linear layers (W 3 , W 4 ) were set to 64, 128, 256, 512.In the decoder, the dimensions of the token-mixing linear layers were 1024, 512, 256, and 128, and the dimensions of the channel-mixing layers were 256, 128, 64, and 32.So, the X Swin could be recovered back to the size of the X flattern .

Window recovering
With the ×2D had the same size as the input X flattern , we gathered the X Swin from all the windows, with their corresponding 3D position coordinates recorded as [i, j, k], to re-form a feature map with the same size of X conv , with abundant global information.

Implementation details and evaluation metrics
Our method was implemented based on Python 3.8.11and PyTorch 2.0.1.A single NVIDIA RTX A6000 GPU was used for training and inference for all experiments.All experiments were trained with a AdamW optimizer (Loshchilov and Hutter 2019) with a learning rate of 1 × 10 −4 , and the momentum and weight decay were set to 0.9 and 1 × 10 −4 , respectively.
We evaluated our proposed network performance by calculating the volume similarity between the ground truth contours and the network predictions using the Dice score coefficient (DSC), sensitivity, and precision, as well as surface similarity using Hausdorff distance (HD), MSD and residual mean square distance (RMSD).The performance of the proposed network was compared with two state-of-the-art networks, Dynamic UNet (Ronneberger et al 2015) and Swin-UNetr (Hatamizadeh et al 2022).
DSC was calculated using equation ( 3).This metric generally use the intersection of predicted result and ground truth as a composite metric to evaluate the segmentation performance of the heart in an extensive range of contexts.DSC expresses between 0 and 1, with a higher value representing a better segmentation result.(Cheng and Wang 2023) Sensitivity was calculated using equation ( 4) as a measure of correctly identifying pixels that were not in the region of interest in a segmentation experiment.In other words, it shows the number of true positive results divided by the number of all samples that should have been identified as positive.(Cheng and Wang 2023) Precision is the number of true positive results divided by the number of all positive results and was calculated using equation ( 5) (Cheng and Wang 2023), In the above equations, TP is true positive, denoting that the sample is deemed positive and is, in fact, positive.TN is true negative, denoting that the sample has been judged to be negative and is, in fact, negative.FP is false positive, denoting that the sample is thought to be positive but is actually negative.FN is false negative, denoting that the sample is thought to be negative but is actually positive.
Surface distance metrics estimate the error between the outer surfaces S and S ′ of the segmentations X and X ′ .The HD measures the local maximum distance between the segmented surface and the corresponding reference surface and compares the symmetrical distance between the predicted and actual contour, and provides a spatial resolution of cine-MRI, which is sensitive to the edge error of the segmentation result.The small HD represents a better segmentation result.The HD was calculated using equation ( 6 MSD denotes how much the surface varies between the segmentation and the ground truth on average in mm.MSD was calculated as equation ( 7), RMSD is defined for 2D surface similarities.RMSD was calculated as equation ( 8), We performed two-sample t-test and calculated the p-value in order to preform statistical analysis for comparison of our proposed metric to Dynamic UNet and Swin-UNetr.

Results
The results of the proposed Swin-MLP Mixer network and its comparison to the Dynamic UNet and Swin-UNetr for a random representative sample are visually shown in figures 4 and 5.For visual comparison, we measured the absolute difference between each predicted segmentation and the ground truth as shown in figure 4. The first row shows the segmentation masks from each method for the representative sample, and the second row shows their differences formthe ground truth.As it can be seen from the second row in figure 4, the Swin-MLP has the lowest number of pixels differentiated from ground truth.The number of differentiated pixels from ground truth counted for each method as 403, 418 and 850 for Swin-MLP, Dynamic UNet and Swin-UNetr, respectively.Moreover, the segmentation for LV, RV, and Myo are fully connected for Swin-MLP and Dynamic UNet; however, the Swin-UNetr did a poor segmentation for RV and has holes in the RV segmentation.
Figure 5 shows segmentation of LV, RV, and Myo on three different frames at the base, middle, and apex of the heart from the representative sample.The Swin-UNetr has holes in RV segmentation on the base and apex frames.Moreover, on the apex frame, Dynamic UNet has a smaller RV segmentation comparing to  ground truth.Swin-MLP segmented all three LV, RV and Mayo on the different frames from base, middle, and apex of heart in comparison to ground truth segmentations.
Table 1 presents quantitative analysis for volume and surface similarity measurements from the proposed Swin-MLP network averaged over all subjects, and compares results from Swin-MLP to Dynamic UNet and Swin-UNetr.We calculated the segmentation performance of our method using six different evaluation metrics including three volume-based metrics as Dice score, precision and sensitivity and surface-based metrics as HD, MSD, and RMSD.For better comparison and visualization, the quantitative result are plotted in figure 6.As can be seen in figure 6, Swin-MLP exhibit the norrowest width of box plot compared to Dynamic UNet and Swin-UNetr, indicating higher accuracy in the segmentation of LV, RV, and Myo across  different images.In measures of volume-based similarity metrics, Swin-MLP achieved a significantly improved DSC for all segments compared to both Dynamic UNet and Swin-UNetr (p-value < 0.05).For LV segmentation, Swin-MLP performed better than the other two methods by showing a higher DSC, precision, and sensitivity values.In measures of surface-based metrics, Swin-MLP achieved improve med lower HD value on all segments.For Myo segmentation, Swin-MLP achieved improved lower HD, MSD, and RMSD in comparison to the other two methods with p-value < 0.01 which are statistically significant.Swin-MLP demonstrated improved higher DSC as a measure of volume-based metric and lower HD as a measure of surface-based metric for LV, RV, and Myo segmentations compared to the competing CNN-and transformer-based network when applied to a public cardiac MRI dataset.
Table 2 shows a summary of Dice score from our model and others which represents that the Dice score, also known as accuracy, are improved for LV and RV segments using our proposed network and it is comparable for Myo segments results from MLP-VNet mixer by Pan et al Chen et al provided a summary of accuracy of state-of-the-art segmentation methods verified on the ACDC challenge dataset.In their summary, the method presented by Isensee et al (2018) shows the greatest accuracy for LV, RV and Myo segmentations (Chen et al 2020).The Dice score is the most used metric in validating medical volume segmentation, with an improved Dice score correlating to a more efficient process of second check compared to manual contouring.The Swin-transformer proposed hierarchical design and the Swin approach producing a hierarchical feature representation.The Swin based self-attention is a key element of Swin Transformer to be effective and efficient and our method requires no manual input parameters, correction, or intervention.We also evaluated our Swin-MLP mixer network for segmentation of LV, RV, and Myo for each disease group, separately.Table 3 shows quantitative evaluation regarding each disease groups, as well as illustrated in figure 7 for better comparison.As it can be seen from figure 7, Swin-MLP mostly performed better with higher DSC, precision and sensitivity values, and lower HD, MSD, RMSD values on all dataset comparing on each disease group separately.For RV segmentation, precision value from all dataset is less than the DSC group, and for MSD value it is higher than the DSC and normal group.This can be due to the small training size when we evaluated on each disease group separately in comparison to all dataset.

Discussion
There is a need for cardiac functional assessment and quantitative measurements in order to perform CVD diagnosis (Bernard et al 2018, Chen et al 2020).Accurate quantitative analyses of a cardiac disease requires to perform quantitative measurements such as myocardial mass and wall thickness, LV and RV volume and mass, and ejection fraction.Cardiac image segmentation is a fundamental task in these quantitative measurements of functional parameters for diagnosis of cardiac diseases assisting radiologists and physicians (Wang et al 2019).Accurate segmentation of the cardiac structures including LV, RV, endocardial and epicardial borders helps to accurately measure these functional parameters of the heart for disease diagnosis and treatment (Bernard et al 2018, Li et al 2023).Recent advancements in automatic segmentation have emerged alongside the development of artificial intelligence and machine learning algorithms.Deep learning algorithms have undergone significant evolution in the context of medical image segmentation, drawing considerable attention to the segmentation of cardiac structures.In this paper, we present a segmentation network, Swin-MLP, specifically designed for automated segmentation of cardiac substructures on MRI scans.The unique strength of Swin-MLP is its hybrid architecture, which combines the merits of both convolutional and MLP-Mixer layers.This integration enables the model to capture information simultaneously at both the local and global levels, thereby enhancing the accuracy of the segmentation tasks.The model employs a window-partition layer that divides the input image into non-overlapping windows, facilitating the calculation of localized global information within each of these windows.Subsequently, these partitioned windows are passed as feature maps to the MLP-Mixer blocks, where intricate patterns and relationships within the data are learned.Furthermore, to boost the model's capacity for learning and generalization, we incorporated a shifted-window-partition layer.This specialized layer performs calculations across Swins, thereby capturing interactions on a more global scale.This inclusion not only enriched the feature representation but also notably enhanced the learning performance of our network.The incorporation of Swin partition technique in the second MLP layer enables to calculate the information across the non-Swins obtained in the first window-partition layer.These enhancements improve the network's learning ability which is advantage of shifted-window partition.Khened el al proposed 2D FCN dense U-net with incorporating inception module's paralle structures (Khened et al 2019).Their network required the least number of trainable parameters, an order of 10 fold reduction comparing to standard U-Net based architectures.They calculated Dice score for end-diastolic (ED) and end-systolic (ES) phase, separately, as 0.941 for LV, 0.907 for RV, 0.894 for Myo segmentations.Isensee et al proposed ensemble of 2D and 3D U-Net (Isensee et al 2018) revolving around the use of both a 2D and 3D model which leveraging their respective advantages through ensembling.Their approach was robust against frame misalignments, different MRI protocols and various pathologies.They achieved Dice scores of 0.950 and 0.923 for left and RV cavity respectively, and 0.911 for LV Myo on the ACDC test set.Ahmad et al proposed dilated residual network (Ahmad et al 2022) to capture the features at full resolution in the bottleneck of UNet which significantly increases spatial and temporal information and maintains the localization accuracy.They calculated Dice score for ED and ES phase separately with the overall average value of 0.949, 0.924, and 0.907 for LV, RV and Myo, respectively.Our proposed Swin-MLP mixer achieved 0.960, 0.0929, and 0.968 for LV, RV, and Myo, respectively, which is superior to the state-of-the-art machine learning techniques for segmentation cardiac MRI images.
One of the primary limitations to deploying the MLP-Mixer model in clinical settings is its computational complexity, which scales linearly with the size of the input image.This means that as the dimensions of medical images such as CT or MRI scans increase, the computational resources required for processing also escalate proportionally.Consequently, there arises an essential need for hardware with an elevated memory capacity to accommodate such computational demands.This not only increases the necessary investment in hardware infrastructure but also imposes a considerable latency during the training phase, which could be detrimental in time-sensitive clinical scenarios.Another concern is the susceptibility of the MLP-Mixer model to which occurs when the model capturing the noise in the training data and consequently performs poorly on unseen or test data.The architecture of MLP-Mixer, which has the capability to capture long-distance features within an image, exacerbates this issue.This is exacerbated when dealing with limited medical datasets, which is a common occurrence in specialized healthcare applications.This can result in decreased model generalization and diminished reliability in clinical implementations of MLP-Mixer.
This research indicates the promising potential of Swin-MLP as a robust segmentation tool, and we believe that its further refinement and extension will greatly facilitate its clinical applications.In future work, we plan to expand the application of our proposed Swin-MLP network by using diverse training and testing datasets to include more normal and abnormal cases.Furthermore, by analyzing the generated segmentation maps, we intend to extract specific clinically relevant parameters, such as LV ejection fraction and myocardial wall thickness.These extracted features will serve as crucial indicators for diagnostic analysis in the context of cardiac diseases.Subsequently, we aim to employ these parameters to train an ensemble system tailored for the precise classification and diagnosis of cardiac conditions, such as myocardial infarction and heart failure.

Conclusions
This work presents a new segmentation network, Swin-MLP, for automatic segmentation of the LV, RV, and Myo in cardiac MRI scans.The proposed Swin-MLP captures both local and global-level information for accurate segmentation.Our proposed network performs better in detecting LV, RV, and Myo structures compared to state-of-the-art CNN-and Transformer-based segmentation networks.Our research highlights Swin-MLP's potential as a robust segmentation tool for clinical applications, with plans to refine it further and expand its use with diverse datasets.Future work aims to extract clinically relevant parameters from segmentation maps to develop an ensemble system for precise classification and diagnosis of cardiac conditions.

Figure 1 .
Figure 1.Swin-MLP network architecture: (a) an input scan is fed into an encoder to learn features, which are forwarded to the decoder with an architecture mirroring the encoder.(b) The Swin-MLP block consists of a resampled convolutional block, window-partition layer, and two sequential MLP-Mixer layers followed by window-recovering layers.

Figure 2 .
Figure 2. Window partitioning schematic from conventional to shifted window.The left window portioning used in the first layer, and the right one is the shifted window used in the second layer.The thin lines represent patches, and the thick lines represent window partitions.
windows with the size of N × N × N, where N was empirically set to 4, 8, 4, 4 for the first to the fourth MLP-Mixer blocks, respectively.The 3D position coordinates of each window were recorded as[i, j, k]  for the window-recovering layer, which was practically the coordinates of the top-left corner of each window.Each local window was flattened into a 2D feature map, X flattern , with the size of H 2N × W 2N × L 2N × 2D and fed into the following MLP-Mixers to calculate the global information, respectively.On the other hand, the shifted-window-partition layer followed the same operations and shifted the windows by N 2 × N 2 × N 2 voxels with cyclic boundary concatenation (e.g. the left-most N 2 × N 2 × N 2 window were been concatenated to the right-most window to form a complete N × N × N window).
Figure 2(a) shows window partition used in the first MLP layer and 2(b) shows Swin partition used in the second MLP layer.With the Swin partition technique, we could enhance the network's learning ability by calculating the information across the non-Swins obtained in the first window-partition layer.

Figure 3 .
Figure 3. MLP-Mixer layers: Each MLP-Mixer layer was composed of (1) a token-mixing MLP calculating the information shared across all features (the voxels in columns) and (2) a channel-mixing MLP calculating the information shared across all channels (the voxels in rows).The blue arrows represent the summation.

Figure 4 .
Figure 4. Visual comparisons of different methods on a sample frame from ACDC dataset.The top row shows the segmentation results, and the bottom row shows the differential masks as difference between segmentation results and ground truth segmentations.LV: light blue, RV: yellow, and Myo: blue.

Figure 5 .
Figure 5. Cardiac segmentation results on three frames (base, middle, apex) from the ground truth and three different segmentation methods.The first column is the original cardiac frame with true segmentations, and other three columns are results from different segmentation methods.LV: light blue, RV: yellow, and Myo: blue.Solid white arrows point to differences between the ground truth and each segmentations.Dotted white arrow points to holes in RV segment from Swin-UNetr method.

Figure 6 .
Figure 6.Evaluation metrics for the comparison between the proposed Swin-MLP method Dynamic UNet, and Swin-UNetr.For the top row metrics (Dice, precision, and sensitivity), higher values represent a better method, and for metrics on the bottom row (HD, MSD, and RMSD), lower values represent a better method.

Figure 7 .
Figure 7. Evaluation metrics for comparison of the proposed method (Swin-MLP) on different disease groups.For the top row metrics (Dice, precision, and sensitivity), higher values (↑) represent better results, and for metrics on the bottom row (HD, MSD, and RMSD), lower values (↓) represent better results.

Table 1 .
Quantitative analysis of segmentation results for Swin-MLP vs. Dynamic UNet and Swin-UNetr.The best network(s) are presented in bold for each structure.The P-value is given by the comparison between the proposed Swin-MLP and the two competing networks.

Table 2 .
results on ACDC dataset for comparison of our model and popular models in terms of Dice coefficient for LV, Myo, and RV segmentations.

Table 3 .
Quantitative analysis of the segmentation results for Swin-MLP on the normal and four different diseases.MINF: myocardial infarction, DCM: dilated cardiomyopathy, HCM: hypertrophic cardiomyopathy, ARV: abnormal RV, NOR: normal.