AI approach to biventricular function assessment in cine-MRI: an ultra-small training dataset and multivendor study

Objective. It was a great challenge to train an excellent and generalized model on an ultra-small data set composed of multi-orientation cardiac cine magnetic resonance imaging (MRI) images. We try to develop a 3D deep learning method based on an ultra-small training data set from muti-orientation cine MRI images and assess its performance of automated biventricular structure segmentation and function assessment in multivendor. Approach. We completed the training and testing of our deep learning networks using only heart datasets of 150 cases (90 cases for training and 60 cases for testing). This datasets were obtained from three different MRI vendors and each subject included two phases of the cardiac cycle and three cine sequences. A 3D deep learning algorithm combining Transformers and U-Net was trained. The performance of the segmentation was evaluated using the Dice metric and Hausdorff distance (HD). Based on this, the manual and automatic results of cardiac function parameters were compared with Pearson correlation, intraclass correlation coefficient (ICC) and Bland–Altman analysis in multivendor. Main results. The results show that the average Dice of 0.92, 0.92, 0.94 and HD95 of 2.50, 1.36, 1.37 for three sequences. The automatic and manual results of seven parameters were excellently correlated with the lowest r2 value of 0.824 and the highest of 0.983. The ICC (0.908–0.989, P < 0.001) showed that the results were highly consistent. Bland–Altman with a 95% limit of agreement showed there was no significant difference except for the difference in RVESV (P = 0.005) and LVM (P < 0.001). Significance. The model had high accuracy in segmentation and excellent correlation and consistency in function assessment. It provides a fast and effective method for studying cardiac MRI and heart disease.


Introduction
Accurate assessment of biventricular anatomy, volumes, and function are highly clinically relevant for diagnosing, treating, and prognosticating patients with cardiovascular disease (Krittayaphong et al 2009, Marwick 2018).Cine magnetic resonance imaging (MRI) has become a gold standard non-invasive imaging tool in cardiovascular medicine, especially for visualizing and quantifying cardiovascular function and myocardial tissue characterization (Kramer et al 2020, Schulz-Menger et al 2020).Accurate heart structure segmentation is the most crucial premise of volume calculation and function analysis, but manual segmentation and careful review of many individual images are time-consuming.
Deep learning-based automated or semiautomated methods for quantifying and analyzing cine MRI have rapidly developed in recent years.Several deep learning methods are applied to biventricular structure segmentation and function assessment to improve accuracy and speed (Bai et al 2018, Campello et al 2021, Wang et al 2022).However, most deep learning methods are based on available public data sets or single short-axis (SAX) cardiac magnetic resonance (CMR) images, while only a few studies have carried out automatic segmentation and cardiac function evaluation of multi-sequence images (Budai et al 2020, Koehler et al 2021).Based on public data set or a single sequence means that the model was not adequately adapted to more vendors, sequences, or full functions analysis.Moreover, because of the relatively large data sets, it also needs to spend too much effort and time on image annotation, training, and even model adjustment.
In this work, we develop a new deep learning model to solve the abovementioned problems.In order to shorten the time and improve the generalization, this paper adopts an ultra-small cine MRI data set with multivendor and multi-cine MRI sequences and adequately adjusts the parameters to achieve the best performance.In order to verify the clinical application ability of this model, we evaluated the agreement and similarity of fully automatic and manual results through an independent test set composed of multivendor and multiple sequences.

Datasets
The local institutional review board approved this retrospective study, and informed consent was not required.The information on all images was anonymized before use.For training and testing our algorithm, we collected 90 subjects as a development data set from three major MR vendors of one medical center, and each case includes three cine MRI sequences: a SAX, a long-axis two-chamber (LAX-2CV), and a long-axis four-chamber (LAX-4CV).The data set includes 52 males and 38 females with an average age of 32.3 years (range from 13 to 68 years).80% of the images are used as training and 20% as validation sets.Additionally, 60 subjects from these three vendors were acquired as an independent testing set (average 35.4 years, 36.7% female).The clinical information associated with the data set is shown in table 1, and the sequence of scans is shown in table 2.

MR acquisition
The cine MR study used non-gated steady-state free precession (SSFP) sequences in the coronal, axial, and sagittal planes, which can be acquired in less than 2 min.Cardiac synchronization is achieved by using the

Manual annotation
All the cine MR images were analyzed by two radiologists with more than 10 years of experience in cardiac MRI.
In order to ensure consistency and less variability with manual annotation, all annotations were marked by one expert and reviewed by another.In each patient's images, LV, LVW and RV needed to be annotated, and all annotations followed the same protocol: Papillary muscles and trabeculations were included in the LV and RV blood pool.Manual annotation was performed using a self-customized labeling tool (Uscube MedLabel, Uscube Science and Technology Co. Ltd, Beijing, China).In complete annotation, contours in the end-diastolic and end-systolic phases were manually annotated.An automatic threshold can extract most structures, but for the structures with inaccurate threshold segmentation, we used the manual anchor method to draw the boundary of tissue structure.The image annotation on SAX, LAX-2CV, and LAX-4CV are shown in figure 1.

Deep learning algorithm
In light of the potential drawbacks of simple skip connections, such as the incorporation of shallow features with limited semantic information that may adversely affect the final performance (Cao et al 2021, Chen et al 2021, Dosovitskiy et al 2021), we propose a novel UT-Net framework for 3D medical image segmentation.This innovative framework is designed with a multi-scale fused module, and it comprises a CNN-encoder and a Transformer-encoder, as illustrated in figure 2. The Transformer-encoder can be regarded as an advanced skipconnection operation that compensates for the spatial information loss occurring during pooling operations.As a result, UT-Net serves as a powerful encoder that effectively integrates encoder features and minimizes the semantic gap, thereby enhancing the overall performance in medical image segmentation tasks.The proposed UT-Net's overview structure is shown in figure 2(a).Given an input 3D medical image X R C H W D Î ´´´w ith height H, width W, depth D (number of slices), and channels C. We first utilize a 3D CNN to capture local feature maps ( ) x i 1,2,3,4,5 Vendor 1: Philips 3.0 T Achieva; Vendor 2: Siemens 3.0 T Magnetom Verio; Vendor 3: GE 3.0 T Discovery MR750w.
for cross-attention lies in the ability to capture richer contextual information and inter-stage dependencies between different channel feature levels.as illustrated in figure 2(b).The proposed MSC module includes six inputs, which consist of five sequences Q i as queries and the concatenated sequences Q S as key and value: denotes the weight metric, d denotes the length of the sequences.The query-key-value (QKV) attention is then formulated as: The output of the layers can be calculated by: where ( ) * LN is normalization and z l is the output of lth layer.Finally, the upsampling and convolutional operations were stacked to produce a segmentation result.

Training and testing
We used 5-fold cross-validation in the following experiments.The experiments were performed in Pytorch framework.Adaptive Moment Estimation (Adam) was used for training with a learning rate of 0.01, the momentum of 0.99, weight decay of 3e−5, batch size of 2, and iteration of 300 epochs.In order to train our model, we adopt a combined loss total L which consists of dice loss Dice L and cross-entropy loss CE L as shown as follows: where l is set to 1 in our experiments, which suggests dice loss and cross-entropy loss are equally important.We adopt dice coefficient (Dice) and Hausdorff distance (HD) 95% as the evaluation metrics.In the training stage, as shown in figure 3, the network is easy to optimize with fast convergence, proving that the model is powerful to capture variable features.
After the deep learning model was trained, we tested on the independent testing data set: the 60 cases of three cine MR sequences from multivendor (table 2).To quantitatively evaluate the automatic segmentation results, we calculated via Dice and HD between all voxels of the automatic segmentation and manual annotation.Among them, manual annotation was used as the reference standard.

Image and statistical analysis
Automatic cardiac image analysis was evaluated by calculating LVEF, RVEF, and LVM, and the results were compared with the reference values of manual annotation.The ejection fraction was calculated by the following: V ED represents the volume of the end-diastole ventricle and V ES represents the volume of the end-systole ventricle.The ED and ES phase volumes of the ventricle were calculated by multiplying the total number of ventricle voxels by the volume of a single voxel through the SAX sequence.LVM was calculated by the difference between the total epicardial volume (sum of epicardial cross-sectional areas multiplied by the sum of the slice thickness and interslice gap) minus the total endocardial volume (sum of endocardial cross-sectional areas multiplied by the sum of the slice thickness and interslice gap), which is then multiplied by the specific myocardium density of 1.05 g ml −4 .Evaluation was performed in a Python environment, and statistical analysis was performed using SPSS software (V26.0;SPSS, Chicago, IL, USA).

Segmentation accuracy
On the three sequences of cine MR of the independent testing data set, we calculated their Dice, and HD 95% values, respectively which Dice is the most important and intuitive value to evaluate the segmentation accuracy and HD 95% is better to evaluate the segmentation boundary.Table 4 lists the detailed results between reference annotation and automatic segmentation to benchmark the trained model.Automatic segmentation of these cardiac structures achieved an average Dice > = 0.92 on the three sequences, while the Dice was lowest on the LVW (0.89) of the LAX-2CV sequence and highest on the LV (0.96) of the LAX-4CV sequence.For the three sequences, the average HD 95% value of each was relatively small (<=2.5), and the maximum value was 3.33 on RV of the SAX sequence.On the whole, this study achieved perfect results.

Evaluation of function parameters
For clinical parameters, we calculated LVEDV, LVESV, RVEDV, RVESV, LVEF, RVEF, and LVM from the automatic segmentation results and evaluated them concerning manual derived in the testing data set (n = 60).a. LVEDV: the r 2 value was 0.978, ICC was 0.989 and the consistency significance F test was (F = 180.806,P < 0.001).The results showed a high correlation and agreement.The Bland-Altman method was used, and the 95% limit of agreement between the automatic and manual results was −14.74 to 13.37.There was no significant difference between manual and automatic measurements (Z = −0.788,P = 0.431).
b. LVESV: the r 2 value was 0.977, ICC was 0.988 and the consistency significance F test was (F = 169.824,P < 0.001).The results showed a high correlation and agreement.The Bland-Altman method was used, and   data set and achieved accurate automatic segmentation of LV, RV, and LVW from multi-cine MRI sequences and multivendor.In the small data set, the cases of abnormal cardiac structural development were not included, which had certain limitations for this study.However, we believe that cases of congenital cardiac dysplasia are a minority in the conventional population and should be counted as exceptional cases which will not significantly impact most.
In this paper, we developed a new deep learning network called UT-Net, which combined the advantages of Transformers and U-Net.Table 4 shows that this network can perform well in ultra-small training data sets and perfect results after training various data with different modes, with good generalization.Table 6 shows that our results are excellent compared to the previous deep learning methods on the SAX sequence.It is crucial for medical research to realize excellent model learning ability through small data and have good generalization.This network will significantly reduce the labeling time of new research, reduce the impact of heterogeneous data on the deep learning algorithm, and expand its application scope.
From the functional parameters, automatic segmentation results were highly correlated with manual annotation results, and these seven parameters showed good agreement in the overall trend through ICC results.This result is highly consistent with the results of previous studies, which show that the accuracy of automated calculation is comparable with manual (Bhuva et al 2019, Luo et al 2019).However, there was a significant difference in the agreement between the RVESV and LVM automatic analysis and manual results in the Bland-Altman plots.The automatic analysis results were higher than those of manual analysis for LVM, but RVESV results were the opposite.This analysis showed apparent differences in a few individual data and was verified through the visual analysis of automatic and manual segmentation image results.For example, only a few images do not accurately identify the basal SAX slice in manual and automated LVW segmentation (figure 5).Because  only the LVW pixels in the ED phase need to be used to calculate LVM, the calculation results caused by this wrong recognition are pretty different, which leads to this significant difference in LVM.Even so, this noticeable difference was caused by only a few images, and it will not affect the general trend of the agreement of the testing data set.The ICC results illustrated all this.A limitation of this work is that all training and testing were performed using retrospective data, and the amount of these data was relatively small.Its clinical application prospect still needs further evaluation, especially in the data sets of some particular heart disease and cardiovascular abnormalities.
In conclusion, the most advanced 3D deep learning algorithm combining Transformers and U-Net was used to train in an ultra-small training data set and obtained a very excellent model.After assessing automated biventricular segmentation and function performance in multivendor, we found the network can accurately segment and generate highly correlated consistency compared with experienced observers without any intervention.This network provides a fast and effective method for studying heart disease with small data sets in the future.
. i = These generated CNN feature maps are then flattened as input sequences for the Transformer-encoder to extract global context.Similar to UNETR (Hatamizadeh et al 2022), h as C × P [0] × P [1] × P [2], where P [0], P [1], P [2] are the numbers of patches.The Transformer encoder contains L layers, each with a common structure, consisting of a multi-scale connection (MSC) block, a multi-head cross attention (MHCA) block to effectively integrate multi-scale features, adaptively focus on relevant information, and a feed forward network (FFN).Given the feature maps generated from CNN-encoder, we first reshape them into sequences with d * N feature map ( ) Q i 1,2,3,4,5 .i= Then, we concatenate the sequences of five-stage CNN-encoders ( ) Q i 1,2,3,4,5 i = as the key and value ( )

Figure 1 .
Figure 1.The manual annotation in the ED and ES phases on SAX, LAX-2CV and LAX-4CV sequence.On the SAX sequence, we only show the annotations at the base, middle and apex slice.Red marked the LV, green marked the RV, and blue marked the LVW.LV and RV blood pool included papillary muscles and trabeculations.

Figure 2 .
Figure 2. Diagram shows complete workflow and algorithm structure of LV, RV, LVW automatic segmentation network: (a) overview of the proposed UT-Net; (b) details of the proposed transformer single layer.

Figure 3 .
Figure 3. Diagram shows the training process on three sequences, mainly the relationship between loss and epoch.It can be seen from the image that the loss decreases with the increase of epoch and the convergence of training and verification fitted well.

Figure 4 .
Figure 4. Correlation plots (left) and Bland-Altman plots (right) of the seven function parameters generated from automatic and manual segmentation results.

Figure 5 .
Figure5.The two significantly different LVM cases of automatic and manual images on ED phase.Red represents LVW automatically divided, and purple is manual annotation.From the yellow box we can see that automatic recognition identifies more LVW structures at the base slice, resulting in higher LVM results than manual.

Table 1 .
Clinical and diagnostic information related to the data set.

Table 2 .
Specifications of the development and test data sets.

Table 3 .
MRI equipment and scanning protocol parameters.
we generate a 1D sequence x R n d h Î ´with length n as

Table 4 .
The Dice and HD of biventricular structure from three MR sequences with multivendor.

Table 6 .
The Dice comparison results with the SAX sequence of the automatic diagnosis challenge.