MUE-CoT: multi-scale uncertainty entropy-aware co-training framework for left atrial segmentation

Objective. Accurate left atrial segmentation is the basis of the recognition and clinical analysis of atrial fibrillation. Supervised learning has achieved some competitive segmentation results, but the high annotation cost often limits its performance. Semi-supervised learning is implemented from limited labeled data and a large amount of unlabeled data and shows good potential in solving practical medical problems. Approach. In this study, we proposed a collaborative training framework for multi-scale uncertain entropy perception (MUE-CoT) and achieved efficient left atrial segmentation from a small amount of labeled data. Based on the pyramid feature network, learning is implemented from unlabeled data by minimizing the pyramid prediction difference. In addition, novel loss constraints are proposed for co-training in the study. The diversity loss is defined as a soft constraint so as to accelerate the convergence and a novel multi-scale uncertainty entropy calculation method and a consistency regularization term are proposed to measure the consistency between prediction results. The quality of pseudo-labels cannot be guaranteed in the pre-training period, so a confidence-dependent empirical Gaussian function is proposed to weight the pseudo-supervised loss. Main results. The experimental results of a publicly available dataset and an in-house clinical dataset proved that our method outperformed existing semi-supervised methods. For the two datasets with a labeled ratio of 5%, the Dice similarity coefficient scores were 84.94% ± 4.31 and 81.24% ± 2.4, the HD95 values were 4.63 mm ± 2.13 and 3.94 mm ± 2.72, and the Jaccard similarity coefficient scores were 74.00% ± 6.20 and 68.49% ± 3.39, respectively. Significance. The proposed model effectively addresses the challenges of limited data samples and high costs associated with manual annotation in the medical field, leading to enhanced segmentation accuracy.


Introduction
Atrial fibrillation (AF) is a common arrhythmia.Mild AF presents with symptoms such as palpitations and tachycardia, whereas severe AF may lead to cardiac thromboembolism and then ischemic stroke.According to the data from the Global Burden of Disease Study, the number of the cases of AF worldwide in 2018 reached 33.5 million (Turakhia et al 2018).According to White paper of AF-SCREEN International Collaboration published in 2019, 8 million AF cases were reported in China and AF patients had the five times higher risk of stroke compared to normal people.AF is strongly associated with the left atrium (LA) (Ausma et al 1997) and the left atrial appendage (Manning et al 1994, Blackshear andOdell 1996).LA is usually enlarged in AF patients and a main source of intracardiac thromboembolism.In recent years, the morbidity and mortality of AF patients have increased, thus posing a serious challenge to the healthcare system (Hindricks et al 2021).
Two medical imaging tools are commonly used: CT scanning and magnetic resonance imaging (MRI).A contrast injection through a peripheral vein is required before CT scanning of coronary arteries.The contrast agent fills the vessels and spreads into LA, left ventricle, and left ear within a short period of time so as to make CT scan images clearer and thus increase the diagnostic efficiency.However, contrast agents are not acceptable for some patients, who may experience some allergic symptoms such as nausea, vomiting, and itchy skin.MRI performs well in the diagnosis of cardiomyopathy.MRI allows the good tissue contrast and clear evaluation results of cardiac tumor, fatty infiltration, tissue degeneration, cysts, and effusions.The lack of understanding of the human atrial structure leads to the poor treatment efficacy of AF cases.Therefore, the segmentation of the LA is important in the diagnosis and treatment of AF because segmentation results can help physicians in further evaluation.LA has a thin myocardial wall (2-3 mm) (Ho et al 2012) which makes segmentation challenging.In addition, LA is surrounded by other cardiac chambers, descending aorta, and aortic sinus (figure 1), which have the similar signal intensity to LA and easily mislead segmentation algorithms.
Deep learning algorithms are widely used in numerous problems in the field of semantic segmentation of medical images (Gu et al 2020, Yin et al 2022) , including the problem of AF.Classical supervised semantic segmentation models include FCN (Long et al 2015), SegNet (Badrinarayanan et al 2017), U-Net (Ronneberger et al 2015), UNet++ (Zhou et al 2018), and VNet (Milletari et al 2016).These algorithms have been embedded in computer-aided detection (CAD) and computer-aided diagnosis (CADx) systems to reduce the workload of experts and improve the diagnosis accuracy.Uslu et al (2022) proposed an LA-Net model, which could extract the most useful edge information in MRI for segmentation by using cross-attention module and enhanced decoder module, and finally achieved the better results of left atrial segmentation and edge masking.Wong et al (2022) proposed a novel U-Net network (GCW-UNet) with Gaussian fuzzy and channel weighted neural network to segment left atrial images in MRI.However, due to the high cost of manual annotation (Fotinos-Hoyer et al 2010, Tobon-Gomez et al 2015), supervised methods are not applicable to many real problems and cannot exploit the essence of deep learning algorithms.Some scholars have shifted their research attention to semi-supervised and unsupervised learning in order to solve the problems with a small amount of labeled data (Hua et al 2022, Lee et al 2022, Zhao et al 2022).In unsupervised learning, segmentation results obtained by optimizing parameters based on unreliable prediction results are not credible.The false segmentation result may mislead the diagnosis direction and cause serious consequences.In summary, semi-supervised learning is a better option.
Due to the application of co-training (Blum and Mitchell 1998) in the field of semi-supervised learning, some problems have been reported.Some scholars directly fused co-training with semi-supervised learning for medical image segmentation.However, they ignored the characteristics of multi-scale medical image segmentation models and focused on the consistency of segmentation results, so high-and low-frequency features interfered with each other and further led to image detail loss or model collapse.Ultimately, the consistency constraints among multiple models were affected.In addition, prediction errors were inevitable in the independent training process of each model and pseudo-labels were generated and thus affected training results.Therefore, the utilization of pseudo-labels remains to be improved.
To solve the above problems, we proposed a multi-scale uncertain entropy-aware co-training framework and realized the efficient left atrial segmentation with a small amount of labeled data (figure 2).In this study, we used the 2018 Atrial Segmentation Challenge Dataset and the pulmonary vein CT dataset from the Second Hospital of Shanxi Medical University as the experimental subjects.The left atrial MRI images showed considerable differences among patients (figure 1), so we firstly processed all MRIs with an adaptive window.The CT difference was small, so the conventional window was adopted.The small amount of labeled data easily led to model overfitting, so we performed data augmentation with the labeled dataset.Our approach involved two independent sub-models (VNet) with the same network structure.However, the two sub-models were trained separately with different data so as to learn different features and further generate different decision boundaries.Subsequently, the same unlabeled data were used in the prediction with the two independent submodels and the prediction results were referred to as pseudo-labels.Then, co-training was achieved by calculating the diversity loss and consistency multi-scale uncertainty entropy loss of the framework.In order to avoid the negative influences of unreliable pseudo-labels on the convergence effect, an empirical Gaussian function was proposed to adjust the weights of pseudo-labels so that the quality and quantity relationships of pseudo-labels were weighted.The contributions are drawn as follows: In this paper, we proposed an efficient left atrial image segmentation method (MUE-Cot) even with a small amount of labeled data.Our method was an end-to-end semi-supervised method.
We proposed a novel consistency multi-scale uncertain entropy loss as the consistency constraints of two sub-models and a combined soft constraint as the diversity loss.This constraint mechanism could improve the performance of the framework without increasing the computational cost.In addition, we proposed an empirical Gaussian function to adaptively adjust its weights for avoiding the limitations of pseudo-labels.In this way, the utilization of high-quality pseudo-labels was improved.
We evaluated our method with the dataset of 2018 Atrial Segmentation Challenge and the pulmonary vein CT dataset from the Second Hospital of Shanxi Medical University and selected some recent advanced models for multiple comparative trials and multi-group ablation experiments.Our method still obtained the better left atrial segmentation with a small amount of labeled data.

Related work
Relevant studies are summarized here.Firstly, semi-supervised image segmentation algorithms in the field of medical images are reviewed, especially co-training.Then, V-Net is introduced.Finally, uncertainty estimation methods are presented.attention inter-cascade generative adversarial network to solve the segmentation problem of unbalanced atrial targets.

Semi-supervised image segmentation of medical images
Co-training (Blum and Mitchell 1998) is a common semi-supervised method for multiple views and has obtained better results in the field of image segmentation.The main idea of co-training is to train multiple classifiers separately with multiple datasets so that each view can realize correct classification.Due to the extensive application of deep learning, classifiers are gradually replaced by deep learning classification models.For example, Peng et al (2020) proposed a new co-training approach in which models were trained with a subset of annotated data and unannotated images were used to exchange information.In this co-training approach, learning was implemented among adversarial samples, so that the cross-model diversity was enhanced.A selfintegration co-training framework with U-Net as the benchmark model was proposed for COVID-9 (Li et al 2021).The consistent regularization was performed on the two synergistic models according to the selfintegration strategy, so that the adverse effects of noisy pseudo-labels were mitigated.(Wang et al 2021a) designed a generalized Jensen Shannon scatter as an end-to-end differentiable loss and used an entropy-based uncertainty regularizer to encourage the model based on existing co-training methods.Based on supervised and semi-supervised methods, Zheng et al (2022) proposed a Monte Carlo sampling estimation method for calculating loss weights, which were used to accelerate the convergence of the network.
V-Net was proposed by Milletari et al (2016) based on U-Net, a classical network used in the field of image segmentation.Similar to U-Net, V-Net also uses skip connection to deliver detailed information.V-Net shows two advantages.Firstly, V-Net replaces pooling operation with convolution operation and has smaller memory consumption during training.Secondly, V-Net uses residual block and skip connection, so that residual features are learned at each stage of the convolution process.V-Net is a segmentation tool for 3D images.In this study, we adapted the V-Net structure to our co-training framework.
In DL optimization, uncertainty estimation is significant in preventing random factors from misleading the optimization direction.Uncertainty estimation methods can be broadly classified into two main categories: Bayesian-based methods (Xia et al 2020) and non-Bayesian methods (Mehrtash et al 2020, Zheng et al 2022).In Bayesian methods (Stern et al 2004), the parameters of the neural network are firstly determined with the prior probability distribution and then Bayesian inference is performed with the sample information.These parameters are finally determined with the obtained posterior probability distribution.Bayesian inference is computationally difficult in practical applications, so scholars have searched for alternatives in non-Bayesian methods (Zheng et al 2022).Mehrtash et al (2020) argued that in addition to pixel-level confidence measure, a confidence measure was required for capturing segment-level model uncertainty, so they proposed an uncertainty measure for segment-level models.Uncertainty estimation is also one of the hot issues in the field of semantic segmentation of medical images.Zheng et al (2022) used Monte Carlo sampling to obtain an uncertainty mapping, which was then used as the loss weight in training.In this study, we proposed a novel uncertainty estimation method, which performed better in our framework.

Datasets and data pre-processing
The 2018 Atrial Segmentation Challenge Dataset and the pulmonary vein CT dataset from the Second Hospital of Shanxi Medical University are firstly introduced.Then, the flow chart of our data pre-processing is described.

2018 atrial segmentation challenge dataset
The 2018 atrial segmentation challenge dataset was provided by the University of Utah (NIH/NIGMS Center for Integrative Biomedical Computing (CIBC)) and multiple research institutes (Xiong et al 2021).The goal of the dataset is to develop an intelligent algorithm of fully automated atrial segmentation of the LA for accurate reconstruction and visualization of the atrial structure.The dataset contained 3D Gadolinium Enhanced Magnetic Resonance Imaging (GE-MRI) and corresponding ground-truth labels of a total of 154 patients with AF.The volume size of the data was 576 × 576 × 88 and the data pixel interval was 0.625 × 0.625 × 0.625 mm.In this paper, this dataset is represented as dataset 1.

Pulmonary vein computed tomography angiography dataset
The pulmonary vein computed tomography angiography dataset was provided by the Second Hospital of Shanxi Medical University.All personal information has been de-identified to protect patient privacy.The dataset contained a total of 150 patients' pulmonary vein CT scans, including scans with the layer thicknesses of 5 and 0.625 mm.We selected the 5 mm CT scans without contrast injection for each patient as the experimental data and invited the radiologists from the Second Hospital of Shanxi Medical University to annotate the CT scan of the LA.The volume size of the data was 512 × 512× (400-600) and the data pixel interval was 0.933 × 0.933 × 0.625 mm.In this paper, this dataset is represented as dataset 2.

Data pre-processing
In the pre-processing step, the above two datasets were pre-processed separately.We divided the pre-processing step into three sub-steps: CT scan tuning window, MRI adaptive window, and random data enhancement.
The window technique in CT scans is a display technique in which the tissues of different densities (e.g.−1000 HU for air and 0 HU for water) are measured in Hausfeld units (HU).CT values in CT range from −10 240 to 18 000 HU (da Cruz et al 2022).To facilitate the operation, the CT values within the window range need to be scaled to the range of 0-255 according to the window width/window level.In previous studies on LA segmentation, the recommended CT window range was not reported.In this study, multiple sets of commonly used window ranges for left atrial CT (figure 3) were tested.Finally, the window range was set to be [−160, 240].The CT window results are shown in figure 4(a).
Unlike CT results, left atrial MRI results varied considerably with patient (figure 1), so we performed an adaptive MRI window for each patient.MRI values were distributed in a range of [0,255].The volume size of MRI was assumed to be ´Ć W H and the location of the center point d of the MRI was determined to be 2, 2 .According to our analysis, the position of LA in the MRI was close to the center, so one-half of the maximum value in the peripheral area  s Î s s s ´2 2 2 c w h of d was chosen as window level (WL) (equation ( 1)) and window width (WW) was set to be 50.

( )
where ( ) ⋅ max represents the maximum value in the computational matrix; [ ] MRI :,:,: is the matrix obtained from the slicing operation of MRI value matrix; s s s , and c w h is the three dimensions of the peripheral region s.In this study, we defined s = 10  The MRI adaptive window results are shown in figure 4(b).In this study, in order to simulate the cases with missing labeled data, we set several ratios of labeled data: 5%, 10% and 20%, respectively.
In the field of image segmentation, data enhancement is a common image operation (Chen et al 2022, Zhang et al 2022a) and has been used as a regularization tool to deceive neural networks to prevent overfitting (Kukačka et al 2017).Traditional data enhancement methods include rotation, translation (Engstrom et al 2017), adding noise to images (Jin et al 2015) , and zoom-in and zoom-out.Inspired by the previous report (da Cruz et al 2022), we performed real-time data enhancement directly with the training dataset.Real-time data enhancement means that data enhancement is performed in real time when the original dataset is read.Compared with traditional offline data enhancement approaches, this enhancement approach can reduce storage space consumption and reading time.In the study, we used four random factor-driven data enhancement algorithms, random horizontal or vertical mirror flip, random white noise addition, Gaussian blur with random Gaussian kernel order between 2 and 6, and image scaling between 50% and 120%.

Methods
After data pre-processing, the uncertain entropy-aware co-training framework is proposed, as shown in figure 5.The co-training framework is in a dual-view arrangement, where two sub-models are arranged in a collaborative manner.The two sub-models are independent of each other, indicating that they can learn and make decisions from different perspectives.To reduce false positive results, we used two loss-constrained co-training frameworks: global-edge diversity loss and consistent multi-scale uncertainty entropy loss.In this section, we firstly reformulated our task problem, introduced global-edge loss and diversity loss, especially consistent multiscale uncertainty entropy loss, and finally defined the overall optimization objective of the co-training framework.
The framework includes two independent sub-models based on VNet and the models are trained with constraints by three loss functions: supervised loss L , sup pseudo-supervised loss L , u and consistent multi-scale uncertain entropy loss L .
con Both L sup and L u are computed with the proposed global-edge loss method.L sup is used to compare the difference between the prediction result and the ground-truth segmentation mask, whereas L u is used to compare the difference between the prediction result and the pseudo-label generated by the other model.The overall loss L total of the whole framework is expressed as a weighted sum of the three losses.

Problem formulation
We formalized the left atrial image segmentation problem based on the given datasets: the labeled dataset and the unlabeled dataset l where x i l is the left atrial image; y i l is the ground-truth segmentation mask corresponding to and [ ] Î k n 1, are the indexes of images.Different perspectives are used as supplementary knowledge to enrich the whole framework of feature learning.The learning procedure is introduced below.With two identically structured models ( ) ⋅ f , learning is performed on D l to obtain the model parameters q 1 and q , 2 respectively.The final goal is to make D u calculated with the parameterized models ( ) q ⋅ f ; 1 and ( ) q ⋅ f ; 2 close to and their corresponding ground-truth segmentation mask as possible.
Lemma 1. (Smooth Transition (Yu and Chen 2020)) A smooth switching function smooth-step of group control is proposed for smooth switching of communication topologies.Smooth-step is a kind of Sigmoid function with gradual growth and reduction properties, which make the switching process smoother.In addition, the smooth-step function eliminates the gradient at the switching point.In other words, near the switching point, its derivative tends to zero, thus reducing the abrupt change and discontinuity of the gradient.When the left edge of the function is set to 0 and the right edge is set to 1, smooth-step is expressed as: where Q is a positive integer.To facilitate the calculation, we reset the calculation in smooth-step so that it was applicable to a certain range of smooth switching.The adjusted smooth-step is expressed as: where t 1 and t 2 represent the smoothed range; ( ) k x is calculated as follows: where ( ) ⋅ ⋅ max , is used to select the maximum value from parameters; ( ) ⋅ ⋅ min , is used to select the minimum value from parameters.

Global-edge loss
In this study, we proposed a global-edge loss L , c which consists of two parts: global loss L Global and edge loss L Edge (equations ( 5) and ( 6)).In order to avoid strict hard constraints on the optimization problem, the prediction x is forced to be close to the labeled value x as possible in a soft way.Global loss L Global and edge loss L Edge are calculated as follows: where r and c respectively indicate the total numbers of rows and columns of pixels in an image, namely, the width and height of an image; x and x are respectively the comparison image and the reference image;   ⋅ F is the Frobenius norm;   ⋅ 2,1 is the L21 norm; b is the edge reference coefficient; ( ) ⋅ Edge is an image boundary function.Medical image boundaries are weak edges, so we chose the Canny operator, which was more sensitive to weak edges, as the computational strategy of ( ) ⋅ Edge .The operation of ( ) ⋅ Edge is introduced as follows.Firstly, the Canny operator is applied to the input image for edge detection.The Canny operator can extract the edge information in the image and generate a binarized edge image.The gradient of the binarized edge image may produce discontinuous jumps at the edges, which may lead to unstable gradient calculation.Therefore, mean filtering is adopted to smooth the binarized edge image so as to reduce the gradient discontinuity.
( ) ⋅ Edge is calculated as follows: mean where ( ) ⋅ f mean is the mean filter function and ( ) ⋅ Canny is the edge detection function.It is worth noting that when the number of training iterations is small, L Edge may hinder the convergence of the model because the bounds of the prediction results are not credible.Therefore, an edge reference coefficient b is introduced in L Edge and a larger b is required for more iterations.To avoid tedious parameter selection, we used the smooth transition equation ( ) ¢ S t (Lemma 1) with respect to the number of iterations t to solve the relationship between b and the number of training iterations.( ) ¢ S t provides a gradual mechanism to gradually increase the edge reference coefficients, so that the model can better utilize the feature information of the boundary during the training process.( ) ¢ S t is a nonlinear smooth transition equation.Compared with a linear transition equation, ( ) ¢ S t can alleviate the problem of gradient discontinuity, so that the model can learn the features of the boundary more consistently and generate continuous prediction results.In this study, different smooth switching effects and the smoother switching process can be realized adjusting the parameters of ( ) ¢ S t .The trend of the edge reference coefficient b is plotted as follows: Where t start represents the epoch at the beginning of the smooth transition; t end represents the epoch at the end of the smooth transition.Based on the empirical judgment of the single model ( ) q ⋅ f , , t the epoch range for the smooth transition of the edge reference coefficient b is set to [ ] Î t 5, 15 .Thus, the global-edge loss L c is defined as: If all models learn the same features and adopt the same parameters, the combination of their outputs is not better than the prediction result of a single model.To ensure the diversity of relationships among multiple models and the correctness of the prediction result of each model, so we proposed supervised loss and pseudosupervised loss, respectively.Supervised loss is used to encourage x i l to approach its corresponding y i l based on the prediction of a single model and defined as L .
sup The supervised loss for the tth model is expressed as follows: ; , ; , , 9 Global Edge where ( ) q f x , Pseudo-supervised loss is used to encourage mutual adversarial learning between ( ) q ⋅ f , 1 and ( ) q ⋅ f , 2 and further strengthen the parameters of the model.Pseudo-supervised loss is defined as L .
u When x i u produces prediction ( ) 2 of the second model is used as a pseudo-label to further supervise the first model.The same pseudo-supervised strategy is used for the second model.Pseudosupervised loss for the tth model is expressed as follows: , , , 10 where ( ) q f x ,  Lemma 2. (Shannon Entropy (Shannon 2001)) It is assumed that the sample space X contains i basic events and is denoted as

X
p i denotes the occurrence probability of the underlying event x .
The corresponding Shannon entropy ( ) ⋅ in the sample space X is calculated as follows: where K is the positive constant, a unit of measure; b is the base of the log function.In this paper, the log function with e as the base is chosen so as to reduce the complexity of the differential calculation.In this paper, ( ) ⋅ is then defined as Theorem 1. (Uncertainty entropy) Two real number sets are given as: and an event set is given as , , , .
h D 3 is the set of uncertainty events for D 1 and D 2 and each element in D 3 is determined by the uncertainty of the corresponding element in D 1 and D .

2
It is assumed that when the values of both z h and ¢ z h are 0 or 1, zh is certain, otherwise zh is uncertain.The above results are expressed as follows:

˜( )
D 3 is a binary set, so it is assumed that the probability of a certain event in D 3 is p c and that the probability of an uncertain event is =p p 1 .
un c We encouraged the events in D 3 to become deterministic.In other words, we reduced the population complexity in D .
3 Therefore, the calculation equation of uncertainty entropy based on Shannon's entropy (Lemma 2) is given as: where ( ) p p , c un indicates the uncertainty entropy of the event set D 3 or the uncertainty entropy of the two sets of real numbers D 1 and D .focused on the consistency of final results, but the importance of consistency multi-scale features in the pyramid structure was often neglected.Therefore, we proposed a novel consistency multi-scale uncertainty entropy calculation method as a consistency constraint in co-training (figure 8).The calculation method is composed of two parts: the consistency multi-scale uncertainty estimation and the consistency regularization term calculation.In order to simplify the computation, MCMC sampling is used for feature selection in uncertainty calculation.Compared with the above methods, our method can improve the consistency constraint performance through uncertainty estimation with multi-scale features at a low computational cost.
First, for the two sub-models, five nodes in their decoders are selected to extract their pyramidal multi-scale features, which are subjected to upsampling and MCMC-Gibbs sampling and then used for the consistency multi-scale uncertainty estimation calculation.Secondly, the prediction results j S 1 and j S 2 of the two submodels are used to calculate the consistency regularization term.Finally, the above two terms are summated to obtain the consistency multi-scale uncertainty entropy loss.

Consistency multi-scale uncertainty estimation
For pseudo-labels, we performed uncertainty estimation by encouraging multi-scale similarity comparisons.We proposed a pyramid consistency loss to minimize the difference between the different scale prediction results of the two synergistic models.Given an unlabeled input where j s 1 is the prediction result at scale s.The network It is worth noting that smaller s implies higher resolution.S indicates the number of total scales.In our total framework, the two networks ( ) 2 use the 2D-VNet structure, so S is set to be 5.For the convenience of presentation, we denote the prediction results for the same scales in ( ) where t denotes the model number.When s is set to be different values, the resolution of j s t is different.Therefore, ( ) is upsampled by different multiples respectively so that the sampled results have the same resolution as the input image and are denoted as To simplify the calculation, all the pixels of j¢ s t are sampled with MCMC-Gibbs before calculating the uncertainty entropy in the next step and the positions of these sampled j¢ s t pixels are the same.The pixels sampled at different scales s are ordered as one-dimensional vectors, where x represents the number of pixel points sampled by MCMC-Gibbs.In this study, to weigh the relationship between FLOPs and model accuracy, we set x = 0.5 Pixel Count.
The consistency uncertainty estimate for the scale s is denoted as: Then the consistency multi-scale uncertainty loss L unc is expressed as:

. Consistency regularization term
Under the constraint of consistency multi-scale uncertainty estimation, it is advantageous for ( ) q ⋅ f , 1 and ( ) q ⋅ f , 2 to learn unsupervised knowledge from unlabeled data.However, in the co-training framework, the pseudo-label is generated independently by another model without additional guidance.These pseudo-labels may be noisy and thus affect the performance of models.To reduce the adverse effects of noisy pseudo-labels, the consistency uncertainty regularization term based on consistency multi-scale uncertainty estimation is used to constrain the learning direction of each model.Given an unlabeled input Based on the above constraints, we encouraged the two networks to make deterministic judgments about the prediction results and proposed a definition of uncertainty entropy (Thm.1).For the two predicted results j S 1 and j , S 2 the uncertainty entropy is firstly calculated as the body of the regularization term, ( ) j j , .
S S 1 2 We expect the uncertainty entropy of j S 1 and j S 2 to be smaller values, which means that there are two possibilities for the set of events corresponding to j S 1 and j .

S 2
The first possibility is that most of the events are certain and the second possibility is that most of the events are uncertain.Therefore, to further encourage the two networks to make deterministic judgments about the predicted results, inspired by the reports (Zheng and Yang 2021, Luo et al 2022), we corrected the regularization term as follows: where p c represents the occurrence probability of a definite event, namely, the ratio or the pixels identified as foreground or background in the prediction results to all the pixels.In this study, we encourage both networks to make deterministic judgments about the predicted outcome.Therefore, if we set the probability of a deterministic event as the higher p , c then it leads to a lower m.Unlike many empirical methods of setting thresholds (Pérez-Benito et al 2020, Li et al 2021), this strategy can be used to adjust the thresholds without additional manual work or experiences.Finally, the consistency uncertainty regularization term is defined as follows: , represents the calculation of uncertainty entropy; z represents the parameter of correction value and is set to be 0.01 in this paper.The value of the consistency uncertainty regularization term with respect to the different p c and p un in Theorem 1 is shown in the figure 9.

Consistency multi-scale uncertainty entropy loss
Based on the consistency multi-scale uncertainty estimate (equation ( 16)) and the consistency uncertainty regularization term (equation ( 18)), the consistency multi-scale uncertainty entropy loss is defined as follows: The meanings of the above variables are given in sections 4.4.1 and 4.4.2.

Overall optimization objective
The overall training objective of our framework is the weighted sum of the supervised loss L , sup the pseudosupervised loss L , u and the consistency multi-scale uncertainty entropy loss L : con ( Based on the previous report (Chen et al 2023), l 1 and l 2 are determined by using an empirical Gaussian function with the confidence level as the independent variable.The ablation experiments on the empirical Gaussian function are described in section 5.6.3.It is assumed that at the tth training iteration, l 1 and l 2 follows a dynamic Gaussian function with the mean of m t and the variance of s .t 2 l 1 and l 2 are further deduced as: is the confidence vector of all pseudo-label groups in a batch and the confidence p b B of the bth group of pseudo-labels is expressed as follows: where ( ) ⋅ ⋅ , represents the calculation of uncertainty entropy (Thm.1);j S 1 and j S 2 are respectively the prediction results of x b u by two VNet models, namely, a set of pseudo-labels.According to equation (23), it is obvious that the values of l 1 and l 2 are in the range of [ ] l 0, .max The maximum value of parameter l max is set to 1 in this paper.In order to achieve the better generalization effect, the empirical mean mt and empirical variance ŝt 2 are estimated with confidence and historical prediction results as follows: Then, the historical forecasts are aggregated by momentum m so as to obtain a more stable empirical mean mt and empirical variance ŝ .

ˆˆ( )ˆ( )
where the empirical mean mt is initialized to be 0.5 and the empirical variance ŝt 2 is initialized to be 1.

Evaluation metrics
In previous studies, various indicators were used to measure the performances of medical image segmentation models, including Dice similarity coefficient (DSC), Jaccard similarity coefficient (JA) (Bray et al 2018), 95% Hausdorff Distance (HD 95 ) (Huttenlocher et al 1993), Sensitivity (SE), Specificity (SP), and Accuracy (ACC).DSC indicates the overlap between the predicted result S and the ground-truth partition G: JA indicates the ensemble similarity between the predicted result S and the ground-truth partition G: HD 95 indicates 95% of the maximum distance between the predicted result and ground truth partition G: SE indicates the ratio of positive and correct prediction results, i.e. the ratio of the total number of actually positive items: SP indicates the ratio of negative and correct prediction results, i.e. the ratio of the total number of actually negative items: ACC indicates the ratio of correct prediction results, i.e. the ratio of the total number of correct items: TP TN TP TN FP FN .32

Experimental setup
We selected two datasets for model validation, left atrial MRI and CT scans, which contained 60 MRI and 40 CT scans, respectively.These data were acquired from different devices, so we pre-processed all CT raw scan/MRI data with uniform window positions and window widths recommended by clinical experts so as to reduce data discrepancies according to the method described in section 3.3.We compared our method with several supervised and semi-supervised baseline methods.In supervised methods, all training datasets are considered to be labeled.In the semi-supervised methods, a subset of training data (i.e.20%, 10%, and 5%) are used as labeled datasets.In addition, we used the U-Net model trained by the supervised method as the upper bound model and the 2D-VNet model trained with 20%, 10% and 5% training data as the baseline model.We implemented some of the current most advanced semi-supervised medical image segmentation methods and compared them with our method.In addition, to test the influences of some settings of the models on the predicted results, we designed multi-group ablation experiments.

Details of image segmentation
The original VNet is a three-dimensional structure, so we modified it according to the requirement of 2D input as a sub-model.Firstly, we adjusted the input and output tensor dimensions to 2D, i.e. (batch size, channels, height, width).Then, the 3D convolution kernel in the original VNet was replaced with a 2D convolution kernel of size (3 × 3) and downsampled with 2D max pooling with a step size of 2. Next, the 3D inverse convolutional layers and jump connections in the original VNet were replaced with 2D inverse convolutional layers and jump connections, respectively.Finally, the up-sampling results of the previous layer and the encoder feature maps of the corresponding layers were spliced in the decoder.
Notedly, in this study, we used longitudinal slices within the cardiac scope of CT and MRI for two reasons.Firstly, similar features associated with the LA might exist in the slices of other structures of the heart (e.g.left ventricle and right atrium).Therefore, with longitudinal slices within the heart, these similar features could be better distinguished from each other and the accuracy of left atrial segmentation was improved.Secondly, taking the internal clinical dataset as an example, the scan range of pulmonary vein CT usually includes the whole lung and other structures are weakly associated with left atrial features.Therefore, the selection of only longitudinal slices within the heart as the dataset can reduce the redundant information for simulation and improve the computational efficiency.After random pre-processing of the left atrial CT and MRI (CT scan windowing, MRI adaptive windowing, and data enhancement), the pre-processed data as 2D slices were input into the model.The training algorithm for multi-scale uncertain entropy-aware co-training is summarized as Algorithm 1 (appendix).
In the segmentation with training parameters, both sub-models used the same hyperparameter settings, which notably did not affect the independence and diversity between the models.In one of the models, we optimized the overall loss with the SGD optimizer.The batch size was set to 4 and the image size was initialized to be 1 × 256 × 256.The number of training iterations was set to 40.We set the initial value of the learning rate to be 5 × 10 -3 .Each learning rate decayed by 2 × 10 −5 after each epoch.The value of momentum was 0.9.
The experimental environment was provided as follows: Pytorch 1.9.0 plus cu111 deep learning framework (Paszke et al 2019), CPU (12th generation Intel i9-12900KF), GPU (NVIDIA GeForce RTX 3090 with 24 GB video memory), and 64 GB system memory.For all the above methods and our method (MUE-CoT), we used the same optimization parameters, learning rate decay, and data pre-processing structure.In addition, some of the above models used U-Net as the base network, so we set the Feature scale of these models to be 2.In other words, the number of filters was set to [32,64,128,256,512].In the subsequent experiments, we evaluated the prediction results with the six metrics introduced in section 5.1.DSC and HD 95 were used as the main evaluation metrics.To avoid incidental factors, we calculated the mean and standard deviation of three runs with different random seeds.Notably, all experimental results in this study were re-implemented by us.

Results on dataset 1
We firstly evaluated our framework on dataset 1 (table 1).U-Net trained with 100% labeled training dataset was used as the upper bound method.When 5%, 10%, and 20% labeled data were used, all semi-supervised learning methods performed better than the baseline method, indicating the significant advantage of semi-supervised learning under the conditions of less labeled data.Compared with other semi-supervised learning methods, DAN and GVS with smaller labeled ratios showed the better performance, suggesting that GAN-based methods had some advantages under the conditions of less labeled data.However, our method had higher DSC and HD 95 metrics on the tested dataset.In particular, our method outperformed the co-training method and the consistency method.Average DSC of our method was 8.07%-11.84%higher than that of co-training method and 3.88%-6.2%higher than that of the consistency method.Average HD 95 of our method was 5.12-6.46mm more than that of co-training method and 1.55-5.53mm higher than that of the consistency method.Compared with the uncertainty-aware method, our method showed the improved performance (average DSC improved by 2.35%-5.54%and average HD 95 improved by 1.1-3.32mm).When the label ratio increased, average DSC increased sharply and average HD 95 decreased.When the label ratio was 20% and 10%, our method was more competitive than the upper bound method.When the label ratio was 20% and 10%, average DSC of our method was 0.67% and 1.14% higher than that of the upper bound method and average HD 95 of our method was 0.97 and 1.32 mm more than that of the upper bound method.Then, we performed paired t-tests with the DSC of the test samples (table 1).All P-Values were less than 0.05, indicating the statistically significant advantage of our framework over other semi-supervised methods.
In addition, some examples from the test dataset are shown in figure 10.Compared with other methods, our method gave a profile that was closer to the ground truth.

Results on dataset 2
To validate the segmentation performance of our model on CT images, we evaluated our framework on dataset 2 (table 2).The LA had similar CT values to other locations, thus increasing the difficulty in segmenting the LA on CT and decreasing the segmentation performance of all models for dataset 2. In general, our framework performed well in solving the CT-value similarity problem.In the semi-supervised segmentation task, less labeled data suggested the difficult segmentation.Therefore, we analyzed the data with a label ratio of 5% to further demonstrate the superiority of our framework.Among other semi-supervised methods, the most competitive segmentation methods were GVS, UA-MT, and ConfKD.Their average DSC was 14.92%-17.54%higher than that of the baseline method and their The symbols ↑ (↓) indicate that the higher (lower) the score, the better the results.
average HD 95 was 48.45-51.01mm more than that of the baseline method.However, our method had the best segmentation results.Average DSC of our method was 0.39%-3.01%higher than that of GVS, UA-MT, and ConfKD and average HD 95 of our method was 1.62-4.18mm more than that of GVS, UA-MT, and ConfKD.When the ratio of labeled data was 20%, our framework was more competitive than the upper bound method.Average DSC of our method was 0.49% higher than that of the baseline method and average HD 95 of our method was 3.71 mm more than that of the baseline method.With the DSC of the test samples, the paired t-tests were performed (table 2).All P-values were less than 0.05, indicating a statistically significant advantage of our framework over other semi-supervised methods in CT segmentation.
Figure 11 shows some results of the test dataset.Compared with other methods, our method gave a profile that was closer to the ground truth.For some images which were difficult to be segmented, the supervised model could not even segment the lesion region, whereas our method could provide satisfactory segmentation results.

Analysis of method stability
In this section, we analyzed the predictive stability of MUE-CoT with different ratios of labeled data.The experiments were performed with dataset 1 as the dataset and the quantitative results were obtained based on table 1.Firstly, we visualized the distributions of DSC and HD 95 for MUE-CoT under different ratios of labeled data (figure 12).When the ratio increased, DSC values increased sharply and HD values decreased.The distributions of DSC and HD 95 of our model were more concentrated under the three ratios of labeled data, proving that the segmentation results of MUE-CoT were stable under different ratios.
Next, we visualized the DSC and HD 95 errors for different Epochs of MUE-CoT with different ratios of labeled data.Figure 13 gives the DSC and HD 95 obtained on the validation dataset with the upper bound model (dashed line) and the model trained from zero (solid line) under different ratios of labeled data (5%, 10%, and 20%).MUE-CoT converged rapidly at the early stage of training and the model became stable at Epoch 20, indicating that DSC and HD 95 of MUE-CoT were stable during the training process.Moreover, under the ratio of 20%, the results of MUE-CoT were close to the results of the fully supervised training upper bound model, indicating that our framework was more competitive.

Ablation analysis
In this section, we demonstrated the effectiveness of the critical structures and hyperparameters in the framework by designed multiple sets of ablation experiments.

Effect of global-edge loss
To investigate the utilization of global-edge loss (i.e.L c ), we explored different kinds of loss functions on a training dataset with 5% labelled data (table 3).We chose cross entropy loss L , BCE Dice similarity coefficient loss L , DSC L , Global and L Edge as diversity loss for our experiments and analysis.The experimental results showed that our L Global and L Edge as diversity loss were more competitive.Compared with L , BCE L , DSC L , Global and L Edge respectively improved the segmentation results, but the segmentation results were still poor.When L Global and L Edge were used as diversity loss, the resulting DSC and HD 95 showed the significant advantage.The symbols ↑ (↓) indicate that the higher (lower) the score, the better the results.

Effect of key structures
In this study, co-training and consistency multi-scale uncertainty entropy loss were two key structures.To verify their effect on this study, we designed two sets of ablation experiments on a training dataset with 5% labeled data.
In the first set of experiments, we used two 2D-VNet for independent training and finally merged their training

Effect of empirical Gaussian function
In this section, we analyzed the pseudo-label weight coefficients l 1 and l 2 with various functions.The experiments were conducted with a training dataset with 5% labeled data under the same other parameters of MUE-CoT.The experimental results are shown in table 5. Linear function, square root function and square function were less effective for pseudo-label weight coefficients, whereas Gaussian function provided the better results.In the visualization results of the variable function (figure 14), the empirical Gaussian function was a more reasonable method for the selection of pseudo-label weight coefficients because it assigned different weight coefficients for the confidence level of the pseudo-labels and had the stronger generalization ability.

Effect of momentum selection
With dataset 1 as the experimental dataset, we set the momentum to be 0.8, 0.9, and 0.99 for comparison and analyzed the segmentation accuracy (table 6) and the convergence trend of loss under different momentum settings (figure 15).The results showed that different momentum values had a small effect on the final segmentation accuracy, but a large effect on the convergence speed.The smaller momentum corresponded to the faster convergence speed and the lower accuracy, whereas the larger momentum suggested the slower convergence speed and the higher accuracy.However, in our work, the priority of sufficient accuracy of training resources was higher, so we set the momentum to be 0.99 in other experiments.The symbols ↑ (↓) indicate that the higher (lower) the score, the better the results.

Analysis of complexity
In this section, we analysed the complexity of MUE-CoT and summarized the FLOPs and parameters of all the methods in section 5.4.We selected dataset 1 as the experimental dataset and set the labelled ratio of the data to be 5% and the image size to be 1 × 256 × 256 (table 7).With two trainable models, our framework required more parameters and FLOPs than other frameworks with a single model, but these requirements were reasonable based on the consideration of segmentation accuracy.Moreover, our method achieved the state-ofthe-art performance compared to other methods with the same (DCT) or more parameters (GVS and ConfKD), thus proving the effectiveness of our framework.

Discussion and conclusion
Accurate segmentation of the LA plays a crucial role in studying human atrial structure and modeling AF pathology.In this study, we proposed a multi-scale uncertain entropy-aware co-training framework for left The symbols ↑ (↓) indicate that the higher (lower) the score, the better the results.The symbols ↑ (↓) indicate that the higher (lower) the score, the better the results.The symbols ↑ (↓) indicate that the higher (lower) the score, the better the results.atrial segmentation and realized the higher performance even with a small ratio of labeled data.We demonstrated its effectiveness with a public benchmark dataset and a realistic dataset.To simulate the case of a small ratio of labeled we set three ratios of labeled data in the training dataset (20%, 10%, and 5%).In this study, we used a co-training structure, which promoted adversarial training among models.The co-training structure could produce diverse prediction results and decision bounds, thus providing richer segmentation knowledge.Importantly, we proposed diversity loss and consistency multi-scale uncertainty entropy loss to constrain the co-training framework.We defined the diversity loss as a soft constraint in order to accelerate the model convergence.In addition, we proposed a novel multi-scale uncertainty entropy calculation method and a consistency regularization term to measure the consistency among various results.To solve the problem of the instable quality of pseudo-labels in the pre-training period, we proposed a confidence-dependent empirical Gaussian function to weight the pseudo-supervised loss.
In our experiments, we compared the proposed model with several semi-supervised methods used in the medical field and determined a baseline model and an upper bound model to further measure the performance of various models (tables 1 and 2).The experimental results showed that our method significantly improved the segmentation performance compared to the baseline model and realized the better performance compared to several other semi-supervised methods (GVS (Zhang et al 2022b), FixMatch (Sohn et al 2020), URPC (Luo et al 2022), and ConfKD (Liu et al 2023)).The stability of the model is also important, so we verified the stability of the model in two aspects: the distributions of metrics of MUE-CoT under different ratios of labeled data (figure 11) and the errors of metrics of different Epochs during the training process (figure 12).In addition, we designed multiple sets of ablation experiments to verify the structure and parameter selection in the framework, including the effects of global-edge loss (table 3), the effects of co-training and consistency multi-scale uncertainty entropy loss (table 4), the effects of empirical Gaussian functions (table 5), and the effects of momentum selection (table 6).The above ablation experiments further demonstrated the effectiveness of MUE-CoT.Subsequently, we analyzed the model in terms of the number of parameters and FLOPs.MUE-CoT used two models for collaborative training, so it showed no advantage over the single-model methods.However, we believed that it was worth improving segmentation accuracy at the cost of less parameters and FLOPs (table 7).Semi-supervised learning is important in the medical field because annotated images are complex, expensive, and scarce.In the future, we will combine semi-supervised learning with domain adaptation to further investigate diversity strategies for the cases of less annotated samples.
Four common semi-supervised image segmentation methods of medical images include self-training methods (Zheng et al 2020, Wang et al 2021b, Hao et al 2022), generative adversarial network-based methods (Han et al 2020, Chen et al 2021, Xun et al 2022), regularization methods (Wang et al 2022), and co-training methods (Yang et al 2017, Peng et al 2020, Xia et al 2020, Li et al 2021, Wang et al 2021a, Zheng et al 2022).Compared with other semi-supervised learning methods, self-training methods do not require any assumption.Wang et al (2021b) proposed the FEW-SHOT learning framework based on the combination of semi-supervised learning and self-training and found that the performance of the model mainly depended on the selection and evolution of high-quality pseudo-labels in cascaded learning.A self-training teacher-student model with self-attentive U-Net and an automatic label grader was proposed (Hao et al 2022).Zheng et al (2020) used representative annotated slices to train the base model and carried out self-training with the pseudo-labels automatically generated by the model.Generative adversarial network-based methods refer to the application of generative adversarial networks (GAN) (Goodfellow et al 2020) in the field of medical image segmentation.Han et al (2020) proposed a GAN-based semi-supervised segmentation network (BUS-GAN) to solve the segmentation problem of large breast ultrasound (BUS) images.Chen et al (2021) proposed an adaptive

Figure 2 .
Figure 2. General flow chart of the proposed method in this study.

Figure 1 .
Figure 1.MRI and CT slices of the left atrium in four patients (the first column is a slice of MRI and the second column is a CT slice).

Figure 6 .
Figure 6.Smoothing function ( ) b t at = t 10 start end

4. 3 .
Diversity loss in co-trainingLike standard multi-view learning methods(Peng et al 2020), multiple models were trained in a collaborative manner.After training, their outputs were combined together to derive prediction results for new images.We encouraged adversarial learning of the prediction results of different models in co-training (figure7(a)) and presented a brief demonstration of the selection strategy of final results (figures 7(b) and 7(c)).In figure 7(b), the red dashed line is the decision boundary of Model 1, and the red triangles within its dashed line represent the confidence voting of Model 1.Similarly, the blue dashed line and the blue triangles within its dashed range represent the decision boundary and confidence voting of Model 2, respectively.When the same area contains the confidence votes for both models, this area is finally selected and marked with a purple triangle.Finally, the final decision boundary (figure7(c)) is determined based on the area marked with purple triangles in figure 7(b).
the prediction result of the tth model for x ; i l y i l represents the ground-truth segmentation mask corresponding to x ; global-edge loss.
prediction result of the tth model for x ; i u  y i represents the pseudo-label corresponding to x ; global-edge loss.

Figure 7 .
Figure 7. Illustration of the result selection strategy based on the consistency constraint for co-training.The dashed line represents the decision boundary of the model; the black dots and arrows indicate adversarial examples; the triangles of different colors represent the corresponding confidence voting of the model.
Consistency multi-scale uncertainty entropy-aware module The most common pyramid network structure in the field of medical image segmentation often has the stitched features of different spatial resolutions after upsampling operations (Yu et al 2019, Wang et al 2020), but the stitched features still suffer from the loss of image details or model collapse due to the interference of high-and low-frequency features.Previous studies on the loss of consistency uncertainty (Xia et al 2020, Zheng et al 2022)
and l 2 are the pseudo-labeled weight coefficients.Considering the reliability of the pseudo-labeled data in the pre-training period, we weighted pseudo-supervised losses so as to avoid pseudo-supervised learningdominated training and thus negative influences on the training results.Confidence threshold weighting (confidence thresholding) is a dominant way to exploit pseudo-labels (Sohn et al 2020, Zhang et al 2021) and a high confidence threshold ensures the quality of pseudo-labels.However, this measure has some drawbacks.For example, FixMatch (Sohn et al 2020) filters pseudo-labels at a high threshold during training and eventually discards about 71% of uncertain but correct pseudo-labels.Therefore, the selection of the pseudo-labels' confidence threshold determines the trade-off between the number and quality of pseudo-labels and directly affects the utilization of labels.

t 2 Figure 9 .
Figure 9. Trend diagram of the regularization term for consistency uncertainty.

Figure 10 .
Figure 10.Examples of segmentation results for Dataset 1 under different labelled ratios.

Figure 11 .
Figure 11.Examples of segmentation results for dataset 2 under different labelled ratios.

Figure 12 .
Figure 12.Statistics of DSC and HD95 for dataset 1 under different ratios of labeled data.

Figure 13 .
Figure 13.Error graphs of DSC and HD95 for different stages with different ratios of labeled data: (a) error graph for 5% labeled data, (b) error graph for 10% labeled data, and (c) error graph for 20% labeled data

Figure 14 .
Figure 14.Examples of multiple variable functions.

Table 1 .
Performance comparison of different semi-supervised methods on dataset 1 under different labelled ratios.

Table 2 .
Performance comparison of different semi-supervised methods on dataset 2 under different labelled ratios.

Table 3 .
Results of ablation experiments for global-edge loss.

Table 4 .
Results of ablation experiments for co-training and consistency multi-scale uncertainty entropy loss.

Table 5 .
Results of ablation experiments for multiple variable functions.

Table 6 .
Results of ablation experiments with different momentum values.