Semi-supervised segmentation of abdominal organs and liver tumor: uncertainty rectified curriculum labeling meets X-fuse

Pengju Lyu; Wenjian Liu; Tingyi Lin; Jie Zhang; Yao Liu; Cheng Wang; Jianjun Zhu

doi:10.1088/2632-2153/ad4c38

1. Introduction

Hepatic malignancies constitute an immense global disease burden as both primary liver cancers such as hepatocellular carcinoma, alongside hematogenous metastatic lesions infiltrating the liver parenchyma, exact a toll on both morbidity and mortality [1]. The complexity inherent in liver anatomy, coupled with the diverse manifestations of liver tumors, mandates the employment of sophisticated and precisely targeted intervention strategies. Image-guided percutaneous puncture interventions have gained increasing prominence in the diagnostic and therapeutic realms of liver tumor management, affording minimally invasive access to the targeted anatomical site [2]. The paramount considerations for ensuring the safety and efficacy of such interventions lie in the accurate segmentation and differentiation of patient anatomy, which constitutes fundamental prerequisites for procedural planning and intra-operative guidance. This imperative underscores the explicit delineation of three cardinal anatomical domains [3]: firstly, the avoidance organs (AVO), characterized by the skeletal structures, mandate meticulous circumvention by instrument or needle trajectories en route to the designated targets; secondly, the liver parenchyma itself emerges as the principal organs at risk (OAR) during the phases of tumor access and treatment delivery, underscoring the essentiality of precise delineation to prevent unintended harm; and lastly, the gross target volume (GTV)-liver tumors prescribes target boundaries and volume, thereby serving as the crucial locus for diagnostic sampling or therapeutic ablation. Given the implications of these delineations, the formulation of an advanced and efficient segmentation framework emerges as a research imperative.

In recent years, the intersection of medical imaging and deep learning (DL) has ushered in a new era in medical image analysis. DL models, especially convolution neural networks (CNNs or convnets) and transformer-based U-shape models [4, 5] have demonstrated remarkable capabilities in extracting multi-organ and tumor features, which paves the way for automated targets segmentation and therefore alleviates the workload on clinicians and radiologists. As the pursuit of enhancing models for more effective feature extraction has been a focal point of medical image analysis, the predominant focus has remained on features derived solely from the spatial domain despite the evolution from localized perspectives utilizing CNN to a more comprehensive global view employing transformers. It is conceivable that the exploration of features originating in alternative domains may yield additional valuable information. Contemporary research in the realm of natural image recognition and language processing has witnessed progress in the manipulation of features within the frequency domain [6–8]. Thus, we aim to develop more comprehensive models extracting information from diverse domains by incorporating a frequency encoder within the context of 3D medical image analysis.

DL methods have significantly elevated segmentation accuracy, their efficacies yet rely heavily on extensive labeled datasets. However, acquiring such datasets, especially in the medical domain, is labor-intensive and often cost-prohibitive. Meanwhile, most studies on liver tumors are limited to small public cohort sizes with performances below the clinically accepted levels. Semi-supervised learning (SSL) and data augmentation have emerged as transformative paradigms for data-efficient learning. SSL addresses the scarcity of labeled data by leveraging the potential of both annotated and unlabeled medical images [9]. Prevailing semi-supervised techniques include pseudo-labeling, where model predictions on ambiguous inputs generate tentative labels for recursive learning [10–12]; consistency regularization, which enforces output invariance to input or model perturbations as implicit constraints [13, 14]; and their organic combinations to uncover additional performance gains, for instance, MC-Net [15] utilizes cross pseudo supervision (CPS) [16] to regularize consistency between two predictions from two different decoders, achieving 90.34% Dice Similarity Coefficient (DSC) in left atrial segmentation. On the other hand, data augmentation artificially expands the size of labeled datasets through transformations such as random crops, flips, rotations and elastic transformation, generating synthetic yet realistic training examples from existing ones [17]. Popular augmentation strategies leverage generative adversarial networks (GANs) to produce additional labeled-like examples through learned feature distributions [18]. Such augmented data significantly bolsters model robustness and generalization abilities. We strive to harness the capabilities of both methodologies, enabling them to handle a wide array of variations in anatomy and pathology.

In this study, motivated by the merits of cascaded architectures for anatomical parsing [19–21], we embrace a triphasic segmentation paradigm customized for hierarchical delineation of AVO (bones), OAR (liver), and GTV (liver tumors). Our phase-specific models, following the seminal works [5, 22, 23] on multi-organs and lesion segmentation, are all rooted in the MetaFormer [24] architecture while varying in token mixers design. To accommodate the multifarious tumor morphologies and textural heterogeneity of hepatic parenchyma, we propose a novel dual-stream segmentation network termed X-Fuse, where individual encoder and decoder streams are interconnected through feature fusion across a shared latent representation. This model extends the conventional spatial feature extractor (scale-aware modulation (SAM) and multi-head self-attention (MSA) modules as spatial token mixer) by integrating a frequency encoder (global filter (GF) modules as spectral token mixer). This composite encoding aims to glean comprehensive multi-domain representations to advance feature learning. Further, to harness unlabeled data for improved model generalization, similar to MC-Net [15], we employ CPS [16] with mutual consistency regularization upon the X-Fuse topology for SSL. To determine trustworthy label assignments, pseudo-labeling relies crucially on thresholding whereby confidences beyond a specified certainty level alone contribute to selecting pseudo-labeled samples. We adopt Flexmatch [25] for curriculum pseudo labeling (CPL) that stimulates representation diversity. Inspired by [26] using prediction variance regularization for domain adaptation, We propose the Jensen–Shannon (JS) divergence between the X-Fuse decoding streams as epistemic uncertainty estimates for CPS rectification. Diverging from the prevailing trend of employing progressively sophisticated methodologies such as GAN [18], and diffusion model [27], each of which demands substantial training datasets, our approach amalgamates two training-free data augmentation strategies for expanding tumor data representations.

The main contributions are summarized as follows:

We present a streamlined triphasic cascade segmentation framework for efficient delineation of bones, liver, and liver tumors). A key innovation is the introduction of the X-Fuse model, which integrates spatial and frequency feature extraction, enhancing the precision of target segmentation.
Within the X-Fuse framework, to optimize SSL, we propose the JS divergence as co-training uncertainty for pseudo-supervision rectification with the integration of a CPL strategy dynamically selecting pseudo labels.
We formulate a simple yet effective data augmentation technique—TumorAug that involves a straightforward copy-paste and expertise-knowledge synthesis to account for tumor phenotypic variability.
Comprehensive benchmarking evaluations on an in-house dataset demonstrate superior performance of our proposed methods under both full supervision and semi-supervision scenarios, validating the efficacy of our unified frameworks for efficient deployment for liver tumor segmentation.

2. Related works

2.1. Abdominal organs and liver tumor segmentation

Recent advancements in abdominal organs and liver tumor segmentation have been catalyzed by the advent of challenging benchmark datasets, e.g. AMOS [28], FLARE [29] and LiTS [30], coupling with cutting-edge DL approaches [31]. Architectures exemplified by the CNN-based U-Net [32] have become ubiquitous, as its contracting and expanding paths allow multi-scale capture of both local details and global context critical for medical image segmentation. Volumetric extensions like 3D U-Net [33, 34] are particularly apt for learning intricate anatomical relationships and spatial interactions within volumetric imaging data such as computed tomography (CT) and magnetic resonance imaging (MRI). Moreover, attention mechanisms have gained traction [35]. MA-Net [36] and RAU-Net [37] incorporate attention modules within the bottleneck and skip connection, respectively, aiming to capture inter-dependencies between channel and spatial dimensions, thereby enhancing the delineation of ambiguous boundaries in liver tumor segmentation. In particular, the self-attention mechanism adapted from vision transformer (ViT) architectures [38], by enabling interactions between all spatial locations in the input image, can focus adaptively on organs and lesion regions by extracting informative features from across the entire 3D volume to capture irregular boundaries and complex morphology not confined to local regions [39–41]. Though scarce in number, recent pioneering works have kindled explorations into SSL for liver tumor segmentation, recognizing the potential of harnessing both scarce annotated and copious unannotated data. For instance, Xia et al [42] introduce a novel uncertainty-aware multi-view co-training framework that utilizes multi-viewpoint consistency and asymmetrical 3D kernels to enhance feature diversity with an uncertainty-weighted label fusion method, employing Bayesian DL, assesses prediction reliability. Our work seeks to advance along similar trajectories as an augmented semi-supervised approach for this undertaking.

2.2. Spectral neural networks

The utilization of frequency analysis has been a longstanding practice in traditional digital image processing methodologies [43]. Recently, there has been a discernible integration of frequency-based operations, exemplified by the Fourier transform, into the domain of deep neural networks (DNNs) [7, 44]. This integration serves varied objectives across two key dimensions: 1) expediting the training process and facilitating the optimization of DNNs [45, 46]; 2) acquiring informative representations of global receptive fields [8, 47]; prior works such as [44, 48] also uncover that disparate priorities are assigned to distinct frequency components during the training process, resulting in varied contributions to the robustness of features, all of which serve as inspiration for our pursuit of enhancing the generalization capacity of DNNs through the direct modulation of frequency components.

2.3. SSL in medical image analysis

In the realm of medical imaging, the availability of unlabeled data often surpasses that of labeled counterparts. This asymmetry presents a significant challenge, prompting the exploration of semi-supervised paradigms to harness the potential of unlabeled data [49, 50]. The prevailing methodologies emerge in two categories: pseudo-labeling-based [10, 51, 52] and consistency regularization-based [13, 14, 53]. Pseudo-labeling has evolved into a classical algorithm in SSL for its simplicity and effectiveness, where it generates pseudo labels for unannotated images and pairs them with annotated image for model training, while consistency regularization is based on the concept that predictions from models should be invariant to data perturbations and geometric transformations. The amalgamation of these two categories into a hybrid framework has gained increasing popularity [54, 55]. Luo et al [56] employ a dual-task network that simultaneously predicts a pixel-wise segmentation map and a geometry-aware level set representation which is transformed into an approximate segmentation map through a differentiable task transform layer, between which the dual-task consistency regularization is ensured via CPS [16]. Li et al [57] propose a self-ensembled co-training framework for automatic COVID lesion segmentation, using collaborative models that teach via reciprocal pseudo-labeling of unlabeled data and self-ensembling for consistency regularization to mitigate noisy labels. Akin to these methods, our proposed methodology entails synthesizing dynamic thresholding and JS divergence as a decoder variance proxy to selectively modulate the loss contribution from pseudo-labeled data.

3. Method

In this section, we delineate our proposed methodology for achieving enhanced segmentation performance on the target anatomical structures and lesions. In section 3.1, we first introduce the cascade workflow with an emphasis on the structural composition of X-Fuse. Then we describe the details of the subsequent application of dynamic thresholding and variance regulation for pseudo-supervision rectification for SSL incorporating additional unlabeled data in section 3.2. Finally, a customized tumor augmentation strategy is presented in section 3.3 to improve model generalization capacity by providing enhanced sample variability during the training process.

3.1. Model architecture

To facilitate accurate liver tumor intervention through efficient segmentation of target anatomical structures, we devise a sophisticated segmentation framework structured as a multi-phase cascade network with progressively refined focus on sub-regions of interest (ROI) and target structures in each phase, as seen in figure 1. The initial phase of our approach employs a lightweight model known as FasterNet [58], specifically engineered to segment the skin and locate the overall body region from the background. Subsequently in Phase_2, critical skeletal structures including ribs, vertebrae, and the hepatic region are delimited through our proposed X-Fuse network. Constructed upon liver localization, the final phase harnesses X-Fuse to procure fine-grained segmentation of hepatic lesions. Importantly, the output from each phase serves as the ROI for the subsequent phase, facilitating a coherent and systematic progression through the segmentation process.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** An overview of the three-phase cascade network. The initial phase aims for rapid localization of the body region through coarse binary segmentation of the skin. Subsequently, the second phase concentrates on the segmentation of both bone and liver structures. In the final phase, the network delineates liver tumors within the cropped liver region.
Download figure:
Standard image High-resolution image

In this study, we propose a novel model denoted as X-Fuse for target organs and lesion segmentation. X-Fuse is the amalgamation of two distinct models whose foundational structure is aligned with the overarching design principles of the U-Net [32] framework, characterized by a four-stage hierarchical encoder incorporating stacked MetaFormer modules, complemented by skip connections that relay encoded representations directly to corresponding decoder blocks. The integration of these two individual models occurs at a bottleneck block, culminating in a robust fused representation of bifurcated feature extraction (spatial and spectral branch). The overall model's topology assumes an 'X' form, hence the name X-Fuse as depicted in figure 2.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Illustration of the architectural framework employed for different phases, grounded in the Meta-Former paradigm. In Phase_1, FasterNet is implemented utilizing partial convolution (PConv) for expeditious processing. Transitioning to X-Fuse for Phase_2 and 3, the model integrates scale-aware modulation (SAM) and multi-head self-attention (MSA) as token mixer blocks within the spatial encoder branch, complemented by global filter (GF) in the spectral encoder branch.
Download figure:
Standard image High-resolution image

3.1.1. MetaFormer

The MetaFormer architecture [24], known for its versatility and efficacy, has found application across a spectrum of current models [5, 59]. It encompasses two distinct residual sub-blocks. Specifically, the initial sub-block primarily incorporates a token mixer module, exemplified by self-attention [60], multi-layer perception (MLP) [61], and convolutional [62] layers and the second sub-block features a channel MLP with two linear layers, activated by GELU functions, together facilitating both inter and intra-token communication. The configuration of token mixers varies across phases. To obtain coarse body location in the initial phase, we opt for FasterNet employing partial convolution [58] which enhances computational efficiency by simply selectively applying filters to the first quarter of input channels, leaving the remaining ones unaltered. During Phases_2 and 3, the entirety of the X-Fuse structure is employed with two parallel branches concurrently engaging in feature extraction within distinct domains. The spatial encoder branch comprises scale-aware and self-attentive modulation for multiscale feature aggregation. This hybrid design is derived from our previous research, showcasing exceptional performance in the MICCAI FLARE23 challenge [41], while the spectral branch modulates the spectral feature in the frequency domain through the utilization of a learnable GF. Each token mixer is expounded upon in the following section illustrated with input $X\in \mathbb{R} ^{C^{^{\prime}}\times H^{^{\prime}}\times W^{^{\prime}}\times D^{^{\prime}}}$ where C' represent channel number, and $H^{^{\prime}}\times W^{^{\prime}}\times D^{^{\prime}}$ denote feature resolution.

3.1.2. Patch embedding

The stem block, namely the patch embedding layer, partitions the input image into a series of discrete non-overlapping patches and projects linearly into high-dimensional token vectors by applying a strided convolution operation employing a kernel size of 3, stride of 2, and padding of 1. It extracts elementary visual features encapsulated within each patch region while simultaneously reducing the spatial dimensions to mitigate computational requirements when fed into MetaFormer-based models.

3.1.3. Spatial encoding branch

We leverage scale-aware [63] and self-attentive [38] modulation for spatial token mixing. Self-attentive modulation, commonly known as MSA, assumes a significant role in capturing contextual interdependencies among the discrete embedding tokens within a sequence (flattened input X') as expressed by equation (1). An MSA sublayer consists of n self-attention head modules in parallel. Each head independently entails mapping each patch token into query Q, key K, and value V vectors through linear projection W_QKV, and calculating attention scores $W_\mathrm{attn}$ through a scaled dot-product similarity between the queries and keys with scale factor D_h , signifying the importance of each patch token in relation to others. These attention weights are then applied to the corresponding V vectors, producing a weighted sum that effectively modulates the influence of each token. This dynamic modulation allows the model to adaptively prioritize relevant spatial information. Outputs from each head are combined through linear transformations $W_\mathrm{msa}$ providing a comprehensive representation that integrates various perspectives:

$\begin{align} & X^{^{\prime}} = \mathrm{flatten}\left( X\xrightarrow{\mathrm{permute}}X^{\tau}\right),\qquad\qquad\qquad\qquad\qquad\ \ X^{\,{\tau}}\in \mathbb{R}^{H^{^{\prime}}\times W^{^{\prime}}\times D^{^{\prime}}\times C^{^{\prime}}}\nonumber\\ & \left[Q,K,V\right] = X^{^{\prime}}W_{QKV},\qquad\qquad\qquad\qquad\qquad\qquad\quad\ \ \ X^{^{\prime}}\in \mathbb{R} ^{N\times C^{^{\prime}}},\,\, N = H^{^{\prime}}\times W^{^{\prime}}\times D^{^{\prime}}\nonumber\\ & W_\mathrm{attn} = \mathrm{softmax}\left(QK^T/\sqrt{D_h}\right),\qquad\qquad\qquad\qquad\quad\,\,\, W_{QKV}\in \mathbb{R} ^{C^{^{\prime}}\times 3D_h}, \ D_h = D/n\nonumber\\ &\mathrm{SA}\left(X^{^{\prime}}\right) = W_\mathrm{attn}V, \qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\!\! W_\mathrm{attn}\in \mathbb{R} ^{N\times N}\nonumber\\ & \mathrm{MSA}\left(X^{^{\prime}}\right) = \left[\mathrm{SA}_1\left( X^{^{\prime}}\right); \mathrm{SA}_2\left(X^{^{\prime}}\right);\ldots ;\mathrm{SA}_n\left( X^{^{\prime}}\right)\right] W_\mathrm{msa}. \;\quad\! W_\mathrm{msa}\in \mathbb{R}^{n\cdot D_h\times C^{^{\prime}}}\nonumber\end{align} \tag{ 1 }$

The SAM instead modulates V with the succession of the multi-head mixed convolution (MHMC) and the scale-aware aggregation (SAA) module as defined in equation (2), designed to facilitate the incorporation of multi-scale contexts and adaptive modulation of tokens. The MHMC divides uniformly input channels into multiple heads $j\in \{1,2,\ldots ,n\}$ (in our implementation n = 3) with $i\in \{1,2,\ldots,C^{^{\prime}}/n\}$ channels in each head. For a single-channel feature maps $X_{j}^{i}$ , it undergoes depth-wise separable convolutions $\mathrm{DWConv}_{k_j\times k_j\times k_j}$ with distinct kernel sizes $k_j\in \{3,5,7\}$ enabling the discernment of a diverse spectrum of granularity features in an adaptive manner, which are aggregates in following SAA module. Specifically, the SAA forms $C^{^{\prime}}/n$ discrete shuffled groups, with each group selectively sampling a single channel from each of the partitioned heads produced by the MHMC. This shuffling integrates the features across granularities. Subsequently, the aggregated groups are processed by point-wise convolution $\mathrm{Conv}_{1\times 1\times 1}$ in inverted bottleneck form (channel expansion = 2) before (plus InstanceNorm (IN) and Relu activation) and after channel-wise concatenation, eventually serving as weight modulator of the value V by Hadamard product $\odot$ in contrast to matrix product in self-attention. In order to mitigate the computational load and model the progression from local to global dependency integration, we implement SAM modules in the initial two stages and MSA in the final two stages:

$\begin{align} {\begin{array}{*{20}{l}} \kern5pt H_{j}^{i} = \mathrm{DWConv}_{k_j\times k_j\times k_j}\left(X_{j}^{i}\right),\qquad\qquad\qquad\qquad\,\,\,\,\,\,\,\,\,\,\,\, X_{j}^{i},H_{j}^{i}\in \mathbb{R}^{1\times H^{^{\prime}}\times W^{^{\prime}}\times D^{^{\prime}}} \rightarrow \mathrm{MHMC}\\ \begin{array}{l} G_i = \mathrm{Relu}\left(\mathrm{IN}\left( \mathrm{Conv}_{1\times 1\times 1}\left(\left[H_{1}^{i}, H_{2}^{i},\ldots, H_{n}^{i}\right]\right)\right)\right),\quad G_i\in \mathbb{R}^{2n\times H^{^{\prime}}\times W^{^{\prime}}\times D^{^{\prime}}}\\ W_\mathrm{sam} = \mathrm{Conv}_{1\times 1\times 1}\left(\left[G_1, G_2,\ldots ,G_{C^{^{\prime}}/n}\right]\right),\qquad\quad\,\,\,\,\,\,\, W_\mathrm{sam}\in \mathbb{R}^{C^{^{\prime}}\times H^{^{\prime}}\times W^{^{\prime}}\times D^{^{\prime}}} \end{array}\bigg\} \rightarrow \mathrm{SSA}\\ \kern5pt \mathrm{SAM}\left(X\right) = W_\mathrm{sam} \odot V. \qquad\qquad\qquad\qquad\quad\quad\quad\,\,\,\,\,\,\,\, V\in \mathbb{R}^{C^{^{\prime}}\times H^{^{\prime}}\times W^{^{\prime}}\times D^{^{\prime}}}\end{array}}.\end{align} \tag{ 2 }$

3.1.4. Spectral encoding branch

We apply deep frequency filtering [8] to achieve explicit spectral feature modulation. Our approach is underpinned by the application of the discrete Fourier transform (DFT) and its inverse counterpart IDFT, serving as a conduit facilitating the transition between the spatial domain and the frequency domain representation of digital images. For 3D volume feature, its 3D-DFT $X_F\left( x,y,z \right)$ and 3D-IDFT $X\left( h,w,d \right)$ can be defined as:

$\begin{equation} \begin{aligned} & X_F\left( x,y,z \right) = \sum_{h = 0}^{H-1}{\sum_{w = 0}^{W-1}{\sum_{d = 0}^{D-1}{X\left(h,w,d\right)e^{-j2\pi \left( x\dfrac{h}{H}+y\dfrac{w}{W}+z\dfrac{d}{D} \right)}}}},\\ & X\left( h,w,d \right) = \dfrac{1}{HWD}\sum_{h = 0}^{H-1}\sum_{w = 0}^{W-1}\sum_{d = 0}^{D-1}X_F\left(x,y,z\right)e^{j2\pi \left(x\dfrac{h}{H}+y\dfrac{w}{W}+z\dfrac{d}{D}\right)}. \end{aligned}\end{equation} \tag{ 3 }$

By virtue of the conjugate symmetric property inherent in FFT, $X_F\in \mathbb{C} ^{C^{^{\prime}}\times H^{^{\prime}}\times W^{^{\prime}}\times ( \dfrac{D^{^{\prime}}}{2}+1 )}$ only needs retain the half of spatial dimensions while preserve the entirety of information. In practice, we apply fast Fourier transform (FFT) [64] algorithm for efficient DFT computation. Specifically, the input features X are first transformed via the FFT from the spatial domain into the frequency domain in which each component in the resulting Fourier spectrum has the intrinsic global vision, thus providing inherent advantages for modeling global context and long-range interactions. The frequency representation allows direct manipulation and filtering of the spectral characteristics of the features. Using a learnable gating mechanism, the GFs W_gf modulate the frequencies by element-wise multiplication $\odot$ to adaptively adjust the feature responses across both local and global scales. Low frequency emphasis enhances large-scale patterns while attenuating high frequencies sharpen local details. The resultant filtered frequency representations are subsequently transmuted back into the spatial domain through the process of inverse fast Fourier transform (IFFT), which de-mixes the updated global frequency representations to recover the local token features. The GF layer can be formulated as Equalton 4. According to convolution theorem [64], GF is equivalent to a depthwise global circular convolution while GF enables explicit tuning of spectral properties within the feature map, concurrently circumventing the computational inefficiencies of conventionally large convolutional kernels.

$\begin{align} \mathrm{GF}\left(X\right) = \mathrm{IFFT}\left(W_{gf}\odot \mathrm{FFT}\left( X \right)\right). \qquad\quad\ W_{gf}\in\mathbb{C}^{C^{^{\prime}}\times H^{^{\prime}}\times W^{^{\prime}}\times\left(\dfrac{D^{^{\prime}}}{2}+1\right)}.\end{align} \tag{ 4 }$

3.1.5. Decoder and skip connection

The encoded feature representations extracted from the two-branch encoders, fused at the bottleneck by simple channel-wise concatenation, are input into a residual block on skip connections comprising two sequential 3×3×3 convolutional layers with instance normalization and ReLU activation before concatenation with the progressively upsampled decoder features. Such integrated representations are then passed through another residual combo block of the same structure. For explicit architectural perturbation, both trilinear upsampling and transpose convolutions are employed for decoding in two branches which yield final ensembled inference output, and also ensure consistency regularization during SSL.

3.2. SSL with CPS

Leveraging the abundance of unlabeled data, we employ a semi-supervised methodology rooted in CPS to attain resilient segmentation of organs and lesions within CT scans. Given a set of N labeled CT scans $D^l = \{(x_i^l, y_i^l)\}_{i = 1}^N$ and a more extensive collection of M unlabeled CT scans $D^u = \{(x_i^u)\}_{i = 1}^M$ , where $x_i \in \mathbb{R}^{H\times W \times D}$ represents the volume of the input data and $y_i^l \in {\{0, 1\}}^{H \times W \times D}$ denotes the one hot label for each category. We train our segmentation model based on the above two subsets $D = \,\,D^l\cup D^u$ , and our primary optimization objectives involve minimizing both the supervised loss $L_\mathrm{seg}$ between prediction and ground truth of D^l and mutual consistency loss $L_\mathrm{cps}$ for D^l and D^u . The total loss $L_\mathrm{total}$ can be calculated as follows:

$\begin{equation} L_\mathrm{total} = L_\mathrm{seg} + \alpha L_\mathrm{cps}.\end{equation} \tag{ 5 }$

Here, α represents the adaptive weighting coefficient that governs the balance between $L_\mathrm{seg}$ and $L_\mathrm{cps}$ , the value of which increases with the progression of training and is determined as $1-e^{-t}$ in our experimental setup, where t corresponds to the current iteration. The individual loss components refer to equations (6) and (7).

$\begin{align} L_\mathrm{seg} & = \sum\limits_{\left(x_i^l, y_i^l\right) \in D^l}L_\mathrm{seg}^A\left(f_A\left(x_i^l\right), y_i^l\right) + L_\mathrm{seg}^B\left(f_B\left(x_i^l\right), y_i^l\right),\nonumber\\ L_\mathrm{seg}^{j} & = \sum\limits_{\left(x_i^l, y_i^l\right) \in D^l}L_\mathrm{Dice} + L_\mathrm{CE} \nonumber\\ & = \sum\limits_{\left(x_i^l, y_i^l\right) \in D^l}\Bigg[\left(1-\dfrac{2\sum f_j\left(x_i^l\right) \times y_i^l}{\sum f_j\left(x_i^l\right) + \sum y_i^l}\right) + \left(-y_i^l\mathrm{log}f_j\left(x_i^l\right)\right),\end{align} \tag{ 6 }$

$\begin{align} L_\mathrm{cps} & = \sum\limits_{\left(x_i\right) \in D}L_\mathrm{cps}^A\left(f_A\left(x_i\right); f_B\left(x_i\right)\right) + L_\mathrm{cps}^B\left(f_B\left(x_i\right); f_A\left(x_i\right)\right)\nonumber\\ & = -\sum\limits_{\left(x_i\right) \in D} PL_A \mathrm{log} \left(f_B\left(x_i\right)\right) + PL_B\mathrm{log} \left(f_A\left(x_i\right)\right),\end{align} \tag{ 7 }$

where $f_A(\cdot )$ and $f_B(\cdot )$ are the softmax-normalized outputs of two respective decoders j, representing segmentation confidence map, and PL_A and PL_B are their converted hard one-hot pseudo labels. In accordance with [16], $L_\mathrm{seg}$ consists of Dice and cross entropy (CE) loss [65], while the CE loss is specifically employed for the computation of the $L_\mathrm{cps}$ .

3.2.1. Pseudo label selection with adaptive threshold

We employ an adaptable threshold on the confidence map $f_j(x_i)$ for reliable pseudo-label selection such that only predictions surpassing the threshold contribute to the pseudo-supervision loss:

$\begin{align} L_\mathrm{cpl}^{j} = \unicode{x1D7D9} \left(\mathrm{max}\left(f\left(x_i\right)\right) > \gamma\right) L_\mathrm{cps}^{j},\end{align} \tag{ 8 }$

where $\gamma \in (0, 1)$ , $\unicode{x1D7D9}( \cdot \gt \gamma)$ is the indicator function for confidence-based thresholding, where γ represents the threshold.

Following the principle [25] that lower initial thresholds gather more pseudo labels to accelerate early learning. As model uncertainty on unlabeled data decreases during training, thresholds become more stringent to filter out noisy assignments:

$\begin{equation} \gamma_t = \begin{cases} \dfrac{1}{C}, &t = 0, \\ \left(1-\lambda\right)\gamma_{t-1} + \lambda \dfrac{1}{B}\sum_{b = 1}^{B}\max\left(f\left(x_i\right)\right), &\mathrm{else}.\\ \end{cases}\end{equation} \tag{ 9 }$

We set the initial value of γ as $\frac{1}{C}$ where C denotes the number of classes, B is the batch size. Rather than re-evaluating all unlabeled data per iteration to determine precise confidence. The exponential moving average enables efficient aggregation of confidence scores across iterations to estimate global trends by assigning exponentially decaying weights $1-\lambda$ to prior observations with $\lambda = \dfrac{t}{t_\mathrm{max}}$ , $t_\mathrm{max}$ is the total iterations.

3.2.2. Uncertainty guided mutual supervision

Motivated by prior works [26], utilizing Kullback–Leibler variance as a measure of prediction uncertainty between teacher-student models, this work incorporates JS divergences into the CPS architecture to quantify and penalize inconsistent predictions from the X-Fuse perturbed decoders. The JS divergence provides a mathematically principled information-theoretic approach to quantifying the dissimilarity between two probability distributions.

$\begin{align} D_{JS}\left(f_A\left(x_i\right);f_B\left(x_i\right)\right) = \sum_{\left(x_i \in D\right)}f_A\left(x_i\right)\log\dfrac{f_A\left(x_i\right)}{M\left(x_i\right)} + f_B\left(x_i\right)\log\dfrac{f_B\left(x_i\right)}{M\left(x_i\right)},\end{align} \tag{ 10 }$

where $M(x_i) = \dfrac{1}{2}(f_A(x_i) + f_B(x_i))$ , denotes the average distribution of both two predictions.

Specifically, the consistency loss is activated to promote prediction agreement in cases where the JS variance is low, indicating a high degree of confidence in the consistency between the perturbed decoders. Conversely, when the JS variance is high, reflecting significant divergence between the decoder predictions, the consistency loss is largely omitted. Henceforth, we employ the JS divergence as a guiding metric for pseudo supervision, further facilitating the rectification of the consistency loss, which is expressed as follows:

$\begin{align} L_\mathrm{Rcps}^j\left(f_A\left(x_i\right); f_B\left(x_i\right)\right) = \sum_{\left(x_i\right) \in D}\left[\left(1-D_{JS}\left(f_A\left(x_i\right); f_B\left(x_i\right)\right)\right) \times L_\mathrm{cpl}^j + \frac{1}{-\log \left( D_{JS}\left( f_A\left( x_i \right) ;f_B\left( x_i \right) \right) \right) +\varepsilon}\right].\end{align} \tag{ 11 }$

Additionally, that last item in equation (11) is incorporated into the loss to prevent scenarios where the variance remains persistently high where ε is a small value to prevent the denominator from being zero.

Overall, as depicted in figure 3, our framework encompasses the integration of a dynamic confidence threshold for pseudo-label selection, subsequently facilitating the generation of model outputs characterized by heightened consistency through additional uncertainty rectification. Both elements collectively contribute to the attenuation on the impact of noise labels, thereby enhancing the learning proficiency of our SSL method. The comprehensive details of the training procedure are expounded upon in algorithm 1.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Illustration of our framework for semi-supervised medical image segmentation. Outputs generated from X-Fuse are subjected cross pseudo supervision rectified by uncertainty estimated by Jensen–Shannon (JS) divergence, scheduled with curriculum pseudo labeling (CPL) strategy.
Download figure:
Standard image High-resolution image

Algorithm 1. Training procedure of proposed method, e.g. liver tumor segmentation.
Input : $\{(x_i^{l}, y_i^{l})\}_{i = 1}^N \in \mathcal{D}_l$ , $\{(x_i^{u})\}_{i = 1}^M \in \mathcal{D}_u$
Output: model's parameters θ
1 for $\boldsymbol{iter\, in [1,\, total\_iters]}$ do
2 Sampling batch B_l and B_u , where $B_l \in D^l$ , $B_u \in D^u$ , $B = \,\,B^l\cup B^u$ ;
3 Computing prediction outputs of duel decoders:
4 ${p}_A = f_A (x_i; \theta)$ and ${p}_B = f_B (x_i, \theta)$ ;
5 Computing supervised loss according to equation (6):
6 $L_\mathrm{seg} = \dfrac{1}{4\|B_l\|} \sum\limits_{(x_i^{l}, y_i^{l}) \in B_l}[L_\mathrm{seg}^A(f_A(x_i^{l}; \theta), y_i^{l}) + L_\mathrm{seg}^B(f_B(x_{i}^l; \theta), y_{i}^l)]$ ;
7 Updating γ_t in B according to equation (9):
8 $\gamma_t = \min((1-\lambda)\gamma_{t-1}+\lambda\dfrac{1}{\|B\|}\sum\limits_{b = 1}^{B}\max({p}_A),\ (1-\lambda)\gamma_{t-1}+\lambda\dfrac{1}{\|B\|}\sum\limits_{b = 1}^{B}\max({p}_B))$ ;
9 Generating PL_A and PL_B according to dynamic threshold γ_t:
10 $PL_A = \unicode{x1D7D9}({p}_A \gt \gamma_t)$ , $PL_B = \unicode{x1D7D9}({p}_B \gt \gamma_t)$ ;
11 Computing JS-divergence $D_{JS}({p}_A\ \|\|\ {p}_B)$ according to equation (10) ;
12 Using JS-divergence to rectify the consistency loss according to equation (11):
13 $L_\mathrm{Rcps}^A = \dfrac{1}{2\|B\|} \sum\limits_{x_i \in B}[(1-D_{JS}) \times L_\mathrm{cps}^{A}(PL_A, {p}_B) + \frac{1}{-\log \left( D_{JS} \right) +\varepsilon}]$
14 $L_\mathrm{Rcps}^B = \dfrac{1}{2\|B\|} \sum\limits_{x_i \in B}[(1-D_{JS}) \times L_\mathrm{cps}^{B}(PL_B, {p}_A) + \frac{1}{-\log \left( D_{JS} \right) +\varepsilon}]$ ;
15 Computing total loss: $L_\mathrm{total} = L_\mathrm{seg} +\alpha(L_\mathrm{Rcps}^A + L_\mathrm{Rcps}^B)$ , $\alpha = 1-e^{-iter}$ ;
16 Updating model's parameters θ ;
17 End,return θ

3.3. Tumor augmentation

To enhance the generalization of liver tumor segmentation across varied anomaly presentations, in Phase_3, we introduce TumorAug explicitly for the augmentation of tumor instances, which incorporates two distinct approaches as shown in figure 4: TumorCP [66] and TumorSyn [67]. TumorCP is directed toward tumor-labeled images. The process involves the isolation of tumor instances, followed by the arbitrary selection of 1-3 tumor cases upon which a sequence of spatial transformations (including scale adjustments, rotation, mirroring, and elastic deformation) and contrast alterations are applied. The resulting copied or generated tumor instances are then pasted at a randomly determined position within the liver region delineated by X-Fuse in Phase_2. TumorCP is solely executed at the intra-patient level to maintain context information consistency. Conversely, TumorSyn is tasked with generating synthetic tumors onto label-free images. This process entails a two-step morphological operation for shape and texture generation, which are exclusively informed by clinical expertise, to effectively model authentic tumors. 1) Shape generation. Conforming to statistical distributions [67], small tumors manifest spherical configurations, while larger tumors tend to exhibit elliptical shapes, Consequently, the emulation of tumor-like shapes is achieved through the utilization of an ellipsoid defined as $ellip\left( a,b,c \right)$ , where a, b and c represent the lengths of the semi-axes and all follow a uniform distribution $U\left( 0.75r,1.25r \right)$ with $r\in \left[ 4,8,16,32 \right]$ corresponding to four types of defined sizes: tiny, small, medium, large. Elastic deformations are subsequently employed to augment the diversity thereof. 2) Texture generation. The Hounsfield Unit values associated with liver tumor textures adhere to Gaussian distributions denoted as $N\left( \mu ,\sigma _t \right) , \mu \sim U\left( 30,\mu _t-10 \right)$ , where µ_t and σ_t represent the mean and standard deviation of the hepatic parenchyma. Cubic interpolation is Subsequently applied to facilitate the smoothing of the texture. Additionally, Gaussian blurring transformations are implemented on the regenerated texture in both approaches, contributing to the refinement of the tumor's edge.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Tumor augmentation (TumorAug) incorporates two approaches: TumorCP isolates and transforms existent tumors from tumor-labeled images before pasting them back. TumorSyn statistically generates synthetic tumor shapes and textures for label-free images. Original tumor contours are delineated in red while augmented ones are marked in blue.
Download figure:
Standard image High-resolution image

4. Experiments

4.1. Dataset and preprocess

The datasets we employ in this study is an integration of a proprietary dataset and the public LiTS dataset [30]. The internally curated dataset comprises 700 abdominal CT volumes all manifesting hepatic malignancies of variable shape, size, and contrast enhancement acquired from Zhuhai People's Hospital. Among those, 200 exemplars were manually annotated by two experienced radiologists using ITK-SNAP [68] for liver, bones, and body region and a subset of 50 scans were reserved for testing purposes that were also labeled for tumor presence. The dataset exhibits a mean resolution of 1.23 mm × 1.23 mm × 1.10 mm, with an average spatial dimension of 512 × 512 × 131. The LiTS dataset is exclusively utilized for the training of liver tumor segmentation in Phase_3. It comprises 131 contrast-enhanced portal phase abdominal CT scans, featuring a voxel spatial resolution of ([0.55–1] × [0.55–1] × [0.45–6.0]) $\mathrm{mm}^3$ , with detailed annotations for both the liver and liver tumor regions.

The preprocessing pipeline, following the work in MICCAI FLARE23 challenge [41], subsequently applies percentile-based intensity rescaling (5th and 95th), respacing to uniform voxel dimensions (1.5 mm, 1.5 mm, 2 mm), Z-Normalization, and data augmentation via random cropping, flipping, and affine transformations, For the preliminary skin segmentation, the CT volumes were resized to 128 × 128 × 128, while a patch-based training approach with 96 × 96 × 96 crops was found to be optimal for organs and tumor delineation in last two phases.

4.2. Implementation details

Our method was implemented in PyTorch and MONAI framework. Model training and inference were deployed on an NVIDIA A800 GPU. The CacheDataset module in MONAI enabled data preloading for expedited iteration. Optimization was performed with the Adam algorithm and a weight decay of $1 \times 10^{-5}$ . The initial learning rate was $3\times 10^{-4}$ with a cosine annealing schedule. Models were trained for up to 50 000 iterations with a batch size of 8.

We employed three metrics to evaluate segmentation performance [65]. The DSC measures the spatial overlap between predicted and ground truth (GT), ranging from 0 to 1, with 1 indicating complete overlap; The average surface distance (ASD) computes the average shortest distance between the surfaces of the predicted and GT, lower values signify better adherence of the predicted surface to the GT; The 95% Hausdorff distance (HD95) calculates the maximum surface distance between the prediction and GT, after excluding the most extreme 5% of outliers. It reflects the degree of maximal deviation.

4.3. Results analysis

To thoroughly assess the capabilities of our proposed method, we conduct extensive comparative analyses against prior state-of-the-art models under both supervised-only (SL) and SSL settings.

4.3.1. Comparison to state-of-the-art supervised methods

To validate the efficacy of our proposed X-Fuse model for the segmentation of anatomically pertinent organs (i.e. bones, liver) as well as pathological regions (i.e. liver tumors), we evaluate our approach under SL conditions with 100% labeled data against prevailing CNN based models such as V-Net [34] and H-DenseUNet $^{\diamond}$ [4] besides the highly configurable and modular architecture nnUNet $^{\diamond}$ [69]; ViT based models such as UNetFormer $^{\diamond}$ [59] and SwinUNETR [5]; and contemporary hybrid models like CoTr [70] and APAUNet $^{\diamond}$ [71] that synergistically couple the advantages of both paradigms, as summarized in table 1. $^{\diamond}$ denotes corresponding methods have been previously applied for liver tumor delineation in their original implementation. Our proposed X-Fuse framework markedly advances previous leading models by substantial margins across all key quantitative segmentation metrics. Specifically, for liver segmentation, X-Fuse exhibits substantial improvements, manifesting a 2.41% enhancement in DSC, a notable 2.04 voxel reduction in HD95, and 0.98 voxel decrement in ASD when compared to the H-DenseUNet. Moreover, X-Fuse demonstrates favorable outcomes in bone segmentation, showcasing a marginal advantage over alternative methodologies where CNNs majorly outperform their transformer counterparts. X-Fuse displays superior performance in the domain of tumor segmentation, registering an increase of 4.17% and 2.91% on DSC over nnUNet and CoTr, respectively. These results underscore the efficacy and competitive edge of X-Fuse in various medical segmentation tasks.

Table 1. Evaluation of X-Fuse against state-of-the-art fully-supervised models in terms of DSC(%) $\uparrow$ , HD95(voxel) $\downarrow$ and ASD(voxel) $\downarrow$ . Best results are marked in bold.

Method	Bones			Liver			Liver tumors
Method	DSC	HD95	ASD	DSC	HD95	ASD	DSC	HD95	ASD
V-Net [34]	90.52 $^{\ast}$	20.80 $^{\dagger}$	7.44 $^{\dagger}$	90.13 $^{\dagger}$	5.51 $^{\ast}$	2.41 $^{\ast}$	73.12 $^{\dagger}$	24.78 $^{\dagger}$	10.04 $^{\dagger}$
H-DenseUNet [4]	91.50 $^{\ast}$	18.78 $^{\dagger}$	5.39 $^{\ast}$	90.34 $^{\dagger}$	5.34 $^{\dagger}$	2.11 $^{\ast}$	75.46 $^{\dagger}$	21.95 $^{\dagger}$	23.89 $^{\dagger}$
nnUNet [72]	92.27	12.56	4.45	90.01 $^{\dagger}$	7.87 $^{\dagger}$	1.78	74.59 $^{\dagger}$	22.85 $^{\dagger}$	15.38 $^{\dagger}$
UNetFormer [59]	90.44 $^{\dagger}$	22.32 $^{\dagger}$	7.18 $^{\dagger}$	91.17 $^{\ast}$	10.65 $^{\dagger}$	2.68 $^{\ast}$	71.46 $^{\dagger}$	30.17 $^{\dagger}$	12.09 $^{\dagger}$
SwinUNETR [5]	90.56 $^{\dagger}$	18.82 $^{\dagger}$	6.95 $^{\dagger}$	91.14 $^{\ast}$	9.45 $^{\dagger}$	2.96 $^{\ast}$	70.93 $^{\dagger}$	23.56 $^{\dagger}$	10.27 $^{\dagger}$
CoTr [70]	91.36 $^{\ast}$	25.95 $^{\dagger}$	7.89 $^{\dagger}$	91.13 $^{\ast}$	5.48 $^{\dagger}$	2.89	75.85 $^{\dagger}$	15.45 $^{\dagger}$	6.76 $^{\dagger}$
APAUNet [71]	90.53 $^{\dagger}$	24.53 $^{\dagger}$	8.76 $^{\dagger}$	90.46 $^{\dagger}$	9.76 $^{\dagger}$	3.45 $^{\dagger}$	72.45 $^{\dagger}$	35.67 $^{\dagger}$	18.34 $^{\dagger}$
X-Fuse	93.17	12.38	4.01	92.75	3.30	1.13	78.76	10.23	3.23

$\ast$ and $\dagger$ denote statistical significance at $p \unicode{x2A7D} 0.05$ and $p \unicode{x2A7D} 0.01$ , respectively, based on a pairwise t-test compared with our method.

4.3.2. Analysis on model components

To assess the contribution of individual components of X-Fuse, comprehensive ablation studies are conducted in table 2, focusing on the significance of encoder branches and bottleneck fusion. The findings reveal that utilizing spatial encoder alone surpasses their CNN-transformer hybrid counterparts such as APAUNet by 1.14% DSC and 3.9% for bones and tumors respectively, substantiating efficacious encoding of multi-scale context aggregation. Meanwhile, the spectral encoder itself rivals, for all metrics on bone and liver, the reputed global models like SwinUNETR in aptly capturing long-range inter-dependencies while seemingly falls behind its spatial counterpart. Additionally, the bottleneck integration layer fusing the latent embeddings from both pathways enables X-Fuse to confer considerable improvements, e.g. 2.12% on tumor over individual modules as well as outperforming the ensemble of two separate networks containing either type of encoders. This underscores the merits of a harmonious dual-encoder conceptualization, harnessing both spatial and spectral cues in a unified architecture.

Table 2. Ablation studies quantifying the contribution of individual components including spatial encoder, spectral encoder, and bottleneck fusion on the test set.

Setting	Spatial encoder	Spectral encoder	Bottleneck fusion	DSC(%) $\uparrow$
Setting	Spatial encoder	Spectral encoder	Bottleneck fusion	Bones	Liver	Tumors
SL with 100% labeled images	✓			91.67 $^{\ast}$	91.38 $^{\ast}$	76.35 $^{\dagger}$
		✓		90.67 $^{\dagger}$	91.18 $^{\ast}$	73.29 $^{\dagger}$
	✓	✓		91.79 $^{\ast}$	91.64 $^{\ast}$	76.64 $^{\dagger}$
	✓	✓	✓	93.17	92.75	78.76

$\ast$ and $\dagger$ denote statistical significance at $p \unicode{x2A7D} 0.05$ and $p \unicode{x2A7D} 0.01$ , respectively, based on a pairwise t-test compared with our method.

4.3.3. Comparison to pervasive semi-supervised methods

The contrastive semi-supervised approaches (all models are built upon V-Net for a fair comparison) in table 3 encompass the seminal mean teacher self-ensembling paradigm (MT) [13] alongside its uncertainty-guided derivative UA-MT [73], uncertainty-attenuated pyramid consistency regularization via URPC [74], shape-aware adversarial learning through SASSNet [75], multi-task consistency enforcement by DTC [56], and cross-pseudo supervision based mutual consistency learning using a dual-decoder MC-Net [54]. In comparison to the supervised V-Net exclusively conditioned solely upon annotated samples in table 1, all semi-supervised methodologies harnessing on unlabeled exemplars confer notable performance enhancements, thereby substantiating benefit in the exploitation of unlabeled data. The UA-MT method eclipses the vanilla MT approach by orchestrating the guidance of student model optimization, underscoring the instrumental role of uncertainty regularization. Conversely, the DTC model fails to exhibit advantages over the MT model, thereby unveiling the limited positive impact of transformation consistency on our stipulated task. MC-Net, distinguished by its superiority over URPC and UAMT across metrics, decisively validates the robustness of mutual consistency. By assimilating implicit geometric clues, the proposed SASSNet attains top performance on the HD95 and NSD metrics on bones and liver. Our framework transcends prevailing semi-supervised techniques in terms of DSC, demonstrating a 1.27% improvement margin on tumor over the next top-performing method SASSNet. By virtue of uncertainty modulation, our approach further refines the rudimentary CPS in MC-Net to 94.35%, 94.94% and 84.89% for bones, liver and liver tumor on DSC, synergistically complemented by a dynamic thresholding strategy.

Table 3. Quantitative comparisons of semi-supervised segmentation models on the test set in terms of DSC(%) $\uparrow$ , HD95(voxel) $\downarrow$ and ASD(voxel) $\downarrow$ . Best results are marked in bold.

Method	Bones			Liver			Liver tumors
Method	DSC	HD95	ASD	DSC	HD95	ASD	DSC	HD95	ASD
MT [13]	92.58 $^{\dagger}$	13.64 $^{\dagger}$	4.23	92.76 $^{\dagger}$	4.67	2.29	80.76 $^{\dagger}$	13.73 $^{\dagger}$	7.35 $^{\dagger}$
UA-MT [73]	93.02 $^{\ast}$	17.94 $^{\dagger}$	4.56	93.35 $^{\ast}$	8.04 $^{\dagger}$	4.35 $^{\dagger}$	81.15 $^{\dagger}$	9.58 $^{\dagger}$	6.94 $^{\dagger}$
DTC [56]	92.23 $^{\dagger}$	15.66 $^{\dagger}$	5.82 $^{\ast}$	92.67 $^{\dagger}$	4.90 $^{\ast}$	3.76 $^{\ast}$	80.73 $^{\dagger}$	15.73 $^{\dagger}$	6.45 $^{\dagger}$
URPC [76]	92.45 $^{\dagger}$	13.96 $^{\dagger}$	4.29	93.43 $^{\ast}$	7.19 $^{\dagger}$	3.98 $^{\ast}$	81.12 $^{\dagger}$	13.32 $^{\dagger}$	5.87 $^{\dagger}$
MC-Net [15]	93.24 $^{\ast}$	12.18 $^{\ast}$	4.81	94.13	6.73 $^{\dagger}$	3.68 $^{\ast}$	81.82 $^{\dagger}$	9.28 $^{\dagger}$	3.34
SASSNet [75]	93.32 $^{\ast}$	8.45 $^{\dagger}$	3.15 $^{\ast}$	94.10	3.15	2.28	82.24 $^{\ast}$	9.24 $^{\dagger}$	3.34
Ours	94.35	11.33	4.32	94.94	3.82	2.36	83.89	5.65	3.25

$\ast$ and $\dagger$ denote statistical significance at $p \unicode{x2A7D} 0.05$ and $p \unicode{x2A7D} 0.01$ respectively, based on a pairwise t-test compared with our method.

4.3.4. Performance under diverse labeling budgets

Figure 5 investigates the impact of our proposed semi-supervised approach under varying labeling budgets, ranging from a modest 10% to a more extensive 80% of the labeled data. Prominently evident is the consistent superior performance of our proposed methodology across varying proportions of labeled data, thereby attesting to its inherent superiority over prior arts. Concerning DSC on tumor, our methodology achieves comparable performance with a minimal 10% allocation of labeled data, demonstrating a marginal deviation of a mere 0.29% compared to 60% usage in the SL baseline (V-Net). This outcome underscores the notable capacity of our approach to substantially mitigate the exigencies associated with labeled data requirements. Additionally, our method, endowed with a sparse 20% allocation of labeled CT scans, approaches performance equipoise over three segmentation objectives, diverging solely around 0.8%, relative to the plain MT architecture with tripled labeled data, demonstrating its significant advantages under limited labeled data regimes.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Comparative analysis of various methods across different percentages of labeled CT scans.
Download figure:
Standard image High-resolution image

4.3.5. Analysis on loss regulatory components

Our SSL framework incorporates three integral constituents, namely the foundational co-training mechanism (baseline) grounded in CPS, coupled with the concomitant incorporation of uncertainty rectification through JS variance and the implementation of dynamic thresholding for pseudo labeling. To discern the efficacy of each individual module on X-Fuse, we conducted a systematic ablation study, incrementally introducing the aforesaid components onto the baseline. The experimental milieu was established with an allotment of 50% and 100% of labeled images, and the resultant performance metrics across various configurations are cataloged in table 4. Evidently, Modules act in clear synergy, unlocking progressive gains to deliver state-of-the-art semi-supervised segmentation. Concretely, While with half labeled data, it reveals that the dynamic thresholding imparted by the CPL strategy serves to amplify the baseline performance by an additional 2.03% and 1.89% on tumors and bones. Uncertainty Rectification bestows a further increment of 0.67% upon the baseline performance of liver delineation. The pinnacle of advancement is attained with the amalgamation of all with more than 1% increments over each individual, propelling our framework to an elevated echelon with 93.12% on Bone, 92.56% on liver Matching the performance of baseline reliant solely on labeled samples in table 1. leveraging all labeled data, in the absence of any regulatory mechanism for pseudo-labeling, the unadorned mutual supervision procedure attains a commendable 83.57% on the DSC of tumor segmentation with a 4.71% increase relative to the supervised-only baseline, demonstrating substantive efficacy of CPS. The implementation of dual regularization techniques engenders the finest results on bones (95.42%), liver (96.26%), and tumors (85.20%).

Table 4. Ablation studies quantifying the individual impact of key loss regulatory components, including CPS, uncertainty rectification, and dynamic thresholding.

Setting	CPS	Uncertainty rectification	Dynamic threshold	DSC(%) $\uparrow$
Setting	CPS	Uncertainty rectification	Dynamic threshold	Bones	Liver	Tumors
SSL with 50% labeled images	✓			89.31 $^{\dagger}$	90.76 $^{\dagger}$	76.56 $^{\dagger}$
	✓	✓		91.29 $^{\ast}$	91.45 $^{\ast}$	77.79 $^{\dagger}$
	✓		✓	91.34 $^{\ast}$	90.83 $^{\ast}$	78.45 $^{\ast}$
	✓	✓	✓	93.12	92.56	80.19
SSL with 100% labeled images	✓			93.82 $^{\ast}$	94.76 $^{\ast}$	83.57 $^{\ast}$
	✓	✓		94.79	95.45 $^{\ast}$	84.59
	✓		✓	94.34	95.83	84.35
	✓	✓	✓	95.42	96.26	85.20

$\ast$ and $\dagger$ denote statistical significance at $p \unicode{x2A7D} 0.05$ and $p \unicode{x2A7D} 0.01$ , respectively, based on a pairwise t-test compared with our method.

4.3.6. Tumor augmentation impact

We examine the components of our TumorAug technique, i.e. TumorSyn and TumorCP, within both scenarios of SL and SSL frameworks in figure 6. In the SL case, TumorSyn imparts a perceptible augmentation of 3.56% to DSC while the TumorCP precipitates an impressive 6.13% surge, indicating the potential of our tumor augmentation approach to further refine outcomes through the nuanced capture of diverse tumor features. While model capacity appears to approach saturation as observed in table 4, our SSL method exhibits a heightened capacity approach by better exploiting TumorSyn and TumorCP for segmentation improvements via providing more diverse, and confident pseudo labeling, enabling TumorSyn and TumorCP to accrue additional gains atop the augmentation-free baseline. Specifically, TumorSyn and TumorCP exhibit a notable escalation of 1.5% and 3.1% in DSC respectively. Notably, the synergistic utilization of TumorSyn and TumorCP under both scenarios catapults raw purely supervised variant by an appreciable 8.35% and 10.49%. Overall, TumorAug potentiates multifarious representation and enhances the generalization capacity of the model across lesional heterogeneous manifestations.

Figure 6. Refer to the following caption and surrounding text. — **Figure 6.** Comparative evaluation of the standalone and synergistic impact of TumorSyn and TumorCP augmentation techniques under SL and SSL frameworks for liver tumor segmentation.
Download figure:
Standard image High-resolution image

Figure 7 shows some qualitative results of different models under SL and SSL segmentation contexts. Particularly, with respect to the segmentation of bone and liver, it is noteworthy to observe that all models under consideration demonstrate a comparable level of performance except that SwinUNETR and V-Net demonstrate over-segmentation and under-segmentation respectively in liver segmentation tasks while X-Fuse distinguishes itself by achieving heightened sensitivity in both learning contexts. Regarding liver tumor segmentation, our model exhibits a commendable capability for delineating fine-grained morphological details and otherwise minute pathological structures, which can be attributed to the innate multi-scale representational ability. Furthermore, the integration of tumor augmentation techniques serves to amplify model competencies for fine-grained characterization of manifold morphological variabilities of liver tumors.

Figure 7. Refer to the following caption and surrounding text. — **Figure 7.** Qualitative visualization of liver, bone, and tumor segmentation performance by different methods in both supervised learning (purple) and semi-supervised learning (blue) contexts. Segmentation predictions (red) and ground truth labels (green) are presented through 3D rendered mask overlays for liver and bone cases (a) and 2D contours for tumor instances (b).
Download figure:
Standard image High-resolution image

5. Discussion

In this research endeavor, we present a cascaded three-phase framework devised for liver tumor intervention, with the focal points being the delineation of AVO and OAR encompassing critical structures like bones and the liver and the precise identification of GTV, specifically targeting liver tumors. In the initial phase, rapid localization of the body is achieved through the utilization of FasterNet. Subsequently, in Phases 2 and 3, the focus shifts towards the precise delineation of relevant organs. To fulfill this objective, we introduce X-Fuse, which exhibits versatility and applicability to both fully supervised and SSL settings. The architectural configuration of X-Fuse is characterized by the integration of two discrete encoders, each incorporating spatial and spectral token mixers within a MetaFormer structure. In the realm of full supervision, our spatial encoder exhibits superior performance over models founded on convnets, transformers, and their hybrid counterparts. Simultaneously, the spectral encoder performs favorably at a comparable level. The effectiveness of our model is further amplified through the fusion of latent representations, where spatial and frequency features mutually complement each other, endowing our model with enhanced capacities and enriched information. The resulting p-values consistently fall below the significance threshold of 0.05, confirming the significant improvements achieved by our proposed method. By harnessing additional unlabeled data, and the application of uncertainty rectification, as determined by the JS variance of dual decoder outputs, coupled with the use of CPL, our methodology capitalizes on the potential of CPS beyond prevailing teacher-student and multi-task consistency approaches. Ablation studies further elucidate that each individual module makes a nearly equal contribution to the overall efficacy of the method. Through performing simple tumor augmentation techniques TumorCP and TumorSyn, we demonstrate that these tumor-centric augmentations markedly refine segmentation performance to 89.53% by increasing variation in malignant tissue patterns. The resulting model can better generalize across heterogeneous images and identify intricate morphological tumor characteristics.

Despite our method yielding promising liver tumor segmentation performance in contrast-enhanced CTs, it still faces challenges or limitations regarding its generalizability to tin non-contrast-enhanced CT scans. As illustrated in figure 8, our method along with all prior arts fails under supervised learning conditions where models are exclusively trained with contrast-enhanced CT images. The incorporation of unlabeled non-contrast scans and TumorAug technique significantly bolsters domain adaptation, as exemplified in Case 1. However, our approach encounters difficulties when images are corrupted by noise stemming from iodized oil embolization or the administration of chemotherapeutic agents, as evidenced in Case 2.

Figure 8. Refer to the following caption and surrounding text. — **Figure 8.** Qualitative visualization of tumor segmentation performance in non-contrast CTs by different methods in both supervised learning (purple) and semi-supervised learning (blue) contexts. Segmentation predictions (red) and ground truth labels (green) are presented via 2D contours for tumor instances.
Download figure:
Standard image High-resolution image

According to the comparisons in figure 9, while our methodology essentially entails an ensemble of two networks, the complexity of our model is marginally higher than the baseline V-Net. Importantly, it significantly falls below that of the semi-supervised two-decoder MC-Net, under equivalent hyperparameter settings, such as the number of layers and channels per layer. Empirically, given a representative test case exemplified by a CT volume of dimensions 512 × 512 × 300, a common image size encountered in clinical practice, the total inference time per scan achieved by our proposed architecture is 30 seconds. This reflects the efficiency gains of our streamlined framework, which maximizes accuracy while minimizing computational expenses through carefully optimized components tailored for the automated analysis of medical images.

Figure 9. Refer to the following caption and surrounding text. — **Figure 9.** Illustration of model complexity.
Download figure:
Standard image High-resolution image

Despite promising performance for liver tumor segmentation in CT imaging, our work has several limitations that provide avenues for future improvements:

Limited encoder interaction: the fusion strategy in X-Fuse simply concatenates spatial and spectral encoder outputs. More complex interactions between the parallel pathways, such as recursive feedback or dual attention flows, could further enrich the integrated representation. Architectures that enable tighter coupling and bidirectional propagation between the encoders merit additional exploration.
Single organ coverage: we solely focused on segmenting the liver as the primary organ at risk. Safe interventional planning, however, necessitates incorporating additional abdominal organs like stomach, intestines and spleen that could undergo collateral damage. Expanding the anatomical breadth to multiple organs would provide more comprehensive risk assessments during therapy.

6. Conclusion

In this work, we have presented a multiphase cascaded DL workflow specifically optimized to empower targeted image-guided intervention for liver tumors. Leveraging a MetaFormer architecture, we develop two models for automated anatomical segmentation—FasterNet and X-Fuse. Through coordinated assimilation of dual-domain spatial and spectral feature streams enriching latent encodings, X-Fuse sets new standards in precision contouring of AVO (bones), OAR (liver), and GTV (Liver tumor). Semi-supervised CPL along with JS variance rectification applied on X-Fuse further minimizes error propagation conferring resilience against limited annotations, especially for diverse liver lesions. Our ultimately simple tumor data augmentation techniques substantially enhance the representation of lesion heterogeneity. Through these advances, X-Fuse strikes a pivotal balance between segmentation accuracy, efficiency and generalizability—critical for integration into safety-critical liver tumor targeting workflows. Our segmentation pipeline marks a major step towards clinically viable automated assistance for precision image-guided liver tumor procedures. Looking ahead, Broadening anatomical perspectives through multi-organ and multi-modal contouring remains vital to enable responsible translation towards intelligent intra-operative guidance systems for liver tumor intervention.

Acknowledgments

The study was supported by National Natural Science Foundation of China (81827805, 82130060, 61821002, 92148205), National Key Research and Development Program (2018YFA0704100, 2018YFA0704104), Project funded by China Postdoctoral Science Foundation (2021M700772), Zhuhai Industry-University-Research Collaboration Program (ZH22017002210011PWC), and Jiangsu Province Science and Technology Support Project (BE2023769). The funding sources had no role in the writing of the report, or decision to submit the paper for publication.

Data availability statement

The data cannot be made publicly available upon publication because they contain sensitive personal information. The data that support the findings of this study are available upon reasonable request from the authors.

Ethical statement

This study utilizing human CT data was approved by the Ethics Review Committee of Zhuhai People's Hospital to ensure ethical standards. Anonymization protocols de-identified the data.

Dates

Peer review information

3.1.1. MetaFormer

3.1.2. Patch embedding

3.1.3. Spatial encoding branch

3.1.4. Spectral encoding branch

3.1.5. Decoder and skip connection

3.2.1. Pseudo label selection with adaptive threshold

3.2.2. Uncertainty guided mutual supervision

4.3.1. Comparison to state-of-the-art supervised methods

4.3.2. Analysis on model components

4.3.3. Comparison to pervasive semi-supervised methods

4.3.4. Performance under diverse labeling budgets

4.3.5. Analysis on loss regulatory components

4.3.6. Tumor augmentation impact

Semi-supervised segmentation of abdominal organs and liver tumor: uncertainty rectified curriculum labeling meets X-fuse

Author notes

Notes

Article metrics

Submit

Share this article

Dates

Peer review information

Abstract

1. Introduction

2. Related works

2.1. Abdominal organs and liver tumor segmentation

2.2. Spectral neural networks

2.3. SSL in medical image analysis

3. Method

3.1. Model architecture

3.1.1. MetaFormer

3.1.2. Patch embedding

3.1.3. Spatial encoding branch

3.1.4. Spectral encoding branch

3.1.5. Decoder and skip connection

3.2. SSL with CPS

3.2.1. Pseudo label selection with adaptive threshold

3.2.2. Uncertainty guided mutual supervision

3.3. Tumor augmentation

4. Experiments

4.1. Dataset and preprocess

4.2. Implementation details

4.3. Results analysis

4.3.1. Comparison to state-of-the-art supervised methods

4.3.2. Analysis on model components

4.3.3. Comparison to pervasive semi-supervised methods

4.3.4. Performance under diverse labeling budgets

4.3.5. Analysis on loss regulatory components

4.3.6. Tumor augmentation impact

5. Discussion

6. Conclusion

Acknowledgments

Data availability statement

Ethical statement