Semi-supervised segmentation of abdominal organs and liver tumor: uncertainty rectified curriculum labeling meets X-fuse

Precise liver tumors and associated organ segmentation hold immense value for surgical and radiological intervention, enabling anatomical localization for pre-operative planning and intra-operative guidance. Modern deep learning models for medical image segmentation have evolved from convolution neural networks to transformer architectures, significantly boosting global context understanding. However, accurate delineation especially of hepatic lesions remains an enduring challenge due to models’ predominant focus solely on spatial feature extraction failing to adequately characterize complex medical anatomies. Moreover, the relative paucity of expertly annotated medical imaging data restricts model exposure to diverse pathological presentations. In this paper, we present a three-phrased cascaded segmentation framework featuring an X-Fuse model that synergistically integrates spatial and frequency domain’s complementary information in dual encoders to enrich latent feature representation. To enhance model generalizability, building upon X-Fuse topology and taking advantage of additional unlabeled pathological data, our proposed integration of curriculum pseudo-labeling with Jensen–Shannon variance-based uncertainty rectification promotes optimized pseudo supervision in the context of semi-supervised learning. We further introduce a tumor-focus augmentation technique including training-free copy-paste and knowledge-based synthesis that show efficacy in simplicity, contributing to the substantial elevation of model adaptability on diverse lesional morphologies. Extensive experiments and modular evaluations on a holdout test set demonstrate that our methods significantly outperform existing state-of-the-art segmentation models in both supervised and semi-supervised settings, as measured by the Dice similarity coefficient, achieving superior delineation of bones (95.42%), liver (96.26%), and liver tumors (89.53%) with 16.41% increase comparing to V-Net on supervised-only and augmented-absent scenario. Our method marks a significant step toward the realization of more reliable and robust AI-assisted diagnostic tools for liver tumor intervention. We have made the codes publicly available [https://github.com/lyupengju/X-Fuse].


Introduction
Hepatic malignancies constitute an immense global disease burden as both primary liver cancers such as hepatocellular carcinoma, alongside hematogenous metastatic lesions infiltrating the liver parenchyma, exact a toll on both morbidity and mortality [1].The complexity inherent in liver anatomy, coupled with the diverse manifestations of liver tumors, mandates the employment of sophisticated and precisely targeted intervention strategies.Image-guided percutaneous puncture interventions have gained increasing domain adaptation, We propose the Jensen-Shannon (JS) divergence between the X-Fuse decoding streams as epistemic uncertainty estimates for CPS rectification.Diverging from the prevailing trend of employing progressively sophisticated methodologies such as GAN [18], and diffusion model [27], each of which demands substantial training datasets, our approach amalgamates two training-free data augmentation strategies for expanding tumor data representations.
The main contributions are summarized as follows: • We present a streamlined triphasic cascade segmentation framework for efficient delineation of bones, liver, and liver tumors).A key innovation is the introduction of the X-Fuse model, which integrates spatial and frequency feature extraction, enhancing the precision of target segmentation.• Within the X-Fuse framework, to optimize SSL, we propose the JS divergence as co-training uncertainty for pseudo-supervision rectification with the integration of a CPL strategy dynamically selecting pseudo labels.• We formulate a simple yet effective data augmentation technique-TumorAug that involves a straightforward copy-paste and expertise-knowledge synthesis to account for tumor phenotypic variability.• Comprehensive benchmarking evaluations on an in-house dataset demonstrate superior performance of our proposed methods under both full supervision and semi-supervision scenarios, validating the efficacy of our unified frameworks for efficient deployment for liver tumor segmentation.

Related works 2.1. Abdominal organs and liver tumor segmentation
Recent advancements in abdominal organs and liver tumor segmentation have been catalyzed by the advent of challenging benchmark datasets, e.g.AMOS [28], FLARE [29] and LiTS [30], coupling with cutting-edge DL approaches [31].Architectures exemplified by the CNN-based U-Net [32] have become ubiquitous, as its contracting and expanding paths allow multi-scale capture of both local details and global context critical for medical image segmentation.Volumetric extensions like 3D U-Net [33,34] are particularly apt for learning intricate anatomical relationships and spatial interactions within volumetric imaging data such as computed tomography (CT) and magnetic resonance imaging (MRI).Moreover, attention mechanisms have gained traction [35].MA-Net [36] and RAU-Net [37] incorporate attention modules within the bottleneck and skip connection, respectively, aiming to capture inter-dependencies between channel and spatial dimensions, thereby enhancing the delineation of ambiguous boundaries in liver tumor segmentation.In particular, the self-attention mechanism adapted from vision transformer (ViT) architectures [38], by enabling interactions between all spatial locations in the input image, can focus adaptively on organs and lesion regions by extracting informative features from across the entire 3D volume to capture irregular boundaries and complex morphology not confined to local regions [39][40][41].Though scarce in number, recent pioneering works have kindled explorations into SSL for liver tumor segmentation, recognizing the potential of harnessing both scarce annotated and copious unannotated data.For instance, Xia et al [42] introduce a novel uncertainty-aware multi-view co-training framework that utilizes multi-viewpoint consistency and asymmetrical 3D kernels to enhance feature diversity with an uncertainty-weighted label fusion method, employing Bayesian DL, assesses prediction reliability.Our work seeks to advance along similar trajectories as an augmented semi-supervised approach for this undertaking.

Spectral neural networks
The utilization of frequency analysis has been a longstanding practice in traditional digital image processing methodologies [43].Recently, there has been a discernible integration of frequency-based operations, exemplified by the Fourier transform, into the domain of deep neural networks (DNNs) [7,44].This integration serves varied objectives across two key dimensions: 1) expediting the training process and facilitating the optimization of DNNs [45,46]; 2) acquiring informative representations of global receptive fields [8,47]; prior works such as [44,48] also uncover that disparate priorities are assigned to distinct frequency components during the training process, resulting in varied contributions to the robustness of features, all of which serve as inspiration for our pursuit of enhancing the generalization capacity of DNNs through the direct modulation of frequency components.

SSL in medical image analysis
In the realm of medical imaging, the availability of unlabeled data often surpasses that of labeled counterparts.This asymmetry presents a significant challenge, prompting the exploration of semi-supervised paradigms to harness the potential of unlabeled data [49,50].The prevailing methodologies emerge in two categories: pseudo-labeling-based [10, 51, 52] and consistency regularization-based [13,14,53].
Pseudo-labeling has evolved into a classical algorithm in SSL for its simplicity and effectiveness, where it P Lyu et al generates pseudo labels for unannotated images and pairs them with annotated image for model training, while consistency regularization is based on the concept that predictions from models should be invariant to data perturbations and geometric transformations.The amalgamation of these two categories into a hybrid framework has gained increasing popularity [54,55].Luo et al [56] employ a dual-task network that simultaneously predicts a pixel-wise segmentation map and a geometry-aware level set representation which is transformed into an approximate segmentation map through a differentiable task transform layer, between which the dual-task consistency regularization is ensured via CPS [16].Li et al [57] propose a self-ensembled co-training framework for automatic COVID lesion segmentation, using collaborative models that teach via reciprocal pseudo-labeling of unlabeled data and self-ensembling for consistency regularization to mitigate noisy labels.Akin to these methods, our proposed methodology entails synthesizing dynamic thresholding and JS divergence as a decoder variance proxy to selectively modulate the loss contribution from pseudo-labeled data.

Method
In this section, we delineate our proposed methodology for achieving enhanced segmentation performance on the target anatomical structures and lesions.In section 3.1, we first introduce the cascade workflow with an emphasis on the structural composition of X-Fuse.Then we describe the details of the subsequent application of dynamic thresholding and variance regulation for pseudo-supervision rectification for SSL incorporating additional unlabeled data in section 3.2.Finally, a customized tumor augmentation strategy is presented in section 3.3 to improve model generalization capacity by providing enhanced sample variability during the training process.

Model architecture
To facilitate accurate liver tumor intervention through efficient segmentation of target anatomical structures, we devise a sophisticated segmentation framework structured as a multi-phase cascade network with progressively refined focus on sub-regions of interest (ROI) and target structures in each phase, as seen in figure 1.The initial phase of our approach employs a lightweight model known as FasterNet [58], specifically engineered to segment the skin and locate the overall body region from the background.Subsequently in Phase_2, critical skeletal structures including ribs, vertebrae, and the hepatic region are delimited through our proposed X-Fuse network.Constructed upon liver localization, the final phase harnesses X-Fuse to procure fine-grained segmentation of hepatic lesions.Importantly, the output from each phase serves as the ROI for the subsequent phase, facilitating a coherent and systematic progression through the segmentation process.
In this study, we propose a novel model denoted as X-Fuse for target organs and lesion segmentation.X-Fuse is the amalgamation of two distinct models whose foundational structure is aligned with the overarching design principles of the U-Net [32] framework, characterized by a four-stage hierarchical encoder incorporating stacked MetaFormer modules, complemented by skip connections that relay encoded representations directly to corresponding decoder blocks.The integration of these two individual models occurs at a bottleneck block, culminating in a robust fused representation of bifurcated feature extraction (spatial and spectral branch).The overall model's topology assumes an 'X' form, hence the name X-Fuse as depicted in figure 2.

MetaFormer
The MetaFormer architecture [24], known for its versatility and efficacy, has found application across a spectrum of current models [5,59].It encompasses two distinct residual sub-blocks.Specifically, the initial sub-block primarily incorporates a token mixer module, exemplified by self-attention [60], multi-layer perception (MLP) [61], and convolutional [62] layers and the second sub-block features a channel MLP with two linear layers, activated by GELU functions, together facilitating both inter and intra-token communication.The configuration of token mixers varies across phases.To obtain coarse body location in the initial phase, we opt for FasterNet employing partial convolution [58] which enhances computational efficiency by simply selectively applying filters to the first quarter of input channels, leaving the remaining ones unaltered.During Phases_2 and 3, the entirety of the X-Fuse structure is employed with two parallel branches concurrently engaging in feature extraction within distinct domains.The spatial encoder branch comprises scale-aware and self-attentive modulation for multiscale feature aggregation.This hybrid design is derived from our previous research, showcasing exceptional performance in the MICCAI FLARE23 challenge [41], while the spectral branch modulates the spectral feature in the frequency domain through the utilization of a learnable GF.Each token mixer is expounded upon in the following section illustrated with input X ∈ R C ′ ×H ′ ×W ′ ×D ′ where C ′ represent channel number, and H ′ × W ′ × D ′ denote feature resolution.

Patch embedding
The stem block, namely the patch embedding layer, partitions the input image into a series of discrete non-overlapping patches and projects linearly into high-dimensional token vectors by applying a strided convolution operation employing a kernel size of 3, stride of 2, and padding of 1.It extracts elementary visual features encapsulated within each patch region while simultaneously reducing the spatial dimensions to mitigate computational requirements when fed into MetaFormer-based models.

Spatial encoding branch
We leverage scale-aware [63] and self-attentive [38] modulation for spatial token mixing.Self-attentive modulation, commonly known as MSA, assumes a significant role in capturing contextual interdependencies among the discrete embedding tokens within a sequence (flattened input X ′ ) as expressed by equation (1).

P Lyu et al
An MSA sublayer consists of n self-attention head modules in parallel.Each head independently entails mapping each patch token into query Q, key K, and value V vectors through linear projection W QKV , and calculating attention scores W attn through a scaled dot-product similarity between the queries and keys with scale factor D h , signifying the importance of each patch token in relation to others.These attention weights are then applied to the corresponding V vectors, producing a weighted sum that effectively modulates the influence of each token.This dynamic modulation allows the model to adaptively prioritize relevant spatial information.Outputs from each head are combined through linear transformations W msa providing a comprehensive representation that integrates various perspectives: The SAM instead modulates V with the succession of the multi-head mixed convolution (MHMC) and the scale-aware aggregation (SAA) module as defined in equation ( 2), designed to facilitate the incorporation of multi-scale contexts and adaptive modulation of tokens.The MHMC divides uniformly input channels into multiple heads j ∈ {1, 2, . . ., n} (in our implementation n = 3) with i ∈ {1, 2, . . ., C ′ /n} channels in each head.For a single-channel feature maps X i j , it undergoes depth-wise separable convolutions DWConv k j ×k j ×k j with distinct kernel sizes k j ∈ {3, 5, 7} enabling the discernment of a diverse spectrum of granularity features in an adaptive manner, which are aggregates in following SAA module.Specifically, the SAA forms C ′ /n discrete shuffled groups, with each group selectively sampling a single channel from each of the partitioned heads produced by the MHMC.This shuffling integrates the features across granularities.Subsequently, the aggregated groups are processed by point-wise convolution Conv 1×1×1 in inverted bottleneck form (channel expansion = 2) before (plus InstanceNorm (IN) and Relu activation) and after channel-wise concatenation, eventually serving as weight modulator of the value V by Hadamard product ⊙ in contrast to matrix product in self-attention.In order to mitigate the computational load and model the progression from local to global dependency integration, we implement SAM modules in the initial two stages and MSA in the final two stages: (2)

Spectral encoding branch
We apply deep frequency filtering [8] to achieve explicit spectral feature modulation.Our approach is underpinned by the application of the discrete Fourier transform (DFT) and its inverse counterpart IDFT, serving as a conduit facilitating the transition between the spatial domain and the frequency domain representation of digital images.For 3D volume feature, its 3D-DFT X F (x, y, z) and 3D-IDFT X (h, w, d) can be defined as: By virtue of the conjugate symmetric property inherent in FFT, only needs retain the half of spatial dimensions while preserve the entirety of information.In practice, we apply fast Fourier transform (FFT) [64] algorithm for efficient DFT computation.Specifically, the input features X are first transformed via the FFT from the spatial domain into the frequency domain in which each component in the resulting Fourier spectrum has the intrinsic global vision, thus providing inherent advantages for modeling global context and long-range interactions.The frequency representation allows direct manipulation and filtering of the spectral characteristics of the features.Using a learnable gating mechanism, the GFs W gf modulate the frequencies by element-wise multiplication ⊙ to adaptively adjust the feature responses across both local and global scales.Low frequency emphasis enhances large-scale patterns while attenuating high frequencies sharpen local details.The resultant filtered frequency representations are subsequently transmuted back into the spatial domain through the process of inverse fast Fourier transform (IFFT), which de-mixes the updated global frequency representations to recover the local token features.The GF layer can be formulated as Equalton 4. According to convolution theorem [64], GF is equivalent to a depthwise global circular convolution while GF enables explicit tuning of spectral properties within the feature map, concurrently circumventing the computational inefficiencies of conventionally large convolutional kernels.

Decoder and skip connection
The encoded feature representations extracted from the two-branch encoders, fused at the bottleneck by simple channel-wise concatenation, are input into a residual block on skip connections comprising two sequential 3×3×3 convolutional layers with instance normalization and ReLU activation before concatenation with the progressively upsampled decoder features.Such integrated representations are then passed through another residual combo block of the same structure.For explicit architectural perturbation, both trilinear upsampling and transpose convolutions are employed for decoding in two branches which yield final ensembled inference output, and also ensure consistency regularization during SSL.

SSL with CPS
Leveraging the abundance of unlabeled data, we employ a semi-supervised methodology rooted in CPS to attain resilient segmentation of organs and lesions within CT scans.Given a set of N labeled CT scans , where x i ∈ R H×W×D represents the volume of the input data and y l i ∈ {0, 1} H×W×D denotes the one hot label for each category.We train our segmentation model based on the above two subsets D = D l ∪ D u , and our primary optimization objectives involve minimizing both the supervised loss L seg between prediction and ground truth of D l and mutual consistency loss L cps for D l and D u .The total loss L total can be calculated as follows: Here, α represents the adaptive weighting coefficient that governs the balance between L seg and L cps , the value of which increases with the progression of training and is determined as 1 − e −t in our experimental setup, where t corresponds to the current iteration.The individual loss components refer to equations ( 6) and (7).
where f A (•) and f B (•) are the softmax-normalized outputs of two respective decoders j, representing segmentation confidence map, and PL A and PL B are their converted hard one-hot pseudo labels.In accordance with [16], L seg consists of Dice and cross entropy (CE) loss [65], while the CE loss is specifically employed for the computation of the L cps .

Pseudo label selection with adaptive threshold
We employ an adaptable threshold on the confidence map f j (x i ) for reliable pseudo-label selection such that only predictions surpassing the threshold contribute to the pseudo-supervision loss: where γ ∈ (0, 1), 1(• > γ) is the indicator function for confidence-based thresholding, where γ represents the threshold.Following the principle [25] that lower initial thresholds gather more pseudo labels to accelerate early learning.As model uncertainty on unlabeled data decreases during training, thresholds become more stringent to filter out noisy assignments: We set the initial value of γ as 1 C where C denotes the number of classes, B is the batch size.Rather than re-evaluating all unlabeled data per iteration to determine precise confidence.The exponential moving average enables efficient aggregation of confidence scores across iterations to estimate global trends by assigning exponentially decaying weights 1 − λ to prior observations with λ = t t max , t max is the total iterations.

Uncertainty guided mutual supervision
Motivated by prior works [26], utilizing Kullback-Leibler variance as a measure of prediction uncertainty between teacher-student models, this work incorporates JS divergences into the CPS architecture to quantify and penalize inconsistent predictions from the X-Fuse perturbed decoders.The JS divergence provides a mathematically principled information-theoretic approach to quantifying the dissimilarity between two probability distributions. where , denotes the average distribution of both two predictions.Specifically, the consistency loss is activated to promote prediction agreement in cases where the JS variance is low, indicating a high degree of confidence in the consistency between the perturbed decoders.Conversely, when the JS variance is high, reflecting significant divergence between the decoder predictions, the consistency loss is largely omitted.Henceforth, we employ the JS divergence as a guiding metric for pseudo supervision, further facilitating the rectification of the consistency loss, which is expressed as follows: Additionally, that last item in equation ( 11) is incorporated into the loss to prevent scenarios where the variance remains persistently high where ε is a small value to prevent the denominator from being zero.
Overall, as depicted in figure 3, our framework encompasses the integration of a dynamic confidence threshold for pseudo-label selection, subsequently facilitating the generation of model outputs characterized by heightened consistency through additional uncertainty rectification.Both elements collectively contribute to the attenuation on the impact of noise labels, thereby enhancing the learning proficiency of our SSL method.The comprehensive details of the training procedure are expounded upon in algorithm 1.
Sampling batch B l and Bu, where Computing prediction outputs of duel decoders: Computing supervised loss according to equation ( 6): Updating γt in B according to equation ( 9): Generating PL A and PL B according to dynamic threshold γt: (10) ; 12 Using JS-divergence to rectify the consistency loss according to equation (11):

Tumor augmentation
To enhance the generalization of liver tumor segmentation across varied anomaly presentations, in Phase_3, we introduce TumorAug explicitly for the augmentation of tumor instances, which incorporates two distinct approaches as shown in figure 4: TumorCP [66] and TumorSyn [67].TumorCP is directed toward tumor-labeled images.The process involves the isolation of tumor instances, followed by the arbitrary selection of 1-3 tumor cases upon which a sequence of spatial transformations (including scale adjustments, rotation, mirroring, and elastic deformation) and contrast alterations are applied.The resulting copied or generated tumor instances are then pasted at a randomly determined position within the liver region delineated by X-Fuse in Phase_2.TumorCP is solely executed at the intra-patient level to maintain context information consistency.Conversely, TumorSyn is tasked with generating synthetic tumors onto label-free images.This process entails a two-step morphological operation for shape and texture generation, which are exclusively informed by clinical expertise, to effectively model authentic tumors.1) Shape generation.Conforming to statistical distributions [67], small tumors manifest spherical configurations, while larger tumors tend to exhibit elliptical shapes, Consequently, the emulation of tumor-like shapes is achieved through the utilization of an ellipsoid defined as ellip (a, b, c), where a, b and c represent the lengths of the semi-axes and all follow a uniform distribution U (0.75r, 1.25r) with r ∈ [4,8,16,32] corresponding to four types of defined sizes: tiny, small, medium, large.Elastic deformations are subsequently employed to augment the diversity thereof.2) Texture generation.The Hounsfield Unit values associated with liver tumor textures adhere to Gaussian distributions denoted as N (µ, σ t ) , µ ∼ U (30, µ t − 10), where µ t and σ t represent the mean and standard deviation of the hepatic parenchyma.Cubic interpolation is Subsequently applied to facilitate the smoothing of the texture.Additionally, Gaussian blurring transformations are implemented on the regenerated texture in both approaches, contributing to the refinement of the tumor's edge.

Dataset and preprocess
The datasets we employ in this study is an integration of a proprietary dataset and the public LiTS dataset [30].The internally curated dataset comprises 700 abdominal CT volumes all manifesting hepatic malignancies of variable shape, size, and contrast enhancement acquired from Zhuhai People's Hospital.Among those, 200 exemplars were manually annotated by two experienced radiologists using ITK-SNAP [68] for liver, bones, and body region and a subset of 50 scans were reserved for testing purposes that were also labeled for tumor presence.The dataset exhibits a mean resolution of 1.23 mm × 1.23 mm × 1.10 mm, with an average spatial dimension of 512 × 512 × 131.The LiTS dataset is exclusively utilized for the training of liver tumor segmentation in Phase_3.It comprises 131 contrast-enhanced portal phase abdominal CT scans, featuring a voxel spatial resolution of ([0.55-1] × [0.55-1] × [0.45-6.0])mm 3 , with detailed annotations for both the liver and liver tumor regions.
The preprocessing pipeline, following the work in MICCAI FLARE23 challenge [41], subsequently applies percentile-based intensity rescaling (5th and 95th), respacing to uniform voxel dimensions (1.5 mm, 1.5 mm, 2 mm), Z-Normalization, and data augmentation via random cropping, flipping, and affine transformations, For the preliminary skin segmentation, the CT volumes were resized to 128 × 128 × 128, while a patch-based training approach with 96 × 96 × 96 crops was found to be optimal for organs and tumor delineation in last two phases.

Implementation details
Our method was implemented in PyTorch and MONAI framework.Model training and inference were deployed on an NVIDIA A800 GPU.The CacheDataset module in MONAI enabled data preloading for expedited iteration.Optimization was performed with the Adam algorithm and a weight decay of 1 × 10 −5 .The initial learning rate was 3 × 10 −4 with a cosine annealing schedule.Models were trained for up to 50 000 iterations with a batch size of 8.
We employed three metrics to evaluate segmentation performance [65].The DSC measures the spatial overlap between predicted and ground truth (GT), ranging from 0 to 1, with 1 indicating complete overlap; The average surface distance (ASD) computes the average shortest distance between the surfaces of the predicted and GT, lower values signify better adherence of the predicted surface to the GT; The 95% Hausdorff distance (HD95) calculates the maximum surface distance between the prediction and GT, after excluding the most extreme 5% of outliers.It reflects the degree of maximal deviation.

Results analysis
To thoroughly assess the capabilities of our proposed method, we conduct extensive comparative analyses against prior state-of-the-art models under both supervised-only (SL) and SSL settings.

Comparison to state-of-the-art supervised methods
To validate the efficacy of our proposed X-Fuse model for the segmentation of anatomically pertinent organs (i.e.bones, liver) as well as pathological regions (i.e.liver tumors), we evaluate our approach under SL conditions with 100% labeled data against prevailing CNN based models such as V-Net [34] and H-DenseUNet ⋄ [4] besides the highly configurable and modular architecture nnUNet ⋄ [69]; ViT based models such as UNetFormer ⋄ [59] and SwinUNETR [5]; and contemporary hybrid models like CoTr [70] and APAUNet ⋄ [71] that synergistically couple the advantages of both paradigms, as summarized in table 1. ⋄ denotes corresponding methods have been previously applied for liver tumor delineation in their original implementation.Our proposed X-Fuse framework markedly advances previous leading models by substantial margins across all key quantitative segmentation metrics.Specifically, for liver segmentation, X-Fuse exhibits substantial improvements, manifesting a 2.41% enhancement in DSC, a notable 2.04 voxel reduction in HD95, and 0.98 voxel decrement in ASD when compared to the H-DenseUNet.Moreover, X-Fuse demonstrates favorable outcomes in bone segmentation, showcasing a marginal advantage over alternative methodologies where CNNs majorly outperform their transformer counterparts.X-Fuse displays superior performance in the domain of tumor segmentation, registering an increase of 4.17% and 2.91% on DSC over nnUNet and CoTr, respectively.These results underscore the efficacy and competitive edge of X-Fuse in various medical segmentation tasks.

Analysis on model components
To assess the contribution of individual components of X-Fuse, comprehensive ablation studies are conducted in table 2, focusing on the significance of encoder branches and bottleneck fusion.The findings reveal that utilizing spatial encoder alone surpasses their CNN-transformer hybrid counterparts such as APAUNet by 1.14% DSC and 3.9% for bones and tumors respectively, substantiating efficacious encoding of multi-scale context aggregation.Meanwhile, the spectral encoder itself rivals, for all metrics on bone and liver, the reputed global models like SwinUNETR in aptly capturing long-range inter-dependencies while seemingly falls behind its spatial counterpart.Additionally, the bottleneck integration layer fusing the latent embeddings from both pathways enables X-Fuse to confer considerable improvements, e.g.2.12% on tumor over individual modules as well as outperforming the ensemble of two separate networks containing either type of encoders.This underscores the merits of a harmonious dual-encoder conceptualization, harnessing both spatial and spectral cues in a unified architecture.

Comparison to pervasive semi-supervised methods
The contrastive semi-supervised approaches (all models are built upon V-Net for a fair comparison) in table 3 encompass the seminal mean teacher self-ensembling paradigm (MT) [13] alongside its uncertainty-guided derivative UA-MT [73], uncertainty-attenuated pyramid consistency regularization via URPC [74], shape-aware adversarial learning through SASSNet [75], multi-task consistency enforcement by DTC [56], and cross-pseudo supervision based mutual consistency learning using a dual-decoder MC-Net [54].In comparison to the supervised V-Net exclusively conditioned solely upon annotated samples in table 1, all semi-supervised methodologies harnessing on unlabeled exemplars confer notable performance enhancements, thereby substantiating benefit in the exploitation of unlabeled data.The UA-MT method eclipses the vanilla MT approach by orchestrating the guidance of student model optimization, underscoring the instrumental role of uncertainty regularization.Conversely, the DTC model fails to exhibit advantages over the MT model, thereby unveiling the limited positive impact of transformation consistency on our stipulated task.MC-Net, distinguished by its superiority over URPC and UAMT across metrics, decisively validates the robustness of mutual consistency.By assimilating implicit geometric clues, the proposed SASSNet attains top performance on the HD95 and NSD metrics on bones and liver.Our framework transcends prevailing semi-supervised techniques in terms of DSC, demonstrating a 1.27% improvement margin on tumor over the next top-performing method SASSNet.By virtue of uncertainty modulation, our approach further refines the rudimentary CPS in MC-Net to 94.35%, 94.94% and 84.89% for bones, liver and liver tumor on DSC, synergistically complemented by a dynamic thresholding strategy.

Performance under diverse labeling budgets
Figure 5 investigates the impact of our proposed semi-supervised approach under varying labeling budgets, ranging from a modest 10% to a more extensive 80% of the labeled data.Prominently evident is the consistent superior performance of our proposed methodology across varying proportions of labeled data, thereby attesting to its inherent superiority over prior arts.Concerning DSC on tumor, our methodology achieves comparable performance with a minimal 10% allocation of labeled data, demonstrating a marginal deviation of a mere 0.29% compared to 60% usage in the SL baseline (V-Net).This outcome underscores the notable capacity of our approach to substantially mitigate the exigencies associated with labeled data requirements.Additionally, our method, endowed with a sparse 20% allocation of labeled CT scans, approaches performance equipoise over three segmentation objectives, diverging solely around 0.8%, relative to the plain MT architecture with tripled labeled data, demonstrating its significant advantages under limited labeled data regimes.

Analysis on loss regulatory components
Our SSL framework incorporates three integral constituents, namely the foundational co-training mechanism (baseline) grounded in CPS, coupled with the concomitant incorporation of uncertainty rectification through JS variance and the implementation of dynamic thresholding for pseudo labeling.To discern the efficacy of each individual module on X-Fuse, we conducted a systematic ablation study, incrementally introducing the aforesaid components onto the baseline.The experimental milieu was established with an allotment of 50% and 100% of labeled images, and the resultant performance metrics  across various configurations are cataloged in table 4. Evidently, Modules act in clear synergy, unlocking progressive gains to deliver state-of-the-art semi-supervised segmentation.Concretely, While with half labeled data, it reveals that the dynamic thresholding imparted by the CPL strategy serves to amplify the baseline performance by an additional 2.03% and 1.89% on tumors and bones.Uncertainty Rectification bestows a further increment of 0.67% upon the baseline performance of liver delineation.The pinnacle of advancement is attained with the amalgamation of all with more than 1% increments over each individual, propelling our framework to an elevated echelon with 93.12% on Bone, 92.56% on liver Matching the performance of baseline reliant solely on labeled samples in table 1. leveraging all labeled data, in the absence of any regulatory mechanism for pseudo-labeling, the unadorned mutual supervision procedure attains a commendable 83.57% on the DSC of tumor segmentation with a 4.71% increase relative to the supervised-only baseline, demonstrating substantive efficacy of CPS.The implementation of dual regularization techniques engenders the finest results on bones (95.42%), liver (96.26%), and tumors (85.20%).

Tumor augmentation impact
We examine the components of our TumorAug technique, i.e.TumorSyn and TumorCP, within both scenarios of SL and SSL frameworks in figure 6.In the SL case, TumorSyn imparts a perceptible augmentation of 3.56% to DSC while the TumorCP precipitates an impressive 6.13% surge, indicating the potential of our tumor augmentation approach to further refine outcomes through the nuanced capture of diverse tumor features.While model capacity appears to approach saturation as observed in table 4, our SSL method exhibits a heightened capacity approach by better exploiting TumorSyn and TumorCP for segmentation improvements via providing more diverse, and confident pseudo labeling, enabling TumorSyn and TumorCP to accrue additional gains atop the augmentation-free baseline.Specifically, TumorSyn and TumorCP exhibit a notable escalation of 1.5% and 3.1% in DSC respectively.Notably, the synergistic utilization of TumorSyn and TumorCP under both scenarios catapults raw purely supervised variant by an appreciable 8.35% and 10.49%.Overall, TumorAug potentiates multifarious representation and enhances the generalization capacity of the model across lesional heterogeneous manifestations.
Figure 7 shows some qualitative results of different models under SL and SSL segmentation contexts.Particularly, with respect to the segmentation of bone and liver, it is noteworthy to observe that all models under consideration demonstrate a comparable level of performance except that SwinUNETR and V-Net  demonstrate over-segmentation and under-segmentation respectively in liver segmentation tasks while X-Fuse distinguishes itself by achieving heightened sensitivity in both learning contexts.Regarding liver tumor segmentation, our model exhibits a commendable capability for delineating fine-grained morphological details and otherwise minute pathological structures, which can be attributed to the innate multi-scale representational ability.Furthermore, the integration of tumor augmentation techniques serves to amplify model competencies for fine-grained characterization of manifold morphological variabilities of liver tumors.

Discussion
In this research endeavor, we present a cascaded three-phase framework devised for liver tumor intervention, with the focal points being the delineation of AVO and OAR encompassing critical structures like bones and the liver and the precise identification of GTV, specifically targeting liver tumors.In the initial phase, rapid localization of the body is achieved through the utilization of FasterNet.Subsequently, in Phases 2 and 3, the  focus shifts towards the precise delineation of relevant organs.To fulfill this objective, we introduce X-Fuse, which exhibits versatility and applicability to both fully supervised and SSL settings.The architectural configuration of X-Fuse is characterized by the integration of two discrete encoders, each incorporating spatial and spectral token mixers within a MetaFormer structure.In the realm of full supervision, our spatial encoder exhibits superior performance over models founded on convnets, transformers, and their hybrid counterparts.Simultaneously, the spectral encoder performs favorably at a comparable level.The effectiveness of our model is further amplified through the fusion of latent representations, where spatial and frequency features mutually complement each other, endowing our model with enhanced capacities and enriched information.The resulting p-values consistently fall below the significance threshold of 0.05, confirming the significant improvements achieved by our proposed method.By harnessing additional unlabeled data, and the application of uncertainty rectification, as determined by the JS variance of dual decoder outputs, coupled with the use of CPL, our methodology capitalizes on the potential of CPS beyond prevailing teacher-student and multi-task consistency approaches.Ablation studies further elucidate that each individual module makes a nearly equal contribution to the overall efficacy of the method.Through performing simple tumor augmentation techniques TumorCP and TumorSyn, we demonstrate that these tumor-centric augmentations markedly refine segmentation performance to 89.53% by increasing variation in malignant tissue patterns.The resulting model can better generalize across heterogeneous images and identify intricate morphological tumor characteristics.
Despite our method yielding promising liver tumor segmentation performance in contrast-enhanced CTs, it still faces challenges or limitations regarding its generalizability to tin non-contrast-enhanced CT scans.As illustrated in figure 8, our method along with all prior arts fails under supervised learning conditions where models are exclusively trained with contrast-enhanced CT images.The incorporation of unlabeled non-contrast scans and TumorAug technique significantly bolsters domain adaptation, as exemplified in Case 1.However, our approach encounters difficulties when images are corrupted by noise stemming from iodized oil embolization or the administration of chemotherapeutic agents, as evidenced in Case 2.
According to the comparisons in figure 9, while our methodology essentially entails an ensemble of two networks, the complexity of our model is marginally higher than the baseline V-Net.Importantly, it significantly falls below that of the semi-supervised two-decoder MC-Net, under equivalent hyperparameter settings, such as the number of layers and channels per layer.Empirically, given a representative test case exemplified by a CT volume of dimensions 512 × 512 × 300, a common image size encountered in clinical practice, the total inference time per scan achieved by our proposed architecture is 30 seconds.This reflects the efficiency gains of our streamlined framework, which maximizes accuracy while minimizing computational through carefully optimized components tailored for the automated analysis of medical images.
Despite promising performance for liver tumor segmentation in CT imaging, our work has several limitations that provide avenues for future improvements: • Limited encoder interaction: the fusion strategy in X-Fuse simply concatenates spatial and spectral encoder outputs.More complex interactions between the parallel pathways, such as recursive feedback or dual attention flows, could further enrich the integrated representation.Architectures that enable tighter coupling and bidirectional propagation between the encoders merit additional exploration.• Single organ coverage: we solely focused on segmenting the liver as the primary organ at risk.Safe interventional planning, however, necessitates incorporating additional abdominal organs like stomach, intestines and spleen that could undergo collateral damage.Expanding the anatomical breadth to multiple organs would provide more comprehensive risk assessments during therapy.

Conclusion
In this work, we have presented a multiphase cascaded DL workflow specifically optimized to empower targeted image-guided intervention for liver tumors.Leveraging a MetaFormer architecture, we develop two models for automated anatomical segmentation-FasterNet and X-Fuse.Through coordinated assimilation of dual-domain spatial and spectral feature streams enriching latent encodings, X-Fuse sets new standards in precision contouring of AVO (bones), OAR (liver), and GTV (Liver tumor).Semi-supervised CPL along with JS variance rectification applied on X-Fuse further minimizes error propagation conferring resilience against limited annotations, especially for diverse liver lesions.Our ultimately simple tumor data augmentation techniques substantially enhance the representation of lesion heterogeneity.Through these advances, X-Fuse strikes a pivotal balance between segmentation accuracy, efficiency and generalizability-critical for integration into safety-critical liver tumor targeting workflows.Our segmentation pipeline marks a major step towards clinically viable automated assistance for precision image-guided liver tumor procedures.
Looking ahead, Broadening anatomical perspectives through multi-organ and multi-modal contouring remains vital to enable responsible translation towards intelligent intra-operative guidance systems for liver tumor intervention.

P Lyu et alFigure 1 .
Figure 1.An overview of the three-phase cascade network.The initial phase aims for rapid localization of the body region through coarse binary segmentation of the skin.Subsequently, the second phase concentrates on the segmentation of both bone and liver structures.In the final phase, the network delineates liver tumors within the cropped liver region.

Figure 2 .
Figure 2. Illustration of the architectural framework employed for different phases, grounded in the Meta-Former paradigm.In Phase_1, FasterNet is implemented utilizing partial convolution (PConv) for expeditious processing.Transitioning to X-Fuse for Phase_2 and 3, the model integrates scale-aware modulation (SAM) and multi-head self-attention (MSA) as token mixer blocks within the spatial encoder branch, complemented by global filter (GF) in the spectral encoder branch.

Figure 3 .
Figure 3. Illustration of our framework for semi-supervised medical image segmentation.Outputs generated from X-Fuse are subjected cross pseudo supervision rectified by uncertainty estimated by Jensen-Shannon (JS) divergence, scheduled with curriculum pseudo labeling (CPL) strategy.

Figure 4 .
Figure 4. Tumor augmentation (TumorAug) incorporates two approaches: TumorCP isolates and transforms existent tumors from tumor-labeled images before pasting them back.TumorSyn statistically generates synthetic tumor shapes and textures for label-free images.Original tumor contours are delineated in red while augmented ones are marked in blue.

P Lyu et alFigure 5 .
Figure 5. Comparative analysis of various methods across different percentages of labeled CT scans.

Figure 6 .
Figure 6.Comparative evaluation of the standalone and synergistic impact of TumorSyn and TumorCP augmentation techniques under SL and SSL frameworks for liver tumor segmentation.

Figure 7 .
Figure 7. Qualitative visualization of liver, bone, and tumor segmentation performance by different methods in both supervised learning (purple) and semi-supervised learning (blue) contexts.Segmentation predictions (red) and ground truth labels (green) are presented through 3D rendered mask overlays for liver and bone cases (a) and 2D contours for tumor instances (b).

Figure 8 .
Figure 8. Qualitative visualization of tumor segmentation performance in non-contrast CTs by different methods in both supervised learning (purple) and semi-supervised learning (blue) contexts.Segmentation predictions (red) and ground truth labels (green) are presented via 2D contours for tumor instances.

Table 1 .
Evaluation of X-Fuse against state-of-the-art fully-supervised models in terms of DSC(%)↑, HD95(voxel)↓ and ASD(voxel)↓.Best results are marked in bold.and † denote statistical significance at p ⩽ 0.05 and p ⩽ 0.01, respectively, based on a pairwise t-test compared with our method.
P Lyu et al *

Table 2 .
Ablation studies quantifying the contribution of individual components including spatial encoder, spectral encoder, and bottleneck fusion on the test set.and † denote statistical significance at p ⩽ 0.05 and p ⩽ 0.01, respectively, based on a pairwise t-test compared with our method. *

Table 3 .
Quantitative comparisons of semi-supervised segmentation models on the test set in terms of DSC(%)↑, HD95(voxel)↓ and ASD(voxel)↓.Best results are marked in bold.
* and † denote statistical significance at p ⩽ 0.05 and p ⩽ 0.01 respectively, based on a pairwise t-test compared with our method.

Table 4 .
Ablation studies quantifying the individual impact of key loss regulatory components, including CPS, uncertainty rectification, and dynamic thresholding.and † denote statistical significance at p ⩽ 0.05 and p ⩽ 0.01, respectively, based on a pairwise t-test compared with our method. *