IWNeXt: an image-wavelet domain ConvNeXt-based network for self-supervised multi-contrast MRI reconstruction

Objective. Multi-contrast magnetic resonance imaging (MC MRI) can obtain more comprehensive anatomical information of the same scanning object but requires a longer acquisition time than single-contrast MRI. To accelerate MC MRI speed, recent studies only collect partial k-space data of one modality (target contrast) to reconstruct the remaining non-sampled measurements using a deep learning-based model with the assistance of another fully sampled modality (reference contrast). However, MC MRI reconstruction mainly performs the image domain reconstruction with conventional CNN-based structures by full supervision. It ignores the prior information from reference contrast images in other sparse domains and requires fully sampled target contrast data. In addition, because of the limited receptive field, conventional CNN-based networks are difficult to build a high-quality non-local dependency. Approach. In the paper, we propose an Image-Wavelet domain ConvNeXt-based network (IWNeXt) for self-supervised MC MRI reconstruction. Firstly, INeXt and WNeXt based on ConvNeXt reconstruct undersampled target contrast data in the image domain and refine the initial reconstructed result in the wavelet domain respectively. To generate more tissue details in the refinement stage, reference contrast wavelet sub-bands are used as additional supplementary information for wavelet domain reconstruction. Then we design a novel attention ConvNeXt block for feature extraction, which can capture the non-local information of the MC image. Finally, the cross-domain consistency loss is designed for self-supervised learning. Especially, the frequency domain consistency loss deduces the non-sampled data, while the image and wavelet domain consistency loss retain more high-frequency information in the final reconstruction. Main results. Numerous experiments are conducted on the HCP dataset and the M4Raw dataset with different sampling trajectories. Compared with DuDoRNet, our model improves by 1.651 dB in the peak signal-to-noise ratio. Significance. IWNeXt is a potential cross-domain method that can enhance the accuracy of MC MRI reconstruction and reduce reliance on fully sampled target contrast images.


Introduction
Magnetic resonance imaging (MRI) is a high-resolution imaging technique widely used for the multidirectional visualization of human soft tissue in a non-invasive manner.As the pulse sequence changes in the scanning protocol, different contrast MR images can be acquired, which reflect the tissue features under different intensities.For example, T1-Weighted (T1W) images have distinct anatomical appearances, T2-Weighted (T2W) images are useful in observing water and fat content, and fluid-attenuated inversion recovery (FLAIR) images can clearly display the edema and inflammation (Menze et al 2015).Compared with single contrast, multi-contrast (MC) MRI provides richer anatomical information for clinical diagnosis.However, MC MRI takes a long time because each contrast image typically requires additional scanning (Sun et al 2020).The lengthy acquisition may increase patient body movement, resulting in more motion artifacts.Consequently, there is a great interest in MC MRI acceleration without losing image contents.
The common method for accelerating MRI is to reduce the amount of data acquired in the frequency domain (k-space), but this undersampling operation will cause severe artifacts and detail blurring in the spatial domain.To reconstruct undersampled MR images, several algorithms based on the compressed sensing (CS) theory (Lustig et al 2007, Bilgic et al 2011, Peng et al 2015, Lai et al 2016, Zhu et al 2019, Wang et al 2021) have been proposed by constructing the sparse regularization term.However, CS-based methods hardly recover clear texture structures when the acceleration factor is high, and tedious iterative steps make the optimization process time-consuming.Recently, benefiting from the rapid development of computer vision, Deep Learning (DL) based methods have shown superior performance over CS-based algorithms.It can be classified into two categories, including data-driven methods (Wang et al 2016, Shaul et al 2020, Zhou et al 2023b) and modeldriven methods (Schlemper et al 2018, Ramzi et al 2022, Zhang et al 2023).In data-driven methods, the network learns an end-to-end mapping between the undersampled image and the fully sampled image.Meanwhile, various advanced structures have been applied to improve the accuracy of the nonlinear mapping, such as the generative adversarial network (GAN) (Quan et al 2018, Li et al 2021, Liu et al 2022a), the Transformer (Hong et al 2023, Wu et al 2023, Lyu et al 2023a), and so on.Inspired by conventional optimization algorithms, modeldriven methods solve the inverse problem in an iterative manner, which performs more interpretable reconstruction than data-driven methods.However, most DL-based methods are limited to single-contrast reconstruction, ignoring the complementary information between different contrast images.
DL-based methods for MC MRI acceleration can use contrast images acquired with shorter scanning time (e.g.T1W images) as fully sampled reference images to assist in reconstructing undersampled target contrast images that require longer scanning time (e.g.T2W images and FLAIR images) (Zhou and Zhou 2020, Liu et al 2021a, Xuan et al 2022).Although these methods can recover high-fidelity target contrast images under the high acceleration factor, there still exist some limitations: (1) in most cases, the feature fusion and reconstruction are performed in the image domain, which does not utilize the complementary information in other sparse domains fully.Some cross-domain methods (Zhou andZhou 2020, Lyu et al 2022) reconstruct undersampled target contrast images in the frequency domain and image domain.However, the convolution operation is relatively unsuitable for frequency domain data recovery (Tong et al 2022).(2) The reconstruction backbone for MC MRI usually adopts traditional CNN-based architectures, such as U-Net and the residual stack network, which is hard to establish non-local dependency.The Vision Transformers (Dosovitskiy et al 2020, Touvron et al 2021, Liu et al 2021b) can effectively capture the non-local dependency of the MC MR image (Lyu et al 2022, Zhou et al 2023b), but the implementation of Transformers demands a large number of training parameters.On the other hand, the large patch sizes hardly satisfy the translational equivariance property required for biomedical image processing (Ahmed et al 2023).(3) Most existing DL-based MC models rely on fully sampled target contrast images for supervised training.Since the acquisition of complete target contrast data takes a lot of time, the final imaging result may contain more motion artifacts and system noise, which is impossible to meet the requirements of full supervision.
To address the above issues, we propose an Image-Wavelet domain ConvNeXt-based network (IWNeXt) for self-supervised MC MRI reconstruction.The proposed method not only can fully utilize the sparse prior of MC images but also captures high-quality non-local dependencies when fully sampled target contrast images are unavailable.Firstly, a parallel cascade network is designed to reconstruct two re-undersampled target contrast data and fuse fully sampled reference contrast data in the image domain and wavelet domain alternately.Compared with k-space data, the wavelet transform is able to provide rich high and low-frequency component information at the image level (Guo et al 2017, Tong et al 2022).Secondly, unlike modeling non-local dependencies by Transformers, two lightweight ConvNeXt-based subnetworks are constructed according to the characteristics of different domains.Specifically, INeXt with a stacked structure initially reconstructs undersampled target contrast data in the image domain, and WNeXt with a multi-level residual structure refines the initial reconstructed result in the wavelet domain.Moreover, inspired by the hierarchical structure of Liu et al (2022b), a modified ConvNeXt block is designed as the backbone of subnetworks, in which a parameter-free attention mechanism (Yang et al 2021) is introduced to better fuse different contrast images.Finally, a novel cross-domain consistency loss guides the self-supervised learning, including frequency domain consistency loss, image domain consistency loss, and wavelet domain consistency loss.By maximizing the similarity between reconstructed k-space data and raw undersampled target k-space data, the frequency domain consistency loss promotes IWNeXt to infer the non-sampled measurements.In the meanwhile, the image domain consistency loss and the wavelet domain loss enforce the apparent similarity of the two outputs.Such dual-domain loss functions can efficiently incorporate the supplementary information from reference contrast images for MC MRI reconstruction.
Our main contributions can be summarized as follows: (1) An image-wavelet domain self-supervised network IWNeXt is proposed for MC MRI reconstruction.By fusing the spatial and wavelet information of the reference contrast image at different reconstruction stages, IWNeXt can recover more detailed image contents.
(2) The INeXt and the WNeXt based on ConvNeXt are designed for the image domain and the wavelet domain reconstruction and fusion, in which a novel Attention ConvNeXt Block (ACB) is constructed to enhance the non-local feature extraction ability of the network.
(3) A cross-domain consistency loss is proposed for self-supervised optimization.It can fully use the prior knowledge of the reference contrast image in the image domain and the wavelet domain to improve MC reconstruction accuracy.However, the aforementioned methods focus on image domain reconstruction and do not consider the gain effect from other sparse domains.Moreover, the fully supervised training limits the clinical implementation of these methods.In this paper, we develop a cross-domain network for MC MRI reconstruction and optimize parameters only using undersampled target contrast images.

Cross-domain MRI reconstruction
Cross-domain methods that combine the image domain with other transform domains have been developed for single-contrast MRI reconstruction.Eo et al (2018) proposed a deep CNN structure called KIKI-net, in which undersampled images were reconstructed alternately in the image domain and k-space.Shaul et al (2020) designed a cross-domain GAN composed of two U-shaped generators to estimate the missing k-space lines and refine spatial details.Wang et al (2020b) designed a three-stage cascade network that integrates the feature representation from the image, frequency, and wavelet domains.Ran et al (2021) divided a parallel network into two branches for dual-domain reconstruction and fused the output of different branches during the iterative process.Tong et al (2022) proposed HIWDNet, which used a hybrid structure network to handle image-wavelet domain reconstructed results.Hong et al (2023) introduced the UFormer into cross-domain dynamic MRI reconstruction and employed the undersampling mask network to capture the temporal correlation of input images.There are also some cross-domain MC MRI reconstruction methods.For example, Zhou and Zhou (2020) proposed a recurrent network named DuDoRNet.It utilized T1W prior in image and frequency domains to reconstruct T2W or FLAIR images.Similarly, Lyu et al (2022) applied the Swim Transformer as the reconstruction unit in the dual-domain recurrent network, which improved the reconstruction quality of target contrast images.
However, DL-based reconstruction models are better at processing image domain data than frequency domain data.It is worth noting that the wavelet transform has a powerful ability to depict different spatialfrequency information and is widely used in single-contrast reconstruction tasks (Ramanarayanan et al 2020, Wang et al 2021, Aghabiglou and Eksioglu 2022).Therefore, except for image domain reconstruction, we add the wavelet domain refinement during MC MRI reconstruction, which can generate more high-frequency details in the final results.

Self-supervised MRI reconstruction
To eliminate the fully sampled data as the reference label in model training, self-supervised MRI reconstruction methods have been recently proposed.Wang et al (2020a) combined the traditional regularization term and the data consistency loss to perform the self-supervised reconstruction.Yaman et al (2020) proposed a selfsupervised physics-guided reconstruction method named SSDU.In SSDU, the raw undersampled k-space data was divided into two augmented measurements: one was reconstructed by the unrolled network, and the other was used to optimize the mapping parameters.Following the core idea of SSDU and its variants (Yaman et al 2021, Yaman et al 2022), some self-supervised reconstruction methods have been explored by applying similar training strategies.For example, Gan et al (2021) developed a self-supervised RED-based network that estimates different undersampled measurements simultaneously.Hu et al (2021) designed a self-supervised ISTA-Net with parallel architecture to iteratively recover two re-undersampled MR images.Zhou et al (2022) constructed a triple-branch variational network for self-supervised dual-domain MRI reconstruction, in which the raw undersampled images are also used as reconstruction targets.Unlike the training strategy of SSDU, Korkmaz et al (2022) added a pre-training image synthesis task to mine deep prior information by adversarial learning, and then a transformer-based generator is further trained for self-supervised reconstruction.
However, these methods only reconstruct single-contrast MR images without taking advantage of the complementary relationship between different modalities.Zhou (2023a) proposed a self-supervised Transformer-based model called DSFormer to reconstruct undersampled target contrast data in the spatial domain, which contains a large number of network parameters and has limited detail recovery capabilities.In contrast, we design a lightweight image-wavelet domain network based on ConvNeXt and incorporate a novel cross-domain loss function to achieve high-fidelity self-supervised reconstruction.

Problem formulation
The purpose of MRI acceleration methods is to reconstruct a fully sampled target contrast image x tar This inverse problem can be expressed as: ´represents the fast Fourier transform (FFT), and is the white noise.Since directly solving the ill-posed problem equation (1) will cause artifacts, CS-based methods incorporate prior knowledge of x tar into the single-contrast reconstruction and formulate an optimization problem: where the first term performs the data fidelity operation to ensure the consistency between reconstructed MR images and undersampled k-space data, R represents the regularization term on x , tar and l denotes a balancing coefficient to tradeoff the proportion of two terms.
To further speed up the MR imaging, DL-based methods build a trainable regularization term using CNN.Thus, equation (2) can be rewritten as follows: represents the zero-filling image of y , q denotes the CNN model with optimizable parameters .q In MC MRI reconstruction methods, the reference contrast image x ref is introduced into the CNN-based regularization term as fully sampled prior information.The equation (3) can be formulated as: where x ref is acquired under the different scanning protocol and paired with x .

Data consistency layer
In the undersampled MR image, the acquired k-space measurements are certain and can be used to improve the reconstruction fidelity.Therefore, following the idea of Schlemper et al (2018), the DC layer is introduced into the proposed method.
According to Schlemper et al (2018), the closed-form solution of equation (4) can be written as follows:    Among them, the IIFF layer and the IR layer are simple convolutions with a 3 × 3 kernel size.Considering many artifacts and blurred details in undersampled target contrast images, the DIFFM is designed as a stacked structure composed of multiple ACBs to reduce the influence of incorrect features and fully fuse the spatial information between different contrasts.
In the kth INeXt, the IIFF layer first fuses the concatenated MC MR images to obtain the shallow feature F : represents the IIFF layer, and • { } denotes the concatenation by channel dimension.
) is fed into the DIFFM to fully fuse complementary information from x .ref For each ACB, the output ) can be written as: is the i th ACB, I C represents total number of ACBs in the DIFFM.A 3 × 3 convolution is employed in the last layer to generate the final feature map F .
) The entire DIFFM can be defined as: is the DIFFM, and denotes the last convolution operation in the DIFFM.q q q all belong to . ) is preliminarily extracted through the IWFF layer: represents the IWFF layer.Different from the DIFFM in INeXt, the MWFAM uses a global fusion layer (GF layer) to aggregate local sparse representation generated by the ACBs in different extraction stages: where I W represents total number of ACBs in the MWFAM, F W up down q denotes the GF layer.The global feature fusion is implemented with a 1 × 1 convolutional and a 3 × 3 convolutional layers.
After complex deep features are extracted in the wavelet domain, the first fused feature of the IWFF layer and the output feature of the GF layer ) are merged by the global residual connection.Then the WR layer reconstructs target sub-bands sub : ) q = + - where H .
represents the WR layer.Finally, the 2D inverse discrete wavelet transform (IDWT) combines sub W up down k , ( ) linearly to produce the spatial result: ) denotes the final reconstructed target contrast image of WNeXt.Like INeXt, the parameters mentioned above all obey , , .
q q q Î 3.2.4.Attention ConvNeXt block Figure 3(a) shows the detailed structure of the original ConvNeXt block.Compared with Vision Transformers, it retains the inductive bias property of convolution in the hierarchical architecture.Meanwhile, the 7 × 7 depthwise convolutional layer provides non-local feature extraction capabilities similar to the multi-head selfattention mechanism.For MC MRI reconstruction, a larger receptive field can not only effectively remove the global artifacts caused by the undersampling operation, but also guide the network to better fuse shallow features with different contrast information.However, the depth-wise convolution and the 1 × 1 patch-wise convolution mix MC information only in the channel or spatial dimension.Therefore, SimAM (Yang et al 2021) based on neuroscience theory is introduced to achieve cross-dimension attention weight allocation without additional parameters, which enhances the interaction between different contrast features.In addition, inspired by the classic Transformer block design (Vaswani et al 2017), the feedforward layer is also added after the residual output of the ConvNeXt block to further refine the captured MC image representations.The redesigned ACB is shown in figure 3(b), and its workflow can be divided into two phases.
In the first phase, a depth-wise convolution and the layer normalization operation extract each channel information of the input feature map F .

-
To simulate the inverted bottleneck structure in the Transformer, a patch-wise convolutional layer activated by GELU expands the dimension of the non-local enhanced feature ) four times, and then SimAM calculates 3D attention weights of activation maps for important signal amplification.The feature dimension is shrunk to match the initial shape by another patch-wise convolutional operation.Finally, a residual connection fuses the output result of the inverted bottleneck ) and to generate the feature map F .
) The whole first phase can be formulated as follows: where q denotes 7 × 7 depth-wise convolutional layer, LN • ( ) represents the layer normalization operation, q denote the first and second patch-wise convolutional layer in the first phase, and I C W ( ) represents total number of ACBs in the DIFFM or MWFAM.In the second phase, to mine deeper representations, ) is passed into the feedforward layer composed of a layer normalization and an inverted bottleneck structure without SimAM.The refined feature ) and ) are combined to form the final result F .
) Thus, the second phase can be defined as follows: where * * q and * * q denote patch-wise convolutions in the second phase.

Cross-domain consistency loss
Due to different paired contrast images sharing the same anatomy, MC MRI reconstruction can infer the clear target contrast image with the help of supplementary information in x .ref It is crucial to drive the network to fully utilize the reference contrast image, especially for self-supervised reconstruction where the complete target contrast image is not acquired.For this reason, as shown in figure 1(a), we propose a cross-domain consistency loss containing three components, which is frequency domain consistency loss L , FDC image domain consistency loss L , IDC and wavelet domain consistency loss L .WDC Mathematically, the cross-domain consistency loss is defined as follows: where CD  represents the cross-domain consistency loss.The reconstruction results of the two branches x rec1 and x rec2 should be consistent with the initial undersampled data y u tar -in the frequency domain.By applying the mean-squared-error (MSE) loss MSE  to minimize the differences between y u tar -and x rec1 2 ( ) in the sampled region M, the network is expected to have an ability for non-sampled k-space data prediction.The calculation procedure of FDC  can be written as follows: where N represents the pixel size of the input image.The complementary effect from x ref exists gain deviation when two branches reconstruct disjoint measurements y u tar1 -and y u tar2 -simultaneity, which may result in learning an inaccurate mapping relationship.To reduce self-supervised reconstruction errors and preserve more recovered high-frequency details, the appearance consistency between x rec1 and x rec2 is constrained in the image domain and the wavelet domain by using .MSE  Thus, IDC  and WDC  can be expressed as follows: = is the wavelet sub-band form of x rec1 or x .

Implementation details
The proposed method is implemented in Python under the Pytorch framework and trained on an NVIDIA GeForce 3080Ti with 12GB memory.The Adam algorithm calculates the reverse gradient for model optimization with an initial learning rate 0.0002.h = During the training stage, the batch size is set to 2, and the total epoch T C is set to 200.When the validation set metrics no longer change significantly, the learning rate is decreased by 30% until the early stop mechanism is triggered.The total iteration N C of the reconstruction network is set to 3 by default, which contains 3 INeXt,3 WNeXt, and 6 DC layers.In experiments, we use fully sampled T1W images as reference data to assist in reconstructing undersampled T2W and FLAIR images.

Performance evaluation
The peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM) are used as evaluation indicators to measure the final reconstruction quality.The higher values of the PSNR and the SSIM represent better network performance.
PSNR emphasizes detail differences at the pixel level, which is defined as follows: where I GT and I R are the ground truth and the reconstructed image, MSE follows the calculation of MSE  in equation ( 19), and Max denotes the maximum pixel search operation.
The SSIM focuses on the perceptual quality of reconstruction, which is computed as follows:  Table 1 presents quantitative comparison results on the HCP dataset.In the experiments, single-channel real-valued T2W MR images are undersampled by the 1D random masks.Compared to DuDoRNet, IWNeXt improves SSIM by 0.204% and PSNR by 0.833 dB under acceleration of 3 and SSIM by 0.522% and PSNR by 1.651 dB under acceleration of 5. Obviously, dual-domain reconstruction methods DuDoRNet and IWNeXt are superior to image domain reconstruction methods Dense U-Net, MC-SSDU, and DSFormer.Since the feature correlation between wavelet sub-bands is stronger than k-space data, IWNeXt outperforms DuDoRNet with different acceleration rates.Furthermore, the cross-domain consistency loss function guides the network to better accomplish MC reconstruction compared with other self-supervised models.The number of training parameters for different methods is also shown in table 1.It can be seen that IWNeXt obtains advanced performance by applying lightweight ConvNeXt without sacrificing more computing resources.
The qualitative comparison results on the HCP dataset are depicted in figure 5.The red box indicates the enlarged area in the reconstructed image.We find that SSDU and DSFormer have better artifact removal ability than Dense U-Net at some acceleration rates and are inferior to DuDoRNet in detail recovery.These selfsupervised methods hardly generate refined texture tissue.On the contrary, regardless of the acceleration rate, our model can achieve better reconstruction performance when the full sampled label is unavailable.
To further evaluate the reconstruction performance of the IWNeXt, comparison experiments with 1D equispaced masks are conducted on the M4Raw dataset.The quantitative results of two-channel complexvalued FLAIR images are shown in table 2. It can be seen that IWNeXt outperforms other baseline methods across all evaluation indicators.Specifically, the proposed method achieves SSIM and PSNR of 0.860 and 32.660 dB for 5× acceleration.For qualitative comparisons on the M4Raw dataset, the reconstructed results of different methods are visualized in figure 6.When dealing with aliasing artifacts induced by 1D Cartesian sampling trajectories, IWNeXt shows competitive reconstruction performance.Compared with other methods, the proposed model produces more refined details in the final reconstructed results.

Ablation study
In this section, we conduct ablation experiments to validate the effectiveness of the proposed method on the HCP dataset: (1) Multi-contrast reconstruction using both image domain and wavelet domain information of the reference contrast image.

Multi-contrast reconstruction
To verify the efficacy of reference contrast images for auxiliary reconstruction, we use single-contrast IWNeXt and MC IWNeXt to reconstruct undersampled target contrast images.It should be noted that the overall structure remains the same except for the absence of the x ref in single-contrast reconstruction.In MC reconstruction, different forms of reference contrast image are utilized to perform information fusion, including image domain fusion (I-fusion), wavelet domain fusion (W-fusion), and image-wavelet domain fusion (IW-fusion).The fusion methods are distinguished by the presence or absence of additional channel inputs to INeXt and WNeXt in IWNeXt.
The quantitative evaluation of SSIM and PSNR is shown in table 3. We can see that the reconstruction quality is significantly improved by fusing reference contrast images in any single domain.It indicates that the complementary information from different contrast images is efficient for the reconstruction of undersampled target contrast images.When using IW-fusion, the network achieves the best reconstruction performance.Figure 7 presents qualitative visual results at acceleration rates of 5. Compared with single-contrast reconstruction, MC reconstruction can recover more detailed structures with the assistance of reference contrast images.
Besides, we also study the effect of the proposed IW-fusion strategy when reference contrast data is degraded.There exist two degradation conditions in experiments.The first one is to simulate patient movement during MC scanning, which is 'Motion'.Specifically, we utilize random rotation ([−0.01π,0.01π]), translation ([−12.8,12.8]), and bicubic interpolation with 9 × 9 control points (each randomly displaces pixels within a range of [−5.12, 5.12]) to generate the deformation field and apply it to fully sampled reference contrast images.The other one is 'Undersampling'.In MC reconstruction, paired MR images employ the same undersampling masks at the start point.We compare the above two types of degraded IW-fusion methods to single-contrast reconstruction and the IW-fusion method without applying degradation conditions ('Undegraded').As shown in figure 8, MC reconstruction methods still outperform the single-contrast reconstruction methods when high- quality reference contrast data is unavailable.Even though degradation processes reduce the shareable features between different contrast images, the proposed IW-fusion strategy drives the network to obtain higher selfsupervised reconstruction accuracy.

Cross-domain reconstruction
To explore the impact of the single-domain and dual-domain methods on MC MRI reconstruction, we compare the following three variants of IWNeXt: The results in table 4 show that the image domain network is better than the wavelet domain network.Although wavelet sub-bands of the reference contrast image provide reliable additional information for undersampled target contrast image reconstruction, it is difficult for the model to infer detailed contents due to the heavy loss of the high-frequency features in target wavelet sub-bands.In contrast, CNN structures are more suitable for the rough restoration of image domain data.Moreover, cross-domain reconstruction is superior to the single-domain reconstruction.As shown in figure 9, under the 5× acceleration, cross-domain reconstruction methods generate more detailed textures and have fewer reconstruction errors.It demonstrates that dual-domain MC reconstruction methods are effective.On the other hand, IWNeXt yields more competitive reconstruction performance compared with WINeXt, which further indicates that the image domain network has a better ability for initial image recovery.Depending on more available pixel information from INeXt, WNeXt can reconstruct more refined tissue structures.

The attention ConvNeXt block
We study the gain effect of the proposed ACB structure on the MC MRI reconstruction by removing SimAM and the feedforward layer respectively.It can be seen from table 5 that the reconstruction performance of IWNeXt is degraded to some extent without SimAM or the feedforward layer while the reconstruction performance is worst when neither is present.By utilizing the channel-spatial hybrid attention mechanism and the deep fusion operation simultaneously, the network with ACB has better non-local feature extraction ability, which illustrates that SimAM and the feedforward layer are beneficial for MC MRI reconstruction.
Figure 10 shows the loss plot of different comparison methods on the validation set at the acceleration rate of 5. We can see that both SimAM and feedforward layers can reduce validation errors and speed up the loss descent.In particular, the network with ACB reconstructs high-quality target contrast images more accurately.

The cross-domain consistency loss
To explore the superiority of the cross-domain consistency loss, we evaluate the reconstruction performance of the network using frequency domain consistency loss


It can be seen that the reconstruction results using dual-domain consistency loss achieve improvement on SSIM and PSNR compared with the reconstruct results using .
FDC  Because the gain deviation caused by disjoint sampling modes is decreased as the appearance similarity between reconstruction results of different branches increases, the network can make full use of the supplementary information of the reference contrast image by the dual-domain self-supervised loss.When the cross-domain consistency loss is used, the network can learn a more accurate mapping relationship.Concurrently, it can be observed that the reconstruction performance is improved.

The number of subnetworks
In this experiment, we investigate the impact of the number of INeXt-DC-WNeXt-DC subnetworks on the reconstruction performance.The experimental results are reported in figure 12, in which the acceleration rate is set to 5 and the number of subnetworks varies from 1 to 5. It can be seen that IWNeXt reaches the bottleneck of reconstruction performance when the number of subnetworks is 3.Moreover, there are no longer significant changes in reconstruction performance after the number of subnetworks is over 3.It shows that setting the number of subnetworks to 3 is efficient for network optimization.

Discussion
Recent studies have shown that DL-based MC reconstruction methods can reconstruct high-fidelity target contrast images under fewer k-space acquisitions by leveraging the complementary information from reference contrast images.Nevertheless, most existing models commonly employ traditional CNN structures to perform undersampled target data reconstruction in the image domain, which not only lacks the utilization of other sparse domain knowledge but also hardly learns explicit non-local semantic information of MC images.On the other hand, the fully sampled target contrast image is still regarded as a necessary condition for network training in most cases, even if the fully sampled MC sequences take a long time and are prone to artifacts in final imaging results.In this work, to overcome the above challenges, we design the ConvNeXt-based network to perform MC MRI reconstruction in the image domain and the wavelet domain, in which the self-supervised training is guided by a novel cross-domain consistency loss.Both quantitative and qualitative experiments demonstrate that the IWNeXt offers superior reconstruction performance against previous methods, including conventional, supervised, and self-supervised methods.It is shown that the additional wavelet domain reconstruction stage can produce image details closer to the ground truth than the additional k-space reconstruction stage under higher acceleration rates compared with DuDoRNet (Zhou and Zhou 2020).Moreover, IWNeXt builds high- Since there are shareable features between different contrast images of the same anatomical part, the MC reconstruction method is more conducive to global artifact removal and texture structure recovery compared with the single contrast network.When the image and wavelet domain features of the reference contrast image are fused, IWNeXt obtains optimal reconstruction performance.Without considering the reconstruction order of different domains, cross-domain methods exhibit better performance than single-domain methods in highfrequency content generation.Because WNeXt separates the image into a series of wavelet sub-bands representing complex frequency information by the 2D discrete wavelet transform, the feature loss caused by the undersampling operation is amplified.Conversely, the spatial form of the undersampled image carries relatively concise feature information and is easier to reconstruct by CNN.Thus, INeXt is more suitable for the initial reconstruction compared with WNeXt.
The reconstruction performance of IWNeXt is also affected by the design of ACB.The introduction of SimAM allows the original ConvNeXt block to integrate the non-local enhanced feature map in both spatial and channel dimensions.In addition, refined MC image representation can be captured by constructing the feedforward layer.After adding SimAM and the feedforward layer to the ConvNeXt block, IWNeXt achieves the optimal reconstruction performance, which shows the ACB structure can improve the non-local feature extraction ability of the network.
Although the frequency domain consistency loss FDC  can drive the parallel framework to reconstruct the target contrast image, the reconstruction performance is not optimal due to the lack of the full use of the complementary information in the reference contrast image.When image domain consistency loss IDC  and wavelet domain consistency loss WDC  are used to constrain the appearance consistency between reconstructed images from different branches, the self-supervised reconstruction performance obtains the better evaluation on SSIM and PSNR.
There still exist some limitations to the proposed method: (1) comparison experiments with other methods show that IWNeXt achieves promising reconstruction performance with fewer network parameters.However, its reconstruction time is relatively lengthy due to the cascade structure.Therefore, future work will investigate end-to-end self-supervised models based on ConvNeXt.(2) The fusion strategy of different contrast images is also critical for MC reconstruction.Several studies (Liu et al 2021c, Lyu et al 2022) have explored how to efficiently incorporate supplementary information from reference contrast images by fully supervised fusion modules.Thus, we will design a multi-contrast fusion strategy in a self-supervised manner to better assist the reconstruction of undersampled target contrast data.

Conclusion
In this paper, an image-wavelet domain parallel framework IWNeXt is proposed for high-quality self-supervised MC MRI reconstruction.In each branch, INeXt and WNeXt with ACB are designed to reconstruct two re- undersampled target contrast images and fuse the fully sampled reference contrast image in the image domain and wavelet domain respectively.Meanwhile, the self-supervised learning is performed by a novel cross-domain consistency loss, which simultaneously constrains consistency between reconstructed results of two reundersampled data in the frequency domain, image domain, and wavelet domain.Experimental results show IWNeXt achieves competitive reconstruction performance compared with the supervised methods and the state-of-the-art self-supervised method.
architecture As depicted in figure 1, IWNeXt is constructed as the parallel framework sharing the same branch structure and network parameters.Each branch performs N C times dual-domain reconstruction by cascading INeXt-DC-WNeXt-DC subnetworks.In the training stage, as shown in figure 1(a), raw undersampled k-space data y u tar -is split into two disjoint parts: y u tar1 -and y .utar2 -To alleviate the pressure of self-supervised learning, the re-undersampling operation for y u tar -maintains most low-frequency measurements and evenly divides high-frequency measurements.Then two branches iteratively reconstruct re-undersampled images x u tar1 -and x u tar2 -corresponding to y u tar1 and y .u tar2 -Meanwhile, the reference contrast image x ref is injected into INeXt and WNeXt for auxiliary reconstruction.To guarantee the data fidelity in different reconstruction stages, y u tar1 -and y u tar2 -are combined with the output of each regularization term via the Data Consistency layer (DC layer).When the iteration stops, the final reconstructed results x rec1 and x rec2 are used to calculate the frequency domain consistency loss L , FDC the image domain consistency loss L , IDC and the wavelet domain consistency loss L .WDC The detailed training procedure of our model is also described in Algorithm 1.Where f upper or lower branch in the parallel network, and Q contains learnable parameters of all components.The process of the testing stage is illustrated in figure 1(b).The trained network directly takes y u tar -as the reconstruction target and x ref as auxiliary signals to predict the final image x . rec the output of the CNN-based regularization unit, W denotes sampled position set in y , u tar -and j is the index of k-space location.Similarly, without taking into account the epoch, we can express f data of INeXt or WNeXt, and 1 2 IFFT.The structure of the DC layer is shown in figure 2(c).

Algorithm 1 .
IWNeXt.Input: raw undersampled measurements: y u tar -Fully sampled reference contrast images: x ref Fully sampled target contrast image: x ref Output: the trained parameter set: Q Initialization: total epochs: T , C Total iterations: N , C Learning rate: h for t = 0 to T C do y 2. ConvNeXt for image domain reconstruction The architecture of INeXt is shown in figure 2(a).It consists of three parts: the Initial Image feature fusion layer (IIFF layer), the deep image feature fusion module (DIFFM), and the Image reconstruction layer (IR layer).

Finally)
are input into the IR layer by the global residual connection to reconstruct the spatial structure of target contrast images:

Figure 1 .
Figure 1.The overall architecture of IWNeXt.(a) The self-supervised training framework.It is a parallel cascade network to reconstruct two re-undersampled images x u tar1 -and x u tar2 -from the same raw undersampled k-space data y .
WNeXt receive the fully sampled reference contrast image x , ref and each DC layer receives re-undersampled k-space data y u tar1 outputs are then optimized by cross-domain losses L , FDC L , IDC and L .WDC (b) The testing framework.y u tar -is directly fed into any trained branch to produce the final reconstructed result x rec .q represents the IR layer, and x I up down k , ̅ ( ) denotes the final output of INeXt.The above network parameters , ,

)
3. ConvNeXt for wavelet domain reconstructionAlthough INeXt can relatively reconstruct complete image contents, it is still insufficient in sophisticated detail restoration.Therefore, after image domain reconstruction, WNeXt is designed for wavelet domain refinement, which is composed of the initial wavelet feature fusion layer (IWFF layer), the multi-level wavelet feature aggregation module (MWFAM), and the wavelet reconstruction layer (WR layer).It not only exploits the strong correlation among different sub-bands but also recovers more texture tissue by fully integrating the wavelet domain information from reference images.In the proposed wavelet domain network, the IWFF layer is two 3 × 3 convolutions, and the WR layer is a single 3 × 3 convolution.Figure 2(b) shows the structure of WNeXt.Firstly, in kth WNeXt, the 2D Discrete Wavelet Transform (DWT) with Haar kernels is applied to transform the initial reconstructed target image x I up down k , ( ) and x ref into wavelet forms.Specifically, along with the spatial- frequency component calculation, original images of size C H W ´´are represented as four sub-band images of size C 4 , low-frequency (LL), vertical (HL), horizontal (LH ) and diagonal (HH ) and sub ref denote wavelet sub-band sets of x I up down k , ( ) and x ref respectively.Then sub I up down k , ( ) and sub ref are spliced at the channel dimension, and the MC wavelet feature F W up down k , ,0 (

Figure 2 .
Figure 2. The architectures of key components in IWNeXt.(a) The architecture of INeXt.It is composed of the Initial image feature fusion layer (IIFF layer), the deep image feature fusion module (DIFFM), and the Image reconstruction layer (IR layer).(b) The architecture of WNeXt.It is composed of the Initial wavelet feature fusion layer (IWFF layer), the multi-level wavelet feature aggregation module (MWFAM), the Global Fusion layer (GF layer), and the wavelet reconstruction layer (WR layer).(c) The structure of the data consistency layer (DC layer).

Figure 3 .
Figure 3. (a) The detailed structure of the original ConvNeXt block.(b) The detailed structure of the Attention ConvNeXt Block (ACB).In the ACB structure, the SimAM is introduced into the inverted bottleneck, and a feedforward layer is added at the end of the original ConvNeXt block. rec2 m and I R m are the pixel mean of I GT and I , R I GT s and I R s are the variance value of I GT and I , depends on I .
GT4.4.Comparisons with different methodsTo evaluate the performance of our model, we compare IWNeXt with the Zero-Filling (ZF) reconstruction method, conventional CS-based algorithms including total variation (TV)(Lustig et al 2007) and MC joint bayesian reconstruction (MCJBR) (Bilgic et al 2011), fully supervised MC MRI reconstruction methods including Dense U-Net (Xiang et al 2018) and DuDoRNet (Zhou and Zhou 2020), and self-supervised MC MRI reconstruction methods including MC-SSDU and DSFormer (Zhou et al 2023b).It should be noted that the source code of SSDU (Yaman et al 2020) only considers single contrast reconstruction.Thus, we expand the number of input channels in SSDU and inject the fully sampled reference contrast image to implement MC reconstruction.
(1) image domain network C-INeXt that the WNeXt-DC is removed in IWNeXt.(2) Wavelet domain network C-WNeXt that the INeXt-DC is removed in IWNeXt.(3) Wavelet-image domain network WINeXt that the INeXt-DC-WNeXt-DC is altered to WNeXt-DC-INeXt-DC in IWNeXt.The above variants maintain the same experimental settings as IWNeXt.

Figure 5 .
Figure 5. Qualitative visualization of T2W brain reconstructions on the HCP dataset with 1D random masks at 3× and 5× acceleration.

Figure 7 .
Figure 7. Representative reconstructed samples on the HCP dataset under the 1D ra ndom mask with the 5× acceleration rate.The enlarged result with the red box and local error maps are visualized on the right side of each sample.

Figure 8 .
Figure 8. Quantitative comparison of the single-contrast reconstruction method and different IW-fusion methods on the HCP dataset under 1D random masks with different acceleration rates.

Figure 9 .
Figure 9. Representative reconstructed samples on the HCP dataset under the 1D random mask with the 5× acceleration rate.The enlarged result with the red box and local error maps are visualized on the right side of each sample.

Figure 10 .
Figure 10.Validation loss of different methods on the HCP dataset under the 1D random mask when the acceleration rate is 5.

Figure 11 .
Figure 11.Boxplots of reconstruction results using different losses on the HCP dataset under 1D random masks with different acceleration rates.

Figure 12 .
Figure 12.The reconstruction performance with different numbers of subnetworks in IWNeXt on the HCP dataset under the 1D random mask when the acceleration rate is 5.
Xiang et al (2018)ructionRecently, since DL-based models have achieved promising applications in single-contrast MRI reconstruction, several efforts have been made for MC MRI reconstruction.Xiang et al (2018)developed a Dense U-Net to accomplish accelerated T2W image reconstruction with the assistance of T1W images.Sun et al (2019) designed Guo et al (2023))aring network named DISN, which used a series of convolution layers to explore valuable common representations of MC MR images.Do et al (2020)constructed a multi-scale X-shape network and fused different contrast information at the encoding phase.Dar et al (2020)combined adversarial learning and perceptual loss to generate more realistic anatomical structures in reconstructed images.Liu et al (2021a)proposed a deep cascade network for undersampled T2W image reconstruction, in which each regularization unit received additional supplementary information from fully sampled T1W images.Similarly, on the basis of the unfolded structure, Liu et al (2021b) employed a deep dilated convolution block in the regularization unit to extract contextual information efficiently.Guo et al (2023)presented a joint reconstruction network based on the pFISTA algorithm, which iteratively solved the inverse problem using the shareable sparsity features between different contrast images.
The HCP dataset (Van Essen et al 2013) collected healthy brain structure data from 1113 subjects by 3T MR scans and released four modalities for public research, including real-valued single-coil T1W and T2W MR images, resting state and task fMR images, and diffusion MR images.In this paper, we randomly select 500 paired magnitude-only T1W and T2W MR data to evaluate the proposed method, of which 300 volumes are used for model training, 100 volumes for model validation, and 100 volumes for model testing.Moreover, to speed up the processing, we resize all slices in each volume to 256 × 256 along the coronal direction without losing the original anatomical structure.
Lyu et al (2023a)2023b) )Raw dataset(Lyu 2023a(Lyu , 2023b) )offers multi-coil k-space data acquired with MC MR protocols and can be used for low-field imaging research.The data scanning is achieved by a 0.3T system with a four-channel head coil, and three contrasts (T1W, T2W, FLAIR) brain images are obtained from 183 volunteers.In experiments, we chose multi-coil T1W and FLAIR data paired in the traverse direction as network inputs.Meanwhile, the ESPIRiT algorithm (Uecker et al 2014) estimates sensitivity maps for all volumes.Each multi-channel k-space slice with a size of 4 × 256 × 256 is then transferred into a complex-valued image with a size of 2 × 256 × 256 by the coil-combination operation.The allocation rule of the M4Raw dataset is as follows: 256 volumes for training, 80 volumes for validation, and 80 volumes for testing.4.1.3.Undersampling masksFor simulating undersampled MR images, 1D cartesian sampling masks with 3× and 5× acceleration rates are adopted in experiments.These undersampling operations retain 24 autocalibrated signal lines in central areas of k-space.It is worth noting that, for model evaluation, different 1D Cartesian sampling trajectories are applied to different datasets.According to Van Essen et al (2012),Vu et al (2015),Lyu et al (2023a), the HCP and M4Raw datasets are undersampled by random and equispaced sampling trajectories, respectively.Different undersampling masks are visualized in figure 4.

Table 1 .
The quantitative comparison results of different methods on the HCP dataset using 1D random masks with 3× and 5× acceleration rates.Bold and underline indicate the best and second-best results respectively.

Table 2 .
The quantitative comparison results of different methods on the M4Raw dataset using 1D equispaced mask with 3× and 5× acceleration rates.Bold and underline indicate the best and second-best results.

Table 3 .
The quantitative results of the single-contrast reconstruction method and MC reconstruction methods on the HCP dataset under 1D random masks with different acceleration rates.

Table 4 .
The quantitative results of IWNeXt and its variants on the HCP dataset under 1D random masks with different acceleration rates.

Table 5 .
Quantitative results of the ablation experiments for ACB on the HCP dataset under 1D random masks with different acceleration rates.