Joint reconstruction and segmentation in undersampled 3D knee MRI combining shape knowledge and deep learning

Objective. Task-adapted image reconstruction methods using end-to-end trainable neural networks (NNs) have been proposed to optimize reconstruction for subsequent processing tasks, such as segmentation. However, their training typically requires considerable hardware resources and thus, only relatively simple building blocks, e.g. U-Nets, are typically used, which, albeit powerful, do not integrate model-specific knowledge. Approach. In this work, we extend an end-to-end trainable task-adapted image reconstruction method for a clinically realistic reconstruction and segmentation problem of bone and cartilage in 3D knee MRI by incorporating statistical shape models (SSMs). The SSMs model the prior information and help to regularize the segmentation maps as a final post-processing step. We compare the proposed method to a simultaneous multitask learning approach for image reconstruction and segmentation (MTL) and to a complex SSMs-informed segmentation pipeline (SIS). Main results. Our experiments show that the combination of joint end-to-end training and SSMs to further regularize the segmentation maps obtained by MTL highly improves the results, especially in terms of mean and maximal surface errors. In particular, we achieve the segmentation quality of SIS and, at the same time, a substantial model reduction that yields a five-fold decimation in model parameters and a computational speedup of an order of magnitude. Significance. Remarkably, even for undersampling factors of up to R = 8, the obtained segmentation maps are of comparable quality to those obtained by SIS from ground-truth images.


Introduction and motivation
Knee osteoarthritis (OA) is a widely spread chronic and degenerative joint disease (Lawrence et al 2008) that can be assessed from quantitative image-based biomarkers such as the fraction of the apparent bone volume and the total bone tissue volume (Eckstein et al 2006), which can be obtained from magnetic resonance imaging (MRI).
A Comparison of such biomarkers might give further insights about the prevention and treatment of OA.It is the most common joint disorder with over 250 million people affected worldwide (Vos et al 2012)-80% of which involve the tibiofemoral joint of the knee with a high socio-economic burden.
Commonly used x-ray imaging provides fast and relatively low-cost acquisition, but only yields 2D projections with marginal soft tissue contrast leading to limited inter-rater concordance.Recent studies have shown that MRI-derived 3D bone shape is a much better bio-physical parameter for the assessment of OA (Hanik et al 2020, Ambellan et al 2021b, Ambellan et al 2021c).A major challenge for its clinical application is the relatively long acquisition time of 3D MRI.Longer acquisition times can limit the achievable spatial resolution due to possible patients motion during the scan, increase patient discomfort and ultimately inevitably come with associated higher healthcare costs.Undersampling the images in the measurement space, the so-called k-space, can accelerate the data acquisition but the reconstruction typically requires advanced regularization methods to obtain images with diagnostic quality.In addition, precise segmentations of the anatomical features of interest are required both for the determination of the latterly mentioned biomarkers as well as computer-based surgical planning of interventions.Furthermore, any impairment in visual acuity can carry over to downstream tasks such as tissue segmentation and, eventually, affect clinical decision-making.In the last years, deep learning (DL) methods based on unrolled optimization (Monga et al 2021) have attracted great attention in the research field of image reconstruction (Adler and Öktem 2018, Aggarwal et al 2018, Hammernik et al 2018, Schlemper et al 2018a, Sriram et al 2020, Kofler et al 2021).Using task-adapted unrolled neural networks (NNs), the obtained regularization methods cannot only be adapted to the data but also to the specific reconstruction methods as well as to a subsequent task such as image classification or image segmentation (Sun et al 2019, Calivá et al 2020, Adler et al 2021, Sui et al 2021).
In this work, we propose a combination of a task-adapted NNs-based method and statistical shape models (SSMs) for joint reconstruction and segmentation of undersampled 3D knee MR images.We extend an approach similar to Sui et al (2021) to a high-dimensional 3D multi-coil MRI reconstruction and segmentation problem and further integrate SSMs to increase the quality of the obtained segmentation maps.

Methods
The overall method consists of two stages.First, an end-to-end trained reconstruction-and segmentation network is used to reconstruct the images from the undersampled k-space data as well as to deliver a first initial guess of the segmentation.In addition, as a second step, a statistical shape model in which information about shape anatomy is used is further employed to improve the initial segmentation.

Problem formulation and image reconstruction
In MRI, the data-acquisition process yields the Fourier transform of an image whose contrast is determined by the parameters of the MR sequence and the MR-related parameters of the image, such as relaxation times. Let with N = N x • N y • N z denote the vector representation of such an unknown complex-valued 3D MR image and A I denote a 3D multi-coil MRI operator which maps the image to its corresponding undersampled k-space data.The considered forward problem is given by where, y I denotes the measured k-space data in presence of complex-valued Gaussian noise e.The operator A I has the form contains the stacked coil-sensitivity maps, F I ≔ S I F denotes the composition of a 3D FFT operator and a binary sub-sampling mask that samples the k-space coefficients indexed by I ⊂ J = {1, K, N} with J denoting the entire set of k-space coefficients.The operation ⊗ denotes the Kronecker product and I N c an N c -dimensional identity operator.Because problem (1) is ill-posed, image reconstruction approahces require the use of advanced regularization techniques.In the following, we describe the here considered regularization scheme in more details.
which, considering x CNN to be fixed, has a unique solution that can be obtained by solving a linear system

H
Approximately solving (4) can for example be achieved with a conjugate gradient method with M iterations which we here denote by f we denote the entire reconstruction network which estimates a CNN-regularized and data-consistent solution x Rec from the multi-coil k-space data

Î
, which serves as input for the subsequent multi-class segmentation network u Seg Y with parameters Ψ.

Network architectures
The basic component of the reconstruction and segmentation network is based on the U-Net  Kofler et al (2018), we hyper-parameterize a U-Net by the number of encoding stages, the number of convolutional layers per stage and the number of filters which is initially applied to the input images by E, C and K, respectively.Based on preliminary experiments, we identified a model configuration of E3 C2 K16 to be powerful enough for the task of image reconstruction, while for the segmentation network, similar as in Ambellan et al (2019b), a network with more trainable parameters corresponding to E4C2 K32, was required to yield accurate segmentation maps.In addition, both networks differ in the shape of the kernels which are 3 × 3 × 3 and 5 × 5 × 3 for the reconstruction and the segmentation network, respectively.The number of CG-iterations for solving problem (3) in Rec Θ,λ is set to M = 12.The choice of the reconstruction and segmentation networks is based on a trade-off between the available computational hardware, i.e. a 48 GB GPU, and the expressiveness of the respective networks in terms of number of trainable parameters.In general, other studies have observed that increasing the number of trainable parameters is beneficial for the task of image reconstruction with unrolled methods (Kofler et al 2022).However, employing deeper networks increases the memory footprint of the end-to-end method.Thus, because the reconstruction module is model-based (due to the use data-consistency module), sacrificing trainable parameters for the reconstruction network seems to be favorable over reducing the number of trainable parameters for segmentation network that performs a more difficult task (semantic segmentation).
Note that the considered reconstruction method corresponds to a special case of the well-known MoDL approach (Aggarwal et al 2018), where the number of alternations is fixed to be one here due to the relatively high dimensionality of the considered problem, i.e. the presence of a 3D acquisition operator with multiple receiver coils.This choice is a necessary trade-off chosen between computational requirements to be able to train the entire model-based network Q in an end-to-end fashion and using a sophisticated reconstruction network as in Sui et al (2021).

Statistical shape models (SSMs)
As CNNs lack explicit shape knowledge, we opt for SSM-based post-processing to circumvent anatomical implausibilities in the segmentation masks.SSMs are geometric models that describe a collection of semantically similar shapes in a compact way.SSMs jointly represent an average shape of many three-dimensional objects as well as their variation thereof (Ambellan et al 2019a).Given a shape population, SSMs offer a powerful mechanism to efficiently capture the range of anatomical variation.To this end, main trends (a.k.a.modes) of variation around a population-average shape are learned, such that the spanned shape-subspace closely fits the training instances (Kainmüller 2014).More precisely, there exists no closer k-dimensional approximation of an input shape within the span of its collection than the unique representation as deviation from the mean by a linear combination of modes.During post-processing, we project the outcome of u Seg Y onto SSMs of the femur and tibia separately, thereby removing any unseen variation and returning the best approximation of the unseen shape within our training set.However, especially for femur and tibia we can assume that there is significantly more variation among individuals in the joint region than at the bone shafts, i.e we expect our shape approximation to fit the observed data better in the shaft than in the joint region.Therefore, when integrating the SSM prediction, we put more trust in the CNN in regions with high-frequent variations (specifically the cartilage interface rims), whereas we rely more on the SSM in regions, where we expect rather low-frequent variations in shape.We employ the common point distribution model that treats shapes as points in the highdimensional space of (stacked) vertex coordinates and captures their distribution via principal component analysis (PCA).Further details and implementation can be found in Kainmüller (2014), Ambellan et al (2019b), and Ambellan et al (2021a), respectively.

Dataset and experimental set-up
To evaluate the proposed method, we used the open-access OAI-ZIB dataset in Ambellan et al (2019b).It consists of 507 3D knee MR images from the OA initiative (http://nda.nih.gov/oai) of shape N x × N y × N z = 384 × 384 × 160 together with their corresponding segmentation maps of the femoral bone and cartilage as well as the tibial bone and cartilage.The segmentations provided by the OAI-ZIB dataset were obtained as manual segmentations performed by experienced users of the Zuse Institute Berlin (ZIB) starting from automated segmentations which were obtained by the method in Seim et al (2010).The employed data covers the full spectrum of different OA grades, with a more pronounced focus towards severe cases.The images from the subjects (age 61.87 ± 9.33 years, 262 male and 245 female) who were with, or at risk, for symptomatic femoral-tibial OA, were acquired on a Siemens 3T Trio scanner using a double echo steady state (DESS) sequence and have a resolution of 0.36 × 0.36 × 0.7 mm 3 .For further details, we refer to the supplementary material of the work (Ambellan et al 2019b).
Note that the SKM-TEA dataset (Desai et al 2021) also poses an interesting basis for evaluation as it contains raw k-space data.However, we opted to use the OAI-ZIB dataset from (Ambellan et al 2019b) as it additionally contains segmentations of femur and tibia bones and is thus not limited to articular soft tissues.These bones in fact pose distinct challenges (i) due to significantly lower signal strength and contrast to certain neighboring tissues like tendons as well as (ii) due to their physical size extending outside the field-of-view, thus, covering image regions with higher vulnerability to artefacts like signal decay and nonlinear distortions.Another interesting dataset is the recently published K2S-challange-dataset (Tolpadi et al 2023), which however, has two major limitations: first, ground-truth segmentation was carried out with CNNs rather than manually and a human reader quality rating served as an additional exclusion criterion.In other words, experts identified cases that are convenient for a CNN-based segmentation.Second, the subject characteristics suggest a bias of the dataset towards non-arthritic individuals that feature physiological bone and cartilage configurations.Contrary, all segmentation masks of OAI-ZIB are manually segmented and two thirds of them feature severe knee osteoarthritis yielding various unique disease patterns.In addition, by using the dataset from Ambellan et al (2019b), we are able to directly compare the obtained segmentation masks to the ones obtained from the ground-truth images in Ambellan et al (2019b) and therefore asses how much the data-acquisition could in principle be accelerated without sacrificing segmentation accuracy.
The phase information of the images was simulated similarly as in Schlemper et al (2018b).From these images, k-space data was retrospectively generated using three different undersampling factors R ä {8, 12, 16}.Undersampling masks along k y and k z were chosen according to a Poisson disk sampling pattern (Bridson 2007) which was implemented using SigPy (Ong and Lustig 2019).Coil-sensitivity maps for N c = 18 coils were simulated by employing 3D Gaussian profiles using the function mrisensesim in Muckley et al (2020), version 1.0.0. Figure 1 shows a schematic representation of the retrospective k-space data simulation used for this study.Our dataset  consisted of triples (y I , x f , s t ) of undersampled k-space data y I , ground-truth images x f and target segmentation masks s t .As in Ambellan et al (2019b), the data was split into 227/26/254 images for training/validation/testing.
for α = 1/3, 2/3, 1 which was also used in Sui et al (2021).Thereby, for α = 1, the regularization is learned to be optimal with respect to the subsequent segmentation task given by u Seg Y , while for 0 < α < 1, also the reconstruction-error with respect to the L 2 -norm is minimized.Similar as in other works, e.g.(Sun et  ¢ ´¢ ´¢ = ´´.Pre-training took about two days for each sub-network, while the end-toend fine-tuning took further two days for each α.The GPU-memory allocation amounted to approximately 18 GB/27 GB for pre-training Rec Θ,λ and u Seg Y , respectively, and 47 GB for fine-tuning the network All experiments were run on an NVIDIA RTX A6000 with a 48 GB GPU.As no validation is needed to train SSMs, both were trained on the whole training set.

Experiments and evaluation
In the following we evaluate the NNs-based method in terms of reconstruction quality as well as accuracy of the estimated segmentation masks.More precisely, we investigate the impact of the single components of the proposed pipeline as well as the training scheme, i.e.
(1) The importance of end-to-end training over model-agnostic decoupled pre-training.
(2) The importance of employing the statistical shape models.
For the obtained reconstructions, we report the peak signal-to-noise ratio (PSNR), normalized root mean squared error (NRMSE) and structural similarity index measure (SSIM) (Wang et al 2004), while for the segmentation maps, we provide surface distance-based metrics which provide more insightful differentiation of tissue shape in addition to the DSC.More precisely, the metrics used to assess the quality of the obtained segmentations are given by where A denotes the set of ground-truth voxels and B denotes the resulting segmentation mask, ∂A and ∂B represent the boundary of A and B, i.e. the set of voxels with at least one neighbor being not part of the respective segmentation mask.The number of voxels on the boundaries ∂A and ∂B is written as N ∂A and N ∂B , respectively.The segmentation accuracy is evaluated for femoral/tibal bone (FB/TB) and femoral/tibial cartilage (FC/TC) Q is applied to obtain the CNN-prior x CNN .The image x CNN is used to regularize the reconstruction problem (3) which is then approximately solved using a conjugate gradient module.From this dataconsistent solution, a further 3D segmentation network u Seg Y is applied to estimate a segmentation mask.The reconstruction and segmentation network can be trained to jointly reconstruct the images from the undersampled k-space data and subsequently segment the reconstructions, while the SSM-based post-processing (blue) is a subsequent step that improves the obtained segmentations.
using the DSC, average surface distance (ASD) and maximum distance (MSD).All these measures are symmetric and allow to assess the global, volumetric (DSC) as well as the local, boundary-related (ASD, MSD) segmentation quality.As methods of comparison, we use the multitasking learning (MTL) approach in Sui et al (2021) and SIS (Ambellan et al 2019b).Since our work extends MTL by incorporating SSMs, we abbreviate it by MTL+SSM.Further, similar to Calivá et al (2020), we evaluate the impact of the joint end-to-end training (E2E) over the decoupled training (DT).Note that while MTL and our approach have the same overall number of 17821797 trainable parameters for the segmentation network, the SIS pipeline has 81203846 trainable parameters.

Reconstruction results
Table 1 summarizes the results of the obtained image reconstructions for the baseline MTL method with DT and E2E training for α ä {1/3, 2/3, 1} and R ä {8, 12, 16}.These reconstructions also serve as input for the subsequent SSM and the entire SIS pipeline.By comparing E2E with α = 1 to DT, we see that the quality of the obtained reconstructions tends to decrease with increasing α.For E2E α = 1/3 and α = 2/3, the quality of the obtained reconstructions is on the other hand comparable to DT.

Segmentation results
Table 2 summarizes the quality of the obtained segmentation maps for MTL, our proposed approach MTL +SSM and MTL+SIS.Obtaining the segmentations with the different methods takes =0.1 s, 30 s and 10 m, respectively.Figure 2 shows an example of images and corresponding segmentation masks obtained with the NNs-based method, once for the DT and for the E2E for α = k/3 for k = 1, 2, 3. We again clearly see how the joint end-to-end training for α = 1 improves the segmentation masks over the ones obtained by DT although the obtained reconstruction exhibit a slightly lower PSNR, NRMSE and SSIM.In this particular example, the wrong misclassification in the region of the tibia, which is visible for DT (actually for all R) could be corrected by E2E for all α, which however introduced some artefacts towards the border of the image.
By evaluating table 2 and comparing the segmentations obtained from MTL for DT and E2E with α = 1, we see that although the segmentation quality tends to slightly increase in terms of DSC (except for TC), the ASD and MSD quite consistently increased.
Our proposed combination of MTL+SSMs significantly improved the obtained segmentation maps and shows a consistent improvement of E2E compared to DT with respect to all measures.An example showcasing typical differences between results of MTL, MTL+SIS and the proposed MTL+SSM is given in figure 3 clearly showing that shape knowledge can be used to enhance the outcome of MTL.Further, by comparing the segmentation maps provided by the MTL+SSM for E2E to the ones obtained by MTL+SIS, we see that the results are very similar, meaning that MTL+SSM shows nearly the same performance as SIS, which is takes 20 times longer than MTL+SSM for obtaining the segmentations.Finally, we point out that the segmentation metrics obtained by the SIS for an undersampling factor of R = 8 are close to the ones which were reported in table 5 in Ambellan et al (2019b), i.e. on the ground-truth images.This result remarkably implies that with the proposed MTL+SSM, it is possible to accelerate the scanning process by a factor of R = 8 on the one hand and to accelerate the segmentation process compared to SIS on the other hand, without sacrificing segmentation accuracy.a Î .E2E partially improves the obtained segmentation compared to DT in terms of partially wrongly segmented regions in the tibia (yellow arrows).However, residual artefacts at the lower tibia shaft, which we attribute to the necessity to train on patches, were either not entirely removed correctly or further amplified (orange arrows).However, using the proposed MTL+SSM-as shown in a second step later-can easily account for this issue and successfully removes the remaining segmentation artefacts.Y of the proposed MTL+SSM approach (middle) and the outcome of the SIS pipeline applied to the images obtained by MTL (right) for an undersampling factor of R = 8, showcasing typical artefacts appearing mainly at the lower tibia shaft including cavities in the tibia, 'melting' of the tibia and its cartilage as well as parts of the tibia being misclassified as femoral bone.Incorporated shape knowledge as in MTL+SSM and SIS can efficiently handle this problem.

Discussion
In the following, we discuss the results obtained with the proposed MTL+SSM method as well as similarities and differences to some other recently published works.

Image reconstruction
From table 1 we can assess the effect of the end-to-end training in the obtained intermediate reconstructions which serve as input for the subsequent segmentation pipelines.Table 1 indicates that the joint training of the reconstruction and the segmentation networks has a slight negative influence on the obtained reconstruction in terms of PSNR, SSIM and NRMSE.The extent of this influence clearly depends on the hyper-parameter α which, as can be seen from the loss-function in (9), balances between the importance of the segmentation mask and the quality of the reconstruction.However, we see that the influence of α is perhaps less pronounced than expected, since the obtained reconstructions only slightly differ in terms of the reported measures as well as visually, see figure 2. This result can most probably be explained by the fact that the reconstructions are constraint by the chosen model used for regularization, i.e. they are the solutions of the variational problem (3).Additionally, an interplay of the the regularization parameter λ and the hyper-parameter α could be responsible for the perhaps unexpectedly small difference between the obtained reconstructions for E2E for the different choices of α.On the other hand, from the point-wise error images in figure 2 we clearly see that the fully taskadapted joint training, i.e.E2E with α = 1, yields somewhat noisier reconstructions with potentially sharper edges which might thus be better suitable for the subsequent segmentation task, which suggests that the obtained reconstructions at least up to some point differ in terms of local image features.

Image segmentation
From figure 2 it is visible how the joint end-to-end training improves the obtainable segmentation in terms of underestimation of the tibia (R = 8) as well as misclassification of the tibia (R = 12 and R = 16).However, residual artefacts in the region of the lower tibia shaft seem to remain despite the joint end-to-end training.These residual artefacts can possibly be attributed to the necessity to train on patches, which is a compromise necessary to be able to use highly expressive networks for the reconstruction/segmentation.However, from figure 3, we see that these residual artefacts can easily be corrected by employing the SSMs as a final regularization step.By carefully evaluating table 2, we can make the following observations.First, by comparing MTL for DT versus E2E with α = 1, we can see that employing end-to-end training quite consistently improves the obtainable segmentation maps with respect to all measures except for TC (which is also consistent with the visual results shown in figure 2).This result proves the superiority of the segmentation maps obtained by end-toend training over decoupled training.Second, by comparing MTL with E2E, α = 1 to the proposed combination of MTL+SSM with E2E, α = 1, we see that the employed SSM can compensate for the residual artefacts in the lower shaft of the tibia (66% DSC versus 85.5% DSC) and additionally significantly improves the distance-based metrics ASD and MSD.Finally, the comparison of the proposed MTL+SSM with E2E to MTL+SIS with E2E, both for α = 1, reveals that both methods perform comparably well in terms of all metrics.However, MTL +SSM yields the segmentation maps in only approximately 30 s compared to the 10 m required by the computationally more demanding SIS pipeline, which alternates betweend the application of 2D and 3D CNNs and the fitting of statistical shape models.Thus, the proposed method yields an undersampling of approximately 20 times in terms of image processing time.Figure 3 shows an example that visually confirms the observations made from table 2.

Related works
The approach presented in Calivá et al (2020) shares our application example but assumes a single-coil data acquisition.It consists of a single 3D U-Net which processes a zero-filled reconstruction with a shared encoder path and two distinct decoder branches, which address the reconstruction and the segmentation, respectively.However, the employed reconstruction network (merely a 3D U-Net) does not employ any method for ensuring data consistency (DC) of the solution, which is typically used in today's state-of-the-art methods for image reconstruction, see e.g.Ongie et al (2020).
The work in Sun et al (2019) uses a joint data-consistent reconstruction and segmentation approach for 2D brain MRI, where the physics-informed (single-coil) reconstruction method (Schlemper et al 2018a) is followed by a 2D U-Net for segmentation.They report improved results in both the segmentation and the reconstruction task when compared to decoupled training.Similarly, the authors in (Karkalousos et al 2023) propose a multitask learning network for joint image reconstruction and segmentation in 2D brain MRI, by also involving the use of multiple receiver coils.Thereby, the application of a data-consistent reconstruction network based on cascades of independently recurrent inference machines (Karkalousos et al 2022) and an Attention U-Net (Oktay et al 2018) are alternated several times to yield a final reconstruction and segmentation.However, endto-end training of reconstruction networks similar to (Schlemper et al 2018a), (Aggarwal et al 2018), (Hammernik et al 2018), (Karkalousos et al 2022), (Geng et al 2023) that are based on algorithm unrolling (Monga et al 2021) and which often alternate several times between CNN-and DC-blocks, becomes computationally prohibitive for large reconstruction problems, e.g. the here considered 3D and multi-coil acquisitions.As an alternative, one can consider methods in which only one single NN-based image-prior is obtained and used for regularization, see e.g.(Hyun et al 2018, Kofler et al 2021).This choice of architecture for the reconstruction module was for example used in the work (Sui et al 2021), followed by a U-Net for image segmentation of liver and renal lesions in 2D MRI.
The work in Acar et al (2022) compares several reconstruction and segmentation networks for joint training.They introduce a training stabilization technique by successively increasing undersampling across training epochs.In their experiments, the global MRI reconstruction suffers from joint training but the quality in the area of diagnostic interest improves.However, none of the discussed methods incorporates shape knowledge into their frameworks.
Finally, the winners of the K2S-challenge (Tolpadi et al 2023) use a similar approach to Sui et al (2021), but with no DC-module and with two nnU-Nets (Isensee et al 2021) instead of U-Nets.Interestingly, the challenge organizers reported that the winning approach yields the best metrics with respect to the segmentation task, while the intermediate reconstructions, which, by not employing any DC-module, are not enforced to be dataconsistent, were outperformed by several other teams.This suggests that, aiming for accurate segmentations with the method in Isensee et al (2021) seems to be somewhat necessarily contradictory to obtaining dataconsistent reconstructions that can be additionally analyzed by radiologists.Further, we note that all mentioned methods employ a relatively simple U-Net for the task of segmentation, while there nowadays exits more sophisticated and complex models.For example, the method in Ambellan et al (2019b) alternatingly applies 2D/3D-U-Nets and fittings of SSMs which act as a regularization method and allow for an effective region of interest restriction.These SSMs solely rely on the shape of the anatomy under study but not on the underlying imaging modality.

Limitations
Clearly, the main limitation of the proposed method is given by the required hardware components necessary for end-to-end training.Despite the use of a relatively powerful GPU with 48 GB of memory, training could only be carried out on patches of the entire images, which we believe to be the reason for the residual artefacts visible in figures 2 and 3 for MTL.However, we point out that there might be several possibilities to address this issue, both from a implementation point of view, e.g.applying activation checkpointing, or by different choices in the architecture design, e.g.2.5D U-Nets (Zimmermann et al 2023) instead of 3D U-Nets for the reconstruction network.Additionally, we point out that due the lack of raw k-space measurements in the OAI-ZIB data, the kspace data for this study had to be retrospectively simulated, which limits the potential impact of this study from a practical and clinical perspective.

Outline and future work
Although the presented approach yields accurate results in terms of segmentation accuracy which are close to the ones that can be obtained by SIS from the ground-truth images, we can identify several research directions which might be worth to be pursued.Based on the visible success of the end-to-end training strategy, it might be desirable to include the regularization by the employed SSMs as an additional block which can be be backpropagated through and thus allows for the end-to-end training of the reconstruction network, the segmentation network as well as the regularization by the SSMs.To be able to achieve this, which necessarily comes at the cost of additional hardware requirements, it might be necessary to further investigate what compromises can or have to be made for the choice of the reconstruction and segmentation networks.Addressing these questions would reduce the needed hardware requirements to be able to train such methods and ultimately increase their applicability to different problems as well as to different datasets.
Further, we note that our baseline segmentation network consists of a relatively simple 3D U-Net.Other recent and more evolved methods, such as the anomaly-aware 3D U-Net presented in Woo et al (2024), could also be applied to further extend the method.
Another aspect that could be addressed is the employed sampling pattern.While in this work we employed a sampling pattern based on Poisson disk sampling (Bridson 2007), other patterns, e.g.pseudo-Gaussian sampling could also be employed (Pandit et al 2016).Nevertheless, note that while an adaptation to different sampling patterns on a Cartesian grid can easily be accomplished, for an extension to non-Cartesian sampling patterns such as radial (Lauterbur 1973) or spiral (Meyer et al 1992), one would need to appropriately address the resulting challenges concerning the more complex image reconstruction block.Last, we note that although the approach was presented for the reconstruction and the segmentation of MR images focusing on femoral as well as tibial bone and cartilage, the method's structure is general.Thus, we expect it to be applicable for the reconstruction and segmentation of other organs and structures as well.

Conclusion
We proposed a combination of task-adapted NNs-based approach and statistical shape models for a realistic 3D multi-coil knee MRI reconstruction and segmentation problem from undersampled k-space data and compared it to a MTL (Sui et al 2021) as well as to a shape-informed segmentation (SIS) pipeline (Ambellan et al 2019b).
Our experiments suggest that it is still possible to further improve over end-to-end trainable task-adapted NNs such as MTL by incorporating SSMs.Further, the segmentation masks of the proposed MTL+SSM are on par with the ones of the segmentation pipeline (SIS) but are obtained in only 30 s compared to 10 m.Last, we report that for an undersampling factor of R = 8, MTL+SIS as well as the proposed MTL+SSM are able to segment the undersampled images nearly as accurately as SIS from the ground-truth images (see (Ambellan et al 2019b) table 5).
2.5.Network training End-to-end training of the entire network • u Rec Seg ,l Y Q was carried out by minimizing a convex combination of the mean squared error (MSE) and the DICE similarity coefficient (DSC)-based loss function al 2019, Tolpadi et al 2023), Rec Θ,λ and u Seg Y were separately pre-trained for the corresponding tasks, i.e. by minimizing the MSE-and the DSC-based loss functions, respectively, and by training for 36 epochs with the ADAMoptimizer (Kingma and Ba 2014) with learning rates of 10 −4 and 10 −5 , respectively.For each α, the entire network •u Rec , Seg l Q Y was fine-tuned for further 16 epochs by decreasing the respective initial learning rates by a factor of five.Due to GPU-memory constraints, training had to be performed on image patches of shape N

Figure 1 .
Figure 1.A schematic illustration of the retrospective k-space data generation process as well as the used end-to-end trainable reconstruction and segmentation network •u Rec , Seg l Q Y (red) and the SSMs-based post-processing.The undersampled k-space data is retrospectively simulated according to (1).The input of the network is the zero-filled reconstruction.After a few warm-up iterations for approximately solving the normal equations, the 3D U-Net u RecQ is applied to obtain the CNN-prior x CNN .The image x CNN is used to regularize the reconstruction problem (3) which is then approximately solved using a conjugate gradient module.From this dataconsistent solution, a further 3D segmentation network u Seg Y is applied to estimate a segmentation mask.The reconstruction and segmentation network can be trained to jointly reconstruct the images from the undersampled k-space data and subsequently segment the reconstructions, while the SSM-based post-processing (blue) is a subsequent step that improves the obtained segmentations.

Figure 3 .
Figure 3. Exemplary comparison between segmentation outcomes of the jointly trained 3D U-Net u Seg Y of MTL (left), u SSM Seg + CNN from the undersampled image, i.e.
Let u Rec Q denote a CNN with trainable parameters Θ which estimates a CNN-based image prior image x H . Similar as in other works, (Hyun et al 2018, Sui et al 2021, Kofler et al 2021), we formulate the reconstruction problem as (Ronneberger et al 2015)which has been extensively applied to image reconstruction (Jin et al 2017, Hyun et al 2018, Hauptmann et al 2019, Kofler et al 2018, Sriram et al 2020, Kofler et al 2021) as well as to image segmentation problems (Ambellan et al 2019b).As in

Table 1 .
Reconstruction results for undersampling factors R = 8, 12, 16 obtained with the investigated baseline MTL by decoupled training (DT) as well as the joint end-to-end training (E2E) with the loss function in (9) for

Table 2 .
Segmentation accuracy of MTL (Sui et al 2021), our MTL+SSM and MTL+SIS (Sui et al 2021)+(Ambellan et al 2019b) for both DT and E2E training strategies, where α = 1 for undersampling factor R = 8 for E2E.The first row shows results of the segmentation network u Seg Y of MTL trained along with the reconstruction NN.The middle part shows results with u Seg Y as above followed by SSM-based post-processing as in SIS, correcting errors as shown in figure 3. The lower part shows results for the SIS segmentation pipeline incorporating statistical shape knowledge applied to the reconstructed images.