TIST-Net: style transfer in dynamic contrast enhanced MRI using spatial and temporal information

Objective. Training deep learning models for image registration or segmentation of dynamic contrast enhanced (DCE) MRI data is challenging. This is mainly due to the wide variations in contrast enhancement within and between patients. To train a model effectively, a large dataset is needed, but acquiring it is expensive and time consuming. Instead, style transfer can be used to generate new images from existing images. In this study, our objective is to develop a style transfer method that incorporates spatio-temporal information to either add or remove contrast enhancement from an existing image. Approach. We propose a temporal image-to-image style transfer network (TIST-Net), consisting of an auto-encoder combined with convolutional long short-term memory networks. This enables disentanglement of the content and style latent spaces of the time series data, using spatio-temporal information to learn and predict key structures. To generate new images, we use deformable and adaptive convolutions which allow fine grained control over the combination of the content and style latent spaces. We evaluate our method, using popular metrics and a previously proposed contrast weighted structural similarity index measure. We also perform a clinical evaluation, where experts are asked to rank images generated by multiple methods. Main Results. Our model achieves state-of-the-art performance on three datasets (kidney, prostate and uterus) achieving an SSIM of 0.91 ± 0.03, 0.73 ± 0.04, 0.88 ± 0.04 respectively when performing style transfer between a non-enhanced image and a contrast-enhanced image. Similarly, SSIM results for style transfer from a contrast-enhanced image to a non-enhanced image were 0.89 ± 0.03, 0.82 ± 0.03, 0.87 ± 0.03. In the clinical evaluation, our method was ranked consistently higher than other approaches. Significance. TIST-Net can be used to generate new DCE-MRI data from existing images. In future, this may improve models for tasks such as image registration or segmentation by allowing small training datasets to be expanded.


Introduction
Deep learning models are increasingly popular due to their impressive performance across multiple image processing tasks such as image segmentation, object detection and classification.There have been continuous improvements made due to advancements in architecture, computational performance and access to larger datasets.Having access to large datasets is beneficial as it allows training with images that are similar to images seen at inference, as well as avoiding any overfitting.However, medical imaging datasets are often small with little variability compared to natural image datasets such as ImageNet (Deng et al 2009).This can make it difficult to train robust models.For medical image processing tasks, deep learning consistently achieves state-ofthe-art performance, usually using augmentation to provide the models with additional varying data.
Datasets containing dynamic contrast enhanced (DCE) MRI are typically small with a limited number of annotations outlining organs of interest.DCE-MRI is a type of quantitative imaging used to monitor microvascular perfusion (Ingrisch and Sourbron 2013).Multiple T1 weighted images are rapidly taken over a few minutes along with a contrast agent injection, which causes a rapid increase in intensity.DCE-MRI data contains motion and contrast enhancement which is tissue dependent.
Quantitative analysis of DCE-MRI is often performed, which usually requires the manual selection of voxels in each frame that represent the tissue of interest.However, the selected voxels in a single frame may not represent the target tissue over the temporal dimension due to motion between frames.Obtaining manually annotated images is time consuming and difficult, hence fully annotated DCE-MRI datasets are scarce.This reduces the diversity of available data to train segmentation and registration models.
When training a model using DCE-MRI data, it was found that strategically using data from the whole sequence with varying contrast enhancements led to better performing models (Tattersall et al 2023a).However, the datasets are rarely annotated and lack diversity.Augmentation can be used to increase the size and diversity of the training dataset which can play a key role in training a robust model (Shorten and Khoshgoftaar 2019).By artificially expanding the dataset through augmentation, models are more capable of handling real-world scenarios where data can differ from the training set.Augmentation acts as a form of regularisation, effectively preventing overfitting and enhancing the model's ability to generalise well to unseen examples (Shorten and Khoshgoftaar 2019).However, common augmentation techniques may not be enough to provide a model trained using DCE-MRI data with a diverse enough set of examples during training.This is due to the lack of availability of annotations for non contrast enhanced (CE) images, and the widely varying levels of contrast enhancement in the images.Instead, an approach which can generate new images by adjusting the contrast enhancement effect would enable a diverse dataset to be created.The generated image would remain structurally identical whilst the contrast enhancement in the image would be changed to ensure that any available ground truth can still be utilised.Such a method would allow for robust models to be trained using smaller datasets enlarged using synthetic images.
Style transfer is a technique that aims to combine a content and a style image resulting in an image which has a combination of characteristics from both images.The content image provides the underlying structure and arrangement of objects, whilst the style image contributes the artistic patterns, colours, and textures.Within medical imaging, style transfer has been used to generate new CT images from MRI or MRI from CT (Yang et al 2019, Reaungamornrat et al 2022).It has also been used to improve image quality by denoising low dose CT images by translating them into a high dose counterpart (Wolterink et al 2017).Within the scope of DCE-MRI, the following constraints need to be met.(1) The structure of the content image needs to be preserved.This allows for any existing annotations to be used.(2) The change in intensities needs to be localised and tissue dependent.
(3) The characteristics of the image need to be preserved such as the noise expected in MRI.
In this work, we proposed a model which used auto-encoders with convolutional long short-term memory (LSTM) networks (Chao et al 2018) to learn spatio-temporal information.Adaptive convolutions (AdaConv) were used to combine content and style latent spaces alongside deformable convolutions to allow the model to adapt the receptive field to account for local geometric variations.This gave the model the ability to decide which areas of the image should have specific style attributes.
Quantitatively evaluating style transfer can be difficult with unregistered data.Current metrics such as peak signal to noise ratio (PSNR), structural similarity index measure (SSIM) (Wang et al 2004) and multiscale (MS)-SSIM (Wang et al 2003) primarily focus on the pixel-level similarity between a pair of images.However, in DCE-MRI data, there is a large variety of contrast enhancement which is tissue dependent.This makes evaluating synthetic DCE-MRI a difficult task.A contrast-weighted (CW)-SSIM (Tattersall et al 2023b) was previously proposed which separated the measurement of overall content and localised style.In this work, we also explored and validated this metric to measure its effectiveness for evaluating generated images.
We compared our model with other style transfer models using standard metrics (PSNR, SSIM and MS-SSIM), CW-SSIM as well as completing a rigorous clinical evaluation of our generated images to correlate the quantitative analysis with experts' opinions.

Related work
Using neural networks for style transfer was first proposed by Gatys et al (2016) and has been successfully built upon.Previously, it was a slow iterative process with low quality style transfer, but the use of feed-forward networks (Ulyanov et al 2016) to speed up the process, as well as creating perceptual losses (Johnson et al 2016) to give more visually pleasing results has led to new impressive models that have generated realistic images (Karras et al 2021).

Approaches to style transfer
There have been multiple style transfer approaches to generate new images such as learning mappings between images or content/style disentanglement.Pix2Pix (Isola et al 2017) and CycleGAN (Zhu et al 2017) are methods that learn mappings between images.Pix2Pix used a conditional GAN which consists of a generator that maps an input image to a desired output and a discriminator to classify between real and fake images.Pix2Pix has been used to generate new medical images such as CT pelvis images from MRI (Maspero et al 2018).Although high quality images can be generated using Pix2Pix, it requires paired and registered images to train effectively.To circumvent this, Zhu et al (2017) proposed CycleGAN which can be trained using unpaired data.In this approach, generators learn mappings between two domains along with discriminators to classify between real and fake images.A cycle consistency loss is also used to ensure that translated images can be reversed.A disadvantage to both of these approaches is that they can struggle to generate diverse outputs when new style images are introduced at inference time as the model learns to map between domains directly.Galli et al (2023) proposed a method using a CycleGAN architecture which translated the appearance of DCE-MRI breast data between two datasets in an attempt to increase the size of available datasets to improve lesion classification.
Alternatively, content/style disentanglement methods such as MUNIT (Huang et al 2018) have been proposed.These methods aim to predict two latent spaces which describe the content and style of the image.This type of method can allow for better control of the translation process and the preservation of the content image whilst transferring style.Content/style disentanglement methods usually use an auto-encoder based architecture.Encoders typically encode content and style latent spaces whilst the decoders combine the latent spaces to generate an image.Discriminators can also be used for adversarial training.Lee et al (2020) proposed DRIT++ which is an extension of MUNIT.DRIT++ aimed to enhance diversity by introducing disentanglement at the domain level, and improving attribute manipulation.A key issue with these approaches is the large computational requirement.Content/style disentanglement methods have been used with DCE-MRI data such as the method proposed by Cai et al (2023) who used content and style decoders along with a mapping network to increase style diversity.
There has also been some work using style transfer with videos.Early work processed frames independently however, it was found that this created videos that flicker and produced false discontinuities (Chen et al 2017).To alleviate this, Chen et al (2017) proposed a method to ensure temporal consistency between image frames.Their method uses three components: a style sub-network, a flow sub-network and a mask sub-network.The style sub-network is an auto-encoder which performs the style transfer.The flow sub-network estimates the correspondence between consecutive image frames and warps the features and finally, the mask sub-network that regresses a mask to features in adjacent time frames.This ensures that the features of objects in each image that are similar can be reused.Using this approach allows a smooth style transfer between image frames, but it is prone to errors when there is large motion between frames as well as propagating errors over time leading to inconsistent style transfer and blurriness.

Injecting style
Style transfer methods aim to transform images from one domain to another whilst preserving content.However, they often fail to generate diverse and realistic images.To achieve this, methods which inject style into the content image has been proposed as an effective solution as it helps preserve the structure of the generated image whilst creating an image that looks similar to the style image.
A popular method for injecting style is adaptive instance normalization (AdaIN) proposed by Huang and Belongie (2017).As shown in figure 1(a), AdaIN normalises the activations in a neural network based on the statistics (mean and standard deviation) of the style image.This allows the network to transfer the style of the style image to the content image by matching their statistical properties, but this only uses global style information.In the case of contrast transfer for DCE-MRI data, CE is variable between the tissues, however, AdaIN adds CE to the whole image, making it unsuitable for this task as the contrast enhancement is localised to specific tissues.
Spatially-adaptive normalization for generative networks (SPADE), proposed by Park et al ( 2019) is a method that builds on AdaIN and introduces spatially-adaptive normalisation by incorporating semantic information for more controlled and semantically consistent style transfer (figure 1(b)).Usually, a semantic segmentation map is used to predict the normalisation parameters for each pixel location.This improves the control over the style generation process.
Another approach to injecting style into the generator network is through an adaptive convolution (AdaConv) (Chandran et al 2021).In AdaConv, the convolution filters are learned dynamically from the style latent code to create a set of convolution filters which are then applied to the content image (figure 1(c)).While this comes at a cost of increased computation, AdaConv captures global and local information to predict parameters to combine the style and content latent spaces.

Data
Three DCE-MRI datasets were used.The first contains 2D kidney DCE-MRI (Lietzmann et al 2012) from 13 patients with 375 images acquired continuously at a temporal resolution of 1.6 s, a spatial resolution of 384 × 348 and pixel sizes 1.08 × 1.08 mm.The second contains 3D prostate DCE-MRI (Lemai ̂tre et al 2015) from 20 patients with 40 volumes acquired continuously at a temporal resolution of 6 s, a spatial resolution of 256 × 192 × 16 and voxel sizes 1.12 × 1.12 × 3.5 mm.The third dataset contains 3D uterus DCE-MRI (Reavey et al 2021) from 36 patients with 150 volumes acquired continuously at a temporal resolution of 2.45 s, a spatial resolution of 192 × 192 × 30 and voxel sizes 2.08 × 2.08 × 4 mm.Each dataset was split at a patient level with an 80:20 ratio, training was done using five-fold cross validation.

Style transfer-global architecture
As in Tattersall et al (2023b), we used a structure composed of encoders E, decoders D and discriminators Dis.A sequence of images was passed into encoders to predict content and style latent spaces.Successful disentanglement led to content latent spaces containing information representing the structures of the image and style latent spaces containing information representing the modality (MRI) and any contrast enhancements.The content latent spaces were then passed into a bi-directional (Graves and Schmidhuber 2005) convolutional LSTM (figure 2) which allowed for the modelling of temporal information from a sequence of images.The convolutional LSTM used spatial invariance by extracting relevant features from the images regardless of their spatial position.By using a bi-directional LSTM, the model was able to learn information from past and future events.
To combine content and style, Tattersall et al (2023b) used AdaConv to convolve a kernel (adaptive convolution), pointwise kernel (adaptive pointwise convolution) and bias predicted from a style latent space over a content latent space.This allowed for local, spatial information to be used when combining content and latent spaces.However, we found that the generated images exhibited smoothening effects, especially in areas of contrast enhancement (ACE).To improve this, we proposed temporal image-to-image style transfer (TIST-Net) which used deformable convolutions proposed by Dai et al (2017) to offset the adaptive convolution.The offset is simply predicted by passing a feature map, in our case the content latent space, through a convolutional layer.This contained information to decide how much a kernel should be deformed.By using this, it allows for the receptive field of a convolutional kernel to adapt and account for local geometric variations in the input data.Figure 3 shows the role of deformable convolutions with AdaConv.By offsetting the kernel, the model decided which areas of the content image should gain particular style attributes which is necessary when generating images with contrast enhancement at various time points.
We show the global architecture for our method in figure 4. Discriminators were used to predict if the generated image was real or fake.We also highlight the losses used to ensure disentanglement.
The L1 loss was used as the cycle consistency loss between the original and reconstructed images and each of the initially predicted and reconstructed latent spaces.This ensured that important information such as structure was not lost during the reconstruction and translation process.This was calculated between the original and reconstructed images as well as between the initially predicted and reconstructed latent spaces.The mean square error (MSE) loss was used as an adversarial loss to predict if an image was real or generated.This helped the model to generate images that were realistic.We also used two perceptual losses proposed by Johnson et al (2016).The content, style and generated images were passed through a pretrained model P to predict feature maps.These feature maps were used as inputs to each of the losses.The first was a feature reconstruction loss which calculated the squared, normalised Euclidean distance (equation ( 1)) between the feature maps of the content c and generated g image.f is the dimension of the feature map.This loss encouraged the style transfer model to generate images with similar structural features.The second was a style reconstruction loss which penalised the differences in style between the feature maps of the style s and generated image.The squared Frobenius norm of the difference between the Gram matrices of the input feature maps was used (equation ( 2)).Minimising this loss encouraged the model to generate images which had similar style patterns and texture to the style image.We applied the same weighting as shown in MUNIT (Huang et al 2018) to each of the losses used during training.The cycle consistency loss between the original and reconstructed image had a weighting of 10 whilst the remaining losses had an equal weighting of 1.Some methods of style transfer such as the one proposed by Gatys et al (2016) aimed to balance a style and a content loss to generate an image.In our work, we do not need to balance the losses, we only need to minimise each of the losses.

Implementation
We used PyTorch to implement our approach and trained our approach with the Adam optimiser (Kingma and Ba 2015), with a learning rate of 0.001 and batch size of 8 for 2D data and 5 for 3D data.Early stopping was used with a patience of 20.Our experiments were run on an RTX Titan GPU.To generate new images, we swapped the predicted style latent spaces from input images with and without contrast enhancement.

Contrast weighted (CW) -SSIM
Evaluating generated DCE-MRI for content and style resemblance can be difficult as there can be varying levels of contrast enhancement in an image depending on the time of acquisition after contrast agent injection.Methods such as PSNR and SSIM primarily focus on the pixel level differences of the whole image.Here we evaluate a previously proposed metric, CW-SSIM, constructed by applying a contrast-based weighting to SSIM.To enable weighting by contrast enhancement, we highlighted regions where there was a change in intensity caused by contrast enhancement.Using this information, we were able to weight areas of the image depending on the distance from the contrast enhancement.We began by taking a DCE-MRI series of length n and subtracting the first image (I 0 ) from each of the images (I t ) and taking an average (equation (3)) to find the ACE.We then applied a threshold T (equation ( 4)) to determine which voxels were CE.For our datasets, an empirical threshold of 20 was found as a good compromise between highlighting the areas of expected contrast enhancement and emphasising noise.Figure 5 shows an example of the generated distance maps.where Next, we calculated two distance maps, dist_map; one for content and one for style evaluation.For the content distance map, we calculated the shortest euclidean distance from each voxel to a CE voxel.The distances were normalised between 0.1 and 1 so that each voxel in the image can contribute to the CW-SSIM.To evaluate style, we inverted the distance map so that a voxel has a higher weighting when it is closer to a CE voxel.x is the generated image and y is either the content or style image.
To evaluate the metric, we performed a series of tests on 100 non-contrast enhanced and 100 CE kidney images to ensure our metric evaluates the correct areas of the image.To evaluate the style CW-SSIM, we took an image and modified the intensity values in the regions we expect to see real contrast enhancement by adding or removing intensity values using annotations of the kidney.Similarly for the content CW-SSIM, we took an image and applied warps with increasing amplitude.To warp the image we use sine waves to offset pixels in the image.For each image, we calculated the CW-SSIM and compared it to the SSIM and MS-SSIM.
For the experiments involving the change in intensity, we expected to see a high content CW-SSIM whilst the style CW-SSIM decreased.Similarly, for the experiments that have warped images, we expected to see the content CW-SSIM decrease as the warp increases whilst the style CW-SSIM remains high.We expected the SSIM and MS-SSIM to decrease for each experiment.

Clinical evaluation
To evaluate the style transferred images, we conducted a clinical evaluation.To do this, we created a website so that experts could login and visually assess the images.A user was presented with one question at a time.For each question an original image was shown and was noted to the user.We also showed images generated from our method, the generated images from the method proposed in Tattersall et al (2023b), MUNIT, CycleGAN and StyleGAN3.We also showed the style image used in the style transfer.
The users were not told which approach each image has come from, or which image is the original style image.All of the images were shuffled for each question so that the order of the images were different each time.The users were asked to rank each image according to the following questions: 1. Compare the structures in each of the images to the original image.Rank in order (1 being best) each image with the closest structures to the original image, regardless of contrast enhancement.
2. Each of these images have been generated to add (or remove) contrast enhancement to the original image.Rank in order (1 being best).
3. Rank each image in order (1 being best) of general image quality (free from artefacts, realistic noise characteristics etc.).
In total, 16 experts with an average of 11 years of MRI experience completed our questionnaire.The participants had varying expertise such as MR physicists, radiologists, clinical scientists, medical physicist, neurologist and a researcher in medical image analysis.In total, 25 images from each dataset (kidney, prostate and uterus) were evaluated by at least three different observers for each image.If an observer could not decide between images, they were allowed to give them the same score.On average, each user took 26 minutes to complete the study.In addition to evaluating our images, we also used this to evaluate CW-SSIM, by studying how the experts' qualitative evaluation correlates with the quantitative results of the style transferred images.
To test for significant differences between the results of our proposed method and the other style transfer methods, we computed the Kruskal-Wallis test (Kruskal and Wallis 1952).This is a suitable choice as the data does not follow a normal distribution.Additionally, the Kruskal-Wallis test works well with multiple groups that contain ranking data.

Style transfer-qualitative results
Figures 6, 7 and 8 qualitatively highlight the good results of our method on 2D and 3D datasets.We show an example of adding or removing contrast enhancement from an image along with their corresponding metrics: SSIM, content CW-SSIM and style CW-SSIM.We compare TIST-Net to the method proposed in Tattersall et al (2023b) and other popular style transfer methods, namely MUNIT, CycleGAN and StyleGAN3.We did not compare against DRIT++ due to the large computational requirements.In figure 6, we can see the results using kidney DCE-MRI.When we compare TIST-Net (figure 6(g)) to StyleGAN3 (figure 6(e)), our results are sharper and contain structures that better resemble the content image.MUNIT (figure 6(c)) and CycleGAN (figure 6(d)) both struggle to transfer the style of the style image whereas our method can.A similar trend is shown with the uterus data.For the prostate dataset, each method struggled to add contrast enhancement to the image.Figures 9, 10 and 11 show results for performing style transfer between two different patients for each dataset.The results for methods such as MUNIT and CycleGAN shows artefacts from the style image in the generated image.TIST-Net generates images that has the structure of the content image whilst having the style of the style image.

Style transfer-quantitative results
Tables 1, 2 and 3 show our quantitative results: PSNR between the style (image we want to transfer style from) and generated image, SSIM and MS-SSIM between the content (image we want to take structure from) and generated image and finally, our proposed weighted SSIMs.We compare our proposed method with the method proposed in Tattersall et al (2023b) and other popular style transfer methods, namely MUNIT, CycleGAN and StyleGAN3.For each style transfer direction and metric, our method consistently outperforms the other approaches.

CW-SSIM evaluation results
Figures 12 and 13 show some examples of the modifications made to the images with the respective scores.Tables 4 and 5 show the quantitative results from each of the tests.When we evaluate the style CW-SSIM, we can see that it decreases for both directions (when increasing or decreasing intensity values), whilst the content CW-SSIM remains high.The SSIM and MS-SSIM both decrease as the style changes.For the content evaluation tests, as the intensity of the warp increases, the score decreases.The SSIM and the MS-SSIM both decrease as the intensity of the warps increases.The style CW-SSIM decreases slightly as the ACE are warped, but the score remains high.These results show that the proposed method can separate the evaluation of content and style from the generated images.

Clinical evaluation results
Tables 6, 7 and 8 show the results from our clinical user study.Each table shows the mean and standard deviation (std) of the ranks given by the participants for each question, style transfer direction and approach.Our method outperforms the other methods and was even ranked higher than the real image shown to the user for having the closest structure.We performed the Kruskal-Wallis test with a significance level of 0.05 and found that there was a significant difference between the scores given to our method and the other methods.Additionally, we also found that there was no significant difference when we stratified the observers by experience.

Discussion
In this work, we proposed TIST-Net, a method for style transfer that utilised temporal information by using convolutional LSTMs.This enabled the model to learn changes in structure caused by contrast enhancement across the temporal dimension.We improved upon classical style transfer by using AdaConv combined with deformable convolutions.The deformable convolutions predict an offset to adapt the AdaConv kernel to change the receptive field of the model.The receptive field of the model can then account for local geometric variations.This enabled better fine-grained control over the transfer of style.TIST-Net can be used as a method of augmentation for DCE-MRI data to generate new images and fully utilise any available annotations in a dataset.
Our qualitative evaluation showed that TIST-Net led to sharper images, better content preservation, better localised CE and realistic MRI appearance compared to works state-of-the-art style transfer methods.Our model achieved good results in both style transfer directions (adding or removing CE) for both the quantitative and qualitative results.When style transfer between different patients, TIST-Net was able to generate images which had the structure of the content image whilst having the style characteristics of the style image.Additionally, we outperformed the other algorithms for each metric, for each style transfer direction.Our results showed that when the contrast enhancement has defined edges in the image, such as those in the kidney, it is an easier task to perform style transfer in both directions.In comparison, when there was no clear boundary to the contrast enhancement (such as in the prostate data), the task was much harder.Typically, when moving from 2D to 3D models, there can be a decrease in performance due to the increased number of parameters to learn whilst having a small amount of data.However, we found that TIST-Net performed well on 2D and 3D datasets as evidenced by our results on the uterus dataset.
CW-SSIM enabled the evaluation of the quality of the content and style of a generated image when there was contrast enhancement.When we compared it with the qualitative results from our clinical evaluation, the quality of the images was in line with each of the CW-SSIM results.By using two distance maps to weight the SSIM, we were able to evaluate the similarity in overall structure as well as the localised style between the generated image and the content and style images.An advantage to using this metric over the SSIM or MS-SSIM is that we were able to measure the content and style of a generated image separately.This allowed us to ensure that the key structures remained whilst the intensity values of the image matched those of the style image.This is important with DCE-MRI as the contrast enhancement is tissue specific rather than a global change in the image.When we compared the style CW-SSIM of the kidney and uterus images generated by our method with their  corresponding style images, we could see that the high scores reflect the image quality.In comparison, the style CW-SSIM scores of the prostate data were much lower, which again reflects the quality of the generated image.By separating the content and style weighting, we lessen the possibility of regions of the image with little or no contrast enhancement inflating the SSIM score.A limitation to our transfer was that it struggled to generate realistic images when the edges between ACE were ambiguous.This was seen in the prostate data where there was no boundary of the contrast enhancement compared to the kidney or uterus data.It is worth noting that this struggle was shared by all the methods we compared to.We hypothesise that this lack of contrast between neighbouring tissues might  lead to a sub-optimal disentanglement.Finally, similar to MUNIT, our method has large computational and memory requirements.In this work, we focused on using LSTMs to encode content latent spaces and not style latent spaces.During our preliminary work we found that the model was unable to learn from the style latent spaces effectively and it was detrimental to our model, leading to worse results.The preliminary exploration was not shown in this paper as it is outside the scope of this paper.The results from our clinical evaluation confirmed that TIST-Net generates images that are realistic as TIST-Net was often given a higher rank than the real image.These results show that our images have similar structures to the content image, similar styles to the style image as well as having characteristics that are expected in MRI.The results also match the order given by the content CW-SSIM, further proving that this metric is suitable for evaluating synthetic images.

Conclusion
We proposed TIST-Net, a style transfer approach which used temporal information to predict disentangled representations of content and style.To learn spatio-temporal information, convolutional LSTMs were used which allowed better content latent spaces to be predicted from structural information through time.We also used deformable convolutions to offset AdaConv to combine content and style latent spaces.We observed an increase in performance for both adding and removing contrast enhancement compared to state-of-the-art style transfer methods.
The qualitative and quantitative analyses showed that our method outperformed state-of-the-art style transfer techniques.The results from our clinical evaluation confirmed that our method generates images that are realistic.We also evaluated CW-SSIM, to validate its viability as a metric.The results from our clinical evaluation further demonstrated that it can be used to evaluate the generated images.Using TIST-Net, we can generate new images and sequences with varying levels of contrast enhancement.Using this augmentation approach enables the use of a small number of annotations making it easier to train robust models for tasks such as image registration or segmentation.

Figure 1 .
Figure 1.Methods to combine content (C) and style (S): (a) AdaIN, (b) SPADE and (c) AdaConv.MLP, Conv and IN denote multilayer perceptron, convolutional and instance normalisation layers, respectively.μ represents the mean, σ represents the standard deviation, γ is a learnable scaling parameter and β is a learnable bias parameter.

Figure 2 .
Figure2.Encoding of images which FL were FL passed into encoders E i to encode content C i and style S i , where i is between 0 and 4. Content spaces FL were FL passed into a convolutional LSTM as a sequence.We then output a new predicted content latent space C'.

Figure 3 .
Figure3.A style latent space is passed through three convolutional layers.An adaptive convolution, adaptive pointwise convolution, and adaptive pointwise convolution bias is predicted from each of the layers.An offset is also predicted from a content latent space (of size channel (C), height (H), Width (W).The offset is then used to deform the adaptive convolution.These predictions are used to combine the content and style latent spaces.Note, this figure shows an example for 2D style transfer.For 3D an additional dimension is used for depth.

Figure 4 .
Figure 4. Our model takes an input I of 5 images/volumes to content encoders E c 0 ... 4 and style encoders E s 0 ... 4 .Decoders D CE and D NCE construct images from latent spaces z i predicted by the content and style encoders.CE represents the decoder which generated contrast enhanced images whilst NCE represents the decoder which generated non-contrast enhanced images.We also show the losses used: cycle consistency, perceptual and adversarial.For clarity of the figure, we have only shown one example for each of the losses.

Figure 5 .
Figure 5.An example of two distance maps (content and style) calculated from one kidney DCE-MRI sequence.

Figure 6 .
Figure 6.Example results from different style transfer approaches ((c) MUNIT, (d) CycleGAN, (e) StyleGAN, (f) method in Tattersall et al (2023b), (g) TIST-Net) with the kidney data.The first row ((a) and (b)) shows the input images.The second row shows the results when (a) is used as the content image and (b) as the style.The third row shows results when (b) is used as the content image and (a) as the style.We also show scores given by the SSIM, content (C) CW-SSIM and style (S) CW-SSIM.

Figure 7 .
Figure 7. Example results from different transfer approaches with the prostate data (a 2D slice is shown from the 3D volume).The first row ((a) and (b)) shows the input images.The second row shows the results when (a) is used as the content image and (b) as the style.The third row shows results when (b) is used as the content image and (a) as the style.We also show scores given by the SSIM, content (C) CW-SSIM and style (S) CW-SSIM.

Figure 8 .
Figure 8. Example results from different style transfer approaches with the uterus data (a 2D slice is shown from the 3D volume).The first row ((a) and (b)) shows the input images.The second row shows the results when (a) is used as the content image and (b) as the style.The third row shows results when (b) is used as the content image and (a) as the style.We also show scores given by the SSIM, content (C) CW-SSIM and style (S) CW-SSIM.

Figure 9 .
Figure 9. Example results of performing style transfer between two different patients for each method with the kidney data.(a) was used as the content image and (b) was used as the style

Figure 10 .
Figure 10.results of performing style transfer between two different patients each method with the prostate data.(a) was used as the content image and (b) used as the style image.

Figure 11 .
Figure 11.Example results of performing style transfer between two different patients for each method with the uterus data.(a) was used as the content image and (b) was used as the style image.

Figure 12 .
Figure 12.An experiment to evaluate the content CW-SSIM.Here we have an original image and images with an increasingly large amount of warps added to them.The CW-SSIM score is also shown.

Figure 13 .
Figure 13.An experiment to evaluate the content CW-SSIM.Here we have an original image and images with intensity either being added or removed.The CW-SSIM score is also shown.

Table 1 .
Quantitative results for each approach when adding or removing CE in the kidney.

Table 2 .
Quantitative results for each approach when adding or removing CE in the prostate.

Table 3 .
Quantitative results for each approach when adding or removing CE in the uterus.

Table 4 .
Mean results and standard deviation of the content (C) CW-SSIM evaluation.

Table 5 .
Mean results and standard deviation of the style (S) CW-SSIM evaluation.

Table 6 .
Tattersall et al (2023b)eviation of the kidney data for each question in the user study.Method 4 is the method proposed byTattersall et al (2023b). a

Table 7 .
Tattersall et al (2023b)eviation of the prostate data for each question in the user study.Method 4 is the method proposed byTattersall et al (2023b). a

Table 8 .
Mean rank and standard deviation of the uterus data for each question in the user study.