Synthetic CT imaging for PET monitoring in proton therapy: a simulation study

Objective. This study addresses a fundamental limitation of in-beam positron emission tomography (IB-PET) in proton therapy: the lack of direct anatomical representation in the images it produces. We aim to overcome this shortcoming by pioneering the application of deep learning techniques to create synthetic control CT images (sCT) from combining IB-PET and planning CT scan data. Approach. We conducted simulations involving six patients who underwent irradiation with proton beams. Leveraging the architecture of a visual transformer (ViT) neural network, we developed a model to generate sCT images of these patients using the planning CT scans and the inter-fractional simulated PET activity maps during irradiation. To evaluate the model’s performance, a comparison was conducted between the sCT images produced by the ViT model and the authentic control CT images—serving as the benchmark. Main results. The structural similarity index was computed at a mean value across all patients of 0.91, while the mean absolute error measured 22 Hounsfield Units (HU). Root mean squared error and peak signal-to-noise ratio values were 56 HU and 30 dB, respectively. The Dice similarity coefficient exhibited a value of 0.98. These values are comparable to or exceed those found in the literature. More than 70% of the synthetic morphological changes were found to be geometrically compatible with the ones reported in the real control CT scan. Significance. Our study presents an innovative approach to surface the hidden anatomical information of IB-PET in proton therapy. Our ViT-based model successfully generates sCT images from inter-fractional PET data and planning CT scans. Our model’s performance stands on par with existing models relying on input from cone beam CT or magnetic resonance imaging, which contain more anatomical information than activity maps.


Introduction
Proton therapy is an established treatment modality for various tumours and anatomical sites, offering superior dose conformity and reduced impact to surrounding healthy tissue over photon beam therapy (Loeffler and Durante 2013).
The application of proton therapy, particularly in head-and-neck (H and N) cancer, encounters a challenge rooted in patient-specific inter-fractional anatomical variations.In clinical practice, for H and N proton-treated patients, 35 fractions of treatment are generally prescribed (Tubin et al 2023).During the treatment course, the variability in air content within the sinus cavity may occur and can lead to substantial dose deviations from the intended treatment plan (Kraan et al 2013, Mackay 2018, Shusharina et al 2019, Albertini et al 2020, Morgan and Sher 2020, Sharma et al 2021).To address this, control computed tomography (CT) scans are typically prescribed after 20 fractions to establish the treatment progress (Kraan et al 2013, Placidi et al 2017, Mackay 2018, Cubillos-Mesías et al 2019, Sonke et al 2019, Albertini et al 2020).However, these scans come with inherent radiation risks and require careful consideration by the radiologist in the final decision concerning the justification for prescription (Tsalafoutas andKoukourakis 2010, Bhandari et al 2014).
Image guidance in proton therapy can improve the treatment accuracy, allowing the identification of patient-related anatomical deviation from the initial treatment plan.Mid-treatment cone beam CT (CBCT) can be used for this purpose.Moreover, deep-learning-based approaches allow the translation of CBCT images into synthetic control CT (sCT) images to allow treatment verification.However, such an image modality leads to an additional dose, it is time-consuming, and it adds burden to the patient caused by repositioning (Li 2013, Mathieu et al 2017).On the other hand, positron emission tomography (PET) is one of the most promising in vivo non-invasive (Alaei and Spezi 2015) monitoring techniques in proton therapy.By detecting radioactive isotopes generated within the patient during irradiation, PET provides insights into potential dose deviations (Enghardt et al 2004, Nishio et al 2010, Sportelli et al 2013, Dendooven et al 2015, Parodi and Polf 2018).Using PET scanners offers a valuable approach for facilitating the accurate and efficient monitoring of proton therapy and enhancing treatment outcomes.Moreover, among the PET monitoring modalities, in-beam PET (IB-PET) allows for the detection of activity from 15 O and other short-lived isotopes and it avoids dealing with biological signal washout in the human body, also eliminating the need for patient repositioning.Such a technique could avoid the late detection of range deviations.However, PET does not provide direct insight into the subject's anatomy like traditional CT scans or anatomical structures like CBCT scans.
At the Italian National Center for Oncological Hadrontherapy (CNAO) (Rossi 2011), more than 1500 patients have received proton therapy treatments.A small cohort of them (clinical trial registered under ID: NCT03662373) has been monitored with the INSIDE IB-PET scanner (Bisogni et al 2016, Bisogni 2019).Recent studies based both on MC simulation and the acquired INSIDE IB-PET data have shown the possibility of detecting morphological changes using range variation or 3D methods (Fiorina et al 2021, Kraan et al 2022, Moglioni et al 2022, Moglioni et al 2023).To the best of our knowledge, these are the most recent papers regarding attempts to correctly locate and identify the morphological change in patients starting from PET data.However, a significant challenge remains: IB-PET data lack direct anatomical representations of morphological changes, preventing the understanding of when to correctly prescribe the control CT acquisition.
To address this gap, the SYNCT project has been initiated.This project aims to enhance image interpretability by generating sCT via neural network (NN) algorithms.NNs have exhibited substantial success in various imaging tasks and have recently gained traction in multi-modal imaging for diagnostics (Yoo 2015, Spadea et al 2021, Man and Chahl 2022, Peng et al 2023).In proton therapy, NNs offer a unique opportunity to integrate multiple data sources, including the planning CT, to reconstruct control CT images.By comparing IB-PET images acquired at different treatment fractions, morphological changes can be discerned, enhancing treatment accuracy.
This paper presents the results of a NN architecture based on visual transformers (ViT Wu et al 2020), a cutting-edge artificial intelligence model class for imaging tasks.Our approach utilizes the planning CT (planCT) and the simulated activity maps (AMs) from two different fractions of treatments.One fraction is acquired at the treatment's onset (N f = N 0 ), and the other is obtained at a subsequent fraction x of treatment N f = N x .The AMs correspond to the spatial coordinates of the annihilations that happened within six minutes from the start of the treatment, i.e. the number of decays accumulated within a 6-minute interval.These AMs are not limited by instrumental and reconstruction limitations such as IB-PET data, which will be addressed successively.Thus, AMs represent the best candidates to test whether sCT can be obtained from activity of β + emitters.The planCT, AM N 0 and AM N x are, in this way, used to produce a synthetic representation of the control CT which would be acquired during the fraction x of the treatment; we then compare the sCT with the ground truth across multiple figures of merit.We perform this comparison using Monte Carlo simulations on real CT scans and treatment plans of six proton-treated patients at CNAO.

Patient data
We have considered six head-and-neck patients treated with proton therapy at CNAO.The relevant characteristics of the treatment plans are summarized in table 1.In the table we have reported, for each patient, the total prescribed dose in Gy, the number of fractions delivered, from which the dose per fraction of treatment can be derived (approximately 2 Gy per day) and the total number of beam fields included in the treatment plan.A beam field corresponds to a specific patient (couch) position angle • IEC (International Electrotechnical Commission reference system) during the treatment.Additionally, the fifth and sixth columns report if the patient during the treatment has shown morphological changes and if the plan was replanned (Y stands for YES, N stands for NO).
For each patient, the planning CT scan, with 2 × 0.97 × 0.97 mm 3 resolution, the RT Structure Set, the RT Plan (the geometric and dosimetric data specifying the treatment) and the control CT scan were available.While morphological changes were not observed during the treatment course in all the patients, they were most commonly observed, based on an inspection of the control CT scan by the radiologist, in the sinus cavity.Such changes are generally characterized by a filling and/or emptying of the cavity with liquid material or air, respectively (see figure 1).Specifically for Patient 1, the dataset also incorporated the set of artificially modified CTs outlined in the work proposed by Kraan et al (2022).For this patient, in addition to the planCT (corresponding to 0ml of volume cavity emptying) and the control CT (corresponding to 13.1 ml of volume cavity emptying), five additional CTs were included in the dataset with volumes of cavity emptying at 1.9, 3.8, 5.7, 7.3, and 10.2 ml, respectively.To obtain the AMs, for a given field of proton irradiation, we utilized a FLUKA-based framework already developed for the INSIDE IB-PET scanner (Fiorina et al 2018, Pennazio et al 2018, Kraan et al 2022).The treatment plan for each patient was simulated, additionally with the last part of the CNAO beam line, including the exit window and nozzle (Molinelli et al 2013, Magro et al 2015, Mirandola et al 2015), the patient geometry (from the CT scan to a voxel FLUKA geometry), radiation transport and interactions of protons in the patient.These data are then used to reconstruct the AMs in a field of view of 140 × 70 × 165 and with a resolution of 1.6 mm × 1.6 mm × 1.6 mm (Fiorina et al 2021, Kraan et al 2022, Moglioni et al 2022), as displayed in figure 2.
The above-mentioned procedure to obtain AMs was repeated for each patient, taking into account each field (beam directions) of irradiation included in the treatment plan separately.The 3D AM was then co-registered to the corresponding CT according to the information reported in the patientʼs RT PLAN for each field of irradiation.Then, we resampled these images to the same dimension as the CT image (i.e.251 × 512 × 512).
Subsequently, the process was repeated using the control CT (previously co-registered with the planCT imregister).In this way, based on the number of fields reported in table 1, i.e. three separate fields of irradiation for five patients, and only two fields for Patient 4, 17 [(5 × 3) + 2] AM N 0 s were obtained, corresponding to the

Training and test dataset
Our approach consists in merging three information sources: • planCT: the planCT, acquired few weeks before the start of the treatment.
• AM N 0 : activity map obtained by simulating the treatment on the planCT, i.e. the AM corresponding to the start of the treatment, i.e. number of fraction of treatment equal to 0.
• AM N x : simulated AM of a subsequent fraction of treatment x.Depending on the patient, the fraction x considered in this work corresponds approximately to the 20 th fraction of treatment, i.e. the fraction corresponding to the control CT.For Patient 1, multiple x fraction were considered.Each x fraction corresponds to the day in which the patient would have exhibited an emptying in the sinus cavity volume equal to 1.9, 3.8, 5.7, 7.3, 10.3 and 13.1 ml, respectively.
The full dataset consists of 6 planCT scans, 17 AM N 0 , 32 AM N x and 11 control CTs, each with dimensions 251 × 512 × 512.The dataset was divided into training and testing sets, with one patient reserved for testing while the remaining five patients were utilised for training the NN, following the leave-one-out cross-validation strategy.The model was trained to reproduce control CTs, using the data from five selected patients and then test the network's performance on the left-out patient.This procedure was repeated iteratively, with each patient taking turns as the test case while the other five patients constituted the training set.When the model was evaluated on Patient 1, all their data including artificially generated data were left out from the training dataset.The evaluation on this patient was performed on all their data, including the artificially generated CTs and their corresponding AMs.
We also performed a study to test the robustness of our synthetic CT reconstruction method on Patient 2, by trying to reconstruct the synthetic CT from AMs acquired in gradually shorter timeframes, starting from the standard 360 s, to approximately 100 s.

Network architecture
Our NN is based on a pre-trained model, specifically the google/vit-base-patch16-224-in21k model.This model is a vision transformer (ViT) pre-trained on the ImageNet-21k dataset (Deng et al 2009), with 14 milion images and 14 000 classes, and it was introduced in Wu et al (2020).A ViT is a type of NN specifically designed to process images.It is based on the transformer architecture originally developed for natural language processing tasks.In a traditional convolutional neural network (CNN), the most typical architecture used for image processing, the network scans the image with small windows (kernels) to identify local patterns.These patterns are then combined to understand larger areas of the image.However, a Vision Transformer works differently.Instead of scanning the image with small windows, it breaks it down into a series of patches, each treated as a word in a sentence.The Transformer then analyzes the relationships between these patches, similar to how it would analyze the relationships between words in a sentence.This approach allows the Vision Transformer to understand the global context of an image rather than just local patterns.It is a relatively new approach to image processing, but it has shown promising results (Shamshad et al 2023).
In our own architecture, reported in figure 3, we fine-tuned two instances of this model, one for processing the CT scans and the other for processing the two AMs.For each image, we use the last attention layer of the ViT model as its vector representation (the 'embedded' patches in figure 3), and we concatenate these three representations into a single vector.This vector is then processed by a series of linear layers(the multi-layer perceptron, MLP block in figure 3) that generate the final image, i.e. the sCT.The activation function used for these layers is the LeakyReLU, with a negative slope coefficient of 0.001.To reduce the chance of overfitting, drop-out layers are added in between the fully connected layers, with an aggressive drop-out rate of 0.25.The network has about 300 million weights (170 million from the two ViT heads and 130 million from the fully connected layers) and processes images by slices.The reasons for the choice of this architecture are due to its ability to capture the modifications across the two AM images and use them to modify the CT image due to its patch-like approach.In fact, the modifications between the two AM images usually translate to more or less local changes in the CT (i.e. it is likely the change is in the same patch).As it can be seen from figure 2, for example, most of the changes across the two AM images are actually reflected by anatomical CT changes in a very close region.Also, our architecture retains a certain degree of space variance thanks to the positional encodings implemented as learnable parameters for each patch, as already done in other ViT architectures (Naseer et al 2021, Rojas-Gomez et al 2023).Finally, the fully connected heads consisting of dense layers are essentially a decoder architecture, which further introduces space variance and allows the reconstruction of the control CT from a global perspective and not simply assembling the individual patches.

Data augmentation
Given the very small size of the dataset, we have used several data augmentation techniques to extract as much information as possible from the small amount of images available.In particular, we have implemented the following procedure: • We have used all the 2D slices in each view (axial, coronal, sagittal) for each 3D image; • Considering that the pre-trained ViT network accepts images of size 224 × 224, while the input images were of size 251 × 512 × 512, we extracted 100 randomly centered patches of size 100 × 100, in each view, from each image and then re-sampled each of these patches to a size of 224 × 224.This ensured the creation of multiple patches that are similar yet distinct from one another.In contrast, generating images of 224 × 224 dimensions using random centers would yield patches that are too similar to each other; • We have added transformed images to the dataset using Gaussian filters with 2 mm and 4 mm σ kernels and random rotations at different angles from 0 to 360 degrees.The reason for adding Gaussian filters is to make the network resilient to different spatial resolutions.We will analyze these filters' impact on the images in the discussion section.

Network training
The network training process necessitates careful consideration due to the combination of pre-trained and untrained layers coupled with the repetitive nature of the dataset.Although the untrained component comprises merely three layers, the presence of fully connected layers escalates the weight count in these layers to approximately 100 million.Consequently, this leads to long training times and introduces the potential of overfitting despite employing conventional overfitting mitigation strategies such as integrating dropout layers.
In particular, we need to minimise excessive modifications to the pre-trained model while accommodating the protracted training necessitated by the untrained segment, i.e. we need to achieve fine-tuning balance.Most of the training samples do not contain information on identifying anatomical changes because such changes are quite small and not present in all the slices.Furthermore, such changes were not present in all the patients.Moreover, the portion of volume where non-trivial AMs information is present is smaller than the CT image.Therefore, we adopted the following strategies to improve the quality of the training: • In order to solve the issue of having a combination of pre-trained and un-trained parts, we used two different learning rates for the pre-trained and the untrained segments of the network.In particular, we used a learning rate of 10 −8 for the pre-trained part and 10 −6 for the un-trained part.
• To minimize the changes to the pre-trained model, while allowing substantial training time for the un-trained one, we used two different learning rate decay schedules.The pre-trained part used an exponential decay schedule, which, together with the low starting learning rate, limited the fine-tuning of the pre-trained part of the network to only the first 5-10 epochs.We instead used a linear rate decay for the un-trained part.
• To control for overfitting, we implemented a cross-validation loop within the training loop, keeping one patient from the training set as a validation patient.This was done to stop the model training once the training loss on validation patients was no longer improving (the so-called early-stopping strategy).This happened after approximately five thousand epochs.
• To focus the attention of the model to the most important task (i.e.identifying morphological changes), while ignoring examples that do not carry any information, we implemented a difficulty mining strategy in which the model was first trained on all samples, and then subsequently focused on only those that carry information about anatomical changes.Most of the slices, in fact, contain very little information about morphological changes and are either empty patches or patches that do not contain AM data.It is very easy for the model to learn how to reconstruct these slices of the control CT, because it is sufficient to copy the planCT.If we do not focus on examples with anatomical changes, the model will likely only learn to always output the planCT without any change.Necessary conditions for a sample to contain morphological changes are that it should contain AM data and CT data on >30% of pixels with Hounsfield Units (HU) values above those of air (−1024 HU).We ran the cross-validation loop described above twice, once on all training samples, and once on only the samples having the aforementioned characteristics.
The total number of parameters used in this model is in the order of 300 M, with the two ViT heads having about 100 M parameters each and the fully connected layers having another 100 M. The large size of the model and the high number of epochs required to split and train the model on 4 Tesla A100 GPUs for about 700 h.This time is added to the time required for architecture search and optimization, which exceeds 2000 h of computing time.

Image quality assessment
In table 2 we summarise the similarity metrics we have used for assessing the quality of the sCT with respect to the ground truth, i.e. the control CT (in this table referred as cCT) (Spadea et al 2021).
The mean absolute error (MAE) corresponds to a voxel-wise average absolute difference between the two images.The root mean squared error (RMSE), is the square root of the squared voxel-wise mean difference.The lower those values are, the higher the voxel intensity similarity is.The peak signal to noise ratio (PSNR), is the ratio between the maximum possible pixel value in the ground truth image MAX cCT and the RMSE (defined above).The structural similarity index measure (SSIM), measures the structural similarity between two images based on luminance L, contrast, and structure.Dice score coefficient (DSC) uses a threshold t to binarize the image according to whether pixel values are above or below t.If we call Bin( • , t) the function that binarizes the image, then the DSC measures the relative overlap between the two binarized images, i.e.Bin(cCT, t) and Bin(sCT, t).For both the DSC and the SSIM a value of 1 indicates a perfect match between the two images.We selected a threshold value t set at 0 HU, corresponding to water-equivalent tissue.This choice is grounded in our objective to emphasize morphological changes, which entail the substitution of liquid material with air (in the case of cavity emptying) or vice versa during cavity filling.By generating a binary mask that assigns a value of 0 to all HU values below 0 HU (effectively eliminating fat and air tissues), we aim to focus the DSC evaluation on the air cavities.
We choose such figures of merit because, on the one hand, we evaluate the similarity of the sCT with the ground truth on a voxel-wise basis using metrics such as MAE, SSIM, RMSE, and PSNR, which are crucial for a detailed comparison.On the other hand, we assess the geometric accuracy by examining the volume and morphology of delineated structures or their corresponding binary masks, utilizing the DSC as a key metric.
The obtained values are then compared with those available in the literature.We particularly refer to other works regarding AI-driven sCT generation in proton or carbon ion therapy, starting from other imaging modalities, such as CBCT (Kurz et  , since to our knowledge this is the first attempt at using in-beam PET data for this task (Peng et al 2023).In table 3, we summarise the results from the literature used in the comparison.We selected these works because they take into account NN architectures similar to the one included in our study, or they focus on the same patients' pathologies (brain and H&N).
The previous metrics were assessed on the test set by calculating them in different anatomical regions.In particular, we considered: 1. Skin region: binary mask which excludes all the pixels outside the head-and-neck contour.
2. AMV region: activity Map Volume, i.e. the volume covered by the AMs.It was obtained by summing the AM N 0 and AM N x and then a binary mask was created by considering the voxel >0.
3. CTV region: clinical target volume, extracted from the RT-Plan.

Geometric comparison
Since the main aim of our model is to produce a sCT that adequately reports the information regarding the possible morphological changes, rather than faithfully reproducing the entire control CT, we propose a geometric comparison in the area where the morphological changes occurred or are expected to occur (i.e.sinus 43.17 -27.05 0.97 -cavity).We prefer not to use the beam depth differences (Maspero et al 2020) method for geometric comparison.This approach is better suited for convex contours, while the morphological changes are characterized by irregular shapes, making applying this method less suitable.
We set an optimal segmentation threshold of -450 HU (Deeb et al 2011) in the planCT and control CT (opportunely co-registered) to obtain two binary masks.Then, the morphological change volume (MCV) region was obtained by computing the voxel-wise absolute difference between these binary masks, as reported in figure 4. The same procedure was then executed by considering the planCT and the sCT, resulting in the synthetic morphological change volume (sMCV) region.
The geometric comparison we develop to compare the MCV and sMCV areas can be summarized as follows: 1. We extracted the images containing the MCV and sMCV contours in the axial view slice corresponding to the axial index of the isocenter (centre of CTV).
2. We then resampled these images to a dimension of 2048 × 2048 with 0.24 mm × 0.24 mm pixel resolution.
3. Subsequently, we cropped each image into a grid of 450 × 450, with each square in the grid maintaining the same resolution as the original image (0.24 mm × 0.24 mm).
4. We assigned a value of 0 to indicate the absence of the contour within a square, a value of 1 to signify the entire square being inside the contour, and a value of 0.5 to denote the scenario where a contour portion passes through half of the square.These values correspond to the contour area presence or absence within that particular square, as illustrated in figure 5.This process provides a pixel-based measure of the areas enclosed by the MCV and sMCV contours at the isocenter.
5. We associate an uncertainty of half a pixel area to every pixel on the border (because that is the granularity at which we measure the area overlap between the pixels and the contour).We further assume that pixels that are not on the border have no uncertainty.Therefore, the relative error on the area value can be calculated as follows: ´100 0.03 mm number of point in the border total area mm . 1 6. Finally, we formulated an assessment metric based on the pixel-wise overlap of the MCV and sMCV contour areas.We refer to this metric as the passing rate.
Based on the passing rate values, we can effectively measure the pixel-level inconsistencies in the regions linked with aeration changes, and use them to assess the accuracy of our network.

Image quality assessment
Table 4 reports the results obtained by comparing the sCT obtained with our model with the corresponding ground truth, i.e. the control CT, in different anatomical sites (Skin, CTV and AMV) with the above-mentioned metrics.Each patient's results are obtained using the leave-one-out procedure described in section 2.2.Thummerer et al (2020).The results reported in table 4 obtained in smaller regions, i.e. the AMV and CTV, are still comparable in terms of MSE, PSNR and DSC with those presented in the literature, with the exception of Patient 1.This situation depends on the fact that certain slices within the CTV and AMV are small, leading to comparisons of the similarity metrics based on only a handful of pixel values which can be more affected by statistical noise (see section 3.2).
In figure 6, we report the trend of the RMSE measured in the CTV region for Patient 2, with progressively shorter AM acquisition times, from 6 min to 20 s.We can see that the quality of the sCT is preserved when up to 2 min of data are used for generating the AMs, and then starts to rapidly degrade.
In figure 7 we report the histograms of MAE, PSNR, DSC and SSIM calculated separately for each slice for each view in the CTV region.
Most of the samples were in the coronal view.This is because of the shape of the CTV: the coronal slices of the CTV were prevalently smaller but more numerous.Moreover, outliers are present in all the histograms.These outliers correspond to those slices in which the CTV area is composed by only a handful of pixels, and thus where even small differences between the control CT and sCT can move the metrics substantially.In the case of MAE, 75% of the counts of all slices of all patients are localized in HU values below 79 HU, as in Kurz et al (2019), Maspero et al (2020), Thummerer et al (2020).Similarly, in the PSNR case, most of the counts occupied the dB values from 15 up to 30 dB, comparable with Chen et al (2022), Zhang et al (2022), Pan et al (2023).In the DSC case, all the histograms are characterized by narrow peaks.A mode of 0.98 was observed for the coronal slice, higher than (Thummerer et al 2020) and equal to Maspero et al (2020).The sagittal and axial slice, instead, present a peak at 0.96 value, higher than (Thummerer et al 2020).The SSIM histograms present for the coronal slice a narrow peak at 0.88 while for the remaining slices at 0.83.These values are higher than the mean of the .This suggests that having fewer active pixels in a slice can result in a lower SSIM value for that slice, leading to 0.86 on average for the patients in table 4. We report the mean instead of the median in table 4 to make the metrics sensitive to our significant number of outliers.These results were obtained starting from the real AM and planning CT images without any data augmentation applied.We have also evaluated the effect of the data augmentation filters, particularly the Gaussian smoothing, by looking at the sCT scans derived from smoothed-out AM and planning  CT images and comparing them with the smoothed-out real control CT.All figures of merit calculated on such CT scans lie within a ±10% of the values reported in table 4, which is comparable with or lower than the intrinsic variability of the image reconstruction process (see figure 7).Finally, in figure 8, we report the sCTs produced by our model in the case of patient 2, which shows the best results in terms of the metrics we adopted in the CTV.The morphological change is correctly reproduced, and no significant errors in the shape of the variation are present.The geometric comparison described in the following section will focus on areas of morphological change in all the patients.

Geometric comparison
In figure 9, we present three different views for three patients with a zoom focus on the CTV area, subject to morphological changes.Specifically, we report patient 5 in the upper row (DSC 0.96, SSIM 0.84), patient 3 in the middle row (DSC 0.96, SSIM 0.87), and patient 1 in the lower row (DSC 0.93, SSIM 0.80).
We see that the portion of the control CT shown in these figures is not perfectly reproduced by the sCT.There are multiple causes for this phenomenon.Firstly, this region corresponds to where most of the morphological change has occurred and therefore it is the most difficult one to reconstruct properly.Secondly, some of the changes are very small and may not be captured due to the lower resolution of the AMs and of the reconstructed sCT.These facts contribute, together with the higher statistical noise present in small ROIs, to explaining the lower values of the SSIM reported in table 4 calculated in smaller ROIs.
Finally, figure 10 reports in the x-axis the expected area for the MCV contours in mm 2 and in the y-axis the passing rate (i.e. the percentage of pixels that correctly match between the MCV and sMCV), calculated from the sum of the absolute difference area between each MCV and sMCV contour for each patient.We obtained a relative error on calculating the area inside the contours ranging from 0.9% up to 10.9%, with a mean of 5.7% and a standard deviation of 2.6%.
Less than the 30% of the contours show a passing rate lower than 70%.These contours with low passing rates correspond only to very small areas (<5 cm 2 ).Most of the contours are in the range of 85%-100% passing rate.Furthermore, the passing rate trend shows that as the area increases, more contours become discernible, resulting in a higher level of agreement between the sCT and the control CT.

Discussion
In this work, we use a combination of planCT scans and non-invasive simulated activity of β + emitters (AMs) of six proton-treated patients to produce sCT images using a NN based on ViTs.The ViT was chosen due to its ability to learn image features from its patches.This resulted in synthetic images that were visually hard to distinguish from the ground truth (figure 8).The prediction accuracy was assessed using the metrics defined in table 2 and reported in table 4 on different anatomical regions and by proposing a geometrical comparison based on contour surface differences.In the Skin region, we obtained an average over the six patients of MAE of 22 HU, an RMSE of 56 HU, a PSNR of 30 dB, an SSIM of 0.91 and a DSC of 0.98, respectively, which are comparable to or slightly better than those found in the literature.This is particularly relevant if we consider that our synthetic images, differently from other works, were generated starting from induced activity maps.Based on these results, we can say that our model can accurately reproduce sCT images from activity data: figure 8 shows an exemplary case of successful reconstruction of the control CT. Figure 6 shows that image quality degrades as we use fewer and fewer events.This can become a concern with low-sensitivity PET systems and might imply that a dedicated, high-sensitivity PET system is required to make this method feasible.However, the numbers shown in figure 6 are a worst-case simulation of this effect, as the network that generated them was not re-trained on the  lower-quality images.This study has shown that annihilation emissions contain enough information to reconstruct morphological changes.Future studies will aim to estimate the physical characteristics (geometry, sensitivity, resolution, acquisition time, etc.) required for a real PET scanner to achieve the minimum image quality to produce images of diagnostic utility.
We obtained worse results in the smaller regions than in the entire skin region.This is because comparing a limited number of pixels led to larger statistical errors.The results are promising, despite the small size of the anatomical changes (order of a few ml), the lower amount of anatomical information contained in the AMs and the inferior spatial resolution of AMs, when comparing them to other techniques at the state of the art, such as CBCT and MRI.Moreover, in such regions (like the CTV), contrary to the brain region or the whole skull region, several small structures, shapes and details assume a global relevance.For these reasons, these are the hardest regions to reconstruct.We notice that our model reproduces such regions with the same accuracy regardless of whether morphological changes have occurred or not.In other words, the main difficulty is in actually reconstructing the region, not in identifying potential anatomical changes.
In figure 9 (a), third column, we observed a slight blurring in the anatomical regions reproduced by our model.For this reason, we proposed a geometric comparison approach in figure 10, that allows us to determine whether our model can adequately reproduce the morphological changes.The geometrical comparison was also applied to patients not subject to morphological changes by taking into account the contours corresponding to the whole sinus cavity area.More than 70% of the compared contour areas show a percentage of agreement over 70%.Moreover, we found that as the expected control CT area increases, more contours become discernible, resulting in a higher level of agreement between the sCT and control CT.
Several aspects of this work deserve further research.First, the number of patients and the number of AMs fractions included was relatively small compared to other works regarding sCT generation through AI models (see table 3).We know that the limited amount of data might affect the generalizability of the model.In fact, we are limited not only by the number of patients included in the dataset but also by the fact that we only consider one anatomical district (H&N).This restriction poses challenges in applying the model to diverse patient cases with varying pathologies or distinct anatomical regions, thereby heightening also the risk of overfitting.Focusing solely on the H&N anatomical district, and consistently encountering morphological changes within the same region (sinus cavity), could thus constrain the model's applicability to other scenarios.Moreover, some patients in the training dataset do not exhibit morphological changes in the control CT.Furthermore, the areas corresponding to the sinus cavity (regardless of the presence of morphological changes) are the most difficult ones to reconstruct, for the reasons highlighted in section 3.2.This effect leads to the generation of sCT images where the edges of the sinus cavity are sometimes non-perfectly reproduced due to the inferior spatial resolution of AMs compared to the CT resolution (1.6 × 1.6 × 1.6 mm 3 instead of 0.97 × 0.97 × 2 mm 3 ).Moreover, the size of the internal representations of the inputs in the pre-trained ViT model also affects the quality of the sCT images, limiting the spatial resolution of the image.For this reason, the images may appear grainy.A larger model could overcome this issue.Nonetheless, such differences, while degrading the image quality metrics, are not relevant from a dosimetric point of view, considering that the dose distribution used to evaluate a treatment plan comes with a voxel resolution of 2 × 2 × 2 mm 3 .

Conclusion
This study demonstrated that, using a combination of planning CT scans and simulated activity of β + emitters (activity maps-AMs) of six proton-treated patients, high-quality sCT images can be generated using a NN based on ViTs.The sCTs generated by our ViT model demonstrate good agreement with the expected control CT, both in terms of similarity metrics and through a geometric comparison based on contour surface differences between the sinus cavity in the sCT and the expected sinus cavity in the control CT.Despite the very good results obtained by the ViT model, the architecture used still has some limitations, especially in its ability to cope with non-local information diffusion from the AMs to the reconstructed CT.Improvements to the architecture could therefore be aimed at increasing its ability to diffuse local AM information to the entire CT, possibly taking inspiration from variational autoencoders or other similar space variant models.However, given the good agreement that we already observe between the control CTs and the sCTs, potential improvements to the architecture will be visible only on a much larger dataset containing a multitude of anatomical districts.
This work paves the way to enabling the representation of IB-PET data in a much more intuitive and understandable way, i.e. in the form of a sCT similar to those already used in clinical practice.Even though we tested this method on IB-PET, the ideas and the models developed are also applicable to other PET-based beam monitoring modalities, such as off-line or in-room PET.It also demonstrates that the information contained in the induced activity is sufficient to identify morphological changes.The achieved results based on ViT are in fact comparable with those obtained when AI is applied to state of the art CBCT and MRI images.Future works will be needed to address the shortcomings of using simulated activity maps, by taking into account the effects of PET instrumentation and its limitations, such as the limited resolution and the incomplete angular coverage.

Figure 1 .
Figure 1.Example of morphological changes (within the red box) corresponding to the planCT and control CT slices of the coronal view for two patients: (a), (b) Patient 2 and (c), (d) Patient 4.

Figure 2 .
Figure 2. (a) and (c) Example of AM slice overlayed to the corresponding slice in the planCT.(b) and (d) Same slice but now with the control AMs overlayed to the control CT for the two patients in figure 1.The white arrows indicate the beam direction.

Figure 3 .
Figure 3. Exemplifying representation of our NN architecture and data-training.Each slice of the planCT and AMs is divided into patches.Subsequently, each patch goes through a separate ViT head (AM N0 and AM Nx share the same ViT weights).The output of a single ViT head consists of a patch embedding, which represents the patch as a vector-like representation.The three patch embeddings are then merged using a concatenation block and fed to a series of linear layers (MLP block), producing the synthetic patch-CT.

Figure 4 .
Figure 4. MCV region.(a) Slice of the planCT, (b) same slice of the control CT and (c) same slice of (a) but with overlayed the MCV binary mask (in green).

Figure 5 .
Figure 5.The assignment of the value of 0.5 * pixel area to a square in the grid.

Figure 6 .
Figure 6.Trend of the RMSE in the CTV region for Patient 2, with varying AM simulation time.

Figure 7 .
Figure 7. Histograms of MAE (top-left), PSNR (top-right), DSC (bottom-left) and SSIM (bottom-right) in the CTV volume of sCT against the control CT calculated separately for each slice for each view (coronal-blue, sagittal-orange, axial-green) in the CTV region.

Figure 8 .
Figure 8. Patient 2: (a), (d), and (g) depict an example slice in the axial, sagittal, and coronal views of the planCT, respectively.In (b), (e), and (h), the same slice is presented from the control CT perspective (i.e.ground truth).In (c), (f), and (i) are displayed the corresponding slices predicted by our model, referred to as sCT.The red squares are strategically placed to enhance the visibility of morphological changes.

Figure 9 .
Figure 9. (a) From left to right: planCT, control CT and sCT details in the CTV area for patient 5, in the coronal view.(b) the same images, but now corresponding to the sagittal view for patient 3. Last, (c) axial view of planCT, control CT and sCT of patient 1.

Figure 10 .
Figure 10.Scatter plot of the passing rate versus the expected area for all contours considered for all patients.The x-axis denotes the expected area, derived from MCV area values.

Table 1 .
Proton-treated patients considered in this work.AM N x s, corresponding to the six control CTs.Concerning Patient 1, a total of 5 × 3 = 15 additional AM N x were obtained, thanks to the inclusion of the five additional artificially modified CTs in the dataset.

Table 2 .
Similarity metric used for sCT quality assessment (from Spadea et al 2021).cCT i and sCT i are the values of the same ith pixel in the cCT and sCT, respectively

Table 3 .
Results from other NN models regarding sCT generation in Particle Therapy.The '-' marks stand for 'not reported'.

Table 4 .
Results obtained for each figure of merit in different anatomical regions for each patient.