A patient-specific deep learning framework for 3D motion estimation and volumetric imaging during lung cancer radiotherapy

Objective. Respiration introduces a constant source of irregular motion that poses a significant challenge for the precise irradiation of thoracic and abdominal cancers. Current real-time motion management strategies require dedicated systems that are not available in most radiotherapy centers. We sought to develop a system that estimates and visualises the impact of respiratory motion in 3D given the 2D images acquired on a standard linear accelerator. Approach. In this paper we introduce Voxelmap, a patient-specific deep learning framework that achieves 3D motion estimation and volumetric imaging using the data and resources available in standard clinical settings. Here we perform a simulation study of this framework using imaging data from two lung cancer patients. Main results. Using 2D images as input and 3D–3D Elastix registrations as ground-truth, Voxelmap was able to continuously predict 3D tumor motion with mean errors of 0.1 ± 0.5, −0.6 ± 0.8, and 0.0 ± 0.2 mm along the left–right, superior–inferior, and anterior–posterior axes respectively. Voxelmap also predicted 3D thoracoabdominal motion with mean errors of −0.1 ± 0.3, −0.1 ± 0.6, and −0.2 ± 0.2 mm respectively. Moreover, volumetric imaging was achieved with mean average error 0.0003, root-mean-squared error 0.0007, structural similarity 1.0 and peak-signal-to-noise ratio 65.8. Significance. The results of this study demonstrate the possibility of achieving 3D motion estimation and volumetric imaging during lung cancer treatments on a standard linear accelerator.


Introduction
Approximately half of all cancer patients can benefit from radiotherapy (Tyldesley et al 2011, Barton et al 2006, 2014, but the commonest causes of cancer-related death involve tumors in the thorax and abdomen (Ferlay et al 2020) that are constantly moving due to respiration (Keall et al 2006). In the presence of this motion, radiation beams must be enlarged to encompass tumor position over all points in the respiratory cycle but this comes at the cost of healthy tissue damage. Real-time motion management can be used to maintain minimal target volumes in the presence of respiration but these technologies require dedicated systems that are not available in most radiotherapy centers (Kamino et al 2006, Kupelian et al 2007, Kilby et al 2010, Shah et al 2011, Lagendijk et al 2014, Mutic and Dempsey 2014, O'Shea et al 2016. A recent survey of 200 centers across 41 countries found that 71% wished to extend real-time motion management to additional treatment sites, but were hindered by human and financial resources as well as machine capacity (Anastasi et al 2020). These results indicate a strong demand for inexpensive motion management strategies that can be seamlessly integrated into the existing clinical workflows.
Precisely irradiating lung tumors is a particularly challenging task in cancer radiotherapy since targets and organs-at-risk can move a centimetre or more due to respiration. These patient-specific respiratory motion patterns must be modelled both accurately and efficiently to enable real-time motion management. In 2007, Zhang et al demonstrated that principal component analysis (PCA) could be used to compactly parameterize respiratory-induced 3D internal organ motion (Zhang et al 2007). Li et al then extended this idea to real-time volumetric imaging and 3D tumor localisation (Li et al 2010a(Li et al , 2010b and we later coupled this PCA-based approach with diaphragm tracking to adapt to different breathing patterns encountered across separate imaging sessions (Shieh et al 2019a). However, it was found that this approach yielded physiologically unrealistic 3D images and large localization errors (see supplementary material). PCA also only captures linear relationships between variables and therefore does not reflect the highly dynamic, nonlinear nature of respiratory-induced motion. Moreover, this method required iterative optimization with longer convergence times for more challenging images, posing a significant obstacle for real-time implementation. In this paper we introduce Voxelmap, a patient-specific deep learning framework that leverages the nonlinear mathematics of manifolds and neural networks to achieve 3D motion estimation and volumetric imaging in a single shot. Further, we perform a simulation study of this framework in the context of lung cancer radiotherapy where accurate 3D motion estimation could be enable a wide array of interventions including respiratory and exception gating, multi-leaf collimator (MLC) tracking and couch shifts.
A key motivator for the use of Voxelmap in image-guided radiotherapy is to understand how insights from the labor-intensive tasks of segmentation and treatment planning should be updated during treatment. With this in mind, Voxelmap trains in a patient-specific manner using only data acquired on the planning day ( figure 1(a)). The trained network is then deployed during treatment both to warp segments of the tumour and organs-at-risk in real-time for online adaptation and to enable volumetric imaging ( figure 1(b)). The core idea behind our proposed approach is that 3D motion estimation can be made computationally tractable by considering that 2D views provide hints about 3D motion. For instance, inhalation involves simultaneous inferior and anterior motion of the diaphragm such that an inferior shift in the coronal plane corresponds to an anterior shift in the sagittal plane. That is, certain 'allowed' patterns of 3D motion emerge given the corresponding 2D views (figure 1(c)). These patterns can be learned via manifold approximation with neural networks. To validate this method, a deep neural network was trained and tested on imaging data acquired on two separate days. During each forward-pass, the neural network first concatenates acquired and reference 2D images. This 2D image pair is then fed into an encoding arm, which can be thought of as generating a latent lowdimensional representation of the key features in 2D image space. This low-dimensional feature map is then reshaped to a 3D tensor for processing by a decoding arm to produce a 3D DVF. Here we include the additional constraint that the underlying manifolds must be differentiable and therefore that the desired transformations be diffeomorphic (it should be noted that the diffeomorphic constraint was included optionally to demonstrate the possibility of such mappings but there is nothing in the mathematics presented above that requires such a constraint). Inspired by Dalca et al (2019), this constraint is imposed implicitly by using scaling and squaring layers to efficiently integrate the output of the decoding arm. The resulting 3D DVF is passed through spatial transformation layers along with a reference 3D image to produce a predicted 3D image (figure 2).

Materials and methods
2.1. Problem formulation Inspired by image reconstruction by domain-transform manifold learning (Zhu et al 2018), we formulate the problem of mapping from 2D images to 3D motion as one of manifold approximation (figure 1(c)). The standard clinical workflow for thoracoabdominal targets begins by acquiring a 4D-CT (Vedam et al 2002, Ford et al 2003, Low et al 2003, which is set of 10 CT images ¼ y y , , 1 1 0 ( ) where each image represents the 3D internal anatomy of a patient over the course of respiration. For example, y 1 corresponds to the 3D image at peakinhalation, y 2 corresponds to the 3D image at 20% exhalation, et cetera. The peak-exhalation image is typically selected as the reference image y ref for segmentation and treatment planning as it usually contains the fewest motion artefacts. Image registration to y ref yields a set of 3D deformation vector fields (DVFs) ¼ z z , , 1 1 0 ( )which represent how every voxel in y ref moves to its corresponding position in each image in the set ¼ y y , , .
1 1 0 ( ) Additionally, since the angles at which the patient will be imaged and irradiated on the treatment day are often known beforehand, we can forward-project each 3D image y i at n angles to obtain the set of 2D images ¼ x x , , . ) at angle a to the 3D DVF z . i Importantly, there are certain key anatomical features, such as the diaphragm and intercostal muscles, that are visible in each 2D image pair and that drive motion in each 3D DVF. Additionally, since every 2D image and 3D DVF differ only by respiratory motion for a given patient, this set of key features can be embedded in an abstract topological space that varies smoothly. In other words, there exists a function s x that maps a a x x , i ref ( ) onto the manifold over 2D image pairs X and a function s z that maps z i onto the manifold over 3D DVFs Z. Intuitively, these manifolds can be considered as atlases of respiratory motion. We evaluate the performance of this approach using imaging data from two lung cancer patients.

Source
Imaging data were acquired using scans from the sparse-view reconstruction (SPARE) challenge (Shieh et al 2019b). Briefly, 1 min CBCT scans were simulated using 4D-CT volumes for patients with locally advanced nonsmall-cell lung cancer receiving 3D conformal radiotherapy. These CBCT scans were simulated by first acquiring multiple 4D-CT volumes for each patient on separate days, followed by forward-projection to generate digitally-reconstructed radiographs. 4D-CT volumes used to simulate planning were acquired on a (a) Prior to treatment, 3D images are used to produce 2D images and 3D DVFs for network training by forward-projection and deformable image registration respectively. (b) During treatment, a trained neural network takes an acquired 2D image from the imaging system as input in addition to a reference 2D image and reference 3D image. The network then predicts a 3D deformation vector field (DVF) that can be fed back into the imaging system for a range of interventions. Additionally, this vector field can be used to warp the reference 3D image to produce a predicted 3D image for volumetric imaging. (c) Certain'allowed' patterns of 3D motion emerge given the corresponding 2D views. 3D images y i and y ref are related by the 3D DVF z , i while 2D images a x i and a x ref are obtained by applying the projection operator a P to y i and y ref respectively. Mapping from 2D images to 3D DVFs can be considered as a problem of manifold approximation. More specifically, s ) to the manifold over 2D image pairs X, f maps from X to the manifold over 3D DVFs Z and sz 1 maps from Z to the candidate 3D vector field z . i different day to those used to simulate treatment. Planning day data were used to train Voxelmap, while treatment day data were used for testing. For the planning day data, each 4D-CT supplied 10 3D images which were projected to produce 2D images at 680 angles for one 360°revolution. This yielded 6800 volumeprojection pairs with a 90:10 training:validation split-i.e. 6120 images for training and 680 images for validation. For the treatment day data, respiratory motion was simulated by converting real-time position management (Varian Medical Systems, Palo Alto, US) traces into respiratory phases and acquiring projections for the corresponding 4D-CT volumes at 680 angles for one 360°revolution. This yielded 680 volumeprojection pairs for testing.
All projections were simulated for a 120 kVp beam with a pulse length of 20 ms going through a half-fan bowtie filter. Scatter and noise were generated at 40 mA tube current via Monte Carlo methods to simulate imaging conditions commonly encountered in radiotherapy. Of the nine patients originally included in the SPARE study, Monte Carlo simulations with equivalent scatter and noise profiles and a clearly visible diaphragm were only generated for two patients over two separate imaging days. We validate our proposed method against the imaging data for both patients in this study. (We also demonstrate the use of Voxelmap for images without scatter in the supplementary section).

Pre-processing
Every 2D projected image was initially generated with pixel sizes of 0.776 mm´0.776 mm and dimensions 512 384. However, they were downsampled (without the use of anti-aliasing filters, padding or interpolation) to 128´128 due to memory constraints. Similarly, every 3D volumetric image was initially generated with voxel size 1 mm´1 mm´1 mm and dimensions 450´220´450 but was resized to 512´256´512 by cubic interpolation and downsampled to 128´128´128, yielding voxels of size 3.5 mm´1.7 mm´3.5 mm. Hence, the results reported here represent a lower limit in terms of performance as the network can be scaled to accommodate higher resolution images as computational technology advances. Similarly, uneven downsampling lead to a slightly stretched or squashed appearance across different imaging dimensions, but this corresponds to a rescaling of the associated DVFs for the registration task and does not reflect a fundamental limitation of the method. Pixel intensities for all images were normalised between 0 and 1.
Lung and planning target volume (PTV) masks were delineated by a clinician for each patient on the peakexhalation 4D-CT. Lung masks were used to generate thoracoabdominal masks by generating a convex hull to The input layer concatenates 2D image pairs (b) the encoding arm consists of n repeating residual blocks (where 2D image size = 2 n´2n ) followed by reshaping to a 3D tensor (c) the decoding arm consists of n repeating residual blocks followed by two additional convolutions at the desired 3D deformation vector field (DVF) size (3´2 n´2n2 n ) (d) optional integration layers can be included to encourage diffeomorphic transformations (e) predicted 3D images are produced by passing output 3D DVFs and reference 3D images through spatial transformation layers. include both lungs, expanding the resulting hull by binary dilation to include the ribs and extension inferiorly to the bottom of the image. Deformable image registration between the peak-exhalation phase of the 4D-CT and every other phase of the 4D-CT for the simulated planning and treatment days was performed using the Elastix toolkit (Klein et al 2010) to produce 3D DVFs for training and testing. In particular, multi-resolution registration was performed using B-spline interpolation of order 3, optimizing for advanced Mattes Mutual Information with adaptive stochastic gradient descent and producing the resultant DVF via B-spline transform. All DVFs produced by Elastix were examined to ensure that there were no physiologically unrealistic registration results. All pre-processing steps were performed in MATLAB (The MathWorks, Inc., Natick, MA).

Neural networks
The Voxelmap network consists of 5 components: (1) an input layer (2) an encoding arm (3) a decoding arm (4) integration layers (included here, but optional) (5) a spatial transformation module (figure 2). To describe the rationale behind each of these components: the input layer concatenates 2D image pairs, the encoding arm serves to produce a latent, low-dimensional representation of the key features in 2D image pairs, conversely, the decoding arm serves to map from this latent, low-dimensional representation to the 3D DVF between 3D image pairs, the optional integration layers encourage diffeomorphic transformations by computing the integral of the output from the final layer of the decoding arm, lastly, the spatial transformation module serves to deform the reference 3D image using the output 3D DVF of the neural network, enabling volumetric imaging.
In this study, 2D image pairs are concatenated to produce an input tensor of dimensions 2´128´128, where the first number indicates the number of channels while the second and third indicate image size. This is fed into the encoding arm of the neural network, which consists of n = 7 (since image size = 128´128 = 2 7´27 ) repeating residual blocks where each input tensor is convolved with a kernel of size 4´4 with stride 2 and padding 1, followed by a kernel of size 3´3 with stride 1 and padding 1 and batch normalization. This repeating pattern of convolutions yields output images at half the dimension of the original inputs and the number of output channels is chosen as double that of the original input.
Hence, the first residual block produces an output tensor of dimension 4´64´64, the next residual block produces an output tensor of dimension 8´32´32 and so on until the final block of the encoding arm with output dimensions 256´1´1. Importantly, the number of residual blocks is determined intrinsically by the size of the input images such that larger images with more detailed DVFs are processed by larger networks. The output tensor of the encoding arm is taken to represent a low-dimensional latent representation of 2D image space, which is reshaped to dimensions 256´1´1´1 for processing by the decoding arm.
In a reciprocal manner to that of the encoding arm, the decoding arm also consists of n = 7 repeating residual blocks. However, each block performs transpose convolution with a kernel of size 4´4´4 with stride 2 and padding 1, followed by a kernel of size 3´3´3 with stride 1 and padding 1 and batch normalization. Two additional convolutions are then performed at the desired output image size of 128´128´128 with kernels of size 3´3´3 with stride 1 and padding 1, and the number of channels is chosen to produce a final output tensor of dimensions 3´128´128´128. Every convolution, transpose convolution and fully connected layer uses ReLu activation, except the penultimate and final layers which use Tanh and linear activation respectively. The output of the decoding arm is then fed through scaling and squaring layers (Dalca et al 2019) (step size = 10), which efficiently integrate the corresponding tensor. This process yields a 3D DVF, which is then fed into a spatial transform module along with the reference 3D image to produce a predicted 3D image.

Training
Training loss was defined as z z MSE , , true pred ( )where MSE is mean-squared error, z true is the true 3D DVF and z pred is the predicted 3D DVF. To encourage the network to focus on learning respiratory motion, this loss was computed within a thoracoabdominal mask. The neural network was trained using the Adam learning algorithm (Kingma and Ba 2014) with learning rate 1 × 10 −5 and batch size 2 for 30 epochs, requiring approximately 20 h on a NVIDIA RTX 2080 Super Max-Q 8 GB GPU. The learning rate was determined through grid-search beginning at 1 and reducing the rate by a factor of ten until convergence was reached, a batch size of 2 was the largest that could fit on the GPU, and 30 epochs were required for convergence. All neural network experiments were performed in the Pytorch machine learning framework (Paszke et al 2019).

Evaluation metrics
Once trained, the network was tested using simulated treatment images. Importantly, these images and the corresponding DVFs were unseen during training. To evaluate DVF accuracy, mean errors (i.e. the average voxel-wise difference) between the ground-truth and predicted 3D DVFs were measured along the left-right, superior-inferior, and anterior-posterior axes within thoracoabdominal and PTV masks. To evaluate tumor shape similarity, Dice coefficients between the ground-truth and predicted PTV were recorded. To evaluate image accuracy, mean absolute error, root-mean-squared error, structural similarity and peak signal-to-noise ratio in pixel intensities between the ground-truth and predicted 3D volumetric images were recorded. Lastly, to evaluate network efficiency, training time, mean inference time and the number of trainable parameters were recorded.

Results
Loss curves indicate that the models fit seen training data well and also generalized to unseen validation data (figure 3). As shown in figure 4, there was significant overlap between the predicted and ground-truth tumor positions. Patient 1 had mean tumor position errors of 0.5 ± 0.2, −0.9 ± 0.7, and 0.1 ± 0.2 mm along the leftright (LR), superior-inferior (SI) and anterior-posterior (AP) axes respectively (table 1). Similarly, Patient 2 had errors of −0.3 ± 0.3, −0.3 ± 0.8, and −0.0 ± 0.3 mm respectively. Similarly, Voxelmap exhibited submillimetre errors in thoracoabdominal motion for both patients. Predicted tumor shapes were also very close to the ground-truth with Dice coefficients of 0.89 and 0.92 for Patients 1 and 2 respectively (table 2). Comparing the predicted and ground-truth DVFs of Patient 1 at multiple angles, minor differences were observed except at 270°w here the diaphragm and much of the thoracoabdominal anatomy was precluded from view (figure 5). Conversely, for Patient 2, the field-of-view was such that the diaphragm was visible throughout the scan and differences between the predicted and ground-truth DVFs did not appear to depend on imaging angle (figure 6). However, the DVFs produced by Voxelmap for Patient 2 appear to have more detail than those in the groundtruth. We suggest that this occurs because DVF smoothness is regularized in the conventional registration method (Klein et al 2010) used to compute the ground-truth but such regularization was not imposed in the Voxelmap learning process. As a result, when the DVFs produced by Voxelmap were applied to reference images to generate 3D images much of the fine-structure was captured and there were only minor differences to the ground-truth. On the other hand, slight deviations from the ground-truth images were observed for Patient 1 near the diaphragm. Nonetheless, the predicted images exhibited high fidelity overall to the ground-truth images for both patients (table 3). Training time, mean inference time and the number of trainable parameters were 20 h, 50 milliseconds and 1 × 10 7 respectively.

Discussion
Estimating 3D motion from 2D images is a challenging task that has broad implications for image-guided interventions. The results of this paper suggest that this task can be made computationally tractable with an appropriate deep learning framework. In particular, we demonstrate the use of Voxelmap to achieve 3D motion estimation and volumetric imaging for the treatment site that experiences arguably the most motion of any site in radiotherapy: the lung. Moreover, motion estimation is further complicated in the context of image-guided radiotherapy by the fact that the gantry continuously rotates around the patient during image acquisition. Despite these challenges Voxelmap was able to continuously track tumor motion, specifically, and thoracoabdominal motion, more generally, to submillimetre accuracy with minimal differences across imaging angles.   To gain insight into the semantic representations learned by the Voxelmap network, activations from the first residual block were compared before and after training ( figure 7). These activation maps indicate that the network employs broad activation over the entire image prior to training but focuses on specific anatomical regions once trained. In particular, the trained activation maps appear to favour high-contrast structures that are biomechanically important in thoracoabdominal motion, such as the ribs and intercostal muscles. Additionally, strong activations were observed in the lower-most pixels of the image over which the diaphragm deforms during respiration. Focussing on this portion of 2D image space produces a robust representation of diaphragm motion as the diaphragm edge is not always present in the field-of-view, changes appearance for different imaging angles, and changes position and shape at various stages of the respiratory cycle. This sparse representation of respiratory motion reflects the notion that 3D DVFs can be learned by mapping from a manifold over 2D image pairs. Indeed, these activation maps indicate that the trained network extracts key features in 2D image space to estimate 3D respiratory motion. It should be noted that the Voxelmap network does not explicitly construct models of breathing motion. Instead, these patient-specific manifolds can be thought of as placing implicit constraints on the solution space for 3D motion estimation, thereby enabling efficient computation.
Overfitting is a perennial challenge in machine learning that occurs when a large number of parameters are optimized to fit seen data but do not generalize well to unseen data. Voxelmap addresses this challenge by training Figure 6. Respiratory motion estimation and volumetric imaging for Patient 2 (a), acquired 2D images at 0°, 90°, 180°and 270°to showcase the performance of Voxelmap at different angles (b), predicted, ground-truth and difference 3D DVFs, where the magnitude of 3D motion is shown as heatmaps in the coronal axis (c), predicted, ground-truth and difference 3D images in the coronal axis. neural networks in a patient-specific manner. The core idea behind our proposed framework is that the problem of mapping from 2D images to 3D motion can be solved by learning manifold representations that reflect the specific biomechanics of the patient-of-interest. This approach lies in stark contrast to traditional machine learning in which large and varied data are used across a multitude of different patients. Indeed the manifolds learned by Voxelmap provide implicit constraints on the registration task that reflect the specific anatomy of each patient and therefore should not be used across different patients. In other words, our method leverages the accuracy and specificity of optimizing over a large number of parameters for a particular patient while avoiding the issue of generalization by never using the same parameters across different patients. In image-guided radiotherapy, training data can be produced abundantly for this purpose by forward-projecting 3D images acquired during pretreatment scans. Here a 4D-CT was acquired for each patient yielding 10 3D images that were then each projected at 680 different angles, yielding almost 7000 training examples. However, patient-specific training required approximately 20 h which would be infeasible in most clinical settings where several patients are treated every day. To meet these time demands, future work will explore the idea of continually training Voxelmap on a large and growing dataset of multiple patients but fine-tuning weights for each patient prior to treatment. One limitation of the present work is that Voxelmap was only validated against data from two lung cancer patients. However, this data from the sparse-view reconstruction (SPARE) challenge (Shieh et al 2019b) is unique in providing 2D-3D image pairs that have been modified via Monte Carlo simulation to produce scatter and noise profiles commonly encountered in radiotherapy. Further, only two (of the original nine) patients had equivalent scatter and noise profiles generated over two separate imaging days with the diaphragm visible in the  field-of-view. (We include results for experiments performed using images without scatter in the supplementary material). Therefore, while the present work provides proof-of-principle against unique anatomies with distinct physiological motion patterns in realistic conditions, future work will be focussed on demonstrating the robustness of our proposed framework across different patients and disease sites. Moreover, our proposed method would potentially require an extended field-of-view in some patients to ensure that the diaphragm remains visible during imaging. The results of this study represent an ideal scenario in the sense that deformations were induced by applying 3D DVFs generated via image registration between the phases of a 4D-CT. In other words, intra-phase motion was not captured by the testing data. However, as the planning and treatment day data were derived from 4D-CT scans acquired on different days, the testing data did incorporate changes in anatomy and respiratory motion as compared with the training data. Moreover, had the DVFs been augmented to capture intra-phase motion, there is no way to guarantee that the resulting deformations are physiologically realistic. A further limitation is that, in contrast with typical treatment scenarios, the desired reference images are known on the day of treatment. However, appropriate reference images can be produced using images available in existing clinical workflows. In particular, a pre-treatment scan of the patient is routinely acquired to determine whether any significant anatomical differences have occurred in the intervening time from the planning day. This scan can be used to deform a reference 3D image from the planning day to account for these anatomical variations and supply a source image on the day of treatment. Another limitation is that DVFs computed using conventional registration methods were used as ground-truth for respiratory motion. There is no way of knowing how every thoracoabdominal structure translates, rotates and deforms precisely given the corresponding image data. However, these conventional registration algorithms are routinely used to account for anatomical differences in the course of radiotherapy treatment. Voxelmap leverages implicit constraints on patient-specific respiratory motion to make this information available and actionable during treatment.
With developments in portal imaging, it may be objected that Voxelmap suffers from the downside of additional radiation dose due to kV imaging. Indeed, as noted in the AAPM Task Group 75 report, additional radiation dose is a major drawback for continuous kV imaging during image-guided radiotherapy (Murphy et al 2007). The associated dose ranges from 1 to 3 mGy per image (Bertholet et al 2019), hence in this study where 680 projections were acquired during simulated treatments, the kV dose would be on the order of 0.7-2 Gy. Notably, Kilburn et al (2016) demonstrated a median difference of 4 Gy in radiation dose between patients with locally advanced non-small cell lung cancer treated with and without daily kV imaging. Additionally, the 2 year survival increased from 64% to 80%. These differences in delivered dose and clinical outcomes stem from the reductions in PTV size afforded by image-guided radiotherapy. We argue that these are benefits garnered in kV imaging but not MV imaging where poor soft tissue contrast and a relatively narrow aperture hinder the ability to accurately track anatomical motion. In the present study we focus on demonstrating geometric accuracy for two reasons. First, geometric accuracy is precursor to dosimetric accuracy (Stroom and Heijmen 2002, van Herk 2004, Herschtal et al 2013. Perhaps the strongest evidence in favour this strong relationship are the outcomes of the recent MIRAGE trial (Kishan et al 2023), where 2 mm margins enabled by real-time tracking showed 24% acute grade 2+ genitourinary (GU) toxicity while 4 mm margins without real-time tracking resulted in 43% acute grade 2+ GU toxicity. Although this study was for prostate cancer, and no comparative trial has been performed for lung cancer, we would anticipate a clinical benefit for lung cancer given that tumor motion is both larger and faster than that of the prostate. Second, dosimetric accuracy depends on various factors including segmentation accuracy, delivery technique and motion management strategies. Here we introduce a system for motion estimation that enables a broad range of interventions, each of which will have its own effects on dose and clinical outcomes.
Deep neural networks have previously been used to map from 2D x-ray projections to 3D computed tomography images without predicting 3D motion (Shen et al 2019, Lei et al 2020. Of course, once the desired 3D images are produced they can subsequently be used to estimate motion via image registration-but motion is constrained in ways that images are not. A priori the vector fields governing anatomical motion are infinite, hence regularization is required to ensure computability and to preserve various topological properties. That is, one's choice of regularization methods reflects various intuitions regarding the desired transformations. Introduced by Christensen et al (1996) and elaborated by Beg et al (2005), one way of regularizing these fields is by treating them as geodesic flows on smooth surfaces. In this regime, the distance of flows from the identity mapping induces a metric on the group of diffeomorphisms. These mathematical properties reflect deep insights about motion fields that do not govern image space. Here we leverage this property by encouraging diffeomorphic mappings via scaling and squaring layers. By mapping to this constrained solution space, Voxelmap was able to continuously estimate respiratory-induced motion for the patients considered in this study. This flexibility is essential in the context of interventional and diagnostic procedures where images are acquired at many different angles. In contrast, previous volumetric imaging methods required the training of a new network on a different dataset for each imaging angle. Additionally, in spite of its flexibility, Voxelmap employs a much smaller network than that of Shen et al (2019) for instance. Indeed, for the same image size, our proposed framework required the storage of 50-fold fewer trainable parameters (1 × 10 7 versus 5 × 10 8 ), thereby drastically decreasing memory requirements. Our lightweight network also performed inference in only 50 milliseconds. Accounting for the latencies of MLC tracking (∼110 milliseconds) and imaging at 10 Hz (∼100 milliseconds) produces an overall latency of 260 milliseconds, if Voxelmap is used for target localization. This adaptation time positions our technology well under the 500 millisecond threshold for real-time tracking as suggested by AAPM Task Group 264: The safe clinical implementation of MLC tracking in radiotherapy (Keall et al 2021). Moreover, existing volumetric imaging techniques were validated on 'clean' digitally-reconstructed radiographs, while the networks in this paper were trained and tested on scatter-and noise-corrupted images that reflect imaging conditions typically encountered in clinical scenarios.

Conclusion
In closing, this paper introduces a patient-specific deep learning framework for 3D motion estimation and volumetric imaging during lung cancer radiotherapy. A majority of radiotherapy centers around the world wish to implement motion management during treatment but are hindered by finances, human resources and machine capacity. Voxelmap addresses this lacuna by using neural networks to learn patient-specific breathing patterns from the images available in existing clinical workflows. In this simulation study, we found that 3D thoracoabdominal motion could be estimated from 2D images to sub-millimetre accuracy with millisecond computation times. While the present work demonstrated the use of Voxelmap in x-ray guided radiotherapy, this framework could be leveraged for other imaging modalities and applications. For instance, our proposed method could also be used in MRI-guided radiotherapy, which leverages the enhanced soft-tissue contrast of magnetic resonance imaging for tumor targeting. Here we demonstrate the possibility of respiratory motion estimation for lung cancer patients, but this motion affects additional sites in the thorax and abdomen, such as the liver and pancreas, which could also benefit from this technology. Moreover, the volumetric imaging capabilities of Voxelmap could enable clinical teams to visualize the changing 3D internal anatomy of their patients during interventional and diagnostic procedures for the first time.