Extracting lung contour deformation features with deep learning for internal target motion tracking: a preliminary study

Objective. To propose lung contour deformation features (LCDFs) as a surrogate to estimate the thoracic internal target motion, and to report their performance by correlating with the changing body using a cascade ensemble model (CEM). LCDFs, correlated to the respiration driver, are employed without patient-specific motion data sampling and additional training before treatment. Approach. LCDFs are extracted by matching lung contours via an encoder–decoder deep learning model. CEM estimates LCDFs from the currently captured body, and then uses the estimated LCDFs to track internal target motion. The accuracy of the proposed LCDFs and CEM were evaluated using 48 targets’ motion data, and compared with other published methods. Main results. LCDFs estimated the internal targets with a localization error of 2.6 ± 1.0 mm (average ± standard deviation). CEM reached a localization error of 4.7 ± 0.9 mm and a real-time performance of 256.9 ± 6.0 ms. With no internal anatomy knowledge, they achieved a small accuracy difference (of 0.34∼1.10 mm for LCDFs and of 0.43∼1.75 mm for CEM at the 95% confidence level) with a patient-specific lung biomechanical model and the deformable image registration models. Significance. The results demonstrated the effectiveness of LCDFs and CEM on tracking target motion. LCDFs and CEM are non-invasive, and require no patient-specific training before treatment. They show potential for broad applications.


Introduction
Radiotherapy (RT) 1 aims to deliver high-dose radiation to a tumor while sparing healthy normal tissues.Therefore, the accurate tumor location is essential.The involuntary tumor motion caused by respiration, such as the lesions in lung and liver, poses challenges to precise delivery (Barnes et al 2001, Depuydt et al 2011, Fast et al  2014, Romaguera et al 2020).Estimating real-time internal target position benefits RT, since it can guide radiation to follow the tumor by moving a treatment couch or an accelerator in real-time tumor tracking radiotherapy (RTTR) (Buzurovic et al 2011), or it can guide beam on/off in respiratory-gated radiotherapy (RGR) (Underberg et al 2006).
Diverse surrogates are often used to track the internal target during RTTR and RGR, given that a tumor is hard to directly measure.The straightforward surrogate is the fiducial markers (FMs).They are implanted nearby lesions.By locating FMs through an orthogonal x-ray imager (Hayashi et al 2021), an electromagnetic tracker (Cook and Jasahui 2019, Sarkar et al 2020) or a pre-trained model whose inputs are external markers (Diamant et al 2020) , the internal target position can be estimated.This kind of invasive surrogate may raise complications (Kord et al 2021), and its accuracy is affected by FM migration (Kord et al 2021).The diaphragm is an anatomic surrogate, because of its being the predominant driver of respiratory motion.It can be employed as This work proposes a non-invasive and robust surrogate to estimate the internal target trajectory.The proposed surrogate is the lung contour deformation features (LCDFs) derived by a deep learning method.In clinical practice, the surrogate is proposed to be respectively correlated with the tumor motion and the body change using three separate sub-models.Then the three sub-models are connected in a cascade to form a 'body change→body deformation features (BDFs)→LCDFs→tumor position' estimation model.Specifically, the contributions of this paper are as follows: The CEM consists of three separate machine learning sub-models (as shown in figure 1).They are connected in a cascade architecture.By inputting the currently captured body and matching it with a reference body, the first sub-model outputs BDFs .Then the BDFs are transferred to the second sub-model to estimate LCDFs.Finally, the third sub-model gives an estimated tumor position by correlating with the LCDFs.
The rest of this paper is organized as follows.Section 2 introduces the surrogate derivation approach and the CEM.To evaluate the tumor location accuracy, the LCDFs and CEM were tested on a public database and compared with other approaches.These experiments are detailed in section 2. Section 3 describes the results.In section 4, we further discuss the performances.Section 5 concludes the study.

Extraction of the proposed surrogate
We matched the reference and the moving unilateral lung contours using an encoder-decoder model (as shown in figure 3(a)).After stopping optimization, the LCDFs are derived from the last convolution layer in the encoding path.
The model's inputs are the binary images of unilateral lung.The lung voxels equal to 1, and the other voxels are 0. The model was trained using an unsupervised learning method.By applying the deformation vector field (DVF) on the reference lung image, we got an estimated lung image.The model is optimized by maximizing the similarity between the estimated lung image and the moving one.

Using the cascade ensemble model to estimate the internal target by correlating with the body
The CEM (as shown in figure 4) is comprised of three separate sub-models, namely the BDFs-Net, the BDFs-LCDFs-Net and the T-Net.The three separate sub-models are described in the following subsections.

Architecture of BDFs-Net
The BDFs-Net, as shown in figure 5, is the encoder of a DIR model for body (DIR-Body).DIR-Body shares the same architecture with DIR-Lung (figure 3).By inputting the binary images of the body, DIR-Body outputs a DVF.According to the DVF, the reference body image is transformed to an estimated one.By minimizing the difference between the estimated body and the moving one, the DIR-Body is optimized, and then the BDFs are extracted from the last convolution layer of the encoder.

Architecture of BDFs-LCDFs-Net
The architecture of BDFs-LCDFs-Net is displayed in figure 6.The pre-processing aims to spatially align BDFs with LCDFs, because the LCDFs resolution is two times the BDFs resolution (details in section 2.4.1).The convolution layers adopt the dynamic region-aware convolution in three dimensions (DRConv3d).DRConv3d is modified according to Chen et al's work (Chen et al 2021).DRConv3d applies various convolution kernels to different image patches according to the patches' features.We employ it based on the hypothesis that the  mappings between BDFs and LCDFs are diverse at different anatomical locations.The transposed convolutions are used to resize the feature maps to the size of the LCDFs.

Architecture of T-Net
T-Net is designed to estimate the target translation (T) from the reference location to the current one based on LCDFs.We assume that the tumor motion has a strong correlation to its surrounding lung contour deformation features.Therefore, as shown in figure 7, the inputted LCDFs cubes (with a size of 8 × 16 × 16) are interpolated into the size of the reference tumor's binary image (i.e., 128 × 256 × 256) and then to a resolution of  ((kernel size), stride, padding).The number on the top of the rectangle is the number of filters.In pre-processing, the size in the parentheses is the output size along z, y and x axes respectively.
2 × 1 × 1 mm.In each post-interpolation LCDFs, a 10 × 10 × 102 cube (with a centroid of the reference tumor center) is extracted.These cubes are transferred to a neural network-based model to calculate T.

Data acquisition and pre-processing
There were two databases in this experiment.One was a private database including 58 ten-phase fourdimensional computed tomography (4D-CT) images.One was the public DIR-Lab database (Castillo et al 2009 )3 including six-phase 4D-CT sets of five patients (Nos.1, 2, 3, 4 and 5).
The private database was collected from 58 patients with a tumor in the right lung4 receiving stereotactic body radiation therapy (SBRT) in our department.All patients underwent CT scans on a Brilliance CT Big Bore system (Philips Healthcare, Best, the Netherlands).Each 4D-CT set owned 10 sets of three-dimensional (3D) CT.Various 3D CTs corresponded to different respiratory phases.All CT images were reconstructed into a matrix of 512 × 512 with a thickness of 2∼5 mm and a pixel space of 0.97∼1.18mm.On each CT scan, the right lung and the body contours were delineated using a commercially available automatic segmentation tool (Shenzhen Yino Intelligent Technology Development Co., Ltd, Shenzhen, China).The tumor, regarded as the internal target, was identified by experienced oncologists.According to these edges, the volumetric binary images of lung, body and tumor were generated and were resampled into a slice thickness of 5 mm and a pixel space of 1 mm.The binary images of right lung and tumor were centrally cropped to matrices of 128 × 256 × 256.The body binary images were down-sampled to 128 × 256 × 256 with a sampling rate of 1 × 1/2 × 1/2 due to the graphics processing unit (GPU) memory limitation.The tumor motion ranges were 0 ∼26.71 mm.
In the DIR-Lab database, five 4D-CT sets had a size of 256 × 256 in the transverse plane.They had a slice thickness of 2.5 mm and a pixel space of 0.97∼1.16mm.In each 4D-CT set, the locations of 75 landmarks at six intermediate states (i.e., phases No. 0, 10, 20, 30, 40 and 50) in a breathing cycle were provided.Among them, the landmarks with a motion range of >10 mm (i.e., 106 landmarks) were regarded as the internal target in this experiment.The pre-processing on the DIR-Lab database is the same as that on the private database.The volumetric binary images of the left lung and the corresponding body were flipped along the x axis to be in line with the DIR-Lung in figure 3.
Note that the threshold of 10 mm is determined according to the recommendations for considering adopting active motion-management techniques (Keall et al 2006, Matsuo et al 2013, Gelover et al 2019).

Data splitting
The collected data were split into various categories to train different models5 .DIR-Lung (figure 3): The training data was 522 (58 × 9) pairs of the volumetric binary images of a unilateral lung from the private database.The reference lung was the image at the 0th phase.The moving lung was the images at the 10∼90th phases.
The data splitting strategy for DIR-Body (figure 5) was same as DIR-Lung, but its training data was the volumetric binary image of the body.
There were 552 training data for BDFs-LCDFs-Net (figure 6) from 58 patients in the private database and 3 patients (No. 2, 3 and 5) in the DIR-Lab database.Specially, the training data were randomly split into 394 data for training and 158 data for validation.The model input (BDFs) was generated as shown in figure 5 For T-Net (figure 7), there were 530 pairs of samples (target motion and its corresponding LCDFs cubes) from the DIR-Lab in total.Among them, 240 samples (from 48 targets) composed the test set.The remaining 290 samples composed the training set.

Implementation details
The model was implemented using PyTorch and was trained on two NVIDIA TITAN RTX GPUs6 .We first trained the DIR-Lung to extract LCDFs, and the DIR-Body to extract BDFs and to generate the BDFs-Net.The T-Net and the BDFs-LCDFs-Net were trained afterward.Finally, we connected these sub-models to form the CEM.All models were updated using Adam (Kingma and Ba 2014).Their training settings were as follows: The DIR-Lung and the DIR-Body were both trained with a learning rate of 10 −5 , 300 epochs and a batch size of 1.The optimal models were the ones saved at the last epoch.
The BDFs-LCDFs-Net was trained using a learning rate of 10 −3 , 300 epochs and a batch size of 1.The optimal model referred to the one with a minimum loss on the validation set.
The T-Net was trained using a learning rate of 10 −5 , a batch size of 1 and 300 epochs.The optimal model was the one reaching minimum average root mean square error (RMSE) on the test set.When optimizing T-Net, its training set was augmented by randomly flipping samples along the left-right and the anterior-osterior axes or by randomly rotating samples by 90°, 180°and 270°counterclockwise.
The loss functions used to train DIR-Lung and DIR-Body are: where m is the true volumetric binary image with a size of N , x N y and N z along x, y and z axes, and m is the estimated one.The subscripts of 'lung' and 'body' refer to the DIR-Lung and the DIR-Body, respectively.D represents the mean square error.R assesses the smoothness in DVF.α is an adjustment factor to balance D and R. α=0.01 as suggested in Balakrishnan et al″s work (Balakrishnan et al 2019).
The loss function to update the T-Net parameters is the Euclidean distance between the target's true centroid location (t) and its estimation ( t ): x y z t x , t y and t z are its coordinates along three orthogonal axes.Correspondingly, ˆ(ˆˆˆ) = t t t t , , .
x y z The BDFs-LCDFs-Net adopted the RMSE within a reference unilateral lung mask between the true LCDFs and their estimated ones.

Validation experiment for the proposed surrogate
To validate the proposed surrogate (i.e., LCDFs), we inputted the testing unilateral lung binary images and their reference ones into the LCDFs encoder (figure 3(b)) to generate LCDFs.Then the LCDFs and the reference tumor binary image were transferred to the T-Net to output an estimated tumor location.To evaluate the target location accuracy, they were compared with the ground truth using 3D RMSE: where N is the number of sampling points in a breathing cycle, and t and t are the same as in equation (3).

Validation experiment for the proposed cascade ensemble model
In the test set, the reference target and body binary images, and the body binary images at testing respiratory phases were inputted into the CEM to estimate the target locations.The target location accuracy was assessed using 3D RMSE defined in equation (4).The real-time performance was evaluated using the cost time (t) per estimation.

Comparison with other methods
We compared the proposed LCDFs and CEM with other methods that had been applied to the DIR-Lab database, including: • pBioMec: a patient-specific lung biomechanical model To compare with the pBioMec, the T-Net was re-trained 9 on 1125 samples from patients Nos. 2, 3 and 5. Then we tested LCDFs and CEM 10 on patients Nos. 1 and 4, since the pBioMec was tested on the two patients.
To compare with the DIR methods, we re-trained 9 T-Net, and tested LCDFs and CEM 10 on the patients No. 1∼5 using a leave-one-out scheme.

Ablation study
LCDFs are our points to track target motion for diverse patients without sampling motion data or training prior to treatment.For an ablation study, we constructed a model (abbr.ab-model) directly relating the BDFs to the target motion.The detailed configuration can be found in figure A1 in appendix B. It was trained using the same data and the same strategy, and was tested on the same database as the proposed one.

Validation results for the proposed surrogate and the cascade ensemble model
The results are listed in table 1. LCDFs estimated the internal targets with an RMSE of 2.6 ± 1.0 mm.CEM reached an RMSE of 4.7 ± 0.9 mm and a t of 256.9 ± 6.0 ms.Eighty-five percent (41/48) of targets achieved RMSEs of 3.5 mm when using LCDFs, while seventy-nine percent (38/48) of targets achieved RMSEs of 5.5 mm when using CEM.
In table 1, seven cases (denoted by a ) encompassed RMSEs of >3.5 mm for LCDFs.We correlated these not ideal performances to LCDFs.Four cases (denoted by b ) had RMSEs of >5.5 mm for CEM, but RMSEs of 2.2 ∼ 3.1 mm for LCDFs.We inferred that the relatively larger RMSEs came from the BDFs-LCDFs-Net.Further investigation is presented in the Discussion section.

Comparison results with other methods
The comparison results are listed in tables 2 and 3.In table 2, pBioMec performs comparably to LCDFs, and slightly better than CEM 11 .The landmark error difference between pBioMec and CEM ranges from 0.51 to 1.75 mm at a 95% confidence level 12 . 7The 75-landmark DIR-Lab dataset consists of annotated 75 landmarks for Nos.1∼5 patients at 6 intermediate states in a half breathing cycle. 8The 300-landmark DIR-Lab dataset consists of annotated 300 landmarks for Nos.1∼5 patients at two extreme respiratory states. 9The test data encompassed a motion range of 0∼10 mm and >10 mm, while T-Net (trained as section 2.4.3) was only trained on those targets with a motion of >10 mm.For a better comparison, T-Net needs re-training. 10The CEM=BDFs-Net+BDFs-LCDFs-Net (the two sub-models were trained in the section 2.4.3)+there-trained T-Net 11 There is no difference between LCDFs and pBioMec with a paired t-test P-value of 0.51.CEM is different from pBioMec (P-value of 0.006). 12The 95% confidence interval of difference between CEM and pBioMec is [0.51, 1.75] mm.It is calculated as a range of , in which μ and σ are the average and standard deviation of the normal distribution by fitting error CEM -error pBioMec .'error CEM -error pBioMec ' is the error difference by subtracting error pBioMec from error CEM .The subscripts of 'CEM' and 'pBioMec' indicate that the landmark errors relate to CEM and pBioMec respectively.n is the number of samples.1.0 0.9 6.0 a These 7 RMSEs are not as good as others.They are inferred to be caused by the LCDFs failures.Further investigation is given in the Discussion section.b These 4 RMSEs are not as good as others.They are inferred to be caused by the BDFs-LCDFs-Net failures.Further investigation is given in the Discussion section.c RMSEs listed in the 'LCDFs' column refer to the T-Net results.RMSEs listed in the 'CEM' column refer to the cascade model results.d Abbreviations: RMSE, root-mean-square error; L/R, the target is in the left (L) or the right (R) lung; r, motion range; SD, standard deviation; LCDFs, lung contour deformation features.t, the cost time per estimated location using CEM.
In table 3, compared with the DIR methods, LCDFs and CEM showed a difference of 0.18∼1.10mm and 0.43∼1.60mm at a 95% confidence level, respectively, although our surrogate and model were provided with lesser knowledge.The estimation on the landmark displacement using LCDFs is just based on the lung contour.CEM was only inputted with body contour.The 3D CT images at the testing respiratory states, encompassing displacements of all voxels, were exploited by the DIR methods.

Ablation study
The 3D target location errors of the ab-model, LCDFs and CEM are plotted in figure 8. LCDFs reached the lowest error over all testing targets.The CEM performed better than the ab-model in general.

Hypothesis
The surrogate (i.e., LCDFs) stemming from the respiration driver has the potential to describe various breathing-induced tumor motion states.For matching the surrogate hypothesis, we used the binary masks in DIR-Lung, since the respiration driver refers to the lung contour deformation in our work.Body and lung are two sub-systems in a human respiratory dynamic system.Their deformations are correlated, and hence the LCDFs can be estimated from the body variation via BDFs-Net and BDFs-LCDFs-Net.

Limitations and future work
In figure 4, the moving body, one of the CEM inputs, is a binary volumetric image.In clinical practice, it is hard to get such a whole chest surface, since a person's back would be blocked by the treatment table from the optical imaging equipment.As an alternative (figure 9), we can use a binocular camera to capture the anterior chest surface which moves with breathing, and then combine it with the posterior chest surface in the planning CT or pre-treatment 4D CT to be a whole body.The volumetric binary images of reference body and tumor can be obtained from the planning CT or 4D CT.
In our future work, we will conduct a study on the technique of capturing the real-time moving body via a binocular camera and its effect on the model accuracy.To meet the recommended RTTR system latency of 500 ms (Keall et al 2006), we will adopt the related techniques used in the commercially available surface-guided radiotherapy to reach a cost time of 41.7 ms (Al-Hallaq et al 2022) when capturing the real-time moving body.
In this work, we focused on the feasible features which can describe various respiration-induced tumor motions and their preliminary exploration for internal target motion tracking.
In the future, an expanded validation and a generalizability test on a greater, more diverse, and multiinstitution patient data is necessary.A robustness analysis on the CEM (a cascaded structure) is essential to examine its brittleness.To evaluate the clinical effectiveness brought by this work, we will reconstruct a treatment duration model.

Detailed investigation on the large location errors
To further investigate those cases with relatively large RMSEs (denoted by a,b in table 1), we plot their location errors versus intermediate respiratory states in figure 10.In this figure, the performance deteriorates at several intermediate states, not all of the breathing cycle.
We infer that the potential reasons are: (a) the uncertainty of automatic segmentation on the bronchus.LCDFs are generated by DIR-Lung.Such a convolutional neural network (CNN) is sensitive to the texture of , in which μ and σ are the average and standard deviation of the normal distribution by fitting error LCDFs -error a certain DIR method or error CEM -error a certain DIR method .'error LCDFs -error a certain DIR method ' is the error difference by substracting error a certain DIR method from error LCDFs .'error CEM -error a certain DIR method ' is the error difference by substracting error a certain DIR method from error CEM .The subscripts of 'LCDFs', 'CEM' and 'a certain DIR method' indicate that the landmark errors relate to different approaches.n is the number of samples.

Example of centroid tracking and exploration on shape estimation
Our work aims to estimate the motion of one point (i.e., tumor centroid).Figure 11 shows such an estimation result (blue dot) by the CEM for a case with a tumor volume of 5.91 cm 3 and a 3D tumor motion of 10 mm.The 3D location error is 4.76 mm.
To explore our model on tumor shape estimation, we made a modification on our model: we derived all tumor contour points from its binary mask at the 0th phase (T00).Each contour point was assumed as the centroid, and was inputted into the CEM to generate its estimated point at the 60th phase (T60).All estimated points composed the estimated tumor edge at T60, shown as the blue line in figure 11.The 3D dice similarity coefficient (DSC) of tumor shapes between the truth and the estimation is 65%.This may be caused by the lack of lung internal anatomy in LCDFs.
In the future, to expand this work to estimate the internal deformation, the feature/representation describing its variations will be generated by matching anatomies, rather than just binary masks.

Conclusion
The proposed internal surrogate (LCDFs) and cascade ensemble model are effective in tracking a moving target, and require no patient-specific sampling motion data and training before RT.They show promise in clinical practice.
a breathing phase judgement (Edmunds et al 2019) to estimate tumor location, or can be directly correlated with the target trajectory using correspondence models (Dick et al 2018a, 2018b, Mueller et al 2022).However, the anatomic surrogate must be tracked via continuous kilovoltage (kV) image acquisition (Hindley et al 2019), and hence it exposes patients to additional irradiation.External surrogates, such as the respiratory air flow (Ladjal et al 2021) and thorax movement (Özbek et al 2020, Ladjal et al 2021), are also reported.They are correlated with tumor motion using a biomechanical model (Jafari et al 2021, Ladjal et al 2021), a statistical model (Li et al 2017, Rostampour et al 2018, Terunuma et al 2018, Huang et al 2022) and a hybrid model with biomechanics and deformable image registration (Al-Mayah et al 2011, Zhang et al 2019).These approaches need additional modeling work before treatment.Some of them need extensive computation (Ladjal et al 2021) and hence present challenges to real-time performance.

( a )
We propose an internal surrogate (i.e., the deformation features derived by registering moving and reference lung contours using a deep learning model) to correlate with the internal target.The proposed surrogate (LCDFs) is non-invasive and applicable for various patients without sampling patient-specific motion data and additional training before RT.(b) To apply the above surrogate in RTTR practice, we propose a cascade ensemble model (CEM) to track tumor motion in real time by capturing the body change.

Figure 2
shows a schematic representation of our work which has different configurations in training and testing phases.The training framework is composed of the following blocks: (a) a deformable image registration (DIR) model for encoding the LCDFs, (b) a DIR model for encoding the BDFs, (c) a BDFs-LCDFs-Net model to estimate the LCDFs with respect to the BDFs, and (d) a T-Net model for tumor motion estimation.During testing, the LCDFs decoder and the BDFs decoder are removed.We tested LCDFs by connecting the LCDFs encoder and the T-Net.We tested the CEM by connecting the BDFs-Net, the BDFs-LCDFs-Net and the T-Net.In the following subsections, we detail the DIR model for encoding LCDFs, the DIR model for encoding BDFs, the BDFs-LCDFs-Net and the T-Net.The codes are available at GitHub (Zhang 2023).

Figure 2 .
Figure 2. Schematic representation of the proposed surrogate (LCDFs) and model (CEM).Left: During training, there are four submodels.(1) LCDFs encoder: part of a deformable image registration (DIR) model.Its input is a pair of binary volumes of unilateral lung from V i and V ref .(2) BDFs-Net: part of a DIR model.Its input is a pair of binary volumes of body from V i and V ref .(3) BDFs-LCDFs-Net.It estimates the LCDFs by inputting the BDFs.(4) T-Net.It outputs T. Its inputs are LCDFs and the binary volume of the target from V ref .Right: There are 2 tests.(1) Test on LCDFs.The input is a pair of binary volumes of unilateral lung and the binary volume of the target from V ref .(2) Test on CEM.The input is a pair of binary volumes of the body and the binary volume of the target from V ref .Abbreviations: DIR-Lung, DIR model for unilateral lung contour; DIR-Body, DIR model for body.

Figure 3 .
Figure 3. (a) Architecture of DIR-Lung and (b) illustration of encoding LCDFs.Legend: Conv3D(kernel size, stride, padding) and LeakyReLU(negative slope).The number on the top of each rectangle is the number of filters.Abbreviations: DIR-Lung, deformable image registration model for unilateral lung contour; DVF, deformation vector field; Conv3D, convolution in three dimensions; CT, computed tomography.The left lung can be flipped along the left-right axis to be in line with figure 3(b).

Figure 4 .
Figure 4. Illustration of the cascade ensemble model (CEM) to estimate the internal target by correlating with the body.Abbreviations: BDFs, body deformation features; LCDFs, lung contour deformation features; T, translation vector of the current target center relative to its reference one.

Figure 5 .
Figure 5. (a) Architecture of DIR-Body and (b) illustration of encoding BDFs via the BDFs-Net.Abbreviation: DIR-Body, deformable image registration model for body.Other notation and abbreviations are same as those in figure 3.

Figure 6 .
Figure 6.Architecture of BDFs-LCDFs-Net.Abbreviations: DRConv3d, dynamic region-aware convolution in three dimensions; TransConv3d, transposed convolution in three dimensions; BDFs, body deformation features; LCDFs, lung contour deformation features.Legend: SpatialAttention(kernel size), LeakyReLU(negative slope), DRConv3d((kernel size), region num), TransConv3d((kernel size), stride, padding).The number on the top of the rectangle is the number of filters.In pre-processing, the size in the parentheses is the output size along z, y and x axes respectively.

Figure 7 .
Figure 7. Architecture of T-Net.The T outputted by T-Net is the translation of the current target centroid position relative to its reference one.Abbreviations: LCDFs, lung contour deformation features; Conv3D, convolution in three dimensions.Legend: Conv3D(kernel size, stride, padding), LeakyReLU(negative slope) and linear(number of outputted features).The number at the bottom of each rectangle is the number of filters.
input images.Therefore, the uncertain segmentation on the bronchus may have an impact on the generated LCDFs; (b) the target motion caused by heart beat; (c) the heterogeneous mappings from BDFs to LCDFs.The DRConv3d in BDFs-LCDFs-Net employs different BDFs patches to represent such heterogeneity.However, the diverse anatomical positions may correspond to a similar BDFs patch but distinct mappings, and hence DRConv3d cannot approximate these complex mappings.In such a scenario, BDFs-LCDFs-Net estimates wrong LCDFs; (d) the irrelevant in vitro air is regarded as a part of BDFs in BDFs-LCDFs-Net.The conventional CNN cannot differentiate the in vitro air from the body.The values in those air voxels, acting like noise, are involved in the BDFs-LCDFs-Net; (e) the lack of lung inside anatomy in DIR-Lung may lead to an unsmoothed or a locally wrong DVF, and hence results in the discontinuity and incompleteness of LCDFs distribution.In our future work, we will (a) use a graph neural network (GNN) to extract BDFs to exclude the influence of in vitro air, (b) construct different BDFs-LCDFs mapping sub-models according to their categories to meet , and (c) involve the lung's internal structure in the registration task to improve the continuity and completeness of the encoded LCDFs.

Figure 9 .
Figure 9. Workflow of capturing the real-time body binary mask.

Figure 10 .
Figure 10.Error versus time for (a) targets with RMSEs of >3.5 mm when using LCDFs to track target and (b) targets with RMSEs of >5.5 mm when using the CEM to locate target.The black dotted lines denote the preferrable 3.5 mm and 5.5 mm thresholds.Abbreviations: LCDFs, lung contour deformation features; CEM, the proposed cascade ensemble model; Px_Ty, the yth target of the xth patient; T10∼T50: 5 intermediate states in a half breathing cycle.

Figure A1 .
Figure A1.Structure of the model for ablation study (abbr.ab-model).Abbreviations are the same as those in figure 4.

Table 1 .
RMSE of locating target using the internal surrogate (LCDFs) and the cascade ensemble model (CEM) a,b,c,d .

Table 2 .
Landmark error (mm) comparison over 75 landmarks at 5 intermediate respiratory states a,b .RMSEs listed in the 'LCDFs' row refer to the T-Net results.RMSEs listed in the 'CEM' row refer to the cascade model results. b

Table 3 .
Landmark error (mm) comparison over 300 landmarks for each patient at two extreme respiratory states.Results are expressed as mean ± standard deviation a, b .
a Abbreviations: LCDFs, lung contour deformation features; CEM, the proposed cascade ensemble model; DIR, deformable image registration; -, no result was reported; 95%CI (diff versus LCDFs): the 95% confidence interval of average difference between a specific DIR method and LCDFs; 95%CI (diff versus CEM): the 95% confidence interval of average difference between a specific DIR method and CEM; , not applicable.b RMSEs listed in the 'LCDFs' column refer to the T-Net results.RMSEs listed in the 'CEM' column refer to the cascade model results c 95% CI (diff.versus LCDFs) or 95% CI (diff.

Table C1 .
Quantity of training and test sets a .In machine learning, the 'training' samples are used to update model weights.The 'validation' samples are used to assess the model performance and fine-tune it during the training phase.
a Abbreviations: , not applicable.Other abbreviations are same as those in figure B1. b