Motion estimation and correction in SPECT, PET and CT

Patient motion impacts single photon emission computed tomography (SPECT), positron emission tomography (PET) and x-ray computed tomography (CT) by giving rise to projection data inconsistencies that can manifest as reconstruction artifacts, thereby degrading image quality and compromising accurate image interpretation and quantification. Methods to estimate and correct for patient motion in SPECT, PET and CT have attracted considerable research effort over several decades. The aims of this effort have been two-fold: to estimate relevant motion fields characterizing the various forms of voluntary and involuntary motion; and to apply these motion fields within a modified reconstruction framework to obtain motion-corrected images. The aims of this review are to outline the motion problem in medical imaging and to critically review published methods for estimating and correcting for the relevant motion fields in clinical and preclinical SPECT, PET and CT. Despite many similarities in how motion is handled between these modalities, utility and applications vary based on differences in temporal and spatial resolution. Technical feasibility has been demonstrated in each modality for both rigid and non-rigid motion but clinical feasibility remains an important target. There is considerable scope for further developments in motion estimation and correction, and particularly in data-driven methods that will aid clinical utility. State-of-the-art deep learning methods may have a unique role to play in this context.


Aims and scope
The aims of this review are: i. To outline the motion problem in medical imaging, including the sources of motion and the type and dependencies of motion-induced artifacts (section 2).
ii. To review published methods for estimating motion fields (section 3) and correcting for these motion fields (section 4) in clinical and preclinical SPECT, PET and CT.
iii. To establish some general conclusions regarding the suitability of particular motion estimation and correction methods for the variety of applications and to suggest some opportunities for future development.
We focus on diagnostic applications of SPECT, PET and CT but also mention some examples of motion estimation and correction in surgical and therapeutic applications. Although we do not explicitly address motion estimation and correction for standalone magnetic resonance imaging (MRI), methods developed for hybrid PET/CT and PET/MR are covered and some modality-naive approaches appearing in the MRI literature are touched on. It is also worth noting that many of the concepts described for SPECT, PET and CT are directly significant motion during a single frame is reduced. In contrast, cone-beam CT (CBCT), which is common in treatment planning and diagnosis in dentistry, orthopaedics, ENT and radiation therapy, involves slower gantry rotation (20-60 s/revolution) and, therefore, like SPECT and PET, is more susceptible to motion artifacts due to poorer temporal resolution. Although the temporal resolution of PET is poorer than for CT, the full-ring fixeddetector configuration of PET means that data from all projection angles are acquired simultaneously, thus providing an instantaneous 'snap-shot' of motion with complete angular sampling. This is approximated in the sequential CT acquisition by virtue of the very fast gantry rotation. However, for SPECT, the slow and sequential acquisition means that there is no motion 'snap-shot', and one cannot even assume that individual projections are free of motion (Kyme et al 2003). In CT, artifacts are usually worse for motion of objects with high or low relative attenuation. In SPECT and PET, the choice of radiotracer can influence the nature of motion artifacts due to varying biodistributions and kinetics. The number of detector heads has also been shown to change the nature of artifacts in SPECT, e.g. Cullom et al (1995), Matsumoto et al (2001). In summary, although the literature indicates broad agreement on the dependencies of motion artifacts, differences exist regarding the precise nature of these dependencies. In general, the nature, size and extent of artifacts is difficult to predict because of the complex interplay of variables.

Implications of motion
Although the literature does not offer a straightforward answer regarding when motion artifacts will occur or how they will manifest, it does offer substantial evidence for the clinical significance of such artifacts. Here we list several representative examples.
Head motion impacts visualisation and quantification in all modalities, distorting anatomy, corrupting uptake patterns and masking changes in radioligand binding (Dinelle et al 2006, Montgomery et al 2006. Head motion is particularly common among paediatric patients who often require sedation to prevent motion (Kaste 2004, Wachtel et al 2009. Of the 70 million CT scans performed annually in the United States, about 10% are performed in children (Brenner 2010), and in developing countries about 75% of paediatric CT scans are of the head (Vassileva 2012). The impact of head motion may be exacerbated in CT due to the higher spatial resolution compared to SPECT and PET (Kochunov et al 2006). For example, the value of CT brain perfusion imaging in stroke patients depends on accurate haemodynamic modeling, which can be compromised by the moderate to severe head motion reported in 25% of patients (Fahmi et al 2013).
In cardiac studies, approximately 40% of myocardial perfusion SPECT scans may be affected by motion (O'Connor et al 1998, Currie andWheat 2004) with 1/3 of these leading to incorrect diagnoses (Botvinick et al 1993, Prigent et al 1993, Wheat and Currie 2004b. Up to 66% of dynamic 82 Rb cardiac PET studies may be affected by significant motion leading to myocardial blood flow estimates that are in error by up to 500% (Hunter et al 2016, Armstrong et al 2019. Misclassification of coronary lesions has been reported in 10% of cases (Lassen et al 2019).
In tumour assessment studies, patient movement can change PET quantification by up to 35% for a 5 mm tumour and 10% for a 10 mm tumour (McCall et al 2010). It is also well known that the apparent enlargement and shape changes of lesions caused by respiratory motion blur can lead to overestimated dose margins in CTbased radiation therapy planning (Nehmeh et al 2002, Geramifar et al 2013.
When CT-based attenuation correction is performed in SPECT and PET, a potential shift or mismatch between the CT scan and the emission scan is a common problem which induces artifacts that impact quantification. The mismatch arises in part because the scans cannot be performed simultaneously, but also due to the different timescales of imaging; the faster CT scan typically produces a 'snap-shot' of motion whereas SPECT and PET scans contain the effects of motion averaged over a much longer period. This problem occurs in both head and thoracic/abdominal SPECT/CT and PET/CT, and regardless of whether or not breath-hold techniques are used for the CT acquisition (Osman et al 2003, Geramifar et al 2013.
In preclinical imaging, the inevitability of motion in awake, unrestrained animals necessitates anaesthesia. However, the potential for anaesthesia to impact the biochemical (e.g. receptor binding) and physiological (e.g. neuro-haemodynamic coupling) processes being studied in the brain using SPECT and PET limits the translational value of the animal model (Momosaki et al 2004, Martin et al 2006, Cherry 2011.

Motion estimation in SPECT, PET and CT
A precondition for all motion correction methods is the estimation of a relevant motion field describing the physical displacement of points comprising the object of interest. In the simplest case, each point undergoes an identical displacement, thereby characterising a rigid translation of the object. In general, each point may undergo a unique displacement resulting in more complex motions such as rigid-body transformations involving rotation of the object, or non-rigid deformation (shape change) of the object. In the literature, motion fields characterising non-rigid motion of an object are sometimes referred to as deformation vector fields (DVFs) or simply as motion vector fields. In this review, we will use the term motion field as a general way of referring to any type of motion, but will preface the term with 'rigid' or 'non-rigid' to distinguish the two cases where this is relevant. Rigid motion fields can be represented compactly via a transformation matrix that operates on all points in the object, thus mapping it to a new location while preserving the relative locations of all points. Non-rigid motion fields, on the other hand, do not in general preserve the relative locations of component points in the object.
A motion field characterises an object's conformation or 'pose' at an instant in time. However, by sampling the motion field repeatedly, one can 'track' the motion of the object over time (e.g. throughout an imaging session). This gives rise to terms such as '4D motion field', which can be thought of as a changing 3D motion field or a time sequence of poses. Estimating the motion field is often an independent step, however it can also be conflated with the correction step in joint estimation/correction methods.
Motion fields can be broadly classified according to whether the motion is rigid or non-rigid. Rigid motion typically pertains to head, brain and dental studies and is characterized by translations and rotations. Non-rigid motion fields mostly pertain to thoracic and abdominal studies in which periodic cardiac and respiratory motion and its effects on the surrounding tissues are most relevant. Non-rigid motion encompasses affine transformations and higher-order DVFs.
Applications in which non-rigid motion fields are particularly relevant include respiratory-gated SPECT and PET, cardiac CT, coronary CT angiography (CCTA), and orthopaedic imaging. Respiratory motion fields are required in SPECT and PET to generate motion-corrected images of the thorax and abdomen and to make accurate quantitative measurements in liver and lung lesions that move periodically with the diaphragm. Cardiac CT is used to derive measures of mechanical function and performance of the heart and therefore relies on accurate estimation of the organ's non-rigid deformation. In CCTA, the non-rigid trajectories of coronary arteries, stents and bypass grafts are tracked during sinus rhythm, requiring high spatial and temporal resolution to resolve the small diameter and fast movement of these structures across temporal frames. Similar approaches are applied to lesions in the liver and lung which undergo periodic displacement due to cardio-pulmonary motion. In orthopaedic imaging, joints and the surrounding tissue may exhibit non-rigid motion under load.
In the remainder of section 3, we begin by defining several important specifications related to motion estimation and then survey various strategies for estimating rigid and non-rigid motion fields in SPECT, PET and CT.

Degrees-of-freedom (DoF)
The number of DoF relates to the complexity of the motion to be modelled. The simplest case is onedimensional (1D) translational motion, as might be used to model periodic shifts of a lesion due to diaphragmatic motion. In tomography, in-plane motion has 3 DoF, including translation and rotation within the plane perpendicular to the scanner bore axis (the trans-axial plane). Complete rigid-body motion is described by 6 DoF (3 rotations and 3 translations). General affine transformation adds to the rigid-body parameters scaling and shear in each orthogonal axis, giving a total of 12 DoF. Highly non-rigid and deformable motion models may have tens to hundreds of DoFs. The number of DoF will usually determine appropriate representations of the motion (e.g. Euler angles, matrices, quaternions, splines, tensor fields), however this is beyond the scope of this review.

Accuracy and precision
Accuracy is a notoriously vague and varied concept in the literature when applied to motion estimation in medical imaging. This makes comparison of the reported accuracy of different motion estimation methods difficult. We define accuracy simply as the extent to which motion estimates are in agreement with a ground truth or reference value. Precision relates to the variance of a motion measurement and depends on the particular noise (or jitter) sources associated with the measurement and how these sources combine. In practice, precision can be measured as the variance of the noise of the motion estimates.

Sampling rate
Sampling rate refers to how frequently one obtains the raw data required to update the motion field. An example is the frame rate of an optical system used for motion tracking (section 3.2.3). Sampling rate differs from processing rate, which refers to how frequently the motion field is updated. Processing rate is a critical consideration for real-time motion estimation/correction but is less critical when offline processing is possible. Both sampling rate and processing rate requirements will depend on the rate of motion and the temporal and spatial resolution cpability of the imaging modality.

Latency
Latency refers to the delay associated with new motion estimates being available for downstream processing. The rate of availability of newly processed motion estimates is always less than or equal to the sampling rate. 'System' latency is the inverse of the processing rate and is equal to the sum of the measurement latency (the inverse of the sampling rate) and latencies due to data transmission and processing. System latency is relevant when considering the practicality of online motion estimation and correction.
3.1.5. Impact Impact relates to how a subject is affected by a particular motion estimation method. The impact of external motion tracking methods varies depending upon the physical principles and apparatus involved. In principle, data-driven motion estimation methods have negligible impact on the subject except if additional scans are required that would add extra time or dose.
3.1.6. Constraints Constraints can be physical or computational and include assumptions or requirements related to the object, lighting, environment ('scene'), working distance, line-of-sight or materials necessary for effective operation of a motion estimation method. Constraints also include assumptions, requirements or limitations related to the scanner (e.g. spatial and temporal resolution) and, in the case of data-driven motion estimation methods, the acquired data (e.g. noise).
3.1.7. Hardware/data requirements Hardware and data requirements relate to the complexity and relative cost of resources required for motion estimation, and the amount of data generated. These factors have implications for the general practicality of a method.
3.1.8. Scalability Scalability refers to how easily a motion estimation method can be adapted across object sizes and working volumes. This is important, for example, when considering the utility of motion estimation methods for small animals.

External motion tracking
An external motion tracking technology is a stand-alone apparatus and associated algorithms enabling the direct or indirect measurement of an object's pose over time. Below we survey external tracking technologies that have been used to estimate motion fields in SPECT, PET and CT.

Mechanical
Mechanical tracking systems involve an articulated mechanical arm to estimate the rigid-body pose of the terminus (end-effector) based on individual joint angles. The range of detectable motion is limited by the kinematics of the joints and any constraints imposed by the workspace. Accuracy is highly dependent on the mechanical stability of the arm and the robustness of the attachment between the mechanical arm and body, an aspect which is challenging in humans and likely to be even more challenging in small animals. Other challenges include the risk of collision between the arm and gantry inside space-constrained, especially narrow bore, scanners (Zhou et al 2013), and the time-varying photon attenuation caused by the articulating arm, which is non-trivial to correct for Angelis et al (2014). We know of only one implementation of this method in tomographic imaging, to track head motion in SPECT . The paucity of examples in the literature probably reflects the impracticality of the approach for motion-corrected imaging applications.

Magnetic
Magnetic tracking systems derive the position and orientation of a sensing coil within a magnetic field based on voltages induced in the coil (Remmell 2006). Using three orthogonal coils allows six DoF pose measurements. Commercial systems are available, mostly using a 'field transmitter' positioned at a fixed distance from the FoV to generate the magnetic field. Although magnetic tracking systems have been used extensively in eye tracking, surgical navigation and motion-adaptive radiotherapy (Birkfellner et al 1998b, Balter et al 2005, there has been relatively little use in diagnostic imaging. This includes magnetic systems to track head movement in PET (Daube-Witherspoon et al 1990, Green et al 1994, Mawlawi et al 1999 and body movement in cardiac SPECT (Sun et al 2001).
The main advantage of magnetic tracking systems compared to other external tracking systems is non-lineof-sight operation. In principle, this makes the approach immune to occlusion and allows tracking of implanted coils. Another advantage is scalability: orthogonal arrays of receiver coils for six DoF measurements can be <1 mm in width. The main drawback of magnetic systems is poor accuracy. The rapid fall-off in dipole field strength with distance results in a measurement error, Δr, given by Nixon et al (1998): where d tr is the transmitter-receiver distance. Metallic objects can also distort the transmitted field and corrupt measurements due to eddy currents induced in the object. Measurement error associated with this effect is given by Nixon et al (1998): where d to and d ro are the distances from the metallic object to the transmitter and receiver, respectively. Thus, the impact of the metallic object rapidly worsens pose measurements as these distances are reduced. Metallic objects which are ferromagnetic produce additional distortion of the magnetic field due to their high magnetic permeability. Because of these dependencies, it is not uncommon for magnetic systems to exhibit errors of 5-10 mm or greater (up to 100 mm) (Birkfellner et al 1998a, Nixon et al 1998, Hummel et al 2002. Field distortion-related errors due to metallic objects can be reduced to some extent using calibration procedures, however this is only effective if the environment remains static during tracking. Magnetic tracking systems can provide sub-millimetre and sub-degree accuracy for stationary sensing coils within about 30 cm of the transmitter (Hummel et al 2002, Schicho et al 2005, however accuracy of a few mm or degrees is more realistic for less controlled environmental factors, larger working distances and dynamic measurements (Frantz et al 2003). In general, the close proximity of metallic gantry elements and the presence of EM fields associated with high voltage components tend to limit the practicality of magnetic tracking systems in SPECT, PET and CT, especially when high accuracy is a requirement.

Stereo-vision
Stereo-vision tracking systems usually operate in the visible or near-IR wavelength range and rely on the detection of sparse or dense object features in two or more camera views. Features may be active (powered) markers such as light emitting diodes (LEDs) (Barnes et al 2008) or passive areas that reflect light, including native features (Kyme et al 2014(Kyme et al , 2018 or artificial features such as reflective tape, spheres and patterns . If the two cameras are optically and spatially calibrated, triangulation can be used to estimate the 3D location of features, matched across views, in a real-world metric frame of reference (figure 2) (Hartley and Zisserman 2004). The changing rigid-body pose of an object can then be estimated given at least three such landmarks tracked over time (Horn 1987).
Stereo-vision systems have been by far the most commonly used external tracking approach for head and body motion estimation in SPECT, PET and CT. It is also the main tracking approach to have been applied preclinically: in monkeys (  . The principle of recovering 3D structure using binocular stereo. If the image of a scene (world) point, X, can be identified in two different camera views (I 1 and I 2 ), and the cameras are spatially calibrated, the 3D location of X in a global metric frame can be estimated by back-projecting through the image points, x 1 and x 2 , from each view to determine the point of intersection. This process of back-projecting rays to reconstruct 3D points is termed 'triangulation' and is shown here for the ideal case (no noise) where the lines intersect exactly. Note that the distance O 1 to O 2 is referred to as the stereo baseline. , circular discs (Hu et al 2004) or no markers (Ma 2009); and a 4-camera optical system with no markers (Kyme et al 2014).
In addition to tracking sparse landmarks on an object, stereo-vision systems are also suitable for generating dense depth maps of surfaces which can in turn be registered across successive frames to estimate motion. This includes a range of consumer-grade depth cameras (known as RGB-D cameras), such as the Intel SR300 and D31X families, which have been applied in medical imaging (Baur et al 2013, Lindsay et al 2015. The second-generation Kinect system (Microsoft Corp., USA) is another consumer-grade depth camera which has been used to track human head and torso motion in PET (Noonan et al 2015, Hess et al 2016 and respiratory motion in CT (Silverstein and Snyder 2018). The Kinect functions as a time-of-flight camera, providing a pixelby-pixel depth map of a scene based on the flight time of modulated light emitted by the sensor. This device was adapted for close-range tracking within a clinical PET scanner to provide a potential accuracy of 1 mm and 1 deg for head motion (Noonan et al 2015).
Stereo-vision systems have several desirable features for motion tracking applications. Firstly, narrow baseline systems (<1 m) amenable to SPECT, PET and CT gantries can provide positional accuracy of a few tens of microns at working distances of up to several metres (Barnes et al 2008, Schmidt et al 2009). And, since accuracy scales with working distance, implementing close-range systems is a straightforward way to achieve a performance gain ( (2005)). And, thirdly, the maximum sampling rate is equivalent to the maximum frame rate which, for state-of-the-art CCD and CMOS cameras, readily exceeds 60 Hz for mega-pixel resolution with a global shutter.
The main limitation of stereo-vision systems is line-of-sight operation, leading to tracking drop-out when certain features or regions are occluded (Zhou et al 2013, Zhang et al 2018. The problem is exacerbated by narrow scanner bores (characteristic of dedicated small animal scanners and MRI), large stereo baseline, and non-planar objects. Use of additional cameras, predictive filtering frameworks (Straw et al 2011) and markerfree tracking in which features are not restricted to a particular attachment, e.g. Kyme et al (2014), all aid in mitigating the line-of-sight problem. When cameras are located out-of-bore, mirrors can been used to reduce line-of-sight limitations (Andrews-Shigaki et al 2011), however in-bore cameras or the use of fibre-optic cables as vision extenders (Slipsager et al 2019) are preferable to improve accuracy and reduce the need for out-of-bore setups with high mechanical stability for tracking objects at long range.
Many stereo-vision systems rely on the attachment of specific markers to the subject. Here, robust attachment that prevents decoupling of marker/patient motion is vital for high motion tracking accuracy. For the head, dental molds (Westermann and Hauser 2000), adhesive bandages (  . In practice, achieving robust attachment without surgery is very difficult and currently no widely accepted method of non-invasive marker attachment exists. In head tracking, there is some evidence that the impact of skin motion on tracking accuracy may be mitigated in part by using a marker with a large area of attachment to the forehead (Spangler-Bickell et al 2019). The challenge of marker attachment has motivated the development of stereo-vision systems that rely on native object features rather than physically attached markers. In these systems, features include manually assigned locations such as the eye corners and base of the nose (

Mono-vision
Monocular systems involve a single camera combined with similar computer vision principles and algorithms used for stereo-vision systems. They are well suited to tracking object motion in 2D. The real-time position management system (Varian Medical Systems, Palo Alto, CA), originally developed for monitoring patient motion during radiotherapy, involves video-based position tracking of passive targets. It has been applied in PET to track vertical motion of chest markers to estimate the misalignment between PET and CT scans  and to estimate lung lesion motion (Liu et al 2011). A similar position tracking system was used for respiratory gating in PET (Nehmeh et al 2011). It is also possible to estimate more complex motion (rigid-body and non-rigid deformation) using monocular systems by fitting the 2D image frames to an object or motion model. In computer vision this is referred to as the structure-from-motion (SfM) and non-rigid structure-frommotion (NRSfM) problem. (For an excellent review of SfM and NRSfM, see Ozyesil et al (2017).) Examples include a cone-shaped marker with four well-defined holes to track 5 DoF motion (Muraishi et al 2004), and a self-encoded marker for rigid-body head tracking (Forman et al 2011, Spangler-Bickell et al 2019. The latter was reported for MRI but is not modality-specific. A key advantage of monocular approaches compared to multi-view stereo approaches is the reduced line-ofsight constraint. In general, however, pose estimates derived from monocular methods typically have poorer overall accuracy compared to stereo. Moreover, the more complex the motion, the more under-constrained the problem is mathematically, leading to increased noise, instability and drift. Pose accuracy from monocular methods is typically a few millimetres or more in the depth direction and achieving sub-millimetre and subdegree accuracy in other out-of-plane DoF is challenging. However, factors such as increasing the object size, increasing the camera resolution, and reducing the working distance are simple ways to improve accuracy. For example, rotational accuracy <0.5°was achieved by Forman, attributable to the large-area marker with large binary features and short working distance of 80 mm (Forman et al 2011).
A unique monocular approach with promising accuracy and stability is moiré phase tracking (MPT), a passive marker-based method originally developed for biomechanical studies but which has since been adapted to medical imaging (Weinhandl et al 2010). MPT uses a single camera to track a marker consisting of two different gratings on either side of a thin transparent substrate (figure 3). Motion of the marker generates moiré fringes, the phase of which is orientation-dependent and can be used to quantify out-of-plane rotations. The remaining four DoF (3 translations and in-plane rotation) are determined using conventional photogrammetric methods. The method has been implemented with 15 mm×15 mm lithographically printed markers at 25 fps (Maclaren et al 2012) to provide static tracking accuracy of 0.7 μm (±1.0 μm) for in-plane translation, 0.07°( ±0.1°) for rotations over an angular range of >50°, and 100 μm (±12 μm) for depth. These data represent some of the best figures reported for tracking system performance in medical imaging. Although the method has only been reported for MRI, it is clearly adaptable to other modalities.

Structured light
Structured light systems are conceptually similar to binocular stereo systems, having one camera replaced by a laser or projector to distribute a light pattern onto the object. This approach leverages the fact that projected features are often more easily and reliably detected than native object features. If the projector and camera are spatially calibrated, the depth of recognizable points projected onto the object surface can be determined based on the intersection of the points and the camera line-of-sight, resulting in a dense depth map of the object surface in the form of a point cloud (figure 4). Motion is estimated from frame to frame by registering successive point clouds using a method such as iterative closest points (Bellekens et al 2014).
Basic laser profiling, in which a single light ray or plane is swept across the object surface and imaged using a fixed camera, is the simplest form of structured light system. This approach was implemented to track mouse head motion in SPECT using a rapidly rotating mirror to sweep the laser line repeatedly through the working volume (Kerekes et al 2003). The main challenge of laser profiling is achieving a suitably high sampling rate to scan fast moving objects and/or a large FoV.
Obtaining a dense depth map from a single image requires projecting multiple light planes simultaneously. This can lead to ambiguity in identifying specific light planes if they are not visually unique (encoded) (Salvi et al 2010). Encoding the light planes can be avoided provided there are certain constraints on the camera model or object shape and motion. Examples of un-coded methods include the use of Fourier profilometry to analyse respiratory-induced breast motion (  illumination to reduce the impact of projected light on the subject and, by excluding textured areas such as eyebrows, achieved an accuracy of 0.25 mm and 0.1°for typical head motion (Olesen et al , 2013. A MRIcompatible version, in which the in-bore optics are located at the end of long optical fibres originating from out-ofbore hardware (projector, camera and electronics), has also been reported for use in combined PET/MR (Slipsager et al 2019).
For rapidly moving objects or objects undergoing large pose changes, pattern encoding strategies are usually necessary to compute unambiguous depth maps from a single image. Spatially encoded light patterns exist which enable single-shot depth maps at high frame rate with sub-mm depth range accuracy at working distances <1 m (Forster 2006). Structured light has been used to track head motion during PET imaging of awake rats, however the performance was limited by occlusions, non-rigid deformation of the head and prohibitive computation time for a sub-millimetre resolution mesh. In our experience, structured light also performs rather poorly on rodents, presumably because the high local contrast of the fur distorts the light pattern (Kyme 2012).
Overall, structured light methods have several desirable properties for tracking in medical imaging: (i) relatively inexpensive hardware; (ii) potential depth accuracy of 0.1-0.2 mm at working distances <1 m; (iii) high sampling rates; and (iv) the ability to estimate high DoF motion fields by fitting surface mesh models to rich depth map data (Wilm et al 2011). The limitations of using structured light include the requirement that surfaces clearly reflect the structured pattern and have large topological changes to ensure robust surface-surface registration. The computational demands of a structured light approach are also greater than for sparse point cloud tracking, however as is the case for commercially available depth cameras, customized and optimized hardware makes real-time tracking rates feasible (Slipsager et al 2019).
3.2.6. Inertial sensors Inertial tracking systems typically combine object-mounted linear accelerometers and angular-rate gyroscopes to determine the position and orientation of an object in a navigation reference frame (Woodman 2007). Position is derived by double integration of the outputs of orthogonally-arranged accelerometers and orientation is obtained, with respect to the reference frame, by integrating the outputs of orthogonally-arranged gyroscopes. This design is the basis of a 6 DoF inertial measurement unit (IMU). With the advent of chip-sized, low-cost accelerometers and gyroscopes based on micro-electro-mechanical-systems technology (Shaeffer 2013), IMU use has expanded from more traditional aerospace applications to human and animal motion tracking, e.g. Roetenberg et al (2007), Ribeiro et al (2009), Cuesta-Vargas et al (2010. However, there has been very limited use of inertial sensors for motion tracking in nuclear medicine and CT. The few reports include 3 and 6-axis mechanical gating of cardiac and respiratory motion for SPECT and PET (Jafari Tadi et al 2014Tadi et al , 2017 and monitoring of knee motion in standing patients imaged using C-arm CT . Inertial systems have important benefits for tracking: they function at long range since both accelerometers and gyroscopes are self-contained; there is no line-of-sight limitation as with optical systems; and they allow high sampling rates with low latency. The main challenge with these systems is compensating for drift in the gyroscope and accelerometer measurements caused by bias and bias instability, noise and calibration residuals. Even minute drifts in gyroscope output can lead to large errors in orientation within a short time, and positional error from the double integration of accelerometer measurements increases quadratically in time for a given fixed bias of the device (Woodman 2007). One way to correct for drift and thus improve the accuracy and stability of an IMU is to use additional sensors to periodically re-reference the gyroscope/accelerometer measurements. However, even without additional sensors, additional hardware is required to power IMUs and wirelessly transmit signals. Thus, for medical imaging the benefits of IMU-based motion tracking are probably outweighed by poor accuracy (several mm/degrees) of current consumer-grade miniature devices and the impracticality of attaching fully self-contained inertial systems to the body.

Other external motion tracking approaches
Several modality-naive optical methods which do not fall neatly into the above classification (sections 3.2.3-3.2.5) have been reported. Commercially available bend-sensitive fibre-optic tape was used to measure head position and orientation in MRI brain scans . Head rotation about two orthogonal axes was estimated using an optical lever in which a single laser beam was deflected onto a distant wall from a mirror rigidly attached to a dental bite (Ruttimann et al 1995). And non-optical approaches include ultrasound-based motion tracking in PET (Schwaab et al 2015) and radar-based estimation of respiratory motion in 4D CT (Pfanner et al 2013).
Many other external motion tracking methods exist based either on optical principles (e.g. depth-fromfocus/defocus (Schechner and Kiryati 2000)) and non-optical principles (e.g. time-of-flight positioning using ultra-wideband or microwave technology (Zhang et al 2006)). However, to our knowledge, these methods have not been applied for motion estimation in SPECT, PET or CT.

Dual-modality approaches
The pairing of PET and SPECT with CT into dedicated dual modality scanners has been standard for well over a decade (Buck et al 2008, Townsend 2008) and, more recently, pairing with MRI either in a fully integrated system or via a PET/SPECT insert has rapidly expanded. The CT or MRI component of dual-modality PET and SPECT systems present a lot of options for estimating the motion fields needed for motion correction. In the majority of reports these are the respiratory and/or cardiac motion fields in thoracic studies.
Using CT, respiratory and cardiac motion fields can be derived from specific 'snap-shots', such as endexpiration and end-inspiration, or from 4D gated CT data by registering consecutive frames (Qiao et al 2006). In either case, since the CT and SPECT/PET acquisitions are not simultaneous, the derived motion field will, in general, differ from the motion field during the SPECT/PET acquisition, and thus adaptation of the motion field to the specific study will be necessary. This can be performed based on simultaneously acquired respiratory belt or ECG data (Bettinardi et al 2013). Many variations of the CT-derived motion field paradigm exist and the reader is referred to some good reviews (Pepin et al 2014, Guerra et al 2017.
Similarly to CT, MRI snapshots of an organ's periodic motion can be used to generate interpolated cardiac and respiratory motion fields . Several authors have derived PET respiratory motion fields from simultaneously measured MR data using non-rigid image registration, e.g. Tsoumpas et al (2010). Alternatively, a respiratory motion model based upon a short-duration simultaneous PET/MR acquisition can be used to establish the correspondence between 3D motion fields and a surrogate signal in order to predict motion during a PET scan from the surrogate signal alone. One such surrogate signal used successfully to reconstruct motion-compensated PET images is a respiratory signal extracted using PCA from the raw PET data (Manber et al 2016). Myocardial wall motion has been derived using tagged-MR (Petibon et al 2013) and combined with respiratory motion derived from dedicated MRI sequences to generate a complete cardio-respiratory motion field . The reader is referred to several surveys for a discussion of these and other approaches (Catana 2015, Fürst et al 2015, Gillman et al 2017.
Advantages of pairing SPECT and PET with CT versus MRI for motion estimation include the mature nature of SPECT/CT and PET/CT technology, cheaper cost, and the prevalence of CT in diagnostic work-ups. The chief drawback is the added dose from CT which, for 4D acquisitions, is significant. On the other hand, the advantages of pairing SPECT and PET with MRI are the potential for truly simultaneous acquisition, superior soft tissue contrast, and the wealth of options for motion estimation due to the flexibility of MRI sequences and the mature history of MRI-based motion estimation research (Ozturk et

Fiducial-based approaches
Fiducial markers are features that share the same intrinsic contrast mechanism as the imaging modality and which are either inserted into the subject or attached to the surface. They include radioactive point or line sources for PET and SPECT (Miranda et al 2017) and radio-opaque markers such as steel beads for CT (Schäfer et al 2004). Fiducials are easily isolated in the raw or reconstructed data and may be exploited for motion estimation. A single fiducial enables the estimation of 1D motion, and three fiducials are sufficient to estimate full rigid-body motion (3 rotations, 3 translations) using the same principles as stereo-vision (section 3.  (Miranda et al 2017(Miranda et al , 2019a(Miranda et al , 2019b. The latter involved localising the fiducial centroids in consecutive 33 ms frames of the raw PET list mode data, equivalent to a tracking rate of 30Hz. Fiducials have also been used to estimate 4D respiratory motion fields (Schäfer et al 2004, Li et al 2006, tumour motion (Becker et al 2010), and joint motion in orthopaedics (Choi et al 2014).
The main benefits of using fiducials for motion estimation are the ease of identification of the fiducial signal in the raw or reconstructed data and the fact that they share the same coordinate frame as the raw data, thus circumventing the need for a cross-calibration. Drawbacks of fiducial-based approaches include the additional time required to prepare fiducial markers, the risk of fiducial motion becoming decoupled from subject motion if attachment of the fiducial markers is not robust, and the limitations in accurately localising fiducials markers imposed by the scanner (e.g. intrinsic spatial and temporal resolution, sensitivity and noise). Additional dose and scatter from radioactive fiducials in PET and SPECT is usually only minor and thus not a notable drawback.

Fully data-driven motion estimation
Data-driven motion estimation is where motion fields are derived from the acquired imaging data alone without the need for any external hardware (e.g. motion tracking systems) (Kesner et al 2014). However, in the following survey of data-driven methods, we relax this definition slightly to include gating-based approaches that rely on respiratory and/or ECG monitoring devices (Bettinardi et al 2013).

SPECT and PET
Registration of the reconstructed image frames from dynamic acquisitions, gated acquisitions or acquisitions framed based on a motion threshold is a very common and useful data-driven method to estimate relatively discrete (inter-frame) motion (Picard and Thompson 1997, Mawlawi et al 2001, Naum et al 2005, Costes et al 2009, Su 2011, Woo et al 2011. Here, motion estimation is implicit in the image registration process. Rigid registration of frames has been applied successfully in organs such as the brain, heart and liver in both humans and animals and supports sub-millimetre and sub-degree accuracy. The feasibility of this approach in brain PET using frames as short as 1s was recently demonstrated (Spangler-Bickell et al 2021). Non-rigid registration has also been applied to obtain cardiac and respiratory motion fields using a variety of motion models including affine A variety of centre-of-mass (CoM) based techniques have been reported for data-driven motion estimation in SPECT and PET. Early methods in SPECT used CoM tracking in sinogram space to estimate simple translational shifting of the heart caused by respiratory motion (Geckle et al 1988, Bruyant et al 2002. Datadriven respiratory gating of axial motion based on the CoM of back-projected events in 500 ms frames was used for myocardial perfusion SPECT (Ko et al 2015). In PET, the CoM of raw PET list mode data within successive temporal bins can be used for data-driven gating or to estimate low-dimensional motion of lesions in cardiothoracic studies (Klein et al 2001b, Bundschuh et al 2007). CoM methods can be applied in conjunction with an appropriate threshold for auto-framing of PET list mode data, with subsequent frame-to-frame registration to estimate rigid or non-rigid motion fields. An example of this approach is the estimation of bulk body motion based on the CoM of PET count rates in 200 ms time bins (Lassen et al 2019). A variation of the approach exploits the additional information available in ToF PET data by computing the centre of distribution based on the central ToF bin for every PET event within short time intervals to estimate 3D respiratory motion fields for internal organs (Ren et al 2017), bulk body motion (Lu et al 2019) and brain motion (Lu et al 2020). Although not strictly data-driven, measurements from a respiratory belt were correlated with organ CoM to derive a high temporal resolution internal-to-external motion model for respiratory motion in PET (Liu et al 2011). This approach was subsequently extended to non-rigid motion fields (Cha et al 2018).
Other sinogram-based approaches for data-driven estimation of simple 1D and 2D motion in SPECT and PET involve comparing adjacent projections based on peak fitting of cross-correlation profiles (Eisner et al 1987), optical flow (Noumeir et al 1996), and phase-only matched filtering (Chen et al 1993). A more sophisticated approach fits dynamic PET sinogram data to temporal basis functions to estimate respiratory and cardiac motion fields (Ahmed et al 2015). In SPECT, rigid-body motion of the brain and heart has been estimated by minimising the error between measured projections and projections generated from the motion- Overall, the approach is well suited to motion that approximates step-wise movement between projections but less well suited to continuous motion. To our knowledge, a similar concept has not been applied in PET. Rigid-body motion may also be computed using the CoM and inertia of at least three simultaneously acquired projection angles via a principleaxes method (Feng and King 2013). In theory, this approach is applicable to 3-headed SPECT systems and PET. Neural networks are yet another possible approach for data-driven motion estimation in SPECT and PET. Early work included traditional neural networks applied to cardiac motion estimation (Beach et al 2007), however the state-of-the-art is to use many-layer (i.e. 'deep') CNNs. To date, we are only aware of supervised CNNs being applied for non-rigid registration of 4D gated frames for respiratory motion correction in PET (Clough et al 2018, Li et al 2020. With insufficient training data it appears that CNN-based methods struggle to out-perform more traditional approaches based on manifold learning (Clough et al 2018). Developing methods to address this particular challenge is, therefore, an important area of research. CNN and other deep learningbased methods are extremely new and there is much to learn about their feasibility, usefulness and limitations for motion-related tasks-especially motion estimation-with noisy SPECT and PET data.

CT
For rigid motion typical of head and dental CT studies, data-driven motion estimation methods can be either projection-based or image-based. Authors have tended to preference working directly with the projection data both for efficiency (avoiding the need to reconstruct) and the fact that motion-related artifacts are generally more localized in the projection data compared to the reconstructed image (where smearing of motion-induced artifacts can make motion estimation more difficult) (Mooser et al 2013).
In projection-based approaches, translational and rotational motion vectors have been estimated from projection moments (Pauchard et al 2011) and cross-correlation of adjacent projections (Wang and Vannier 1995, Eldib et al 2018 or opposing projections (Gu et al 2017). Projections can also be represented as a linear sum of nearest-neighbours generated by forward projecting the motion-corrupted reconstruction under many rigid-body transformations. The resulting weights of the linear sum are then used to iteratively update a rigid-body estimate in either fan-beam or cone-beam geometry .
Projection-based methods may rely additionally on mathematically formulated data consistency conditions describing redundancies in the sinogram domain. The best known of these are the Helgason-Ludwig consistency conditions (HLCC), originally described for parallel-beam geometry. Extensions of the HLCC have been derived for fan-beam geometry and used to estimate translational and rigid in-plane motion (Yu et al 2006, Leng et al 2007, Yu and Wang 2007, Clackdoyle and Desbat 2015. The HLCC were further extended to 3D cone-beam geometry but the formulation was not applicable to circular orbits (Clackdoyle and Desbat 2013). Akin to the HLCC, the Fourier consistency conditions are defined in the Fourier domain of the sinogram and have been applied to fan-beam geometry to estimate detector shifts (Berger et al 2014) and to cone-beam geometry to iteratively correct for 3D translational motion (Berger et al 2017).
Image-based approaches for rigid motion estimation typically involve rigid registration of multiple CT volumes. This is used when multiple fast 3D frames are acquired, as in CT brain perfusion studies for stroke.
Each frame can be registered to one of the individual frames or to another reference image such as a non-contrast CT (Fahmi et al 2014). Such methods do not, however, compensate for motion occurring during each individual frame.
For non-rigid motion, data-driven motion estimation in CT is typically performed using one of three approaches: image-based registration of gated data, 3D-2D registration, or iterative minimisation of an imagebased motion metric. By far the most common of these is the gating approach to compute periodic motion fields (Guerrero et al 2004, Ehrhardt et al 2007, Yang et al 2008, Schirra et al 2009, Tang et al 2012. In cardiac CT, for example, the heart is commonly scanned during a breath-hold (to avoid respiratory motion) at increments of 10% of the cardiac cycle, providing a 4D dataset. Similarly, a 4D dataset is obtained in thoracic and abdominal CT by scanning at increments of 10%-20% of the respiratory cycle (Sonke et al 2005). In each case, gating can be performed either retrospectively or prospectively, usually based on a surrogate such as the ECG or the signal from a respiratory belt. Estimation of global motion fields from respiratory-gated CT is usually performed by non-rigid registration of the temporal CT frames. In cardiac CT, where the goal is to obtain parameters of cardiac function, relevant structures are first segmented from the individual gates (e.g. using active shape models) and then tracked across gates (e.g. using optical flow). A similar approach is used to track and segment lung nodules and other structures that move due to cardio-pulmonary motion (Cha et al 2018).
Variations of this general approach to estimate motion fields from 4D gated data have also been reported based on partial-angle reconstructions (Kim et al 2015a, 2018) and short scan images (Rohkohl et al 2013).
Bias in motion field estimation can arise in several ways using gating-based approaches. Ignoring the temporal consistency of the data is one source of bias that is typically handled by performing isotropic smoothing to regularise 4D motion fields across 4D data sets (Montagnat andDelingette 2005, Metz et al 2011). For non-smooth motion (e.g. sliding motion at the surface of the chest wall and abdominal wall), directiondependent regularisation is preferable (Fu et al 2018). Bias is further reduced if all gated frames are registered to an unbiased group-wise mean instead of one specific frame (e.g. end-expiration or end-inspiration) (Metz et al 2011). In principle, double-gating provides 5D (3D + respiratory gate + cardiac gate) data sets from which cardio-pulmonary motion fields may be estimated via gate-gate registration. This is robust for normal CT, which is fast enough (∼0.3 s/revolution) for individual gates to have sufficient angular sampling and be free of artifacts. However, for CBCT, the individual gates may contain only 2% of the total data (e.g. if 20% respiratory and 10% cardiac temporal framing is used) due to the slower gantry rotation (∼10-40 s), and the resulting streak artifacts from undersampled projection angles will strongly bias the estimation of motion fields based on gate-gate registration. One approach to mitigate this problem is to register the original gated frames (with motion + sampling artifacts) to a simulated motion-free gated data set containing only sampling artifacts (Brehm et al 2015).
3D-2D registration approaches estimate and compensate for motion by registering the measured projection images to those generated from a known reference volume or initial reconstruction that is iteratively updated (Zeng et al 2005, Hansis et al 2008, Berger et al 2016, Ouadah et al 2016. This has been applied in all forms of CT, including registration of a reference gate and individual respiratory-gated CBCT frames to determine non-rigid motion fields (Dang et al 2015); registration of a motion-free reconstruction of the tibia and femur with 2D projection images of the bones under load (Berger et al 2016); and registration of a 3D arterial tree model and 2D projection images to estimate motion in CBCT angiography (Blondel et al 2004, Klugmann et al 2018. The third main approach for data-driven estimation of non-rigid motion characterises the motion via a motion artifact metric defined in image space. Motion correction is then performed by iterative minimisation of this metric. These methods usually impose various assumptions about the object and seek to optimize quantities such as entropy or positivity measures for the 3D reconstruction (Kyriakou et al 2008, Katsevich et al 2011, Rohkohl et al 2013. Although conceptually appealing, achieving robust estimation (and correction) using these methods has proven difficult in practice.
Recently, CNNs have been applied directly or indirectly to data-driven motion estimation in CT. This includes CNN-derived 3D keypoints to drive a downstream non-rigid registration of respiratory-gated CT frames  (Lossau et al 2019b). The latter exhibited poor rotational accuracy of several tens of degrees, which may be explained in part by the small training set. It is likely that for robust performance in ambitious CNN-based tasks like direct motion estimation, large training sets will be essential. Unfortunately, large training sets are not readily available with ground truth for different CT applications. This presents an important area for future development so that the full potential of CNNs can be exploited. In summary, the feasibility and rationale of specific implementations of CNN-based approaches for data-driven motion estimation is yet to be clearly demonstrated, but the time is ripe for a thorough development of such methods .

Summary and outlook
Clearly many different techniques exist for motion estimation in SPECT, PET and CT. This diversity reflects both the diversity of applications and the absence of a 'silver bullet' solution. In table 2 we attempt a comparison of the different approaches surveyed in this review. Where possible an absolute comparison is shown (e.g. sampling rate) but in all cases a relative comparison is expressed using colour-coding: green for the best performance, orange for intermediate performance, and red for the worst performance. Combinations of colours indicate where variation in performance is expected.
Stereo-vision has clearly been the most widely used external tracking method for motion estimation in motion-corrected imaging, but mostly limited to research studies. Stereo implementations have benefited from the extensive development of methods and algorithms from fields outside of medical imaging, including computer vision, photogrammetry and robotics. However, the prevalence of this approach mainly reflects good performance and flexibility across the range of relevant specifications (section 3.1 and table 2). More recently there has been a notable trend towards marker-free stereo tracking based on multi-view camera systems, structured light systems and depth cameras ( With the exception of rigid-body motion (e.g. of the head, jaw and limbs) where, in general, external tracking may provide a robust direct surrogate for internal motion, external motion estimation methods all require an implicit or explicit model mapping external-to-internal motion. This is particularly challenging in the abdomen and thorax. In these cases, data-driven methods have obvious appeal. For this reason, and also because of their practical convenience (i.e. no other hardware or equipment required, with the exception perhaps of a gating device such as ECG or respiratory belt), data-driven motion estimation methods have consistently featured in the literature. Data-driven methods do, however, vary in performance due to both theoretical factors and data limitations (e.g. noise). Nevertheless, we expect to see a continued effort towards improving the robustness of data-driven methods for all forms of motion in different applications of SPECT, PET and CT.
The rise of deep learning techniques, in particular CNN-based models applied to imaging data sets, also presents a new opportunity for data-driven motion estimation in SPECT, PET and CT. To date only a handful of studies exist which involve CNNs being applied to motion-related tasks in SPECT, PET and CT. An obvious opportunity is to apply CNN-based registration methods to gated SPECT, PET and CT data to estimate nonrigid cardio-pulmonary motion fields (Clough et al 2018). However, numerous other approaches involving CNNs are possible. CNNs have shown excellent promise in providing image-based correction of other sources of artifacts in PET, such as attenuation and scatter (Yang et al 2018, Shiri et al 2019), thus it seems natural to explore their potential for motion estimation and correction. Many open questions exist, including: the extent to which the manifestation of motion can be learned, especially as the number of DoF increases; whether motion can indeed be learned directly or whether CNNs better serve as one step in a larger pipeline; which input is optimal: raw data, projection (sinogram) data or image data; and how such models can be extensively trained given the paucity of labelled data and in the absence of ground truth. We see these and other questions as very productive avenues of future research.
Having considered approaches for motion estimation, we now turn to how the derived motion fields are used to correct for motion in SPECT, PET and CT.

Introduction
Conventional image reconstruction methods for SPECT, PET and CT rely on the assumption that the subject remains stationary during data acquisition. If this assumption is violated the reconstructed images suffer from motion artifacts that typically manifest as distortions and blurring and which lead to a loss of quantitative accuracy. This section reviews methods of motion correction, which are techniques to reduce or eliminate such artifacts. Nearly all of these methods are closely associated with methods of motion estimation since they require knowledge of the subject's motion during the scan to apply a correction. A possible exception to this would be the future development of neural network-based approaches trained to remove motion artifacts from reconstructed images without a priori knowledge of the motion.
The motivation for motion correction largely depends on the imaging application: clinically it is used primarily to obtain images free of motion artifacts for improved diagnosis and patient management; in research it allows observations to be based upon more accurate image data; in preclinical imaging, correction for Table 2. Comparision of motion estimation technologies for SPECT, PET and CT. Technologies are compared across several specifications using a relative colour scale: green (best performance), orange (intermediate performance) and red (worst performance). A combination of colours is used where variation in performance is expected. In some cases the technologies are compared in an absolute sense, indicated by a specific value (a) .

16
Phys. Med. Biol. 66 (2021) 18TR02 respiratory and cardiac motion may be beneficial for anaesthetised animals, but correction for head motion is usually unnecessary as it can be well controlled using a head holder. Head motion correction is, however, necessary when imaging awake and unrestrained animals.
In classifying motion correction approaches we distinguish between methods addressing rigid motion and non-rigid motion. Rigid motion, such as motion of the head, is relatively easy to measure accurately using the motion of surface points as a surrogate (see section 3). However, it is much more challenging to obtain accurate estimates of non-rigid internal motion in the abdomen and thorax from external observations. Therefore, correction for non-rigid motion is generally less robust unless accurate knowledge of the internal motion is available-e.g. from gated frames or from the MR data in simultaneous PET/MR imaging.

Correction for rigid motion
Although correction methods for rigid motion can in principle be applied to any body part that moves rigidly, they have been applied primarily to the brain and heart.

SPECT
Early works on rigid motion correction in brain SPECT had the disadvantage that they required projection data at certain angles affected by motion to be discarded, or were unable to correct for motion in all six DoF (Eisner et al 1987, Li et al 1995a, 1995b, Pellot-Barakat et al 1998. Correcting for motion in all DoF can, however, be achieved by reconstructing from a 'virtual' detector trajectory that is obtained by perturbing the actual trajectory by the inverse of the head motion at each projection angle (Fulton et al 1994(Fulton et al , 1999) (figure 5). While able to correct for view-to-view motion, this method cannot address motion during the acquisition of individual projection views unless the projection data are acquired in list mode (McNamara et al 2008). Reconstructing from a modified (virtual) detector trajectory restores projection consistency but cannot compensate for situations in which the patient's head motion, in combination with the detector motion, results in incomplete data (Orlov 1975, Tuy 1983. The feasibility of obtaining motion information from the projection images themselves, termed data-driven motion estimation, rather than using a motion tracking system, and reconstructing with the virtual projection method, has also been investigated (Hutton et al 2000, Kyme et al 2003, Feng et al 2006. Whereas these works treated motion estimation and motion correction as separate problems to be performed in succession, the motion estimation and correction steps can be combined into a single optimization problem, in which the 6 DoF motion and the motion corrected image are successively updated in a 2-step iterative fashion (Schumacher et al 2009).
Rigid motion correction methods have also been developed for SPECT cardiac imaging. The heart tends to shift in the superior-inferior direction as respiration slows following the exercise phase of a stress/rest imaging protocol. Correction may be applied by shifting the projection data to compensate, with motion modelled as a rigid translation, e.g. Mester et al (1991)  Step motions applied to the phantom during acquisition (top), and comparison of horizontal profiles through the middle of the image with and without motion correction versus motion-free reference (bottom) (Fulton et al 1999). Reproduced with permission from Fulton 2000. Britten et al (1998), Lee and Barber (1998), Mitra et al (2012. The accuracy of this approach is limited by the fact that cardiac motion is non-rigid (Rahmim et al 2007). The heart rotates and changes shape during contraction, as well as being affected by respiratory motion, for which no rigid motion correction method can fully compensate (Pretorius et al 2016). Several automated or semi-automated correction algorithms for upward creep have been reported, e.g. Matsumoto et al (2001), Uchiyama et al (2005), Mitra et al (2011), Kangasmaa and Sohlberg (2014). Motion correction may also be combined with strategies for mitigating cardiac motion, including the use of pharmacological vasodilation for exercise (Anagnostopoulos et al 1995) or patient support devices (Cooper and McCandless 1995).
In contrast to conventional Anger camera-based SPECT systems, the introduction of dedicated cardiac SPECT cameras characterised by stationary solid state cadmium-zinc-telluride detector modules allows a different approach to the correction of upward creep in myocardial imaging (Redgate et al 2016, Wu and. On these systems pinhole images are acquired by all detectors simultaneously, so that shifts in the x, y and z directions can be estimated from changes in position of the images on the detectors and knowledge of the detector inclinations. Alternatively, motion can be estimated by segmenting the emission data into a series of short frames, reconstructing the frames, and estimating the motion from the reconstructed frames (Van Dijk et al 2016. With either approach corrections can be applied for translations along all 3 coordinate axes.
Correction of intra-frame motion requires a SPECT scanner with list mode capability, enabling a timespecific correction for rigid motion to be applied to each event prior to reconstruction (Bruyant et al 2002, Ma et al 2005.
In preclinical brain SPECT, motion correction has been successfully applied to fully conscious unrestrained mice inside a 'burrow' within the scanner (Weisenberger et al 2005, Baba et al 2013, using list mode reconstruction and data from a synchronised optical motion tracking system.

PET
Much more has been published on rigid motion correction methods for PET than SPECT because of the more widespread use of PET in neuroimaging and neuroscience. The multiple acquisition frame (MAF) motion correction method is applicable to PET scans acquired as a series of dynamic (temporal) frames (Picard and Thompson 1997). The separately reconstructed frames are registered in 3D to one of the frames chosen as a reference. The six DoF transformations to register each frame with the reference frame can be obtained using either an automated 3D image registration algorithm, or from an optical motion tracking system calibrated to the PET coordinate frame (section 3.2.3). When image registration is used to obtain the transformations, their accuracy depends on the individual images having sufficient counts and being free of artifacts caused by intra-frame motion. For accurate attenuation and scatter correction it is necessary to transform the attenuation data to match the pose of the emission data prior to the reconstruction of each frame . To counteract intra-frame motion a new frame may be commenced whenever motion reaches a preset threshold. However, there is a practical limit here as correction for continuous head motion would entail the reconstruction and registration of an unwieldy number of low-count frames. The efficacy of the MAF method has been demonstrated in PET research applications, for example parametric imaging (Tellmann et al 2004, Herzog et al 2005b. With cooperative healthy subjects intra-frame motion is often negligible. However, a more sophisticated approach is needed when intra-frame motion is sufficient to cause artifacts. Intra-frame motion can be corrected by acquiring the data in list mode and adjusting the position and orientation of the lines-of-response (LoR) of individual coincidence events in response to changing head pose (Daube-Witherspoon et al 1990, Menke et al 1996). Motion relative to an arbitrarily chosen reference pose at any time during the scan is expressed as a six DoF spatial transformation. The inverse of this transformation is applied to the 3D coordinates of the detectors in coincidence to obtain a motion-corrected LoR, which is recorded in a 3D sinogram. Reconstruction of this motion-corrected sinogram provides the motion-corrected image. The transformations can be performed either post-acquisition , Woo et al 2003, Bühler et al 2004, Watabe et al 2004, or on-the-fly during acquisition by performing the geometrical transformations in hardware (Jones 2002).
In contrast to the MAF method where the same motion-corrective transformation is applied to all events in a frame, event-by-event methods apply the transformation to the much smaller number of events in a list mode time bin (typically 1 ms in duration). In the presence of rapid motion this enhances the ability of the motion correction algorithm to apply the true transformation to each event. The full potential of this 1 kHz update rate has not yet been realised due to the relatively low sampling frequency of optical motion tracking systems (30-120 Hz).
While potentially more accurate than the MAF method, the sinogram-based LoR rebinning method has the disadvantage that some events cannot be recorded in the motion-corrected sinogram after spatial transformation because the obliqueness of the transformed LoR exceeds the maximum allowed ring difference or because the LoR moves entirely out of the transaxial FoV (Akhtar et al 2013). The extent of the count loss depends on the magnitude, direction and duration of pose changes relative to the reference pose during the scan. Thus, it can be mitigated in part by a shrewd choice of reference pose, but even then the count losses and their impact on quantitative accuracy can be substantial. In awake animal studies the losses can exceed 80% of all coincidence events.
An alternative to histogram-based LoR rebinning that avoids count losses is to pass the transformed coordinates of the detectors involved in each coincidence event directly to a list mode expectation maximisation (LM-EM) reconstruction algorithm (Qi and Huesman 2002, Reader et al 2002, Carson et al 2004, Johnson et al 2004, Rahmim et al 2004. This approach removes the constraint that transformed LoRs must intersect with a detector pair and allows all transformed events to contribute to the motion-corrected image. There is also an improvement in accuracy since the transformed LoR coordinates may be used directly in reconstruction instead of assuming, as is done in histogram-based LoR rebinning, that each photon of the transformed coincidence pair interacted with the centre of the nearest physical detector. In all of these LoR transformation methods, normalisation requires special attention since the normalisation coefficient of the LoR to which the event is assigned after transformation is likely to differ from that of the LoR on which it was detected. If the transformed events are histogrammed into a sinogram, LoRs associated with detector pairs with different normalisation coefficients will be recorded in the same sinogram bin, causing errors when conventional normalisation is applied during reconstruction , Bühler et al 2004. It has been shown that these methods can produce artifacts unless the different numbers of LoRs contributing to sinogram bins as a result of data compression strategies (axial and angular mashing) are properly accounted for (Zhou et al 2009). As an alternative to pre-normalisation, post-normalisation methods can be applied if a motion-averaged normalisation array is constructed (Thielemans et al 2004(Thielemans et al , 2008. However, this can be very demanding computationally, especially for scanners with large numbers of LoRs and when the motion is frequent, and requires the emission data to be precorrected for attenuation. A faster alternative is to calculate, using a GPU, a time-weighted motion compensated sensitivity matrix in image space. This is applicable to both histogram-based and LM-EM LoR transformation methods and is done by considering the trajectory of each object voxel as it moves within the FoV, averaging the values of all the conventional sensitivity matrix voxels visited along the way, weighted for the time spent in each location (Rahmim et al 2004, Bashar et al 2013. Some typical approaches to scatter correction in conjunction with motion-corrected reconstruction are described in Carson et al (2004), Thielemans (2005), Rahmim et al (2008).
Deconvolution approaches have also been explored as a means of correcting for head motion in brain PET. If the motion is known (e.g. from optical motion tracking) an iterative ML-EM deconvolution scheme can be used to estimate the true activity distribution (image) from the motion-corrupted image . Here, deconvolution is used to find the estimate of the true image which, when the known motion is applied to it, gives a best match to the motion-blurred image.
Variations of the above motion correction methods have been adapted successfully to brain PET imaging in animals, including rats ( Here the primary objective of motion correction is to enable the brain of an animal to be imaged while awake, avoiding the need for anaesthesia which is known to affect measured parameters of neurological function such as blood flow and receptor binding. In some implementations, the animal's brain is imaged while the animal is moving freely within an enclosure in the PET FoV (Angelis et al 2019, Kyme et al 2019).
The introduction of integrated PET/MR systems capable of simultaneous PET and MR imaging has provided new ways to correct PET studies based on motion derived from MRI (section 3.3). In one of the first demonstrations of this approach, 3D translational and rotational head motion estimates were obtained in human volunteer studies from echo planar imaging and cloverleaf navigator sequences every 3 s and 20 ms, respectively, then applied offline for motion correction of the PET data with LoR rebinning (Catana et al 2011). Correction accuracy was limited by signal contamination from non-rigid motion of the neck. Good results have also been obtained using wireless coils (Huang et al 2014b) (figure 6), an optical camera attached to the head coil (Spangler-Bickell et al 2019), and 3D image registration (Reilhac et al 2018) to provide 6 DoF motion information for list mode reconstruction in simultaneous PET/MR. The application of 3D image registration to PET frames as short as 1s has been shown to provide 6 DoF motion estimates of sufficient accuracy for motion correction of brain PET scans with list mode reconstruction in PET/CT and PET/MR (Spangler-Bickell et al 2021).
Methods for direct parametric reconstruction of dynamic brain PET data, with motion correction, have also been reported. Rather than reconstructing a series of dynamic frames, these methods incorporate the kinetic model into the reconstruction algorithm to directly fit the model to each voxel time-activity curve. To correct for motion, the motion may be estimated by an optical motion tracking system and undone using LoR transformation (Germino et al 2017) or, alternatively, estimated along with kinetic parameters from the raw PET data in a joint estimation problem where the kinetic and motion parameters of the likelihood function are alternately updated (Jiao et al 2017).

CT
The majority of publications on motion correction methods for CT imaging address non-rigid motion in cardiac and other forms of thoracic CT imaging (see section 4.3.3). Relatively little has been published on the less challenging, but nevertheless non-trivial, problem of correcting for the motion of bodies that move rigidly, such as the head.
Early rigid motion correction methods for CT were limited to in-plane motion. Correction for in-plane translation of the object during helical CT acquisition was achieved by applying compensatory shifts to the projection data prior to interpolation into parallel projection views and reconstruction (Wang and Vannier 1995), and other similar approaches (Schäfer et al 2004, Yu et al 2006, Zafar et al 2007, Pauchard and Boyd 2008. In-plane (i.e. transaxial) rotation can be corrected similarly since it causes an apparent shift of the object in the tangential direction that can be undone by shifting the projection data in the opposite direction (Fahrig and Holdsworth 2000, Yu and Wang 2007, Zafar 2011). Rotations about axes other than the CT scanner rotation axis have effects on the projection data that cannot be compensated merely by shifting or rotating detected rays in-plane.
A more general approach, analogous to the 'virtual' detector trajectory method for SPECT described in section 4.2.1, is to reconstruct the motion-corrected image from a virtual source/detector trajectory obtained by perturbing the true trajectory to account for object motion at each view. This method can be applied to any 6 DoF motion. The potential of this approach in CT appears to have been first explored by Bodensteiner et al (2007) who used it in an iterative procedure to identify and compensate for small perturbations of gantry motion and object motion in mobile C-Arm CT. It was subsequently applied to compensate for 'nodding' motion of a head phantom in slowly-rotating CBCT using a FBP reconstruction algorithm (Jacobson and Stayman 2008). A related technique, termed '3D re-registration motion correction', was applied to simulations of 3 DoF motion (2 rotations and one translation) using a modified FDK algorithm (Feldkamp et al 1984, Wells et al 2011. Kim and co-workers applied the virtual trajectory approach to helical CT using motion information from an optical motion tracking system to modify the effective trajectory of the source and detector during reconstruction (Kim et al 2013(Kim et al , 2015b. In contrast to its application in SPECT where correction for intraprojection motion was problematic, intra-projection motion is negligible in helical CT since projections are acquired within very short time intervals (typically <1 ms).
The feasibility of compensating for rigid head motion in helical CT scans using a data-driven approach has also been demonstrated (figure 7) (Sun et al 2015(Sun et al , 2016. This method involves the same motion-corrected reconstruction approach as Kim et al (2015b) but view-to-view rigid head motion is estimated by attempting to identify, at each projection angle, the best 2D-3D registration of the current image estimate and the measured projection data. Each time the motion estimate is updated it is used to update the motion-corrected image. Estimates of the motion and the motion-corrected image are successively updated in an iterative process. Similar methods have been reported by other groups: Bruder et al (2016) explored one data-based (L2-norm) and two image-based (image entropy and total variation) cost functions to estimate motion from projections, rather than a linearization approximation (Sun et al 2015(Sun et al , 2016; and a locally linear embedding scheme was shown to improve motion estimation and motion correction accuracy on simulated clinical CT data and real micro-CT data .
The use of iterative reconstruction algorithms in conjunction with a virtual source/detector trajectory to obtain motion-corrected images can result in reconstruction times orders of magnitude longer than conventional analytical reconstruction algorithms used in clinical CT. Recently, the analytical reconstruction algorithms FBP and FDK have been successfully applied to a virtual source/detector trajectory to provide motion-corrected helical CT images in much shorter times (Bruder et al 2016, Jang et al 2018, Nuyts and Fulton 2020. These methods have the potential to be easily integrated into clinical imaging protocols as they can be applied retrospectively to raw CT datasets with no a priori knowledge of the motion.
As mentioned in the context of SPECT motion correction (section 4.2.1), the combination of detector and object motion in CT imaging may result in a virtual trajectory that provides insufficient data for exact reconstruction. The potential for this problem to arise in clinical CT head imaging has been investigated in simulated CT scans with a range of realistic head motion patterns obtained from volunteers using optical motion tracking (Kim et al 2016). Residual data-insufficiency artifacts in the motion-corrected images were only observed when the head motion was ʼsevere', i.e. when subjects moved their head rapidly in multiple directions. To identify regions of an image in which motion artifact-free reconstruction is not assured, one may use a local measure quantifying the degree to which Tuy's completeness condition is violated in each voxel (Tuy 1983, Sun et al 2014. A summary of rigid motion correction methods is provided in table 3.

Non-rigid motion
Whereas rigid objects preserve their shape as they move, non-rigid motion of an object involves relative motion between the particles comprising it, and thus a change of shape or 'deformation'. The magnitude and nature of the deformation varies at different points within the object. A brief review of approaches to compensate for nonrigid motion in SPECT, PET and CT follows. It is worth noting that, as for rigid motion, most of these approaches rely on accurate information about the motion of the object during the scan (see section 3.2).

SPECT
In thoracic and abdominal SPECT imaging, respiratory motion often results in serious artifacts due to the motion of the diaphragm, liver, lungs, thoracic cage and abdominal organs. One of the first attempts to correct for respiratory motion by modelling it as deformable motion was to approximate in-plane thoracic motion as a combination of time-varying magnification and displacement in 1D  or 2D (Lu and Mackie 2002). Motion information was extracted from the SPECT sinogram by examining the sinusoidal traces of prominent features. The projection data were then rescaled and interpolated to compensate for motion effects prior to reconstruction. In practice, the traces of distinct native features were difficult to identify. This led to attempts to elucidate internal respiratory motion by tracking reflective spheres attached to the chest and using a neural network to decompose the motion data into rigid and non-rigid components (the latter representing respiratory motion) (Beach et al 2007, Mitra et al 2007. A common approach to compensating for respiratory motion in SPECT, PET and CT is the use of respiratory gating. Synchronising the acquisition of a fixed number of sequential image 'frames' with the respiratory cycle results in a set of N images, each depicting the body at a different phase of the cycle, over a time interval N times shorter than the full cycle. This limits the amount of motion affecting each image at the expense of increasing the noise in each image. The trade-off is more significant in SPECT and PET where projection images are inherently noisy. In myocardial SPECT imaging respiratory artifacts are significantly reduced provided that the amplitude of respiratory-induced heart motion during a gating time period is <1 cm (Segars and Tsui 2002). Kovalski et al combined respiratory gating with list-mode acquisition, detecting the shifts between bins and rebinning the data into projections to compensate for these shifts (Kovalski et al 2007). Motion correction of respiratory-gated SPECT images of the lung has also been reported. End-exhalation and PET (brain) Multiple acquisition frames Thompson 1997, Fulton et al 2002) Event-by-event LoR rebinning  List mode reconstruction ( end-inspiration respiratory gated 99 m -Tc-MAA lung perfusion images may be non-rigidly registered and added together to obtain a less noisy image (Ue et al 2006(Ue et al , 2007. Cardiac-gated SPECT acquisition, synchronized with the cardiac cycle via an ECG signal, produces a set of images of the myocardium at different phases of the cardiac cycle, each with reduced motion artifacts (but increased noise) compared to the ungated image. The gated images allow one to model and correct for the physical deformation of the heart during the cardiac cycle in order to create a motion-corrected summary image with a SNR similar to that of the ungated image. Various approaches have been developed to identify cardiac motion, including tracking anatomical points within the heart through the series of gated images to create a vector field of displacements (Laading et al 1999), and incorporating a temporal regularization into the reconstruction process using a temporal prior Gravier 2004, Gravier et al 2007). Optical flow methods have also been used extensively to estimate frame-to-frame heart deformations in cardiac-gated imaging. A content-adaptive mesh model tomographic reconstruction framework based on a deformable non-uniform sampling grid (Brankov et al 2004) has shown good performance for motion-corrected cardiac-gated myocardial SPECT imaging (Marin et al 2010). For a good summary of these approaches see (Gilland et al 2008).
Further improvement in SPECT myocardial perfusion image quality has been reported with simultaneous respiratory and cardiac gating (Bitarafan et al 2008, Kovalski et al 2009, Chan et al 2014, Qi et al 2017.

PET
In PET lung imaging, the ability of respiratory gating to reduce respiratory motion artifacts has been well demonstrated (Nehmeh et al 2002, Rahmim et al 2007, Guerra et al 2012. Deformable registration methods that combine the gated images into a single composite image have been developed using, for example, a 12parameter affine motion model (Klein et al 2001a), optical flow (Dawood et al 2005, Huang et al 2014a and polynomial warping (Woo et al 2005). Jacobson et al proposed a motion correction method for ungated data that incorporates parameters describing the deformable motion into a statistical projection model. Joint maximum likelihood estimation of both deformation parameters and the image parameters allowed for more accurate motion estimation than frame-based reconstruction followed by image registration, however it did not improve the accuracy of lesion uptake measurements (Jacobson and Fessler 2004). Similar approaches have been reported by Blume et al (2010), Kalantari et al (2016. These methods model the time-varying relationship between the motion and projection data with respect to a single motion-free image. By optimizing an objective function, the image and motion may be jointly estimated in PET/CT (Bousse et al 2016), and in PET/MRI with motion-adjusted attenuation information derived from the MRI scan (Bousse et al 2017). Several CT and MRIbased methods can be used to estimate the required non-rigid motion fields (see section 3.3).
Motion correction methods using frame-based image reconstruction followed by deformable image registration of respiratory-gated PET data suffer from errors due to the difficulty of accurately estimating nonrigid motion from frames containing high levels of noise. There is good evidence that improved registration accuracy, and better computational efficiency, can be achieved using deep neural networks. For example, unsupervised non-rigid image registration was incorporated into PET image reconstruction (Li et al 2020). Unsupervised approaches have excellent practical utility since model training does not require ground truth.
Respiratory motion can also reduce the accuracy of operator-drawn ROIs and the resulting time-activity curves, for example in dynamic 13 NH 3 -ammonia PET myocardial blood flow studies. Accuracy can be improved by aligning the measured data to template images of blood pool and myocardium typifying the tracer distribution at different stages of uptake (Turkington et al 1997). This method provides accurate 3D image registration, even in early frames with low counts. In PET/CT, mismatch between PET and CT images (due to the PET image being acquired relatively slowly over multiple respiratory cycles compared to the 'snapshot' CT) can lead to significant error in estimating regional cardiac uptake, as had previously been shown at the borders between lung and soft tissue in oncologic PET/CT (Le Meunier et al 2006). Respiratory gating mitigates this effect (Livieratos et al 2006, Wells et al 2010, Ren et al 2017. In PET, as in other imaging modalities, contractile motion of the heart itself can be mitigated using cardiac gating (Rahmim et al 2007). There have been several efforts to recover a single composite motion-corrected cardiac image from all acquired events, including using deformable image registration (Klein et al 1997, Klein 1999 or the optical flow constraint (Gilland et al 2008) to align the cardiac gated images.
Cardiac motion effects can also be mitigated using motion data derived from simultaneous PET/MR. Once non-rigid myocardial wall motion fields are obtained (using, for example, tagged MRI), this information may be incorporated, together with a position-dependent point spread function, into the reconstruction system matrix to obtain an image with correction for both motion blurring and the partial volume effect (Petibon et al 2013). Both cardiac and respiratory motion estimates can be combined into a single non-rigid motion vector field for incorporation into a motion-corrected reconstruction with time-dependent MR-derived attenuation correction (Ouyang et al 2014). Variations of this approach are described in Robson et al (2018) and Kolbitsch et al (2019).
Several groups have investigated dual-gating approaches, in which respiratory and cardiac gating are performed simultaneously, and estimates of both non-rigid respiratory motion and non-rigid cardiac motion are used to apply motion correction (Lamare et al 2014, Feng et al 2016, Klen et al 2016. Motion correction may be applied after (AR), during (DR), or before (BR) reconstruction (Feng et al 2016). In AR methods, motion correction is applied to gated images via non-rigid registration, whereas DR methods incorporate the motion information in the reconstruction algorithm, and BR methods apply the correction in the projection domain. A mass-preserving image registration algorithm was applied to dual-gated cardiac PET data to estimate and apply AR correction for both respiratory and cardiac motion (Gigengack et al 2012). However, changes in the attenuation distribution due to motion were not considered in this work, limiting quantitative accuracy. An inter-comparison of AR, DR and BR respiratory motion and cardiac motion methods, using Monte-Carlo simulated dual-gated data and attenuation maps transformed according to the estimated respiratory motion, was conducted by Feng et al (2016). Optimal motion correction accuracy over a range of noise levels was obtained with DR motion correction. In a comparison of four different AR methods with simulated data, separate estimation of respiratory and cardiac motion, with modeling of respiratory motion before cardiac motion estimation, provided the most accurate estimation of respiratory and cardiac motion .
Deconvolution-based motion correction methods rely on the estimation of a suitable de-blurring motion kernel (El Naqa et al 2006, Thomas et al 2019) which, in dual-modality systems, can be estimated from CT or MRI-derived motion fields. It is, however, also feasible to derive the motion PSF directly from ungated SPECT and PET reconstructions corrupted by respiratory motion (Xu et al 2011), albeit with limited practicality for small lesions.
For quantitatively accurate correction of both rigid and non-rigid motion in PET, it is important to correct not only for motion/deformation of the radioactivity distribution within the patient but also accompanying changes in the attenuation distribution (Pevsner et al 2005, Khurshid et al 2006, McQuaid and Hutton 2008, Bai and Brady 2011. In PET/CT, gating of both the CT and PET scan can provide well-matched images, but this entails a high radiation dose from the gated CT scan (Ponisch et al 2008). An alternative approach is to construct registered CT attenuation images for each PET frame by applying spatial transformations to a single CT scan. In an example of this approach, Radon consistency conditions were used to transform a single CT image into alignment with the respiratory-gated PET frames prior to attenuation correction, then the inverse transformations were applied to align the gated PET images into a single image (Alessio et al 2007).

CT
Non-rigid motion of internal organs, as a result of respiration, cardiac contraction and gastrointestinal motion, also causes image artifacts in abdominal and thoracic CT imaging. This affects the use of slow-rotating CBCT for tumour delineation as part of radiotherapy treatment planning and the use of modern multi-slice helical CT scanners in diagnostic procedures (Moorrees and Bezak 2012).
When imaging is fast, as with helical CT, respiratory motion may sometimes be avoided by breath-hold imaging provided the patient is able to comply. However, this is rarely the case in CBCT where the acquisition of a full set of projection data typically takes minutes. Some of the earliest work in this area was to model cardiac and respiratory motion as entirely in-plane, evaluating at every pixel to derive a motion map . Motion artifacts were shown to be reduced in clinical scans by pixel-specific backprojection, i.e. by performing the backprojection in a frame of reference that moves with the object. The main shortcoming of this method was its inability to correct for out-of-plane motion.
In slow-rotating CBCT, a commonly used approach is respiratory gating, e.g. Sonke et al (2005), Hinkle et al (2012), Moorrees and Bezak (2012). This reduces the severity of motion artifacts since the amount of motion affecting each gated image is less than the motion during the entire respiratory cycle. However, it has the drawback of requiring the acquisition of a larger number of total projections, increasing the patient radiation dose. To counteract this, a limited subset of projections is typically acquired for the reconstruction of each phase, at the expense of image quality. Alternatively, the motion can be taken into account during image reconstruction (Rit et al 2009a(Rit et al , 2009b. In such approaches, 3D deformable motion fields due to respiration are obtained by analysis of a 4D treatment planning CT performed separately from the 3D CBCT scan and assumed to share the same respiratory motion characteristics. The motion-corrected image is then obtained using a reconstruction algorithm incorporating this motion. A similar approach has been used in C-arm imaging (Schäfer et al 2012).
An iterative image-based correction for respiratory motion artifacts in CBCT has also been reported (Schretter et al 2009). Motion artifacts are first extracted in projection space (as differences between the acquired projections and corresponding forward projections of the image reconstructed in the previous iteration step), then reconstructed in image space and subtracted from the original reconstructed image.
Wang et al showed that joint estimation of 3D non-rigid motion fields and the motion-compensated image in 4D CBCT could improve estimates of the reconstructed image and motion trajectory as compared to conventional sequential 4D-CBCT reconstruction and motion estimation in image-guided radiation therapy (Wang and Gu 2013). Following a subsequent finding that motion estimation was less accurate in the lung due to the relative absence of high-contrast structures, the same group used a CNN to fine-tune DVFs derived from joint estimation (Huang et al 2020).
CT volume imaging of the heart became feasible in the early 2000s with the introduction of 16-slice CT scanners with shorter rotation times of about 0.4 s, and the use of ECG-gated acquisition (Kachelriess andKalender 1998, Kachelriess et al 2000). ECG-gated projection data can be reconstructed into separate frames, from which motion fields are calculated, e.g. using non-rigid registration, to characterize the motion at multiple points in the cardiac cycle. A final motion-compensated reconstruction takes the motion into account. This can significantly improve the diagnostic value of x-ray CT angiography by correcting for the motion of the coronary arteries (Isola et al 2012). The general approach can be applied iteratively with alternating updates of the motion field and motion-corrected image (Tang et al 2012). To avoid the high radiation dose of gated CT, motion correction can also be based on motion fields estimated by deformable registration of partial angle reconstructions (Kim et al 2015a). A point matching approach has also been used to improve the visualisation of coronary arteries without ECG gating (Bhagalia et al 2012).
There is growing interest in AI-based methods to replace or supplement more traditional algorithms for non-rigid motion estimation and correction, e.g. Huang et al (2020). A generative adversarial network (GAN) using the Wasserstein distance and mean squared error loss (m-WGAN) was trained to suppress motion artifacts in dental CT images (Jiang et al 2019) (figure 8). A deep learning approach has also been applied to the detection of motion artifacts in noninvasive CT coronary angiography (Elss et al 2018). Here, to avoid the unnecessary computation of applying a motion correction algorithm to datasets without motion artifacts, a CNN was trained to classify 2D coronary cross-sectional images as either motion-free or motion-perturbed. Training data were generated by applying artificial motion vector fields to nine high-quality ECG-triggered clinical cases. However, as recognised by the authors, it remains unclear how this method would perform on real clinical data acquired with a variety of scanner types and imaging protocols. Thus this work is still at quite a preliminary stage. Table 4 provides a summary of non-rigid motion correction methods.

Summary and outlook
The development of effective and practical methods to correct for rigid and non-rigid motion in medical and preclinical imaging with SPECT, PET, CT and the hybrid modalities PET/CT and PET/MR is an area of active research.
For the estimation and correction of rigid head motion, demonstrably effective and technically feasible methods now exist for all of these modalities. In PET, both human and preclinical, combining optical motion tracking, LoR transformation and list mode reconstruction provides an effective solution which is in routine use at several leading research centres. In PET/MR, the potential to motion-correct PET data using head motion data estimated from simultaneously-acquired MR sequences, instead of optical tracking, has also been demonstrated. The feasibility of correcting for head motion in SPECT has also been clearly demonstrated in human and preclinical imaging, using optically-derived motion estimates within the reconstruction. In CT, recent developments have demonstrated the feasibility of jointly estimating head motion and the motioncorrected image with a data-driven approach.
Despite technical feasibility having been established in the research environment, this has not been successfully translated into clinical feasibility since few rigid motion correction solutions are available commercially. A likely explanation for this is the issue of practicality (Kyme et al 2018). If head motion is continuous, high-frequency motion sampling is required. Until recently, this has only been achievable in PET and SPECT with external tracking approaches such as optical tracking. However, optical motion tracking has its own limitations, including the fact that it usually requires the attachment of markers and the setting up and calibration of a motion tracking system, all of which increases the complexity of the scanning procedure. In PET, recent work showing that 6 DoF motion estimation at 1 Hz is feasible in PET using a data-driven approach is a promising development (Spangler-Bickell et al 2021). In CT, data-driven motion estimation approaches can provide useful motion estimates at still higher frequencies due to the much lower noise in the rapidly-sampled projection data. Effective data-driven correction for head motion has been demonstrated using 6 DoF motion estimated jointly with the motion-corrected image from raw CT projection data in multi-slice helical CT. The practicality of this approach was limited until recently by the extensive computation required to perform motion-corrected ML-EM reconstruction, which made the method too slow for clinical use. However, recent efforts to use analytical reconstruction algorithms such as FBP and FDK are bringing reconstruction times closer Table 4. Sample non-rigid motion correction methods.

Modality (motion)
Method Example SPECT (respiratory) Sinogram correction (Lu and Mackie 2002) Respiratory gating (Segars and Tsui 2002) Deformable registration (Ue et al 2006) SPECT ( Deep learning (Huang et al 2020) to clinically acceptable timeframes (Bruder et al 2016, Jang et al 2018, Nuyts and Fulton 2020. With further acceleration motion correction could become a clinically feasible option in head CT in the future. Methods to estimate the rigid motion of the head with optical tracking systems are subject to several sources of error. The accuracy depends on the intrinsic accuracy of the system and the accuracy of the rigid transformation relating the coordinate systems of the tracker and scanner (usually determined via a calibration procedure, e.g. Fulton et al (2002)). Tracking accuracy may also be limited by the rigidity of marker attachment to the head, the frequency with which motion updates are obtained relative to the velocity of motion, and, in data-driven motion estimation, the noise level in the projection data from which the motion is estimated. Because of these potential sources of error, methods to assess the accuracy of motion correction are of interest, e.g. Keller et al (2012), Schleyer et al (2015). Robotic testing platforms which allow phantoms to be manipulated using highly reproducible and accurate motion trajectories play an important part in the performance evaluation of motion tracking and correction methods (Kyme et al 2014(Kyme et al , 2020. Interested readers are referred to several studies comparing the accuracy of different motion correction methods in PET head imaging, e.g. Montgomery et al (2006), Rahmim et al (2008), Jin et al (2009Jin et al ( , 2014. Residual motion blur in motion-corrected images due to sources of motion tracking error can be reduced using deconvolution methods. Miranda et al report improved spatial resolution in PET imaging of unrestrained rats using a deconvolution algorithm with a spatially variant kernel that is dependent on the observed head motion (Miranda et al 2014). Angelis et al directly measured the motion-dependent kernel during the PET acquisition by attaching a small point source to the rat's head (Angelis et al 2018).
In general, it appears that additional complexity and long reconstruction times have been the major barriers to commercially available rigid motion correction methods for all the imaging modalities considered here and thus to their more widespread clinical use. Productive research areas in the future could include the development of more automated, yet reliable and accurate, optical motion tracking methods, including extensions of recent efforts to develop markerless optical motion tracking solutions, and more computationallyefficient motion-corrected reconstruction methods.
Correcting for non-rigid motion (e.g. respiratory or cardiac motion) is a more challenging problem than rigid motion correction, mainly because of the difficulty of accurately estimating the motion and deformation of internal structures. Nevertheless, it has attracted considerable interest from researchers due to its major potential clinical impact. Non-rigid respiratory and cardiac motion artifacts can be reduced using gating techniques which produce separate images at predefined intervals during the respiratory or cardiac cycle. Each gated image is affected to a much lesser extent by motion than a composite, ungated image, but more affected by noise. Gating does not require motion to be estimated directly, but 3D non-rigid motion fields can be estimated from the gated images, e.g. using non-rigid registration, and input to a reconstruction algorithm to produce a single reconstructed image based on all of the data. The degree of success depends largely on the spatial and temporal accuracy with which the deformation can be estimated from the noisy gated images. Despite its limitations, in PET for example, simultaneous cardiac and respiratory gating has been shown to enable estimates of both motion types and their incorporation into the reconstruction process to produce a motion corrected image that provides clinical benefits.
It is interesting to consider that current endeavours to improve PET coincidence timing resolution for timeof-flight may eventually enable the annihilation site to be localized with sub-pixel accuracy. In this case it may be possible to apply event-by-event corrections for rigid and non-rigid motion with much greater precision than at present (Meikle et al 2021).
In emission tomography, whether the aim is to correct for rigid motion or non-rigid motion, it is vital to account for motion of both the emission and attenuation distributions. Accounting for motion of the emission distribution during reconstruction while treating the attenuation distribution as static will result in artifacts and quantitative errors.
Although little has been published so far on state-of-the-art machine learning methods in this field, it is clear that such methods could aid, or eventually replace, existing motion estimation and motion correction methods. The attraction of machine learning methods, and specifically deep neural networks, in the context of motion correction is threefold: (i) the potential for a network to comprehensively model the underlying physicscompared to the limited physics captured within explicit analytical models; (ii) the potential to substitute specific components of a traditional motion correction pipeline with a neural network-based module that provides better performance; and (iii) the potential for neural networks to enable highly accurate and practical fully data-driven correction-ideally, exclusively in the image domain-a long-term but elusive goal for practitioners in SPECT, PET and CT. There is much scope for further work in this area.

Conclusion
In this review we have described the clinical importance of subject motion in SPECT, PET and CT and critically surveyed methods to estimate and correct for this motion. Despite many similarities in how motion is handled in these modalities, utility and applications vary based on differences in temporal and spatial resolution. Technical feasibility has been demonstrated in each modality for both rigid and non-rigid motion, but clinical feasibility remains an elusive target. There is considerable scope for further developments in motion estimation and correction. Deep neural network-based methods may have a unique role to play in this context.