Analysis and evaluation of planned and delivered dose distributions: practical concerns with γ- and χ- Evaluations

One component of clinical treatment validation, for example in the commissioning of new radiotherapy techniques or in patient specific quality assurance, is the evaluation and verification of planned and delivered dose distributions. Gamma and related tests (such as the chi evaluation) have become standard clinical tools for such work. Both functions provide quantitative comparisons between dose distributions, combining dose difference and distance to agreement criteria. However, there are some practical considerations in their utilization that can compromise the integrity of the tests, and these are occasionally overlooked especially when the tests are too readily adopted from commercial software. In this paper we review the evaluation tools and describe some practical concerns. The intent is to provide users with some guidance so that their use of these evaluations will provide valid rapid analysis and visualization of the agreement between planned and delivered dose distributions.


Introduction
Improved treatment planning systems, dose delivery equipment (typically, linear accelerators) and therapeutic techniques have advanced modern radiation therapy by enabling clinicians to achieve more conformal high dose delivery to target volumes while sparing adjacent normal tissues. However, as discussed throughout these proceedings, these improvements have also increased significantly the requirement for dose delivery validation especially in the commissioning of treatment units and new treatment techniques and, in some settings, in the patient specific validation of dose delivery.
The measurement of the integrity of dose delivery during the commissioning of a new unit, or a new treatment technique, may involve the measurement of dose distributions in phantoms for test cases planned under well-defined conditions [1,2]. Such measurements can ensure correct performance and establish benchmark data for future quality assurance of the particular treatment protocol. The measurements often involve film or two or three dimensional (2D and 3D) diode or ion chamber array measurements on regular and anthropomorphic phantoms. These devices may also be used to test the delivered distributions for specific patients by exposing the devices in phantom to the same multileaf collimator leaf sequences, trajectories and MUs as planned for the treatment [3]. Patient treatment delivery validation may also be performed using dose reconstruction from exit beam measurements on Electronic Portal Imaging Devices (EPIDS) [4]. And finally whole treatment protocols may be regularly monitored in end to end test procedures using 2D and 3D dose delivery validation [5,6]. Gel and radiochromic 3D dosimetry has played a role in all these settings in select clinics. comparison is performed between two dose maps: one distribution is the 'reference plan' (typically from the treatment planning system, but see discussion below) and the other is the 'evaluated distribution', usually from a two or three dimensional dose measuring system. The reference distribution is treated as the true distribution, while the evaluated image is analyzed for its agreement with the reference as follows: every point in the reference image has a corresponding γ value; a measure of agreement at that location. Each possible point in the reference distribution can be coupled with any point in the evaluated image. For all pairs there exists a Г, defined by the vector difference between the points. Tolerance criteria Δd M and ΔD M (e.g., 3mm and 3%) are used to normalize Г along the distance and dose vector dimensions, correspondingly. The gamma index value, γ, is the smallest Г that can be found considering the entire evaluated distribution. In 2D evaluation space Г 2 = 1 describes a circle whose extent is defined by the tolerance criteria (see figure 1a), whereas in a 3D space the criteria define an ellipsoid. When γ≤1, the distributions agree within the stipulated tolerances. Conversely, when γ>1, no point in the evaluated dose map can be found within the circle/ellipsoid and the dose distributions disagree at that location. As with the composite test [14], the γ function identifies if the the evaluated distribution passes (γ ≤1) or fails (γ>1) the comparison. Also, if the evaluated distribution fails, the value of γ indicates roughly the degree of failure relative to the dose and distance to criteria set in the evaluation [7]. For example, a γ = 1.5 fails by 50% which corresponds to a failure of 1.5% or 1.5mm for a 3%/3mm dose/DTA criteria; or 1% or 1mm for 2%/2mm criteria. If the vector nature of the gamma is evaluated [7,15,19] the test can give indications of whether the failure is primarily due to dose or distance to agreement failure.

Chi dose evaluation
Since it was introduced, the γ-tool has been refined, modified and evaluated by several authors [12][13][14][15][16][17][18][19][20]. One alternative to the γ function suggested was the chi (χ) evaluation proposed by Bakai et. al [14], which also provides a metric combining both dose and DTA criteria to evaluate the level of agreement between evaluation and reference dose distributions. However, in the χ approach the comparison between the reference and evaluated distributions is carried out differently than with the γ test. Instead of searching the test space for evaluation points that are closest to a given reference point, the χ test compares the reference dose to the test dose at the same point in space, but it scales the dose limit criteria according to dose gradient of the reference distribution.
The difference between γ and χ evaluations is illustrated in figure 1. Discrete points on the reference curve are systematically compared (see figure 1a) to discrete points on the evaluation curve.
In the figure test point A on the evaluation curve is within the acceptance region for reference point A.
The test point B is not within the acceptance region for reference point B, but then it is within the acceptance region for reference point C. This illustrates that the γ evaluation must search multiple test points to find a minimum Г. When calculating gamma over a volume, this can become  under-dosing in the region of an organ of risk). Another advantage with the χ evaluation is that one can assign different dose and distance limits in different regions of the dose distributions being compared and can also allow the dose criteria in the positive and negative directions to differ. This would enable one to apply tighter dose and/or distance criteria to the region around an organ of particular concern and looser criteria elsewhere.
The χ evaluation does have some limitations. Firstly, it requires that the points on the evaluation distribution are spatially matched to the reference points. This means that the data for the evaluation dose distribution may need to be interpolated, with the potential for some distortion. Secondly, the extended dose limits (i.e., establishing the acceptance regime outlined by the red curves in figure 1b) in regions where the dose gradient changes rapidly (where the second derivative is large). The result is that χ values near the shoulder portion of an evaluated curve may not be calculated accurately. In spite of these limitations, the χ test is a strong alterative to the traditional γ evaluation and it has recently been gaining broader clinical use.

Practical Considerations I: discretization and spatial resolution
While Low et al. presented a powerful theoretical concept for the evaluations based on continuous functions [9], real clinical comparisons are typically made between discretized representations of dose distributions often with the reference and evaluation dose data sampled at different spatial resolutions. The importance of resolution was first analysed by Depuydt et al. [12] in their clinical assessment of the γ evaluation. They were particularly concerned with overestimations of γ values caused by large grid spacing in the discrete dose distributions, particularly in regions of high dose gradient. To avoid overestimating they effectually reduced the continuous value of γ to a binary test equivalent to the composite evaluation [8]. Several other studies have been directed, in part, towards resolving this issue [14,17,18]. Low and Dempsey [13] noted that by re-sampling the distributions to 1x1mm2 grids, the error in γ associated with pixelization artifacts was reduced to less than 0.2, even in regions of high dose gradient. Others have minimized discretization artifacts by interpolating additional data points in the evaluated grid [17].
The effect of resolution on the resulting γ distributions is illustrated in figure 3, showing how the γ comparison maps and the observed pass rate (a common parameter used to roughly summarize the complex γ dose distribution comparison by reporting the percentage of points in the dose distribution that pass the specified γ criteria Δd M and ΔD M ) change with spatial sampling (the bottom rows corresponding to different resolutions). When the film data (fixed at a resolution of 0.24 mm) are taken as the reference doses, and the Eclipse plan is set as the evaluation distribution, increasing the resolution of the evaluated distribution (from 2.5 to 0.24 mm) changes the pass rate from 80.9% to 91.3%. This result can be explained by the behaviour of the γ search. When the evaluated distribution has a coarse pixel size compared to the reference distribution, many reference pixels fall a significant distance from the nearest evaluated pixel. Thus, the γ value for many reference pixels reflect significant spatial misalignment purely as an artefact of the coarse evaluated resolution which is seen as the grid-lines in the low dose region of figure 3 (c). When the resolution of the evaluated distribution is increased to match that of the reference distribution, this spatial artefact is eliminated since each reference point has a directly corresponding pixel in the evaluated distribution. Increasing the evaluated resolution also provides each reference point with a greater range of dose values for comparison, further increasing the likelihood of finding a set of pixels which pass the gamma test. Note, however, that the change of resolution in the Eclipse plan does not significantly change the pass rate if the Eclipse plan is the reference distribution. This is discussed further below.
The requirement for interpolation adds a significant increased computational burden to γ evaluations [17] and the computation time grows as a third power of increasing grid resolution [18]. The computational speed of the algorithm can be enhanced by pre-calculating interpolation factors [17]. Another approach to reducing the evaluation time is to restrict searches around each reference point [15,17,18,19]. For example, one can limit searches a priori so that points that "do not have a chance" of yielding the smallest value of Γ are eliminated: as soon as Δd/Δd M becomes larger than the smallest Γ found so far, the search is terminated. Others [18] have developed geometric approaches to determining the smallest Γ.

Practical Considerations III: the role of reference and evaluation distributions
In the previous discussion of figure 3, it was noted that the results of the comparison of two dose distributions is affected by the designation of the distributions as reference or evaluated data sets.
Neither γ evaluations, nor DTA map, are symmetric with respect to the distributions being compared.
Before this is discussed in detail, it is important to note an important change of notation introduced by Low and Dempsey [13] to recognize that in clinical practice the distributions can both be calculated (e.g., output from two different planning systems), or from two separate measurements (e.g., on alternative dosimetric systems being evaluated for performance). For example, it may be necessary to perform comparisons between Monte Carlo computations and calculations made by commercial planning software, or between film measurements. So to avoid confusion, the terms 'reference' and 'evaluated' distributions were adopted by Low to replace 'measured' and 'calculated' distributions, respectively. This notation has been become widely adopted [15,17,18]. The convention is different than that used historically, for example, during the development of the DTA and dose difference tools in the evaluation of electron beam dose calculation software [10]. In the current usage it is usually understood that the dose comparison is used to measure the extent to which an evaluated distribution agrees with a reference distribution, which is treated as the true distribution (though, strictly speaking, obtaining a true distribution may not be possible).
That the designation of which distribution is set as the reference can affect the results of the γ comparison was illustrated previously in figure 3, where the role of the Eclipse dose distribution is switched in the bottom columns. This is also true for the χ comparisons (see figure 4). As noted above, part of the change in the γ comparisons can be related to differences in resolution. Noise in the data sets also contributes to the differences [7,13].
In the examples shown in figures 3 and 4, the film dose distribution data are noisier and the Eclipse data smoother. In figure 3 the side-by-side comparison maps (a)-(b) and (c)-(d) show substantial increases in pass rate when the noisy film data forms the evaluated dose distribution. This result stems from the search algorithm of the γ function which compares points in the reference distribution to all of the nearby points in the evaluated distribution. When the evaluated distribution contains noise, a greater range of dose values is made available for comparison with a given reference point. Therefore, the likelihood of finding a reference point that satisfies the dose and distance criteria increases. When the roles of the two distributions are reversed, the opposite is true, as extreme dose values from the film distribution are compared to a narrow range of nearby dose values in the smooth Eclipse distribution.   Note that the agreement breaks down as DTA becomes smaller than the inherent resolution of the reference dose distribution.
The discrepancy in pass rate and appearance between the two χ distributions (in figure 4) highlights the effect of noise in the reference distribution. The noisy film measurement, when assigned to the role