Analysis of dose comparison techniques for patient‐specific quality assurance in radiation therapy

Abstract Purpose Gamma evaluation is the most commonly used technique for comparison of dose distributions for patient‐specific pretreatment quality assurance in radiation therapy. Alternative dose comparison techniques have been developed but not widely implemented. This study aimed to compare and evaluate the performance of several previously published alternatives to the gamma evaluation technique, by systematically evaluating a large number of patient‐specific quality assurance results. Methods The agreement indices (or pass rates) for global and local gamma evaluation, maximum allowed dose difference (MADD) and divide and conquer (D&C) techniques were calculated using a selection of acceptance criteria for 429 patient‐specific pretreatment quality assurance measurements. Regression analysis was used to quantify the similarity of behavior of each technique, to determine whether possible variations in sensitivity might be present. Results The results demonstrated that the behavior of D&C gamma analysis and MADD box analysis differs from any other dose comparison techniques, whereas MADD gamma analysis exhibits similar performance to the standard global gamma analysis. Local gamma analysis had the least variation in behavior with criteria selection. Agreement indices calculated for 2%/2 mm and 2%/3 mm, and 3%/2 mm and 3%/3 mm were correlated for most comparison techniques. Conclusion Radiation oncology treatment centers looking to compare between different dose comparison techniques, criteria or lower dose thresholds may apply the results of this study to estimate the expected change in calculated agreement indices and possible variation in sensitivity to delivery dose errors.

(TPS) dose calculation limitations (accuracy of beam modeling and the algorithm itself) are two of the multiple factors that can introduce disagreement between planned and delivered dose, which impact the accuracy of treatment delivery.
The most common form of PSQA involves the comparison of TPS dose calculations with 2D or 3D dose measurements. [2][3][4][5] The gamma evaluation method (also known as gamma analysis, or gamma index analysis), developed by Low et al. 6,7 is widely used to compare such measurements. 5 This technique compares an evaluated (usually measured) dose distribution with a reference (usually calculated) dose distribution in a quantitative manner by calculating the gamma value of each point, which is the minimum Euclidean distance in the dose-spatial domain. The agreement between evaluated and reference dose distributions is calculated using two acceptance criteria: dose difference, ΔD, in %; and distance-to-agreement, DTA, in mm.
Gamma analysis produces gamma index values assigned to each individual point, for which gamma index values ≤1 indicate passed or otherwise failed. The percentage of passing points in the gamma distribution is referred to as gamma pass rate (or %GP). %GP can be used by the users to establish or apply action levels. Surveys of PSQA practices have reported that ΔD and DTA criteria of 3%/ 3 mm are the most frequently used. [2][3][4] Global gamma evaluation normalizes the percent differences for every point to a globally used single value, usually the maximum planned dose; whilst local gamma evaluation normalizes the percent differences for every point to the expected dose at each point. Thus, the %GP calculated by global gamma will always be higher than or equal to local gamma, where the same criteria and lower dose threshold are used. 8 The use of global gamma evaluation for PSQA using ΔD and DTA criteria of 3%/3 mm has been questioned due to reported poor sensitivity and specificity to delivery errors [9][10][11][12][13][14][15] and clinically relevant patient dose errors, 11,16,17 and a lack of clinical intuitiveness. 17,18 Some authors have proposed DVH-based QA metrics as a response to these criticisms. 17 A number of studies have proposed or evaluated alternative dose comparison techniques, or assessed variation in behavior of gamma evaluation for both local and global gamma evaluations with different criteria and lower dose thresholds (LDT). 19,20 Jiang et al. 18 proposed the maximum allowed dose difference (MADD) technique, in which a distance-to-agreement criterion is converted to a dose difference by multiplying the dose gradient at the point of interest. This DTA-equivalent dose criteria is combined with ΔD to determine MADD (by summation for box calculation MADD b , or summation in quadrature for gamma calculation MADD γ ).
The difference between dose distributions at the point of interest can then be normalized by local MADD (as a "normalised dose difference"), providing an index in which values ≤1 indicate agreement.
The advantage of this method over the gamma method is that it is not only accurate and simple but also clinically intuitive and insensitive to dose grid resolution.
Stojadinovic et al. 14 proposed the divide and conquer (D&C) gamma method, in which the determination of agreement between dose distributions is dependent on the dose region: a high dose (HD) region within the 90% isodose, a high gradient (HG) region between the 90% and 50% isodoses, a medium dose (MD) region between the 50% and 20% isodoses, and a low dose (LD) region between 20% to 10% isodoses. Significant differences in behavior were reported, when D&C results were compared to conventional gamma evaluation, for a dataset containing 50 PSQA measurements. 14 This method has challenged the adequacy of conventional IMRT QA program. The authors concluded that a better paradigm would be needed to standardize IMRT QA practices. The advantage of the D&C method over the gamma method is that by analyzing four distinct regions separately, more reasonable characterization of the agreement between calculations and measurements can be performed without combining regions of high and low dose gradients.
Some studies have analyzed the effect of induced error to the PSQA results using gamma method. 21,22 Other studies have characterized the effect of ΔD and DTA criteria selection on gamma agreement indices. Crowe et al. 23 reported that global gamma agreement indices calculated using 2%/3 mm, 3%/2 mm, and 3%/3 mm were correlated with each other, suggesting that these criteria would produce similar PSQA results (or similar sensitivity and specificity), if action thresholds were adjusted accordingly. Recommendations were provided for radiation oncology treatment centers intending to transition to tighter global gamma evaluation ΔD and DTA criteria. 23 However, although optimal gamma parameters and performance of alternative metrics were tested in the literature 19,20,24 , few studies 8  plan is defined simply by the numerical pass rate and our institutional acceptance criteria: most "failed" plans in this study were considered clinically acceptable when reviewed by a Radiation Oncologist.

| MATERIALS AND METHODS
The global gamma analysis, local gamma analysis, D&C and MADD methods were implemented in Matlab version R2015b (MathWorks, Massachusetts, USA), per original descriptions by Low et al., 6,7 Stojadinovic et al. 14 and Jiang et al., 18 respectively. The implementation of the gamma analysis and MADD methods was validated using data described by Low and Dempsey. 7,18 The in-house software was designed to iterate through routinely prepared PSQA directories, containing TPS calculated dose distributions, converted to the ".snc" format using Sun Nuclear SNC Patient Software version 6.2.2 (Sun Nuclear Corporation, Melbourne, USA), and ArcCheck measured dose distributions, in ".txt" format. The lower resolution (1 cm) ArcCheck measured dose distribution was compared with the higher resolution (1 mm). TPS calculated dose distribution without interpolation. The VMAT arcs were measured in absolute dose whereas the HT beams were measured in relative dose. Normalization was performed at dose maximum.
An overview of the QA measurements selected for this study is shown in Table 1. This cohort included both measurements that passed departmental PSQA and those that failed. The passed measurements are those that produce global %GPs ≥95% at 2%/2 mm and LDT of 5% with Measurement Uncertainty Corrections turned on in the SNC Patient Software. The Measurement Uncertainty Correction is a default option in the SNC Patient Software intended to compensate for presumed sources of measurement uncertainty that potentially decrease the calculated pass rate. It typically adds about 1%-2% to the user-defined acceptance criterion of percentage difference acceptability tolerance. By applying the Measurement Uncertainty Correction, the user essentially loosens the gamma comparison criteria from 2%/2 mm to 3%/2 mm, which is the recommendation from TG-218. The failed measurements are those that produce global %GPs <95% under the same conditions. Two hundred and sixty two VMAT beams were analyzed, including 230 passed and 32 failed beams. The work was repeated for 167 HT plans, including 142 passed and 25 failed plans (see appendix).
These criteria were selected based on local and survey-reported practices. 2,4,23 D&C agreement indices were calculated using the same criteria pairs as were used for the gamma and MADD evaluations, with ΔD criteria replaced as summarized in Table 2. The dose criteria for each dose region was selected for approximate equality in terms of dose in "absolute" dose. They were chosen such that the local dose difference corresponded to approximately the same global dose difference; that is a 7% dose difference in the HG region (centered around the 70% isodose) is about equal to a 5% difference in maximum dose (0.7 × 0.07 = 0.049). This is why, for example, 5%, 7%, 10%, and 15% were used for HD, HG, MD, and LD in one case.
The relationships between agreement indices calculated using varying LDT for local and global gamma analysis, calculated using varying criteria for each dose comparison technique, and calculated using varying dose comparison techniques for each criteria were quantified using ordinary least squares regression. Correlation (or similarity in behavior, in terms of identifying results as demonstrating high or low agreement) was assessed using coefficients of determination R 2 (representing the variation in the dependent variable that can be explained by variation in the independent variable) and P-val-

| RESULTS
This section only included the results from the VMAT plans. The HT results can be found in the appendix. Table 3 shows the mean agreement indices that resulted from evaluating 262 VMAT PSQA measurements using the various comparison methods investigated in this study. Table 4 summarized correlation between 5% and 10% LDT-calculated %GPs for VMAT plans (P ≤ α in all cases). Calculated R 2 values and Ŝidák-corrected significance results for relationships between dose comparison techniques for 2%/2 mm are presented in  | 191 • Applying more strict gamma criteria results in higher standard deviation of data.
• Global gamma evaluation technique with various gamma criteria behave similarly regardless of LDT.
• The correlation between 5% and 10% LDT in local gamma calcu-  • The D&C technique or the local gamma evaluation may exhibit increased sensitivity to dose errors 11,12 and thus may be preferable to identify undesirable plans. According to Table 3, a 1%-10% decrease in pass rate may be expected for the criteria of 3%, 3 mm based on the data presented in this work. Local %GP at 2%/2mm

| CONCLUSION S
Local %GP at 2%/2mm Global Gamma vs Loc  Table A1 shows the mean agreement indices that resulted from evaluating 167 HT PSQA measurements using the various comparison methods investigated in this study. There are more spread in the data of HT plans than VMAT ones for all techniques and criteria. For tighter criteria, there are more spread in the data for both treatment modalities. that large percentage dose differences were more prevalent in low dose regions. The behaviour variation between HT and VMAT cohorts possibly suggests a difference in low dose calculation or delivery accuracy between the two treatment modalities. (Tables A3 and A4) Correlation was poor for most relationships in VMAT. Better overall correlation was observed for HT plans, with MADD b vs. MADD γ being the highest (Table A5; Fig A1).