Evaluation of the accuracy of deformable image registration on MRI with a physical phantom

Abstract Background and purpose Magnetic resonance imaging (MRI) has gained popularity in radiation therapy simulation because it provides superior soft tissue contrast, which facilitates more accurate target delineation compared with computed tomography (CT) and does not expose the patient to ionizing radiation. However, image registration errors in commercial software have not been widely reported. Here we evaluated the accuracy of deformable image registration (DIR) by using a physical phantom for MRI. Methods and materials We used the “Wuphantom” for end‐to‐end testing of DIR accuracy for MRI. This acrylic phantom is filled with water and includes several fillable inserts to simulate various tissue shapes and properties. Deformations and changes in anatomic locations are simulated by changing the rotations of the phantom and inserts. We used Varian Velocity DIR software (v4.0) and CT (head and neck protocol) and MR (T1‐ and T2‐weighted head protocol) images to test DIR accuracy between image modalities (MRI vs CT) and within the same image modality (MRI vs MRI) in 11 rotation deformation scenarios. Large inserts filled with Mobil DTE oil were used to simulate fatty tissue, and small inserts filled with agarose gel were used to simulate tissues slightly denser than water (e.g., prostate). Contours of all inserts were generated before DIR to provide a baseline for contour size and shape. DIR was done with the MR Correctable Deformable DIR method, and all deformed contours were compared with the original contours. The Dice similarity coefficient (DSC) and mean distance to agreement (MDA) were used to quantitatively validate DIR accuracy. We also used large and small regions of interest (ROIs) during between‐modality DIR tests to simulate validation of DIR accuracy for organs at risk (OARs) and propagation of individual clinical target volume (CTV) contours. Results No significant differences in DIR accuracy were found for T1:T1 and T2:T2 comparisons (P > 0.05). DIR was less accurate for between‐modality comparisons than for same‐modality comparisons, and was less accurate for T1 vs CT than for T2 vs CT (P < 0.001). For between‐modality comparisons, use of a small ROI improved DIR accuracy for both T1 and T2 images. Conclusion The simple design of the Wuphantom allows seamless testing of DIR; here we validated the accuracy of MRI DIR in end‐to‐end testing. T2 images had superior DIR accuracy compared with T1 images. Use of small ROIs improves DIR accuracy for target contour propagation.

vs CT (P < 0.001). For between-modality comparisons, use of a small ROI improved DIR accuracy for both T1 and T2 images.

Conclusion:
The simple design of the Wuphantom allows seamless testing of DIR; here we validated the accuracy of MRI DIR in end-to-end testing. T2 images had superior DIR accuracy compared with T1 images. Use of small ROIs improves DIR accuracy for target contour propagation. before and during treatment is used to create a foundation for delineating targets 1 to judge the need for adaptive radiation therapy.
However, DIR errors in commercial software used with MRI have not been widely reported.
Use of image registration to improve the assessment of disease response is an area of active research. 2 Notably, the magnitude of changes in tumor location and shape in actual patients over the course of treatment can be substantial or minimal, 3,4 with the accuracy of DIR varying accordingly. Moreover, registration errors made at an individual fraction of treatment affect the delivery of that fraction only, whereas systematic errors (including operator error) affect the delivery of all treatment fractions. 5 Tyran et al. 6 in evaluating the reliability of an MR-guided online adaptive radiation therapy decision-making process, concluded that daily review was not reliable for determining the need for adaptive radiation therapy, arguing that an online predicted plan, based on deformed and manually adjusted contours, should be generated for every fraction. Therefore, adequate DIR accuracy for propagating contours is essential to ensure the proper use of online adaptive radiation therapy.
Ger et al. evaluated a DIR system with synthetic images derived from patient longitudinal deformations and a porcine phantom with implanted markers. 7 Tait et al. investigated the use of DIR in gynecologic brachytherapy to combine MRI guidance and CT-based planning for optimizing placement of brachytherapy sources. In that study, DIR provided MRI guidance for CT-based planning, which facilitated improved target volume delineation and dose escalation while minimizing toxicity to surrounding organs at risk. 8 The stability of DIR for clinical purposes is affected by several factors, including image registration algorithms, input image quality, and regularization methods. 5 The quality of image registration can be affected by other factors as well, including user experience and method. 9 An MRI-compatible phantom is needed that is sophisticated enough for "bench-  11 Niebuhr et al. developed the ADAMpelvis phantom, which is anthropomorphic, deformable, and multimodal. 12 However, to date no standard physical phantom has been developed that allows end-to-end testing of the accuracy of DIR in MRI-guided simulation and treatment delivery.
Regarding digital phantoms, the American Association of Physicists in Medicine (AAPM) task group report 132 5  The aim of this study was to use a previously reported physical phantom to evaluate the accuracy of DIR in T1-weighted (T1) and T2-weighted (T2) MRI and CT, with comparisons made within the same imaging modality (e.g., T1 vs T1 and T2 vs T2) and between different modalities (inter-imaging modalities; e.g., T1 vs CT and T2 vs CT). We also assessed the accuracy of DIR between modalities by varying the sizes of regions of interest (ROIs) in the phantom volume. Our goal in this study was to demonstrate a method of benchmarking a DIR system to ensure the accuracy of contour propagation for adaptive radiation therapy.

2.A | Phantom design
The physical Wuphantom (US patent pending) was used previously to test the accuracy of DIR for CT and for CBCT. 13,14 This acrylic phantom includes a variety of inserts that simulate different tissue shapes and properties (Fig. 1). For MRI testing, the base of the Wuphantom was filled with water to make it visible on MRI. Deformations and changes in tumor locations are simulated by changing the rotations of both the phantom and its inserts. Three large cavity inserts were created in different shapes (circle, oval, and irregular) to simulate contours deformed from the original (baseline) shape (the circle). Both large and small circular inserts can be rotated to different degrees to mimic location changes in the X and Y directions for translation testing. For DIR testing, the inserts were rotated to simulate contour changes in the shape and location compared to the circle, which is usually used as the reference. Each insert containing 27.4 mL of 5% (w/v) agarose gel, to simulate tissues that are slightly denser than water. The image contrast was varied by using different agarose gel concentrations (0%, 0.5%, 1.0%, 2.0%, and 4.0%). The density of the 4.0% agarose gel is similar to that of prostate tissue (derived density of 1.036 g/mL), and its visualization characteristics on MRI are similar to those of prostate tissue. 15 Although many other materials can be used to simulate human tissues under MRI, 16,17 we focused here on Mobil DTE oil and agarose gel to facilitate the study of low-contrast subjects.

2.B | MR and CT image acquisition
MRI scans were obtained with a 1.5 T MRI Siemens MAGNETOM Aera scanner (Siemens, Inc., USA) with 8 × 2-element flat head coils and a flat insert table. We selected 130 slices with an axial field of view of 25.6 cm and a superior-inferior slice direction (slice thickness of 2 mm) to cover the spatial region encompassing the entire phantom volume. CT images were also acquired with a Siemens Definition Edge CT scanner. Image acquisition parameters for CT and MRI are given in Table 1. All of the MR and CT images were transferred to a Velocity Workstation version 4.0 (Varian Medical Systems, Palo Alto, CA, USA).
To acquire reference images for both CT and MRI, the Wuphantom was placed on the base with 0°tilt and rotation, and the alignment marks (insert rotation, phantom tilting, and rotation) were set at 0°, the circular insert in the left large cavity was filled with DTE oil, and the smaller circular insert on the right cavity was filled with agarose gel. Images for DIR accuracy tests were acquired by replacing the circular insert with the oval or irregularly shaped inserts, and the inserts were rotated to different degrees to simulate location changes. We used 11 combined shape-deformation scenarios to simulate both object deformation and location changes ( Table 2). Sample images showed that the contrast inserts simulating the tumor and surrounding tissue in the Wuphantom were distinguishable on all CT and T1 and T2 MR images (Fig. 2).

2.C | Image registration
Image registration was done with Varian Velocity DIR software (version 4.0). Images were first registered using manual alignment by shifting and rotating the secondary image. Next, an ROI was drawn to encompass the whole phantom. Within this ROI, images were aligned first using Velocity rigid registration, which uses mutual (a) (b)

2.D | Contour propagation
Before DIR, contours were delineated for the large and small inserts [ Fig. 3(a), (c), (e)] with a predefined threshold for both CT and MRI.
This provides a "ground truth" for contours of various shapes for quantitative validation. After DIR, all of the contours were propa-

2.E | DIR accuracy
Quantitative comparisons of the contours can be done with several metrics. Two commonly used approaches are the Dice similarity coefficient (DSC) 18 and mean distance to agreement (MDA), 19

2.F | Statistical analysis
All DSC and MDA data were compared between the MRI and CT scans in paired sample analyses; Wilcoxon matched-pair nonparametric tests 20 were used to evaluate differences between MRI and CT registration. A probability value of P ≤ 0.05 was considered statistically significant. All statistical analyses were calculated using R statistical software (R Foundation for Statistical Computing, Vienna, Austria).

| RESULTS
Registration of images obtained with the same modality (CT vs CT, T1 vs T1, or T2 vs T2) showed no differences in DIR accuracy for the T1: T1 and T2:T2 comparisons (P > 0.05). For both the T1:T1 and T2:T2 comparisons, mean (±SD) DSC values for fatty tissue (oil) were 0.88 ± 0.08, and those for prostate (agarose gel) were 0.92 ± 0.05 (Table 3). Comparisons in DSC values for MRI vs CT DIR are also shown in Table 3 and illustrated graphically in Fig. 4. MDA values differed slightly in the T1:T1 and T2:T2 comparisons for both fatty tissue and prostate (  Fig. 5). In other words, DIR accuracy was lower for between-modality comparisons (T1 or T2 vs CT) than for same-modality comparisons (T1 vs T1 or T2 vs T2), and the accuracy was also lower for T1 sequences than for T2 sequences (P < 0.001) for both fatty and prostate tissues.
We also compared the effects of ROI size (large vs small) on between-modality DIR (Fig. 6). The volume of the large ROI, which encompassed the entire phantom, was 20 cm × 20 cm ×14 cm; the

| DISCUSSION
We report here use of the Wuphantom to quantitatively evaluate the accuracy of DIR for CT and two sequences of MRI. The tests included both within-modality comparisons (T1 vs T1 and T2 vs T2) and between-modality comparisons (T1 vs CT and T2 vs CT). DIR was less accurate for between-modality than for same-modality comparisons. All of the results (except for T1 vs CT) were within the AAPM's recommended thresholds (DSC > 0.8 and MDA < 3 mm).
DIR accuracy was better for T2 images than for T1 images on between-modality comparisons. We also found that using a small ROI improves the accuracy of DIR for target contour propagation.
The DIR process has uncertainties regardless of the algorithm chosen. For areas with very low tissue contrast, registration can be prone to inaccuracies. 21  Our study did have some limitations. We used a standard head MRI T1 and T2 scanning protocol for the reference and secondary images. We also did not fully evaluate the registration results when the scanning protocol changed (e.g., proton density, diffusionweighted images, or slice thickness changes). Moreover, a physical phantom usually has rather simple geometry and correspondingly simple deformations. Even though the DIR in the current study was found to be quite accurate for the evaluated scenarios, we acknowledge that uncertainty still exists in the DIR process for patient-specific images.
Indeed, the phantom is more useful for detecting systemic error of a DIR system than for evaluating the accuracy of a clinical case.

| CONCLUSION S
We quantitatively evaluated the accuracy of DIR for MRI and CT.
For between-modality comparisons (T1 vs CT or T2 vs CT), T2 imaging performance was better than T1 imaging performance. Use of a smaller ROI was found to improve the accuracy of DIR for target contour propagation. The AAPM recommends that a physical phantom be used for end-to-end testing to account for variations in the imaging chain; we believe that our work with the Wuphantom is an important contribution to such testing.

ACKNOWLEDG MENT
We thank Ms. Ann Sutton from the Department of Scientific Publications at MD Anderson for her editorial assistance. We also thank Christine Wogan, MS, ELS, of the Division of Radiation Oncology at MD Anderson Cancer Center, for editorial assistance.

CONF LICTS OF INTEREST
A patent related to wuphantom has been filed.