Practical quantification of image registration accuracy following the AAPM TG‐132 report framework

Abstract The AAPM TG 132 Report enumerates important steps for validation of the medical image registration process. While the Report outlines the general goals and criteria for the tests, specific implementation may be obscure to the wider clinical audience. We endeavored to provide a detailed step‐by‐step description of the quantitative tests’ execution, applied as an example to a commercial software package (Mirada Medical, Oxford, UK), while striving for simplicity and utilization of readily available software. We demonstrated how the rigid registration data could be easily extracted from the DICOM registration object and used, following some simple matrix math, to quantify accuracy of rigid translations and rotations. The options for validating deformable image registration (DIR) were enumerated, and it was shown that the most practically viable ones are comparison of propagated internal landmark points on the published datasets, or of segmented contours that can be generated locally. The multimodal rigid registration in our example did not always result in the desired registration error below ½ voxel size, but was considered acceptable with the maximum errors under 1.3 mm and 1°. The DIR target registration errors in the thorax based on internal landmarks were far in excess of the Report recommendations of 2 mm average and 5 mm maximum. On the other hand, evaluation of the DIR major organs’ contours propagation demonstrated good agreement for lung and abdomen (Dice Similarity Coefficients, DSC, averaged over all cases and structures of 0.92 ± 0.05 and 0.91 ± 0.06, respectively), and fair agreement for Head and Neck (average DSC = 0.73 ± 0.14). The average for head and neck is reduced by small volume structures such as pharyngeal constrictor muscles. Even these relatively simple tests show that commercial registration algorithms cannot be automatically assumed sufficiently accurate for all applications. Formalized task‐specific accuracy quantification should be expected from the vendors.


| INTRODUCTION
Image registration is currently widely used in radiation oncology clinical practice. However, it is a complex subject, and image registration software, such as treatment planning and other radiotherapy software, has to undergo acceptance testing and validation to assess its performance and limitations prior to clinical use. The AAPM TG 132 Report on "Use of image registration and fusion algorithms and techniques in radiotherapy" 1 (the Report) enumerates important steps for validation and verification of the image registration process. Furthermore, the supplemental materials in the Report contain a series of publicly available image datasets designed to help in quantitating image registration accuracy. While the Report outlines the general goals and criteria for the tests, specific implementation may be obscure to the wider clinical audience. Certain tests are not accompanied by readily available software to implement them. In this paper, we endeavored to provide a detailed step-by-step description of the quantitative tests' (Section 4.C of the Report) execution, striving for simplicity and utilization of software either in the public domain, or ubiquitous in general (e.g., Microsoft Excel) or in radiotherapy (e.g., a treatment planning system). We illustrate our approach by applying the tests suggested in the Report to a commercial image registration software package that may have been less explored in the radiotherapy literature in comparison with others.
2 | ME TH ODS 2.A | Image registration software As an example of an image registration software package, we used Mirada RTx v. 1.6 (Mirada Medical, Oxford, UK), which is currently in clinical service at our institution. It has a rigid registration algorithm and two choices for deformable image registration (DIR). The rigid registration is based on the Mutual Information 2-5 approach and has a number of spatial resolution settings. The finest grid was always used. The DIR portion includes two algorithms. One ("CT Deformable") is used for CT to CT registration when the datasets are similar, and is a derivative of Lucas-Kanade optical flow algorithm. 6 For CT datasets with dissimilar intensities and cross-modality registration, the "Multimodality Deformable" option is used, which optimizes a Mutual Information-based similarity function. 3,7,8 The software is capable of exporting Digital Imaging and Communications in Medicine (DICOM) spatial registration objects for both rigid and deformable registrations. The deformation vector field (DVF) is downsampled spatially compared to the imaging datasets themselves, by a factor of 2 in each dimension for "CT Deformable" and a factor of 4 for "Multimodality Deformable".

2.B.1 | DICOM transformation objects
Before describing the methods of quantifying registration errors, it is instructive to reiterate some pertinent details of the DICOM standard. 9 The DICOM spatial frame of reference convention differs from the one typically employed in the modern treatment planning systems and linear accelerators (e.g., IEC1217). It is a right-handed patient-based coordinate system. The relationship between the DICOM and IEC1217 systems for a patient in a standard (head first supine, or HFS) position is depicted in Fig. 1

2.B.3 | Rigid translations and rotations
Rigid registration involving translations and rotations is slightly more complicated. The tests are enumerated in Table 2, along with the known translations. Note that the known T values in Table 2 differ not only in sign but also in magnitude from the nominal X,Y,Z shifts specified in the Report. The reason for that is that in the transformation calculations, rotations are applied first, followed by translations.
To determine the known T values, we first independently construct a direct transformation matrix M from the moving to stationary datasets, corresponding to the known rotations and shifts in Table 2.
While the order of the rotations is not explicit in the report, it was determined by trial and error to be around the Z axis first, followed by Y, and finally X. In matrix notation, this implies: For the transformation in Cases 10-14, the individual rotational components are The translation vector T from the last column (transcribed to Table 2) can now be compared to the registration software-generated one to obtain the registration errors along the cardinal axes.

2.B.4 | Deformable registration
The quantitative deformable registration tests are enumerated in Table 3. The Report provides two dataset pairs for evaluation of DIR, using, in theory, two different methods of providing the ground truth transformation. The first is Basic Deformation Dataset 1. It is constructed from Basic Anatomical Dataset 1 by adding noise, translations, rotations, and deformation in the central region. It is stated in the Report that "evaluating the accuracy of the deformation phantom should be performed using the DICOM deformation vector field (DVF) files". 1 Unfortunately, this recommendation was not followed, and the ground truth DVF file provided in the Report's supplemental materials is in a proprietary binary format, making it unusable without the corresponding commercial software package.
Constructing the ground truth DVFs is a nontrivial endeavor. 10,11 As a result, an alternative practical approach to Case 15 had to be developed. As an easy first step, the center of each of the three visible fiducials was identified on the target and deformed images, and the differences recorded as target registration errors (TRE). For a more comprehensive analysis, we segmented the datasets and compared the structures resulting from deforming the moving dataset to those manually drawn on the target (noisy) dataset. The analysis was done with the StructSure tool (Standard Imaging Inc. Middleton, WI, USA) based on the work by Nelms et al. 12 However, of the menu of metrics available in the software, we chose only the one that could be, albeit with some effort, extracted manually from any radiotherapy planning/registration system. The pertinent values are the volumes of the deformed and target structures and of their overlap. From that, the Dice similarity coefficient (DSC) 13 can be calculated as where V A and V B are the volumes of the deformed and target structures and V A \ V B is their overlapping volume. On the other hand, determination of the mean distance between contour surfaces, which is another structure-based metric recommended in the Report, is too time consuming for manual calculations and would require a specialized software tool. Fortunately, a formal statistical analysis in a recent publication 11 suggests that DSC and structure volume are a strong predictor of the distance to conformity between contours, and the latter may be omitted as redundant.
The second DIR case provided in the Report, Clinical 4DCT Dataset (Case 16 in Table 3), is intended to be used with a TRE-type quantification scheme. It has 300 virtual fiducials semiautomatically placed at bifurcation points identified on both end-inhalation and end-exhalation respiratory phases. 14  In addition, datasets from Cases 16-18 were segmented by a local expert (JC) on both respiratory phases and the deformed contours from the moving dataset compared to those drawn on the target, as described before. This allows for useful cross-checking of the results between two independent approaches to geometrical registration error determination. This method of producing the contour pairs is not as refined as the ones described by Loi et al. 11 but has the advantages of not requiring specialized software and perhaps being somewhat more realistic.
The Report recommends 10 clinical cases to be examined, without specifying a method of obtaining the ground truth. We felt that the seven thoracic cases described above were sufficient for that anatomical region. Therefore, we added three randomly selected abdominal (two extreme respiratory phases) and three head and neck (treatment planning vs. diagnostic) CT dataset pairs as examples. Contour comparison was again selected as a practical method of quantifying the TRE.
The normal structures were segmented on each dataset by an expert, and the contour comparison routine described above was applied. x-and y-direction errors exceeded 1 mm, while for both MRI to CT registrations only the x error was above 1 mm.

3.B | Registration errorsrigid translations and rotations (Cases 10-14)
The dataset voxel dimensions for rigid translation/rotation cases followed the same pattern as for Cases 6-9, with the MRI transverse pixel size LATIFI ET AL.

| 129
The results of the contour similarity analysis for Case 15 are detailed in Table 4.
All contours, except for the "seminal vesicles", show DSC well above the level considered acceptable in the Report (0.8-0.9). 1 The "seminal vesicles" are small, low-contrast structures and their low DSCs are mostly due to the inability of the observer to properly identify them on the noisy target image. With the very high Dice coefficients and maximum differences between the contours being of the order of one voxel size, this test was considered successful.

3.C.2 | Clinical Thoracic deformable registration (Cases 16-22)
For the cases in this section, the Optical Flow "CT Deformable" Mirada algorithm was used. It is the primary algorithm intended for CT to CT registration and also provides better spatial resolution of the DICOMexported DVF. It is apparent from

3.C.3 | Clinical Abdominal cases (Cases 23-25)
The difference in the abdominal datasets, as in the thoracic ones above, is that they belong to the two extreme respiratory phases.
The DSC results for the major abdominal contours are presented in

3.C.4 | Clinical Head and Neck cases (Cases 26-28)
The main challenge in aligning the diagnostic and treatment planning HN image sets is the flexion of the neck, which requires substantial deformation. Additionally, the diagnostic datasets include contrast media, particularly evident in major blood vessels. However, the vessel and major muscle alignment was visually checked and deemed very close. The results of the DSC between the drawn and warped contours in both directions are presented in Table 8.

3.D | Consistency with respect to registration direction
The robustness of deformation with respect to direction depends on the criteria and follows the quality of the corresponding registration metrics. For Thoracic case 16, for example, the misalignment of the virtual fiducials is rather large (Table 5). Similarly, the mean (Dx, Dy, Dz) are unstable with direction of registration and change from (À1.6, À2.3, 18.6 mm) for the 0% to 50% deformation to (À0.02, 1.2, À5.1 mm) for the opposite one. On the other hand, the DSCs between the thoracic and abdominal contours in Tables 6 and 7 are rather high and do not change meaningfully with direction. The HN DSCs show more random variation, as the contour overlap is generally lower (Table 8).

| DISCUSSION
In stark contrast, for example, with the dose calculation algorithms, 17 the guidance literature on validation of image registration software, particularly DIR, is still in its infancy. The issue is rather complex, as the apparent registration success or failure depends on multiple variables, such as the algorithm, site, metrics, and clinical goals. The Report provides a reasonable suite of virtual phantoms and criteria for rigid registration validation. In this paper, we elaborated on their detailed application to a particular commercial software package.
Even in these simplest cases, the strict criterion of ½ voxel size registration accuracy is not met in every case, although the overall error T A B L E 4 Comparisons between the pertinent contours deformed from the moving dataset and those drawn on the target. Analyzing the TRE for a large number (hundreds) of virtual fiducials is the step-down from the DVF analysis for every voxel, but it is still capable of producing a fairly detailed picture of the registration accuracy within an organ (typically the lungs 14-16 ).

ROI
One digital dataset pair with the corresponding sets of fiducials from Ref. [14] is provided in a supplement to the Report. We performed well in these tests for major thoracic and abdominal organs segmented on two respiratory phases, and fairly for the HN cases with differences in neck flexure. This underscores that the requirements for faithful contour propagation are not synonymous to, and may in fact be disparate from, the requirements for volumetric spatial accuracy of image registration. 24 Hybrid DIR models are being proposed to address this issue. 27 In our case, the DIR algorithm appears to be adequate for contour propagation but is questionable at best for applications requiring the fidelity of the volumetric DVF, such as deformable dose accumulation. 24,[27][28][29] Finally, it is fair to say that DIR evaluations with physical phantoms are not practically feasible in the majority of the radiotherapy clinics.

| CONCLUSION S
Given the wide availability of commercial image registration software, the AAPM TG-132 Report 1 is a useful, albeit far from complete, step toward providing a medical physicist with the knowledge, tools, and criteria for validating those algorithms in the clinic. We demonstrated how a number of suggested quantitative tests can be performed using only publicly available tools. However, for deformable registration, the Report on the practical level provides more questions than answers. There is a great need for a universally available, comprehensive library of digital datasets with the ground truth deformation data. A good example of a related recent project relying on public domain software and providing downloadable datasets would be the work by Nyholm et al. 30 Furthermore, it may not be realistic to expect a clinical physicist to perform validation of a DIR package for a full variety of clinical sites and use scenarios. A more practical approach may be for the software vendors to provide a comprehensive, objective set of characterization and validation data for their algorithms, from which at least an initial approximation of fitness for a particular task could be inferred.

CONFLI CT OF INTEREST
The authors have no relevant conflict of interest to report.