Evaluation of whole‐body MR to CT deformable image registration

Multimodality image registration plays a crucial role in various clinical and research applications. The aim of this study is to present an optimized MR to CT whole‐body deformable image registration algorithm and its validation using clinical studies. A 3D intermodality registration technique based on B‐spline transformation was performed using optimized parameters of the elastix package based on the Insight Toolkit (ITK) framework. Twenty‐eight (17 male and 11 female) clinical studies were used in this work. The registration was evaluated using anatomical landmarks and segmented organs. In addition to 16 anatomical landmarks, three key organs (brain, lungs, and kidneys) and the entire body volume were segmented for evaluation. Several parameters — such as the Euclidean distance between anatomical landmarks, target overlap, Dice and Jaccard coefficients, false positives and false negatives, volume similarity, distance error, and Hausdorff distance — were calculated to quantify the quality of the registration algorithm. Dice coefficients for the majority of patients (>75%) were in the 0.8–1 range for the whole body, brain, and lungs, which satisfies the criteria to achieve excellent alignment. On the other hand, for kidneys, Dice coefficients for volumes of 25% of the patients meet excellent volume agreement requirement, while the majority of patients satisfy good agreement criteria (>0.6). For all patients, the distance error was in 0–10 mm range for all segmented organs. In summary, we optimized and evaluated the accuracy of an MR to CT deformable registration algorithm. The registered images constitute a useful 3D whole‐body MR‐CT atlas suitable for the development and evaluation of novel MR‐guided attenuation correction procedures on hybrid PET‐MR systems. PACS number: 07.05.Pj

multimodality image registration algorithms. The manifold applications of image registration is beyond the scope of present study; however, pointing to a number of noticeable applications in medicine is helpful. Image registration plays an undeniable role in image comparison procedures such as treatment verification carried out by comparison of pre-and postintervention images, (1,2) subtraction of ictal and inter-ictal SPECT images, (3,4) bone growth monitoring on X-ray or ultrasound time series, (5) tumor growth monitoring on series of MR scans, (6) monitoring response to treatment in PET, (7) and temporal differencing for motion detection. (8) Other wide scopes for image registration are statistical atlas generation (9) and atlas-based segmentation, (10) and attenuation correction on hybrid PET/MR systems. (11) This later reference reports on an MR-guided attenuation correction method relying on an atlas consisting of registered MR and CT images of several subjects.
In clinical oncology, the need for integration of anatomical and functional information for diagnosis, staging, and therapy planning has pushed medical imaging technology into the area of multimodality and multiparametric imaging. (12)(13)(14) Multimodality imaging systems combine the advantages of both anatomical imaging modalities (e.g., CT, MRI, ultrasound, portal imaging) and functional imaging modalities (e.g., SPECT, PET, fMRI, MRS). With the advent of multimodality imaging, image registration became an essential component of both clinical and research applications and is increasingly attracting the interest of the medical imaging community. (15) Recently developed hybrid PET/MRI systems provide the combination of anatomicalfunctional and even functional-functional information. (13,(16)(17)(18) Although hybrid PET/MRI provides highly accurate hardware based coregistered images, the lack of suitable and efficient motion correction techniques and the need for MR-guided attenuation correction make softwarebased image registration an essential component of this technology.
The procedure used for atlas generation consists of two major parts: image registration and conversion. A large number of image registration algorithms were proposed varying in dimensionality, nature of transform, modalities involved, and optimization procedure. Owing to the various features associated with the procedure, different classifications were presented. (19) Regardless of the type of registration, the evaluation and validation of image registration algorithms play a pivotal role in the whole process. Image registration depends on the geometry of both the target image (the fixed image on which the second image is registered) and the source image (the moving image that is registered on the target image). Therefore, introducing a thorough concept that guarantees the accuracy of the registration process in whole-body imaging proved to be a difficult task. However, some general guidelines were proposed for validation of image registration, particularly for the thoracic imaging. (20,21) Validation of image registration algorithms usually follows a sequence of measurements using computer-generated anatomical models, (22) physical phantoms (23,24) or images of patients or volunteers. (24,25) Although validation based on phantom studies holds a high precision owing to the accurately known transformation parameters of warped phantoms as well as the inverse transform which is used as ground truth for evaluation, it is not inclusive since realistic patient geometry is hardly considered. The issue becomes more complex in the case of intermodality registration. On the other hand, validation based on clinical images is prohibitive in terms of time and availability of expert evaluators. In other words, either for localization of anatomic landmarks or segmentation of organs, expert knowledge (as reference) is necessary, which deters investigators from following this approach.
In this study, we describe a deformable intermodality (MR to CT) image registration approach with the aim to exploit it for atlas generation in the context of MR-guided attenuation correction. The atlas is made up of aligned whole-body CT and MR images of 28 patients. The performance of the 3D intermodality registration technique is evaluated using clinical images through manually defined anatomical landmarks and segmented organs.

A. Image registration
Image registration is the process of finding a spatial transformation from one image or domain to another according to a given set of constraints. This constraint could be a special kind of similarity measure. Consequently, the task is to optimize the transformation parameters based on the candidate similarity measure. Our registration process includes two main steps. The first step allows the achievement of preliminary alignment through a rigid transform to roughly align the target and source images. The rigid transform with six parameters consists of a sequence of translation and rotation transforms. The second step consists of a nonrigid registration procedure by means of the B-spline deformable transform. This registration strategy is based on methods described by Rueckert et al. (26) The B-spline transformation is a free-form deformation described by a cubic B-spline function defined on a uniformly spaced control point grid (independent of the image data) covering the image being registered. B-spline registration requires initial control points which were selected uniformly to produce a 3D grid. During the registration process, transform parameters are calculated for this grid; then the whole 3D image is transformed in each loop. As such, the registration process is fully automated and was carried out without user intervention for all studies.
The registration of whole body images was carried out using elastix, (27) a command line driven program based on the Insight Toolkit (ITK) registration framework (open source; National Library of Medicine). (28) The registration parameters were selected based on our own assessment of the various possible options and recommendations provided by the developers of elastix based on their experience and feedback from a large community of users. We used parameters corresponding to lung registration study, since the lungs were the most problematic regions. (29) The first tests demonstrated that this is a good candidate for our task, but some changes were needed to achieve optimal performance following in-depth evaluation of the impact of the various parameters involved. The three most important steps of the registration procedure: 1. The rigid registration was performed using a multiresolution registration approach at five different levels, from a level 16 times lower in resolution to full original resolution. 2. The elastic registration was carried out using three resolutions, from eight times lower to twice the original resolution. 3. The total number of iterations for each of the two registrations was about 8000.
We set higher pyramid levels for rigid registration since large and rough transform steps are required. Besides, when we increase pyramid levels and/or iterations in the B-spline registration, the resulting image tends to deviate from the target owing to the fact that higher pyramid levels and/or number of iterations lead to overdeformation of the resulting image. The normalized mutual information (NMI) metric was considered as similarity measure, but other measures available in elastix could have been used. However, better performance was reported when using the normalized correlation coefficient (NCC) for intramodality registration and the NMI for intermodality registration metrics. In agreement with observations made by other users, a slightly better performance was achieved using the NMI for intermodality registration because this metric is able to manage contrast scale differences introduced by different modalities. We therefore used NMI metric with 32 bins, 4096 samples for elastic, and 5000 samples for rigid registration.
A steepest descent optimizer was chosen to optimize the similarity measure. This optimizer uses the gradient vector to move through the cost function until an optimum is found. To determine the best step size and learning rate which is not trapped inside local minima, the whole process was repeated several times. For each resolution, a specific step size was assigned to the optimizer. For the gradient descent, the step is given by: Step a where k is the iteration number. The parameters a, A, k, and α are summarized in Table 1. This formula was used by Klein et al. (30) and was judged to be appropriate for our registration procedure. For obvious reasons the step size is large in the beginning and gradually becomes smaller at higher iterations. The image transformation consists of two steps: (i) transformation in physical space, and (ii) resampling (from physical space to voxel space). The B-spline transform is equivalent to the generation of deformation fields where a deformation vector is assigned to every point in physical space. The deformation vectors are computed using 3rd order B-spline interpolation from the deformation values of points located on a uniform grid. The uniform grid spacing was eight voxels which accounts for around 30,000 control points. It has been shown that a smaller number of randomly selected control points speeds up the process without affecting negatively the optimization process. (30) Therefore, 5,000 points were randomly selected in each iteration. Conversely, the resampling step requires interpolation to calculate the intensity values of the transformed image. In order to speed up the resampling process, for each iteration, a 1st order B-spline method was used to interpolate the moving image, whereas a 3rd order B-spline method was used for the final iteration.
The registration process was fully automatic, taking less than 30 min on Intel Core 2 Quad processors (Intel Corporation, Santa Clara, CA). Once the process is finished, the software writes an ASCII file containing the optimized parameters for each transformation, namely six parameters for the rigid registration and tens of thousands (variable depending on the dataset) for the elastic registration. This ASCII file is used by other utilities (Transformix) to deform the source image to produce the final registered image.

B. data acquisition
Whole body X-ray CT and MR images were acquired for 28 (17 male and 11 female) patients. The patients were randomly selected from the clinical database (age between 18 and 81 years). The weight and height ranges differ between male (height: 164 to 179 cm; weight: 59-95 kg) and female (height: 148 to 168 cm; weight: 38-88 kg) patients. It should be emphasized that in most of the cases, patients' disease state did not affect the registration process, except a few cases which were excluded when the presence of large metallic implants deteriorated the overall image quality.
CT images were acquired for attenuation correction of PET/CT studies on the Biograph HireZ scanner (Siemens Healthcare, Erlangen, Germany). CT scanning was performed using 120 kVp tube voltage and CARE Dose (Siemens) tube current modulation which sets the tube current Higher resolution with finer sampling would have been desirable for optimal image registration but, since this study was conducted retrospectively, we had only access to low resolution whole-body MR images (with rough voxel resolution to increase SNR) used for attenuation correction and anatomical localization. Smaller voxel sizes will provide noisier images of even poorer quality, which might worsen the registration process.

C. Evaluation
The most relevant issue in correlative imaging is the ability to locate the same anatomical features in coregistered images. In other words, a perfect registration algorithm completely aligns all corresponding points and objects of the source and target images (in our case organs or landmarks of human anatomy). Hence, the validation step requires accurate segmentation of specific organs, as well as localization of anatomic landmarks. One of the prevalent reference or ground truth for segmentation of organs and localization of landmarks is expert knowledge. Despite its inherent limitations and reproducibility, expert knowledge is the still the most widely used reference for evaluation of image registration techniques. Consequently, we can measure three different kinds of metrics for registration evaluation: point, surface, and volume. For each patient, 16 anatomical landmark spatial coordinates were defined on both CT and MR images by a radiologist (Table 2). (32,33) The Euclidean distance between these points in the two datasets was calculated as the registration error. In order to evaluate volume and surface discrepancies, the whole-body, along with three key organs (brain, lungs, and kidneys) was segmented. The segmentation algorithm consists of a combination of manual and active contour models (34) (snakes) using an interactive program which facilitates the task of manual segmentation and initial contour definition. The active contour model is used to accelerate the segmentation process but is prone to leak in some regions, wherein manual intervention is required. In addition, kidneys were not easily distinguished on the low-resolution MRI and in some cases on CT where manual segmentation was carried out. However in three cases, the kidneys were left out, since they were not distinguishable on MRI, even by experienced radiologists. Various strategies were proposed to measure the discrepancy between volumes and surfaces of corresponding segmented volumes of source and target images. These strategies can be classified into three different categories as discussed by Klein et al.: (25) volume overlap, volume overlay errors, and surface distance measures.

C.1 Volume overlap
These are the most often used methods to evaluate image registration techniques. Basically these methods are based on measurements of union, intersection, and subtraction of segmented volumes. They roughly quantify the fraction of source (S) and target (T) volume labels that agree or disagree. A good description of these methods, including multiple labels, fractional labels, and weighted labels, is given in Crum et al. (35) The more accurate the registration, the closer these quantities approach the unity. For this reason, these methods belong to the same group.
Target overlap can be calculated over a set of r regions (regions that were labeled through segmentation): (2) Mean overlap (Dice coefficient) is a figure of merit that quantifies the intersection between source and target labels divided by the mean volume of the two regions: The Dice coefficient (36) is a special case of Kappa statistical coefficient. (37) A possible interpretation of the results of this quantification is given below: (38) • Poor agreement → less than 0.
The Jaccard coefficient (39) (also known as Tannimoto coefficient (40) ) is treated separately from the Dice coefficient; however, one should note that both are related by the following formula: (5)

C.2 Volume overlay error
Following the previously described volume overlay methods that quantify the agreement between source and target regions, we can also compute the disagreement by making the hypothesis that the target volume is the truth we must achieve and the source volume is a tentative to achieve this truth. The false-negative (FN) value can then be defined as the proportion of target volume that is incorrectly not covered by the source volume: (6) Inversely, the false-positive (FP) error can be defined as the proportion of source volume incorrectly not covering the target volume: Volume similarity is a figure of merit that measures the similarity between target and source volumes regardless of the fact that the volumes are superimposed. Therefore, it is not technically a quantification of registration accuracy. However, it provides information about which registration method gives a closest source volume to the target volume. (8)

C.3 Surface distance measures
The above-mentioned figures of merit do not explicitly assess boundary discrepancies between source and target regions. The distance error (DE) is a well-defined quantity to measure the boundary mismatch between the source and the target. (25) This distance is equal to the minimum distance from each boundary point of the source region (S Bp ) to the entire set of points of the target region (T B ), averaged across the N boundary points: The most frequently used discrepancy measure is the Hausdorff distance (HD), which measures the maximum distance one would need to move the boundaries of the source region to completely cover the target region. (35) Formally, HD S→T from the source region (S) to the target region (T) is a maximum function defined as: (10)

C.4 Implementation
A set of classes were written in C++ object-oriented language to implement the above-described image registration validation metrics, taking advantage of ITK classes for image I/O operations and well-defined image data structure. Figures of merit for volume discrepancy measurement were calculated by counting the voxels of union, intersection, and subtraction of each region. The calculation of surface distance required a faster algorithm owing to the high complexity of finding the minimum Euclidian distance of adjacent voxels for all points on the surface. As a result, the distance map for each region was generated and used to find the surface discrepancies. Figure 1 shows two representative clinical studies before and after registration. The misalignment between MRI and CT images at the level of the arms and head is commonplace in whole body imaging. It is visually obvious that the images were matched satisfactorily after registration. The results of evaluations for the considered patient population in terms of point-to-point evaluation of the deformable registration procedure are shown in Fig. 2 using box and whisker plot. The box shows the median (red horizontal line) and the lower (Q1) and upper (Q3) quartiles (defined as the 25th and 75th percentiles). The red plus sign in the plots indicates the outliers. Outliers in our plot include dubious results which are beyond 1.5 times the interquartile range (Q3-Q1). A possible origin of dubious results is the failure of the image registration process in cases of poor MR image quality so that the organs are not distinguishable enough. In these cases, the cost function fails to correctly guide the optimizer. The whiskers show the maximum and minimum of population after elimination of outliers. The box and whisker plots not only indicate the quantitative characteristics of the samples under study, but also hold useful features for comparison of different groups of samples. The notches in each box depict the confidence interval for the median at 95% of confidence degree. Two analogous medians with confidence intervals that overlap are not significantly different. It can be seen that for the majority of data points (above 75%), the error is under 20 mm. The detailed statistical analysis results are reported in Table 3. Figure 3 shows box and whisker plot for volume overlap of various organs. As mentioned earlier, the comparison was performed for the whole body and for three specific organs (brain, lungs, and kidneys). The plot indicates some outliers considered as dubious results. The false positive, false negative, and volume similarity plots for the 28 patients show an excellent alignment (error under 0.1) for whole body registration (Fig. 4). False negative is the fraction of the target image volume that is not intersected by the source image, which is under 0.35 for the majority of patients. On the other hand, the false positive is a fraction of source image that is not intersected by target image. The majority of patients have false-positive value under 0.3 for all evaluated organs.

III. RESuLTS
As mentioned earlier, the Hausdorff distance is the maximum distance one would need to move the boundaries of the source region to completely cover the target region which is obtained by finding the maximum value of the distance map. Figure 5 shows the box and whisker plot of maximum and minimum distances for 28 patients. Detailed statistical parameters for Figs. 5 to 7 are summarized in Table 3.
For all measured organs, the Hausdorff distance is above 10 mm. In some cases computed Hausdorff distance for whole body exceeds 50 mm. This relatively high error originates from the interscan movements of patient arms. Due to the higher degree of freedom, arms are potential source of mismatch. In addition to arms, slight movements of the head bring about misregistration errors, as well. In comparison with arms, head misregistration happens seldom in patients. Therefore the computed Hausdorff distance shows smaller misregistration errors for the brain than for the whole body. Figure 6 illustrates the arms and head misregistration as the worst case.  Apart from the local misregistration (in the arms), the distance error, which is the mean value of the distance map, indicates a fair agreement (under 10 mm) for the majority of patients (Fig. 5). Diaphragmatic function is the principal source of misregistration in the lungs and even kidneys. A close look at the 3D distance map of lungs reveals that the lungs' base and the kidneys' apex are the most susceptible to misregistration error ( Fig. 7 (a)-(b). Table 3. Statistical characteristics of results presented in Figs. 5 to 7 for grouped analysis of 28 clinical studies reporting volume and surface discrepancies of four anatomical regions. Δ is as defined in Table 2.

IV. dISCuSSIon
In this work, we performed and evaluated a 3D whole-body CT to MRI nonrigid image registration algorithm aiming at elaborating a framework for implementation and validation of MRI-guided attenuation correction strategies on hybrid PET/MRI systems. The validation part relied on the use of retrospectively acquired clinical data and various metrics including qualitative visual assessment performed by an experienced physician and objective computeraided measurements. A limited number of studies focused on the use of 3D registration for MR-guided attenuation correction. (41)(42)(43) These contributions focused mostly on brain imaging applications where intramodality registration, even of a patient study to an atlas, is less challenging because of the rigidity of the head compared to whole-body imaging. In addition, these contributions failed to describe in sufficient detail the registration procedures used to enable their reproduction and reuse and, more importantly, the registration accuracy was not assessed  or quantified. One of these studies used an aligned MR/CT atlas to build a pseudo-CT image using MR-to-MR registration (intramodality registration). (41) More recently, the same group reported on the comparison of segmentation and atlas registration and pattern recognition-based methods using whole-body studies, and concluded that the latter provides better overall PET quantification accuracy than the MR image segmentation approach. (11) The focus of this paper was on attenuation correction using the same registration approach. The visual assessment of registered images (Fig. 1) indicates excellent visual alignment of images after registration in most of the cases. It is obvious that for the volume overlap metric (Fig. 3), the sensitivity of the measurement falls down as the volume increases. In other words, a tiny misalignment in small volumes will cause a greater effect on the global results. One can observe that we obtain higher correspondence for the whole body than kidneys. There is one case (patient 08) that falls under 0.3. This drastic falloff originates from the failure in detecting the kidneys owing to the lower image quality. As mentioned before, a DC (mean overlap) higher than 0.8 implies an excellent agreement of volumes. Thus, the majority of patients (above 75%) meet the criterion for whole body, brain, and lungs. On the other hand, for kidneys, the majority of patients satisfy the good agreement of volumes criteria (0.6 ≤ DC < 0.8). The smaller volume of the kidneys and the respiratory motion are the main causes of this reduction. Equation (5) explicitly demonstrates that the JC (union overlap) depends on the DC value. Therefore the discussion about DC backs up the JC, as well.
Comparing the false negative and false positive for the brain (Fig. 4) indicates a slightly higher false-positive error. Owing to the fact that the anterior half of the brain is smaller than the posterior half, any anterior-posterior shift will bring about different false-negative and false-positive error. There is a fair agreement between the manual evaluation based on expert knowledge (Fig. 2) and objective evaluation based on computational figures of merit (Fig. 5) for lungs and kidneys.
The previously introduced quantities (Eq. (2-10)) are widely used in the evaluation of medical image registration and segmentation algorithms. For evaluation of image segmentation, manually segmented and automatically segmented objects are usually compared on the perfectly aligned images. (44)(45)(46) Some studies reported Dice coefficients close to 0.7 as excellent results. (46) Likewise, many papers used evaluation metrics to compare different types of registration algorithms. (25,47) In our work, we merely validated our intermodality deformable registration process to build a whole-body CT-MR atlas and define a framework for validation of MR-guided attenuation correction in hybrid PET/MRI using CT-based attenuation correction as reference.
We strived to cover the extensive range of quantities available in this study. Nevertheless, since the whole registration process is sensitive to geometry and differing from one object to the other, proposing a comprehensive approach for evaluation of deformable image registration is not an easy task, particularly for whole-body imaging. Algorithms developed to address the registration problem for a specific body region (e.g., the lungs) and validated using special test cases for that region might not perform equally well in other regions of the body or in whole-body studies. Therefore, we do not claim that our evaluation is complete and perfect, as it includes many limitations. The limited number of anatomic landmarks can be considered as the main limitation of our work. We aimed at selecting a largest number of data points per site for reliable evaluation. However, using a large number of data points per anatomical site is difficult to achieve when a human observer is involved. Another limitation is that our evaluation relied heavily on the segmentation of the whole body and three specific organs. The errors and uncertainties associated with segmentation may be considered as prevailing origin of the discrepancies. Segmentation errors also need to be estimated using a reproducibility experiment involving multiple observers over multiple time points, which was not possible in our case owing to the lack of qualified human observers willing to perform this tedious task. Nevertheless, it is commonplace to use human expert knowledge as ground truth for evaluation of image segmentation techniques. Indeed, most of the studies cited did not perform multiple observer studies owing to the above-mentioned difficulties.
In addition to the above, when probing outliers from box and whisker plots, we encountered cases where the organ being considered is indistinguishable, which consequently causes registration and segmentation errors. The segmentation errors have several origins, including low image contrast in low-resolution MR images used in this protocol, metal artifacts in images of some patients bearing metallic implants (patient #21), and respiratory motion artifacts.
There are limited data in the literature reporting on performance assessment of whole-body multimodality image registration since most of the studies focused on specific regions of the body, mainly the brain where registration with high accuracy could be achieved. Our experience in cardiac perfusion PET/CT studies led to the conclusion that deformable registration outperforms both manual and rigid-body registration. (48) Klein et al. (27) reported a mean distance between corresponding points of 1 mm for brain imaging using elastix. Our experience with elastix for deformable atlas registration in small animal PET/CT imaging revealed a mean registration mismatch distance of about 6 mm with good quantification accuracy in most of the regions, especially the brain, but not in the bladder. (49) In this work, a median distance error of 3.7 mm was obtained for brain segmentation (Table 3), which is reasonable for a whole-body registration. The median of our registration errors for anatomic landmarks varies between 7 mm and 14 mm; however, the misregistrations were local. A lower overall error can be expected in our case. It should be noted that because of the retrospective nature of the study, physical spacing along Z direction for the datasets used in this work is much larger than that of axial (instead of equal physical spacing), which might affect the registration accuracy. Uniform sampling in physical space is more effective than uniform sampling in voxel space (in case voxels are not isomeric). However, even in this case we did not consider all control points. Eight-voxel spacing will result in over 30,000 control points, which will considerably slow down the whole process. Our registration procedure randomly selects 6,000 control points for each iteration.
Segmentation of all organs in both CT and MR images is tricky, and on that account we considered only three organs located in different regions of the body to obtain a global estimation of the performance of the image registration algorithm. Various patient motions, such as physiological respiratory motion and arms motion, are other sources of error that can compromise the registration process.
The registered image pairs constitute a useful 3D whole-body MR-CT atlas, suitable for the development and evaluation of novel MR-guided attenuation correction procedures for hybrid PET/MRI systems. Our aim is to assess the potential of various knowledge base learning strategies to convert MR to pseudo-CT images, enabling the generation of attenuation maps with continuous variation in attenuation correction factors through classification of bone and consideration of lung inhomogeneities. This work is under progress and will be reported in future publications.

V. ConCLuSIonS
We have developed and evaluated an MR to CT deformable registration algorithm using 28 clinical whole-body scans. The registration could be performed from CT to MR space either by applying inverse transform parameters or performing another registration. Our intention in this work was to assess the performance of the registration algorithm while keeping original CT with defined landmarks as reference. The registration of large volumes requires more efforts in terms of computing time and computational complexity. Although some of the considered evaluation metrics (DC and JC) were not independent, we deliberately included all potential metrics for evaluation. Our 3D registration process suffers from mismatch in the arms in some cases. However, the applied method is able to accurately register the arms and other parts of the body in most of the cases at the expense of computation time. Besides, for the targeted objective (evaluation and atlas generation for MR-guided attenuation correction), the lack of correspondence at the level of the arms will not be a major obstacle.