Evaluation of automatic contour propagation in T2‐weighted 4DMRI for normal‐tissue motion assessment using internal organ‐at‐risk volume (IRV)

Abstract Purpose The purpose of this study was to evaluate the quality of automatically propagated contours of organs at risk (OARs) based on respiratory‐correlated navigator‐triggered four‐dimensional magnetic resonance imaging (RC‐4DMRI) for calculation of internal organ‐at‐risk volume (IRV) to account for intra‐fractional OAR motion. Methods and Materials T2‐weighted RC‐4DMRI images were of 10 volunteers acquired and reconstructed using an internal navigator‐echo surrogate and concurrent external bellows under an IRB‐approved protocol. Four major OARs (lungs, heart, liver, and stomach) were delineated in the 10‐phase 4DMRI. Two manual‐contour sets were delineated by two clinical personnel and two automatic‐contour sets were propagated using free‐form deformable image registration. The OAR volume variation within the 10‐phase cycle was assessed and the IRV was calculated as the union of all OAR contours. The OAR contour similarity between the navigator‐triggered and bellows‐rebinned 4DMRI was compared. A total of 2400 contours were compared to the most probable ground truth with a 95% confidence level (S95) in similarity, sensitivity, and specificity using the simultaneous truth and performance level estimation (STAPLE) algorithm. Results Visual inspection of automatically propagated contours finds that approximately 5–10% require manual correction. The similarity, sensitivity, and specificity between manual and automatic contours are indistinguishable (P > 0.05). The Jaccard similarity indexes are 0.92 ± 0.02 (lungs), 0.89 ± 0.03 (heart), 0.92 ± 0.02 (liver), and 0.83 ± 0.04 (stomach). Volume variations within the breathing cycle are small for the heart (2.6 ± 1.5%), liver (1.2 ± 0.6%), and stomach (2.6 ± 0.8%), whereas the IRV is much larger than the OAR volume by: 20.3 ± 8.6% (heart), 24.0 ± 8.6% (liver), and 47.6 ± 20.2% (stomach). The Jaccard index is higher in navigator‐triggered than bellows‐rebinned 4DMRI by 4% (P < 0.05), due to the higher image quality of navigator‐based 4DMRI. Conclusion Automatic and manual OAR contours from Navigator‐triggered 4DMRI are not statistically distinguishable. The navigator‐triggered 4DMRI image provides higher contour quality than bellows‐rebinned 4DMRI. The IRVs are 20–50% larger than OAR volumes and should be considered in dose estimation.

higher contour quality than bellows-rebinned 4DMRI. The IRVs are 20-50% larger than OAR volumes and should be considered in dose estimation. the ITV has been also obtained via generating maximum intensity projection (MIP) image. [6][7][8] The full prescription dose is planned to cover the planning tumor volume (ITV+margin), which may include nearby OAR causing normal tissue toxicity. 4 A planning organ-at-risk volume (PRV) 9-11 is used to account for inter-fractional variation in OAR position, but not intra-fractional motion, therefore, the OAR motion has not been properly accounted for, leading uncertainties in OAR dose estimation, treatment toxicity, 12,13 and the dose-toxicity relationship. 4,14 On the contrary, incorporating OAR motion in treatment planning may help to provide an improved OAR dose estimation and therefore optimized dose prescription for an SBRT treatment. More accurate clinical data may lead to better understanding of SBRT doselimiting toxicity [15][16][17][18] and normal tissue complication probabilities (NTCP), 13 to improve the therapeutic ratio.
Respiratory-correlated 4D Magnetic resonance imaging (4DMRI) offers higher soft-tissue contrast [19][20][21][22][23] and fewer binning artifacts using an internal navigator echo rather than an external surrogate, 23 which is used by 4DCT. In addition, time-resolved 4DMRI over multi-breathing cycles have also been reported, producing more than 10-phase clinical motion data. [24][25][26] Because four-dimensional MRI is an emerging 4D imaging modality with great potential benefits in radiotherapy applications, it is of paramount importance to perform the preclinical evaluation, especially more imaging data, prior to clinical use in radiotherapy planning. Automatic contour propagation using deformable image registration (DIR) was reported for 4DCT-based respiratory motion assessment, [27][28][29] CT-based longitudinal adaptive evaluation, [30][31][32] and cone-beam CT for setup with liver matching. 33 Other automatic image segmentation approaches were reported, 34,35 including model-based method. Recently, automatic OAR contouring on 2D cine or 3D MR images were reported to facilitate MR-based planning or MR-guided radiotherapy. [36][37][38][39] With both high soft-tissue contrast and low binning artifacts, T2W navigatortriggered 4DMRI provides more anatomic landmarks than 4DCT, facilitating DIR for automatic contour propagation for OAR motion assessment for treatment planning and delivery.
To evaluate the accuracy of an automatic segmentation method in human images, physician's manual contours are used as the clinical ground truth. However, the intra-and inter-observer variation is common, leading to multiple ground truths. To minimize the uncertainty in the scientific ground truth, the simultaneous truth and performance level estimation (STAPLE) algorithm was developed 40 and applied to evaluate tumor delineation in radiotherapy planning of lung, liver, and pancreatic cancer using CT or 4DCT. [41][42][43][44] Based on the statistics of a group of clinical ground truths from multiple physicians, the most probable ground truth can be computed and used as the scientific ground truth.
In this study, automatic OAR contour propagation of T2W 4DMRI was evaluated for OAR delineation and generation of internal organat-risk volume (IRV) to account for intra-fractional OAR motion, as the PRV only accounts for inter-fractional OAR motion. Ten volunteers were scanned for navigator-triggered T2W 4DMRI (10 bins) under an IRB-approved protocol and four major organs (lungs, heart, liver, and stomach) were segmented manually and automatically by a radiation oncologist and a trained medical student. We hypothesize that the DIR-propagated contours from the 4DMRI images have similar quality to the manual contours. The quality of automatically propagated contours was evaluated using the most probable ground truth with a 95% confidence level (S95) using the STAPLE algorithm. 40 The OAR contour quality was also evaluated using navigator-triggered and bellowsrebinned 4DMRI. Finally, the volume increase from OAR to IRV was quantified using IRV/V OAR ratio.  in coronal directions using an internal navigator as the respiratory surrogate under an IRB-approved protocol. The bellows waveform was collected simultaneously for retrospective reconstruction of bellows-rebinned 4DMRI. The navigator echo window (3 × 3 × 6 cm 3 ) was placed on the right diaphragm dome and amplitude-binning was used for 4D image reconstruction. 23 Ten-respiratory bins were used in all 4DMRI reconstructions. The pulse sequence used in 4DMRI scanning included turbo spin echo with, TE/TR of 80/6000 ms, flip angle of 90°, and pixel bandwidth of 470 Hz. To avoid signal saturation, 4-8 packs of acquisition (segmented acquisition bands) were defined to ensure two consecutive 2D slice images were acquired from different packs. The 4D scanning program used the first 10s breathing waveforms as a training dataset for amplitude-based binning for the rest of the scan until the bin-slice array (table) was filled.
The images have a pixel size of 2 × 2 mm 2 and slice spacing of 5 mm. The 4DMRI acquisition lasted 6-15 minutes with a large field of view covering the lungs, stomach, and liver. The navigator-triggered 4DMRI images were rebinned using the concurrent bellows waveforms (bellows-rebinned) 23 for contour delineation and comparison.

2.B | Manual and automatic delineation of the normal structures
Based on the T2W navigator-triggered 4DMRI images, a radiation oncologist and a medical student contoured five OARs, including the right and left lungs, heart, liver, and stomach, in each respiratory phase using an in-house treatment planning system (Metropolis). A written guideline of segmentation was provided, including the window/level settings (0-1200 for T2w 4DMRI) and the anatomic landmark to define the superior end of the heart (defined as when the two ascending arteries split in axial view). The intraobserver variability was examined based on volume variation among ten breathing phases on volume-preserved organs, such as the heart, liver, and stomach. The OAR contours based on navigator-triggered 4DMRI was compared with bellows-rebinned 4DMRI and the difference was assessed. Rigid alignment between the two sets of 4DMRI images was performed prior to the contour delineation and comparison.
A fast free-form multi-resolution DIR method 45,46 was employed to propagate the contour from the full-exhalation phase to the other respiratory phases. An intensity-based metric was used as the registration criterion to minimize an energy function that accounts for voxel intensity similarity and smoothness between two images: The first term describes the voxel intensity (I A and I B ) between images A and B at point x, while the second term is for smoothing that regulates the gradient changes in the vector displacement field ũ. The λ = 0.1 parameter sets the weighting factor between the two terms. The displacement vector field is found by solving the Euler-Lagrange equation through an iterative approach: Using the same window/level settings with optimal OAR visualization for both moving and fixed MRI images. The region of interest was drawn to cover the anatomy of interest, excluding surrounding most air voxels outside the body, and the displacement vector field generated from DIR was applied for contour propagation. Two sets of DIR-propagated and two sets of manual contours in all respiratory states were generated, compared, and analyzed using the method described below. The accuracy of the free-form DIR was previously found to be~3.5 mm using 4DCT of a deformable phantom. 46 It is expected that the DIR accuracy for 4DMRI with better soft-tissue contrast and low binning artifacts should be no worse than 4DCT.

2.C | Assessment of DIR-propagated OAR contours using the STAPLE algorithm
A statistical analysis of multi-sets of contours was conducted using the STAPLE algorithm 40 to provide similarity (Jaccard index), sensitivity (true positive), and specificity (true negative) 44,47 (or SSS), which were expressed as: where D and G are the volumes enclosed inside individual and group consensus contours, while D and G are the space outside of the volumes D and G, respectively. The STAPLE calculated the most probable ground truth with a 95% confidence level (S95) using a maximum likelihood algorithm based on input contours, which are assumed to be close to the ground truth. Four manual and automatic contour sets were used and evaluated against the S95 contour. The Student's t-test was performed after the STAPLE analysis for the P-value of the SSS results between the manual and automatic contours and between the contours from navigator-triggered and bellows-rebinned 4DMRI images.
The STAPLE algorithm was implemented in python script language. Based on the contour inputs (two manual contours or two DIR-propagated contours checked by visual and corrected as needed), the program first generated a probability map, then calculated the most probable ground truth (S95) at 95% confidence level as the reference ground truth to evaluate the similarity, sensitivity, and specificity of the input individual contours using Eq.3.

2.D | OAR volume variation and internal organ-atrisk volume
The volume-conserving organs, such as the heart, liver, and stomach, do not change their volume with the respiratory motion, although they may deform, because non-lung tissues are not compressible under respiratory pressure difference (3-6 mmHg). 48  The intra-observer variability was assessed using volume-conserved organs, including the heart, liver, and stomach. Fig. 1 demonstrates a volunteer example of the intra-observer variability in three manually contoured organ volumes. The small intra-observer variability (±3%) is found due to the high soft-tissue contrast of T2W 4DMRI images.
The results for all 10 volunteers are tabulated in Table 1. The changes in the center of mass of the liver and stomach are smaller than the diaphragm motions on the same side. For the liver, the manual contours are close to the ground truth, as shown in Fig. 2.

3.B | Comparison of manual and automatic OAR contours
The automatically propagated contours among 4DMRI are based on DIR that should have no worse than the uncertainty of~3.5 mm, which was validated in 4DCT, due to the high MR soft-tissue contrast and minimal binning artifacts of the navigator-triggered 4DMRI, as shown in Fig. 2A. As a result, automatic contours are very similar to the manual contours from the same observer (Fig. 2B). It is worthwhile to mention that there are~5-10% outliers with obvious flaws in the propagated contours, often occurring at superior-inferior OAR edges, and visual checking and manual correction are necessary. Fig. 2B shows the S95 contour in a liver case, generated from two manual and two auto high-quality contours using the STAPLE.
F I G . 1. A typical example (volunteer 5) of manual vs. automatic contours and intra-and inter-observer variability in three volume-conserving organs based on T2W RC-4DMRI. The auto-contour variation is smaller than inter-observer variation (U1 = user1 and U2 = user2). Table 2 tabulates the similarity, sensitivity, and specificity between manual and automatic contours. The similarity between manual and automatic contours from the same observer is generally greater than that of inter-observers, as shown in Tables 1 and 2, suggesting high-quality automatic OAR contours. The sensitivity and specificity of the manual and automatic contours are similarly high in comparison with the S95 contour.

3.C | Organ motion in 4DMRI and internal organat-risk volume
The OAR motion can be represented by their center of mass trajectory within the breathing cycle. The superior-to-inferior motions of the liver and stomach are listed in Table 1, together with those of the left and right diaphragm domes. When considering OAR motion, the IRV is shown in Figure 3. For all 10 subjects, the IRV volume increases from the OAR volume by 20.3 ± 8.6%, 24.0 ± 8.6%, and 47.6 ± 20.2% for the heart, liver, and stomach, respectively. At the superior and inferior border of an OAR, the organ tissue voxels only periodically occupy the space, and therefore the IRV in that region decreases its duration from 100% to 0%, appearing blurred its appearance in the averaged MR images from 4DMRI.

3.D | Contour differences in 4DMRI reconstructed using internal and external surrogates
The manual contours based on navigator-and bellows-binned 4DMRI are compared in the heart and liver for clinically relevant differences. Figure 4 depicts that both the manual and automatic contours based on bellows-rebinned 4DMRI are consistently inferior to the navigator-triggered 4DMRI, reflecting image quality difference.
The P-values (P < 0.05) are shown in Fig. 4 The T2W 4DMRI provides high-soft tissue contrast with many fine anatomic landmarks in the image, facilitating both manual and automatic OAR contour quality. Moreover, using the internal navigator for 4DMRI image reconstruction, binning artifacts are almost negligible. 23 Therefore, the accuracy and reliability of the DIR-based propagated contours among 4DMRI images are expected to be T A B L E 1 Evaluation of intra-observer variability in manual contours of three volume-conserving organs (heart, liver, and stomach) based on relative volume change within a breathing cycle.   Table 2). U1 = User 1 and U2 = User 2.
higher ( Fig. 2A) than 4DCT images with low soft-tissue contrast and frequent binning artifacts. 28,49,50 This explains the low intra-observer ( Fig. 1 and Table 1) and inter-observer ( Fig. 2B and Table 2) variability. In addition, the contour guideline is important, especially the superior boundary of the heart, defined as when the two ascending arteries split in axial view. The uncertainty in heart contour also results from whether to include the fat layer inferior to the heart visible in T2W 4DMRI while unclear in 4DCT. The stomach contours have the highest uncertainty likely due to the interference from foreign objects and trapped air in the hollow structure. Studies of OAR segmentation were mostly based on CT or 4DCT. 51,52 4.B | Consistency and reliability of automatic OAR volumes using DIR mapping The uncertainty of the free-form DIR algorithm was reported to bẽ 3.5 mm in 4DCT in the thoracic and upper abdomen. 46 Given more visible landmarks and fewer artifacts in T2W 4DMRI image, 23 it is expected that the DIR uncertainty should be reduced. However, the image resolution of 4DMRI (2 × 2 × 5 mm 3 ) is inferior to 4DCT (1 × 1 × 3 mm 3 ), therefore, the DIR accuracy may be affected, so does the automatically propagated contours. On the other hand, the binning artifacts in 4DMRI are much fewer than 4DCT (see more discussion below). Overall, the quality of DIR alignment in 4DMRI should be similar to that in 4DCT, 46 if not better, as shown in Fig. 2A. In addition, the major bronchial tree structure is seen in the thorax, facilitating lung DIR alignment. The liver contour has few artifacts because it is near the navigator ( Fig. 2A). Mild binning artifacts are observed in the heart and stomach due to cardiac and digestive motions, 23 which do not synchronize with respiration. The observed large IRV increase for the stomach (48 ± 20%) comparing to that of the liver (24 ± 9%) may result from highly heterogeneity inside the stomach but less clear organ boundary, possible internal gas compression and movement, and therefore higher uncertainties in the organ delineation.
(a) Deformable image registration between full exhalation (blue) and full inhalation (red) of T2W navigator-triggered 4DMRI (volunteer #7). Soft-tissue alignment before and after DIR within a region of interest (orange box) makes aligned voxel (white), except for the flowing blood voxels inside the major vessels around the heart. (b) Comparison of four sets of manual and auto contours (user1: orange/green, user2: pink/brown) in the upper panel and with the most probable ground truth (S95, green area), generated by the STAPLE, in the lower panel.
T A B L E 2 Comparison of manual and automatic contours with the most probable ground truth (S95) contour based on their similarity (Jaccard Index), sensitivity, and specificity averaged from ten respiratory phases and two observers on four OARs. The manual and automatic contours are statistically indistinguishable. # Volunteers 1 and 9 have an incomplete field of view in the inferior and superior, respectively. *The P < 0.05 indicates that the automatic contour is slightly better than the manual contour.
As volunteer subjects are the surrogates of patients, in the presence of tumor or metastasis the OAR appearance may vary, including the tumor and possible edema. These extra objects, in fact, could be used as additional anatomic landmarks to facilitate the DIR and automatic contour propagation. Therefore, we expect that the results of healthy subjects can be applied to patients.

4.C | Contour variation resulted from 4DMRI
binning artifacts OAR contour similarity between two 4DMRI images that were reconstructed using the same image dataset but two concurrent respiratory surrogates (internal navigator and external bellows) is evaluated. Binning artifacts appear fewer in navigator-triggered than bellowsrebinned 4DMRI images, 23 thereby enhancing the contour quality (p < 0.05, Fig. 4). The superior contour quality implies that the contour quality in navigator-triggered 4DMRI is better than that of 4DCT where an external surrogate is almost always used. In fact, binning artifacts are commonly present in 4DCT 49,50 and can cause up to 90-110% gross tumor volume changes in lung cancer. 53,54 The high 4DMRI image quality and high soft-tissue contrast improve that DIR and therefore contribute to the high OAR contour quality.
The automatically propagated contour quality from this MRI study is among the high ends with only 5-10% manual correction in comparison with 20-45% in 4DCT, 55 which uses an external surrogate for reconstruction. Although the percentage is drastically reduced, a further investigation should be conducted to facilitate clinical applications in assessing OAR motion for IRV determination.
As the contour outliers often occur in slices with drastic shape changes (large motions), such as the inferior lungs, they are likely caused by contour interpolation from stretched voxels to a slice. In addition, a different DIR algorithm may also be evaluated.

4.D | The OAR voxel probability within the IRV
The current ITV planning approach does not account for the OAR motion and therefore the planned dose to the OAR may not be accurate. Clinically, the only inter-fractional variability in the OAR position relative to the tumor is considered using the PRV. In SBRT, OAR toxicity is often the limiting factor, preventing the prescription F I G . 3. Visual illustration of the internal organ at risk volume (IRV, color shaded) and individual organ at risk (OAR) contour volumes for volunteer #7 in full exhalation and inhalation phases: (a) heart, (b) liver, and (c) stomach. On average, the volume increase from the OAR volume to IRV is 20-50%, depending on the OAR motion, volume, and contour accuracy.
of a potent ablative dose with accelerated fractionation. 4,5 Therefore, accurate estimation of OAR dose is essential to optimize the SBRT dose prescription and to establish the OAR dose-toxicity relationship for an improved therapeutic ratio. 13 The introduction of IRV into SBRT planning is one way to account for OAR motion and to assess the OAR dose more accurately. Using voxel probability of OAR in or near the radiation field is another method because the OAR dose depends on the duration of the OAR move into or near the planning tumor volume (ITV+margin). Further investigation is needed to reduce the outliers in automatic organ delineation, to provide more accurate dose evaluation to the OARs, and to account for breathing irregularities that affect the ITV and IRV delineation using the time-resolved 4DMRI technique over multi-breathing cycles. 26

| CONCLUSION
Using T2W navigator-triggered RC-4DMRI and free-form DIR algorithm, the automatic contours of four common OARs are evaluated with the STAPLE analysis and found to be indistinguishable with the manual contours for the lungs, heart, liver, and stomach in 10 sub- F I G . 4. Variation in heart and liver contour similarity based on the navigatortriggered and bellows-rebinned 4DMRI images. The similarity difference between manual and automatic contours (heart: P = 0.003 and 0.042 and liver: P = 0.001 and 0.001, respectively) in Navigator-based and Bellows-based 4DMRI is statistically significant.