Automatic detection of contouring errors using convolutional neural networks

Purpose To develop a head and neck normal structures autocontouring tool that could be used to automatically detect the errors in autocontours from a clinically validated autocontouring tool. Methods An autocontouring tool based on convolutional neural networks (CNN) was developed for 16 normal structures of the head and neck and tested to identify the contour errors from a clinically validated multiatlas‐based autocontouring system (MACS). The computed tomography (CT) scans and clinical contours from 3495 patients were semiautomatically curated and used to train and validate the CNN‐based autocontouring tool. The final accuracy of the tool was evaluated by calculating the Sørensen–Dice similarity coefficients (DSC) and Hausdorff distances between the automatically generated contours and physician‐drawn contours on 174 internal and 24 external CT scans. Lastly, the CNN‐based tool was evaluated on 60 patients' CT scans to investigate the possibility to detect contouring failures. The contouring failures on these patients were classified as either minor or major errors. The criteria to detect contouring errors were determined by analyzing the DSC between the CNN‐ and MACS‐based contours under two independent scenarios: (a) contours with minor errors are clinically acceptable and (b) contours with minor errors are clinically unacceptable. Results The average DSC and Hausdorff distance of our CNN‐based tool was 98.4%/1.23 cm for brain, 89.1%/0.42 cm for eyes, 86.8%/1.28 cm for mandible, 86.4%/0.88 cm for brainstem, 83.4%/0.71 cm for spinal cord, 82.7%/1.37 cm for parotids, 80.7%/1.08 cm for esophagus, 71.7%/0.39 cm for lenses, 68.6%/0.72 for optic nerves, 66.4%/0.46 cm for cochleas, and 40.7%/0.96 cm for optic chiasm. With the error detection tool, the proportions of the clinically unacceptable MACS contours that were correctly detected were 0.99/0.80 on average except for the optic chiasm, when contours with minor errors are clinically acceptable/unacceptable, respectively. The proportions of the clinically acceptable MACS contours that were correctly detected were 0.81/0.60 on average except for the optic chiasm, when contours with minor errors are clinically acceptable/unacceptable, respectively. Conclusion Our CNN‐based autocontouring tool performed well on both the publically available and the internal datasets. Furthermore, our results show that CNN‐based algorithms are able to identify ill‐defined contours from a clinically validated and used multiatlas‐based autocontouring tool. Therefore, our CNN‐based tool can effectively perform automatic verification of MACS contours.


INTRODUCTION
Manual contouring is a time-consuming process 1,2 and prone to inter-and even intrauser variabilities. [3][4][5][6][7][8] An autocontouring system can save the experienced user's time and reduce both inter-and intra-user variabilities. However, an experienced user must review every contour generated from an autocontouring system before it would be used clinically. A previous study showed that automated contouring of head and neck structures can save 180 min per patient, but still requires 66 min to edit the automatically generated contours. 9 Although the editing process takes significantly less time than the manual contouring process, this process still requires user's judgment, which can be biased and time-intensive, and errors could still be missed. An automated contour review process that automatically flags suspicious cases could potentially be more objective, and provide additional time-savings. Furthermore, the automatic review process would be an integral part of an automated radiation treatment planning system; we are currently developing such a system, 10 which asks the users to review all the automatically generated contours every time a new treatment plan is created.
In this study, an autocontouring tool based on convolutional neural networks (CNN) was developed and tested. We then assess the error detection ability of our tool when applied to computed tomography (CT) scans with normal structures contoured using an in-house atlas-based contouring tool. This multi-atlas based autocontouring system (MACS) is the primary contouring tool for a fully automated radiation treatment plan generator, 10 and it has been used successfully for clinical 11 and research purposes for several years. [12][13][14][15] Specifically, it is used to contour the normal structures for nearly all head and neck patients who receive radiotherapy at our institution, so the development of a CNNbased autocontouring tool promises to augment and provide quality assurance to MACS.
Machine learning-based contouring error detection algorithms have shown promising results in various radiation treatment sites. McIntosh et al., 16 used image features with conditional random forests algorithm to detect contouring errors in thoracic structures. Hui et al., 17 applied principal component analysis and Procrustes analysis on shapes of contours to detect contouring errors in male pelvis. Chen et al., 18 identified contouring errors in head and neck region by the geometric attribute distribution models.
Furthermore, McCarroll et al., 19 developed a bagged tree classification model using contour features to predict the errors in MACS contours and achieved 0.63 accuracy. However, many of the erroneous MACS contours still have reasonably good shapes and relative position although the absolute positions are off by few millimeters to centimeters, and these errors are difficult to be detected by the machine learning-based algorithms using the shapes and/or features of the contours. On the other hand, Beasley et al., 20 used volumetric overlap and distance between expert's contours and automatically generated contours to detect errors in the automatically generated contours, and was able to achieve AUC of 0.85-0.90 in detecting errors in parotids contours. As this approach is effective to detect even small offsets in contours, we implement the similar approach to detect errors in MACS contours by replacing the expert with the CNN-based autocontouring tool.
The CNN algorithm was chosen to develop an autocontouring tool because other studies [21][22][23][24][25][26] have shown that CNN-based models outperform most other machine learningbased and model-based algorithms in contouring head and neck structures. Zhu et al., 22 showed that their CNN-based autocontouring algorithm for head and neck normal structures could achieve the equivalent performance with the best MICCAI 2015 challenge results with the atlas-and modelbased algorithms. 27 Google DeepMind 23 demonstrated that their CNN-based autocontouring algorithm was able to achieve the near expert dosimetrist level accuracy and they also provided the ground truth contours on their test dataset to enable other autocontouring systems to benchmark the performance. However, most of the autocontouring tools were developed for internal use or commercial purpose and thus not publicly available. Therefore, we develop our own CNNbased autocontouring tool and provide the performance of our tool with the benchmark data from Google DeepMind.

MATERIALS AND METHODS
The CNN-based autocontouring tool was developed to generate contours for 16  The CT scans and the corresponding clinical contouring data of the 3495 patients who received external photon beam radiation treatment from September 2004 to June 2018 at the University of Texas MD Anderson Cancer Center were used as the training and validation data. Of these patients, 1169 had head and neck cancer, 1319 had brain cancer, and 1007 had thoracic or esophageal cancer. Contours for each structure were acquired independently to maximize the amount of data, and thus, the number of available structures in a single patient's data varied from 1 to 16. The total number of CT scans used for training and validation for each structure is given in Table I. Of these scans, 80% were used for training, and 20% were used for cross-validation. The data were collected solely on the basis of their structure labels, but manual review was conducted when two or more labels indicated the same structure in the same CT scan.
The data curation was performed semiautomatically as described in Fig. 1. First, the acquired clinical contours and the corresponding scans were given to a CNN-based segmentation model for training. The model was trained until it could roughly segment the structures but was still underfitted. Then, the contours were automatically generated on the training data with the trained model, and the Sørensen-Dice similarity coefficients (DSCs) 14 between the original training data and the predicted contours were calculated. When a calculated DSC was less than a certain value (0.6 for structures larger than an eye, 0.4 for smaller structures), the original contour was manually reviewed, and any of the incorrect clinical contours were removed from the training dataset. Once the entire set of training data was reviewed, we trained the model with the "refined" dataset again. This process was repeated two to three times, until all the significantly incorrect contours were eliminated from the training dataset.
Next, the training data were flipped and rotated for data augmentation. First, the structures were doubled by horizontal flipping. The paired structures had to be relabeled owing to their change in orientation from flipping (e.g., right eye becomes left eye after flipping). Then, the data were tripled by rotation around the longitudinal axis at two random angles between À30°and 30°.

2.A.2. Training the CNN-based segmentation model
The proposed model uses a combination of classification and segmentation CNN models. The Inception-ResNet-v2 28 image classification model was trained to detect the existence of the structures in each CT slice. The classification results were used to determine the range of CT slices containing each structure, as shown in Fig. 2. The CT slices within the range, as shown in Fig. 2(b), were then given as an input to the segmentation models in the inference phase. The binary classifier was used to determine the presence or the absence of each structure in each image slice except for the brainstem and spinal cord. Instead, a three-class classifier was used to select between the brainstem, spinal cord, and the absence of both structures, because they are physically connected in the axial direction. The ground truth for training and validation were created by labeling the slices containing clinical contour as presence. We used two CNN-based models for segmentation. The small structures (eyes, lenses, optic chiasm, optic nerves, and cochleas) were segmented using the V-Net 29 three-dimensional (3D) CNN-based model. The additional batch renormalization layers 30 were applied to the end of every convolutional and up-convolutional layer of the original V-Net model. The other structures were segmented using the FCN-8s, 31 a two-dimensional (2D) CNN-based architecture. The batch renormalization layers were also added to the end of every layer in the FCN-8s model.
The input size of 20 slices 9 512 9 512 was used for the segmentation of the eyes and the associated structures (lenses, optic nerves, and optic chiasm). According to previous studies, the median (AEstandard deviation) diameter of a human eye is 24.9 AE 2.2 mm, 32 and the mean heights of optic nerves and optic chiasm are 3.0 and 3.5 mm, respectively. 33 Thus, we can assume that eyes and all associated structures are located within AE10 slices (AE25 mm) from the longitudinal center of the eyes. An input image size of 20 slices 9 512 9 512 was also chosen for cochlea segmentation for consistency, even though the cochleas were mostly covered within five CT slices. The rest of the structures were segmented with the FCN-8s. The size of the input images was 512 9 512 9 1 channel. Only the CT slices that were classified to contain the structure of interest were transferred to the FCN-8s model for segmentation. All of the classification and segmentation architectures were trained independently for each structure.

2.A.3. Training parameters
The pixel sizes of the CT scans in the transverse plane varied from 0.53 to 1.37 mm, and the slice thicknesses varied from 1.0 to 3.75 mm, respectively. All data were resampled to have the same voxel size of 0.9766 mm 9 0.9766 mm 9 2.5 mm. The CT numbers lower than À1000 HU and higher than 3000 HU were clipped. Then, the CT numbers ranged from À1000 to 3000 HU were rescaled to the 0-255 pixel intensity range.
An NVIDIA DGX Station with four V100 GPUs was used to train our models. The loss function for the segmentation models was DSC as it was our metric to determine the accuracy of a segmentation model. A weighted crossentropy was used as a loss function for the classification model to compensate for the data imbalance between the number of slices with and without the organ of interest. The weight was determined to be the ratio the number of absence to the number of presence. The Adam optimizer 34 was used as an optimization algorithm. The Adam optimizer's parameters beta1, beta2, and epsilon were set to 0.9, 0.999, and 10 À8 , respectively.

2.B.1. Model accuracy
Twelve CT scans with baseline contouring atlas of the head and neck MACS and 162 CT scans from head and neck cancer patients who received proton radiation treatment at the University of Texas MD Anderson Cancer Center were used as the test data. Similar to the training and validation data, the number of available structures in a single patient's data varied from 1 to 16, and the total number of CT scans used for each structure is given in Table II. The accuracy of the model was measured by the DSC and the Hausdorff distance 14 between the model-generated contours and the manual contours.
Also, 24 CT scans from The Cancer Imaging Archive (TCIA) with 14 normal structure contours (everything except esophagus and optic chiasm) were used as an external test dataset. The physician-drawn contours for the TCIA dataset were provided by Google DeepMind. 23 DeepMind also published the performance of their autocontouring model, which achieved near expert dosimetrist level performance on the TCIA dataset. We applied our CNN-based model to the same TCIA data, calculated the DSC between our contours and the physician-drawn contours from DeepMind, and compared the calculated DSC with the DeepMind's published DSC. We used a two-tailed Student's t-test to detect any statistically significant difference between the two models, with significance defined by a P < 0.05.

2.B.2. Automatic verification of automatically generated contours
We trained and tested our CNN-based autocontouring system as an automatic verification tool with 48 CT scans with MACS contours and 12 CT scans with the baseline contouring atlas of the head and neck MACS, all of which were independent from the training dataset for the CNN-based autocontouring system. The contours were scored by an experienced head and neck radiation oncologist on a scale of 1 to 3, where 1 was a clinically acceptable contour without editing (no error), 2 was a contour requiring minor editing (minor error), and 3 was a clinically unacceptable contour requiring major editing (major error), 35 and the number of MACS contours for each score are given in Table III. We then generated our CNN-based model contours on these patients and calculated the DSC between the MACS contours and the CNN-based model contours. Of these 60 patients, 40 patients were used for receiver operating characteristic (ROC) analysis based on their DSC and physician scores. We created two ROC curves per organ for two scenarios; (a) considering minor contouring errors to be clinically acceptable, so we would only detect major contouring errors, and (b) considering minor contouring errors to be clinically unacceptable, so we would detect both minor and major contouring errors. The DSC threshold, the minimum DSC to pass the automatic verification tool, was derived not to include any major errors for the scenario (a), and to include less than 30% of the minor errors for the scenario (b). Furthermore, area under the curves (AUCs) of the ROC curves were calculated to quantitate the relationship between DSC and physician scores.
To test the DSC thresholds, we calculated the sensitivity and specificity on the other 20 patient data. We defined clinically acceptable contours to be positive and clinically unacceptable contours to be negative. Therefore,  the sensitivity measures the proportion of the clinically acceptable contours that are correctly detected as such and the specificity measures the proportion of the clinically unacceptable contours that are correctly detected as such.
As the DSC thresholds were derived independently for each scenario, the sensitivity and specificity were calculated independently for each scenario with corresponding thresholds as well.

3.A. Model accuracy
The average DSCs and Hausdorff distances between our model contours and clinical contours are calculated in Table IV. For the large structures created with FCN-8 architecture, the DSCs were higher than 80.7%, and for the small structures created with the V-Net model, the DSCs were higher than 65.2% except for the optic chiasm.
The DSCs between our model contours and the Deep-Mind physician-drawn contours and between the Deep-Mind model contours and the DeepMind physician-drawn contours are in Table V. The differences between the models for both lenses, both parotids, the left optic nerve, the left eye, and the spinal cord were not statistically significant, and our model performed better than DeepMind's in contouring the brainstem. Our model performed worse than DeepMind's in the brain, both cochleas, the mandible, the right optic nerve, and the right eye, but the differences were smaller than 3.5% except both cochleas. The differences in the standard deviations of the DSCs of the two models were small (<2%) except for the brainstem and lenses, where DeepMind's outcomes were more sparsely distributed than our model's outcomes. The Hausdorff distance between our model contours and DeepMind physician-drawn contours are in Table VI, and the average Hausdorff distances were <1.78 cm for all structures except for the brain.

3.B. Automatic verification of automatically generated contours
The ROC curves based on DSC and physicians' scoring for the 40 patients are shown in Fig. 3. The average and minimum AUCs were 0.98 and 0.95 (excluding the optic chiasm) if minor errors were considered clinically acceptable. That is, for the scenario where we wish to only detect situations where major edits are needed. The average and minimum AUCs were 0.85 and 0.66 if minor errors were clinically unacceptable. The sensitivity and specificity based on the given DSC thresholds on the 20 patients were given in Table VII. If minor errors were clinically acceptable, the average sensitivity and specificity was 0.81 and 0.99, respectively, excluding the optic chiasm. If minor errors were considered clinically unacceptable, the sensitivity and specificity was 0.61 and 0.80, respectively, excluding the optic chiasm.
The MACS mandible contours with errors are demonstrated in Fig. 4. Figure 4(a) shows the major error in the mandible which was detected by our system (DSC = 74.0). Figure 4(b) shows the minor error case detected by our system (DSC = 83.5), and Fig. 4(c) shows the minor error case undetected by our system (DSC = 86.7). As shown in Figs. 4(b) and 4(c), most of the minor error cases are only required to modify very small volumes which have a small impact on DSC, so the DSC distributions of no errors and minor errors are difficult to be distinguished.

DISCUSSION
We have demonstrated that a CNN-based architecture can accurately contour normal structures in the head and neck region. CNN architectures are preferred to have a fixed-size input. However, the scan range and the number of slices significantly vary according to clinical protocols and patients' height. To address this problem, we have used a unique approach of using a CNN-based classification architecture, when the DeepMind model was built to have a partial 3D CT scan with 21 slices as an input to contour a single-CT slice. Our approach allows us to flexibly choose any 2D or 3D CNN-based segmentation architectures, so we could choose the segmentation architectures based on the performance and/or the GPU memory availability. As we trained each organ contouring algorithm independently, we would be able to retrain any poorly performing architecture independently later with an advanced architecture or newly collected data.    Furthermore, we can think a classification-segmentation combination for a single organ as a module and apply it to contour the organ for other sites, such as the esophagus for thoracic patients. A disadvantage to this approach is that multiple models are used to generate contour predictions requiring additional time to predict all contours; however, our approach still manages to contour 16 structures on a CT scan of 160 slices in 2 min using a single GPU. The accuracy of MACS contours was strongly associated with DSC between CNN-based model contours and MACS contours. This association indicates that our CNNbased model can effectively identify the major errors in the MACS contours. To date, most of the automated contouring error detection techniques were developed with machine learning algorithms using features or shapes of contouring structures 16,17,36 and the relative positions 18 of the contours. Chen et al., 18 showed that their geometric attribution-based contouring error detection algorithms for the brain, brainstem, parotids, optic nerves, and optic chiasm contours can achieve the average sensitivity of 0.786-0.831 and the average specificity of 0.878-0.951. These show that our system has similar accuracy MACS contouring errors for the previously developed machine learningbased algorithms. Furthermore, because most of the significant errors from MACS contours were specifically caused by irregular patient positions or abnormally large tumors, erroneous MACS contours in these cases still have reasonably good shapes and relative positions as shown in Fig. 4. Therefore, our CNN-based contouring verification system has a strength in detecting significant errors in such cases over the other machine learning-based error detection algorithms.

4.A. Model accuracy
The number of training, validation, and test data we used to train and evaluate the model is the largest among deep learning-based head and neck autocontouring studies up-todate. 26 Although the accuracy of the internal dataset does not seem to be superior to other CNN-based models, the end-toend comparison shows that the accuracy of our model is almost equivalent to that from DeepMind, which has near expert dosimetrist level accuracy, except for cochleas. Additionally, our model achieved similar or lower standard deviations compared with the DeepMind model and had no completely failed cases (DSC = 0) on the TCIA data. The consistency and robustness of a model are important characteristics as a quality assurance tool, and our model has a strength in these aspects.
The accuracy of the cochlea contours from our CNNbased model was significantly inferior to those from the DeepMind model. However, the volume of the cochlea was about three to four times larger for our model than that from the DeepMind model as shown in Figs. 5(a) and 4(b). The volume difference was due to differences in how clinicians contour the cochleas. At MD Anderson, the   semicircular canals are included as part of the cochlea to reduce the risk of hearing loss from radiation treatment, 37 while the DeepMind's contour does not include the semicircular canals. Therefore, most of the DeepMind physician's cochlea contours are completely covered by the cochlea contours from our model, and the true-positive ratio, the volume ratio of the cochlea from the DeepMind model covered by the cochlea from our model, was 97.0% AE 5.0 (SD). Furthermore, the DSCs of the left and the right cochlea were 65.2% and 67.6%, respectively, with our physician-drawn contours (Table IV), but were 39.5% and 42.2%, respectively, with the DeepMind physiciandrawn contours (Table V). The accuracy of our model was still lower by about 10%, but considering the small volume of a cochlea, the difference would probably not significantly affect the final dose distribution of radiation treatment planning, especially if a small planning margin is created around the structure. The definitions of other structures also differed somewhat between the two groups. The brain was defined to exclude the brainstem in the DeepMind contour, whereas the brainstem was a part of the brain in MD Anderson's contour, as shown in Figs. 5(c) and 5(d). These differences in contouring style underestimate the DSC of our CNN-based model about 2% (Table V) compared with the DSCs of the brain with our physician's contour (Table IV). This indicates that the actual differences in the accuracies of the brain would be less than 1 % between the two models.

4.B. Automatic verification of automatically generated contours
The specificity showed that major contouring errors can be confidently identified by measuring DSC between our CNN-based contours and the target contours for most of the head and neck normal structures. Furthermore, the AUCs and the average sensitivity showed that the overall accuracy of these tool to identify major contouring errors is sufficient to be clinically implemented. Any major error in the optic chiasm, however, was difficult to be identified. For the optic chiasm, our CNN-based model was neither consistent nor robust [mean DSC, 40.7% AE 13.9% (SD)]. The poor performance on contouring the optic chiasm was due to the very low contrast of the optic chiasm in CT images; even experts struggle to precisely draw the optic chiasm on CT, so MRI is recommended for contouring the optic chiasm. 38 This difficulty can be seen in how the chiasm is drawn in clinical practice, with much variation in size and shape.
Although the AUCs and the sensitivity and specificity showed some potential to identify minor contouring errors for some structures, the relationship was not as strong as it was with major contouring errors. As we defined a minor contouring error to be a contour located in a right position but required a small shape modification, DSC, the geometric overlap over two contours, could not be a sensitive metric to measure the difference between well-defined contours and contours requiring minor edits. Additionally, as minor errors were defined to be small variations in shapes of contours, implementing both our system and the feature-based 16 or the shape-based contour verification tools 17,36 would improve the overall contouring verification accuracy.
One of the limitations of the automatic error verification study is that each contour was scored by only one radiation oncologist, so the study does not include the impact of interobserver variability. Because every radiation oncologist has slightly different ways to draw contours, it is possible that a contour scored to be "no error" can be scored as "minor error" by another radiation oncologist and vice versa. Similarly, some of the cases that we failed to predict the scores could have been successful if another radiation oncologist scores them, so further study with taking account of interobserver variability would be able to improve the overall robustness of our tool.

CONCLUSION
We have demonstrated that a CNN-based autocontouring tool with near expert dosimetrist level accuracy for most of the head and neck normal structures can be developed using semiautomatically curated patient data. Furthermore, our model enables the detection of most of the major errors in the normal structures of the head and neck contours created by a clinically validated multiatlas-based autocontouring tool.