Automatic contouring system for cervical cancer using convolutional neural networks

Purpose To develop a tool for the automatic contouring of clinical treatment volumes (CTVs) and normal tissues for radiotherapy treatment planning in cervical cancer patients. Methods An auto‐contouring tool based on convolutional neural networks (CNN) was developed to delineate three cervical CTVs and 11 normal structures (seven OARs, four bony structures) in cervical cancer treatment for use with the Radiation Planning Assistant, a web‐based automatic plan generation system. A total of 2254 retrospective clinical computed tomography (CT) scans from a single cancer center and 210 CT scans from a segmentation challenge were used to train and validate the CNN‐based auto‐contouring tool. The accuracy of the tool was evaluated by calculating the Sørensen‐dice similarity coefficient (DSC) and mean surface and Hausdorff distances between the automatically generated contours and physician‐drawn contours on 140 internal CT scans. A radiation oncologist scored the automatically generated contours on 30 external CT scans from three South African hospitals. Results The average DSC, mean surface distance, and Hausdorff distance of our CNN‐based tool were 0.86/0.19 cm/2.02 cm for the primary CTV, 0.81/0.21 cm/2.09 cm for the nodal CTV, 0.76/0.27 cm/2.00 cm for the PAN CTV, 0.89/0.11 cm/1.07 cm for the bladder, 0.81/0.18 cm/1.66 cm for the rectum, 0.90/0.06 cm/0.65 cm for the spinal cord, 0.94/0.06 cm/0.60 cm for the left femur, 0.93/0.07 cm/0.66 cm for the right femur, 0.94/0.08 cm/0.76 cm for the left kidney, 0.95/0.07 cm/0.84 cm for the right kidney, 0.93/0.05 cm/1.06 cm for the pelvic bone, 0.91/0.07 cm/1.25 cm for the sacrum, 0.91/0.07 cm/0.53 cm for the L4 vertebral body, and 0.90/0.08 cm/0.68 cm for the L5 vertebral bodies. On average, 80% of the CTVs, 97% of the organ at risk, and 98% of the bony structure contours in the external test dataset were clinically acceptable based on physician review. Conclusions Our CNN‐based auto‐contouring tool performed well on both internal and external datasets and had a high rate of clinical acceptability.


INTRODUCTION
Manual contouring of tumors and normal structures is a very labor-intensive and time-consuming part of the radiation treatment planning process. 1,2 "Wrong or inaccurate" contours drawn by physicians and dosimetrists constitute the highest and seventh-highest risk factors for failure of photon/ electron external beam radiation treatment, respectively. 3 Most of these errors could be avoided if an accurate and reliable auto-contouring tool were available. In the past, various algorithms have been evaluated for the development of autocontouring tools, with mixed success. [4][5][6] With the advent of deep learning, more specifically, convolutional neural networks (CNNs), this movement has been accelerated as CNNs outperformed most of the other algorithms in various segmentation tasks. 7 As a result, CNN-based auto-contouring systems for computed tomography (CT) images have been developed for various body sites, such as the head and neck, [8][9][10][11][12] thoracic region, [13][14][15][16] abdomen, [17][18][19] and pelvis. [20][21][22][23][24][25][26][27][28][29][30][31][32] Although these approaches have generally been very successful, they are not yet accessible to cancer treatment centers where they would be most usefulthose with limited resources that see a large number of cervical cancer patients, such as in South Africa and other low-and middle-income countries (LMICs). In fact, cervical cancer is the second most common cancer in women in Africa, 33,34 and the most cost-effective treatment that increases the survival rate of cervical cancer patients in LMICs is radiation treatment. 35 To fill this gap, the Radiation Planning Assistant (RPA; rpa.mdanderson.org), 36 a web-based, fully automated radiotherapy contouring and planning generation system, is being developed to address the shortage of treatment planning staff and subsequently increase the survival rate for cancer patients in LMICs.
Although the potential of deep learning-based auto-contouring systems for pelvic structures has been explored in several previous studies, most of them were focused on prostate cancer, [21][22][23][24]31,32 and only a few papers have published results for the female pelvis. 25,26 In this study, we developed an auto-contouring system that can contour the clinical treatment volumes (CTVs) and normal structures that are necessary for various cervical cancer radiation treatment planning techniques. The auto-contouring system in this work will be implemented with RPA to automatically generate high-quality radiation treatment for cervical cancer patients in LMICs.
To the best of our knowledge, this study is the first CTbased auto-contouring study that includes pelvic lymph node CTV (nodal CTV) and para-aortic lymph node CTV (PAN CTV) for cervical cancer radiotherapy using deep learning.
This addition is important, as these are the primary targets for radiotherapy treatments of cervical cancer and will facilitate fully automated treatment planning for cervical cancer.

MATERIALS AND METHODS
Our CNN-based auto-contouring tool was developed to generate contours for three CTVs and 11 normal structures in the female pelvis: primary CTV, nodal CTV, PAN CTV, bladder, rectum, spinal cord, left and right femurs, left and right kidneys, sacrum, pelvic bone, L4 vertebral body, and L5 vertebral body. These structures were categorized into three groups: bony structures, organs at risk (OARs), and CTVs. These are the structures required to automate 4-field box, 3D conformal, IMRT, and VMAT plans for cervical cancer. [37][38][39] First, the Inception-ResNet-V2 40 classification architecture was trained to identify the extent of the structure in the cranial-caudal direction, as shown in Figs. 1(a) and 1(b). This approach was taken to address the GPU memory limitation issue as well as to improve the accuracy of the automatically generated contours by allowing the subsequent segmentation model to process a restricted field of view. 9 Second, the segmentation models were applied to the CT slices that were classified to contain the organ of interest, as shown in Fig. 1(c). Both the classification and the segmentation models were trained independently for each structure.

2.A. Training parameters
For the training and validation data, 2254 female pelvic CT scans from cancer patients who received radiation treatment from September 2004 to June 2018 at The University of Texas MD Anderson Cancer Center were used. Furthermore, 210 CT scans with kidney contours from the 2019 Kidney Tumor Segmentation Challenge (KiTS19) were added to our training data. The CT scans had pixel sizes in the transverse plane that ranged from 0.754 to 1.367 mm and slice thicknesses from 2.0 to 3.0 mm, except for 8 CT scans (3 were 5 mm, 3 were 4 mm, 1 was 1.5 mm, and 1 was 1.0 mm thick). All data were resampled to have the same voxel size of 1.17 mm 9 1.17 mm 9 2.5 mm. The CT numbers lower than À1000 HU and higher than 3000 HU were clipped and then linearly shifted to have a 0 to 4000 pixel intensity range.
An NVIDIA DGX Station with four V100 GPUs (16 GB RAM) was used to train our models. The loss function for the segmentation models was the Sørensen-Dice similarity coefficient (DSC), 41,42 as this was our metric to determine the accuracy of the segmentation model. A weighted cross-Medical Physics, 47 (11), November 2020 entropy was used as a loss function for the classification model to compensate for the data imbalance between the number of slices with and without the organ of interest. The weight was determined to be the ratio of the number of absences to the number of presences. The Adam optimizer 43 was used as an optimization algorithm. The Adam optimizer's parameters, beta1, beta2, and epsilon, were set to 0.9, 0.999, and 10 À8 , respectively.
To select the two-dimensional (2D) and three-dimensional (3D) CNN segmentation architectures, we did a preliminary study on the spinal cord for 2D and the left kidney for 3D. The vanilla DeepLabv3+ 44 and the FCN-8s 45 with additional batch normalization layers at the end of every convolutional layer were trained to segment the spinal cord in 2D. The mean AE standard deviation DSC were 0.87 AE 0.03 and 0.90 AE 0.02, for the vanilla DeepLabv3+ and the modified FCN8-s, respectively, so the modified FCN-8s was chosen for our model. Similarly, the 3D U-Net 46 and the 3D V-Net 21 segmentation architectures were trained to segment the left kidney on CT images resized to have a 256 9 256 9 60 dimension. We added batch normalization layers at the end of every convolutional layer for both architectures. The mean AE standard deviation DSC were 0.93 AE 0.04 and 0.93 AE 0.04, for the U-Net and the V-Net, respectively. As there was no significant difference between the two architectures, we chose the V-Net, which has residual connections in each stage.

2.B. Bony structures
The contours of the four bony structures (pelvic bone, sacrum, L4 vertebral body, and L5 vertebral body) were generated on 370 CT scans to train and validate the auto-contouring model. The pelvic bone was defined to be the traditional pelvic bone without the sacrum, as the sacrum was contoured as a separate structure. All the bony structure contours were automatically generated with a multi-atlas-based auto-contouring system (MACS) 4,5,47 first, and the automatically generated contours were manually reviewed and revised if necessary. V-Net, 21 a CNN-based 3D segmentation architecture, was used to segment the four bony structures. The input image for the segmentation architecture was resized to N slice 9 256 9 256. A single segmentation model was used to contour the adjacent L4 and L5 vertebral bodies simultaneously. For data augmentation purposes, horizontal flip and rotation with random angles between À30°and 30°along the axial axis were applied for these structures.

2.C.1. Primary clinical treatment volume
The primary CTV for cervical cancer patients is defined to include the uterus and the cervix. To train the model, 406 contours were either curated from clinical contours or manually generated from scratch by 4 physicians at MD Anderson Cancer Center.
V-Net was used to segment the primary CTV. Although the classification model restricted the field of view of the input images, the GPU memory was not sufficient to train the full-resolution CT images. To overcome this problem, we resized the input image to 256 9 256 pixels in the transverse plane, segmented the primary CTV, and estimated the center of mass of the primary CTV. Then, we cropped the box that fully enclosed the primary CTV and centered it on the center of mass of the prediction on the original CT scan with a 512 9 512 pixel image size. Finally, we applied the V-Net segmentation model to the cropped 3D image, as shown in Fig. 2. This way, the final contour is predicted on the limited CT field of view with the original spatial resolution. This approach was inspired by the method proposed by Feng et al. 13 and applied to the rest of the CTVs and OARs that were segmented with the 3D segmentation model.
Although the cropped images were supposed to be centered at the center of mass of the organ in the prediction, the center was randomly chosen while training the model in each epoch for the data augmentation purpose. Furthermore, the random rotation between À30°and 30°along the axial axis Medical Physics, 47 (11), November 2020 and the horizontal flip were also used for data augmentation. The same data augmentation techniques were applied to train the segmentation models for other CTVs and OARs.

2.C.2. Pelvic lymph node clinical treatment volume
The nodal CTV covers the common iliac, external iliac, internal iliac, obturator, and presacral nodal regions as described in the GEC-ESTRO II guideline 48 for intermediaterisk nodal CTV. To provide data for the training process, 250 nodal CTV contours were contoured by the same four physicians who contoured the primary CTV and later peer-reviewed to ensure high accuracy and consistency. As the lymph nodes and vessels are small and have CT numbers similar to those of muscles, a 3D segmentation model can sometimes miss a small part of the lymph nodes. To prevent this, FCN-8s, 45 a 2D segmentation architecture, was also trained to auto-contour the nodal CTVs. The CT slices that were predicted to contain the nodal CTV contours by the 3D segmentation model were given to the 2D segmentation model for slice by slice prediction. In prediction, the sum of the nodal CTV contours from the 2D and 3D models was used as a final contour.
The superior border of the intermediate nodal CTV was determined at one slice below the bifurcation of the common iliac artery. To locate the superior border more accurately, a segmentation model for the aorta near the bifurcation region was trained with 296 CT scans. The segmentation model was applied to a cropped region around the automatically generated L4 vertebral body contour to limit the field of view.

2.C.3. Para-aortic lymph node (PAN) clinical treatment volume
The PAN CTV covers the para-aortic lymph nodes from the level of the renal veins to the aorta above the aortic bifurcation (i.e., one slice above the superior slice of the nodal CTV). In order to gather data sufficient for the PAN CTV segmentation model, we used 146 clinical contours, and all the contours were manually curated and revised if necessary. FCN-8s was used to auto-contour the PAN CTVs.

2.D. Organs at risk
OARs for cervical cancer radiation treatment include the bladder, rectum, spinal cord, left and right femurs, and left and right kidneys. The training and validation data for the OARs were acquired from clinical contours of the 2254 CT scans. Contours for each structure were considered to maximize the amount of data, and thus, the number of available structures in a single patient's data varied from 1 to 7. The total number of CT scans used for training and validation for each structure is shown in Table I. Of these scans, 80% were used for training, and 20% were used for validation. Since the classification and the segmentation models were trained independently for each structure to avoid the class imbalance problem, 49 the imbalance in the number of training data for each structure did not influence the model accuracy. As the contours were collected solely on the basis of their labels, review of these contours was required to confirm their accuracy. Owing to the substantial number of contours, we proposed a semi-automatic data curation method instead of manual review, as described in Fig. 3. First, "unreviewed"  contours and the corresponding scans were divided in half. Two CNN-based segmentation models, one for each half, were trained, and the contours were predicted on the other half of the dataset. If the DSC between the clinical contours and the predicted contours was lower than an arbitrarily determined threshold value (DSC = 0.7 for the rectum, 0.8 for the remaining OARs), the original contour was manually reviewed, and any incorrect clinical contours were removed from the training dataset. Once the entire set of training data was reviewed, we repeated the process with the "refined" dataset from the beginning three times. The left and right kidney contours from the KiTS19 dataset 50 were added to the training dataset. Abnormal kidneys with large tumors were excluded from the dataset, so 172 contours and 186 contours, respectively, for left and right kidneys were added from the total of 210 CT scans.
All the OARs, except for the spinal cord, used 3D V-Net segmentation models and followed the steps described in Fig. 2. For the spinal cord segmentation, a 2D FCN-8s model was used to generate the contour on each slice. The overall flowchart of the developed auto-contouring system is demonstrated in Fig. 4.

2.E. Test dataset
For quantitative analysis of the auto-contouring system, CT scans and corresponding clinical contouring data from 140 female pelvic cancer patients who received radiation treatment at MD Anderson were used as the test dataset. All of the test CT scans were independent from the training and validation CT scans.
The contours of the CTVs were manually generated by physicians, and the contours of the bony structures and OARs were manually generated by medical physics researchers and reviewed by physicians. Some of the CT scans did not show all of the OARs, owing to the limited cranial-caudal extent. As the superior border of the PAN CTV can be slightly different, depending on the location of pathological nodes and physician judgment, we modified the superior borders of the automatically generated PAN CTV on the basis of the manually generated ground truth contour. We did the same for the inferior borders of the rectum and the spinal cord for similar reasons. The accuracy of the model was measured by the DSC, mean surface distance (MSD), and Hausdorff distance (HD) 47 between the automatically generated contours and the ground truth contours.
For qualitative analysis, contours were automatically generated using the auto-contouring system (Fig. 4) for CT scans from 30 cervical cancer patients from three South African hospitals. This dataset was completely independent from the training dataset and the potential population target for the RPA system. The automatically generated contours were evaluated by an experienced radiation oncologist at MD Anderson and scored as needing no edits, minor edits, or major edits. For the contours scored as needing minor edits, revisions were preferred but not mandatory for the contours to be considered clinically acceptable.

3.A. Model accuracy
The DSC, MSD, and HD between the automatically generated contours and the internal test dataset for 140 CT scans are given in Table II. Owing to the limited cranial-caudal extent, only 132, 129, and 127 contours were evaluated for the PAN, L4/L5 vertebral bodies, and kidneys, respectively; two patients did not have nodal CTV and one patient did not have spinal cord contours. All the CTVs had mean DSC > 0.76, mean MSD < 0.27 cm, and mean HD < 2.09 cm. All the normal structures had mean DSC > 0.81, mean MSD < 0.18 cm, and mean HD < 1.66 cm. All the bony structures had mean DSC > 0.90, mean MSD < 0.08 cm, and mean HD < 1.25 cm. The overall boxplots of DSC for each structure are given in Fig. 5. Although most of the automatically generated contours had DSC distribution within a certain range, low DSC outliers existed in the box plots, and some of these contours are shown in Fig. 6. The failures in generating accurate contours often occurred when the CTVs and OARs were located near high-density material in the bowel, as shown in Fig. 6(a). Contouring of the bladder occasionally failed when the border between the bladder and the uterus was vague, as shown in Fig. 6(b). Contouring of L4 and L5 vertebral bodies sometimes failed when the segmentation model predicted L3 to be L4 and L4 to be L5, as shown in Fig. 6(c). The automatically generated PAN CTV contours had low DSC values when the interface between the nodal CTV and the PAN CTV was incorrectly determined, as shown in Fig. 6(d).

3.B. Physician review
Physician scoring of the automatically generated contours on the 30 external CT scans is shown in Table III. Owing to the limited cranial-caudal extent, 28 contours were evaluated for the left and right kidneys. For the primary, nodal, and PAN CTVs, 83%, 70%, and 87% of the contours were clinically acceptable, respectively. For the bladder, rectum, and right kidney, 90%, 93%, and 96% were clinically acceptable, respectively, and the other OARs were 100% clinically acceptable. For the bony structures, 93% and 97% of the L4 and L5 vertebral bodies were clinically acceptable, respectively, and the pelvic bone and sacrum were 100% clinically  acceptable. Some of the minor edits and major edits are demonstrated in Fig. 7.

DISCUSSION
We have developed a CNN-based auto-contouring tool for three CTVs and 11 normal structures in cervical cancer CTs that can be used for fully automated radiation treatment planning. The number of training, validation, and test CT scans we used to train and evaluate this model is the largest to date among deep learning-based female pelvis auto-contouring studies. 25,26 We successfully acquired this high volume of data by using a semi-automatic data curation method. Also, to the best of our knowledge, we are the first to auto-contour nodal and PAN CTVs in the female pelvic region using deep learning. We have demonstrated that our CNN-based autocontouring system can accurately generate clinically acceptable contours for both CTVs and normal structures in multiple patient cohorts.

4.A. Quantitative results
For the bony structures, 3.5% (5/140 from the quantitative analysis) of the L4 and L5 vertebral body contours were not clinically acceptable (i.e. outliers in the boxplot in Fig. 5). Similarly, 6.7% (2/30 from the qualitative analysis) of the L4 and L5 vertebral body contours were not clinically acceptable (Table III). Therefore, the overall failure rate for the bony structures was about 4%. This is a noticeable improvement compared to a previous study where the failure rate for the automatically generated contours in a multi-atlas-based autocontouring system was about 10%. 39 The performances of deep learning-based auto-contouring systems for OARs in the female pelvis from other published literature are presented in Table IV. As there is only one published paper on a deep learning-based auto-contouring system for cervical cancer, we have also included the state-ofthe-art auto-contouring models for rectal and prostate cancers. Overall, the performance of our system is equivalent to or better than the auto-contouring system for cervical cancer developed by Liu et al. 26 for most of the structures.
Our quantitative test CT scans were randomly chosen from CTs of any female patient with an intact uterus, so the shape and volume of the bladder in the CT scans varied significantly. When we retrospectively tested our bladder segmentation model on 510 prostate patients with full bladders, the mean DSC was much improved at 0.95 AE 0.04. Compared with the state-of-the-art rectal and prostate models, our model performed at least as well in all structures except for the rectum. However, the direct comparison of auto-contouring models for different sites is not straightforward because the homogeneity of the structures in the test CT scans substantially changes the DSC, as shown in the accuracy of our two bladder models.

4.B. Failure cases from physician's review
The overall clinical acceptance rates were higher than 70% for the CTVs and 90% for the OARs and bony structures. When high-density materials were located in the bowel, the auto-contouring system had a higher chance of creating inaccurate contours of the CTVs or OARs near the region, as shown in Figs. 5 and 6. These high-density materials were fecal matter resulting from a high-carb diet with minimal protein, fat, and fibers, which likely causes compacted slowmoving feces. This diet is more common in South Africa, the patient population for the external test dataset, than in the U.S, the patient population for the training and internal test datasets. As we acquire more CT data from such patients through the RPA system, we will be able to upgrade the autocontouring system to achieve more robust results in these patients.
For the nodal CTVs, 9/30 were scored as needing major edits; one was due to high-density fecal matter in the bowel and three were due to failure to detect the superior border. The three cases did not have clear borders for vessels, as shown in Fig. 8(b), and therefore, the bifurcation segmentation model did not perform appropriately. All three patients seemed to be underweight, based on their CT scans, so we believe that the poor contrast resolution was due to incorrect use of image acquisition parameters or the lack of fat in between the vessels. We need to further investigate our autocontouring system in underweight patients and may need to adjust the CT acquisition parameters for these patients in the future. We trained our model using a consistent, well-curated dataset from a single hospital and the publicly available kidney contours. Final physician review used images from three other hospitals, with different patient populations from the training dataset. Thus, the review results gave us some confidence in the ability of our model to successfully contour patients from a different patient population, as well as various CT scanners and imaging protocols. In this study, we did not examine the impact of inter-user variations on the physician assessment of these contours. Based on our experience with other sites, 51 it is likely that an increased fraction of patients  will be considered 'minor edits' instead of 'no edits' as we deploy the auto-contouring system to more hospitals. We will further assess and quantify the inter-user variability as we begin to deploy this system clinically.

CONCLUSION
We have demonstrated through both quantitative and qualitative studies that a CNN-based, auto-contouring tool can achieve clinically acceptable contours for most of the CTVs and normal structures in cervical cancer patients. We will implement our auto-contouring system to the Radiation Planning Assistant, accelerating the radiation treatment planning process in hospitals in low-and middle-income countries.

ACKNOWLEDGMENTS
This work was supported by the National Institutes of Health/National Cancer Institute grants UH2-CA202665, UH3-CA202665, and P30CA016672 (Clinical Trials Support Resource). The authors also thank Dawn Chalaire from Scientific Publications at The University of Texas MD Anderson Cancer Center for editing this work and Raphael Douglas, Stephen Grant, Alexander Bagley, and Pezzi Todd for data curation.

CONFLICTS OF INTEREST
This work was partially funded by the National Cancer Institute and Varian Medical Systems.

a)
Author to whom correspondence should be addressed. Electronic mail: drhee1@mdanderson.org.