Evaluation of deep learning‐based auto‐segmentation algorithms for delineating clinical target volume and organs at risk involving data for 125 cervical cancer patients

Abstract Objective To evaluate the accuracy of a deep learning‐based auto‐segmentation mode to that of manual contouring by one medical resident, where both entities tried to mimic the delineation "habits" of the same clinical senior physician. Methods This study included 125 cervical cancer patients whose clinical target volumes (CTVs) and organs at risk (OARs) were delineated by the same senior physician. Of these 125 cases, 100 were used for model training and the remaining 25 for model testing. In addition, the medical resident instructed by the senior physician for approximately 8 months delineated the CTVs and OARs for the testing cases. The dice similarity coefficient (DSC) and the Hausdorff Distance (HD) were used to evaluate the delineation accuracy for CTV, bladder, rectum, small intestine, femoral‐head‐left, and femoral‐head‐right. Results The DSC values of the auto‐segmentation model and manual contouring by the resident were, respectively, 0.86 and 0.83 for the CTV (P < 0.05), 0.91 and 0.91 for the bladder (P > 0.05), 0.88 and 0.84 for the femoral‐head‐right (P < 0.05), 0.88 and 0.84 for the femoral‐head‐left (P < 0.05), 0.86 and 0.81 for the small intestine (P < 0.05), and 0.81 and 0.84 for the rectum (P > 0.05). The HD (mm) values were, respectively, 14.84 and 18.37 for the CTV (P < 0.05), 7.82 and 7.63 for the bladder (P > 0.05), 6.18 and 6.75 for the femoral‐head‐right (P > 0.05), 6.17 and 6.31 for the femoral‐head‐left (P > 0.05), 22.21 and 26.70 for the small intestine (P > 0.05), and 7.04 and 6.13 for the rectum (P > 0.05). The auto‐segmentation model took approximately 2 min to delineate the CTV and OARs while the resident took approximately 90 min to complete the same task. Conclusion The auto‐segmentation model was as accurate as the medical resident but with much better efficiency in this study. Furthermore, the auto‐segmentation approach offers additional perceivable advantages of being consistent and ever improving when compared with manual approaches.


| INTRODUCTION
Cervical cancer is one of the most common malignant tumors in the female reproductive system. The incidence and mortality rates of cervical cancer rank the fourth highest among all female cancer patients. 1 Radiation treatment (RT) is an effective method for cervical cancer treatment, 2 and the mainstream technology today is based on the concept of intensity-modulated radiation therapy (IMRT). In radiotherapy planning, the precise delineation of the clinical target volume (CTV) and organs at risk (OARs) is essential in ultimately delivering the necessary amount of radiation dose to the target area while sparing adjacent normal tissues from complications. Manual delineation of the OARs, however, is time-consuming and labor-intensive in the RT planning workflows. Studies have shown that as much as 120 min can be required for a clinician to manually delineate the OARs of a cervical cancer patient. 3 Inter-observer variability (IOV) has been found among radiation oncologists who perform manual contours, and even the same physician can have different manual contours at different times due to fatigue and other factors. [4][5][6][7][8] The variability can lead to a higher error level than the planning and setup errors. [9][10][11][12] Automatic segmentation of CTV and OARs can alleviate physicians' burden and reduce variability. To that end, atlas-based approaches have been reported. [13][14][15] However, the atlas-based autosegmentation methods require users to establish their own templates, and the subsequent applications can suffer from the large number of patient cases in the template and the poor accuracy of manual contouring. Moreover the image processing of the atlasbased auto-segmentation requires a long time. These issues limit further development of this technology.
In recent years, convolutional neural networks (CNNs) have been proven to be an effective tool in auto-segmentation of the CTV and OARs of the head and neck, [16][17][18][19] thoracic cavity, 20-23 abdomen, [24][25][26] and pelvis. [27][28][29][30] Studies have shown that for auto-segmentation of OARs in head and neck cancers and chest cancers, the accuracy of deep learning-based auto-segmentation 19,21,26,31 is significantly higher than that of the atlas-based method. 32 used the modified U-Net model for auto-segmentation of OARs of cervical cancer, and the evaluation of radiation oncologists showed that the results predicted by the model were highly consistent with those of the radiation oncologists. Wong et al. 36 verified that the accuracy of deep learning-based auto-segmentation is comparable to that of expert inter-observer variability for RT structures and suggested that the use of deep learning-based models in clinical practice would likely realize significant benefits in RT planning workflow and resources.
However, most previous studies 19,24,25,30 have focused on the accuracy of auto-segmentation ignoring the evaluation of learning ability in the clinical practice. This study aims to compare the learning abilities of the auto-segmentation model and a medical resident both learned from the same senior radiation oncologist. Higher accuracy represents higher learning ability, and smaller variance corresponds to better stability. We first collected cervical cancer cases delineated by the same senior radiation oncologist. Next, the testing cases were delineated separated by a medical resident under the instruction by the senior physician for 8 months and by the autosegmentation model trained on the training set. The auto-segmentation model was compared against the medical resident using the remaining 25 cases in the testing set.

2.A | Datasets
We retrospectively collected 125 cases of cervical cancer receiving IMRT between January 2019 and May 2020 at the First Affiliated Hospital of Anhui Medical University in China. These female patients were between 22 and 86 yr of age, with an average age of 53.8 yr.
The CT scanning covered from the lower lumbar spine to the sciatica knot and pelvic cavity. The CT slice thickness was 5 mm. The CT image datasets were transmitted to the Eclipse 13.6 treatment planning system (TPS).
The manual delineation of the cervical cancer CTV was conducted in accordance with the guidelines of by the Radiation Therapy Oncology Group (RTOG). 37 The senior radiation oncologist

2.B | Deep learning-based auto-segmentation
In this study, we investigated the use of a 3D CNN for delineating CTVs and OARs of cervical cancers. As shown in Fig. 1, the network consists of an encoder which extracts features from data and a decoder which performs the pixel-wise classification. The encoder consists of five successive residual blocks. Each block contains three convolution layers with 3 × 3 × 3 kernel, and there is a spatial dropout layer between the early two convolution layers to prevent the network from overfitting. Spatial down-sampling is performed by a convolution layer with 3 × 3 × 3 kernel and 2 × 2 × 2 stride. The decoder consists of four successive segmentation blocks. Each block contains two convolution layers with the kernel of 1 × 1 × 1 and 3 × 3 × 3, respectively. Spatial up-sampling is performed by a deconvolution layer with 3 × 3 × 3 kernel and 2 × 2 × 2 stride.
Here each convolution layer is followed by an instance normalization, and a leaky rectified linear unit. Four dashed arrows in the  39 and integrated into DeepViewer (commercial auto-segmentation software based on deep learning). 40,41 This study included 125 cervical cancer cases, 100 of which were randomly selected and divided into training and validation sets at a ratio of 4:1, while the remaining 25 cases were used to test the model. The weighted DSC was selected as the loss function, and Adam was selected as the optimizer. During training, data augmentation and deep supervision were used to avoid overfitting. The entire training process used the Python deep learning library Keras 42 with TensorFlow 43 as the backend, and a Nvidia Geforce RTX 2080Ti GPU card with 11G memory was used to train the model.

2.C | Experiment
To study the difference in learning ability between the auto-segmen- First, this study included 125 cervical cancer cases whose CTV and OAR contours were manually delineated by the same senior physician with 20 yr of clinical experience according to the above principles, and these contours were regarded as true contours (TCs) in this study.
Second, a medical resident, who was a student of the senior physician and had spent 8 months of training on how to delineate the CTV and OARs, was invited to participate in this experiment.

2.D | Evaluation metrics
The DSC and HD were used to evaluate the accuracy of the autosegmentation model and the accuracy of the resident. The DSC is defined as follows: where A is the DCs or RCs, and B is the TCs in our study. The numerator is twice as large as the intersection of A and B, and the denominator is the sum of A and B. A larger DSC corresponds to a higher degree of coincidence between the DCs or RCs and the TCs.
The DSC ranges from 0 to 1, with the latter value indicating perfect performance.
The HD is defined as follows: where h(A,B) is the greatest of all the distances from a point in A to the closest point in B. A smaller value usually represents better segmentation accuracy.

| RESULTS
The DSC values of deep learning-based auto-segmentation (the DSC of DCs-TCs) and the DSC values of manual contouring by the resident (the DSC of RCs-TCs) are summarized in Table 1 and displayed in Fig. 3. As shown in The combined results of the DSC and HD show that the deep learning-based auto-segmentation model had better accuracy than the medical resident in delineating the CTV, with significant differences in both the DSC and HD (P < 0.05). As shown in Fig. 5 (panels a1-a3), deep learning-based auto-segmentation was more similar to the contours delineated by the senior physician. As shown in Fig. 5 (panels b1-b3, d1-d3 and e1-e3), the auto-segmentation model and the medical resident performed comparably for the bladder, femoralhead-left, and femoral-head-right, with no significant differences.
Regarding delineation of the small intestine and rectum (Fig. 5), the auto-segmentation model was slightly better than the resident for the small intestine, and the medical resident was slightly better than the auto-segmentation model for the rectum.     In this study, the deep learning-based auto-segmentation model was found to be as accurate as the resident, and the auto-segmentation model had better stability, indicating that the deep learningbased auto-segmentation model reached or even exceeded the level of the resident. In terms of time requirements, the auto-segmentation model was better than the resident (2 and 90 min for a patient's CTV and OARs, respectively). In many clinical situations, the CTV and OARs are first delineated by a resident, and a senior physician modifies the contours based on the resident's results. According to the results of our experiment, the auto-segmentation model can even replace part of the work of residents. Senior physicians modify the contours based on auto-segmentation directly and obtain con-

| CONCLUSION
In this study, we compared and analyzed differences in learning ability between the deep learning-based auto-segmentation model and a medical residentboth learned to delineate the CTV and OARs of cervical cancer from the same senior physician. This study demonstrates that in terms of both accuracy and efficiency, the deep learning-based auto-segmentation model was as accurate as the medical resident but with a much better computational efficiency. Furthermore, the auto-segmentation approach offers additional perceivable advantages of being consistent and ever improving when compared with manual approaches. When carefully validated and implemented clinically, such as deep learning-based method has the potential to improve the RT workflow.

ACKNOWLEDGMENTS
This work was jointly supported by Natural Science Foundation of Anhui Province 1908085MA27, Anhui Key Research and Development Plan 1804a09020039. This retrospective study was approved by IRB with waiver of informed consent.

AUTHOR CONTRIBUTION STATEMENT
Zhi Wang, Xi Pei, and X. George Xu contributed to conception and design. Yin Lv, Weijiong Shi, and Fan Wang contributed to the source of datasets. Zhi Wang, Yankui Chang, and Zhao Peng contributed to auto-segmentation model. Zhi Wang, Yankui Chang, Zhao Peng, Xi Pei, and X. George Xu contributed to writing of the paper.

CONFLI CT OF INTEREST
No conflict of interest.