Evaluate the performance of four artificial intelligence‐aided diagnostic systems in identifying and measuring four types of pulmonary nodules

Abstract Purpose This study aims to evaluate the performance of four artificial intelligence‐aided diagnostic systems in identifying and measuring four types of pulmonary nodules. Methods Four types of nodules were implanted in a commercial lung phantom. The phantom was scanned with multislice spiral computed tomography, after which four systems (A, B, C, D) were used to identify the nodules and measure their volumes. Results The relative volume error (RVE) of system A was the lowest for all nodules, except for small ground glass nodules (SGGNs). System C had the smallest RVE for SGGNs, −0.13 (−0.56, 0.00). In the Bland–Altman test, only systems A and C passed the consistency test, P = 0.40. In terms of precision, the miss rate (MR) of system C was 0.00% for small solid nodules (SSNs), ground glass nodules (GGNs), and solid nodules (SNs) but 4.17% for SGGNs. The comparable system D MRs for SGGNs, SSNs, and GGNs were 71.30%, 25.93%, and 47.22%, respectively, the highest among all the systems. Receiver operating characteristic curve analysis indicated that system A had the best performance in recognizing SSNs and GGNs, with areas under the curve of 0.91 and 0.68. System C had the best performance for SGGNs (AUC = 0.91). Conclusion Among four types nodules, SGGNs are the most difficult to recognize, indicating the need to improve higher accuracy and precision of artificial systems. System A most accurately measured nodule volume. System C was most precise in recognizing all four types of nodules, especially SGGN.

and density of pulmonary nodules, aiding in systematic and rational clinical decision-making and treatment. 3,4 However, the accuracy and precision of nodule volume measurement by AIADS is affected by several factors, including acquisition and reconstruction parameters, pulmonary nodule characteristics, and system technology. 5,6 This is a relatively young area of research requiring quantification of the impact of these factors on pulmonary nodule volume measurement. Limited research has focused on the influence of acquisition and reconstruction parameters on volume measurement. [7][8][9][10] A few studies have compared the accuracy of two detection systems for pulmonary nodule volume measurement. 11 However, there are few reports comparing different AIADS to assess the influence of pulmonary nodule characteristics on the accuracy and precision of nodule measurement and detection.
The malignant potential of a pulmonary nodule varies depending on its density and size. Nodule diameter is strongly correlated with malignancy. Less than 1% of nodules with a diameter <5 mm are malignant compared with 6% to 28% of those measuring 5-10 mm and 64%-82% of nodules >20 mm in diameter. [12][13][14][15][16] A ground glass nodule (GGN) is reportedly more likely to be malignant than a solid nodule (SN). 17 Before the development of MSCT, it was difficult to qualitatively assess small nodules and GGNs because of their small size, low density, and lack of specificity on imaging. 18,19 However, MSCT has substantially increased the detection rate of SNs and GGNs by manual identification. 20 However, due to the numerous scanned slices generated by MSCT, even if only one organ is examined, clinicians face a huge workload in thoroughly examining each study. AI software is based on automatic extraction by the computer of data on pulmonary nodules that indicates their morphologic features. It intelligently detects the shape, edge, density, and size of nodules to improve the diagnostic efficiency and accuracy of medical images. Therefore, the application of AI software in medical imaging can not only reduce pressure on physicians but also, more importantly, aid in faster diagnosis and treatment for patients. Many studies have shown AI systems have the advantage over traditional diagnostic methods of efficiency in identifying and diagnosing pulmonary nodules. 3,4 This study analyzed the performance of different AIADS software to determine factors influencing the accuracy of the identification of various pulmonary nodules.
In our study, the models of solitary pulmonary nodules were implanted into a commercial lung phantom. Scanned and reconstructed MSCT images were then analyzed by four different AIADSs and their performance compared. The purpose was to provide some information that might aid in technical improvement and in the clinical application of AIADSs.

2.D | Image analysis
Two senior radiologists specializing in chest imaging used four AIADSs from different companies for image recognition and automatic detection of pulmonary nodules. The diameter of the implanted nodules was used to determine the true volume. Two radiologists recorded the volume detection data for each pulmonary nodule in each group of images by four different systems.
After detection, they compared the consistency of their records with each other. If there were discrepancies in the data, then the nodules were redetected until the results were consistent.

2.E | Outcome measures
The results for each system's performance for each type of nodule were compared in terms of the relative volume error (RVE) and miss rate (MR). These were calculated with the following formulas: Relative volume error was defined as the ratio of the difference between the measured value and the reference value to the reference value. MR was defined as the ratio of undetected nodules to total nodules. Accuracy was defined as the best nodule volume measurement and was evaluated by the RVE, consistency test followed. Precision, defined as correctly identifying a nodule, was evaluated with the miss rate (MR), and receiver operating characteristic (ROC) curves.

2.F | Statistical analysis
The There were no significant differences between of B, C, D. For GGNs, there were significant differences when comparing two systems:    (Tables 3 and 4).
In summary, compared with other systems, system A was best at classifying SSNs and GGNs. System C was best for the SGGN, and system D was best at classifying the SN but performed poorly with the other three types of nodules.

Statistics
Sig.

Statistics
Sig.

Statistics
Sig.

Statistics
Sig.

Statistics
Sig.
(1) Even experienced chest radiologists find it challenging to classify pulmonary nodules, with poor consistency in observer results. If management guidelines for pulmonary nodules are based only on the size and classification, then inconsistencies in the classification of will lead to inconsistencies in the management. 29,30 To address this issue, the diagnosis of nodule type must be more objective. CT manufacturers and software developers must improve the algorithms and technologies to achieve this goal.
In our study, we analyzed the abilities of the AIADSs to recognize four types of nodules. The MR of system C was significantly lower than of other systems. System C was more sensitive in recognizing small nodules and low-density nodules. All four systems did well at recognizing SNs, likely because they are larger and have higher density. By contrast, SGGNs are more difficult to identify.
Due to its low density, small diameter and unclear boundary, the SGGNs may not be as clear as SNs compared with the lung background, so the software has certain difficulties in recognition. These results are consistent with that of Reeves et al. 31 We found significant differences between the AIADSs we analyzed. Although one or another system had the best performance for a particular purpose such accuracy in measuring nodule volume (e.g., system A) or precision in identifying the type of nodule (such as system C) for SGGNs, no one system consistently outperformed the others in all the aspects of pulmonary nodule assessment. The shortcomings we identified, particularly for systems B and D, might prompt the developers to improve the algorithms to achieve better performance. Many studies have focused on various methodologies to distinguish among the types of pulmonary nodules, such as the support vector machine, 32 neural networks, 33 decision trees, 34 or other classifiers, 35 but the results have been unsatisfactory. A study based on a deep residual neural network yielded a good results, indicating that combining deep residual learning, course learning, and transfer learning can improve the accuracy of nodule classification. 36

| CONCLUSION
Among four types nodules, SGGNs are the most difficult to recognize, indicating the need to improve higher accuracy and precision of artificial systems. System A most accurately measured nodule volume. System C was most precise in recognizing all four types of nodules, especially SGGN. The superior performance of the software is related to its stronger computing power and more mature algorithms. This paper is helpful to provide reference for quantitative selection of better software for clinical selection.

ACKNOWLEDGMENTS
We thank the study participants for their time and collaboration.

CONF LICT OF I NTEREST
The authors declare they have no competing interests.

AUTHOR CONTRIBUTION
Ming-yue Wu and Yong Li analyzed the data and wrote the manuscript; Bin-jie Fu and Guo-shu Wang collected data and participated in manuscript revision; Zhi-gang Chu gave the fund assistance; As corresponding author, Dan Deng was mainly responsible for the revision of the manuscript. All authors have read and approved the final manuscript.