Reliability of CT‐based texture features: Phantom study

Abstract Objective To determine the intra‐, inter‐ and test‐retest variability of CT‐based texture analysis (CTTA) metrics. Materials and methods In this study, we conducted a series of CT imaging experiments using a texture phantom to evaluate the performance of a CTTA panel on routine abdominal imaging protocols. The phantom comprises of three different regions with various textures found in tumors. The phantom was scanned on two CT scanners viz. the Philips Brilliance 64 CT and Toshiba Aquilion Prime 160 CT scanners. The intra‐scanner variability of the CTTA metrics was evaluated across imaging parameters such as slice thickness, field of view, post‐reconstruction filtering, tube voltage, and tube current. For each scanner and scanning parameter combination, we evaluated the performance of eight different types of texture quantification techniques on a predetermined region of interest (ROI) within the phantom image using 235 different texture metrics. We conducted the repeatability (test‐retest) and robustness (intra‐scanner) test on both the scanners and the reproducibility test was conducted by comparing the inter‐scanner differences in the repeatability and robustness to identify reliable CTTA metrics. Reliable metrics are those metrics that are repeatable, reproducible and robust. Results As expected, the robustness, repeatability and reproducibility of CTTA metrics are variably sensitive to various scanner and scanning parameters. Entropy of Fast Fourier Transform‐based texture metrics was overall most reliable across the two scanners and scanning conditions. Post‐processing techniques that reduce image noise while preserving the underlying edges associated with true anatomy or pathology bring about significant differences in radiomic reliability compared to when they were not used. Conclusion Following large‐scale validation, identification of reliable CTTA metrics can aid in conducting large‐scale multicenter CTTA analysis using sample sets acquired using different imaging protocols, scanners etc.


| INTRODUCTION
With the technological advancements in medical imaging, radiomics, defined as the high-throughput extraction of quantitative features from routine medical images to create a mineable database of imaging metrics, has emerged as a promising tool for decision support. 1,2 Radiomic metrics assessing tumor shape, nonuniform grayscale appearance (texture) which are difficult to assess visually have been reported to provide information regarding tumor diagnosis, prognosis, and treatment response. [1][2][3][4][5][6][7] In spite of various benefits within the clinical workflow such as objective whole tumor assessment at no additional imaging cost and longitudinal disease monitoring, limitations with the standardization of the method reduce its reliability, particularly in multicenter studies. [8][9][10][11][12][13][14][15][16] Typical radiomic workflows comprise of four stages: image acquisition, region of interest (ROI) segmentation, feature extraction and statistical analysis. 6 Each of these four stages can be implemented using a variety of approaches and techniques. Currently, there is no consensus regarding a standardized implementation of the radiomics workflow, thereby hampering efforts to reproduce results.
Previous studies assessing the reproducibility of radiomic metrics conclude that radiomic metrics are not equally sensitive or insensitive to changes in scanning protocols or CT scanners. Therefore, careful consideration of the type of radiomic metrics is warranted based on the clinical application particularly in multicenter studies to avoid the chances of false discovery. 16 A recent systematic review conducted by Traverso et al. 16 assessing the repeatability and reproducibility of radiomics identified that most current studies report high-risk of type I error, thereby increasing the chances of false discovery. 17 In addition, the use of correlated metrics within the radiomics panel increases the chances of false associations with high significance. 18 One of the solutions suggested by Traverso et al. 16 to reduce the risk of false-positive associations in radiomic studies is to identify reproducible and repeatable radiomic metrics and use them to train predictive models of tumor behavior. To address this concern, we conducted a series of CT imaging experiments using a texture phantom on multiple scanners and scanning protocols to assess the reliability of CTTA metrics. We define CTTA reliability as a measure of intra-scanner variability, inter-scanner variability and test-retest performance of the CTTA metrics. To assess the intrascanner variability or the "robustness" of the CTTA metrics, images of the texture phantom are obtained using a variety of imaging conditions (scanning parameters) on a given scanner and assessed using a CTTA panel. CTTA metrics that show a strong unchanging signal across the various imaging conditions are identified as "robust" CTTA metrics. To assess the test-retest variability or the "repeatability" of the CTTA metrics, images of the texture phantom are obtained using a variety of imaging conditions on a given scanner, 15 min apart and the difference in the performance of the CTTA panel across the two time-points is calculated. CTTA metrics that show small to no changes in their values across the various imaging conditions are identified as "repeatable" CTTA metrics. To assess the inter-scanner variability or the "reproducibility" of the CTTA metrics, CTTA metrics with consistent performance of the robustness and repeatability were shortlisted across scanners.
Through this study, we determine which type of CTTA metric is most reliable within the limitations of the study and we provide heatmaps of metrics such as robustness, repeatability, and reproducibility showing comparative performance of the various CTTA metrics.

2.A | CTTA Phantom
Most commercially available CT phantoms are designed to be homogeneous throughout their volume; however, in real life human anatomy has variable densities inside creating texture. In our study, to evaluate the reliability of the CTTA metrics, we develop a texture phantom. The phantom comprises of three texture patterns within a homogenous background representative of textures seen in medical images (Fig. 1). The patterns were 5cm x 5cm in a 15 cm short cylinder. The phantom patterns were made using acrylonitrile butadiene styrene (ABS) plastic using 3D printing technologies and casting them into tissue density urethane.
The patterns Bk, 1, 2 and 3 represent texture varying from the smoothest, that is, the background, to 10% fill, 20% fill, and 40% fill. Our intention was to create a generic phantom that could be imaged using diverse imaging protocols and scanners to identify reliable CTTA metrics. It was not our intention to create a tumorspecific phantom as much as it was to create a phantom that covers a wide enough span of tumor textures seen in oncological CT images.
The goal of our project was to be able to reproducibly create and manipulate textures. For this, we used a 3D printer to create the texture patterns. While we tried a variety of approaches, we are currently focusing on creating reproducible geometric patterns, which could be varied in different ways to better understand how changes in patterns drive texture and its analysis. The materials selected for our tests were within the tissue density and texture range and provided targeted contrast within our texture patterns.
These target Hounsfield number ranges were based on an evaluation of patient images.

2.B | CTTA phantom imaging
The texture phantom was scanned using a Philips Brilliance 64 CT and Toshiba Aquilion Prime 160 CT scanner. The phantom was fixed on the CT patient table for the duration of this study. The image acquisition scanning positioning for each volume was rigidly set to produce identically positioned slices, therefore obviating any need for volume registration. For the robustness assessment, 21 different settings, were tested on the Philips scanner and 16 different settings were tested on the Toshiba scanner (Table 1)

2.C | Region of interest segmentation
The ROI delineation was performed using a manual segmentation technique. Three spherical ROIs were segmented in 3D using imagerendering software (Synapse 3D, Fujifilm, Stamford CT). Some images of the phantom had air bubbles created as a result of the construction process, care was taken to exclude these regions when the analysis was performed.
Custom MATLAB (Mathworks, Natick, MA, USA) code was used to extract voxel data corresponding to the ROI. Two-dimensional CTTA was conducted on the orientation that provided the largest diameter in the axial, coronal, or sagittal dimension. Three-dimensional CTTA was conducted on the whole ROI volume. We used a 20-bin gray-level quantization. The slice thickness varied between 2 and 3 mm.

2.D | Image data
From the segmented ROI within the texture phantom, highlighted in

2.E | CTTA metrics
Texture analysis involves the study of the variation of pixel image intensity. We evaluated eight different types of texture quantification techniques on each ROI image with 235 different texture metrics. These techniques have been described in the literature [19][20][21][22][23] (Supplementary S1: Details of CTTA metrics).

2.E.1 | Histogram analysis
We implemented histogram analysis which focused on the first-order statistical analysis of texture, 19 that is, the technique focused on assessing image intensities (Gray-level distribution of an image), with no regard for the spatial location of the intensities. (13 features).  2.E.2 | Two-dimensional and Three-dimensional

Gray-level co-occurrence method (GLCM) and Graylevel difference method (GLDM) Analysis
We performed second-order statistical analysis of texture, which included 2D-and 3D-GLCM and GLDM analysis. These analyses took into account both, the pixel intensities and their inter-relationships, thereby providing spatial information of the intensities (2nd order texture analysis) in various forms. For workflow implementation, the number of gray levels was reduced to 12-bit, which was determined to be sufficiently accurate for the study of texture. Sixty different metrics were calculated in 2D analysis. In the 3D analysis, 20 additional directions in the z-plane were added. (80 × 2 = 160 features).

2.E.3 | Two-dimensional fourier analysis
A 512-point fast Fourier transform (FFT) was applied to all images.
Matlab® (Mathworks, Natick, MA) was used to apply the transformation. 21 Applying the FFT algorithm, we extracted the individual frequencies, their amplitude (how much of frequency of a given type is present in the image), and phase (where in the image a given frequency is present), of the original image. FFT metrics were assessed between 10% and 90% of the maximum frequency to avoid highand low-frequency noise, which is typical for medical images. The frequency boundary was set based on maximization of the signal to noise ratio. (18 features).

Gray-level run-length method (GLRLM) analysis
We performed additional second-order statistical analysis of texture, which included 2D-and 3D-GLRLM. 22 This analysis took into account the spatial relationship between pixels/voxels to each other by evaluating the frequency with which a given value of voxels occurs next to each other in a given direction. The 2D analysis comprised of 33 metrics and 3D analysis included 11 more metrics (44 features). For the robustness test, the percent absolute difference between each of the radiomic metrics in the baseline scan setting and new F I G . 2. Reliability assessment of the texture metrics of the USC Radiomics panel using two different CT scanners.

2.F | Statistical analysis
settings (x-axis variables) was plotted for each scanner [Figs. 2(a) and 2(C)]. The variables were changed one at a time with respect to the baseline scan settings. The percent absolute difference in the radiomic metrics in the repeatability and robustness study has been presented as a heatmap ranging from 0% (blue) to 20% (red) variation.
In the repeatability heatmap (Fig. 2), a solid horizontal blue band represents good repeatability across these acquisition parameters (here, <5% absolute difference between various settings). In the robustness heatmap (Fig. 2), a solid horizontal blue band represents good robustness across these acquisition parameters (here, <5% difference between various settings).

| RESULTS
Our results indicate that the reliability of radiomics metrics is dependent on the scanner and scanning settings.

| DISCUSSION
While CTTA based tumor-modeling assessment is increasingly being reported, 1-7 a consensus on radiomics reliability has not emerged leading to an increased risk of false discovery. Such a scenario can impede the clinical translation of radiomics. The primary objective of our study was to identify CTTA metrics that are reliable using a CTbased texture phantom. We specifically assess the intra-, inter-and From a material standpoint, our texture phantom was made using ABS plastic using 3D printing technologies and casting them into tissue density urethane. Our intention was to create a generic phantom that could be imaged using diverse imaging protocols and scanners to identify reliable CTTA metrics. We focused on using materials and designs inspired from a wide span of tumor textures seen in oncological CT images. Subsequently, the materials selected for our tests were within the tissue density and texture range that provided targeted contrast within our texture patterns. This is an improvement over texture analysis reliability studies conducted using a water phantom. 27 Last, but not least, from a design standpoint, our texture phantom is designed as a short length cylinder, similar to the standard ACR/AAPM phantom 28 which is comparatively more CT-friendly design in terms of reducing imaging artefacts compared to a cuboid.
From an analysis standpoint, using the concordance correlation coefficient (CCC) as a metric for stability, a few studies assess the such as tumors. 10 Studies by Hassan et al., showed that normalizing voxel size and gray-level discretization greatly reduced the dependence of CTTA metrics on these quantities. 24 While novel CTTA metrics such as first-order wavelet features were assessed in this reliability study, the limited suitability of the CCR phantom to truly assess the performance of metrics of local texture variations limits the scope of these findings. Recent studies by Lu et al., using the a Gammex CT ACR 464 phantom 29 and scanning its four water equivalent inserts using routine abdomen protocols on a GE Discovery scanner reported that highly reliable radiomic metrics were attained from images reconstructed at high tube current and thick slice thickness. Also, based on a ranking of the reliability of commonly used CTTA metrics, first-order texture metrics such as mean, standard deviation, skewness, kurtosis were more reliable than second-order texture metrics such as GLCM-energy, correlation, contrast, and homogeneity. While encouraging, the study was tested on only one scanner and using two materials for each pattern.
In addition to designing a more sophisticated phantom that mimic printing a single material, which is immersed a casting material. While the boundaries serve as multi-material regions, the assessment will be limited and unrealistic owing to simplification. Our next generation/ version of the texture phantom will include this.
Identification of reliable CTTA metrics is an important step toward the clinical translation of radiomics. Our approach of identifying such metrics differs from studies in the literature 13,31 by the use of a new texture phantom and a rigid ROI registration, which helps in eliminating the reported challenges such as patient-based tumor variability and segmentation bias/errors. In addition, we assessed all the CT acquisition parameters together, which to the best of our knowledge has not been performed before. Finally, we assessed the reliability of CTTA metrics by using comparable acquisition protocols, over a wide range of values, over two CT scanners. Similar studies 9,10 have addressed such issues but their use of automatic acquisition protocols for different CT scanners, lead to reproducibility concerns, particularly since, automatic acquisition protocols optimize imaging parameters such as tube current, slice thickness etc., which make it impossible to study the effects of various imaging parameters simultaneously at the same time.
Different CT scanners have been proven to report a variation in Hounsfield units. 10 In addition to scanner specific calibration, these differences may be responsible for the changes in radiomics reproducibility across the two scanners. Our results support this fact and identify reliable CTTA metrics for use across different scanners. In addition to testing the reliability of CTTA metrics to image acquisition parameters, also tested its reliability to changes brought about by using noise reduction techniques such as I-dose etc. As expected, due its nonlinear effect, the use of I-dose levels affects radiomic reliability significantly. The CTTA reliability reduces with the increase in I-dose levels.
Published data assessing radiomics reproducibility and repeatability have reported a high reliability is associated with the entropy measure of first-order statistical measure (e.g., histogram analysis). 32 Though we do not observe histogram-based entropy measure to be reliable in our experiments, we do observe entropy of FFT magnitude and FFT phase to be reliable. 16  specific to a given human tissue, but this was done to improve the comprehensive assessment of a variety of human tissue textures than that of a specific one.
The two commonly used statistical indices to assess reliability include the intraclass correlation coefficient (ICC) and the CCC. 33 When assessing reproducibility alone without repeating multiple VARGHESE ET AL.
| 161 times for a given scanner or modality, the ICC2 (two-way random) and ICC3 (two-way mixed) are identical to CCC. However, if with repeated measures, which is equivalent to assessing reproducibility and repeatability at once, only ICC3 (two-way mixed) is identical to CCC. CCC or ICC2/ICC3 include two components for claiming reliable (a) small difference between measurements (b) correlated result between measurements. An excellent CCC or ICC will represent both the small difference and high correlation, however when the CCC or ICC value is moderate, it will be hard to pin point whether the problem is from large difference or poor correlation. In this preliminary study, we only have three inserts for the phantom, thus we are only interested in observing the signal change (difference) when altering the scanner settings, or between scanners. The absolute percent difference is very intuitive to serve the purpose of this study. If evidence established for reliability in signal difference, we will proceed further study with more heterogeneous inserts (e.g., 9) and investigate both difference and correlation.
Various studies have shown the valuable role of CTTA metrics in tumor characterization, prognosis, and survival information, albeit using a small sample size. 6,21,[33][34][35] While The Cancer Imaging Archive (TCIA) database can aid large-scale validation of the CTTA panel, additional problems such as noisy or missing data can reduce the impact. Machine-learning methods have been used to augment these limitations, 36 however, the choice of machine-learning algorithms and associated steps affects the final performance and thus far a consensus has not been reached. Future work within our group will evaluate the clinical applications of our results using data-driven radiomics 37 frameworks in combination with TCIA data.
In conclusion, our study has demonstrated the intra-, inter-and test-retest variation in CTTA metrics calculated on CT images of a texture phantom imaged using two different CT scanners. We identify reliable CTTA metrics, that is, those metrics with less < 5% change in its value when assessing for robustness, reproducibility, and repeatability. We strongly recommend that groups working on future radiomic studies account for the performance variations demonstrated here and/ or use the reliable CTTA metrics, that is, Entropy of FFT-based magnitude and phase, within their radiomics texture panel.

ACKNOWLEDGMENTS
The authors acknowledge Enrique Godinez and Paul Casares for their great help in scanning the phantom on the clinical CT scanners.

CONF LICT OF I NTEREST
No conflicts of interest.