Exploratory analysis using machine learning to predict for chest wall pain in patients with stage I non‐small‐cell lung cancer treated with stereotactic body radiation therapy

Abstract Background and purpose Chest wall toxicity is observed after stereotactic body radiation therapy (SBRT) for peripherally located lung tumors. We utilize machine learning algorithms to identify toxicity predictors to develop dose–volume constraints. Materials and methods Twenty‐five patient, tumor, and dosimetric features were recorded for 197 consecutive patients with Stage I NSCLC treated with SBRT, 11 of whom (5.6%) developed CTCAEv4 grade ≥2 chest wall pain. Decision tree modeling was used to determine chest wall syndrome (CWS) thresholds for individual features. Significant features were determined using independent multivariate methods. These methods incorporate out‐of‐bag estimation using Random forests (RF) and bootstrapping (100 iterations) using decision trees. Results Univariate analysis identified rib dose to 1 cc < 4000 cGy (P = 0.01), chest wall dose to 30 cc < 1900 cGy (P = 0.035), rib Dmax < 5100 cGy (P = 0.05) and lung dose to 1000 cc < 70 cGy (P = 0.039) to be statistically significant thresholds for avoiding CWS. Subsequent multivariate analysis confirmed the importance of rib dose to 1 cc, chest wall dose to 30 cc, and rib Dmax. Using learning‐curve experiments, the dataset proved to be self‐consistent and provides a realistic model for CWS analysis. Conclusions Using machine learning algorithms in this first of its kind study, we identify robust features and cutoffs predictive for the rare clinical event of CWS. Additional data in planned subsequent multicenter studies will help increase the accuracy of multivariate analysis.


| INTRODUCTION
Stereotactic body radiation therapy (SBRT), or stereotactic ablative radiotherapy (SABR), is an increasingly used radiation modality for the treatment of primary early-stage 1 and metastatic lung tumors. 2 SBRT has been shown to provide effective local control with acceptable toxicity. 3 It is the preferred treatment modality for medically inoperable stage I non-small-cell lung cancer (NSCLC) patients, and there is emerging evidence and investigation regarding its role for selected operable NSCLC patients, [4][5][6] as well as for stage I small-cell lung cancer patients. 7,8 The chest wall has been identified as an organ at risk for SBRT, with chest wall toxicities of any grade ranging from 2% to 45% following SBRT. [9][10][11][12] Radiation-related chest wall toxicity can result from radiation-induced rib fracture or chest wall syndrome (CWS). In the absence of rib fracture, CWS is caused by radiation-induced neuropathy of the intercostal nerves or nerve branches, chest wall edema, chest wall fibrosis, or hairline rib fractures not clearly visible on imaging. [12][13][14] There is currently a paucity of data on standard dose-volume constraints for the chest wall, with no clear consensus on how to balance target coverage versus chest wall/rib sparing or how factors like fractionation impact CWS. A commonly used constraint is chest wall dose to 30 cc < 30 Gy, 15 yet there is no prospectively validated data to support this threshold. There have been efforts in recent years to identify the risk factors for rib fractures and CWS and to refine the clinical and dosimetric predictors of chest wall toxicity using dose-response models. [15][16][17][18] One challenge in evaluating predictive factors for CWS is the low and varying range of events observed. 14,17,19 Machine learning has previously been used in radiation oncology for a variety of problems, from quality assurance to outcome prediction. [20][21][22][23][24][25][26] In circumstances where the event being analyzed is relatively uncommon, machine learning algorithms are advantageous in magnifying events. This is achieved by developing models that can learn from and make predictions of a given dataset. Examples include hierarchical clustering models which can iterate quickly through different features and cutoffs in order to identify potentially predictive factors based on how effectively events are separated from nonevents. 26 The use of these computational algorithms to mine raw data can filter out noise and identify the pertinent factors when the number of events is smaller than the number of features. This current study, the first of its kind, utilizes such algorithms to identify specific dosimetric thresholds predictive for CWS in 197 consecutive patients with Stage 1 NSCLC treated with SBRT.

2.A | Patient inclusion
This study was approved by our institutional review board. A cohort of 197 consecutive patients diagnosed with Stage I NSCLC and treated with SBRT from June 24, 2009, to July 31, 2013, to allow for adequate toxicity follow-up was identified. All patients were treated to a biologically effective dose (BED) of ≥ 100 Gy in one of four fractionation schemes: (a) 20 Gy × 3 fractions, 12.5 Gy × 4 fractions, 10 Gy × 5 fractions, or 7.5 Gy × 8 fractions. All patients were planned with a constraint goal to keep 30 cc of the chest wall to <30.0 Gy. Twenty-five parameters (termed features in the machine learning analysis) suspected of a correlation or previously reported 10,12,13,15,17,[27][28][29][30][31] to associate with CWS were analyzed, including patient and tumor characteristics and dosimetric features were recorded for each patient. Toxicities were assessed using CTCAEv4 criteria for chest wall pain, where Grade 1 represents mild pain, Grade 2 represents moderate pain limiting instrumental activities of daily living (ADL), and Grade 3 represents severe pain limiting self-care ADL.

2.B | Feature definition
Twenty-five features were analyzed in this study. They were classi-

2.C | Univariate analysis
Univariate CWS thresholds for each feature collected were generated to split the patient population into high-and low-risk subpopulations. These thresholds were determined using decision stumps (simple univariate thresholds) implemented in Matlab R2015a (Math-Works Inc., Natick, MA, USA). In all cases, the deviance was used to measure how far the decision tree is from the target output. It is a smoother version of the classification error and provides a measurement of the quality of the description provided. 32 Each threshold was characterized by the probability of splitting out patients with and without CWS into the appropriate subpopulations. In addition, a generalization score was determined for each threshold, which was defined as the ratio of true positives for out-of-sample to in-sample data. A cutoff of >0.75 was used for the generalization score, meaning a similar split of the data would result at least 75% of the time.
The generalization score is used to characterize out-of-sample performance of the univariate dosimetric thresholds, and it quantifies how well these thresholds should perform for data that the algorithm has not encountered. 26 This analysis was performed under the conditional assumption that the true distribution of patients satisfying the threshold is represented by the patients not developing CWS.

2.D | Multivariate analysis
Two different algorithms were considered: decision trees, for interpretability, and Random forests, for accuracy. [33][34][35] Decision trees partition the data into a disjoint number of subpopulations and make a constant prediction at each subpopulation. Random forests predict outcomes by averaging the output of hundreds of decision trees. 36 For specifics about these algorithms, the readers are referred to "The Elements of Statistical Learning," a comprehensive book about machine learning. 36 In this work, the complexity of the models for all algorithms was controlled by choosing hyperparameters (global constants that control the complexity of the algorithms such as the number of times data are allowed to be partitioned in a decision tree) that minimized the leave-one-out cross-validation of the deviance. Leave-one-out cross-validation refers to a method where one observation is left out of the dataset, and then performing training on the remaining observations and predicting the observation that the algorithm has not seen. Specifically, the complexity of the decision tree was optimized through the use of minimum number of observations per node (Min Number per Node). Smaller node sizes result in complex trees that do well in explaining the training dataset with which the algorithm is initially presented but may result in suboptimal results with the testing dataset. This hyperparameter controls the number of observations a terminal node must have before attempting a split. As our goal was to identify the thresholds that best predict CWS in future patients, we tested various training sets (10 training sets in a 10 K-fold experiment) in order to select hyperparameter values that minimized the testing error.
Two additional analyses were performed to control for overfitting. First, the Min Number per Node was changed from 50 to 80 in steps of 5. Second, for each hyperparameter, a random subsampling of the patient population was performed where a predefined number of patients ranging from 158 to 197 patients would be randomly selected from the data set. One hundred iterations were performed, and an aggregate decision tree was developed. All features that were selected at least 10% of the time were compared to the maximally selected feature.
The complexity of Random forests was controlled by selecting the maximum number of splits allowed per individual tree and the number of variables randomly subsampled. Unless otherwise specified, 500 individual decision trees were combined when Random forest was used. The default hyperparameter, the square root of the number of features, was used for the number of feature subsamples in Random forests. In all cases, artificial equal prior probabilities, which is where the initial weights of the observations are upsampled for the minority event (e.g., development of CWS) and under-sampled for the majority event (e.g., absence of CWS), such that their sum would be equal, were used to avoid the inherent bias in the algorithms due to the skewed dataset. One hundred iterations were performed, and we identified features that have an out-of-bag importance, which is at least 10% of the maximally selected feature.
The out-of-bag importance, as defined by Breiman, is an unbiased estimator of the predictive value of a feature, which uses randomly generated training sets by sampling with replacement. 33

3.B | Decision tree modeling
The decision tree analysis revealed that when evaluating different node sizes ranging up to 100 patients per terminal node, values of this hyperparameter ranging from 50 to 80 patients per terminal node produce similar results in the testing dataset, with local minima observed at 50 patients per node and 80 patients per node (Fig. 1).
When evaluating terminal node sizes above 80 patients per node, the decision trees become overly simplified, resulting in the training and testing dataset errors being similar.
Decision trees with node sizes of 50 and 80 patients per terminal node are shown in Fig. 2 demonstrating that a smaller PTV volume is associated with a higher incidence of CWS [ Fig. 2(a)]. If instead, the Min Number per Node = 80, only the first split is obtained [ Fig. 2(b)].

3.C | Feature robustness and data consistency
When introducing variation components of differing nodal size and random subsampling of the population to test for feature robustness, only rib dose to 1 cc and chest wall dose to 30 cc were selected as features that influence development of CWS (Fig. 3). Random forest analysis performed as part of a second and separate analysis of robustness also identified rib dose to 1 cc and chest wall dose to 30 cc as predictors of CWS. Rib Dmax was additionally identified as a potential predictor for CWS (Fig. 4), whereas PTV volume was excluded.
Using learning-curve experiments with different hyperparameters, we found that as patient number in the training set increases, the training error increased and the testing error decreased. The learning-curve experiments established that our patient dataset is likely to provide a true representation of the wider population with regard to developing CWS (Fig. 5). This data consistency verification confirms accepting the previously identified CWS predictors of rib dose to 1 cc < 4000 cGy, chest wall dose to 30 cc < 1900 cGy, and rib Dmax < 5100 cGy (all P < 0.05).  16 Regarding the chest wall specifically, constraints of V30 Gy < 70 cc, 12 V30 Gy < 35 cc, 13 and D30 cc < 30

| DISCUSSION
Gy have been recommended. 15  In this study, we analyzed 25 patient features, both dosimetric and nondosimetric. Similar to the published literature, we found that rib dose 29 and chest wall dose 13 are important dosimetric features.
Decision trees were considered the baseline algorithm because of the ability to produce models that are clinically interpretable and could be validated according to prior clinical knowledge. Random forest was used to evaluate feature importance and to generate and explore additional hypotheses. By separating the data into training (data used to create the models) and testing (data used to evaluate the performance of the model and not seen during training) proper estimation of the error expected for the algorithm could be established. Training and testing errors refer to errors calculated on these datasets. In addition, by using interpretable algorithms like decision trees (those that produced models that clinical practitioners can understand) and black box algorithms (those that produce models that cannot be easy understood but are potentially more accurate) like Random forests that combine the input of hundreds of trees into one prediction different hypothesis and important features can be automatically selected. 33,35,36,40 If the data are self-consistent, then the training error increases along with the number of patients used to build the model. Conversely, if true knowledge is acquired from the data, then the testing error will decrease with the number of patients used for training.
Our final model, which combines the results of the baseline analysis using decision trees and is supplemented by the results of Random forests, specifically identify a cutoff of rib dose to 1 cc < 4000 cGy, chest wall dose to 30 cc < 1900 cGy, and rib Dmax < 5100 cGy as important prognosticators ( A potential shortcoming of this study, and other similar studies, lies with the fact that CWS grading is inherently subjective. This is a potential bias intrinsic to analyses of CWS. Our study is likewise unable to compensate for this underlying subjectivity. Another limitation of our study is that an exhaustive analysis of all possible variables and thresholds is prohibitive, despite utilizing machine learning.
With the chest wall constraints, we evaluated the commonly employed constraint of chest wall dose to 30 cc. A weakness of this approach is whether this choice represents the ideal volumetric constraint. Future investigations assessing continuous volumetric modeling of the chest wall constraint in addition to continuous dose modeling are warranted. This could likewise be employed to other relevant thoracic structures like the rib, akin to a prior effort by Petterson et al., 29 and lung dose. Expanded datasets in future analyses will add to the robustness of our findings, and future work will focus on external validation in a multicenter analysis.

| CONCLUSIONS
The strength of this study, the first of its type, is in the use of machine learning heuristic clustering analysis to identify factors in a continuous fashion that would predict both for and against CWS by incorporating patient-and tumor-related variables and dosimetric factors. From our analysis, we conclude that in patients treated with SBRT using common and standard fractionation schemes (4 × 12.5 Gy, 5 Gy × 10), providers should attempt to keep the rib dose to 1 cc <4000 cGy, chest wall dose to 30 cc < 1900 cGy, and rib Dmax < 5100 cGy in order to mitigate CWS. These novel and clinically meaningful metrics provide a guide for treatment planning of SBRT and contribute to the knowledge base for patient counseling and informed consent.

CONFLI CT OF INTEREST
We do not have any conflicts or potential conflicts of interest to disclose at this time. We do not use any copyrighted information or patient photos. The data presented in this manuscript was acquired in accordance with the policies of the institutional review board at our institution.