Deep learning‐based survival analysis for brain metastasis patients with the national cancer database

Abstract Purpose Prognostic indices such as the Brain Metastasis Graded Prognostic Assessment have been used in clinical settings to aid physicians and patients in determining an appropriate treatment regimen. These indices are derivative of traditional survival analysis techniques such as Cox proportional hazards (CPH) and recursive partitioning analysis (RPA). Previous studies have shown that by evaluating CPH risk with a nonlinear deep neural network, DeepSurv, patient survival can be modeled more accurately. In this work, we apply DeepSurv to a test case: breast cancer patients with brain metastases who have received stereotactic radiosurgery. Methods Survival times, censorship status, and 27 covariates including age, staging information, and hormone receptor status were provided for 1673 patients by the NCDB. Monte Carlo cross‐validation with 50 samples of 1400 patients was used to train and validate the DeepSurv, CPH, and RPA models independently. DeepSurv was implemented with L2 regularization, batch normalization, dropout, Nesterov momentum, and learning rate decay. RPA was implemented as a random survival forest (RSF). Concordance indices of test sets of 140 patients were used for each sample to assess the generalizable predictive capacity of each model. Results Following hyperparameter tuning, DeepSurv was trained at 32 min per sample on a 1.33 GHz quad‐core CPU. Test set concordance indices of 0.7488 ± 0.0049, 0.6251 ± 0.0047, and 0.7368 ± 0.0047, were found for DeepSurv, CPH, and RSF, respectively. A Tukey HSD test demonstrates a statistically significant difference between the mean concordance indices of the three models. Conclusion Our results suggest that deep learning‐based survival prediction can outperform traditional models, specifically in a case where an accurate prognosis is highly clinically relevant. We recommend that where appropriate data are available, deep learning‐based prognostic indicators should be used to supplement classical statistics.


1.A | Clinical motivation
The median survival time for patients with brain metastases is on the order of months; however, some groups of patients can significantly outlive the median survival. Physicians have many treatment options to choose from, where the potential for disease-free recovery is strongly connected to the treatment intensity. The brain met graded prognostic assessment (GPA) is one clinical tool that allows physicians to predict the longevity of patients with brain metastasis and thus select an appropriate treatment based on expected patient lifetime.
For example, patients expected to live longer than 6 months are more likely to benefit from the short-term memory protection offered by pin-point radiosurgery treatment. On the other hand, patients with more limited life expectancy may be just as well served with a simpler whole brain radiotherapy, as they may not live long enough to experience the longer-term cognitive effects of radiotherapy.
Brain met GPA uses multivariate Cox regression (MCR) and recursive partitioning analysis (RPA) to determine factors that significantly contribute to survival predictions. In the specific case of breast cancer, the factors determined to be most significant include Karnofsky performance status, number of metastases, and hormone receptor characterization. The weights according to MCR are used to compute an index, or scale from 0.0 to 4.0, that maximizes separation between survival curves between groups. New patients are placed in a group according to a few features and given a highly nonspecific survival estimate as in Fig. 1. 1,2 In this study, we focus on predicting survival probabilities for patients with brain metastases and breast primary site using a deep neural network. We expect that a representation learning approach to prognostic assessment will produce more accurate survival estimates than MCR and RPA for our dataset. Similar work in machine learning for patient prognosis has been done by Alcorn et al. 3 Their work focuses on the application of random survival forests specifically to the problem of prognosis for patients with bone metastases.

1.B | Cox proportional hazards
Proportional hazards models are regarded as the gold standard for survival analysis. 4 Cox models aim to describe a patient-specific hazard function (event rate), given a quantitative description of their attributes (covariates, features). 5 According to the proportional hazards assumption, the event rate for patient having covariates x at time t is modeled with the hypothesis function.
Regression with survival data is limited by censoring, or "loss to follow-up." There is no meaningful way to ascribe an event time to patients who discontinue communication with record keepers.
Therefore, the parameters of the Cox model must be learned with a nonparametric objective function. Parameters Θ * that best predict the order of survival times for N patients having covariates {x 1 ð Þ ,...,x N ð Þ g and survival times ft 1 ð Þ , ...,t N ð Þ g are obtained by maximizing the Cox partial likelihood: h t ðiÞ jx ðiÞ ;Θ À Á ∑ j:t ðjÞ >tðiÞ h t ðiÞ jx ðjÞ ;Θ À Á , where δ i ¼ 1 indicates that patient i was not lost to follow-up.

1.C | DeepSurv
Deep learning has been shown to be an effective tool for modeling nonlinear functions. There have been many breakthroughs in image classification, natural language processing, and other fields due to new methods and increased availability of deep learning platforms. 6 In 2016, Katzman  test) was used to assess model performance. 9 The three models considered were DeepSurv, CPH, and a random survival forest (RSF).
The RSF serves as a benchmark nondeep state-of-the-art survival analysis method, based on recursive partitioning analysis. For more on RSFs, see. 10,11

2.B | Survival model evaluation
The generalization error for a model with parameters Θ can be described by its concordance index C on a test data set, 12 where α is the learning rate. Updates were halted when the validation concordance appeared to converge to a maximum (60 iterations). A Wald test with significance level α = 0.01 was used to identify the parameters which are likely to be truly nonzero in the Cox framework. 14 Features and significance levels are displayed in Table 1.
The deep neural network DeepSurv was implemented in Python with L2 regularization, batch normalization, dropout, Nesterov momentum, and learning rate decay. A six-dimensional box in hyperparameter space was uniformly sampled 100 times and DeepSurv's performance was evaluated with a validation dataset. 15 The hyperparameters that yielded the highest validation accuracy (Fig. 3) were chosen for deployment. DeepSurv was then trained for 7000 epochs per sample at about 32 min per sample on a 1.33 GHz quad-core CPU.

| RESULTS
The three models were independently trained and validated A Tukey Honestly Significant Difference test was used to evaluate the difference of mean concordances for each model (  16 This technique can be used to better understand the biases of the working models. We consider the distribution of errors T ÀT for each model (Fig. 6). Advanced survival analysis techniques suffer from inaccessibility.
Brain met GPA is widely popular because it is online and easy to use. Any future deep learning-based approaches to patient prognosis should be accessible to physicians in the form of a webpage or easyto-use software.
One weakness of the deep risk framework is the lack of a timedependent hazard estimation; DeepSurv acts as an extension of the classic CPH model. Luck et al. have recently shown that by directly modeling the survival function, as opposed to risk, they can obtain concordance indices on par with those generated by DeepSurv. 19 A final implementation of this work for clinical use might benefit from an effort to include time dependence.
Currently, there does not appear to be any significant benefit to using DeepSurv over the Random Forest. However, deep learning is a very rapidly growing field. DeepSurv, despite utilizing several stateof-the-art training techniques (dropout, batch normalization, L2 regularization), is architecturally quite simple. It does not take advantage of expected patterns in survival data in the way that convolutional networks handle images, and recurrent networks handle language.
Our group expects that deep learning-based methods will continue to improve in the near future.

CONF LICT OF I NTEREST
The authors declare no conflict of interest.