A practical method to quantify knowledge‐based DVH prediction accuracy and uncertainty with reference cohorts

Abstract The adoption of knowledge‐based dose‐volume histogram (DVH) prediction models for assessing organ‐at‐risk (OAR) sparing in radiotherapy necessitates quantification of prediction accuracy and uncertainty. Moreover, DVH prediction error bands should be readily interpretable as confidence intervals in which to find a percentage of clinically acceptable DVHs. In the event such DVH error bands are not available, we present an independent error quantification methodology using a local reference cohort of high‐quality treatment plans, and apply it to two DVH prediction models, ORBIT‐RT and RapidPlan, trained on the same set of 90 volumetric modulated arc therapy (VMAT) plans. Organ‐at‐risk DVH predictions from each model were then generated for a separate set of 45 prostate VMAT plans. Dose‐volume histogram predictions were then compared to their analogous clinical DVHs to define prediction errors Vclin,i‐Vpred,i (ith plan), from which prediction bias μ, prediction error variation σ, and root‐mean‐square error RMSEpred≡1N∑iVclin,i‐Vpred,i2≅σ2+μ2 could be calculated for the cohort. The empirical RMSEpred was then contrasted to the model‐provided DVH error estimates. For all prostate OARs, above 50% Rx dose, ORBIT‐RT μ and σ were comparable to or less than those of RapidPlan. Above 80% Rx dose, μ < 1% and σ < 3‐4% for both models. As a result, above 50% Rx dose, ORBIT‐RT RMSEpred was below that of RapidPlan, indicating slightly improved accuracy in this cohort. Because μ ≈ 0, RMSEpred is readily interpretable as a canonical standard deviation σ, whose error band is expected to correctly predict 68% of normally distributed clinical DVHs. By contrast, RapidPlan's provided error band, although described in literature as a standard deviation range, was slightly less predictive than RMSEpred (55–70% success), while the provided ORBIT‐RT error band was confirmed to resemble an interquartile range (40–65% success) as described. Clinicians can apply this methodology using their own institutions’ reference cohorts to (a) independently assess a knowledge‐based model's predictive accuracy of local treatment plans, and (b) interpret from any error band whether further OAR dose sparing is likely attainable.

risk DVH predictions from each model were then generated for a separate set of 45 prostate VMAT plans. Dose-volume histogram predictions were then compared to their analogous clinical DVHs to define prediction errors V clin,i À V pred,i (ith plan), from which prediction bias μ, prediction error variation σ, and root-mean-square error could be calculated for the cohort. The empirical RMSE pred was then contrasted to the model-provided DVH error estimates. For all prostate OARs, above 50% Rx dose, ORBIT-RT μ and σ were comparable to or less than those of RapidPlan. Above 80% Rx dose, μ < 1% and σ < 3-4% for both models. As a result, above 50% Rx dose, ORBIT-RT RMSE pred was below that of RapidPlan, indicating slightly improved accuracy in this cohort. Because μ ≈ 0, RMSE pred is readily interpretable as a canonical standard deviation σ, whose error band is expected to correctly predict 68% of normally distributed clinical DVHs. By contrast, RapidPlan's provided error band, although described in literature as a standard deviation range, was slightly less predictive than RMSE pred (55-70% success), while the provided ORBIT-RT error band was confirmed to resemble an interquartile range (40-65% success) as described. Clinicians can apply this methodology using their own institutions' reference cohorts to (a) independently assess a knowledge-based model's predictive accuracy of local treatment plans, and (b) interpret from any error band whether further OAR dose sparing is likely attainable.

| MATERIALS AND METHODS
A set of known high-quality 3 135 volumetric modulated arc therapy (VMAT) prostate treatment plans from our Institution was available as the reference planning cohort. Ninety of these plans (training set) were used to train both the ORBIT-RT and RapidPlan models. Both models were then used to generate OAR DVH predictions for the remaining 45 treatment plans (validation set, N ¼ 45).
The 45 predicted DVHs V pred,i D ð Þ of the validation set were then compared to their analogous, clinically accepted DVHs V clin,i D ð Þ to define prediction errors V clin,i À V pred,i . As the DVHs are necessarily OAR volume normalized, all DVH error metrics are consequently also expressed as OAR volume percentages and as functions of dose.
The dosewise mean error μ≡ V clin,i À V pred,i serves as a metric for prediction bias, while the standard deviation σ of V clin,i À V pred,i indicates prediction error variation. Summation in quadrature of bias μ and error uncertainty σ yield the root-mean-square error of the predictions, RMSE pred : In other words, the ORBIT-RT and RapidPlan models' accuracy were empirically sampled from the same set of independent DVH prediction "trials," and the predictions' resulting difference from the reference clinical values, quantified aggregately by RMSE pred . The values μ, σ, and RMSE pred μ, σ ð Þ serve as independent metrics for prediction accuracy because they are irrespective of the particular prediction model used.
Furthermore, when μ≈0 in Eq. (1), RMSE pred is readily interpretable as a canonical standard deviation, wherein we would expect to find 68% of an ideal, normal distribution of the reference cohort's 45 clinical DVHs, by the central limit theorem. RMSE pred is a statistical outcome of the entire reference cohort, and is thus not patient specific.
Both ORBIT-RT and RapidPlan also provide their own patientspecific DVH error estimates (Fig. 1

| RESULTS
Following the prescribed methodology, we first examine the models' prediction bias μ and error uncertainty σ (Fig. 2). Then we see how μ and σ contribute to RMSE pred (Fig. 3), our independent metric for model accuracy. Finally, we compare the provided error bands of ORBIT-RT and RapidPlan to our RMSE pred , by quantifying their prediction success rates (Fig. 4).   There is greater distinction in σ between the ORBIT-RT and RapidPlan models in Fig. 2. Above 50% Rx dose, σ for ORBIT-RT was comparable to or less than that of RapidPlan. This was most clinically significant at 100% Rx dose, where RapidPlan predictions for the validation set varied as much as 1-2% of OAR volume for the bladder, rectum, and penile bulb. One exception to this trend was the low-dose rectum, known to have large error, 8 in which σ for ORBIT-RT was greater than RapidPlan below 50% Rx dose.
Using Eq. 1 and our subsequent observations of μ and σ, we now examine the empirical prediction error estimate RMSE pred for both models. Figure 3 summarizes RMSE pred calculations for all OARs over the examined dose interval. Above 50% Rx dose, in the clinically relevant prostate dose interval, ORBIT-RT predictions exhibited comparable or slightly lower RMSE pred than RapidPlan, indicating slightly improved accuracy. Below 50% Rx dose, the relative accuracy between the two models was more variable.
When μ ≈ 0, as verified in Fig. 2  The model-propagated error bands of both models were found to capture clinical DVHs less frequently than our empirical error band V pred AE RMSE pred (Fig. 4), which was shown to perform in line with a canonical standard deviation. For ORBIT-RT, this was hypothesized; at 40-65% predictive success, ORBIT-RT's error band more closely resembles an IQR, as described in Ref. [11]. The predictive success of RapidPlan's error band, described in Ref. [ 4. Clinical prostate OAR DVHs successfully predicted by their analogous predictions' error bands were tallied as a percentage of the total validation set. An 11-point boxcar smoothing routine has been applied to the data. As expected, by Eq. (1), both ORBIT-RT and RapidPlan empirical RMSE pred bands (a, c) successfully predict clinical DVHs at a frequency typical of σ or greater. Meanwhile, the ORBIT-RT modelpropagated error band more closely resembles an IQR (b), and the RapidPlan model-propagated error band is slightly less predictive than σ (d).