Concurrent Monte Carlo transport and fluence optimization with fluence adjusting scalable transport Monte Carlo

Purpose: The future of radiation therapy will require advanced inverse planning solutions to support single-arc, multiple-arc, and “4 π ” delivery modes, which present unique challenges in finding an optimal treatment plan over a vast search space, while still preserving dosimetric accuracy. The successful clinical implementation of such methods would benefit from Monte Carlo (MC) based dose calculation methods, which can o ff er improvements in dosimetric accuracy when compared to deterministic methods. The standard method for MC based treatment planning optimization leverages the accuracy of the MC dose calculation and e ffi ciency of well-developed optimization methods, by precalculating the fluence to dose relationship within a patient with MC methods and subsequently optimizing the fluence weights. However, the sequential nature of this implementation is computationally time consuming and memory intensive. Methods to reduce the overhead of the MC precalculation have been explored in the past, demonstrating promising reductions of computational time overhead, but with limited impact on the memory overhead due to the sequential nature of the dose calculation and fluence optimization. The authors propose an entirely new form of “concurrent” Monte Carlo treat plan optimization: a platform which optimizes the fluence during the dose calculation, reduces wasted computation time being spent on beamlets that weakly contribute to the final dose distribution, and requires only a low memory footprint to function. In this initial investigation, the authors explore the key theoretical and practical considerations of optimizing fluence in such a manner. Methods: The authors present a novel derivation and implementation of a gradient descent algorithm that allows for optimization during MC particle transport, based on highly stochastic information generated through particle transport of very few histories. A gradient rescaling and renormalization algorithm, and the concept of momentum from stochastic gradient descent were used to address obstacles unique to performing gradient descent fluence optimization during MC particle transport. The authors have applied their method to two simple geometrical phantoms, and one clinical patient geometry to examine the capability of this platform to generate conformal plans as well as assess its computational scaling and e ffi ciency, respectively. Results: The authors obtain a reduction of at least 50% in total histories transported in their investigation compared to a theoretical unweighted beamlet calculation and subsequent fluence optimization method, and observe a roughly fixed optimization time overhead consisting of ∼ 10% of the total computation time in all cases. Finally, the authors demonstrate a negligible increase in memory overhead of ∼ 7–8 MB to allow for optimization of a clinical patient geometry surrounded by 36 beams using their platform. Conclusions: This study demonstrates a fluence optimization approach, which could significantly improve the development of next generation radiation therapy solutions while incurring minimal additional computational overhead.


INTRODUCTION
Future radiation treatment technology will require advanced inverse planning solutions to account for an increasing number of physical and biological parameters used to meet clinical objectives. The introduction of single-arc or multiplearc delivery modes in the clinic along with the proposal of "4π" therapy present unique inverse planning challenges involving large, highly complex optimization search spaces. The inclusion of biological information in planning decisions has further exacerbated the situation. The impact of this added complexity is the challenge of efficiently finding an optimal treatment plan over a vast search space while preserving dosimetric accuracy and ensuring that the plan can be delivered in a timely manner.
The successful implementation of such technologies would greatly benefit from Monte Carlo (MC) based dose calculation methods, which can offer improvements in dosimetric accuracy of up to 20% compared to deterministic methods. [1][2][3] The standard method for Monte Carlo based treatment planning optimization occurs in two discrete steps, where first the patient-specific dose coefficient matrix (DCM) is generated, either in the form of beamlets 4 or inverse dose kernels, 5 followed by a subsequent optimization step. While this sequential MC treatment plan optimization (MCTPO) paradigm independently leverages the accuracy of the MC dose calculation and efficiency of well-developed optimization methods, the intermediary DCM central to this method incurs the penalty of being computationally time consuming to generate, and computationally memory intensive to both store and utilize during optimization. The calculation of the DCM is a computationally intensive process because the dosimetric uncertainty of the matrix elements must be calculated to nominally <2% for the optimized plan to converge to the true plan. 6 A variety of methods have been explored to reduce the Monte Carlo dose calculation time, such as the utilization of GPU-based techniques to exploit the inherent parallelizability of MC particle transport, demonstrating the capability of greatly reducing the calculation time required for transporting a given number of histories. 4,7,8 To our knowledge, there has also been one recent work in literature by Li et al. 9 successfully demonstrating the potential for further reductions in computation time using intelligent biasing of beamlet calculations. Li et al. achieve this by performing multiple sequential MCTPO loops, using progressively less noisy beamlets. In each loop, noisy beamlets are generated using one to two orders of magnitude fewer histories (10 3 -10 4 ) than would be used to generate a ground-truth beamlet, and fed into a traditional optimization algorithm for fluence optimization. The optimized fluence weights are used to bias the noisy beamlet dose calculation in the next sequential MCTPO loop, to preferentially transport more histories in higher-weight beamlet pixels (bixels). The beamlet data are accumulated with prior loops, improving overall beamlet statistics, and the fluence is reoptimized. The sequential MCTPO loops continue until dosimetric changes between sequential MCTPO iterations fall below a set threshold. Their work demonstrated that compared to a plan generated using one unbiased beamlet calculation and subsequent optimization step, an upward of 77% reduction in computation time may be achieved by biasing the beamlet histories throughout repeated MCTPO iterations.
However, efforts focusing on the reduction of dose calculation time address a portion of the computational burden that is linear at worst. The memory footprint of the resulting DCM still scales as the product of the number of bixels available in the treatment and target voxels in the optimization objective, and has grown increasingly large and difficult to optimize over for novel treatment delivery modes. As a consequence, the resolution of patient geometries is often down-sampled, 10 fewer low-resolution optimization structures are used 11 with fewer constraints applied to those structures, 12 and the Monte Carlo calculated dose kernels are truncated 9,11,13,14 in attempts to reduce the computational burden during the subsequent optimization step. Ironically, the dosimetric accuracy afforded by Monte Carlo dose calculation is compromised at every step leading up to the optimization. As the complexity of the treatment plan grows, reducing the fidelity of the optimization space will produce diminishing returns, and traditional optimization methods will become a limiting bottleneck in the overall treatment planning process.
To directly address both the memory the constraint faced by current treatment planning optimization techniques as well as to reduce wasted computation time on beamlets that contribute minimally to the final plan, we propose an entirely new form of "concurrent" Monte Carlo treatment plan optimization (CMCTPO) platform, which optimizes the fluence during MC transport itself. The optimization occurs concurrently with the dose calculation, circumventing the necessity of the DCM, resulting in a drastic reduction in computational and memory overhead to a linear scaling while retaining full dosimetric fidelity.
Specifically, we directly scale the MC transport of particle histories to increments in fluence and dose in order to update dose-based objectives during particle transport. Unlike all other MCTPO methods, MC transport itself is used as a surrogate for the dose coefficient matrix that is traditionally be used to relate fluence to dose. During particle transport, incremental changes in dose due to the transport of very few histories (10-10 3 per beamlet) are used to calculate a gradient, which is used to bias the number of histories transported in subsequent iterations. By utilizing MC transport to obviate the DCM, a drastic reduction in memory overhead can be realized during fluence optimization. We address two unique challenges that stem from performing optimization during MC transport in such a manner, namely, (1) a significant uncertainty in the gradient calculated at every optimization iteration due to the relatively few particles transported and (2) a potential increase in dosimetric uncertainty when decrementing fluence weights. A stochastic gradient descent (SGD) method is employed to allow for optimization of the dose objective using the gradient generated from very few histories per beamlet per iteration. Additionally, a novel gradient rescaling and global renormalization method has been developed to decrement fluence weight and calculated dose between iterations, a process necessary when gradient elements are evaluated to be negative. We apply our method to convex dose prescription objectives in a proof-of-principle demonstration of the algorithm.
By optimizing the fluence during the dose calculation, there is also the added benefit of reducing wasted computation time spent on beamlets that weakly contribute to the final dose distribution, demonstrated by Li et al. This platform permits the level of accuracy and fidelity needed to perform voxel-level optimization and could significantly improve next generation radiation therapy solutions.

2.A. Optimization expression
Without loss of generality, it is possible to describe the goal of optimization in treatment planning as finding the set of bixel weights b ∈ ℜ n which minimizes some cost function where in practice, the global cost function F(d) can be a combination different cost functions for different biological structures, such as the quadratic dose constraint for the target volume, min/max dose for certain critical structures, or monotonically increasing costs for normal tissue (NT), and the like. To minimize these objectives, numerical gradient descent methods are often employed; the bixel weights are iteratively modified by an update rule such as where (k) denotes the iteration number, η a step-size, and ∇ b F Db (k) is the gradient of the objective function with respect to beamlet weights. In Secs. 2.B-2.I describing our proof-of-principle CM-CTPO platform: fluence adjusting scalable transport Monte Carlo (FASTMC), we discuss the considerations necessary for integrating gradient descent into MC transport.

2.B. FASTMC algorithm: Concurrent Monte Carlo transport and fluence optimization
In FASTMC, a standard beamlet calculation consisting of generally N j ∼ 10 6 particles per beamlet 9,15 is divided into "sub-beamlet" calculations of approximately ∆N (k) j ∼ 10-10 3 histories per iteration, with the bixel weight incremented as This process is integrated with the Monte Carlo dose calculation itself, such that in FASTMC, the increment of beamlet weight is directly tied to MC transport of histories by a fixed weight γ, The MC transport of ∆N (k) j histories results in an increment to the corresponding dose profile via with D (k) * , j being the sub-beamlet for bixel j calculated from MC transport stepping from b (k−1) (3) is calculated using backward finite difference, by calculating the cost secant between the current (iteration k) and previous (iteration k −1) doses, and the for each bixel j, The ∆b (k) step may then be set according to For this initial investigation, a fixed step-size η was used, and the optimization was terminated once the beamlet increment ∆b = η∇F, and correspondingly the magnitude of the gradient itself fell below a preset threshold. Figure 1 illustrates the general process by which this optimization is integrated with the dose calculation. For this proofof-principle investigation, all parameters were empirically determined for the specific cases presented.
The backward difference method used in Eq. (6) has the benefits of having a very low memory footprint (due to being calculable with just one D (k) * , j column) and supporting a variety and combination of potentially analytically nondifferentiable dose objectives. However, the drawback is that it introduces a ∼O(∆b) truncation error to the estimate of the gradient 16 that can potentially slow the rate of convergence during the gradient descent. Near the minimum, this truncation error also results in an additional uncertainty as to the position of the global minimum. While an error analysis of the clinical impact such a truncation error would have on clinical plans was beyond the scope of our proof-of-principle investigation, an alternative objective function formulation suitable for CMCTPO is presented in the discussion that allows for an exact calculation of the derivative.
Finally, we wish to note that the transient sub-beamlet D (k) * , j has the same dimensionality and corresponding memory footprint as the dose vector d and exists only to calculate the ith direction of the gradient. During each iteration t, F. 1. Proposed FASTMC method which distributes the optimization throughout the dose calculation, allowing updates to the optimization gradient during incremental dose calculation updates. After determining the initial for the first step, each bixel transports histories based on their gradient value multiplied by a user-defined constant. Subsequent gradient calculations are generated using data from the previous step.

D (k)
* , j is generated, used to calculate the partial cost derivative, and overwritten by the subsequent sub-beamlet as the bixels increment from j = 1,...,n. This effectively reduces the memory overhead of the DCM D, which scaled as the product of the number of dose voxels and beamlet pixels (m × n) in sequential MC optimization, to that of just the number of dose voxels (m × 1) in FASTMC.

2.C. Stochastic gradient descent and momentum
The gradient generated in FASTMC is highly stochastic due to the relatively few particles used at every iteration to generate each sub-beamlet. Following such a noisy gradient would result in inefficient MC transport in bixels that do not decrease the global cost. However, the gradient also has the property that it is an unbiased estimate of the true gradient. This property is similar to the gradients encountered in stochastic gradient descent (Appendix A). We appropriate the concept of momentum from SGD (Appendix B) as a key method in facilitating convergence, whereby the update rule from Eq. (2) becomes Substituting Eq. (6) into Eq. (9), we obtain We see in Eq. (9) that with the addition of the β < 1 momenta term in the update rule, which serves to add a fraction of the previous iteration's gradient to the current iteration, the most recent calculation of the gradient effectively becomes a geometrically decreasing sum of all of the previous gradients. The summation has the effect of preserving the persistent terms in the gradient, while averaging out the stochastic terms, while the geometric falloff allows the gradient to be more heavily influenced by subsequent iterations over prior interactions. A final consideration is the generation of gradient for the initial step, which cannot be calculated from any prior steps. This may be generated in a variety of manners, ranging from a uniform initial intensity across all bixels to a patientspecific map obtained via deterministic preprocessing. For this investigation, we initialized the gradient to be uniform and constant for all bixels that were ray-traced to the target volume, allowing the subsequent iterations to direct the gradient to the optimal direction.

2.D. Gradient rescaling and renormalization
Directly coupling particle transport to increments in bixel weight means that care must be taken in how to implement the evaluation of negative terms in the gradient. A consequence of obviating the DCM is the lack of sufficient beamlet information to reweight individual bixel intensities relative to other bixels in lieu of additional particle transport. It is also undesirable to decrement bixel intensity using negatively weighted histories (which would serve the purpose of lowering bixel fluence and removing dose from voxels) as this would increase the relative uncertainty of the dose distribution. A final calculated plan consisting of a fraction f ∈ [0,1] of negative histories will have a relative dosimetric uncertainty σ d f >0 /d f >0 that is larger than the uncertainty σ d f =0 /d f =0 for a plan calculated using the same number of all positive histories by a factor (Appendix C), To ensure increasing dosimetric fidelity with iteration number, we employ a gradient rescaling and renormalization procedure, whereby in the presence of negative gradient terms, all elements of the gradient are increased by the absolute value of the most negative term, and the global fluence and dose distributions are rescaled by a constant factor λ < 1 that minimizes the new cost, such that arg min λ ∈(0,1) The mechanism of the gradient rescaling and renormalization is best illustrated in Fig. 2. At an arbitrary iteration in the optimization process (a), the gradient has been calculated to determine that certain beamlet indices must increment negatively. A gradient rescaling is performed by adding F. 2. Illustration of the gradient rescaling combined with renormalization. At an arbitrary step during the optimization process, negative increments must be made in certain bixel indices (a). For clarity, the beamlet intensities are illustrated to be uniform at this step. The magnitude of the most negative bixel step is added to the entire step (b), and the current beamlet weights are renormalized (c), a process which "steps backwards" in bixel intensity without increasing dosimetric relative error. The step is then taken from the renormalized coordinates, resulting in the appropriate relative bixel intensities (d).
the minimum gradient value to all other gradient elements [ Fig. 2

2.E. FASTMC platform implementation
To evaluate this algorithm, the Monte Carlo transport toolkit  4 (Ref. 17) was used due to its flexible objectoriented framework, which allowed for adaptation of an in-house  4 based forward transport platform into a CMCTPO platform. The in-house forward transport user code has been benchmarked to show good agreement with EGSnrc (Ref. 18) and clinical TPS for patient geometries, 19 and served as a gold-standard benchmark for a GPU-based Monte Carlo dose calculation engine. 7 An "optimization manager" class was implemented, repurposing the forward transport user code to allow onthe-fly fluence optimization without user intervention. The optimization manager class oversees the initialization of particle transport, dose and fluence scoring, the calculation of the global cost, and updating the optimization trajectory. Upon initialization, the optimization manager class loads the planning structures, allocates memory for scoring dose and fluence, and executes a ray-tracing function which maps all bixels in the search space to their corresponding voxels. During each optimization iteration, sub-beamlet particle transport is initiated from each bixel. Following each transport step, the optimization manager records the updated dose and calculates the change in cost from each sub-beamlet. The change in plan cost divided by the number of histories transported, the aforementioned cost secant gradient element, is used by the optimization algorithm to bias the subsequent optimization iteration in the direction of lowest cost.

2.F. Example cases
The FASTMC platform was used to optimize the dose distribution for two simple geometrical cases and one simplified patient CT case: (1) the uniform cubic target in a cubic water phantom, (2) a concave "horseshoe" target with an adjacent cylindrical organ at risk (OAR) in a cubic water phantom, and (3) an anonymized prostate case with a simple combined OAR of rectum and bladder, and NT consisting of the entire geometry. The simplicity and symmetry of the cubic case is used to illustrate key characteristics of the SGD algorithm, such as the impact of momentum, rescaling, and renormalization on the gradient descent. The horseshoe geometry is used to investigate the capability of the algorithm to generate highly conformal dose distributions, as well as examine the scaling characteristics of the algorithm with increasing number of beamlets. Finally, a prostate case is presented to illustrate the memory overhead and scaling of the FASTMC platform when considering clinical geometries and beam configurations. The cubic and horseshoe geometries consisted of a sensitive water target situated in a 25 × 25 × 25 cm phantom of water, with a voxel resolution of 1 × 1 × 1 cm. The cubic target was surrounded by 8 fields that were equally spaced around the geometry. The horseshoe target geometry was simulated for 18, 36, and 72 beams. The prostate geometry consisted of a 130 × 128 × 54 voxel patient CT geometry surrounded by 12, 18, and 36 beams. The dose objectives (for target, OAR, or PTV) were assigned to full patient geometry in the prostate case for a total of ∼3.15 × 10 5 optimization voxels. Details about each simulation geometry are provided in Table I.
Simple quadratic and cubic cost functions were used for the PTV and OAR/NT (if present), respectively, T I. Details about simulation geometries used in this work. For simplicity, all target volumes were prescribed to a nominal dose value of 100 Gy, and OAR and NT constraints (if present) were set to a nominal dose value of 80 Gy. For this investigation, the relative weights of the optimization problem were set 100 for the PTV, 1 for the OAR, and 0.01 for the NT, to give dose conformity to the PTV a much higher priority than OAR sparing or NT dose, while allowing the impact of large search spaces to be evaluated. Because the number of histories transported in each subbeamlet calculation scales with the magnitude of the gradient as ∆N (t) ∼ γ∆b (t) = γη∇F, near the minimum where the gradient becomes shallow, ∆N (t) becomes small. To ensure that sufficient histories are transported in such conditions to reach the minimum, an empirically determined step-size of γη ∼ 5000 was chosen for the cubic and horseshoe geometries, and γη ∼ 50 000 for the prostate geometry. However, this stepsize results in large initial steps that may point in suboptimal directions due to magnitude, and the stochasticity of the gradient. Therefore, in addition to the nominal step-size η, an empirically determined maximum limit of max(γη∇F,20 × n) histories per iteration was set for the cubic and horseshoe geometries, and max(γη∇F,200 × n) histories for the prostate case. This translates to an average of 20-200 histories per bixel being used to generate each sub-beamlet during each optimization iteration, with the actual number in each bixel depending on the optimization gradient. The simulation terminates when the gradient becomes sufficiently small such that ∆N (t) = 0. The transport and optimization parameters used in this investigation are summarized in Table II.

2.G. Time and memory benchmarking
Two key metrics were used for examining the scalability of FASTMC including the reduction in histories afforded by CM-CTPO, as well as the computational cost of integrating optimization into the MC dose calculation. The reduction in histories transported in FASTMC relative to the sequential MCTPO paradigm was estimated by using the number of histories transported in the highest intensity FASTMC bixel as the number of histories that would have been transported through all bixels for generating unweighted beamlets. This value was multiplied by the number of bixels available in the solution space to obtain an estimate of the difference between FASTMC and the histories necessary to generate unweighted beamlets.
The impact of integrating optimization and MC dose calculation in the CMCTPO method was compared with traditional sequential MCTPO methods using µ-second resolution timers, which were integrated into the FASTMC platform. The timers recorded the total time spent transporting histories for dose calculation in comparison to the time spent evaluating the gradient and performing renormalization for optimization. Finally, the working memory of the FASTMC platform was logged during each iteration to evaluate the additional memory requirements of a CMCTPO platform. Since the sub-beamlet and gradient calculation steps are independent of the hardware platform (e.g., CPU/GPU/APU) and are inherently parallelizable, the scaling metrics are presented as relative percentages to allow ease in estimating scalability on other platforms.

2.H. Convergence error evaluation
To evaluate the impact of the gradient truncation error and stochasticity of the gradient on the quality of the final plan, a traditional sequential MCTPO investigation was also performed. In this second investigation, a DCM D with <1% dosimetric uncertainty was generated for the cubic target geometry using MC methods, with 10 6 histories per beamlet. The qualitative impact of the truncation error and stochasticity on both the convergence rate and the cost minimum achieved during iterative gradient descent was evaluated by calculating ∆b (k) via • an analytical form of the gradient from Shepard et al., 21 which served as the ground-truth calculation of the gradient and was used for evaluating convergence errors introduced by the backwards difference method and stochasticity; • a backward difference method, used to evaluate the impact of truncation error alone on convergence rate and convergence error in this scenario; • backward difference with momentum, to evaluate the impact of momentum in a noise-free gradient descent; • an unbiased, stochastic estimate of the gradient using the backward difference update rule, where stochastic uncertainty is introduced when incrementing the dose; this formulation was used to evaluate the impact of stochasticity on convergence rate and absolute convergence; • and the noisy-gradient backward difference with momentum, where we evaluate the impact of the momentum term when it is applied to a noisy gradient.
To simulate the stochasticity in the sub-beamlet dose d (k) that would be present in Monte Carlo calculations, element-by-element Gaussian noise σ (k−1) D * , j was added to the ground-truth D used for dose calculation at every iteration, where the full-width half-maximum of the Gaussian noise for column j for iteration k was set proportional to the inverse square root of the step-size at that iteration, The optimization parameters were again empirically determined to best illustrate the relative impact of the truncation error, stochasticity, and momentum, respectively.

3.A. Cubic target case
The results for the cubic target case are presented in Fig. 3. Due to the small step-size limit used in the investigation, fewer iterations are plotted than actually occur during the optimization for clarity. The global cost [ Fig. 3(a)] can be seen to decrease as histories are transported, with the corresponding dose-volume histogram (DVH) [ Fig. 3 The simplicity and symmetry of the cubic target case allows the exploration of the optimization space to be illustrated in a simple contour map (Fig. 4). Two representative bixels (b 1 and b 2 , bixels that have the two highest intensities at the end of the optimization) are plotted along with the global cost for all of the optimization iterations. It can be seen that the optimization initially heads toward the direction of the minimum [ Fig. 4(a)] and overshoots the minimum along a path tangential to the true minimum. A gradient rescaling and renormalization take place [ Fig. 4(b)], decrementing the beamlet intensity according to the gradient, without an increase in dosimetric uncertainty. The solution rapidly converges to the minimum as the gradient becomes shallower [Fig. 4(c)]. The momentum term not only preserves the directionality of previous gradients but also their relative magnitudes. As the number of histories being transported depends on both the step-size as well as the magnitude of the momentum, the optimization continues to step beyond the local minimum and then renormalize, until the gradient decays to a small enough value that no more histories are transported, and the simulation terminates.

3.B. Horseshoe target with OAR case
Results for the horseshoe simulation can be seen in Fig. 5, where with increasing number of beams, the dose distribution [Figs. 5(a)-5(c)] becomes more uniform around the PTV, and the peripheral dose falls off more rapidly. The dose-volume histograms [Figs. 5(d)-5(f)] show the PTV dose becoming steeper as the number of beam angles increases. The 100 : 1 difference in weight between the PTV and OAR results in moderate sparing, with priority given to PTV coverage. A weight of zero for the OAR, however, would result in a semicircular dose distribution around the target, rather than the concave distribution.

3.C. Prostate with combined rectum + bladder OAR
Results for the prostate simulation can be seen in Fig. 6, where with increasing number of beams, the dose distribution [Figs. 6(a)-6(c)] within the target volume becomes more uniform around the PTV. The dose-volume histograms [Figs. 6(d)-6(f)] show little change to the PTV as the number of beam angles increases, which is likely due to the simple, relatively convex shape of the target volume, which was easily covered using 12 beams. The 1000 : 1 difference in weight between the PTV and OAR again results in priority given to PTV coverage.

3.D. Time and memory overhead metrics
For the cubic PTV simulation, a total of 1.5 × 10 6 histories were transported during the FASTMC optimization process. The highest intensity bixel transported approximately N = 1500 histories total throughout the simulation, which when multiplied with the 2432 bixels with line-of-sight to the PTV, gives a nominal 37 × 10 6 histories necessary to generate unweighted beamlets. For cubic case, this results in a 60.3% reduction in number of histories transported compared what would be necessary using current methods to generate plans with similar nominal dosimetric uncertainty. Normalized to the transport time, the optimization consisted of approximately  For the horseshoe target, a total of 2.9 × 10 6 , 5.4 × 10 6 , and 11.3 × 10 6 were transported for the 18, 36, and 72 beam simulations, respectively, with an estimated savings in transported histories of 74.2%, 78.4%, and 79.2% compared to sequential MCTPO methods. The computational overhead for optimization relative to the transport was 13.9%, 11.4%, and 11.8%. The sub-beamlet array memory overhead was 125 KB for all cases, while the gradient vector occupied 129. 6 Table III.
Forward transport of the optimized plans was not implemented for this investigation. Since histories are not reweighted relative to each other, simulating the fluence map with identical random number seed for any given simulation will result in the exact same dose distribution as the CMCTPO optimized solution.

3.E. Convergence error results
The results in Fig. 7(a) suggest that the effect of the O(∆b) truncation error in the gradient due to the backward difference (dashed line) gradient approximation is negligible relative to the exact expression for the gradient (solid line). The introduction of the momentum term in the case of the noise-free backward difference method (solid line with square markers) results in jumps in cost due to the momentum causing the optimization to "overshoot" when large changes in the gradient occur, but overall does not change the final result. Figure 7(b) clearly illustrates the impact of stochasticity in the backward difference method (dotted line), which delays the convergence rate of the gradient descent, while further raising the minimum cost that may be achieved by the optimization. The addition of momentum (dotted with square marker) in this case dramatically improves the convergence rate. F. 7. Gradient descent implemented without noise (a) demonstrates the different convergence rates and minima achievable by evaluating the gradient exactly (solid line) and via backward differences (dashed line). The introduction of momentum (solid line, square markers) does not improve convergence in the absence of stochasticity. The introduction of stochasticity (b) to the backwards difference gradient descent (dotted line) slows convergence and raises the minimum cost achievable with respect to the noise-free backward difference method (dashed line). In this case, the momentum term drastically improves convergence rate (dotted with square markers).

DISCUSSION AND CONCLUSIONS
This paper presents the proof-of-principle implementation of a truly concurrent Monte Carlo treatment plan optimization platform that performs gradient descent based fluence optimization during the MC transport without the need to retain the costly DCM, by directly scaling increments of fluence and dose with particle transport. A novel gradient rescaling and renormalization method was developed to enable the following of negative gradient directions without increasing dosimetric uncertainty. Additionally, a momentum term was employed to allow efficient gradient descent under highly stochastic conditions, reducing the number of iterations necessary to achieve a solution.
Our results demonstrate this algorithm's capability to generate very conformal dose distributions when utilizing concave dose objectives, with a dramatically reduced memory overhead of tens of MB, while retaining full dosimetric fidelity. Additionally, the integration of fluence optimization during the MC dose calculation affords the additional benefit of a reduction of computation time by efficiently biasing particle transport to beamlets that have higher intensities in the optimal result. We obtained a reduction of 50%-80% in total histories transported in our investigation, with the additional benefit of a roughly fixed optimization time overhead consisting of ∼8%-11% of the total computation time in all cases.
In contrast to FASTMC, biased beamlet calculation methods such as in Li et al. operate through repeated iterations of full sequential MCTPO, which necessitates a memory-intensive DCM. As a consequence, dosimetric fidelity is still a concern, as dose kernel truncation, geometry resampling, and additional hardware-specific compromises (due to GPU memory constraints) had to be invoked. For the down-sampled fixed-angle IMRT case, the work of Li et al.
demonstrated a promising ∼70%-80% reduction in overall treatment planning time, with optimization comprising ∼11% of their total overhead. However, work by their associates 20 examining rotational treatment configurations suggests that the dosimetrically uncompromised optimization over the much larger search spaces found in clinical arc type therapies will become a much more important computational burden than the mapping of those search spaces.
The methods used in this initial investigation demonstrated the potential for such a CMCTPO method to greatly reduce the memory overhead involved in fluence optimization, while improving dosimetric accuracy over current MCTPO methods and offering the additional benefit of reducing histories that must be transported during the MC dose calculation. However additional theoretical and practical considerations must be addressed to allow the successful application of this method clinically. For example, the backward difference estimate of the gradient utilized in this investigation served to clearly demonstrate our proof-of-principle platform in conjunction with the pedagogical quadratic cost function for the target volume. The need to approximate the gradient, and the corresponding first order truncation error associated such an approximation, may be avoided entirely however by using alternative formulations of the target dose objective that offer an analytical expression for the gradient that do not necessitate a DCM. 21 For general cases, a more robust method for automatically determining the step-size and stepsize limit would also be necessary, include considerations for the resolution of the geometry to estimate the number of interactions generated per sub-beamlet iteration and the corresponding stochasticity.
An additional consideration is the subtle consequence of the renormalization method for when bixels of zero intensity are necessitated in the optimization. In the current formulation, this would normalize all bixel intensities and deposited dose to zero, effectively resetting the optimization. Certain heuristic methods could be implemented to address this, such as by ranking the bixels based on their initial gradients after the first step and only optimizing for a subset of bixels that have the highest cost decrease per history. As the optimization continues, additional bixels could be included in the optimization, as the gradient becomes shallower.
Of course, the future development of CMCTPO algorithms depend largely on the intended use case for this method. For example, if a platform such as FASTMC were used as a final dose calculation tool where aperture segmentation had already been performed, zero bixel weights would be a nonissue. Instead, the platform would offer the benefit of recalculating the planned dose with MC accuracy while performing a finetuning of the final dose distribution. The CMCTPO platform could also be used as a preprocessor for existing optimization algorithms instead, offering a memory-efficient method by which to determine optimal beam configurations in multiplearc or 4π type deliveries.
In general, the implementation of a decaying step-size, variable momentum factor and implementation of additional variance reduction methods 22 borrowed from SGD theory may all improve the convergence rate of the SGD algorithm. Additionally, while FASTMC demonstrates the potential for tremendous computational time and memory overhead reductions with unprecedented scalability, the SGD algorithm's capability to generate acceptable plans using clinically relevant objective functions 23 must be investigated and implemented in order to fully realize its benefits in a clinical setting. Traditional SGD has been proven to converge in mildly convex 24 and nondifferentiable 25 optimization problems, and extensive effort has been put forth towards methods to improve the convergence rate in various scenarios. 22,26,27 Nevertheless, FASTMC offers a novel platform for exploring extremely large search spaces with a dramatically reduced memory footprint.
In conclusion, we have presented an entirely novel concurrent Monte Carlo treatment plan optimization platform for external beam photon radiotherapy. Our platform resolves many of the computational burdens of conventional sequential Monte Carlo treatment planning approaches hitherto, while improving the level of accuracy and fidelity needed to perform optimization at the voxel level. Therefore, we believe that this approach could significantly improve the development of next generation radiation therapy solutions while incurring minimal additional computational overhead.

ACKNOWLEDGMENTS
The authors would like to thank Professor Steven Wright for his feedback on a variety of mathematically conceptual and notational matters. The authors would also like to thank Yinghang Yang for his assistance in a variety of syntactical and programming related issues. Youming Yang was supported by the UW Radiological Training Grant No. T31CA0090206. Corey Zankowski and Michelle Svatos were supported by Varian.

CONFLICT OF INTEREST DISCLOSURE
The authors have no COI to report. scaling factor relating number of histories to beamlet weight, γb j = N j ; ∆(x) : an incremental change of a variable x, associated with sub-beamlet increment; the kth iteration of any variable x; x j n j=1 : the set of variables x j for j = 1...n.

APPENDIX A: STOCHASTICITY IN FASTMC
The gradient generated in fluence adjusting scalable transport Monte Carlo (FASTMC) has the potential to be highly stochastic due to the few particles used. It also has the property that if the gradient was evaluated q times from the same coordinate b (t) with different random number seeds and averaged, the central limit theorem states that for q → ∞ we recover our approximation of the gradient without statistical uncertainty, Where again, our estimate of the gradient is approximate due to the truncation error. This property of being able to reduce the stochasticity in an unbiased estimate of the gradient via repeated recalculation of the gradient subject to a random variable is identical to those of gradients found in stochastic gradient descent (SGD) optimization problems.
Stochastic gradient descent is a popular method in largescale machine learning 26 due to its ability to solve optimization problems involving large search spaces and deep networks. 27,28 In such applications, it becomes computationally expensive to calculate the for the large number of variables in the search space, and instead SGD is utilized. In these scenarios, the main appeal of SGD is that its cost per iteration effectively scales independently of n. In SGD problems, calculating the full gradient at every iteration is often computationally prohibitive, and instead variance reduction techniques have been developed to allow optimization utilizing a "noisy" estimate of the gradient. Similarly, calculating an accurate gradient at every iteration for "concurrent" Monte Carlo treatment plan optimization (CMCTPO) is an inefficient procedure requiring the transport of a relatively large number of histories. Utilizing methods developed in SGD, it is then possible to perform gradient descent in our CMCTPO framework, using a highly stochastic gradient calculated from the transport of very few histories.
As an aside, it is important to note that the nature of stochasticity is fundamentally different between the two optimization problems. The uncertainty in the FASTMC gradient is due to the dosimetric uncertainty in D (k−1) * , j for every element of the full gradient, whereas the uncertainty in standard SGD methods is due to a subsampling of the full gradient, with no uncertainty necessarily being assumed in any given gradient element. Therefore, it is possible that not all techniques employed in SGD may be applicable to CMCTPO.

APPENDIX B: MOMENTUM IN STOCHASTIC GRADIENT DESCENT
Expressing the SGD formulation by Johnson and Zhang 22 using our notation, a generalized gradient descent update rule may be written where b is the weight, (k) is the iteration number, η k the learning rate (or step-size), and F j one training example out of the training set. Under SGD, the n dimensional gradient becomes where at each iteration, k = {1,...}, j k is drawn randomly from {1,...,n}. The equation may be more formally written as where ξ k is a random variable, and g k b (k) ,ξ k is the "unbiased estimator" of the full gradient for that iteration, with the property that the expectation value E g k b (k) ,ξ k |b (k) = ∇F b (k) , or similarly, that the expectation E g k b (k) ,ξ k − ∇F b (k) |b (k) = 0. In plain language, in the limit of small step-size and large number of iterations, the stochastic gradient is indistinguishable from the true gradient. To address the uncertainty associated with the estimator of the gradient, a momentum term is often introduced, 27,29,30 which modifies the update rule to become Where along with the learning rate η k , there is the addition of a momentum term β k that is proportional to the difference between the current iteration weights b (k) and the previous iteration's weights b (k−1) . For a constant step-size η k = η and momentum term β k = β, the update rule with momentum may be simplified to For momenta β < 1 then the current gradient is comprised of geometrically increasing contributions of gradients from the first iteration κ = 1 to the current iteration κ = k. The statistical uncertainty in the dose is nominally σ d j /d j ∼1%-2% for a standard beamlet calculation of N j ∼ 10 6 histories per bixel. For a sub-beamlet calculation utilizing ∆N (k) j ∼ 10-10 3 histories per iteration per bixel, the voxel dose uncertainty which scales as σ d /d ∼ √ 1/N can be in the range of σ ∆d j /∆d j ∼ 10%-1000%. A full gradient descent based on the cost secant in such a manner would fluctuate wildly in bixel space, and following such gradients would not minimize the global cost efficiently.

APPENDIX C: DOSE UNCERTAINTY, GRADIENT RESCALING, AND RENORMALIZATION
It has been shown in literature that for Monte Carlo (MC) simulations, the dose uncertainty σ d for a voxel with dose d generated from N histories can be represented as 31 where d is the dose-per-history, and C is a constant dependent on simulation-specific parameters such as the voxel shape and average particle energy. We define ∆d (κ) , where κ = 1,2,...,k, to be the change in dose due to a sub-beamlet increment calculated using n κ histories, with ∆d (κ) centered around a nominal value ∆d 0 with an associated uncertainty of σ ∆d,κ =  C (d/n κ ), We define d (k) to be dose in a voxel after k iterations that include both positive and negative fluence increment contributions generated using the same number of histories n, d (k) = ∆d (1) − ∆d (2) + ∆d (3) − ∆d (4) ... + ∆d (k) .
For identical step-sizes where each increment is independent of the other, the above equation simplifies to where we define P(κ) ≡ ±1 to be a parity function, equal to positive or negative unity (with [P(κ)] 2 = 1), depending on the iteration number. Because uncertainty adds in quadrature and is independent of the parity function, we may rewrite the above as For identical step-sizes, n κ = n and we may write σ ∆d,κ = σ ∆d , which gives Recognizing that ∆d 0 is a constant, we may rewrite as Eq. (C5), We may define f to be the fraction of all P(k) terms that are negative, and write where we note that f is doubled because for each negative P(k) term, a positive term is canceled out, for example, if half of all the terms were negative ( f = 0.5), then the summation would be zero. We may rewrite the final dose as Finally, evaluating the relative error, we see that we obtain which is the scaling we would expect for the relative uncertainty of all positive increments, boosted by a factor of 1/(1 − 2 f ). This may be illustrated by considering one dose voxel d for iterations κ = 1,2,3, where two positive increments and one negative increment in dose are made with the same number of histories N. For each increment, we assume the dose increases by ∆d which is centered around some nominal value ∆d 0 with a spread σ ∆d ∼ √ 1/N, such that ∆d = ∆d 0 ± σ ∆d .
Examining the dose and relative error of the voxels as we take two positive increments and one negative, we have d (1) = ∆d = ∆d 0 ± σ d (1) = ∆d 0 ± σ ∆d , d (2) = d (1) + ∆d = (∆d 0 ± σ ∆d ) + (∆d 0 ± σ ∆d ) = 2∆d 0 ±  2σ 2 ∆d , Thus, we see that the relative error σ d+ for taking three forward steps is whereas the relative error σ d± for taking two steps forward and one back is proportional to where the threefold increase in uncertainty by taking a step backwards follows the 1/(1 − 2 f ) scaling, where f = 1/3,