DVH Analytics: A DVH database for clinicians and researchers

Abstract In this study, we build a vendor‐agnostic software application capable of importing and analyzing non‐image‐based DICOM files for various radiation treatment modalities (i.e., DICOM RT Dose, RT Structure, and RT Plan files). Dose‐volume histogram (DVH) and planning data are imported into a SQL database, and methods are provided to manage, edit, view, and download data. Furthermore, the software provides various analytical tools for plan evaluations, plan comparisons, benchmarking, and plan outcome predictions. DVH Analytics is developed using Python, including libraries such as pydicom, dicompyler, psycopg2, SciPy, Statsmodels, and Bokeh for parsing DICOM files, computing DVHs, communicating with a PostgreSQL database, performing statistical analyses, and creating a web‐based user interface. This software is open‐source and compatible with Windows, Mac OS, and Linux. For proof‐of‐concept, a database with over 3,000 DVHs from a single physician's head & neck practice was built. From these data, differences in means, correlations, and temporal trends in dose to multiple organs‐at‐risk (OARs) were observed. Furthermore, an example of the predictive regression tool is reported, where a model was constructed to predict maximum dose to brainstem based on minimum distance from planning target volume (PTV) and treatment beam source‐to‐skin distance (SSD). With DVH Analytics, we have developed a free, open‐source software program to parse, organize, and analyze non‐image‐based DICOM data for use in a radiation oncology setting. Furthermore, this software can be used to generate statistical models for the purposes of quality control or outcome predictions and correlations.

the use of historical data can better determine if a plan is atypical.
Furthermore, it is conceivable that radiation toxicities may have stronger correlations to other DVH points or perhaps the combination of additional DVH points. In either case, developing a large database of treatment planning data can provide the ability of more complex statistical analysis of larger datasets more quickly and accurately than manually transcribing statistics of one plan at a time into a spreadsheet of predetermined thresholds.
To the best of our knowledge, the open-source treatment planning and evaluation tools currently available have been developed without an explicit database. The Matlab-based software known as CERR (Computational Environment for Radiotherapy Research) provides an open-source platform that is effective for prototype treatment planning and evaluation, especially for research purposes.
Furthermore, CERR is capable of working with patient files from DICOM or AAPM/RTOG archives, which makes data transfer between multiple platforms straightforward. 3 For analysis across multiple patients, the Matlab-based software DREES (dose-response explorer system) is an open-source extension of CERR which provides a data-driven analysis of treatment outcomes; DREES provides analytical tools such as fitting tumor control probability (TCP) and normal tissue complication probability (NTCP) curves, modeling and visualizing dose-volume and plan metrics, and estimating uncertainty in planning parameters. 4 The input data for DREES assume a Matlab-based data structure comprising DVH and outcome data, which can be a limitation when working directly with DICOM files. Another option, RadOnc, is an R package designed for radiation oncology and provides an extensive library of analytical tools. 5 However, similar to CERR and DREES, RadOnc does not provide a way to store, query, explore, or analyze a large, scalable database.
Currently, storage of patient data with open-source platforms developed on Matlab and R is file-based. For such individual filebased systems, making a query on a single parameter over multiple patients can involve opening, reading, and closing a large number of files, which adds an excessive amount of memory use and computational overhead. Because of this, scalability for large datasets is a concern for currently existing open-source platforms in radiation oncology. Ideally, a user should be able to access only the data of interest so not to unnecessarily burden computational resources.
Moreover, current open-source platforms only allow data access to one user at a time. This may limit efficient use of clinical resources. To address these limitations, a SQL (Structured Query Language) database can be employed for data storage. With regard to size restrictions, the entire database is only restricted by the size of the available disk space on the computer or server hosting the database. More importantly, the SQL database also allows multiuser interaction simultaneously, which may improve efficiency of clinical resource usage. In essence, the SQL database is a fast and lightweight data storage system that can access only the queried data, without accessing large individual patient files in entirety. Because of these benefits, the standard for scalable databases is SQL. While there are SQL-based software programs intended to create a database of DICOM data, such as DICOM Data Warehouse, those currently available are not specific to radiation oncology nor provide statistical analyses, to the best of our knowledge. 6 Therefore, the aim of the proposed software is to provide an open-source platform with a long-term, scalable database and the statistical tools needed to explore and analyze a large set of data in a radiation oncology setting.
In the following, we develop a database and analytics platform that directly reads DICOM files and stores the parsed data in a SQL database. This tool provides visualizations and evaluation metrics for individual and multiple patients, with the aim to significantly reduce importing times for plans, while providing an overall clinical perspective of a large dataset. In Section 2, we discuss the data assumptions, database design, back-end computations of additional anatomical metrics, as well as front-end computations of plan evaluation metrics. Subsequently in Section 3, we present the results as a demonstration of the viability of the tool using an example dataset and not as actual clinical findings.

| MATERIALS AND METHODS
A database of DVH and treatment planning data can provide insightful information at the start and after the completion of a course of radiation treatment. Figure 1 illustrates a proposed workflow including such a database into a radiation oncology clinic. Combined with clinical outcome data, a DVH database can significantly reduce data collection time and potential transcription errors when correlating dosimetric data to patient outcomes. Alternatively, the combination of DVH and treatment planning data can be used to detect anomalous data during the final chart check prior to the first day of treatment, which may have otherwise gone unnoticed. Lastly, historical DVH data can be used for comparison to potentially indicate how a draft plan can be improved at the beginning of the treatment F I G . 1. We propose this workflow as an example incorporating a DVH Database into a radiation oncology clinic. The bottom row represents a typical workflow without the use of a DVH database, while the top row indicates which stages a DVH database may provide benefit. However, DVH Analytics only needs operating system access to a file directory containing the files to be imported; it is up to the user to decide how to get the DICOM data into the directory.

2.A | Assumptions
The initial intent of this database was to capture treatment planning DVHs in a clean and easily searchable manner. It is expected that there is only one complete, composite dataset per course of treatment, including all boosts. That said, each prescription will be included in the prescription, plan, and beam data, but the DVH data will represent only the final composite treatment plan for each structure.
Furthermore, the software was designed to automate data collection by extracting all required data directly from DICOM files. In this study, we validated the import process on six different treatment planning systems: Philips Pinnacle 3 8.0 m through 9.10, Elekta Monaco 5.0, Brainlab iPlan 4.5, Raysearch Raystation 5.0 through 6.1, Oncentra Brachytherapy 4.5, and Varian Eclipse 10. Currently, the only planning system among this group that does not explicitly include prescription information in the DICOM files is Pinnacle 3 . Therefore, a Pinnacle 3 Script was generated to record prescription information into the names of points of interest (POIs) within the DICOM RT Structure file.
DVH Analytics provides a method to generate a region-of-interest (ROI) map for each physician. Although ROI categorization happens at the time of import, an admin user can initiate reprocessing without the need to reimport DICOM files. With only these three exceptions (i.e., prescription, treatment site, and ROI categorization), all other data are extracted from DICOM files without additional user input. However, it still is possible for some data to be incorrect or missing (e.g., physician, patient date of birth, simulation study date). Therefore, DVH Analytics provides an administrator view to query and edit any parameter after import. For users familiar with PostgreSQL, the database could be managed from command line or any PostgreSQL compatible database management software.

2.B | SQL database design
The primary data for DVH Analytics is organized into four tables: DVHs, Plans, Prescriptions, and Beams. These tables are linked by patient medical record number (MRN) and study instance unique identification (UID). The contents of these tables are listed in TPS, prior to export, beginning with "tx:," DVH Analytics will set the treatment site to the text that follows. For example, a POI name of "tx: Brain" will prompt DVH Analytics to set the treatment site to "Brain." This method is similar to that used to extract prescription information for plans exported from Pinnacle 3 .
The Prescriptions (Rxs) table contains the fraction group data.
Aside from the plan name, which is the same for each prescription, all data refer to the particular fraction group (e.g., initial, boost1, boost2, etc.). The Beams table primarily contains data specific to each beam, including beam energy min/max (proton beams have a range of energies), beam type (e.g., static, dynamic), scan spot count for proton plans, and gantry/collimator/couch information (i.e., start, stop, min, max, and range). Currently, all data in this table are applicable for linac or proton-based treatments. Importing data into DVH Analytics from brachytherapy plans will not result in Beams table data; however, all other SQL tables will be populated.
Finally, a catalog of imported DICOM files is maintained using a fifth SQL table. This table includes MRN, study instance UID, the postimport directory which contains the DICOM files, and the file names of the RT Plan, RT Structure, and RT Dose files used for import. If multiple instances of a DICOM file type are found, only the file with the latest timestamp will be used for import; however, all files with the same study instance UID will be collected into the user-specified import directory and further organized by the MRN.
This feature allows DVH Analytics the ability to easily reimport data directly from DICOM files; it also allows for future development of DVH Analytics that may rely on imaging data.

2.C | Back-end computations
For the most part, there are explicit DICOM tags for the data contained in the DVH Analytics database. As DVH data often are not explicitly stored in DICOM files, DVH Analytics uses the dicompyler CUTRIGHT ET AL.

| 415
Python code to compute the DVH data. 8 In addition, DVH Analytics stores information for the purposes of ROI name management and computes the union of all PTVs for calculations determining minimum ROI-to-PTV distances and PTV overlap. Anatomical factors, such as distance between structures, can have a significant impact on treatment planning goals. In fact, factors such as distance between PTV and a surrounding OAR, as well as the volume overlap between PTV and OAR, have been identified as significant predictors of DVH goals. 16 In addition, radiobiological calculations are performed based on equivalent uniform dose (EUD) as described by Niemierko. 17

2.C.1 | ROI name management
One of the more difficult challenges of maintaining a meaningful DVH database is overcoming variations in ROI names. For example, a particular physician or planner may choose to name the left eye any of the following: L Eye, Orbit left, eye l, etc. The only certain way to catch all ROIs intended to be the left eye is to always name the ROI the same exact way. In practice, this does not happen when the treatment planning software allows the user to type in any ROI name. As a way to mitigate this, DVH Analytics provides a method to map any number of possible ROI name variations. Over time and with user input, this system becomes more robust, reducing the likelihood of missing a ROI categorization. From the Admin view, a user can view a list of all uncategorized ROIs. From this list, the user may tag the ROI as "ignore" so that it is removed from the list or add the ROI to the ROI map.
This mapping system uses two separate ROI name categories: institutional ROI and physician ROI. Institutional ROIs are names used to define a sample of DVHs across the entire database, whereas a physician ROI is curated for a particular physician's practice, which will either map to an institutional ROI or be left as uncategorized. This allows physicians the flexibility to create their own naming system as well as the ability to track more anatomically specific ROIs for their specialty. DVH Analytics provides a view of any selected ROI name as shown in Fig. 2, which illustrates another example of potential variations for the left cochlea.
For tumor/target volumes (e.g., gross tumor volume (GTV), clinical target volume (CTV), and PTV), DVH Analytics records the DICOM information containing the structure type (e.g., PTV, Organ at Risk, External, etc.). It is recommended that these tags be appropriately defined prior to DICOM export. For plans with multiple PTVs, DVH Analytics will assume a naming scheme of PTV1, PTV2, PTV3, etc., of Pinnacle 3 at least 9.10 and earlier). However, DVH Analytics will tag any structure that begins with "ITV" as such in the database, regardless of the associated structure type in the DICOM file.

2.C.2 | Geometric computations
DVH Analytics provides a method to calculate the geometrical union of ROIs; this method is specifically applied to generate a combined PTV for the purposes of computing ROI distances to the combined PTV as well as PTV overlap. For convenience and computational efficiency, we employ the Python packages Shapely and SciPy. 9,10 Shapely provides a convenient way to perform geometric operations between two-dimensional polygons, specifically the calculation of intersections, differences, and unions. After converting the DICOM coordinates of a ROI into ordered sets of points, polygons representing the ROI are generated with Shapely. Per DICOM convention, if multiple polygons exist in a single slice (2D image), a subsequent polygon that exists inside the cumulatively generated polygon represents a subtraction of area from the cumulatively generated polygon (e.g., a ring structure). Likewise, a subsequent polygon outside the cumulatively generated polygon represents an island structure (e.g., delineating both left and right lungs in a single ROI).
With this understanding, the authors generated a combined polygon (i.e., a MultiPolygon class in the Shapely code) for the PTV, and separately, the OAR, accounting for any holes or islands.

PTV overlap
After the generation of the combined PTV, the intersection of the resulting MultiPolygon with the ROI is calculated for each slice.
The resulting areas of these intersections are multiplied by their respective slice thickness. Then, these volumes are summed to calculate the PTV overlap volume. The slice thicknesses are obtained from the z-coordinates of the slice of interest and an adjacent slice.

Minimum ROI to PTV distances
A brute-force method of calculating all distances between points defining the PTV surface to all points defining the ROI surface is employed to compute the minimum ROI to PTV distance. The minimum distance between the PTV and the ROI is the minimum of all the distances computed. DVH Analytics also records the mean, median, and maximum of these distances to provide additional spatial context. This brute-force method can be very computationally expensive, particularly with straightforward methods of lists and forloops in Python. To overcome this limitation, we employ the SciPy library, which includes a "spatial" module with a function to do these distance computations. 10 We observed more than an order of magni-

2.D | Main application view
The main application view is split into eight tabs: Query, DVHs, Rad Bio, ROI Viewer, Planning Data, Time-Series, Correlation, and Regression. The contexts of these tabs are described in the following subsections.

2.D.1 | Query
The initial view of DVH Analytics is a module for the user to design their query. As opposed to requiring the user to learn command-line syntax of PostgreSQL, DVH Analytics provides a series of dropdown menus grouped by "Selection Filters" and "Range Filters," which prompt the user for discrete and non-discrete data constraints, respectively. 12 These categories are listed in Table 2. Once a Selection Filter category from , the query will assume an "or" operator.
In addition, the user may define up to eight "Endpoints" to be tabulated. Each endpoint added allows the user to define a dosimetric (e.g., D 2 cc , D 95% ) or volumetric (e.g., V 20% , V 50Gy ) point for all DVHs in the query. These values can be reported in absolute units of cm 3 or Gy, as well as in a relative scale (relative to volume or prescription dose), as shown in Fig. 3(a).
Once the user has defined the desired sample based on any number of Range or Selection Filters, clicking the update button will retrieve all data stored in the database that fits the query. All information presented in the remaining tabs is based on this retrieval.

2.D.2 | DVHs and planning data
DVH Analytics provides an interactive plot containing up to two separate interquartile ranges (IQRs) of user-defined DVH samples as well as the option to plot a single DVH from DICOM files located within the user-defined "review" directory, as shown in Fig. 3(b). The reviewed DVH is not included in the sample statistics calculations.

2.D.3 | ROI viewer
DVH Analytics provides a visualization of ROIs from a specified study instance UID (as filtered from the query and, subsequently, MRN and study date); this is illustrated in Fig. 3(c). This module processes the DICOM coordinates of the specified ROIs into polygons, which may be viewed two dimensionally, one slice at a time.

2.D.4 | Time-series plots
A time-series plot to demonstrate trends across simulation dates are provided in the Time-Series tab, as illustrated in Fig. 3 2.E | Importing and processing times

3.A | Data validation
DVH calculations were validated against DVH data extracted from Pinnacle 3 using a script that stores the data into an ASCII file. ROI volumes and selected DVH points were manually recorded as displayed in Pinnacle 3 . All DVH Analytics data were extracted from the csv file generated when clicking the "download" button in the DVH Analytics application. These data were collected into a spreadsheet for plotting and tabulation to ensure independent validation DVH Analytics with another DVH calculation method. DVHs for ITV, PTV, spine, left lung, heart, spine, and ribs were plotted, as shown in The calculations for ROI volume, PTV overlap, and minimum distance from ROI to PTV were validated by comparing these results to those computed in Pinnacle 3 with ROIs from a randomly selected plan (nonanatomical ROIs were omitted for brevity, e.g., ROIs for optimization). These data are reported in Tables A1-A3, located in the Appendix.
Volume calculations are performed with the code from dicompyler. 8 The data in Table A1 show a maximum absolute difference of 2.76 cm 3 . All absolute differences greater than 0.5 cm 3 correspond to relative differences of typically 1-3%, at most 6.2%. Because dicompyler's volume calculation is dependent on the dose grid resolution (at the time of this study), these differences can be reduced further by calculating the dose grid with a finer resolution prior to DICOM export from the treatment planning system. The values reported in Table A1  does have functionality to calculate distances between two user-specified points. When possible, the distances reported in Table A3 are the minimum distance measured in any of the orthogonal planes (axial, sagittal, or coronal) of several measurements. In the few cases when the listed ROI could not be viewed in an axial, sagittal, or coronal plane with the PTV simultaneously, points were placed in the nearest corners of the ROI and PTV; the 3D distance between these two points was reported. Notably, all absolute differences greater than 2 mm correspond to relatively small ROIs comprising a small number of slices. Considering the plan selected for this analysis is based on a CT study with 3-mm thick slices, an absolute difference of 3-mm perpendicular to axial planes is not unexpected. The largest absolute difference was 3.4 mm.

3.B | Data exploration
While the authors' initial intent for the design of DVH Analytics was to develop a queryable database of DVHs, the collection of a large subset of the DICOM data other than DVHs originally meant for query definitions has added significant value. Simply plotting any of these data over time can reveal potentially invalid data, quality control metrics, or temporal variations (e.g., across a physician, the institution, treatment site, etc.).

3.B.1 | Seeking "Bad" data
The benefit of time-series plots is not exclusively individual patient QA or easy data aggregation for clinical outcome studies; time-series plots also provide a valuable method for seeking incorrect or incorrectly categorized data. For example, when plotting the H&N larynx volume data, it was observed that one larynx volume was more than double the average of the sample. Upon inspection, the outlier was actually due to the incorrect categorization of the ROI. Ostensibly, the name of the ROI was a misspelling of larynx. In fact, the ROI was poorly labeled; it was an expansion from the anatomically delineated ROI and used for planning purposes. This is clear evidence that automated categorization of ROIs is not without caveats and should serve as a warning to users. The authors recommend plotting and examining various variables in this fashion after importing new data as a time-series plot can easily demonstrate gross outliers.

3.B.2 | Quality control metrics with context
Observing temporal changes in data can provide valuable insight to build physician-specific profiles of typical dose constraints, indicate plans with atypical parameters, and even help correlate patient toxicities to dosimetric data. For example, Fig. 5(a) illustrates a change in beam (MU) of 75 plans spanning 11 years. All of these plans are H&N plans from a single physician in a single institution. All plans were generated using Pinnacle 3 and planned with step-and-shoot IMRT or VMAT delivery techniques. Figure 5

3.C | Data analysis
Pearson-R correlations between nine variables with a dataset of 88 patients are presented in Fig. 6(a) for brainstem and larynx data.  Table 3.
In the case of brainstem, being a serial structure, the correlation values observed between PTV distance and ROI dose may be of clinical significance in evaluating plan quality and/or OAR constraints. In this H&N dataset, the larynx mostly overlaps with the PTV, and hence, little correlation is observed. However, the brainstem never overlaps resulting in a strong negative correlation.
Multivariable linear regressions were performed, with the same dataset used to generate the correlation matrix in Fig. 6(a), using Statsmodels. 11 These results are tabulated in Table 3. The models indicate a significant correlation between the minimum and average PTV-to-brainstem distances and the maximum brainstem dose, which is consistent with the previously discussed correlation matrix. Interestingly, Model 1 summarized in Table 3 indicates a strong correlation with the maximum source-to-skin distance (SSD) of the treatment beams. However, it is important to appreciate that there is likely a more fundamental independent variable at play than SSD (e.g., laterality of PTV or beam isocenter, patient weight or size, etc.).
Model 2 also is significant, but reports a reduction in correlation.
The same two models applied to larynx data reported poor correlation (i.e., R 2 = 0.316 and R 2 = 0.038, respectively), demonstrating the need for context as these regression models for brainstem data are clearly not suitable for the larynx data.

| DISCUSSION
As with any database, maintaining its integrity is critical for useful implementation. There are a number of points-of-failure that DVH Analytics is not equipped to handle in an automated fashion. The methods below are suggestions from the authors to help mitigate these issues.

4.A | Planning vs anatomical structures
Arguably, the biggest flaw in a DVH database is that much of its data are based on subjective delineation of anatomical structures.
Furthermore, although with the best of intentions, it is not uncom-

4.B | Database gatekeeper
Although DVH Analytics provides automation methods for parsing data into a database, there is still some manual effort required by the user to maintain the integrity of the data quality. For instance,

4.D | Future research
Considering that radiation treatment is just one piece of cancer care for many patients, our next step is to seamlessly connect pertinent patient data from other treatment modalities. We are particularly interested in combining the DICOM data extraction and statistical tools currently available in DVH Analytics with information such as cancer staging, chemotherapy agents/prescriptions, surgical status, and clinical outcome data (including radiation induced toxicities).