Comparative analysis of respiratory motion tracking using Microsoft Kinect v2 sensor

Abstract Purpose To present and evaluate a straightforward implementation of a marker‐less, respiratory motion‐tracking process utilizing Kinect v2 camera as a gating tool during 4DCT or during radiotherapy treatments. Methods Utilizing the depth sensor on the Kinect as well as author written C# code, respiratory motion of a subject was tracked by recording depth values obtained at user selected points on the subject, with each point representing one pixel on the depth image. As a patient breathes, specific anatomical points on the chest/abdomen will move slightly within the depth image across pixels. By tracking how depth values change for a specific pixel, instead of how the anatomical point moves throughout the image, a respiratory trace can be obtained based on changing depth values of the selected pixel. Tracking these values was implemented via marker‐less setup. Varian's RPM system and the Anzai belt system were used in tandem with the Kinect to compare respiratory traces obtained by each using two different subjects. Results Analysis of the depth information from the Kinect for purposes of phase‐ and amplitude‐based binning correlated well with the RPM and Anzai systems. Interquartile Range (IQR) values were obtained comparing times correlated with specific amplitude and phase percentages against each product. The IQR time spans indicated the Kinect would measure specific percentage values within 0.077 s for Subject 1 and 0.164 s for Subject 2 when compared to values obtained with RPM or Anzai. For 4DCT scans, these times correlate to less than 1 mm of couch movement and would create an offset of 1/2 an acquired slice. Conclusion By tracking depth values of user selected pixels within the depth image, rather than tracking specific anatomical locations, respiratory motion can be tracked and visualized utilizing the Kinect with results comparable to that of the Varian RPM and Anzai belt.


| INTRODUCTION
As radiotherapy treatments become increasingly precise, identifying and visualizing tumor movement during treatment becomes exceedingly important. Tumors located within the thorax and abdomen are significantly affected by motion induced with a patient's natural respiratory cycle. Accounting for this additional internal motion becomes paramount. One specific way to acquire and process this information is through the use of a 4DCT, by which the respiratory motion of the patient is tracked using a gating device. 1,2 The respiratory motion trace is processed in tandem with the CT acquisition and CT slices are binned to specific portions of the respiratory cycle. 3 This process then allows internal motion visualization of the tumor by use of external motion tracking. 4 Devices used to acquire the respiratory motion trace typically require some manner of physical device attached to the patient by way of a marker placed on the patient's surface or apparatus worn by the patient. However, these processes may require repositioning and multiple attempts to get an accurate respiratory motion trace due to irregular breathing and can restrict the respiratory motion tracking to one specific area on the patient, typically the lower abdomen. In this manuscript, the Microsoft Kinect v2 sensor was adapted to trace and record a patient's breathing cycle by way of a markerless process, doing away with any requirement for external hardware to be attached to the patient.
Developed and released by Microsoft in 2014, the Kinect v2 was created for the purposes of anatomical motion tracking by combining a high resolution color camera and a time-of-flight IR projector/sensor. Additionally, Microsoft released a software development kit (SDK) which is available free of charge. This kit contains sample programs which can facilitate access to various functions of the Kinect to software developers. 5,6 Allowing this open-sourced platform has enabled developers to create a vast number of applications within the medical community ranging from tracking and management of inter-and intra-fraction patient motion to gesture recognition within surgery suites for a hands-free computer interface. 7,8 The combination of a color camera with an IR projector/sensor to obtain depth information has allowed the Kinect to become a versatile and useful tool within the medical community.
Previous research into respiratory motion tracking using the Kinect utilized either the Kinect v1 or required a translational marker to be placed on the patient's surface or embedded within clothing worn by the patient, [9][10][11] similar to other respiratory tracking systems currently available for purchase. The latest version of the Kinect contains higher resolution sensors than the previous model, which helps remove the requirement for a translational marker to track respiratory motion. The removal of this requirement allows for a simpler process to be employed with less trial-and-error to obtain a useful respiratory trace.
In this manuscript, the Microsoft Kinect v2 sensor was adapted to trace and record a patient's breathing cycle by way of a marker-less process. The cost of utilizing a marker-less approach is the inability to guarantee tracking of a specific point on the patient's surface. This is due to the fact that the tracking process is done with respect to pixels in an image frame as opposed to fixed anatomical locations. Motion of the patient's surface during breathing will, in general, cause slightly different anatomical points within some connected surface area to pass through the tracked pixels within the image. This inherent difference between marker-based and marker-less tracking could theoretically lead to differences in recorded breathing traces between the methodologies. As a result, our evaluation of the Kinect v2 sensor as a motion-tracking device also includes, by necessity, an overarching evaluation of a general marker-less approach whereby the motion tracking is in some sense decoupled from the motion of singular points on the patient's surface.

| MATERIALS AND METHODS
In this study, a Kinect respiratory tracking process was created and compared against both the Varian RPM Respiratory Gating system (RPM) and the Anzai Gating system (Anzai). For comparison and accuracy measurements, RPM and Anzai were both employed to a subject at the same time with the Kinect mounted above the patient.
RPM traces the movement of a propriety marker placed on the subject's abdomen through the use of infrared sensors at a rate of 30 fps. 12 Anzai utilizes a belt strapped around the subject's abdomen which contains a pressure sensor to track the respiratory motion at a rate of 40 fps. 13,14 The Kinect returns depth values, in mm, for every pixel within the depth frame at a rate of 30 fps. 15 All three products acquired data simultaneously with the RPM marker placed directly on top of the Anzai belt and data was exported from all three for analysis.
Currently available gating procedures employ either a phase based or amplitude based binning process when incorporating respiratory motion. 16,17 As such, the traces recorded for all three products in this manuscript were analyzed with each process in mind.
With a phase based binning process, the period of one cycle is obtained and divided up into ten phase portions with bins of equal width. With an amplitude based binning process, the bins are divided up into percentages of the maximum and minimum amplitude throughout one cycle, typically calculated as 100%, 80%, 60%, 40%, 20%, and 0%. These percentages correspond to specific physical states of the breathing cycle (mid-inhalation, maximum exhalation, etc.). Given irregularities that can occur in a patient's breathing pattern which may cause shifts in the phase but not amplitude, many binning procedures are moving away from a phased based process in favor of an amplitude based process. 18 However, in this manuscript, both binning procedures are used to test the validity of data being recorded by the Kinect.
Calculation and identification of the local maximum and minimum for each breathing cycle (100% amplitude, and 0% amplitude, respectively) was implemented through a simple local comparison algorithm. To mitigate possible misidentification of per-cycle maxima and minimum due to temporally small, noisy perturbations, each individual data point of the trace was compared to the 10 data points acquired before and after, allowing for 20 comparisons in total. If the data point in question was greater than or equal to the 20 points surrounding it in time, it was considered 100% amplitude for that breathing cycle. If the data point was less than or equal to the 20 points surrounding it in time, it was considered 0% amplitude. Similar to analyses performed in the clinic when acquiring respiratory traces, multiple values of 100% or 0% amplitude may be identified by the system for the same breathing cycle. As such, manual adjustment was required to remove duplicate local maximum or minimums.
In order to obtain data for the respiratory trace, the Kinect v2's depth camera was utilized. The depth camera has a resolution of 512 9 424 and has the ability to detect distances ranging from 0.5 m to 4.5 m. 19 The sensor returns depth data for each pixel within the 512 9 424 frame in 1 mm increments. Rather than track movement associated with a specific location on the body and monitor depth changes as it moves across the frame, as would be done with a physical marker, the system is designed to track specific pixels from the depth image and record the depth values returned over time. Although different from the typical respiratory tracking processes, which track a specific location on the body, this manuscript investigates if both processes can produce the same respiratory trace with congruent results.
To begin the data collection process, the user manually selects 5-12 points anywhere on the patient for respiratory motion tracking.
Data collection duration is also selected by the user and the process can be stopped manually if needed. Each point has depth data continuously recorded during the acquisition process, with visual displays of each trace, and the program can choose the most accurate representation of the respiratory motion by calculating the largest difference between the maximum and minimum distances recorded for the points created. Additionally, as all traces during acquisition are saved, the user has the ability to view and select traces from different points to those chosen by the program in order represent respiratory motion if so desired.
To ensure that the data collection process and GUI were as user friendly as possible, the body tracking capabilities of the Kinect were implemented. The Kinect software has the ability to detect when a human body has entered the frame of the camera and can differentiate between pixels associated with a body vs pixels belonging to the background. Once the body is recognized by the Kinect, the background can be removed from the image displayed allowing for an easy visualization of the patient. The advantage to utilizing this process is that the image displayed is aligned, pixel for pixel, exactly to the depth images generated. This allows selection of specific points on the patient to exact depth data generated by the depth sensor.
In order to reduce noise as much as possible from the depth values obtained for each pixel selected, a median filtration algorithm was implemented for data obtained within a specific frame. Yang et al. measured typical noise from the depth sensor to be less than 2 mm when the object was within a 1-2 m range from the Kinect. 20 Additionally, random fluctuations can act to produce a depth value of 0 or a value much greater than an expected depth. The median filtration algorithm implemented reduces this noise by creating a 7 9 7 grid of pixels around the pixel selected. Depth values from all 49 pixels are analyzed and the median of those pixels is used as the corrected value for the center pixel. This process enables noise filtration of the depth data without being affected by any outliers within the 7 9 7 grid.
During each tracking session, the Kinect was mounted directly over the subject pointing down at an angle of roughly 45 degrees and was set at a height of roughly 0.75 m. Data acquisitions were performed on both a male and female subject for approximately 120 s and each were asked to breath in a manner typical for the individual with no breath holds. Lachat et al. noted that the accuracy and constancy of depth values obtained from the Kinect requires a brief warm up period of approximately 30 min. 21 As such, the Kinect was allowed ample warm up time during setup and before data was acquired.  and 3(c) for Subjects 1 and 2, respectively. Here, the plot contains data comparing two products with each point on the plot having the X and Y coordinates calculated by the following:

| RESULTS
The X coordinate of a point, tAþtB 2 , represents the average time measurement for a specific amplitude percentage between two products (t A for product A, and t B for product B). The Y value, t A À t B , represents the difference between the time measurements from the two products being compared. In essence, the difference between two time measurements for a specific amplitude percentage (Y value) is plotted against the average of those same two measurements (X value). 22,23 The data analyzed here with the Bland-Altman approach only represents the data obtained from the amplitude binning process. This was simply done for clarity as analysis for the phase based binning process would yields similar results.   With the Bland-Altman plots created in Fig. 4, the agreement between two products producing similar measurements lies with the percentage of values that fall within the span of the mean AE 1.96 9 SD. Typically, two products can be shown to produce similar measurements if roughly 95% of the data within the plot falls inside this range. Table 1  Lastly, the difference between the times obtained for each product within the amplitude and phase based binning process was calculated and the average difference across products for each percentage was calculated. Figure 5 displays the Interquartile Range (IQR) for the amplitude time differences by way of a Box and Whiskers plot for both subjects. Figure 6 displays the IQR for the phase time differences across each of the calculated bins utilizing similar Box and Whiskers plots for both subjects.
The IQR becomes an important quantifier when analyzing the differences between traces as it indicates a range of time that specific percentages of amplitude and phase differ between products. A summary of IQR values can be found in Tables 2 and 3 with Table 2 -0.25  Table 3 containing the average mean time difference and standard deviation for each comparison.
When analyzing traces with the amplitude based binning process for each breathing cycle, the IQR for the time differences between products was low overall, typically lower than 0.   Table 4 summarizes what minimal impact these IQR values would have during a 4DCT acquisition process.

| DISCUSSION
The  Although this analysis has shown the Kinect can produce similar traces to those of Anzai and RPM, the current iteration of respiratory tracking with the Kinect is not without its limitations. One issue encountered was in regards to the body tracking capabilities of the Kinect software. As the system was originally designed as a bodytracking device for gaming, the optimal position for recognition and T A B L E 3 Average time difference throughout trace between each product with 2 subjects. Values were averaged over all ten amplitude and ten phase bins per cycle created in the above analysis. A second issue is with regard to gross patient motion during the tracking process. Without constant supervision of the image on the screen, the patient could move significantly and interrupt the respiratory tracking process. This can be overcome by implementing thresholds of maximum amplitude traced. For example, should a patient's typical breathing pattern involve a trough to peak amplitude valuẽ 20 mm, setting a threshold of AE10 mm would then alert the user that gross motion has occurred. Secondary to this process, the depth frame can be utilized to track gross motion across the entire frame.

Binning process Subject
By saving an initial state of the patient and continually comparing it to the current state, the depth values within the frame can be compared and analyzed to detect where in the frame motion has occurred. This is a process currently being investigated by this institution and can easily be implemented at the same time as respiratory tracking to ensure that the user would be alerted if gross motion were to occur.

| CONCLUSION
Recording respiratory motion with the Kinect v2 by way of recording depth values for specific pixels on the depth image, rather than anatomical locations, has shown to be as accurate as the Varian RPM system and Anzai belt and is easily implemented. The ability to select multiple points on a patient to be used for respiratory tracking through the GUI, allows for a unique and user-friendly setup. Without the need for a physical hardware attached to the patient for tracking, points can be selected anywhere on the patient, including the area of the tumor, without interfering with a CT scan or radiation therapy.

CONFLI CT OF INTEREST
The authors have no relevant conflicts of interest to disclose.