Contextual loss based artifact removal method on CBCT image

Abstract Purpose Cone beam computed tomography (CBCT) offers advantages such as high ray utilization rate, the same spatial resolution within and between slices, and high precision. It is one of the most actively studied topics in international computed tomography (CT) research. However, its application is hindered owing to scatter artifacts. This paper proposes a novel scatter artifact removal algorithm that is based on a convolutional neural network (CNN), where contextual loss is employed as the loss function. Methods In the proposed method, contextual loss is added to a simple CNN network to correct the CBCT artifacts in the pelvic region. The algorithm aims to learn the mapping from CBCT images to planning CT images. The 627 CBCT‐CT pairs of 11 patients were used to train the network, and the proposed algorithm was evaluated in terms of the mean absolute error (MAE), average peak signal‐to‐noise ratio (PSNR) and so on. The proposed method was compared with other methods to illustrate its effectiveness. Results The proposed method can remove artifacts (including streaking, shadowing, and cupping) in the CBCT image. Furthermore, key details such as the internal contours and texture information of the pelvic region are well preserved. Analysis of the average CT number, average MAE, and average PSNR indicated that the proposed method improved the image quality. The test results obtained with the chest data also indicated that the proposed method could be applied to other anatomies. Conclusions Although the CBCT‐CT image pairs are not completely matched at the pixel level, the method proposed in this paper can effectively correct the artifacts in the CBCT slices and improve the image quality. The average CT number of the regions of interest (including bones, skin) also exhibited a significant improvement. Furthermore, the proposed method can be applied to enhance the performance on such applications as dose estimation and segmentation.


| INTRODUCTION
Although cone beam CT has great potential in clinical applications, the challenge of scattered radiation decreases the image quality, leading to many artifacts in the images. [1][2][3] Artifacts (including streaking, shadowing, ringing, and cupping artifacts, etc.) are generally defined as the difference between the reconstructed value of the CT image and the true distribution of attenuation coefficient of the object. In the published literature, the main correction methods can be divided into two types based on their different processing methods. 4 One is a hardware processing method, which prevents the scattered rays generated during the attenuation process from reaching the detector to the greatest extent possible. Common methods include the air-gap method, collimator method, filter method, antiscattering grating method, and the method employing a modulator. 5,6 However, the increase in hardware equipment introduces operational difficulties in the CBCT system and increases the cost of the entire process. 7 The second type of correction method is digital image processing technology, which mainly estimates the scattering distribu-  15 Satisfactory results were also obtained using the level set 16 and moving block 17 methods. People have started paying attention to convolution-based methods. For example, Zhao et al.
introduced free parameters in the convolution kernel to identify the optimal parameters, so that the model of the scattering potential and the convolution kernel could best fit the approximate estimate of the scattering profile of the previously known image objects. 18 Baer et al. incorporated physical scatter correction method in a convolution-based scatter correction algorithm. 19 Deep learning has become a popular method in the field of computer vision with the advantage of learning complex models end-to- Inspired by the method proposed by Merchez et al, 26 we added contextual loss to a simple five-layer CNN network to correct the CBCT artifacts in the pelvic region. The loss function consists of two parts, L t CX and L s CX . L t CX measures the loss of the generated image and label image, while L s CX measures the loss of the generated image and input image. Contextual loss plays a key role in the optimization of the CNN network performance. We chose to conduct this feasibility study in the context of pelvic CBCT images. We provide training data and ground truth data to the network for supervised machine learning.
The remainder of this paper is organized as follows. In Section II, we describe the method used. The experimental results are presented in Section III. Finally, the discussion and conclusions are reported in Section IV.

| MATERIALS AND METHODS
The experimental method in this paper can be briefly summarized as shown in Fig. 1. Next, we will introduce each part of Fig. 1 in detail.

2.A | Registration
According to Fig. 1, the registration is first performed after obtaining the original data. Image registration involves aligning an analysis image with a reference image using a geometric transformation that correlates these two images. Medical image registration methods can XIE ET AL.
| 167 be broadly categorized into rigid and nonrigid registrations. Nonrigid registration is widely applied to deal with large motion and interfraction variability in chest and abdomen. In this study, we used the nonrigid grid registration method.
Organ movements might be obvious in IGRT. For example, pelvic anatomy, which includes the prostate and rectum, could change during IGRT. 27,28 To prevent the significant difference between the CBCT-CT training pairs from affecting the experimental results, we referred to the method detailed in 29 to perform deformable image registration (DIR) on the pelvic region and subsequently generated the CBCT-CT training pairs required for the experiment. The CBCT image is static and the CT image is moving during DIR. The structural differences between the CBCT-CT training pairs can be reduced via deformable registration.
It should be noted that the number of CBCT slices and CT slices and slice thickness in the original data are different. Although the algorithm proposed in this paper can be applied to misaligned data, it is also crucial for the medical images to retain quantitative image values. Therefore, a certain registration is necessary to match the patient's data before the data are input to the network. In this step, we mainly introduced DIR technology to correspond to the slices and adjusted the parameters, and there was no large deformation.
However, mismatches still exist between the CBCT-CT training pairs following registration.

2.B | Data set
As the training of the convolutional networks is inseparable from data set, the generation of datasets is related to whether the trained model can sufficiently represent all the sample spaces. We used the patient pelvis data for training in the proposed clinical application method. Specifically, patients were required to undergo a CT scan of the pelvic region before the start of IGRT. In the subsequent radiotherapy, the patients underwent a CBCT scan so that the pelvic area could be monitored in real time. Therefore, our data were CBCT images and CT images that were obtained from an IGRT system.
The original data came from 11 patients. The size of the CT images is 512 × 512, while CBCT data are composed of six groups of 384 × 384 and five groups of 512 × 512. Considering the small sample size, we used data expansion techniques in the experiment, such as image rotation, and obtained 2179 CBCT slices and 2036 CT slices. For CBCT, the slice thickness is usually 3.0 mm, and the pixel size is 0:8789 Â 0:8789mm 2 . However, for CT, the slice thickness is displayed as 2.5mm or 3.0mm, and the pixel size is 0:9766 Â 0:9766mm 2 .
The above original images were preprocessed by DIR to generate 627 CBCT-CT pairs, which was the data set used in our experiment. It should be pointed out that the data set is a 2D data set.
During the training process, we randomly selected 499 pairs of CBCT and CT images as the training set, 64 pairs as the validation set and 64 pairs as the test set. The size of each image was 512 × 512. Among them, the CBCT images were used as the input images, whereas the CT images were used as the label images.
However, a complete correspondence between the registered CBCT-CT pairs was still not achieved, and the slightly misplaced input-tag image pairs rendered the pixel wise loss function unsuitable for training.

2.C | Contextual loss
The contextual loss function has excellent application prospects with respect to the slight misalignment of data. The main idea behind this function is that it assumes the image as a collection of features, and then determines the similarity between the images by measuring the similarity between the features. This loss function allows the local deformation of the image to a certain extent, therefore the requirement for the data to be aligned at the pixel level is moderate. In addition, the loss function used in this study constrains the local features, which enables it to operate on the region with similar semantics. Specifically, it first finds similar features in these regions with similar semantic meanings and forms a match between these features. The context of the entire image is then integrated, and the similarity between the images is represented by the similarities between the matching features. Therefore, we can categorize this process into the following:

2.C.1 | Feature extraction
As shown in Fig. 1, the input image (CBCT image) was sent to the CNN network to obtain a preliminary generated image. Next, the input image, generated image, and label image (three-dimensional) were sent to the VGG19 network (proposed by Oxford's Visual Geometry Group) for feature extraction.
In this study, we used the VGG19 network that was pretrained on ImageNet 30 as the extractor. The pretrained VGG network takes three channels images as the input, while the CT images are grayscale images. Therefore, we duplicated the CT images into three channels before feed them into the VGG network. The VGG-19 network contains 16 convolutional layers, followed by 3 fully connected layers. The features of the corresponding convolutional layers that were used to calculate the loss function will be described later.
The VGG19 network structure used in this study is shown in Fig. 2.
Let the source image s and target image t be the two images to be compared, and s i and t j are the features obtained after the source image s and target image t are passed through VGG19, respectively.
Then, we can represent each image as a set of features, namely S ={s i } and T = {t j }. Furthermore, we assume |S| = |T| = N, and when | S| ≠ |T|, N-sampling is performed from a larger set. N represents the number of high-dimensional points (features).

2.C.2 | Similarity between features
Next, we present a detailed introduction from a mathematical perspective to define the similarities between the features. Contextual loss is a loss function related to the cosine distance. Let d ij denote the cosine distance between features, expressed as follows: when d ij <<d ik , 8k≠j, we assume that features s i and features t j have similar contexts. To simplify the calculation, the cosine distance is normalized as follows: Here, we fixed ɛ ¼ 1e À 5. Using an exponential operation, we transformed the distance into similarity. The definition can be expressed as follows: where, h>0 is a bandwidth parameter. Here, we fixed h ¼ 0:5.
Finally, we used a scale-invariant version of the normalized similarity to define the contextual similarity between the features:

2.C.3 | Similarity between images
We find the features s i that is most similar to features t j to form a match between the features, as shown by the arrows in Fig. 3, and the contextual loss can be regarded as the weighted sum of the arrows. The ratio of the above-defined methods to the distance is robust. If s i is not similar to t j , CX ij will be low regardless of the distance between s i and t j . However, if the features s i and t j are similar, CX ij will be high even if they are not in the corresponding positions.
We consider a pair of images to be similar when most features of one image can find similar features in another image.
We can mathematically define the contextual similarity between the images as follows: where, CX ij represents the similarity of the features s i and t j . When an image is compared with itself, the feature similarity value is CX ii ¼ 1, which indicates that CXðS,SÞ ¼ 1. In contrast, when the feature set in one image differs completely from that in the other image, the feature similarity value is CX ij ¼ 1 N 8i, j, indicating that CXðS, TÞ≈ 1 N ! 0.

2.C.4 | CX loss function
In summary, the loss function can be expressed as follows: L CX ðs,t, lÞ ¼ ÀlogðCXðφ l ðsÞ, φ l ðtÞÞÞ where φ represents the VGG19 network, and φ l ðsÞ, φ l ðtÞ represent the feature maps of the images s and t extracted from the layer l of network φ, respectively.

2.D | The proposed loss function
We trained a network G to map the given source image s to the output image GðsÞ. Here, for network G, we used a five-layer CNN network with adaptive dimensions. When the input image width≥128, the dimension was set to dim ¼ 512, else the dimension was dim ¼ 1024. In this experiment, the input image size was 512 × 512, so the initial width ¼¼ 512. Then the width was down-sampled by width==2 until width ¼¼ 4, and the input image is loaded. The loss L CX ðGðsÞ, t, lÞ represented the degree of similarity between the generated and target images, whereas the loss L CX ðGðsÞ,s,lÞ was used to measure the similarity between the generated and source images.
The loss function used in the experiment can be defined as follows: where, l s ¼ conv4 2 yields the content feature, and yields the style feature. In the experiment, we randomly sampled the layer conv2_2 into 80 × 80 features to obtain better results, while reducing the required computational memory. We discovered that the difference in the number of randomly sampled features may be critical to the experiment. The specific analysis will be provided later.
Here, λ is a constant that controls the ratio of the two loss functions. We set λ ¼ 5 in the experiments. It is noteworthy that the parameters above were obtained through multiple experiments and were found suitable for the experiments discussed in this study.

2.E | Network training
The purpose of the training network was to obtain a mapping from the CBCT images to the planning CT images, which can improve the  Fig. 4. Based on the situations, the network parameters, feature sample size, and ratios of the two loss functions L t CX and L s CX were adjusted accordingly, and the training was repeated until the artifacts were effectively corrected.
We used the TensorFlow library in the Python environment on a GeForce GTX 1080 Ti processor. Adam optimizers and the nonlinear activation function ReLU were used in the experiment. Following the normal practice adopted in the deep learning community, 31 each convolutional layer employed a small 3 × 3 kernel. We set the learning rate to 1e À 4 during the experiment. The number of epochs was set to 300, and the input-output image sizes were set to 512 × 512.
The step size was set to 2 to achieve an accurate convergence. F I G . 3. Feature matching.

2.F | Evaluation
The main difference between the method proposed in this paper and the method in 21 is that we introduced contextual loss, but method in 21 chose MSE loss. Perceptual loss can be applied to networks with mismatched data, Kupyn et al 32 presented DeblurGAN network to reconstruct the image, which is based on conditional GAN and perceptual loss. We compared the proposed method with the above two methods. In the results section, we use pelvic data for statistical and visual analysis.
We also used other loss functions for comparison, such as L2 loss and perceptual loss. The formula is expressed as follows.
L p ðx, y,l p Þ ¼ k φ lp ðxÞ À φ lp ðyÞ k 1 (9) where φ represents the VGG19 network, and φ l ðxÞ,φ l ðyÞ represent the feature maps of the images x and y extracted from the layer l of network φ, respectively.
In this study, we calculated the mean absolute error (MAE), peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and average CT number to quantify the results.
MAE is defined as the difference between the evaluation image and the CT image. The formula is expressed as follows: where, m × n is the total number of pixels. y(i, j) is the value of the CT image with pixels (i, j), and y Λ ði,jÞ is the value of the evaluation image with pixels (i, j).
The input of PSNR was t, GðsÞ ð Þ , where t and GðsÞ are the target and predicted images, respectively. The PSNR formula can be expressed as follows: where n is the number of sampling points. The number of sampling points in the natural image is 8, and the maximum pixel value is 255.
The pixel range of the medical image is larger, and the corresponding n value needs to be adjusted for calculation.
Structural similarity (SSIM) is an index to measure the similarity of two images. The formula is expressed as follows.
where µ x is the average of x and µ y is the average of y. σ 2 x is the variance of x, σ 2 y is the variance of y, and σ xy is the covariance of x and y. c1 ¼ ðk 1 LÞ 2 , c2 ¼ ðk 2 LÞ 2 are constants used to maintain stability. L is the dynamic range of pixel values. k 1 ¼ 0:01 and k 2 ¼ 0:03.
The standard deviation represents the dispersion degree of pixel gray values relative to the mean. The larger the standard deviation, the more scattered the gray level distribution and the better the image quality.
where, m Â n is the total number of pixels. y(i, j) is the value of the evaluation image with pixels (i, j) and u stands for mean.
The average CT number can be obtained using the analysis measurement function of ImageJ software. Using the cursor to accurately select the area of interest, the system will give the CT number corresponding to that area.

| RESULTS
In this study, we not only compared the artifact removal performance with other methods but also showed the process of finding the best performing network and parameters.

3.A | Experimental results
CBCT slices may be heavily contaminated with streak artifacts during the scanning process, which means that some detailed information may be destroyed. The proposed method effectively suppressed scattering artifacts in the CBCT slices, as indicated by the results shown in Fig. 5. For a clear comparison, the last column contains the corresponding CT images with few artifacts (RCT). Comparing Fig. 5a and Fig. 5b, it can be seen that the proposed method can correct artifacts in CBCT slices (including streaking, shadowing, ringing, cupping artifacts, etc.), which significantly improves the image quality.
Observe the last line of Fig. 5b and Fig. 5c, although the CBCT slices To objectively illustrate the effectiveness of our method in removing artifacts, a quantitative analysis (including CT number, MAE, PSNR, SSIM, std) of the pelvic region is presented in Table 1.
Each of these analyses was derived from the mean values calculated over the test dataset. We calculated the CT numbers (in HU) of areas such as the bone marrow and skin; subsequently, we  It can be seen from Table 1 Table 2 and Table 3 Breathing and other movements were more significant in the chest region than they were in the pelvic region. To verify that our method is applicable to other anatomies, thoracic data were input to CT-CBCT pairs are generated. Randomly select 64 pairs as the verification set and 64 pairs as the test set for the experiment. Fig. 8 shows that good results were achieved even with the chest data.
Here, we only selected three slices for display.

3.B | Optimization
We take different experiment to find the best parameters of our method, which are simply expressed in Table 4. The experimental results obtained are shown in Fig. 9.
Step 1: The results show that the image quality improved to some extent, but obvious streaking artifacts were still present, as shown in Fig. 9b.
Step 2: The uneven grayscale (cupping artifacts) in the image significantly reduced; however, the stripe artifacts were not removed ( Fig. 9c).
Step 3: These changes did not produce the desired results. Fig. 9d shows the results obtained when the L2 constraint was introduced in equation (7).
Step 4: Fig. 9) shows that the streak artifacts were well corrected, and compared with the method used in 21 , image details were well preserved with no blurring.
Step 5: Comparing Fig. 9e and Fig. 9f, we can be observed that the five-layer network achieved better artifact correction in the same training time. Step 1 17 64 65 × 65 Step 2 2 adaptive 65 × 65 Step 3 2 adaptive 65 × 65 Step 5 5 adaptive 80 × 80 The quantitative analysis of the above improvement steps is shown in Table 5 and Table 6. Table 5 shows that after continuous testing, not only can the artifacts be visually suppressed, but also the MAE index can be more intuitively estimated as the picture quality is significantly improved. Furthermore, the loss function is based on semantics and evaluates image similarity based on feature similarity, as opposed to distance. Therefore, the loss function is robust to slight data movements and can address data mismatch problems more effectively.
Furthermore, in this paper, we used the CNN as a feedforward network for scatter artifact correction. This method effectively suppresses the scattering artifacts produced by actual CBCT systems.
Moreover, the proposed method can be generalized and applied to different anatomies. In this study, we used the pelvic and thoracic data for testing. Good artifact removal can be achieved, provided Step 1 Step 2 Step 3 Step 4 Step 5 that the CBCT images of the anatomies being investigated are collected as training data. We found that method in 21 blurred the detailed texture of the image when we repeated the experiment.
The method in 32 is used for deblurring of 2D images. CT is a tomography technique, and the traditional photography technique is a 2D single projection technique. The effect of motion on these two images is different. Therefore, we found that CBCT slices did not obtain satisfactory results using this method, resulting in the loss of some structures. The method in 25 shows good performance in removing artifacts, but the correction on the air cavity needs to be improved. In contrast, it can be seen from Fig. 5b that the proposed method obtained relatively good correction effect on the air cavity.
The proposed method preserved the details of the evaluated site, such as textures in the inner contours of the pelvis and chest regions. Moreover, no blurring was introduced into the CBCT slices during artifact removal. The results of the pelvic and thoracic data showed that the proposed method may be useful for removing artifacts in CBCT slices, with a significant improvement in the CT value of the regions of interest. Therefore, the incorporation of our method can effectively reduce the artifacts of CBCT in IGRT and improve the accuracy of dose calculation.
Our proposed method can be further improved using more complex generation networks. Future studies will focus on further investigating and improving our experimental results.

ACKNOWLEDGMENTS
The author is grateful to the anonymous reviewers for their constructive comments and evaluations, which significantly improved the presentation of the current study. The author acknowledges the help and support provided.

CONF LICT OF I NTEREST
The authors declare no conflict of interest.

RESEARCH INVOLVING H UMAN PARTICIPANTS AND /OR ANIMALS
This article does not contain any studies with human participants or animals performed by any of the authors.

INFORMED CONSENT
Informed consent was obtained from all individual participants of the study.