Liver tumor segmentation based on 3D convolutional neural network with dual scale

Abstract Purpose Liver is one of the organs with a high incidence of tumors in the human body. Malignant liver tumors seriously threaten human life and health. The difficulties of liver tumor segmentation from computed tomography (CT) image are: (a) The contrast between the liver tumors and healthy tissues in CT images is low and the boundary is blurred; (b) The image of liver tumor is complex and diversified in size, shape, and location. Methods To solve the above problems, this paper focused on the human liver and liver tumor segmentation algorithm based on convolutional neural network (CNN), and specially designed a three‐dimensional dual path multiscale convolutional neural network (TDP‐CNN). To balance the performance of segmentation and requirement of computational resources, the dual path was used in the network, then the feature maps from both paths were fused at the end of the paths. To refine the segmentation results, we used conditional random fields (CRF) to eliminate the false segmentation points in the segmentation results to improve the accuracy. Results In the experiment, we used the public dataset liver tumor segmentation (LiTS) to analyze the segmentation results qualitatively and quantitatively. Ground truth segmentation of liver and liver tumor was manually labeled by an experienced radiologist. Quantitative metrics were Dice, Hausdorff distance, and average distance. For the segmentation results of liver tumor, Dice was 0.689, Hausdorff distance was 7.69, and the average distance was 1.07; for the segmentation results of the liver, Dice was 0.965, Hausdorff distance was 29.162, and the average distance was 0.197. Compared with other liver and liver tumor segmentation algorithms in Medical Image Computing and Intervention (MICCAI) 2017 competition, our method of liver segmentation ranked first, and liver tumor segmentation ranked second. Conclusions The experimental results showed that the proposed algorithm had good performance in both liver and liver tumor segmentation.


| INTRODUCTION
Liver is one of the largest and most important organs in the human body, and the liver is also one of the organs with a high incidence of malignant tumors. Liver cancer is a major threat to human health, whose incidence is increasing every year worldwide. 1 Accurate measurement of liver tumor size from abdominal computed tomography (CT) images, segmentation, and localization of tumor areas are helpful for clinicians to make an accurate evaluation of liver tumors. Currently, liver tumor segmentation is manually performed by radiologists on hundreds of CT images slice by slice, which is very tedious and timeconsuming, and the segmentation results depend on the clinical knowledge and experience of the radiologists. Therefore, automatic liver and liver tumor segmentation algorithm is essential and helpful for computer-aided diagnosis. The main difficulties of automatic liver and liver tumor segmentation algorithm are: (a) liver is very close to adjacent organs and their CT values are similar to each other; (b) the CT contrast between the liver tumor and healthy tissue is low, and the boundaries around liver tumor are blurring; (c) the shape, size, and location of liver tumors are complex and variable.
The segmentation algorithms for liver and liver tumors were mainly divided into four categories: regional growth, 2,3 graph cut, [4][5][6] level set, 7,8 and deep learning. [9][10][11][12][13][14][15] The segmentation algorithm in this paper was based on deep learning, so we mainly reviewed several classic liver and liver tumor segmentation algorithms based on deep learning. Ben-Cohen et al. 15 used the VGG16 architecture of the fully convolutional network (FCN) for liver segmentation and liver lesion detection. They discarded the final classifier layer of VGG16 and converted all fully connected layers into convolutional layers. A two-channel convolution was added to predict the probability of lesions or the liver at each output location, and then the output is up-sampled to the original pixel for end-to-end learning using a deconvolution layer. Sun et al. 14 designed a multichannel FCN to segment liver tumors from enhanced CT images. Because each stage of the enhanced CT data provided unique information about the pathological features, the method trained a network for each stage and then fused their high-level features. In the research of Qi et al., 10 a three-dimensional (3D) depth supervision network based on FCN was proposed. The network had a full convolution architecture, which was an end-to-end approach to learn and predict. The most important innovation structure of this network was the depth supervision of hidden layer, which can accelerate the speed of optimization convergence and improve the prediction accuracy. Finally, based on the high-quality score map generated by the 3D depth monitoring network, the contours were refined using the fully connected conditional random field to obtain fine segmentation results.
Based on FCN, Ronneberger et al. 12 proposed a U-Net network model. The entire neural network consisted of two main components, namely the contraction path and the extension path. The contraction path was mainly used to capture the context information in the picture, and the extension path was to accurately locate the part of the image that needed to be segmented. Ronneberger et al. also proposed the data enhancement method for training some data with small samples, especially the data related to medicine, and proved that U-Net was very helpful for deep learning in medical images with small samples. Therefore, the structure of U-Net was widely used in the research of medical image segmentation. Han et al. 13 combined U-Net's long-distance cascade connection with ResNet's short-range residual connection. The model had 32 layers, the input of the model was composed of several adjacent axial CT image slices, and the output was a two-dimensional (2D) segmentation map corresponding to the input center slice. Patrick et al. 11 proposed a segmentation algorithm using two cascaded U-Net networks on CT slices. The first network was only used to segment the liver, and the mask map of the liver region generated by the first step was taken as the input of the second network to train the second U-Net network, and the second network was only used to segment the tumor. Finally, the conditional random field was applied to the full dataset to obtain the relationship between the slices.
In MICCAI 2017 competition of liver and liver tumor, 20 liver segmentation algorithms and 24 liver tumor segmentation algorithms were proposed, and the ranking of these algorithms is shown in Tables 6 and 7. Almost all the ranking top algorithms used U-Net or VGG-Net, and the best one obtained 0.961 Dice for liver segmentation and 0.686 Dice for liver tumor segmentation. However, none of these algorithms directly used 3D CT images in the whole neural network, because training 3D image data in a complicated convolutional neural network was time-consuming and required high computational resources.
To solve the above-mentioned problems, this paper proposed a TDP-CNN, which can fuse the local features with the global contextual information from the background, and directly processed 3D medical image data to obtain the 3D spatial information, and greatly speed up the training procedure. Furthermore, a conditional random field was combined in our algorithm to fine-segment the results from TDP-CNN.

2.A | Algorithm overall flow
The algorithm proposed in this paper was mainly based on a 3D convolutional neural network with the dual scale from two paths. Shown in Fig. 1, the overall scheme of the algorithm was as follows: 1. Filtered and normalized the original CT images; 2. Segmented the 3D CT images into several sub-image blocks, which were used as the input of TDP-CNN. The architecture of TDP-CNN was shown in Fig. 2. There were two paths in the TDP-CNN, and each path was composed of eight blocks, and all the blocks had the same architecture, which included one convolutional layer, one batch normalization layer, and one activation layer. The feature maps of two paths were fused, and input into the fully connected layer, and then classified in the softmax layer.
3. The trained TDP-CNN was used to segment the liver and liver tumor, and generate probability maps of the segmentation results; 4. Finally, the probability maps were post-processed by a fully connected conditional random field algorithm to obtain the final segmentation results of liver and liver tumors.

2.B | Data preprocessing
Before liver tumor segmentation, we used Gaussian smoothing to filter the CT images to remove the noise caused by the equipment and environment.  (1) where k denotes the dimensional filtering kernel width and σ denotes the standard deviation, and in this paper σ = 1. Then, the filtered CT images were further normalized; each pixel was normalized to the mean and standard deviation of the whole image, so that the pixel values of all CT images meet the standard normal distribution. Besides, due to the computation limits of our workstation, the CT images were subsampled from 512 × 512 to 256 × 256 to reduce the amount of computation. Finally, data augmentation was used to deal with the small dataset size, and we geometrically rotated, flipped, cropped, and scaled the original CT images so that we can obtain more variant liver and liver tumor types and enlarge our training dataset.

2.C | TDP-CNN architecture
As it is known that there are limited computational resources in computer system, such as CPU power, GPU power, memory size, data transferring speed et al., and 3D medical image data take much more memory than 2D image data, and all the 3D convolutional network's operations, such as convolution, pooling, activation et al. also take much more computational time than 2D operations. Therefore, complicated CNN, like U-Net or VGG-Net, always has a heavy computational burden and maybe trained for weeks or even months, if the 3D medical image data were directly loaded into CNN. But CNN with simple architecture cannott have good performance of liver and liver tumor segmentation results. To balance the computational performance and the requirements of computational resources, three improvements were made: (a) We did not load the whole 3D medical image data from one subject into our CNN in one time, instead, we segmented the whole 3D medical image data into small segments, and only several segments were input into our CNN each time; (b) To capture the 3D spatial features, we used multiscale small segments. Here, "multiscale" refers to two segments with the same center, but one segment has a bigger image size and higher image resolution, the other segment has a smaller image size and lower image resolution. (c) To compute the multiscale segments together in F I G . 1. Flow chart of our method.
our CNN, we specially designed a dual path neural network architecture. Here, "dual path" refers to a local path and global path, respectively, shown in Fig. 3. In the local path, segments with smaller image sizes but higher image resolution were loaded, processed, and trained to capture the image features of local details, such as contour, texture, and so on. Similarly, in the global path, segments with bigger size but lower image resolution were loaded, processed, and trained to capture the image global features, such as background and contextual information. Then the features maps from two paths were fused at the end of the local and the global path, and transferred to the fully connected layer and a softmax layer.
According to the network architecture of our method, several important parameters had to be determined first, such as the kernel size of the 3D convolutional operations and the 3D size of the small segments. To determine the optimal combination of these parameters, we needed to repeat the experiments of selection many times.
In this selection process, we only needed to know which combination was the best. And this best combination of parameters always remained the best one no matter in single-path or two-path convolutional neural networks. Therefore, we only needed a simple one-path convolutional neural network to perform the experiments of parameter combination selection, and the simple network structure can save us a lot of time and resources. We referred AlexNet as this simple one-path convolutional neural network, which contained five convolutional layers, three down-sampling layers, and three fully connected layers. We did not use pooling layers, because the pooling operation will result in the loss of the exact location of the voxels, which may harm the accuracy of the segmentation results.
For a 3D convolutional neural network, the calculation of 3D convolutional operations costs much more computational resources than 2D convolutional operations. Therefore, only the kernel sizes of 3 × 3 × 3 and 5 × 5 × 5 were under consideration, but 5 × 5 × 5 kernel had about 4.6 times more parameters than 3 × 3 × 3 kernel.
To build a deeper convolutional neural network, in this paper, we chose 3 × 3 × 3 as the kernel size of our convolutional operations.
The size of the 3D image segment was another important parameter, and we can obtain it based on the size of the receptive field.
R l , k l , s i were 3D vectors of {x, y, z}, R l represented the size of the receptive field in the l-th layer, k l represented the size of the F I G . 2. Schematic of three-dimensional dual path-convolutional neural network (TDP-CNN) model, there were two paths in the model, one for local and the other for global, the architecture of both paths were totally the same. In each path, there were eight blocks, and all the blocks had the same architecture, which was composed of a 3 × 3 × 3 3D convolution layer, a batch normalization layer, and a PReLu layer. And residual connections were employed between block 2 and block 4, between block 4 and block 6, and between block 6 and block 8. And the end of two paths, the feature maps were fused and input into the fully connected layers and softmax layers to get the final classification results.
| 147 convolution kernel, s i indicated the stride size of the i-th layer, and the size of receptive field in the first layer was 1.
In our simple one-path CNN, which was used to testify the combination of parameters, l = {1,2,…,7,8}, i = {1,2,…,l-1}, R 0 = 1, k l = 3, After we determined the size of 3D kernel and the size of 3D image segments, we needed to analyze the architecture of the 3D CNN, shown as Fig. 2. There were two paths in the TDP-CNN, one was the local path, the other was the global path, and these two paths had the same architecture. There were eight blocks in each path, and every block was composed of one 3D convolutional layer with kernel size 3 × 3 × 3, one batch normalization layer, and one PReLu layer. Batch normalization was a technique for improving the performance and stability of 2D CNN, and can also be used in 3D CNN to normalize the input layer by adjusting and scaling the activations and mitigate the problem of internal covariate shift. In TDP-CNN, we used residual connections between block 2 and block 4, block 4 and block 6, block 6 and block 8, to further illustrate the residual connections, we took the residual connection between blocks 2 and 4; for example, the outputs of block 2 were directly transferred to the end of block 4, and the outputs of blocks 2 and 4 were added together. The residual connections can give later layers direct access to feature maps of previous layers, which can improve gradient propagation resulting in faster convergence during training and better neural network performance. At the end of both paths, the outputs from the local path and global path were added together, to obtain the feature maps about the images' local details and global spatial information. Then these feature maps were transferred to 3D fully connected layers and considering that the number of feature maps was Q, the size of 3D feature maps was M × N × P.
In regular CNN, the operations of fully connected layers consisted of two steps; firstly, the Q feature maps were convolved with a kernel, whose size was also M × N × P, in order to transform one feature map from 3D matrix to one element, so the number of parameters can be greatly reduced, but the spatial information was lost, and secondly, the Q elements were connected to every neuron in the fully connected layer. However, in our TDP-CNN, the first step of fully connected layer operation was different, that is, the Q feature maps were convolved with a kernel, whose size was 1 × 1 × 1, in order to keep the size of the feature maps unchanged, so our model had more parameters in this layer, but it can maintain the 3D spatial information, which was very important for 3D liver and liver tumor segmentation. Finally, the outputs of fully connected layers were input to softmax classifier to obtain the probability maps of liver and liver tumor segmentation results. Until now, the whole architecture of TDP-CNN was depicted.
More specifically, the neural network parameters of each layer are shown in Table 2, and other parameters in the training of TDP-CNN are shown in Table 3.

2.D | Data post-processing
To remove the mis-segmentation points, this paper used fully connected CRFs (FC-CRF). 16,17 Considering that there were N pixels in a CT image, each pixel corresponded to a CT value set I = {I 1 ,I 2 ,…, I N } and a category label set L={l 1 ,l 2 ,…,l k }, k = 3 in this paper, because there were three categories (liver, liver tumor, and background), and the set of category labels was X ¼ X 1 ; :::; X N f g . Then Therefore, the process of solving conditional random fields (CRF) was the process of minimizing the Gibbs energy function. E(X|I) was defined as shown in Eq. (7), where i and j took values from 1 to N: Ψ μ ðx i Þ was a one-dimensional potential function, which was calculated by the classifier independently for each pixel, indicating that the pixel i was divided into the energy of the label x i . In our method, the one-potential function was calculated by the probability map of the liver and liver tumor generated by the TDP-CNN model. The binary potential function Ψ p ðx i ; x j Þ indicated that the pixels i, j were simultaneously divided into the energy of the label x i and x j , and its expression was: k ðmÞ was a Gaussian kernel: f i and f j were the eigenvectors of pixels i and j in the feature space, respectively, ω ðmÞ were linear combination weights, and μ was a label compatibility function that satisfied the Potts model. Each kernel function k ðmÞ had a symmetric, positive definite precision matrix Λ ðmÞ . For multiclass image segmentation, the potential function was defined by the color I i , I j of the pixel i, j and the position P i and P j of pixel i, j: Dice is a commonly used indicator for evaluating the results of medical image segmentation. Dice is 100% when the prediction result is completely consistent with the real result.

Sensitivity
Sensitivity, also known as true-positive rate or recall rate, is used to measure the ability of the algorithm to identify positive data.

Specificity
Specificity is used to reflect the ability of the algorithm to identify negative data. Besides, the experiments in this paper also use two distances between pixels to evaluate the segmentation results.

Hausdorff distance
Considering that all the voxels in ground truth images are represented by s g , and all the voxels in the predicted images are represented by s p , the Hausdorff distance can be given: HDðs g ; s p Þ ¼ maxðhðs g ; s p Þ; hðs p ; s g ÞÞ In the formula, hðs g ; s p Þ is called the one-way Hausdorff distance and is given by: where ||·|| denotes the Euclidean distance. The Hausdorff distance is sensitive to outliers and is used to find the largest distance between the ground image and the predicted image.

Average distance
The average distance, also known as the average symmetric surface distance (ASSD), is given by: where N 1 and N 2 represent the number of voxels in s g and s p , respectively, dðs g ; s p Þ denotes the average shortest distance between voxels from s g to s p , and dðs g ; s p Þ can be calculated by: ASSD is used to represent the overall difference between two sets. For a completely correct segmentation result, ASSD value is 0, which means that the predicted image completely coincides with the real image.

3.C | TDP-CNN parameter settings
conv represented the convolutional layer, the kernel represented the size of convolution kernels in each layer, and the FMs represented the number of feature maps in each layer.
In this experiment, the local path and global path of TDP-CNN had the same network architecture, as shown in Table 2. In the training of TDP-CNN, the initializing setting and parameters are shown in Table 3.

3.D | TDP-CNN training and testing results
During the training of TDP-CNN, the ground truth of both liver and liver tumors was loaded into the network, so TDP-CNN can simultaneously segment liver and liver tumors. The final classifier layer of the network divided the 3D CT images into three categories: liver, liver tumor, and background. In the experiment, we used the cross-  However, we found that Hausdorff distances for liver and liver tumor segmentation were large, which meant that there were some false-segmentation voxels. Therefore, we used FC-CRF to refine the segmentation results to improve the segmentation accuracy. Table 5 shows  Fig. 10, and the image data were the same as the one in Fig. 7. It can be seen that the application of FC-CRF can effectively remove false segmentation of liver and liver tumor. Finally, we had to point out that the average time taken by our method for each 3D CT image data was 13.3 min.  It is essential to compare the performance of dual path multiscale CNN with that of single-path CNN, to clearly show the advantages of our method over the single-path CNN. In the process of comparison, we built a single-path CNN, whose architecture was the same as one of the two paths from TDP-CNN, as shown in Table 2, and we named this single-path CNN as "Local." It can be clearly seen that the number of parameters from TDP-CNN is two times more than that of Local, therefore, to make the comparison fair, we built another single-path CNN, which also had eight layers and the same kernel size as shown in Table 2, but the number of feature maps in each layer was twice larger than that of Local, and we named this single-path CNN as "Local+." We trained all the three CNN 700 epochs, and the training loss is shown in Fig. 11 Our method can accurately segment large liver tumor, but performed comparatively worse for small liver tumors, which is defined as a 3D size less than 500 voxels. The diameter of small liver tumor is only a few voxels, and the size of the whole liver image is 512 × 512; therefore, segmenting small structures in such a big background is a very difficult task. Besides, image noise and artifacts are also another influence on the task. This problem was also recog-  corresponding methods were developed, but still cannot get high Dice value. Therefore, more studies are needed in the future to improve the small liver tumor segmentation.

CONF LICT OF I NTERESTS
The authors declared that they have no conflict of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.