Evaluation of Multi-Slice Inputs to Convolutional Neural Networks for Medical Image Segmentation

When using Convolutional Neural Networks (CNNs) for segmentation of organs and lesions in medical images, the conventional approach is to work with inputs and outputs either as single slice (2D) or whole volumes (3D). One common alternative, in this study denoted as pseudo-3D, is to use a stack of adjacent slices as input and produce a prediction for at least the central slice. This approach gives the network the possibility to capture 3D spatial information, with only a minor additional computational cost. In this study, we systematically evaluate the segmentation performance and computational costs of this pseudo-3D approach as a function of the number of input slices, and compare the results to conventional end-to-end 2D and 3D CNNs. The standard pseudo-3D method regards the neighboring slices as multiple input image channels. We additionally evaluate a simple approach where the input stack is a volumetric input that is repeatably convolved in 3D to obtain a 2D feature map. This 2D map is in turn fed into a standard 2D network. We conducted experiments using two different CNN backbone architectures and on five diverse data sets covering different anatomical regions, imaging modalities, and segmentation tasks. We found that while both pseudo-3D methods can process a large number of slices at once and still be computationally much more efficient than fully 3D CNNs, a significant improvement over a regular 2D CNN was only observed for one of the five data sets. An analysis of the structural properties of the segmentation masks revealed no relations to the segmentation performance with respect to the number of input slices. The conclusion is therefore that in the general case, multi-slice inputs appear to not significantly improve segmentation results over using 2D or 3D CNNs.


I. INTRODUCTION
Segmentation of organs and pathologies are common activities for radiologists and routine work for radiation oncologists.Nowadays manual annotation of such regions of interest is aided by various software toolkits for image enhancement, automated contouring, and structure analysis in all fields on image-guided radiotherapy [1,2,3].Over the recent years, deep learning (DL) has emerged as a very powerful concept in the field of medical image analysis.The ability to train complex neural networks by example to independently perform a vast spectrum of annotation tasks has proven itself a promising method to produce segmentations of organs and lesions with expert-level accuracy [4,5].
For both organ segmentation and lesion segmentation, the most common DL model is the Convolutional Neural Network (CNN).Whereas the classic approach of segmenting 3D medical volumes by CNNs consists of training on and predicting the individual 2D slices independently, the interest has shifted in recent years towards full 3D convolutions in vo1umetric neural networks [6,5,7,8,9].Volumetric convolution kernels have the advantage of taking inter-slice context into account, thus preserving more of the spatial information than what is possible when using 2D convolutions within slices.However, volumetric operations require a much larger amount of computational resources.For medical image applications, the lack of sufficient Graphical Processing Unit (GPU) memory to fit entire volumes at once requires in almost all cases a patchbased approach, reduced input sizes, and/or small batch sizes and therefore longer training times.

A. Related Work
In terms of fully connected, end-to-end 3D networks, studies often attempt to compensate for the small patch size that can maximally fit into the GPU memory at once by creating more efficient architectures or utilizing post-processing methods.The original U-Net by Ronneberger et al. [10], an architecture which was, at that time, and still is, a popular and powerful network for semantic medical image segmentation, was first reintroduced as a 3D variant by C ¸ic ¸ek et al. [8].The 3D U-Net was used by Vu et al. [11,12] in a cascaded approach where a first coarse prediction was used to generate a candidate region in which a second, finer-grained prediction was performed; this proved to be an effective way of reducing the amount of input data for the final prediction.V-Net by Milletari et al. [7] extended the network of [8] by adding residual connections to the 3D U-Net.
Li et al. [13] reduced the computational cost required for a fully connected 3D CNN by replacing the deconvolution steps in the upsampling phase with dilated convolution to preserve the spatial resolution of the feature maps.VoxResNet [14] is a very deep residual network that was trained on small 3D patches.The resulting output probability map was combined with the original multimodal volumes into a second VoxRes-Net to obtain a more accurate output.A related approach from Yu et al. [15] extended this architecture by implementing long residual connections between residual blocks, in addition to the short connections within the residual blocks.The same group proposed another densely connected architecture called DenseVoxNet [16], where each layer had access to the feature maps of all its preceding layers, decreasing the number of parameters and possibly avoiding to learn redundant feature maps.
Lu et al. [17] used a graph cut model to refine the output of their coarse 3D CNN.A 3D network composed of two separate convolutional pathways, at low and high resolution, was introduced by Kamnitsas et al. [18].For improvement, the resulting segmentation was, in turn, post-processed by a Conditional Random Field.A variant of this multi-scale feature extraction during convolution was used by Lian et al. [19], who used this procedure in the encoding phase of their U-Net-like 3D CNN.Ren et al. [20] exploited the small size of regions of interest in the head and neck area (i.e. the optic nerves and chiasm) to build an interleaved combination of smallinput, shallow CNNs trained at different scales and in different regions.Feng et al. [21] used a two-step procedure: a first 3D U-Net was used to localize thoracic organs in a substantially downsampled volume, and crop to a bounding box around each organ.Then, individual 3D U-Nets were trained to segment each organ inside its subvolume at the original resolution.Another example of 3D convolutions applied only on a small region of interest is from the work of Anirudh et al. [22], who randomly sampled subvolumes in lung images for which the centroid pixel intensity was above a certain intensity threshold, to classify the subvolume as containing a lung nodule or not.
While these studies have shown that 3D CNNs are worth the effort, alternative approaches have been investigated to involve volumetric context to improve segmentation while avoiding 3D convolutions altogether.One of the more common methods, usually called 2.5D, is to use CNNs that combine tri-planar 2D CNNs from intersecting orthogonal patches [23,24,25,26,27,28,29,30].This can be a computationally efficient way of incorporating more 3D spatial information, and these studies all present promising results.However, this method is limited in the volumetric information it can encompass at once.We, therefore, investigate a method that uses a volumetric input but is still largely 2D based with a minimal amount of 3D operations.Instead of a method that takes a single 2D slice as input, and outputs the 2D segmentation of that slice, one can also incorporate neighboring slices to provide a 3D context to enhance segmentation performance.A common approach to this is to include neighboring slices to a central slice as multiple input image channels.Novikov et al. [31] included the preceding and succeeding axial slice for vertebrae and liver segmentation.Such a three-slice input was also used by Kitrungrotsakul et al. [32] for the detection of mitotic cells in 4D data (spatial + temporal).This was a cascaded approach where a first detection step with a three-slice input produced results for these three slices.In the second step, they reduced the number of false positives where for each slice the time-frame before and after was included.In a deep CNN for liver segmentation, Han [33] used five neighboring slices.Ghavami et al. [34] compared incorporating three, five, and seven slices for prostate segmentation from ultrasound images.While their method produced promising segmentation results, no significant difference was found between these three input sizes.In a recent paper, Ganaye et al. [35] employed a seven-slice input producing an output for the three central slices, which the authors refer to as 2.5D.This model was used to evaluate a loss function that penalized anatomically unrealistic transitions between adjacent slices.The authors did not report a significant improvement between the baseline 2D and 2.5D models, but the 2.5D model did outperform in terms of Hausdorff Distance when the non-adjacency loss was employed.

B. Contributions
In this paper, we systematically investigate using multiple adjacent slices as input to predict for the central slice in that subset, and we investigate this on the segmentation task in medical images.We will henceforth refer to any method based on this principle as pseudo-3D.We compare the segmentation performance of a range of input multi-slice sizes (d ∈ {3, 5, . . ., 13}) to conventional end-to-end 2D and fully 3D input-output CNNs.We employ the common approach from the literature where each neighboring slice is put as a separate channel in the input, and we will refer to this method as the channel-based method.Further, we introduce a second pseudo-3D method, that appears to have not been proposed in the literature before.This pseudo-3D method consists of two main components: a transition block that transforms a dslice input into a single-slice (i.e.2D) feature map by using 3D convolutions, and this feature map is then followed by a standard 2D convolutional network, such as the U-Net [10] or the SegNet [36], that produces the final segmentation labels.This method shall be referred to as the proposed method.
The main contributions of our work are: 1) We systematically compare the segmentation performance of 2D, pseudo-3D (with varying input size, d), and 3D approaches.2) We introduce a novel pseudo-3D method, using a transition block that transforms a multi-slice subvolume into a 2D feature map that can be processed by a 2D network.This method is compared to the channel-based pseudo-3D method.3) We compare the computational efficiency of fully 2D and 3D CNNs to the pseudo-3D methods in terms of graphical memory use, number of model parameters, floating point operations (FLOPs), training time, and prediction time.4) We conduct all experiments on a diverse range of data sets, covering a broad range of data set sizes, imaging modalities, segmentation tasks, and body regions.

II. PROPOSED METHOD
The underlying concept of the pseudo-3D methods is similar to that of standard slice-by-slice predictions using 2D CNNs, but the input is now a subvolume with an odd number of slices, d, extracted from the whole volume with a total of D slices.The output of the model is compared to the ground truth of the central slice.If d = 1, the method is equivalent to a 2D CNN.A fully 3D CNN would be d = D for both input and output, where all operations in the network are in 3D and the output volume is compared to the ground truth of the whole volume.See Figure 1 for an illustration of the proposed method.In this study, the number of slices in the input subvolume ranged from d = 3 to d = 13.In order to isolate the contribution of using multi-slice inputs, this work did not include multi-slice Fig. 1: A comparison of 2D, pseudo-3D and 3D approaches.With a 2D network, the volume is segmented with a single slice input and output.Pseudo-3D uses multiple adjacent slices as input to produce an output of the central slice from the input.3D approaches take in the whole volume at once, and return a prediction for the whole volume as well.In the figure, the W , H, and D are the original width, height, and depth of the input volume, respectively.
outputs-where the multiple outputs for each slice are usually aggregated using e.g.means or medians.
Let the input volume be of width W , height H, depth d, and have C channels.A common way of utilizing depth information to train with regards to the central slice is as follows: group the channel and depth dimension together as one, and consider the input to be of shape W × H × (d • C), i.e. with d • C channels.By incorporating the slices in the channel dimension, the multi-slice input can be processed by a regular 2D network.As was mentioned in Section I-B, this method is denoted here as the channel-based method.
The channel-based method is compared to a novel pseudo-3D approach denoted the proposed method.Consider the input to be of shape W ×H ×d×C.This is fed through a transition block with L = d 2 layers (where • is the floor function).In each layer, a 3D convolution with a kernel of size 3 × 3 × 3 is applied to the volume within the image, after it has been padded in the W and H dimensions, but not in the d dimension.Thus, after each layer in the transition block, the depth of the image is reduced by 2 slices, while the width and height stay the same size.After the final convolution, the depth dimension is removed.Hence, the shapes change as In both the proposed method and the channel-based method, the output layer of the network is the segmentation mask, with an output shape of W × H × 1.Hence, it produces a single segmentation slice, corresponding to the central slice of the input subvolume.See Figure 1 for an illustration of this.
The network architectures that were evaluated in this work was the U-Net [10] and the SegNet [36], two popular variants of encoder-decoder architectures that have been successful in semantic medical image segmentation.An illustration of both pseudo-3D methods, with U-Net as the main network architecture, is given in Figure 2. Another illustration of the networks with SegNet backbone can be seen in Figure 11 in the Supplementary Material.
We evaluate the two pseudo-3D methods for d ∈ {3, 5, 7, 9, 11, 13} and compare them to the corresponding conventional end-to-end 2D and 3D networks, all with the U-Net or SegNet architectures.This yields a total of 14 different experiments for each data set (six input sizes for the two pseudo-3D methods, plus 2D and 3D methods, all with two network architectures).Apart from the segmentation performance, the computational cost is also evaluated across experiments in terms of the number of network parameters, the maximum required amount of GPU memory, the number of FLOPs, the training time per epoch, and the prediction time per sample.

III. EXPERIMENTS
We here present the data sets the experiments were conducted on, as well as the encompassing information and parameters used in the experiments.

A. Materials
To test the generalizing capabilities of the methods, we ran experiments on five different data sets, covering a variety of modalities, data set sizes, segmentation tasks, and body areas.Three of the data sets are publicly available, as they were part of segmentation challenges.On top of those, we further used two in-house data sets collected at the University Hospital of Umeå, Umeå, Sweden.
1) Umeå Pelvic Region Organs: An in-house data set containing computed tomography (CT) images of the pelvis region from 1 244 patients that underwent radiotherapy for prostate cancer at the University Hospital of Umeå, Umeå, Sweden.We denote this data set Umeå Pelvic Region Organs (U-PRO).The delineated structures include the prostate (in most cases annotated as the clinical or gross target volume) and some organs at risk, among them the bladder and rectum.The individual structure masks were merged into a single multilabel truth image, with pixel value 1 for the prostate, 2 for the bladder, and 3 for the rectum (see Figure 3).Patients without the complete set of structures were excluded, resulting in a final data set containing 1 148 patients.
2) Umeå Head and Neck Database: An in-house data set containing CT images of the head and neck region of 110 patients.This data set comprises the patients from the University Hospital of Umeå, Umeå, Sweden, that participated in the ARTSCAN study [37].We denote this data set Umeå Head and Neck Database (U-HAND).For each CT image, manual annotations of the target volumes and various organs at risk were provided.The organ structures that were included with  this data were the bilateral submandibular glands, bilateral parotid glands, larynx, and medulla oblongata (see Figure 4).After removal of faulty CT volumes where the slice spacing changed within a volume and excluding patients in which not all of the six aforementioned structures were present, the final data set contained 73 patients.were available, acquired with different protocols and various scanners at 3 T.
Manual segmentations were carried out by one to four raters and approved by neuroradiologists.The necrotic and non-enhancing tumor core, peritumoral edema, and contrastenhancing tumor were assigned labels 1, 2, and 4 respectively (see Figure 5).The images were co-registered to the same anatomical template, interpolated to a uniform voxel size and skull-stripped.

4) Kidney Tumor Segmentation Challenge 2019:
The data set for the Kidney Tumor Segmentation Challenge 2019 (KiTS19) challenge [40], part of the MICCAI 2019 conference, contains preoperative CT data from 210 randomly selected kidney cancer patients that underwent radical nephrectomy at the University of Minnesota Medical Center between 2010 and 2018.Medical students annotated under supervision the contours of the whole kidney including any tumors and cysts (label 1), and contours of only the tumor component excluding all kidney tissue (label 2) (see Figure 6).Afterward, voxels with a radiodensity of less than −30 HU were excluded from the kidney contours, as they were most likely perinephric fat.Fig. 6: Kidney Tumor Segmentation Challenge 2019 data set.From top to bottom: images and ground truth images of the kidney (red) and kidney tumor (green).

5) Internet Brain Segmentation Repository:
The Internet Brain Segmentation Repository (IBSR18) data set [41] is a publicly available data set with 18 T1w MRI volumes, and is commonly used as a standard data set for tissue quantification and segmentation evaluation.Whole-brain segmentations of cerebrospinal fluid (CSF), gray matter, and white matter were included with their respective labels 1, 2, and 3 (see Figure 7).

B. Preprocessing
Due to the diverse range of data sets, it must be ensured that the training data is as similar as possible across experiments in order to achieve a fair comparison.
1) Magnetic Resonance Image Preprocessing: The BraTS19 and IBSR18 data sets were N4ITK bias field corrected [42] and normalized to zero-mean and unit variance.The BraTS19 volumes were cropped around the center to a resolution of 160 × 192 × 128, to increase processing speed.This last step was skipped for the IBSR18 data set because of the much smaller amount of data samples.2) Computed Tomography Image Preprocessing: In the U-PRO, U-HAND, and KiTS19 data sets, all images had an x×y resolution (i.e.sagittal-coronal ) of 512 × 512 and a varying slice count.The voxel size also varied between patients, so a preprocessing pipeline (see Figure 8) was set up to transform these two data sets to a uniform resolution and voxel size.
Fig. 8: Preprocessing pipeline as applied on the U-PRO data set.Given are the resolutions and in parentheses the voxel dimensions in mm.W , H, and D each denote that the volume shape is varied in width, height, and/or depth, respectively.α, β, and γ each denote that the voxel spacing is varied in width, height, and/or depth, respectively.First, the data were resampled to an equal voxel size within the same set.The volumes were then zero-padded to the size of the single largest volume from that set after resampling.In order to increase processing speed and lower the memory consumption, the U-PRO and KiTS19 volumes were thereafter downsampled to 256 × 256 × 128, and the U-HAND volumes were downsampled to 256 × 256 × 64.An example of this method pipeline is shown in Figure 8.
As a final step, the images were normalized by clipping each case to the range [−1000, 2000], subtracting 500 and dividing by 1500.

C. Training Details
Our method was implemented in Keras 2.2.41 using Ten-sorFlow 1.12.02 as the backend.The experiments were trained on either a desktop computer with an NVIDIA RTX 2080 Ti GPU, or the NVIDIA Tesla V100 GPUs from the High Performance Computer Center North (HPC2N) at Umeå University, Sweden.Depending on the model, the convergence speed, and the data set size, a single experiment took from minutes to multiple days to complete.
1) Experiment Setup: For the 3D experiments, the BraTS19 data set was the only data where the whole volumes could be fed into the network at once because of constraints in GPU memory.For the other data sets, we resorted to a patch-based approach where the input size would be 256 × 256 × 32, the largest size possible for our available hardware.
In all experiments, we employed the Adam optimizer [43] with an initial learning rate of 1 • 10 −4 .If the validation loss did not improve after a certain number of epochs, we used a patience callback that dropped the learning rate by a factor of 0.2 and an early stopping callback that terminated the experiment.Because of the differences in data set sizes, these callbacks had to be determined from initial exploratory experiments for each separate data set to ensure experiments did not run for too long or too short.The patience callbacks were set to five epochs for the BraTS19, KiTS19, and U-PRO experiments, six epochs for the U-HAND data set, and ten epochs for the IBSR18 data set.The early stopping callbacks were set to 11 epochs for U-PRO data, 12 epochs for BraTS19 and KiTS19 data, 14 for U-HAND data, and 25 epochs for IBSR18.The maximum number of epochs an experiment could run for, regardless of any changes in the validation loss, was set to 100 for the U-HAND and U-PRO data and 200 for the other data sets.Batch normalization and a L 2 norm regularization, with parameter 1 • 10 −5 , were applied to all convolutional layers, both in the transition block and in the main network.As loss function, we employed a combination of the Dice similarity coefficient (DSC) and categorical cross-entropy (CE).The DSC is typically defined as with U the output segmentation and V its ground truth.However, a differentiable version of Equation 1, the so-called soft DSC, was used.The soft DSC is defined as where for each label i, the u is the SOFTMAX output of the network and v is a one-hot encoding of the ground truth segmentation map.The is a small constant added to avoid division by zero.
The DSC is a good objective for segmentation, as it directly represents the degree of overlap between structures.However, for unbalanced data sets with small structures and where the vast majority of pixels are background, it may converge to poorly generalizing local minima, since misclassifying only a few pixels can lead to large deviations in DSC.A common way [44,45] to resolve this is to combine the DSC loss with the CE loss, defined as and we did this as well.Hence, the final loss function was 2) Data Augmentation: In order to artificially increase the data set size and to diversify the data, we employ various common methods for on-the-fly data augmentation: flipping along the horizontal axis, rotation within a range of −1 to 1 degrees, shear images within the range of −0.05 to 0.05, zoom with a factor between 0.9 and 1.1, and adding small elastic deformations as described in [46].The data augmentation implementation we used was based on [47].The images in the KiTS19 data are asymmetric along the x-axis because of the liver; therefore, vertical flipping was not applied on that data set as it would result in anatomically unrealistic images (see Table I).
3) Evaluation: For evaluation of the segmentation performance, we employed the conventional DSC as defined in Equation 1.In order to ensure a fair comparison and to investigate the variability of the results within experiments, we used five-fold cross-validation in each experiment (except for the U-PRO).Due to its much larger size, the experiments on the U-PRO data set were run only once.
To compare the computational cost of our proposed models to the corresponding 2D and 3D CNN models, we extracted the number of trainable parameters, the maximum amount of GPU memory used, the number of FLOPs, training time per epoch, and prediction time per sample.

IV. RESULTS
The segmentation performances in terms of DSC of all models are illustrated in Figure 9.For each data set, the mean DSC scores are plotted (with point-wise standard deviation bars) as a function of the input size, and are given for the 2D, pseudo-3D with d = 3 to d = 13, and 3D models, and for the U-Net and SegNet backbones.These results are also tabled in Table III, along with summaries of the experiment setups per data set.
Randomly selected example segmentations are illustrated in Figure 10.For each data set, a prediction from the 2D models, pseudo-3D models with the proposed method, and 3D models are given, along with their respective ground truths.The BraTS19, KiTS19 and U-PRO segmentations are cropped for ease of viewing.We chose to omit examples for the channel-based pseudo-3D models because of their high level of similarity to the proposed method.Segmentations with the channel-based method, along with additional exemplary segmentations, can be found in Figure 14-16 in Section F of the Supplementary Material.
The computational costs of the models used for BraTS19 experiments are presented in Table II.The number of model parameters, graphical memory use, and FLOPs are only dependent on the model type, and therefore the corresponding columns in Table II are equal for all other data sets.The same variables are shown for the other data sets in Table IX-XII in Section B of the Supplementary Material, where the only differences are in the training and inference times due to the different numbers of samples; these two parameters scale with the data set size.

V. DISCUSSION
This study evaluated the inclusion of neighboring spatial context as an input of CNNs for medical image segmentation.Such pseudo-3D methods with a multi-slice input and singleslice output are commonly implemented by regarding the adjacent slices as additional channels of the central slice.Apart from this approach, we also proposed an alternative pseudo-3D method, based upon multiple preliminary 3D convolution steps before processing by a 2D CNN.Across    data sets and using U-Net and SegNet CNN backbones, we compared both these pseudo-3D methods, for an input size d = 3 up to d = 13, to end-to-end 2D and 3D CNNs with respectively single slice and whole volume inputs and outputs.Additionally, we evaluated a number of computational parameters to get a sense of each model's hardware requirement and load.

A. Computational costs
As seen in Table II, the computational costs are in line with what would be expected.The transition block adds a relatively small amount of extra parameters on top of the main 2D network, and the required amount of GPU memory and FLOPs scale accordingly with d.Since the input is still the same size as for the channel-based method, the training times per epoch are largely similar.One advantage of the fully 3D CNNs demonstrated in these results, is that prediction time is significantly faster because samples can be processed all at once instead of slice by slice.
The high computational cost of end-to-end 3D convolution is also demonstrated in Table II.The memory footprint is almost 35 times larger than the 2D U-Net; over 16 GB is required to train on the complete volumes, which is at or above the limit of most modern commodity GPUs.Both pseudo-3D methods use less than 5 % of the GPU memory consumed by the end-to-end 3D network, even at d = 13.It can thus be concluded that both pseudo-3D methods are computationally very efficient ways of including more inter-slice information, with the proposed method being slightly more expensive in terms of the GPU memory consumption compared to the channel-based method.

B. Quantitative analysis
As can be seen in Figure 9, overall, all experiments managed to produce acceptable segmentation results, even for data sets with complex structures such as the BraTS19 images, or with organs that can be hard to visually distinguish, such as in the U-HAND set.One obvious similarity between these data sets is that using a U-Net backbone outperforms the SegNet in nearly every case.Regarding the behavior as a function of input size d, the results in Figure 9 are inconclusive for almost all data sets.
In the plots from the BraTS19, KiTS19, IBSR18 and U-HAND data in Figure 9, there does not seem to be an additional benefit by adding more slices as input over an endto-end 2D approach.There seem to be some exceptions, like the surge at d = 13 in the U-HAND results, but in these four data sets the variance is either too high or the rate of increase is too low to draw any strong conclusions.For these cases, it would be doubtful if the accessory downsides, e.g.increased training time, are worth the at most marginal improvements in segmentation performance.Likewise, there seems to be no significant difference between our proposed method and the channel-based method in these four data sets.
The only data set in this study where DSC does seem to significantly improve with d is in the U-PRO.As more slices are being included in the input volume, the segmentation performance approaches that of a fully 3D network, and the proposed method outperforms the channel-based method by an increasing margin.While the overall improvement when going from 2D to pseudo-3D with d = 13 is arguably low, we can regard the U-PRO case as a demonstration of the possibility that pseudo-3D models can improve the segmentation performance over 2D methods.
Fully 3D CNNs seems to produce equal or worse results than their 2D and pseudo-3D counterparts in most cases.Again, the only exception seems to be in the U-PRO results, and in this case only when the U-Net is used as backbone network (see the respective plot in Figure 9).This could be explained by the much higher number of parameters of 3D CNNs, which makes them prone to overfitting.The high number of data samples in the U-PRO set, combined with the skip-connections that differentiates the U-Net from the SegNet, might have been enough to overcome the problem in this specific case.
There does not seem to be a straight-forward explanation as to why the U-PRO data set is an exception compared to the other data sets.In an attempt to connect DSC behaviour with d to differences in data set properties a feature-based regression analysis was performed.We computed features of the structures (ground truth masks) that describe each mask's structural properties: structure depth (i.e. the average number of consecutive slices a structure is present in), structure size relative to the total volume, and average structural interslice spatial displacement.The extracted feature values for all data sets and their respective structures can be found in the Supplementary Material Table IV-VIII.We found no significant agreements between models that could connect one of these data set features to DSC with respect to d.For more details about the feature extraction and regression analysis, see Sections A.1 and A.2 of the Supplementary Material.
Another distinction between the U-PRO set and the others included in this study is its much larger number of samples.As mentioned above, this could have been a contributing factor to the higher performance of the U-PRO 3D U-Net compared to other data sets' 3D CNN results.This feature was also hypothesized to influence the relation between d and DSC, and therefore the following analysis was performed: the same experiments were performed but now training on distinct subsets of 200 samples from the U-PRO data set.The average scores obtained from the five distinct subsets can be found in Figure 12 in the Supplementary Material, where we see a similar behavior as in Figure 9. Hence, we rule out the data set size as the main cause as well.
We, therefore, conclude that pseudo-3D methods have the potential to increase segmentation performance, but in the general case will not yield better results compared to conventional, end-to-end 2D and 3D CNNs.
Further analysis to explain the behaviour of the U-PRO data set might be performed.The regression-based feature analysis was somewhat rudimentary, and could likely be extended with e.g. more sophisticated models and more data.
Another possible follow-up study might be to investigate whether it is the multi-slice output (e.g.producing segmentations for all input slices) in pseudo-3D methods that improve the results in other studies.While this was out of the scope of this work, aggregating multiple outputs may be the main reason why pseudo-3D methods sometimes improve the segmentation performance.Based on our conclusions that using multi-slice inputs does not seem to improve the results on their own, the added benefit might only come into play from aggregation of multiple outputs.In this case, using something like Bayesian dropout could prove just as beneficial.

C. Qualitative analysis
It is important to emphasize that the images in Figure 10 are randomly selected single slices from thousands of samples and are therefore presented purely for illustrative purposes, and might not always be a representation of the overall segmentation performance of a particular data set.However, some remarks can be made that can be related to the quantitative results in Figure 9.The relatively large variance in segmentation performance between experiments of the BraTS19 data are demonstrated in Figure 10; as seen, the predictions can differ quite drastically within the same model and with varying d.This reflects the DSCs of the BraTS19 set presented in Figure 9.
It also appears that the U-Net is better at capturing fine structural details, while the SegNet segmentations seems to be coarser and simpler.This becomes particularly noticeable in data sets with complex structures, such as the gray matterwhite matter border in the IBSR18 images (Figure 10).This in turn results in an overall large difference in mean DSC between U-Net and SegNet.When the ground truth structures are more coarsely shaped, such as in the U-HAND set, the SegNet can keep up much better with the U-Net performance.

D. Effect of the Loss Function
In an earlier stage of this project, we employed a different experimental setup with a pure DSC loss function.However, these initial experiments proved this loss not to be sufficient for all data sets.Particularly the KiTS19 and U-HAND data sets yielded unacceptably unstable results which, even with exactly equal hyperparameters, could either result in fairly accurate segmentations or complete failure.Investigation of the DSCs of individual structures demonstrated that in these failed experiments, multiple structures did not improve beyond a DSC on the order of 0.1.After adapting the loss function to include also the CE term (see Equation 4), the results improved substantially for all data sets.Performance details for each run using the pure DSC and final loss function can be seen in Figure 13 and Table XIII in Section E of the Supplementary Material.

VI. CONCLUSION
This study systematically evaluated pseudo-3D CNNs, where a stack of adjacent slices is used as input for a prediction on the central slice.The hypothesis underlying this approach is that the added neighboring spatial information would improve segmentation performance, with only a small amount of added computational cost compared to an end-to-end 2D CNN.However, whether or not this is actually a sensible approach had not previously been evaluated in the literature.
Aside from the conventional method, where the multiple slices are input as multiple channels, we introduced here a novel pseudo-3D method where a subvolume is repeatably convolved in 3D to obtain a final 2D feature map.This 2D feature map is then in turn fed into a standard 2D network.
We investigated the segmentation performance in terms of the DSC and the computational cost for a large range of input sizes, for the U-Net and SegNet backbone architectures, and for five diverse data sets covering different anatomical regions, imaging modalities, and segmentation tasks.While pseudo-3D networks can have a large input image size and still be computationally less costly than fully 3D CNNs by a large factor, a significant improvement from using multiple input slices was only observed for one of the data sets.
We also observed no significant improvement of 3D network performance over 2D networks, regardless of data set size.
Because of ambiguity in the underlying cause of the behavior on the U-PRO data set compared to the results on the other data sets, we conclude that in the general case pseudo-3D approaches appear to not significantly improve segmentation results over 2D methods.

SUPPLEMENTARY MATERIAL
A. Structure Analysis 1) Feature Extraction: We selected three data set features that describe each data set's structural properties: structure depth, structure size relative to the total volume, and average structural inter-slice spatial displacement (see Table IV-VIII).These aforementioned structural properties are computed as follows.
The structure depth of class c = 1, . . ., C is computed as where p = 1, . . ., P denotes the patient p, r pc represents an unconnected region of class c in patient p, the φ rpc denotes the number of consecutive slices (in the axial dimension) of region r pc .Here P and R pc are the number of patients and unconnected regions of class c in patient p, respectively.The structure size relative to the total volume of class c is defined as where υ cp denotes the total number of voxels labeled as class c in patient p.The H, W , and D are the height, width, and depth of the input volume, respectively.
To compute the structure spatial displacement, we first compute the center of mass, ψ cps , of class c of patient p at slice s (in the axial dimension) as where r ijcps is the value of a voxel at coordinates (i, j) in class c in patient p and slice s, and where υ cps denotes the total number of voxels labeled as class c in patient p in slice s.
With these, the structure spatial displacement of class c is computed as where ψ cp(s−1) , ψ cps 2 denotes the Euclidian distance between two coordinate points, ψ cp(s−1) and ψ cps .
2) Regression Analysis: To find explanations to why the U-PRO data set is an exception compared to the other data sets, we attempted to connect DSC behaviour with d, that is the number of slices extracted from the whole volume with a total of D slices as the subvolume input, to differences in data set properties including structure depth, structure size relative to the total volume, and average structural inter-slice spatial displacement, and the number of training samples.
We aggregated all these structure properties to generate minimum, mean and maximum of Φ, Υ and Ψ over all classes in each data set, such that for each data set we obtained a fixed set of nine features.Overall, we formed a regression task to compute regression models for all models (U-Net and SegNet), architectures (2D, 3D, proposed, and channel-based), ds (number of slices extracted from the whole volume), and data sets including BraTS19, KiTS19, IBSR18, U-HAND and U-PRO with the following input features: − Φ min , minimum of structure depth over classes, − Φ mean , average of structure depth over classes, − Φ max , maximum of structure depth over classes, − Υ min , minimum of structure size relative to the total volume over classes, − Υ mean , mean of structure size relative to the total volume over classes, − Υ max , maximum of structure size relative to the total volume over classes, − Ψ min , minimum of spatial displacement over classes, − Ψ mean , average of spatial displacement over classes, − Ψ max , maximum of spatial displacement over classes.We used the Bootstrap (with 1 000 rounds) to compute the mean regression coefficient vectors and the corresponding confidence intervals the some regularised linear regression models.In the regression analysis, we used Ridge regression, Lasso, Elastic Net, and Bayesian ARD regression.The analysis was performed using scikit-learn 0.22 3 .However, the regression analysis was non-conclusive, revealing no relation between the structure features and the number of input slices, d.
3) Extracted Features: See Table IV-VIII for more details.

B. Supplementary Computational Results
To compare the computational cost of our proposed models to the corresponding 2D and 3D CNN models, we extracted the number of trainable parameters, the maximum amount of GPU memory used, the number of FLOPs, training time per epoch, and prediction time per sample.
The computational costs of the models used for BraTS19 experiments are presented in Table 2 of the main paper.The number of model parameters, graphical memory use, and FLOPs are only dependent on the model type, and therefore are equal for all other data sets.The same variables are shown here for the other data sets in Table IX-XII, where the only differences are in the training and inference times due to the different numbers of samples; these two parameters scale with the data set size.

C. SegNet Pseudo-3D architecture
Figure 2 in the main paper shows both pseudo-3D methods with a U-Net backbone.Here, Figure 11 shows the same methods but with a SegNet backbone.

D. U-PRO Subset Experiments Results
A distinction between the U-PRO set and the others included in this study is its much larger number of samples.This feature was hypothesized to influence the relation between d and DSC, and therefore the following analysis was performed: the same experiments were performed but now training on distinct subsets of 200 samples from the U-PRO data set.The average scores obtained from the five distinct subsets can be found here    in Figure 12, where we see a similar behavior as in Figure 9 in the main paper.Hence, we rule out the data set size as the main cause of the U-PRO performance behaviour.

E. Supplementary Quantitative Results
In an earlier stage of this project, we employed a different experimental setup with a pure DSC loss function.However, these initial experiments proved this loss not to be sufficient for all data sets.Particularly the KiTS19 and U-HAND data sets yielded unacceptably unstable results which, even with exactly equal hyperparameters, could either result in fairly accurate segmentations or complete failure.Investigation of the DSCs of individual structures demonstrated that in these failed experiments, multiple structures did not improve beyond a DSC on the order of 0.1.After adapting the loss function to include also the CE term, the results improved substantially for all data sets.Performance details for each run using the pure DSC and final loss function can be seen in Figure 13 and Table III.

F. Supplementary Qualitative Results
Example segmentations are illustrated in Figure 14

Fig. 2 :
Fig. 2: The proposed methods illustrated with the U-Net backbone.The output is the prediction for the central slice of the input.The numbers in the transition block indicate the depth and in the backbone the number of filters.Top: The Proposed Method where the transition block uses 3D convolutions and 2D padding to iteratively reduce the input from depth d to 1, while the width and height remain.Bottom: The Channel-Based method, where neighboring slices are input as separate channels, and the input can be fed into a 2D CNN right away.

Fig. 3 :
Fig. 3: Umeå Pelvic Region Organs data set.From top to bottom: images and ground truth images of the prostate (red), bladder (green) and rectum (blue).

Fig. 4 :
Fig. 4: Umeå Head and Neck Database.From top to bottom: images and ground truth images at different slices of the left and right submandibular glands (red and green), left and right parotid glands (dark blue and yellow), larynx (light blue), and medulla oblongata (pink).

Fig. 5 :
Fig. 5: Manual expert annotation of two patients with HGG from the Brain Tumors in Multimodal Magnetic Resonance Imaging Challenge 2019 data set.Shown are image patches with the tumor structures that are annotated in the different modalities.The image patches show (from left to right): (1) the whole tumor visible in FLAIR, (2-3) the enhancing and tumor structures visible in T1w and T1c, respectively, and (4) the final labels visible in T2w.The segmentations are combined to generate the final labels of the tumor structures: the necrotic and non-enhancing tumor core (NCR/NET-label 1, red), the peritumoral edema (ED-label 2, green) and the GD-enhancing tumor (ET-label 4, yellow).

Fig. 7 :
Fig. 7: Internet Brain Segmentation Repository data set.Axial slices of three patients with the ground truth of the cerebrospinal fluid (red), white matter (green) and gray matter (blue).
The rectified linear unit (RELU) function was used as the intermediate activate function.The activation function of the final layer was the SOFTMAX function.Each data set was split into 80 % training and 20 % test set, and with the training set, in turn, being split into 80 % for training and 20 % for validation.

Fig. 11 :
Fig. 11: Proposed methods with a SegNet backbone.The output is the prediction for the central slice of the input.The numbers in the transition block indicate the depth and in the backbone the number of filters.Top: a transition block uses 3D convolution and 2D padding to iteratively reduce the input from depth d to 1, while the width and height do not change.Bottom: the neighboring slices are regarded as multiple channels, and the input can be fed into the 2D CNN right away.

Fig. 12 :
Fig. 12: Mean and standard error of test set on the U-PRO data set.Each run was trained on 200 patients.

Fig. 13 :Fig. 14 :
Fig.13: Mean and standard deviation of five runs on BraTS19, KiTS19, IBSR18, U-HAND and U-PRO data sets using the soft DSC loss.

TABLE I :
Data sets and augmentation techniques in this study.W , H, and D each denote that the volume shape is varied in width, height, and/or depth, respectively.α, β, and γ each denote that the voxel spacing is varied in width, height, and/or depth, respectively.

TABLE II :
Architecture comparison.Experiments on U-Net architecture and multimodal BraTS19 data set.Patch shape was set at 160 × 192 × d where d is the number of slices.Here, t and p denote the training time per epoch and prediction time per sample, respectively.

TABLE IV :
Mean and standard deviation of of each class's depth, Φ c , count over the total number of pixels, Υ c , spatial displacement, Ψ c , in the axial direction over all volumes in the BraTS19 data set.

TABLE V :
Mean and standard deviation of of each class's depth, Φ c , count over the total number of pixels, Υ c , spatial displacement, Ψ c , in the axial direction over all volumes in the KiTS19 data set.

TABLE VI :
Mean and standard deviation of of each class's depth, Φ c , count over the total number of pixels, Υ c , spatial displacement, Ψ c , in the axial direction over all volumes in the IBSR18 data set.

TABLE VII :
Mean and standard deviation of of each class's depth, Φ c , count over the total number of pixels, Υ c , spatial displacement, Ψ c , in the axial direction over all volumes in the U-HAND data set.

TABLE VIII :
Mean and standard deviation of of each class's depth, Φ c , count over the total number of pixels, Υ c , spatial displacement, Ψ c , in the axial direction over all volumes in the U-PRO data set.

TABLE IX :
-16.It is important to emphasize that the images are randomly selected single slices from thousands of samples and are Architecture comparison.Experiment on U-Net architecture on KiTS19 data set.Patch shape was set at 256×256×d where d is the number of slices.Here, t and p denote the training time per epoch and prediction time per sample, respectively.This experiment was performed on an NVIDIA GeForce GTX 1080 Ti.

TABLE X :
Architecture comparison.Experiment on U-Net architecture on IBSR data set.Patch shape was set at 256 × 128 × d where d is the number of slices.Here, t and p denote the training time per epoch and prediction time per sample, respectively.This experiment was performed on an NVIDIA GeForce GTX 1080 Ti.

TABLE XI :
Architecture comparison.Experiment on U-Net architecture on U-HAND data set.Patch shape was set at 256 × 256 × d where d is the number of slices.Here, t and p denote the training time per epoch and prediction time per sample, respectively.This experiment was performed on an NVIDIA GeForce GTX 1080 Ti.

TABLE XII :
Architecture comparison.Experiment on U-Net architecture on U-PRO data set.Patch shape was set at 256×256×d where d is the number of slices.Here, t and p denote the training time per epoch and prediction time per sample, respectively.This experiment was performed on an NVIDIA GeForce GTX 1080 Ti.