Breast MRI segmentation for density estimation: Do different methods give the same results and how much do differences matter?

Purpose To compare two methods of automatic breast segmentation with each other and with manual segmentation in a large subject cohort. To discuss the factors involved in selecting the most appropriate algorithm for automatic segmentation and, in particular, to investigate the appropriateness of overlap measures (e.g., Dice and Jaccard coefficients) as the primary determinant in algorithm selection. Methods Two methods of breast segmentation were applied to the task of calculating MRI breast density in 200 subjects drawn from the Avon Longitudinal Study of Parents and Children, a large cohort study with an MRI component. A semiautomated, bias‐corrected, fuzzy C‐means (BC‐FCM) method was combined with morphological operations to segment the overall breast volume from in‐phase Dixon images. The method makes use of novel, problem‐specific insights. The resulting segmentation mask was then applied to the corresponding Dixon water and fat images, which were combined to give Dixon MRI density values. Contemporaneously acquired T1‐ and T2‐weighted image datasets were analyzed using a novel and fully automated algorithm involving image filtering, landmark identification, and explicit location of the pectoral muscle boundary. Within the region found, fat‐water discrimination was performed using an Expectation Maximization–Markov Random Field technique, yielding a second independent estimate of MRI density. Results Images are presented for two individual women, demonstrating how the difficulty of the problem is highly subject‐specific. Dice and Jaccard coefficients comparing the semiautomated BC‐FCM method, operating on Dixon source data, with expert manual segmentation are presented. The corresponding results for the method based on T1‐ and T2‐weighted data are slightly lower in the individual cases shown, but scatter plots and interclass correlations for the cohort as a whole show that both methods do an excellent job in segmenting and classifying breast tissue. Conclusions Epidemiological results demonstrate that both methods of automated segmentation are suitable for the chosen application and that it is important to consider a range of factors when choosing a segmentation algorithm, rather than focus narrowly on a single metric such as the Dice coefficient.


INTRODUCTION
Mammographic density, a quantitative measure of radiodense fibroglandular tissue in the breast, is one of the strongest predictors of breast cancer risk. Women with more than 75% density have a fourfold or higher risk of breast cancer compared to those with less than 5%. 1 More intensive screening for women with high mammographic density has been proposed 2 but remains controversial. 3 However, in clinical practice, mammographic density, as assessed on x-ray mammograms, is generally reported using only qualitative, radiologist-assessed categories, and agreement between radiologists tends to be only moderate. 4 Quantitative analysis is hampered by the fact that breast density is an inherently 3-D material property and therefore not well suited to measurement using 2-D x-ray projections. Although subsequent risk assessment and epidemiological analysis rarely use full 3-D information (normally preferring a single number, i.e., the volume-averaged mean breast density), accurate derivation of such a statistic from the 2-D x-ray data is problematic and subject to error. Automated tools, such as Volpara (Vol-paraSolutions, Wellington, NZ) 5 and QUANTRA (Hologic Inc., USA), are gaining traction in the mammography community, suggesting that mean breast density can be calculated without inter-reader bias. However, such readings may be affected by errors in estimating breast thickness 6 and the relation between the values of breast density reported and those obtained by other techniques remains to be elucidated. 7 Increasingly, magnetic resonance imaging (MRI) mammography is being used in clinical and research settings to assess breast structure, because of its 3-D capabilities, its nonionizing nature and the strong soft tissue contrast between fibroglandular (parenchymal) and fatty tissue. In an MRI context, breast density refers to the percentage of breast tissue volume that is deemed to be "parenchymal" and this is generally assumed to be the same as volume fraction of tissue whose MR signal arises from free water molecules (i.e., the "water fraction" or "percentage water"), as opposed to fat. Clearly, this is not an exact equivalent of the mammographic x-ray density. Nevertheless, Thompson et al. 8

demonstrate a clear correlation between the two.
At present, manual evaluation of MRI 3-D breast density is an arduous, observer-dependent, and time-consuming process. Therefore, full or partial automation of the 3-D analysis of the breast is required. To achieve the desired segmentations of breast parenchymal volume and breast fat volume, two separate image processing tasks are required. First, the breast as a whole needs to be distinguished from the background and chest wall; and, second, the parenchymal tissue within the breast needs to be distinguished from fat.
Several different MRI pulse sequences have previously been used to assess breast density, but no definitive consensus has been reached about which is optimal. Few studies have compared different sequences within the same subject population. Furthermore, while there is a large body of prior literature (see Table I) describing different ways to achieve the two segmentation tasks described above, no studies, to date, have compared different automated methods with each other and with manual segmentation, for a sizeable subject population.
It is clear that many methods can produce "good" segmentation results. This study poses the following question: Do the minor differences we see between segmentations when we apply different algorithms on the same data actually matter for the uses to which the segmentations are ultimately put?
This study compares two very different methods of breastoutline segmentation: (a) an established 37 bias-corrected fuzzy C-means (BC-FCM) clustering technique based on a cost-function; and (b) a new heuristic approach based on thresholding, landmark identification, and direct analysis of image features. The results of this part of the study will be measures of overall breast volume from each method and volume similarity measures (Dice and Jaccard coefficients).
With the breast outline obtained, the second part of the study compares two methods of fat-water discrimination, again based on different principles. (a) The Dixon approach 38 uses scans acquired with an MRI technique that returns separate "fat" and "water" images. In principle, these allow us to obtain a fat and water fraction for every voxel, accounting for partial volume effects. However, Dixon sequences are not currently part of the routine acquisition protocol for clinical MRI examinations. 39 (b) Our second method uses an analysis of the intensity histograms of the two different tissue classes in fat-suppressed T 1 -weighted (T1w) and T 2 -weighted (T2w) images. Such images are routinely acquired in diagnostic scanning and this method thus has the potential advantage of wider applicability if the two methods are shown to be concordant. Note that there is no means of obtaining ground truth data and, given that we are dealing with a healthy subject cohort, no possibility of obtaining x-ray data for comparison.
Nomenclature for the various segmentations is summarized in Fig. 1.
A comprehensive epidemiological analysis of the relationship between breast composition and seven other physical, historical and lifestyle variables has been carried out for this cohort. While the full report is beyond the scope of this study, we summarize the results and use them to discuss quantitatively the impact of differences between the various assessment methods on conducting reliable clinico-epidemiological studies.

2.A.1. Study population
This work forms part of an investigation into breast composition at young ages, nested within the Avon Longitudinal Study of Parents and Children (ALSPAC). ALSPAC originally recruited 14,541 pregnant women resident in Avon, UK with expected dates of delivery from 1 April 1991 to 31 December 1992, as described by Boyd et al. in a cohort profile paper. 40 For this substudy, Caucasian nulliparous women were invited to attend an MRI examination at the University of Bristol Clinical Research and Imaging Medical Physics, 44 (9), September 2017 TABLE I. Summary of journal papers describing methods to segment pectoral muscle and internal fibro-glandular tissue from MR images. N OB refers to the number of observers who provided the gold standard manual segmentation. N D indicates the number of MR data sets the method was validated with and N S the number of MRI scanners. N/A = not applicable; N/S = not specified.  FIG. 1. Flow diagram of the overall data processing chain and nomenclature for the various segmentation methods. Some of these have the potential to operate on different source data and we can also combine the methods in different ways to achieve an overall result. We thus assign each step three codes: segmentation purpose (V = breast volume, FW = fat-water); degree of automation (m = manual, s = semi-automatic, a = fully automatic); and source data (D = Dixon; T1 = T 1 -weighted, T2 = T 2 -weighted, T12 = uses both T 1 -and T 2 -weighted data). Thus, a breast-volume measurement using semiautomatic segmentation on original Dixon data would be represented as VsD. Fat-water segmentations require both source data and a previously generated volume mask, so are represented by the combination of two codes. For instance, fat-water statistics calculated semiautomatically from Dixon source data and using a mask generated automatically from T1w and T2w data would be described by VaT12-FWsD. We note one additional case, in which the volume mask VaT12 is re-sampled to give a result in the same coordinate space as the Dixon images and we assign this the label VaT12D.
Medical Physics, 44 (9), September 2017 data that are available through a fully searchable data dictionary. 41

2.A.2. MR imaging
Participants underwent a breast MRI scan using a 3T Siemens Skyra MR system with a breast coil that surrounds both breasts of a prone patient. Three sets of bilateral images were acquired: • multislice, sagittal Dixon 38 images (in-phase, out-ofphase, water and fat), acquired using a turbo spin-echo sequence with nominal in-plane resolution of (0.742 9 0.742) mm 2 , nominal slice thickness 7 mm and interslice spacing 7.7 mm; • T1-weighted 3D images, acquired using a VIBE sequence with fat saturation and a nominal resolution of (0.759 9 0.759 9 0.900) mm 3 , as routinely used in clinical dynamic contrast-enhanced MRI protocols for the breast; • multislice, axial, T2-weighted images, acquired using a turbo spin-echo sequence, with nominal in-plane resolution of (0.848 9 0.848) mm 2 , and both slice thickness and spacing between slices 4 mm.

2.A.3. Manual reference segmentation
To assess breast volume, a manual segmentation protocol (as described in the Supplementary Information) was developed and used by three readers (RD, MB, and ISS) independently to outline the breast from surrounding tissues in the Dixon images, using ITK-SNAP (version 3.0.0). All subjects had a manual segmentation of all breast slices performed by at least one reader. The datasets of 16 representative subjects were manually segmented twice by all three readers to assess between-and within-observer variation. In cases where more than one manual segmentation is performed, the VmD and VmD-FWsD results quoted below represent the median values taken for the multiple manual readings.

2.A.4. Training and validation data sets
A training set of 100 randomly selected subjects was used to make initial comparisons across MR images and segmentation methods, and for the manual readings, between-and within-observer variation. The training data were used to assess the common reasons for segmentation failure and to improve the algorithms. At the end of the testing phase, the algorithm code was "frozen" and final comparisons of the segmentation methods were completed on a second set of images from a further 100 participants. Except where stated otherwise, all the summary statistical results presented here come from this second, "validation" cohort. For further details concerning statistical methods, please see the Supplementary Information.

2.B.1. Semiautomated, bias-corrected fuzzy C-means (BC-FCM)
A fuzzy C-means (FCM) algorithm was applied to the Dixon in-phase images. It has the advantage that it can be modified to carry out a simultaneous intensity inhomogeneity compensation, or bias-correction (BC), and this is potentially less expensive computationally than a prefiltering operation. 42 The algorithms in this section were implemented using IDL (Harris Geospatial Systems, Melbourne, FL, USA) and run on a standard desktop computer.
The BC-FCM variant we implemented is described in. 37 Formally, the algorithm does not require a training dataset and so is an unsupervised clustering algorithm. However, in practice, some experience with the types of data involved can improve the results dramatically. Except for the local smoothness criterion (introduced by cost function c in ref. [37] see this publication for all other related notation), BC-FCM per se does not use any spatial information. Nevertheless, a "good" segmentation involves a number of problem-specific insights and the basic BC-FCM method above was enhanced by additional heuristic algorithms in the spatial domain, based on the results obtained with the training data.
Initial parameters and iteration threshold: After some experimentation, b(r) was set to 0.1 for all spatial locations and e to 0.01. The two initial class centroids c f were calculated by taking the mean of the slice being processed and adding a lower and an upper offset. These two offsets are adjustable parameters under user control. For many subjects (see the Results section for an example), a single set of defaults performed extremely well. However, for a small subset of "difficult" cases (second example in Results), user interaction was needed to try various combinations. As implemented here, on a standard desktop computer, running nonoptimized software, it took around 2 min to run the segmentation algorithm on each 3-D dataset. Thus, this "trial and error" step was the most frustrating feature of the BC-FCM method in practice. Numerous coding and hardware improvements (e.g., parallelization) could be made to the prototype to improve the user experience, potentially allowing these adjustable parameters to be altered by simple slider controls with immediate feedback.
We observed an improvement in performance by allowing the algorithm to perform separate BC-FCM classifications for segmenting the posterior of the breast from the chest wall and segmenting the anterior portion from air, then merging the two volumes. Furthermore, it was noted that the optimal offsets providing the initial class centroids were often different for these two segmentation problems. Thus, each dataset is split into two portions in an anterior-posterior (AP) direction and the BC-FCM algorithm applied twice per image slice. Given that the size of breasts varies, the position of the AP-split is also different for different datasets and this is Medical Physics, 44 (9), September 2017 handled automatically by having two passes through the entire algorithm with an automated choice of the AP-split position made after Pass 1.
Morphological operations: The breast outlining task requires a definite boundary to be drawn. Thus, it is not necessary to use the full membership function output of the BC-FCM routine, and we arrange for the clustering to produce a binary image. This may include some misclassified regions outside the breast and some "holes" inside the breast. To remove the unwanted regions, 2D hole-filling followed by a 4-neighborhood connectivity search and object labeling is performed. The largest nonbackground object in each slice is identified as the breast region and other smaller objects are removed from the binary image. This exercise is repeated for all slices and these are then merged to form an approximate breast volume.
Within this approximate breast volume, there may be some nonbreast tissue segmented for cases in which fatty breast tissue is connected to the chest and liver; and there may also be some unsegmented breast tissue left for cases in which dense breast tissue is connected to the chest wall muscles. To reduce these over-and undersegmentations, 3D morphological image opening is performed, followed by closing using two cylindrical structuring elements having the same radius of 3 voxels but different heights of 3 voxels and 25 voxels in the axial direction. These parameters were found by experimentation during our previous study. 37 Lateral cutoffs: The preceding steps in the process do an excellent job in segmenting the anterior and posterior margins of the breast. However, there is no consensus in the literature as to "where the breast stops" in the right-left and superior-inferior directions. The extent of the breast is not directly delineated by any change in MRI contrast and the required boundary may, indeed, be specific to the application of the imaging (e.g., when comparing the MRI segmentation with the breast region compressed within the paddles of a mammography system, the axilla region may be excluded entirely). Thus, based on the consensus protocol (Appendix S1) reached by the three experienced readers, a heuristic algorithm was developed, as described below. This additional truncation is derived entirely from geometric considerations and boundaries are drawn without regard to image intensity, which is in many cases the same on either side of the boundary.
Each breast is processed in turn. The stack of sagittal images segmented using BC-FCM forms a pseudo 3-D dataset. From this dataset, the transverse plane containing the largest breast area is passed to a simple algorithm that extracts the air-breast interface as a 1-D "breast profile". (This geometry is illustrated as Figure S2 of the Supplementary Information.) The profile is used to determine the position of the breast midpoint in a left-right direction. Working outwards from this midpoint, we find the first position at which the absolute value of the gradient (approximated by the finite difference between adjacent voxels) of the breast profile rises above a threshold value, determined by experimentation. This indicates a change in angle of the skin surface from flat regions between and outside the breasts, to the side contour of the breast. A mask is applied to exclude all sagittal slices in the original dataset on either side of these changes in angle. (Typically, the "raw" output of the BC-FCM algorithm would include these.) Finally, a similar profile is generated for the superior-inferior direction and the upper and lower bounds of the breast are determined in each sagittal plane of the original data.

2.B.2. Fully automated, using T1w and T2w images
Preprocessing processing (bias-field correction): A slowly varying bias-field, caused by inhomogeneities in the magnetic field during the MR acquisition, is a common artifact of MR images. To correct this for the T1w and T2w images, we apply the "N4ITK" nonparametric nonuniform intensity normalization method. 43 This is a refinement of the popular N3 algorithm which adopts a fast, robust B-spline fitting algorithm and a hierarchical, multiscale, optimization scheme [Figs. 2(a) and 2(b)].
Breast mask segmentation: This novel, heuristic method, implemented using the Insight Toolkit, 44 computes a whole breast mask using both the T1w and T2w images. In developing this automated approach, emphasis has been placed on limiting the number of empirically derived parameters and relying instead on detecting statistical or functional extrema. In this way, we aim to make the method as widely applicable to variations in subjects and images as possible. The method comprises a number of distinct processing steps as follows.
1. The T2w image is resampled to match the resolution of the T1w image. 2. A grey-scale closing operation along each of the orthogonal axes, x, y and z, is performed on the T2w image, to eliminate voids from the subsequent foreground segmentation. In this operation, each voxel's intensity, I T2w , at index (i, j, k) is replaced by I cT2w (i, j, k) according to: where N i , N j , N k are the number of voxels along each axis.
Medical Physics, 44 (9), September 2017 3. The T1w image is rescaled to match the intensity range of the closed T2w image and the maximum of these two images, I MaxT1wT2w , computed. 4. The foreground (i.e., the subject) is segmented from the background by thresholding, I MaxT1wT2w . The threshold, t bg , is computed via: according to the following functional criteria: • The background is assumed dark therefore the threshold should be close to zero: • The frequency of voxel intensities in the background is higher than the foreground, i.e., the background intensities form a distinctive peak in the image histogram, P(I), which is captured by a sharp rise in the cumulative intensity distribution function: • The background has a lower intensity variance than the foreground: F var ðIÞ ¼ P I j¼0 PðjÞðj À lÞ 2 P maxðIÞ k¼0 PðkÞðk À lÞ 2 The resulting foreground mask image is denoted I fgsee Fig. 2(d).
5. Landmark identification. The most anterior voxels in the foreground mask, I fg , on the left and right sides of the volume, are identified and assumed to be approximately coincident with the nipple locations. If multiple voxels are found, then the center of mass of the cluster is computed. The midsternum is computed as the most anterior voxel of the foreground mask, equidistant from the nipple landmarks in the coronal plane. 6. Pectoral muscle boundary extraction. Various methods have been presented in the literature to segment breast MRI volumes and the pectoral muscle (Table I). These include semiautomated methods requiring user interaction, 31,33,36 2D midslice template registration, 36 statistical shape models, 25 and atlas-based methods. 16,[18][19][20]24,25 A number of methods have been developed to segment explicitly the pectoral muscle. These include a B-spline fit to the intensity gradient of the pectoral boundary, 33 anisotropic diffusion and Canny edge detection, 17 and Hessian matrix planar shape filtering. 15,46 Atlasbased methods have been shown to perform well but are computationally intensive 47 and require significant initial investment of time to develop a library of atlases.
We have developed a method to detect explicitly the anterior pectoral muscle boundary in individual MR volumes. Our approach has similarities to the Hessian processing of Wang et al., 15,46 in that it employs Gaussian derivatives to detect regions in the image with a planar profile. However, rather than computing a ratio of the eigenvalues of the Hessian matrix and thresholding the result, we obtain a direct classification of linear structures, immediately posterior to the sternum, using Oriented Basic Image Features (OBIFs, Fig. 3).
The concept of Basic Image Features (BIFs) was developed by Griffin. 48 The technique classifies pixels in a 2D image into one of seven classes according to the local zero-, first-, or second-order structure. This structure is computed using a bank of six derivative of Gaussian filters (L 00 , L 10 , L 01 , L 20 , L 11 and L 02 ) which calculate the nth (where n = 0,1,2) order derivatives of the image in x and y (S 00 , S 10 , S 01 , S 20 , S 11 and S 02 ). By combining the outputs of these filters, any given pixel can be clas- given  Fig. 3(c)]. 7. Finally, we generate a 2D coronal mask, I CNL , to crop nonbreast tissue from the whole breast mask. I CNL is computed from a coronal skin elevation map, I skin2D, which contains the distance of each anterior skin voxel in the foreground mask, I fg , from the most posterior boundary of the MR volume. The coronal profile of each breast is obtained by thresholding I skin2D at where h ms is the anterior elevation of the midsternum landmark, and h Ln and h Rn are the left and right nipple anterior elevations, respectively. The roughly circular profile obtained for each breast is then dilated by 10 mm and the mask squared off, to create a superior-lateral corner and hence extend the breast volume into the axilla [ Fig. 4(c)]

2.C.1. Semiautomated calculation of percentage breast density, based on dixon images
In principle, the output from a Dixon pulse sequence is a set of images reflecting water content I w (r), which we identify with the parenchymal component of the breast, and an equivalent set I f (r) reflecting fat content. Ideally, these images would be quantitative and allow the direct calculation of the water and fat fractions / w (r) and / f (r) via the equation 49 In practice, there are a number of complicating factors: • Parenchymal tissue and fat have different relaxation properties and, since the acquisitions are not generally designed to be proton density weighted, this means that the relative intensities of equal fractions of fat and water are different.
• The B 1 field of the probe is not uniform across the whole breast and this leads to a spatially dependent efficacy of the fat-water separation.
• In practice, the fat tissue does not have a single proton resonance.
• Different manufacturers have different proprietary image reconstruction methods and these may influence the quantitative results.
Our solution to (at least) the first of these problems is to proceed as follows: (a) Identify a small region in the water image that is expected to be entirely composed of parenchymal tissue. The region should be in a part of the image that is free from intensity artifacts caused by proximity to the RF coil (i.e., the data should come from a homogenous region of B 1 ). (b) In the fat image, identify similarly a second region entirely composed of fat. (c) Calculate the ratio of the average voxel values in each of the two regions: where N w and N f are the numbers of voxels in the selected regions-of-interest ROI w and ROI f , respectively. (d) Replace the value I f in Eq. (10) with rI f . This procedure potentially improves the accuracy of the water-fraction calculation but at the cost of introducing an interactive step into the density estimation process. We have not tested in a systematic fashion the influence that the size and shape of the region-of-interest selection have on the process, in part because we have no ground truth values. A further issue with this technique is that in the limiting cases of extremely dense or extremely fatty tissues, it may not be possible to find appropriately "pure" regions of both types.

2.C.2. Fully automated, using T1w and T2w Images
Fuzzy c-means (FCM) clustering has been evaluated by a number of studies to classify the internal structure of the breast into fat and fibroglandular tissue classes 16,18,29,31,[33][34][35]50 Table I). Song et al. 50 adopt a Gaussian kernel FCM, while Sathya 34 use a quadratic kernel FCM to train a support vector machine (SVM). In, 29 Wang et al. use a multiparametric hierarchical SVM classification approach to segment the internal breast and found this to be superior to both a conventional SVM 28 and FCM segmentation. T1W, T2W, proton density, and three-point Dixon (water and fat) images were all incorporated. Klifa et al. 31 compared the resulting volumetric MRI density measurement of their method with mammography but found only modest correlation (R 2 = 0.67). In, 20 a probabilistic atlas approach was proposed. This requires a sizeable number of prelabeled atlases to be created, considerable computation to register them and assumes correspondence between fibroglandular structures across the population. To address the latter, a Markov random field (MRF) was introduced to spatially regularize the classification of each voxel according to that of its neighbors. Similarly, Wu et al. 16 use the registered atlas as a pixel-wise fibroglandular likelihood prior for a multivariate Gaussian mixture model and demonstrate superior performance when compared to FCM using a manual thresholding approach as the gold standard. In a later publication, 19 the same authors investigate a continuous max-flow (CMF) algorithm to generate a voxel-wise likelihood map using the same atlas initialization. They demonstrate that this approach performs better with the atlas initialization than without, but that FCM is superior to the CMF approach without the atlas.
Mixture models have also been proposed by Yang et al. 32 who implement a method using a Kalman filter-based linear mixing. They demonstrate it out-performs a c-means method but evaluation using real MR data was limited.
Our segmentation of the T1 and T2 MRI data into fat and glandular tissue is a modification of that proposed by Van Leemput et al. 51 in which an intensity model and spatial regularization scheme are optimized using a maximum likelihood formulation of the expectation-maximization (EM) algorithm. The EM algorithm iteratively updates the Gaussian probability distributions used to estimate the intensity histograms of each tissue class (fat and nonfat) via a maximum likelihood formulation. In order to improve classification of voxels in which the partial volume of fat and glandular tissues is a significant factor, a Markov random field (MRF) regularization scheme is employed to ensure spatial consistency. The MRF modifies the probability of a particular voxel being assigned to either the fat or glandular classes (or a proportion of either) according to the current classification of neighboring voxels. In this way, isolated regions of glandular tissue in very fatty regions, for instance, are penalized in favor of a more realistic and anatomically correct arrangement of the classes.

2.D. Epidemiology
Appropriate linear and logistic regression models were used to examine associations of average total breast, fat and water volumes, and percent water, as measured using different MR images and segmentation methods, with selected established and potential mammographic density correlates.
Breast measures were log-transformed and the exponentiated estimated regression parameters represent the relative change (RC) in breast measure with a unit increase, or category change, in the exposure of interest (with 95% confidence intervals (95% CI) calculated by exponentiating the original 95% CIs). Age at menarche (months), height (cm), and BMI (height (cm)/ weight (kg) 2 ) at MR were treated as continuous variables and centered at the mean. Current hormone contraceptive use, cigarette smoking, and alcohol drinking were treated as binary (yes/no) variables. Mothers mammographic density (%) was averaged between both breasts, and maternal age (months) at mammography and clinically measured or self-reported maternal BMI (median 3 yr (interquartile range (IQR) = 1.5 yr) prior to mammography)) were used as continuous measures and centered at the mean. Variables were included as potential determinants of breast measures, or as confounding factors, where appropriate.
Data analysis was conducted with STATA statistical software, Version 14. Figure 5 shows an example of the two methods applied to a dataset containing medium-sized breasts, with a moderate parenchymal content. There is a border of fat around the parenchyma, which, at the posterior of the breast, leads to excellent contrast at the boundary with the chest wall, making segmentation a relatively straightforward task. Results are shown for two separate manual segmentations by the same experienced observer; for the BC-FCM method from ref. [37]; the BC-FCM method with additional heuristics and default parameters, as described above; and the new method based on T1 and T2 images (VaT12). It will be seen that the segmentation performance is excellent, with only minor difference between the methods. Note how implementation of guidelines developed during the manual segmentation process supplements the BC-FCM approach in order to cut off the segmentation in both the left-right and superior-inferior directions, where there are no corresponding intensity boundaries seen in the image data themselves. Table II shows the Dice and Jaccard coefficients for the four sets of segmentations illustrated in Fig. 5, confirming the excellent performance of all the algorithms.

3.A. Breast outline segmentation
By contrast, Fig. 6 illustrates a case where all assessment methods have far more difficulty in providing a correct segmentation. Smaller breasts tend to be more problematic to segment, as a higher fraction of the segmentation involves partial-volume effects. Highly parenchymal breasts have very low (sometimes no) contrast between the parenchyma and pectoral muscles of the chest wall, and the intensity-based BC-FCM algorithm has particular difficulties in this regard. Many slices require a high degree of anatomical knowledge to perform the segmentation. Consider the two versions of the BC-FCM results presented. With the default parameters in the upper of the two rows, Medical Physics, 44 (9), September 2017 oversegmentation occurs in slice 11 and part of the chest wall is included in the parenchymal breast region. By contrast, with the "best" set of parameters (as found by repeating the algorithm and manually adjusting them), the lower row shows that the problem in slice 11 is corrected, with good matching of the pectoral muscle contour, but only at the cost of introducing an undersegmentation in slice 8, and, worse, losing the segmented breast region entirely in slice 6. In practice, where such problems occurred, it was necessary to edit the final segmentations manually. (Note on terminology: As shown in Fig. 6, the "BC-FCM/heuristics (VaD)" method cannot reliably be run for the whole cohort using only default parameters and so we must describe the technique as semi-rather than fully automated. Even for cases where no manual editing or parameter adjustment need to be performed, human inspection is still TABLE II. Dice and Jaccard coefficients for the "easy" segmentation problem of Fig. 5. Note that the BC-FCM/heuristics (VaD) represents the fully automated version, running with default parameters. required to confirm this. All subsequent cohort statistics will therefore use the nomenclature VsD to reflect this.) We have run a similar analysis on all 16 cases for which we have duplicate manual segmentations by all three observers. The detailed results are shown in the Supplementary Information.
A second method of examining the relation between the volume segmentation results is to plot the total breast volume obtained by one method against that of another. In the scatter plots of Figs. 7(a)-7(c), the x-and y-coordinates of each point represent the mean, for a single subject, of the left and right breast volumes evaluated, respectively, by the two methods under consideration. Figure 7(a) compares VsD, the semiautomated BC-FCM method using Dixon image input, with the "gold-standard" median manual segmentation, VmD, measured on the same Dixon dataset. Figure 7(b) gives results for the VaT12 method, which operates on the T1w and T2w datasets and evaluates the breast volume in the coordinate space of the T1w dataset. Finally, Fig. 7(c) looks at the effect of resampling the map generated by the algorithm in (b) with the spatial resolution and frame of reference of the Dixon data, which we term VaT12D. In each case, the line of identity is shown and Table IV reports the corresponding interclass correlations (ICC), representing the proportion of variance across participants shared between different ascertainment methods. Figures 8 and 9 present the results of the fat and water segmentation in the same format as for the total breast volume. In this case, however, a further option is available. Although the breast outline segmentation VaT12 requires both the T1w and T2w data, once this mask is available, it is possible to obtain two separate fat-water segmentations one using just the T1w and one using just the T2w data. These are denoted VaT12-FWaT1 and VaT12-FWaT2, respectively.

3.B. Fat-water segmentation
The interclass correlation (ICC) for total water volume, representing the proportion of variance across participants shared between the different ascertainment methods, are given in Table V.

3.C. Epidemiological results
A diagrammatic summary of the results of the epidemiological analysis is presented in Fig. 10 and further details of the work are reported as supplementary information.
Associations with both breast volume and breast water fraction were found for current body mass index (BMI). For a 1 kg m À2 increase in BMI, a relative change in breast volume of 1.13 [1.10, 1.16]  A weak association between current height and breast volume was also observed. For a 1 cm increase in height, the analysis methods gave the following relative increases in breast volume: VmD No associations were found with any of age of menarche, use of oral contraception, smoking, alcohol intake or maternal mammographic density.
From the similarity of all these statistics, we conclude that the exact details of the segmentation methods are not significant at the level of this cohort analysis.

DISCUSSION
Our results show that, as in many segmentation problems, the degree of success of the automated algorithms varies significantly between subjects. Figure 5 and Table II demonstrate excellent performance by all of the algorithms, whereas the degree of correspondence with the expert manual segmentation is considerably poorer in Fig. 6 and Table III. However, it should be noted that even the expert human observer is less able to provide a good repeat segmentation.
The ICCs for total breast volume in Table IV demonstrate good agreement between all methods, but interestingly, slightly closer agreement between VaT12 and the two Dixon-based methods (VmD or VsD) than between VaT12D and the Dixon methods. As described above, VaT12D is created by simply resampling VaT12 in the Dixon coordinate space, which has a coarser slice thickness, using appropriate blurring and nearest neighbor interpolation. Although movement between the Dixon and T1w or T2w scans could explain this disparity, registering the volumes did not improve the results. The resampling process appears to amplify the difference between VaT12 and VmD or VsD, but we have not analyzed this further, given that it is a relatively small effect. It would, of course, be interesting to compare the output of the VaT1T2 method directly with manual segmentation of the high-resolution T1w dataset in its native reference frame, without the need to down-sample. However, the workload involved in creating high-resolution manual segmentations is prohibitive. In the Supplementary Information, we report anecdotal results for five such cases with full high-resolution manual segmentations.
Also of note from comparison of the scatter-plots of Fig. 7 is that each of methods VsD, VaT12D, and VaT12 increasingly overestimates the breast volume in comparison to VmD as the mean left and right breast size increases. This is most apparent for VaT12. The trend to larger error is, of course logicalsimilar percentage errors between the methods will result in greater absolute differences the larger the breastbut it is not currently clear why all methods are biased to overestimate the volume in this region. The method VaT12D also underestimates the breast volume for smaller breasts compared with the manual segmentation VmD and the reason for this, too, is unclear.
The biggest discrepancy between analysis methods, as shown by the scatter plots, is in the assessment of mean  Table V). By contrast, the correlation between the Dixon-based VmD-FWsD and VaT12-FWaT1 is weaker, and the VaT12-FWaT2 result additionally shows a bias (Fig. 8). However, it is important to note that the assumption that water fractions based on the Dixon method can be regarded as a gold standard for true parenchymal fraction is much less compelling than the previous assumption that VmD is the gold-standard volume. We justify our choice of VmD-FWsD as the method of comparison on the basis that it is consistent with previous work in the field 49 (and indeed an improvement), but Ledger et al. 52 have demonstrated that there is a significant degree of variability between different Dixon-based methods, depending on the exact design of the pulse sequence. It is unsurprising that a segmentation based on a completely different MRI contrast mechanism should be less highly correlated. What is nevertheless highly encouraging is that the correlation remains as strong as it isthe worst value reported in Table V is 0.920 and this suggests that the use of MRI as a modality will prove to be a robust choice for breast analysis.
A salutary lesson from the scatter graphs is the constant need for vigilance and appropriate quality control when processing large cohorts of data. During the review of this paper, a referee noticed an outlier, which turned out to be the result of an easily corrected error that caused the mask for the entire right breast to be missing. Such "edge" cases, occurring very infrequently, remain a significant challenge in the adoption of automated pipelines. Any requirement for manual inspection of each dataset to check the output negates to some extent the advantages of fully automated segmentation processes, and an appropriate balance needs to be determined for each application.
Another feature highlighted by all of these results is the problem inherent in the use of quantitative metrics such as Dice and correlation coefficients, which (despite their apparent calculation "accuracy") are a very blunt tool for analysing a complex situation. Are all of the voxels that fail to overlap equally important? Is much of the difference between the observer and the automated methods in fact caused by the choice of how much of the axilla is included and is this region of any significance biologically?
A first reading of the coefficients presented here suggests that the VsD breast outline segmentation, followed by the FWsD tissue segmentation method is the best-performing of the computer-aided tools presented here. But is it the most suitable? Ultimately, the choice of segmentation method needs to weigh up the following points: • To what extent does the application demand a segmentation that is as good as that of an expert radiologist? Two extremes here might be the planning of radiotherapy treatment for an individual patient, where high

VaT12-FWaT1
VaT12-FWaT2 • To what extent is the ground truth knowable? For a given set of intra-and interobserver performance metrics evaluated on a test cohort, what performance thresholds should be regarded as "acceptable" for automated segmentations?
• How widely available are the required source data? As previously noted, the Dixon protocol is not routinely included in clinical examinations, thus limiting the applicability of breast density measurements based on the VsD-FWsD method.
• How robust is the method?
• To what extent are speed, convenience and consistency of method to be preferred over accuracy?
In our case, consideration of all of the above led to the use of the VaT12 method, rather than VsD, for segmentation of the remaining 300 cases in the cohort (results not presented). This choice was made largely on the basis of improved automation and on the epidemiological evidence from the 200-strong training and test datasets, as described in Section 3.C, where key epidemiological parameters were found to be identical, within confidence limits, for both methods.

CONCLUSION
We have presented what we believe to be the first detailed comparison on a large, population-based cohort of two methods of breast-outline segmentation based on completely different approaches. These have been coupled with two methods of fat-water discrimination based on fundamentally different MR contrast mechanisms. All combinations of the methods studied are in very strong agreement, as seen both visually and via interclass correlation coefficients, and are suitable for large-scale epidemiological analysis. We have discussed the assumptions behind the methods and posed a number of general questions that we believe need to be answered each time a decision is made on whether and how to perform automated segmentation. Figure S2. Concepts involved in the heuristic algorithms of the BC-FCM refinement algorithm. Figure S3. Distribution of breast volumes and percentage water as measured by the different segmentation and fatwater estimation methods. Nomenclature for method names is as described in the main text. Figure S4. Results of Bland-Altman analysis of (A) breast volume measurements and (B) percentage water measurements obtained using different segmentation methods.
Nomenclature of method names is as described in the main text. Table S1. Dice and Jaccard coefficients obtained by comparing manual and automatically segmented masks. Table S2. Dice and Jaccard coefficients obtained by comparing manual and automatically segmented masks for five representative cases in which the high-resolution T1-w datasets were fully manually segmented. Appendix S3. MRI manual masking protocol.