Automatic training and reliability estimation for 3D ASM applied to cardiac MRI segmentation

Catalina Tobon-Gomez; Federico M Sukno; Constantine Butakoff; Marina Huguet; Alejandro F Frangi

doi:10.1088/0031-9155/57/13/4155

1. Introduction

Imaging data have become essential in clinical routine. Analysing it often requires extensive manual postprocessing. Most of the collected information is disregarded during actual diagnosis/treatment. This fact has motivated the development of computer-based techniques to translate the data into clinically relevant and quantitative information.

In the cardiac field, model-based approaches have found wide acceptance given the 3D (or 3D+t) nature of most data (Heimann and Meinzer 2009). Active shape models (ASMs) are one of these approaches (Cootes et al 1995). It is based on a generic template of the target organ, which encodes knowledge of shape variability across a population. The template is constituted by a distribution of landmarks, known as a point distribution model (PDM). To construct a PDM, several training steps must be fulfilled. These steps were initially achieved manually for 2D datasets. However, a manual scheme for 3D datasets is nearly unfeasible. This complication has inspired the development of automatic techniques for PDM construction. Some authors have proposed techniques for auto-landmarking surface (Subsol et al 1998, Brett and Taylor 2000, Davies et al 2002) or volumetric (Frangi et al 2002, Kaus et al 2003) representations of the target organ. Other authors have shown techniques that work directly from raw images (Rueckert et al 2003, Ordas et al 2007).

Fitting a PDM to unseen data requires knowledge about the appearance of the image being processed. ASMs have a second component, known as grey-level or intensity model, which takes care of this. However, the appearance of the images changes from modality to modality. Thus, to be able to handle multi-modal datasets, the intensity model needs retraining. That requires the collection of manual ground-truth contours. To bypass this step, several authors have proposed to reuse a previously built PDM and generate the appearance information some other way (Peters et al 2010, van Assen et al 2008, Tobon-Gomez et al 2008). Peters et al (2010) use a few manual delineations per modality, which are propagated and refined with the simulated search method. van Assen et al (2008) use a fuzzy inference system based on relative intensity differences. Our approach is to learn the intensity models from simulated datasets. We have previously explored this concept on gated single photon emission computed tomography (gSPECT) studies (Tobon-Gomez et al 2008). This work included 208 simulated datasets and 20 clinical datasets. Results showed that gSPECT studies could be successfully segmented by models trained from simulated datasets with subvoxel accuracy. Volume measurement accuracy ranged from 90.0% to 94.5% for simulated datasets and from 87.0% to 89.5% for real datasets.

However, applying this method to a modality with higher spatial resolution, such as MRI, raises a new question regarding its achievable accuracy. The aim of this study is to test two hypotheses in the context of cardiac MRI segmentation with a 3D-ASM approach. Hypothesis 1: an intensity model trained with simulated images can obtain segmentation accuracy comparable to inter-observer variability. Hypothesis 2: including a measure of reliability during the matching process can increase robustness. To test hypothesis 1, we train the intensity models with simulated images generated as described in Tobon-Gomez et al (2011) (further details in section 5.1). We evaluate the segmentation accuracy obtained with these models in comparison to inter-observer variability. We also compare their accuracy with one of the models trained on real images. Unlike gSPECT studies, in MRI studies both ventricular cavities are visible. Therefore, we use a bi-ventricular model for our segmentation approach (further details in section 5.2). To test hypothesis 2, we employ the generic reliability measure for statistical shape models proposed by Sukno and Frangi (2008). Only landmarks labelled as reliable are used to deform the PDM during matching (further details in section 3). In section 8.3, we discuss our results considering other model-based cardiac segmentation studies (Peters et al 2010, van Assen et al 2008, Zhang et al 2010, Zhuang et al 2010, Lötjönen et al 2008).

2. Active shape models

The essential components of an ASM are a shape model, an intensity model and a matching algorithm (full description in Cootes et al (1995)). The shape model is a template of the organ of interest represented as a distribution of landmarks or PDM. The PDM encodes statistical information of variability in the training set. For a 3D space, a linear PDM trained from n shapes, $\lbrace \mathbf {x}_{i}; i=1,\ldots ,n\rbrace$ , of m landmarks each, {l_j = (l_xj, l_yj, l_zj); j = 1, ..., m}, is a linear model defined as

$\begin{equation} {\mathbf {x}} = {\mathbf {\bar{x}}} + {\mathbf {\Phi }} {\mathbf {b}}, \end{equation} \tag{ 1 }$

where x is a concatenated vector of all landmark coordinates in the form (l_x1, l_y1, l_z1, l_x2, l_y2, l_z2, ..., l_xm, l_ym, l_zm). Then, ${\mathbf {\bar{x}}}$ is the average of all aligned shapes, b is the shape parameter vector of the model and Φ is a matrix formed by the principal components of the covariance matrix:

$\begin{equation} {\mathbf {S}} = \frac{1}{n-1} \sum ^n_{i=1} ({\mathbf {x}}_i-{\mathbf {\bar{x}}})({\mathbf {x}}_i-{\mathbf {\bar{x}}})^T. \end{equation} \tag{ 2 }$

The intensity model, as mentioned in the introduction, learns the appearance around the boundaries of the target organ. Typically, this means sampling a 1D profile of grey-level values (or gradients) for each landmark along the direction perpendicular to the boundary. From the values sampled along each profile, the mean vector and covariance matrix are estimated (see figure 1(a)). The matching algorithm uses the intensity model to match unseen image data. Our approach is based on the sparse fitting method SPASM (van Assen et al 2006). It is sparse in the sense that image data are obtained only from the acquired image planes. This characteristic is quite relevant for MRI datasets given their anisotropic nature. The algorithm first finds the intersections of the mesh with the image planes. These intersections make a stack of 2D contours. Each landmark in the contour stack will be displaced to a new location. The new location, or candidate position, is selected by locally searching a profile which better resembles the one stored during training. In our case, the best candidate position is selected by computing the minimal Mahalanobis distance between the sampled profiles and the mean profiles of the intensity model. Each candidate point forces a deformation on the mesh. The deformation propagates to the rest of the nodes with a weighting function normalized by a Gaussian kernel. The mesh is deformed for several iterations until a best fit is found.

**Figure 1.** Explanatory diagram of intensity model construction (a). Profiles are sampled perpendicularly to the contour of the object. At the location of each landmark, profiles are collected for every training set. The intensity model will include a mean profile (to account for average image intensity values) and a covariance matrix (to account for image variability). Example of the distribution of Mahalanobis distances of intensity profiles for a given landmark (b) when using 3D ASM (bottom), and the resulting mutual information between *r^j_i* and $\hat{r}_i^{j}$ (equation (4)) when varying the threshold that divides reliable and unreliable estimates (top). Dashed line indicates the threshold maximizing the mutual information.
Download figure:
Standard image

**Figure 1.** Explanatory diagram of intensity model construction (a). Profiles are sampled perpendicularly to the contour of the object. At the location of each landmark, profiles are collected for every training set. The intensity model will include a mean profile (to account for average image intensity values) and a covariance matrix (to account for image variability). Example of the distribution of Mahalanobis distances of intensity profiles for a given landmark (b) when using 3D ASM (bottom), and the resulting mutual information between *r^j_i* and $\hat{r}_i^{j}$ (equation (4)) when varying the threshold that divides reliable and unreliable estimates (top). Dashed line indicates the threshold maximizing the mutual information.
Download figure:
Standard image

3. Reliability estimation

In this work, we use the generic approach proposed by Sukno and Frangi (2008). This method uses a probabilistic framework to identify whether the position selected by the image model for a given landmark is trustworthy (reliable). The landmark is regarded as reliable when its distance with respect to the ground-truth position is within the intrinsic precision of the algorithm. This is estimated by using a reliability threshold on the Mahalanobis distance (from the intensity model). We compute a reliability threshold for each landmark j of each shape i, as follows:

We run the matching algorithm on the training set for several iterations. At each iteration, we store the minimal Mahalanobis distance ζ^j_i and its corresponding Euclidean distance from the ground truth ^j_i.
We compute the intrinsic precision of the algorithm ^j_th as the average segmentation error on the whole shape at the final iteration.
Using function
$\begin{equation} f_u(\mathbf {a,b}) = \left\lbrace \begin{array}{@{}ll@{}}1 & \mbox{if} \;\mathbf { a}< \mathbf {b},\\ 0 & \mbox{otherwise},\end{array} \right. \end{equation} \tag{ 3 }$
we define two variables r^j_i = f_u(^j_i, _th^j) and $\hat{r}_i^{j} = f_u\big(\zeta _i^{j}, \zeta _{{\rm th}}^{j}\big)$ .
We vary the value ζ^j_th until we maximize the mutual information between r^j_i and $\hat{r}_i^{j}$ across the training set:
$\begin{equation} \zeta _{{\rm th}}^{j} = \arg \max _{\zeta }{\,\,I\big ( f_u\big(\epsilon _i^{j}, \epsilon _{{\rm th}}^{j}\big); f_u(\zeta _i^{j},\zeta ) \big ),} \end{equation} \tag{ 4 }$
where $I(\mathbf {c,d})$ is the mutual information between c and d (see figure 1(b)).
The optimal value ζ^j_th = ζ that divides reliable and unreliable estimates is the reliability threshold. These thresholds are stored for later use during matching.

To include reliability information during matching, each landmark is displaced to a new candidate position only if the Mahalanobis distance is below its corresponding reliability threshold ζ^j_th. Therefore, only the landmarks marked as reliable contribute to the deformation of the shape model. The steps of the algorithm, including reliability information, are described in algorithm 1.

Algorithm 1 Matching with Reliability Information
Require: imageStack:=stack of DICOM images to be segmented
Require: initialShape:=initially positioned mean shape
Require: meanProfiles:=trained intensity profiles per landmark
Require: ζ^j_th:=reliability thresholds per landmark j
1 repeat
2 If iteration:=1 then
3 currentShape←initialShape
4 else
5 currentShape←bestFit
6 end if
7 function Intersect(imageStack,currentShape)
8 for all image planes do
9 2D contour←intersection with currentShape
10 end for
11 contourStack←stack 2D contours
12 for all points in contourStack do
13 interPoints←find closest landmarks in currentShape
14 end for
15 end function
16 function findCandidates(interPoints,meanProfiles)
17 for all interPoints do
18 sampledProfiles←sample perpendicular profiles
19 for all possible profile positions in search range do
20 Mahalanobis(meanProfiles,sampledProfiles)
21 minMahalanobis←minimal Mahalanobis distance
22 end for
23 function testRelibiality(minMahalanobis,ζ^j_th)
24 if minMahalanobis < ζ^j_th then
25 canditatePoints←store new candidate positions
26 else
27 canditatePoints←current position j
28 end if
29 end function
30 end for
31 end function
32 function NewValidInstance(canditatePoints)
33 for all candidatePoints do
34 Forces←calculate propagation to neighbouring nodes
35 end for
36 defShape←apply Forces to currentShape
37 bestFit←best parameters from equation (1) to fit defShape
38 end function
39 until iterations completed

4. Image datasets

4.1. Population group 1 (V1)

This population included 400 virtual subjects generated with XCAT (Segars et al 2008). The XCAT generates a voxelized version of each virtual subject containing several thoracic structures: heart, great vessels, lungs, ribs, liver, spleen, stomach, intestines and body background. Each volume was represented by a stack of 221 axial slices of 512 × 512 pixels with a 0.78 mm³ isotropic voxel size. To obtain short-axis stacks, we reformatted the axial stacks with the true rotation angles used during phantom generation. Each short-axis stack was used as input to the simulator (further details in section 5.1). This group was used for training and testing.

4.2. Clinical population group 1 (C1)

This group included 40 patients, with 12 healthy subjects, 18 patients with hypertrophy and 10 patients with infarction. Demographics are summarized in table 1. The studies were acquired using a GE Signa CVi-HDx, 1.5T scanner (General Electric, Milwaukee, USA). Short-axis images were scanned with a steady state free precession (SSFP) sequence. The slice thickness was 8 mm with an in-plane pixel resolution of 1.56 mm × 1.56 mm. Both left ventricle (LV) and right ventricle (RV) were manually segmented at end diastole (ED) and end systole (ES) by an experienced researcher. A subset of 20 studies was segmented by a radiologist to evaluate inter-observer variability. This group was used for training and testing.

Table 1. Patient demographics and cardiac function.

	C1	C2
Age (years)	56 ± 16	n.a.
Sex	26 M/14 F	n.a.
LV function
EDV (ml)	120 ± 29	192 ± 92
ESV (ml)	48 ± 21	115 ± 90
EF (%)	60 ± 11	47 ± 20
MASS (gr)	130 ± 32	105 ± 43
RV function
EDV (ml)	125 ± 34	n.a.
ESV (ml)	71 ± 22	n.a.
EF (%)	43 ± 10	n.a.

n.a. = not applicable.

4.3. Clinical population group 2 (C2)

This group included 45 patients, with 9 healthy subjects, 12 patients with ischaemic heart failure, 12 patients with non-ischaemic heart failure and 12 patients with hypertrophy. Demographics are summarized in table 1. The database was collected at Sunnybrook Health Sciences Centre as part of the cardiac MRI LV segmentation challenge (Radau et al 2009). The studies were acquired using a GE Signa, 1.5T scanner (General Electric, Milwaukee, USA). Short-axis images were scanned with a SSFP sequence. The slice thickness was between 8 and 10 mm with an in-plane pixel resolution of 1.36 mm × 1.36 mm. Ground-truth contours, provided in the segmentation challenge package, were drawn by an experienced cardiologist. The contours included endo- and epicardium at ED and endocardium at ES. All the contours were confirmed by another cardiologist. This second group was included for two reasons: (1) to test the real intensity models on images with different intensity distributions; (2) to evaluate our segmentation approach with an external ground truth, as provided for the challenge.

5. Methods

5.1. MRI simulation

The MRI simulator used in this work models the physical phenomena by solving the Bloch equations (Kwan et al 1999). It uses a labelled volume as input, where each label corresponds to a tissue with certain magnetic resonance properties.

The virtual population in group V1 was used as input to the simulator. Each virtual patient was modified to increase the realism of the simulated images. These modifications were evaluated in Tobon-Gomez et al (2011). Briefly: (1) we included papillary muscles in the LV and trabecular structures in both ventricles. Papillary muscle generation was based on clinical measurements and trabeculae generation was based on a mathematical model (figure 2(a)); (2) we incremented the intensity variability of each body tissue in the simulated image by increasing the number of labels of the virtual patient. Magnetic resonance properties, gathered from the literature, were assigned to each sublabel according to its tissue class (figure 2(b)); (3) we modelled partial volume effect by using a higher resolution virtual patient as input to the simulator. Each pixel was computed from four input pixels and each slice was computed from ten input slices. Therefore, each voxel in the output image included the contribution of 40 voxels from the input virtual patient (figure 2(c)). An example of the obtained simulations can be found in figure 3.

**Figure 2.** Modifications made to the XCAT phantom to increase realism of simulated images: we included trabeculae based on a mathematical model and papillary muscles based on clinical measurements (a); we increased the number of labels and assigned magnetic resonance properties (PD, T1, T2, T2*) to each sublabel according to its tissue class (b); we modelled partial volume effect by using a higher resolution virtual patient as input to the simulator (c).
Download figure:
Standard image

**Figure 3.** Images from virtual group V1: two virtual patients at ED (a) and ES (b). Edges obtained automatically by 3D ASM with simulated intensity models are displayed in colour. White = ground-truth surface. Blue = matched surface with no reliability. Red = matched surface with reliability.
Download figure:
Standard image

5.2. 3D-ASM training

Our shape model is a surface mesh representing the LV and RV. The model is constituted by three regions: LV endocardium (LV-ENDO), LV epicardium (LV-EPI) and RV endocardium (RV-ENDO). Right ventricular segmentation is especially difficult mainly due to its thin walls and abundant trabecular structures. By modelling LV/RV simultaneously, we aim to obtain higher overall accuracy. Moreover, independent segmentation of cardiac chambers could cause either divergence or overlapping. Given that our bi-ventricular configuration is a subpart of a whole heart model trained from 100 high-resolution CT datasets (Ordas et al 2007), we trust that the interactions among cavities are handled correctly by the trained PDM.

For intensity model training, we require all training meshes to have identical topology. To accomplish this on the simulated datasets, we extracted the LV and RV as binary masks from the 3D labelled phantoms in group V1 (section 4.1). We generated isosurfaces from the binary images. Our shape model was aligned to these surfaces using finite iterative closest point registration (Zhang 1994). The aligned shape models were then deformed to match the binary images using our 3D-ASM algorithm⁴. Shape model parameters were set to 99.9% of the total variance and ±3σ along each principal component. These constraints allowed our model to capture fine details of the XCAT phantom geometry.

For the real datasets in group C1 (section 4.2), we enforced topological consistency by obtaining 3D manual segmentations. We used an in-house software (GIMIAS v1.1.0) to visualize the datasets, initialize and manually deform the surface mesh (Larrabide et al 2009). For initialization, the observer selected four anatomical landmarks: one in the aortic valve, one in the mitral valve, one in the tricuspid valve and one in the LV endocardial apex. These initialization points were registered to the corresponding anatomical landmarks in the surface mesh with a similarity transformation. The meshes were manually deformed by the observer to obtain the manual ground-truth shapes. Note that the shape model constraints were not used to prevent statistical bias.

We trained one intensity model for each landmark of the PDM. The intensity models were independently trained for each cardiac phase. This means that for the simulated datasets we obtained 30 intensity models and for the real datasets we obtained 2 intensity models (ED and ES). The same applies for the reliability thresholds: one threshold per landmark and per phase (section 3).

5.3. 3D-ASM segmentation

To run the segmentation on groups V1 and C1, we used the initialization shapes described above (section 5.2). For group C2, initialization shapes were obtained in the same manner as for group C1.

To process the datasets, we used a coarse-to-fine search strategy (Cootes et al 1994). It consists on building a multi-resolution pyramid from the original image and processing each level independently. In the coarse levels, we increased the constraints of the model to avoid local minima. In the fine level, we relaxed the constraints to allow for finer details (Ma et al 2010). Table 2 summarizes the parameters of our 3D-ASM segmentation approach.

Table 2. Parameters used for 3D-ASM segmentation.

Shape model	Number of nodes
LV endocardium	880
LV epicardium	1797
RV endocardium	3478
Intensity model	Profile
Length	17
Sampling interval	1.56 mm
Matching	Coarse search	Fine search
Resolution number	3–2	1
Profile search range	21	21
Allowed mode variation	±1σ	±3σ
Shape variability	85%	98%
Gaussian kernel width	2 mm	2 mm
Iterations	100	100

6. Experimental setup

6.1. Segmentation accuracy

Simulated versus real intensity models. The intensity models were trained both on simulated and real datasets. The simulated intensity models were used to segment the datasets in all test groups (V1, C1 and C2). The real intensity models were used to segment datasets in groups C1 and C2. Datasets in V1 were segmented by means of twofold cross-validation, where two subgroups of n = 200 were randomly selected. Due to smaller sample size, C1 datasets were segmented by means of leave-one-out cross-validation. To test performance, we calculated point-to-surface (P2S) errors with respect to the ground-truth meshes/contours described previously (sections 5.2 and 4.3). Statistical significance of the medians was assessed with a Mann–Whitney test (Gibbons 1985).
Effect of reliability. We performed the experiment described above with and without reliability information (section 3). We calculated P2S errors and evaluated statistical significance of the medians with a Mann–Whitney test.
Clinical population subgroups. To analyse the influence of pathology on segmentation errors, we separated our clinical population in three main groups: (1) normal, (2) hypertrophic and (3) infarcted and/or with heart failure. We computed the segmentation errors within each of these groups. Statistical significance of the medians was assessed with a Mann–Whitney test.
Initialization sensitivity. To test the sensitivity of the segmentation to initialization, we executed the segmentation 17 times: 1 with favourable and 16 with unfavourable initialization. The 16 unfavourable initial positions result from a uniform 4 × 4 grid. The central position of the grid matched the centroid of the favourable initialization shape (section 5.2). To stay within the bounds of the short-axis stack, no translation was performed along the long-axis direction. The remaining initializations were all possible combinations of 6 and 12 mm displacements from the centroid on the short-axis plane. These translation factors were chosen so that all mesh points remained within the search range of the matching algorithm. We calculated P2S errors with unfavourable initializations. Statistical significance of the medians was assessed with a Mann–Whitney test.

6.2. Volume measurements

We calculated ground-truth volumes from the ground-truth meshes in groups V1 and C1. For group C2, we computed volumes from the ground-truth contours by summing the volume of each slice (area inside contour × slice thickness). We calculated the measured volumes from the matched meshes in all groups. Relevant parameters were computed: end diastolic volume (EDV), end systolic volume (ESV), ejection fraction (EF) of both ventricles and myocardial mass (MASS) of the LV. Measurement agreement with respect to ground-truth volumes was evaluated by means of Bland–Altman plots. This analysis was also performed for the clinical population subgroups described above (section 6.1).

7. Results

7.1. Segmentation accuracy

Visual results of the segmentation for V1, C1 and C2 are displayed in figures 3–5, respectively. Figures 6, 7 and 9 present box plots of P2S errors for all data groups with favourable and unfavourable initializations. The plots are further analysed below.

V1. Analysing first the results on the virtual datasets (figure 6), we can observe the effect of reliability inclusion. With favourable initialization, reliability did not aid nor penalize segmentation accuracy. When testing the matching algorithm with unfavourable initialization, reliability aided the segmentation of LV-EPI at ES. In this case, errors with reliability were significantly lower than those with no reliability.
C1. On the first group of clinical datasets (figure 7), we applied real and simulated intensity models. Using the real models, overall errors were lower than inter-observer variability (IV). Regarding reliability, we observed two main trends. On the one hand, with favourable initialization reliability did not improve nor penalize segmentation accuracy. On the other hand, with unfavourable initialization reliability reduced outliers at ES.Using the simulated models, LV errors were lower than IV (12%), while RV errors were larger than IV (36%). Regarding reliability inclusion, we can make the following observations. With favourable initialization, segmentation accuracy of LV was penalized at ED, while at ES it reduced outliers without significant penalization. With unfavourable initialization, reliability reduced outliers at ES for all regions. This effect was also observed with the real intensity models.Analysing each clinical group, we see that datasets from hypertrophic patients were associated with larger segmentation errors (figure 8). For these datasets, the reliability effect on LV-ENDO and LV-EPI accuracy at ES was more notable.
C2. On the second group of clinical datasets (figure 9), we also applied real and simulated intensity models. Using the real models, reliability reduced outliers at ES both with favourable and unfavourable initializations. Using the simulated models, reliability reduced LV-ENDO segmentation errors at ES (50%). However, the segmentation error obtained by the simulated models was above the one obtained by the real models (60%). Since the simulation settings were set to match the clinical images of C1, the difference in appearance of datasets in C2 implies an added challenge for the intensity models. In this case, the real models cope better with different image appearance (i.e. different tissue contrast and signal-to-noise ratio) than the simulated models.

**Figure 4.** Images from clinical group C1: a normal subject (a and b), a hypertrophic patient (c and d) and an infarcted patient (e and f). Edges obtained automatically by 3D ASM with simulated (top) and real (bottom) intensity models at ED (a, c, e) and ES (b, d, f). Note the influence of reliability information for the simulated models (top), especially for the RV. White = ground-truth surface. Blue = matched surface with no reliability. Red = matched surface with reliability.
Download figure:
Standard image

**Figure 5.** Images from clinical group C2: a normal subject (a and b), a hypertrophic patient (c and d) and a heart failure patient (e and f). Edges obtained automatically by 3D ASM with simulated (top) and real (bottom) intensity models at ED (a, c, e) and ES (b, d, f). Note the influence of reliability information for the simulated models (top), especially for the RV. White = ground-truth surface. Blue = matched surface with no reliability. Red = matched surface with reliability.
Download figure:
Standard image

**Figure 6.** Box plots of P2S segmentation errors for V1 with favourable (left) and unfavourable (right) initializations. Errors were computed at ED (dark) and ES (light) for NRL results (blue) and RL results (red). Maximum whisker corresponds to approximately 99.3% coverage if the data were normally distributed. Pair of NRL versus RL errors that yielded statistically significant differences (p < 0.05) is marked with an asterisk (*). NRL = no reliability; RL = with reliability.
Download figure:
Standard image

**Figure 7.** Box plots of P2S segmentation errors for C1 with favourable (left) and unfavourable (right) initializations. Errors were computed at ED (dark) and ES (light) for NRL results (blue) and RL results (red). Grey plots represent inter-observer variability. Maximum whisker corresponds to approximately 99.3% coverage if the data were normally distributed. Pair of NRL versus RL errors that yielded statistically significant differences (p < 0.05) is marked with an asterisk (*). NRL = no reliability; RL = with reliability.
Download figure:
Standard image

**Figure 8.** Box plots of segmentation errors for all datasets in C1 (top) and only hypertrophic cases (bottom). Errors were computed at ED (dark) and ES (light) for NRL results (blue) and RL results (red). Black plots represent inter-observer variability. Maximum whisker corresponds to approximately 99.3% coverage if the data were normally distributed. Pair of NRL versus RL errors that yielded statistically significant differences (p < 0.05) is marked with an asterisk (*). NRL = no reliability; RL = with reliability.
Download figure:
Standard image

**Figure 9.** Box plots of P2S segmentation errors for C2 with favourable (left) and unfavourable (right) initializations. Errors were computed at ED (dark) and ES (light) for NRL results (blue) and RL results (red). Maximum whisker corresponds to approximately 99.3% coverage if the data were normally distributed. Pair of NRL versus RL errors that yielded statistically significant differences (p < 0.05) is marked with an asterisk (*). NRL = no reliability; RL = with reliability.
Download figure:
Standard image

To have a better understanding of the distribution of outliers, figure 10 displays average P2S errors for group C1 colour mapped on the mean shape model. In the inter-observer maps, LV outliers are located at the base and slightly at the apex. For RV, outliers are located on the outflow tracks (figure 10(a)). Both for real and simulated intensity models, LV outliers are located at the base and at the apex as well (figure 10(b)). For simulated intensity models, some RV outliers are located on the outflow tracks, while most of them are located along the free wall (figure 10(d)). For the real intensity models, we observe that inclusion of reliability does not penalize accuracy (figure 10(c)). For the simulated intensity models, we observe a reduction of errors at the RV and the epicardial wall of the LV (figure 10(e)).

7.2. Volume measurements

From all the calculated volume measurements, figures 11 and 12 display the most relevant ones to evaluate our hypotheses. Following hints from the results on segmentation accuracy, we could expect to find differences in ESV and on LV MASS. Smaller errors in ESV will be reflected on a better computation of EF. We observed this effect especially on hypertrophic patients of C1 (figure 11) and heart failure patients of C2 (figure 12). Reliability inclusion reduced bias and confidence intervals of EF and LV MASS values. For group C2, we observed a generalized overestimation of volume most likely due to our 3D approach (versus the 2D approach of the ground truth). This effect was also observed in other 3D approaches during the segmentation challenge (Wijnhout et al 2009, Radau et al 2009). This overestimation was compensated in the calculation of EF.

**Figure 11.** Bland–Altman plots for volume measurements from hypertrophic datasets in C1. Inclusion of reliability reduced bias and confidence intervals of EF (left) and LV MASS (right) values, computed from segmentation results with simulated models. gt = ground truth; m = automatic measurement; g = grams.
Download figure:
Standard image

**Figure 12.** Bland–Altman plots for volume measurements from heart failure datasets in C2. Inclusion of reliability reduced bias and confidence intervals of EF values, computed from segmentation results with both simulated (left) and real (right) models. gt = ground truth; m = automatic measurement.
Download figure:
Standard image

8. Discussion

8.1. Evidence for hypothesis 1—an intensity model trained with simulated images can obtain segmentation accuracy comparable to inter-observer variability

Our first hypothesis mainly holds true for the LV. As for the RV, the segmentation accuracy with simulated intensity models was comparable to IV only at ES. Taking a closer look at the simulated images, we can observe that the RV model seems oversimplified (see figure 13). On the one hand, it still lacks RV papillary muscles making the cavity more even than in real images. On the other hand, the anterolateral wall changes diameter rather rapidly generating extremely blurred edges. This effect is less notable at ES, where the wall becomes thicker, and therefore partial volume effect is reduced. It is also less notable along the posterior wall due to its smoother change of diameter. This drawback can be improved either by: (1) modifying the geometry of the phantom to provide a smoother transition towards the apex and/or increasing the wall thickness of the RV model; (2) simulating thinner MRI slices to reduce partial volume effect. We chose not to explore the second option in this work to maintain consistency of MRI acquisition parameters between our real and simulated datasets.

**Figure 13.** Differences (arrows) between RV in real (a and b) and simulated (c and d) images. Anterolateral wall changes diameter rapidly generating extremely blurred edges (c). This effect is less notable at ES, where the wall becomes thicker, and therefore partial volume effect is reduced (d).
Download figure:
Standard image

With respect to the LV model, it would be interesting to include geometrical alterations due to pathology, since the appearance of the profiles is evidently affected by them. For instance, a chronic infarction makes the walls thinner, while a hypertrophy makes the walls thicker. Besides the alteration of the profiles, the pathology can also modify the overall shape of the ventricles and the relationship between them. Therefore, a specific PDM per pathology can also greatly improve segmentation accuracy. This will require, as well, an automatic technique to select the proper PDM when processing a new case.

As for the location of outliers, both the base and the apex are known to be challenging areas. First, the correct definition of the basal plane is a known complication of cardiac imaging postprocessing for most modalities. Second, the boundaries at the apex are especially blurred due to high density of trabeculae. Both these difficulties can be greatly improved by including long-axis images in the matching process. In this type of images, the base and apex are clearly visible.

8.2. Evidence for hypothesis 2—including a measure of reliability during the matching process can increase robustness

Our second hypothesis holds true only at ES, where the inclusion of reliability reduced outliers, especially for LV epicardium segmentation. In cases of hypertrophy, the inclusion of reliability information proved to aid left ventricular segmentation at ES. Reliability also reduced outliers in ESVs and LV mass computation.

We can conclude that in the presence of clear edges, the segmentation without reliability draws the model to a correct position. When the edges are less clear, for instance around the right ventricular edge or in severe hypertrophy, the reliability information helps the model stay in a more correct position.

8.3. Comparison to previous relevant studies

Table 3 summarizes the segmentation errors of other model-based cardiac MRI segmentation approaches. Given that comparisons across studies are difficult since datasets are different (i.e. type of patients, number of cardiac phases, image quality and image resolution), our purpose is to identify consistent trends. In general, endocardial errors tend to be the smallest, followed by the epicardial ones, and RV errors tend to be the largest. We also observed that trend in our work. For most studies, at least half of their database is comprised by normal subjects. Hence, our database was challenging for two reasons: (1) only 30% of datasets corresponded to normal subjects, and (2) there was a high percentage of hypertrophic patients (45%). We found that this pathology poses a special challenge to our model-based segmentation approach (see figure 8). In this sense, the performance of model-based approaches on hypertrophic patients seems understudied in the literature.

Table 3. Segmentation errors of other model-based cardiac MRI segmentation studies (in works with multiple modalities, only MRI data are reported).

							LV				RV
							Endo		Epi		Endo
Author	n	Population	Resolution (mm³)	Stack	Regions	Phase	mean mm	SD mm	mean mm	SD mm	mean mm	SD mm
van Assen et al (2008)	15	Nor	1.50 × 1.50 × 6.0	SA	LV	ED	1.72	n.a.	1.55	n.a.	n.a.	n.a.
Lötjönen et al (2008)	40	Nor/Isch/Dil	1.50 × 1.50 × 7.1	SA/LA	LV/RV	ED/ES	1.54	0.39	1.47	0.34	2.37	0.65
Koikkalainen et al (2008)	25	Nor	1.20 × 1.20 × 6.5	SA/LA	WH	ED	1.46	0.30	1.87	0.63	2.26	0.46
Zhang et al (2010)	50	Nor/TOF	1.70 × 1.70 × 7.0	SA/LA	LV/RV	4D	1.69	0.38	1.89	0.49	2.53	0.56^a
Peters et al (2010)	42	Mixed	0.60 × 0.60 × 0.8	3D	WH	ED	0.69	1.13	0.83	1.17	0.74	0.96
Zhuang et al (2010)	37	Nor/Mixed	1.00 × 1.00 × 1.0	3D	WH	ED	1.47	0.32	2.32	0.82	2.13	0.70
Our approach no RL	40	Nor/Hyp/Isch	1.56 × 1.56 × 8.0	SA	LV/RV	ED/ES	1.78	0.53	1.86	0.53	3.17	1.57
Our approach RL	40	Nor/Hyp/Isch	1.56 × 1.56 × 8.0	SA	LV/RV	ED/ES	1.85	0.56	1.89	0.54	2.98	1.48

LV = Left ventricle; RV = Right ventricle; SD = Standard deviation; Nor = Normal; Path = Pathologic; Isch = Ischemic; Dil = Dilated; TOF = Tetralogy of Fallot; Hyp = Hypertrophy; SA = Short axis; LA = Long axis; WH = Whole heart; RL = Reliability; n.a. = not applicable. ^a = epicardium.

9. Conclusion and future work

In this work, we tested the ability of intensity models trained on simulated MRI datasets to segment the ventricular cavities from real datasets. A motivation to train intensity models from simulated datasets is to be able to handle multi-modal datasets more efficiently. Another motivation is to avoid the need for manual outlining, because it is very labourious and very expert dependent. The accuracy obtained by the simulated intensity models was compared to the one obtained by real intensity models and to inter-observer variability.

We can conclude the following: if the proper ground-truth meshes (with point correspondence) are available, this dataset is ideal for intensity model training since they perform with good accuracy and robustness. If, on the other hand, no such dataset is available, training the intensity models on simulated data can achieve accuracy below inter-observer variability (mainly for the LV). The right ventricular model of the simulated images needs further improvement to generate proper profiles around the myocardial edges.

We also tested how the inclusion of reliability information during the matching process can increase robustness. Reliability information was also obtained from the simulated datasets. It proved to increase robustness of the segmentation process without significantly penalizing accuracy. Reliability inclusion made segmentation results less sensible to unfavourable initializations.

Future work should target the incorporation of long-axis information during training and matching for a better segmentation of base and apex. Another factor that needs to be tackled is the correction for z-shift due to respiration. While currently it is handled by the statistical constraints of the PDM, correcting this artefact as a preprocessing can aid the segmentation. Finally, building a specific PDM per pathology, alongside with an automatic technique to select the proper PDM, should benefit the segmentation in cases of severe hypertrophy and infarction.

Acknowledgments

This work has been funded by the Spanish Ministry of Innovation and Science under the CENIT programme (cvREMOD project), the CICYT programme (TIN2009-14536-C02-01), the Institució Catalana de Recerca i Estudis Avançats (ICREA) and the European Commission Seventh Framework Programme under the euHeart project (FP7-ICT-2007-2-224495).

Automatic training and reliability estimation for 3D ASM applied to cardiac MRI segmentation

Article metrics

Permissions

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Active shape models

3. Reliability estimation