Assessment of bootstrap resampling performance for PET data

Bootstrap resampling has been successfully used for estimation of statistical uncertainty of parameters such as tissue metabolism, blood flow or displacement fields for image registration. The performance of bootstrap resampling as applied to PET list-mode data of the human brain and dedicated phantoms is assessed in a novel and systematic way such that: (1) the assessment is carried out in two resampling stages: the ‘real world’ stage where multiple reference datasets of varying statistical level are generated and the ‘bootstrap world’ stage where corresponding bootstrap replicates are generated from the reference datasets. (2) All resampled datasets were reconstructed yielding images from which multiple voxel and regions of interest (ROI) values were extracted to form corresponding distributions between the two stages. (3) The difference between the distributions from both stages was quantified using the Jensen–Shannon divergence and the first four moments. It was found that the bootstrap distributions are consistently different to the real world distributions across the statistical levels. The difference was explained by a shift in the mean (up to 33% for voxels and 14% for ROIs) being proportional to the inverse square root of the statistical level (number of counts). Other moments were well replicated by the bootstrap although for very low statistical levels the estimation of the variance was poor. Therefore, the bootstrap method should be used with care when estimating systematic errors (bias) and variance when very low statistical levels are present such as in early time frames of dynamic acquisitions, when the underlying population may not be sufficiently represented.

estimation of the variance was poor. Therefore, the bootstrap method should be used with care when estimating systematic errors (bias) and variance when very low statistical levels are present such as in early time frames of dynamic acquisitions, when the underlying population may not be sufficiently represented.
Keywords: resampling, bootstrap, positron emission tomography, image analysis, statistical uncertainty (Some figures may appear in colour only in the online journal)

Introduction
Medical imaging such as positron emission tomography (PET), while being used for estimation of some biological parameters of interest (e.g. tissue metabolism for normal or diseased physiology), is always subject to statistical errors at more than one level. For example, at one level, when a set of such parameters are sought to examine or diagnose pathology in a given diseased population, such diagnosis or examination will be subject to sampling errors due to the usually limited size subsets (samples) of the population. At another level, sampling errors are present in the form of measurement errors due to random errors (e.g. variance due to the limited number of annihilation photon pairs recorded by each pair of detectors in PET) and/ or systematic errors (bias) in the image acquisition and reconstruction. Although, sampling errors cannot generally be determined, they can be estimated for example using the bootstrap method provided that any given population sample (or the number of events recorded during the PET data acquisition) is large enough and representative of the studied population of detection events resulting from the underlying radiotracer distribution.
Such estimation of sampling errors is important for identifying optimal methods for extracting a meaningful signal (information) in the presence of noise caused by random sampling among other limiting effects. The bootstrap method (Efron and Tibshirani 1994), provides a means of estimating the accuracy (bias) and precision (standard error) of any statistic (an estimate of a population parameter) through resampling with replacement from the observed dataset. For example, the bootstrap has been applied to a pool of recruited participants in order to assess robustness of covariance patterns extracted from a dataset of a cross-sectional study on diseased and healthy participants as well as estimating the prediction accuracy of the model used in such analysis (Habeck et al 2008, Markiewicz et al 2009 (note that in these cases the considered stochastic process is not the measurement but the sampled participants). Further, the bootstrap has been used in the kinetic modelling applied to the dynamic PET datasets in order to estimate the errors of the estimated kinetic parameters while the resampling was performed in the pool of residuals estimated by the prediction of the kinetic model (Turkheimer et al 1998, Kukreja andGunn 2004) with the assumption that the residuals are independent, identically distributed and have zero mean.
Bootstrap resampling has also been successfully applied to single measurements from nuclear medicine imaging techniques such as PET or SPECT (single-photon emission computed tomography). In such cases the resampling is performed either on the histogrammed PET data formed into sinograms or on the raw and uncorrected PET list-mode data in order to assess the statistical properties of PET images. An example of the former is Buvat (2002) where a set of empirical sub-sinograms is generated in a gated fashion from the original data in order to estimate the unknown distribution of the sinograms (note that the sinogram bins might not be Poisson distributed due to preprocessing steps). Any bootstrap sinogram is then generated by resampling with replacement every sinogram row from the set of corresponding empirical sub-sinogram rows. Examples of the latter resampling of raw and unhistogrammed PET data have also been reported (Haynor and Woods 1989, Dahlbom 2002, Groiselle and Glick 2005, Verhaeghe et al 2010. Such bootstrap resampling was evaluated when applied to estimating PET image noise yielding similar performance to that of independent realisations of phantom scans (Dahlbom 2002) or Monte Carlo simulations (Groiselle and Glick 2005). Whereas Verhaeghe et al (2010) used the bootstrap resampling of the list mode data in the procedure developed for task-based selection of image reconstruction methodology where the mean square error estimated with the bootstrap was minimised.
In the resampling of PET data two methods of the bootstrap are used: (i) the non-parametric bootstrap which does not require any knowledge of the underlying distribution from which the PET data are randomly drawn (e.g. Buvat (2002)) and (ii) a parametric bootstrap resampling which assumes that the observed data is taken from a defined distribution (e.g. the Poisson distribution for the raw measured PET data (Haynor and Woods 1989) or the multivariate Gaussian distribution in the image space of reconstructed PET scans (Maitra 1998)). The parameters for the distributions used in the resampling are obtained from the original dataset.
Different bootstrap resampling approaches (non-parametric and parametric bootstrap methods applied to list-mode and sinogram data) were compared by Lartizien et al (2010) suggesting that non-parametric and parametric methods produce similar results whether the resampling is done in list-mode or sinogram data. However, it has been shown that resampling based on larger samples (more than one list-mode and sinogram dataset available for resampling) produces more accurate estimates particularly of the mean but also of the variance.
Since the computational and storage costs have significantly lowered today, the method of bootstrap resampling is increasingly gaining a momentum providing greater insights into the statistical analysis of medical images. Therefore, in this work the performance of the more popular non-parametric bootstrap as applied to the raw and uncorrected PET list-mode data is assessed in a novel way using systematic and rigorous comparison of distributions generated through resampling at different statistical levels. In particular, the Jensen-Shannon divergence between two corresponding distributions is used for the assessment of how the bootstrapped distributions diverge from reference distributions here generated and treated as a gold standard. The reference distributions came from long PET acquisitions of one slice of the novel phantom  ensuring very high statistics for the slice voxels whilst accurately simulating the spatial distribution of either the FDG and raclopride in the brain. A real raclopride PET dataset was also used to confirm the findings from the assessment of the bootstrap resampling using the phantom data. The non-parametric bootstrap was chosen as it does not require any knowledge or assumption about the distribution which gave the observed data and therefore the non-parametric bootstrap is less restricted for different applications (Buvat 2002).
The proposed assessment of the bootstrap method can be carried out in more than one way and on many levels but this work focuses on the bootstrap resampling as applied to the data from which medical images or parameters are formed. It is possible though to apply the results of this study to most bootstrap resampling schemes beyond medical imaging.

Theory and methods
In order to assess the bootstrap resampling a set of reference realisations (treated as the gold standard) was generated to form sampling distributions with which corresponding bootstrap distributions can be compared. The reference distributions of this study have to be such as to allow a wide range of statistical levels to be investigated. This potentially could be achieved in three ways: (i) through multiple scan realisations of the same object with realistic distributions of the same radioactivity fixed at exactly the same position for all scans; (ii) accurate Monte Carlo simulations generating such independent realisations; or (iii) using one very long list-mode acquisition of a realistic activity distribution with high radioactivity concentration to ensure a very high statistical level and good representation of the underlying distribution. Based on this list-mode acquisition, which takes the role of the underlying distribution, multiple reference realisations (corresponding to real world multiple measurements) can be generated by sampling with replacements and with the statistical level varied by changing the size of the sampled realisations. This is similar to bootstrap resampling with the difference that the size of newly generated samples changes according to the desired statistical level as indicated in Haynor and Woods (1989) and Groiselle and Glick (2005).
Points (i) and (ii) for the assessment of the bootstrap in this work are impractical as generating such data through physical scans is close to impossible whereas using Monte-Carlo simulations would take an even longer time than the already long acquisition times required for multiple realisations (Buvat 2002). Therefore in this work strategy (iii) has been adopted.

Long acquisition scans
As a basis for the generation of the reference distributions two long PET scans of a simulated brain slice were acquired on the high-resolution research tomograph (HRRT, Siemens, (de Jong et al 2007)) using a specially designed phantom . This phantom enables printed planar sources of any radioactivity distribution to be imaged in transaxial or axial planes. Two digital phantoms based on the realistic digital MRI brain phantom (Aubert-Broche et al 2006) publicly available at www.bic.mni.mcgill.ca/brainweb/ are here considered, that is, two discrete models are used based on subjects #18 and #20 from the public database. The structural discrete models of grey matter (GM), white matter (WM), the meninges and striatal regions (putamen, caudate) are used to define two digital PET phantoms with values corresponding to two distinct spatial distributions of two common radiotracers: [ 18 F]FDG (subject #18 used) and [ 11 C]raclopride (subject #20). The two planar distributions (see figure 1) after a long acquisition of 6 h (with the total radioactivity of 21.6 MBq in the plane at the beginning of the scan) and 5 h (with the total radioactivity of 7.9 MBq), respectively, were reconstructed with the ordinary Poisson ordered subsets expectation maximisation algorithm (OP-OSEM, (Comtat et al 2004)) with inclusion of a model of the point spread function (PSF) of the scanner (Reader et al 2003).
One of the reasons why such planar phantoms are used is a greater radioactivity concentration in one image plane (and thus good statistical representation within that plane) as opposed to volumetric phantoms for which such high radioactivity concentration in each plane could cause a reduction in count-rate performance of the detectors or even damage them. Further, with the digital planar phantom it is possible to investigate the impact that different radioactivity distributions may have on the accuracy of bootstrap resampling.
The findings based on the two planar phantoms are then compared with a real human raclopride scan (figure 1(c)). It is worth noting the supreme statistical quality of the reconstructed images from the two phantoms compared to the real human scan shown in the figure.

The bootstrap
The bootstrap is a tool used in statistical inference to ascertain the precision and/or accuracy of a statistic of interest θ by generating a set of K statistically equivalent datasets based on one or more original datasets. θ can be a voxel value, the mean voxel value of any region of interest (ROI) or any statistic based on the reconstructed image of the measured PET data x, i.e.
where x = {x 1 , x 2 , ..., x n } and n is the data size (number of recorded events in the list-mode data). One bootstrap sample x * of size n is generated by resampling with replacement from the original data x, regarded as representing the underlying distribution. It is then possible to estimate some parameters of the sampling distribution such as the standard error or bias of θ from K bootstrap replicates θ * = s(x * ).

Two stage resampling
One of the key aspects of this work is the two stage non-parametric resampling in order to assess the accuracy of the bootstrap resampling. The first stage resampling (with replacement) plays the role of the real world where multiple realisation scans of the same object would be performed. The second stage resampling is the bootstrap which takes one of the realisations from the first stage to generate an equivalent bootstrap distribution to that of the first stage. and from a real raclopride scan of the human brain (c). The list-mode data used for reconstructing the images was used in the resampling procedure which generated multiple image replicates. A number of regions of interest (ROIs) and voxels from the centre of the ROIs at different brain regions were analysed. The ROIs and voxels were taken from grey matter (g), white matter (w), striatum (s) and the meninges (m).

Stage I. World of multiple realisations.
The resampling in this stage is very similar to the bootstrap (see section 2.3.2) with the only difference that the number of events in the resampled dataset is lower than in the original long scan (apart from one case used for comparison where the resampled datasets in this stage are of the same size as the original dataset). The scheme of the resampling is presented in figure 2 in which the probability model P representing the population with its underlying probability distribution of the infinite number of events required to describe the spatial distribution of the radiotracer gives the observed PET data of the long acquisitions. A crucial step in both resampling stages is the estimation of P using P , the empirical distribution function constructed using the measured list-mode data and putting probability 1/n on each recorded event (true, scatter, random and delayed coincidences recorded in the list-mode data), where n is the number of all recorded events in the list-mode data. Hence, a new sample (x * ) can be generated from the estimated probability model P with replacement exactly the same way as the original data was generated from P. In this way a number of new list-mode datasets can be generated with different total numbers of events m (see figure 2) corresponding to six statistical levels S (or sizes of generated datasets) as the percentage of the original long list-mode dataset used in the assessment (S = {0.01%, 0.1%, 1%, 10%, 20%, 100% }). For the statistical level of 100% the total number of detected true coincidence events was 8.8 × 10 9 for the FDG phantom and 3.1 × 10 9 for the raclopride phantom. For the chosen image slice of the measured clinical raclopride scan the estimated number of true coincidence events was 28 × 10 6 .
The resampling of the events as recorded in the observed list-mode data is done in a non-parametric way in short time-periods (of approximately 1 s) of the list-mode data in order to maintain the dynamic changes of the underlying distribution of the radiotracer. Such resampling of the statistically very rich list mode dataset can be viewed as repeated scans of the same object and here it is repeated 2K = 2 × 100 times (the reason for the 2 is that the so formed distributions based on 100 samples are then compared to each other in order to see any differences for the reference distributions). Also it is worth noting that any division of the long acquisition data into smaller independent datasets would be suboptimal for this assessment as such breaking up of the data would destroy the original probability model P in a systematic way. Resampling with replacement is carried out after estimating the probability model P with the empirical distribution function P based on the observed data. The observed data of size n is thus resampled with replacement to generate new datasets of size m. If m = n then the resampling is equivalent to the bootstrap. For each new dataset the same calculation s(x * ) can be performed as for the original dataset s(x).

Stage II. World of bootstrap.
For each of the six statistical levels, a single realisation from stage I is chosen for further use in the second stage (II) of resampling to generate K bootstrap datasets which are statistically equivalent to the one chosen from stage I. Importantly, in this stage (II) of bootstrap resampling, the probability model P (figure 2) is estimated based on the data of the chosen realisation from the first resampling stage (I), and, crucially, not the empirical distribution which gave the realisations of stage I P ( ) which is known based on the rich measured dataset. Therefore, in the bootstrap resampling (stage II) the underlying distribution is estimated by ̂* P through assigning a probability of 1/n to each event of chosen list-mode data k of stage I * x ( ) k . A number K of bootstrap replicates is then generated through random sampling with replacement from ̂* P of exactly the same number of events n as in * x k (m = n in figure 2). The bootstrap resampling is carried out for all six statistical levels S = {0.01%, 0.1%, 1%, 10%, 20%, 100% } of stage I. This stage (II) of bootstrap resampling is repeated five times by choosing five independent realisation of stage I in order to facilitate more robust comparison of the distributions of the two stages.
Thus, two statistically equivalent sets of list-mode data are generated in the two stages differing in that the reference datasets (stage I) are resampled from a high statistics dataset whereas the bootstrap datasets (stage II) are resampled from the reference datasets of lower statistics.

Processing and analysis of the generated datasets
The pathway of data analysis for the two resampling stages is presented graphically in figure 3. Each dataset of stage I and II are histogrammed into sinograms and corrected the same way  (1). Statistic θ here is the voxel value or the mean voxel value of an ROI. Through resampling in the two stages two corresponding distributions (R and B) are formed which are then compared using the Jensen-Shannon divergence (section 2.4.1). as in . Since the phantom generates negligible scatter (below 2% of all events) the scatter correction is not applied apart from the dataset of the human brain scan with raclopride. The random coincidence events are estimated using measured delayed coincidences (which for the bootstrap replications are also resampled). The list-mode datasets (recordings of individual PET events) are histogrammed and then reconstructed with 10 iterations and 16 subsets of OP-OSEM with the modelled PSF of the HRRT scanner (Hong et al 2007, Comtat et al 2008. In this assessment single voxels and circular ROIs at 9 different locations were considered (the locations are different for the FDG and raclopride scans) as shown in figure 1. For the raclopride image (phantom and human datasets) three ROIs are considered in the striatal regions, three in the grey matter (GM) and three in the white matter (WM). For the FDG image four GM ROIs, four WM ROIs, and one ROI in the meninges (close to the skull) are considered. Nine voxels are considered for each image at the centre of each of the nine ROI. The three steps of histogramming, reconstruction with corrections, and analysing the ROIs and voxels are included in the operator s in equation (1).

Jensen-shannon divergence.
The metric of Jensen-Shannon (J-S) divergence D JS was used in order to quantify how different or 'distant' from each other are the equivalent sampling distributions from stage I (R, based on the full dataset) and stage II (B, based on bootstrap resampling). The J-S divergence (D JS ) is derived from the Kullback-Leibler (K-L) divergence (D KL ) and is defined by . The reason behind the J-S divergence is that it is symmetric as opposed to K-L divergence, i.e. D JS (R|| B) = D JS (B|| R) whereas D KL (R|| B) ≠ D KL (B|| R). The proprieties of the JS divergence are such that when two distributions R and B are totally different (e.g. there is no overlap) D JS is 1 (100%) and when the distributions are exactly the same D JS is 0 (0%).

Two methods of histogramming.
In order to apply the J-S divergence to the sampling distributions R and B of the two stages, the distributions of the ROI and voxel values have to be histogrammed (note that it is a different histogramming procedure to that of histogramming PET coincidence events into sinograms). There are two possible ways of achieving this: (i) with constant binning where the bin size for all statistical levels (data sizes) are kept constant and (ii) adaptive binning where the bin size is varying according to the size of the datasets at each statistical level and therefore the effects of the data size are normalised out in the distribution comparison. In particular, the bin size b size in the adaptive histogramming is where r max is the maximum range of θ-values in the sampling distributions for different statistical levels expressed in percentages of the original data size, i.e. S = {0.01%, 0.1%, 1%, 10%, 20%, 100% } and b n is the number of bins. This way the product b S size is constant for all statistical levels. The difference between the two histogramming methods is depicted in figure 4.

Analysis of moments.
For greater insight into the potential differences between the distributions of stages I and II, further measures of the distributions are used. These are the mean (the first raw moment), coefficient of variation (CoV, related to variance-the second central moment), skewness (the third standardised moment) and kurtosis (the fourth standardised moment measuring the peakedness of the distributions). The moments were investigated in two ways, i.e. with and without the normalisation by the statistical level in the considered datasets. The two-sample t-test was applied to the estimated moments to test if the bootstrapreference and reference-reference absolute differences of the moments are significantly different. If the differences were statistically insignificant it would indicate that the bootstrap well replicates the parameters matching the performance of the reference distributions.

Optimal number of bootstrap samples and bins.
For optimal analysis using the J-S divergence, the impact of the number of bootstrap samples, number of bins used in the histogramming, the statistical level of the datasets and the choice of ROIs/voxels on the variability of the J-S divergence is assessed. First, a linear model was fitted to the responses of the J-S divergence explained by the above predictors: the number of bootstrap samples, histogram bins and statistical level of the datasets were treated as continuous variables whereas the ROIs/voxels were treated as a categorical variable. Analysis of variance based on the linear regression was performed to gain a rough estimate of the factor mostly influencing the J-S divergence. Next, the optimal number of bootstrap samples and bins is chosen based on investigation of the specific relationships between the J-S divergence and the number of bootstrap samples as well as histogram bins.
For the investigation of the impact of the number of bootstrap datasets (samples) K on the J-S divergence, K was varied linearly from 10 to 100 samples with steps of 10. Similarly, the number of histogram bins b n was varied from 2 to 513 in exponential steps (b n = 2 n + 1, n = 1...9). The statistical level of the datasets was S = {0.01%, 0.1%, 1%, 10%, 20%, 100% } and the choice of ROIs was treated as a categorical variable with 9 categories corresponding to the 9 ROI positions (figure 1). Note that by adapting the bin size for the high statistics case the J-S divergence D JS is comparable to the divergence for the low statistics case. Note also that the distributions on the left and right are the same with the only difference being that the top distributions have different representation in the histogrammed form.

Results and discussion
The resampling of the two stages (see section 2.3) generates two distributions for each chosen statistical level (data size) and each ROI and voxel (figure 1). The resampling in the first stage generates the reference distributions which here play the role of multiple realisations acquired by scanning the same object many times. Therefore, they are considered to be a very good approximation of the gold standard. The bootstrap resampling of the second stage takes one of the reference realisations and resamples it with replacement to generate statistically equivalent datasets whose distribution is expected to estimate the reference sampling distribution. Figure 5 shows the two equivalent distributions plotted next to each other (the reference distribution are represented by black boxplots whereas the bootstrap distributions are represented by grey boxplots) for the nine ROIs/voxels and two data sizes (0.01% and 1%) of the raclopride planar phantom ( figure 1(a)). It is readily noticed that the statistical level has a very significant influence on all the distributions. Further, as expected the voxel distributions have greater dispersion due to considerably smaller size compared to the ROIs. Another observation made from figure 5 is that the boxplots distributions although of similar dispersions their medians are visibly different-the lower the statistical level the greater the difference. Also, in the extreme case of voxels and data size of 0.01% the dispersions can be different. Figure 6 plots the same distributions but for the raclopride human dataset and for data size of 1% and 100% . The statistical richness of the phantom dataset as opposed to the human dataset can be noticed in terms of the dispersions.
The differences between the reference and bootstrap distributions as noticed in figures 5 and 6 can be measured using the J-S divergence by assigning a value to the 'distance' between any two equivalent distributions. Before such a measure can be made the sampling distributions of the statistic (voxel and mean ROI values) have to be histogrammed to a common set of bins. Two binning methods were proposed (see section 2.4.2): (i) constant for which the bin size remains constant for all statistical levels and (ii) adaptive whose bin size adapts to the statistical level of a given dataset. Figure 7 plots the J-S divergence for the distributions generated for the raclopride phantom and for two methods of binning the distributions. For each binning method (top and bottom graphs) there are four J-S divergence curves plotted against the six statistical levels (data sizes expressed in percentages of the original dataset). Two curves are plotted for the J-S divergence between two equivalent reference distributions of size K = 100 (therefore there were 2K reference realisations generated) for ROIs and voxels. This curves (denoted as 'Reference' in the figure) represent the reference point to which the other two curves (denoted as 'Bootstrap' in the figure) should come close. These two 'bootstrap' curves represent the divergence between the equivalent bootstrap and reference distributions. In order for the 'Bootstrap' curves to be smoother the bootstrap resampling in stage II was repeated five times and the J-S curves for each resampling procedure were found and then averaged over the five repetitions. Similarly, figures 8 and 9 present the same plots of J-S divergence but applied to the other two datasets: FDG planar phantom (figure 8) and raclopride human scan (figure 9).
It becomes clear that there is a systematic error present in the bootstrap resampling which appears constant when the effect of the statistical level in the datasets are taken out through the automatic normalisation present in the adaptive histogramming. Further, when constant binning is used, and thus the effects of statistical level are present in the analysis, the J-S divergence, as expected, decreases and all the curves come closer to each other which is especially apparent for the raclopride phantom dataset (figure 7). Note that the pattern of difference between the bootstrap and reference curves is present for the three original datasets. The reason that the difference is still observable even for the statistical level of 100% is that the bootstrap realisations in stage II are based on one noisy realisation of stage I (it is true for all statistical levels), and hence, bootstrap realisations are more representative of that particular realisation of stage I than of the remaining realisations of stage I. Furthermore, since stage I realisations diverged from the original data, with the amount of divergence relative to the number of counts resampled, consequently the J-S divergence will remain unchanged if the bin size is changed according to the statistical level S.
For further assessment why such differences in the J-S divergence exist for the bootstrap distributions, the difference in moments (1st-4th) of the reference and their corresponding bootstrap distributions were analysed. It was found that in terms of the third and fourth moments-the skewness and kurtosis respectively-there was no statistically significant difference for the equivalent distributions of stages I and II and the chosen binning resolution used for generation of the distributions. However, when the first raw moment (the mean) was considered for the striatal regions of the brain, an interesting effect was found in the absolute difference of the mean being similar in shape to that found in the analysis of the J-S divergence. That is, the two curves of the difference of the mean (unnormalised to the statistical level) for the bootstrap-reference and reference-reference distributions have similar behaviour to the two curves of J-S divergence for bootstrap-reference and reference-reference distributions generated with constant binning. Figure 10 shows the difference between the means of the sampling distributions generated in the two resampling stages for ROIs (left figure) and voxels (right figure) for the raclopride phantom dataset, similarly to the presentation of the J-S divergence (e.g. figure 7).
The percentage change of the mean difference for the bootstrap and reference distributions for the striatal regions (an important measure in raclopride studies) was observed to be up to 14% for the ROIs and 33% for the voxels for the low statistical level (0.01%, which for the count level of the phantom measurement, it corresponds to the real count levels in some time frames of dynamic acquisitions).
For further analysis, the difference of the mean for the six statistical levels was normalised by the statistical level represented by the percentage of the original dataset (the top of figure 10). Although the richness of the dataset represented at the higher statistical levels makes the distributions very close to each other in absolute terms (bottom plots), in relative terms they are different which is confirmed by the t-test indicating that the curves are statistically different with p ≪ 0.01. Therefore, it is more advantageous to look at the difference normalised by the statistical richness (top graphs of figure 10) for which the difference remains constant similarly to that of the J-S divergence with adaptive binning. The normalised mean difference drops slightly for the very small data sizes, possibly due to very poor statistical representation and the non-negativity constrain in the OSEM image reconstruction.
Since the bootstrap resampling is based on one limited dataset drawn from a given population (here one realisation from stage I) the bootstrap estimated mean converges to the value of that one realisation. Hence, the estimated mean is more representative of that realisation and may be significantly distant to the population mean from which the noisy realisation was taken depending on how statistically rich the realisation is.
Similar to the above analysis of the mean, the coefficient of variation (CoV) for the bootstrap and reference distributions is shown in figure 11. Both normalised and unnormalised absolute difference of CoV was presented for the chosen statistics of ROI and voxel values. The two-sample t-test was applied to the six points of the two CoV curves to investigate whether the points representing (1) the absolute difference between bootstrap and reference distributions are significantly different from (2) the points representing the difference between two independent reference distributions for the six statistical levels. Therefore, it is possible to infer when the differences between (1) and (2) are not related to limited data. The p-values of the test are the same for the normalised and unnormalised data and are shown at the top of the plots for each of the six points. Small p-values (e.g. below 0.05) indicate that it is highly likely that the points are significantly different. Note that the raw and normalised CoV differences presented in figure 11 correspond to the J-S divergence for adaptive and constant binning respectively as shown in figures 8-9.
The presented test data shows that the absolute CoV differences of bootstrap-reference distributions can be considered significantly different from the reference-reference distributions in cases of statistically poor acquisitions (data size of 0.01%) for voxels and ROIs. Since the voxels are considerably smaller than ROIs it is more likely for the voxels to exhibit significant differences (see the right plots for the data size of 10%). The cases of significant differences show that the bootstrap estimation of CoV is likely to fail as the absolute difference between bootstrap and reference distributions cannot match the absolute CoV differences of the gold standard. This is especially evident for the data size of 0.01% which for this raclopride phantom measurement it is around 0.3 × 10 6 counts. Importantly, this is in the same order of magnitude as the number of true coincidence counts in early time frames emitted from the image slice of the human raclopride scan. From the foregoing analysis of the J-S divergence and the difference of the mean and coefficient of variation it is clear that the bootstrap cannot fully replicate the sampling distribution of a statistic of interest for the PET data across the different datasets. The main reason for that seems to be due to a systematic shift of the mean with the bootstrap distributions relative to the reference distributions. This effect has also been observed by other researchers , Lartizien et al 2010.
The reason for this effect is due to the fact that only one noisy realisation (measurement or simulation) is used in bootstrap resampling where the estimated mean is representative of that single noisy realisation (Dahlbom 2002 and not of the underlying population. Furthermore, such single realisations are rarely representative of the population but are nevertheless playing the role of the underlying population distribution (because this is all that is available for analysis). Therefore, the estimated mean based on the reference distributions which are sampled from a much richer dataset will be different and more accurate. Groiselle et al (2003) found that only for very low statistical levels of their dataset the means of the given statistic were significantly different to the means found using Monte Carlo simulations. This would agree with the results shown in figure 10 (bottom) for the unnormalised mean difference where for higher statistical levels the performance of the bootstrap is very close to the reference distributions in absolute terms. Similarly, Lartizien et al (2010)  reported that the mean estimated based on bootstrap resampling of low statistics dataset does not match the mean estimated based on multiple realisations of Mote Carlo simulation. However, the mean as estimated by the bootstrap matched closely the Mote Carlo estimated mean if the bootstrap resampling was based on more than one datasets thus increasing the statistical level of the resampled dataset. Nevertheless, as shown in this work, the difference in the mean can be observed in relative terms for all statistical levels (data size) when the statistical level is normalised out. Therefore, it is important to note that if in a given study higher statistics measurements are used to observe effects of a smaller size, the difference in the mean would start to become evident again as shown using the J-S divergence with adaptive binning and the normalised difference of the mean.
Comparatively, when estimating the variance the bootstrap performs much better achieving good accuracy which has also been reported when estimating noise properties of image reconstruction (Buvat 2002, Dahlbom 2002, Groiselle and Glick 2005, Lartizien et al 2010. It is interesting to note that the shift of the bootstrap distributions in terms of the mean is relative to the statistical richness of the dataset, that is, the distributions become narrower with higher statistical levels and so they become close to each other in absolute terms but in relative terms of the difference to the statistical level it is approximately a constant error. Also, it is worth noting that when performing resampling of the datasets with 100% statistical level the bootstrap still fails in estimating the mean relative to the statistical level (as presented in the J-S divergence with the adaptive binning and in the analysis of the mean and CoV with the normalisation to the data size) although in absolute terms the error decreases as seen with the J-S divergence and constant binning as well as in the analysis of the mean and CoV without such normalisation.

Optimisation of the assessment of the bootstrap
It was found through a simple linear regression and the analysis of variance (ANOVA, table 1) that for the adaptive binning the numbers of bins b n has a considerably greater impact on the J-S divergence than the number of bootstrap datasets K; whereas for constant binning the statistical level plays a greater role as expected. All results were highly significant (p ≪ 0.01). The reason for the number of bins b n (and hence their width) having such a significant impact on the divergence lies in the fact that two different distributions can be made very similar by binning them with a small b n (wide bins) for which the J-S divergence of the two distributions will tend to be small. However, if b n is chosen large (narrow bins) the histogram representation of the two distributions will be dissimilar capturing more differences and hence the J-S divergence will be comparatively larger. The impact of b n on the J-S divergence is even greater for adaptive binning as, in addition to the above effect, the bin size varies according to the statistical level. The statistical level S has a little impact on the divergence for the adaptive binning due to the bin size being changed according to S and hence normalising it out in the analysis of variance. On the other hand, S has a considerable impact for the constant binning since there is no such normalisation and for low statistical levels the distributions tend to be more dispersed than for higher statistical levels (see figures 4 and 5). It is worth noting that although the above relationship may not be linear for the chosen predictors, the ANOVA for the linear regression provides a useful overview of the relationship. The relationship between the J-S divergence and the number of bins used in the adaptive binning is shown in figure 12. The dispersion of D JS due to other factors such as the statistical level and ROI position is shown using boxplots. The relationship is presented for two cases: (i) the J-S divergence between two independent reference distributions generated in stage I (black boxplots) (this is why 2K random samples were generated in stage I); (ii) the J-S divergence between the reference (stage I) and bootstrap (stage II) distribution (grey boxplots). As expected, the figure confirms that the fewer the number of bins are used the more similar will be the distributions (low value for the J-S divergence). In order to better differentiate the reference distributions from the bootstrap distributions the number of bins would be best chosen between 10 and 30. In this work for adaptive binning the number of bins was chosen to be 20. Similarly for constant binning the number can be estimated but since the size of the bins is kept constant the number was chosen to be greater (50) in order to capture the variability across all statistical levels.
In order for a valid assessment of the bootstrap resampling to be conducted the number of bootstrap samples has to be large enough for accurate estimation of the properties of the sampling distribution. The impact of the number of bootstrap replicates (samples) on the J-S divergence is plotted in figure 13. As can be seen there is considerably less change in the J-S divergence when close to 100 bootstrap samples. It is possible that the divergence would drop even more but it would happen at a much larger computational costs. In this study the number of bootstrap replications was chosen K = 100 for the whole assessment.

Conclusion
The method of bootstrap resampling is gaining momentum due to the lower computational and storage costs today as well as its significant aid for medical image analysis (in research and in clinical practice) in estimating errors, robustness, noise properties, etc of reconstructed images and other estimates obtained from the observed data.
The key aspects of this work are the generation of the reference distributions through the resampling with replacement of the statistically very rich datasets obtained by scanning the specifically designed phantoms. The rich dataset then was resampled into a set of datasets of different statistical levels. It was appropriately assumed that the long acquisitions to be treated as the population, that is, the underlying true distribution from which all other samples can be taken.
If bootstrapping is to be used for characterisation of the distribution of a given statistic derived from PET data (equally it can be any other nuclear medicine imaging modality) there are errors that occur, primarily in terms of the mean but also in terms of the absolute distribution as measured by the J-S divergence. The extent to which the error of the mean will affect further statistical inference will depend on the effect size which is being investigated in a given PET or SPECT study-if the observed effect size is significantly larger than the error (which depends on the statistical richness of the dataset), then the bootstrap resampling will produce acceptably accurate estimates.
In this assessment reference distributions were used as a gold standard to which bootstrap distributions were compared. Based on the comparison, it was found that the bootstrap should be used with caution due to its limitations especially when estimating the first moment (mean). For higher order moments, e.g. the variance, the bootstrap method can be accurate if the datasets are reasonably statistically rich. Hence, in the early time frames of dynamic acquisitions, the bootstrap can fail to estimate the variance accurately due to the poor statistical level in such frames. Therefore, when estimating some aspects of sampling distributions care has to be taken in order to avoid wrong findings supported by the bootstrap used in an inappropriate way. Since the J-S divergence varies according to the different ROIs and statistical levels it was plotted using boxplots to capture the dispersion of the divergence caused by the other factors.