Observer-study-based approaches to quantitatively evaluate the realism of synthetic medical images

Abstract Objective. Synthetic images generated by simulation studies have a well-recognized role in developing and evaluating imaging systems and methods. However, for clinically relevant development and evaluation, the synthetic images must be clinically realistic and, ideally, have the same distribution as that of clinical images. Thus, mechanisms that can quantitatively evaluate this clinical realism and, ideally, the similarity in distributions of the real and synthetic images, are much needed. Approach. We investigated two observer-study-based approaches to quantitatively evaluate the clinical realism of synthetic images. In the first approach, we presented a theoretical formalism for the use of an ideal-observer study to quantitatively evaluate the similarity in distributions between the real and synthetic images. This theoretical formalism provides a direct relationship between the area under the receiver operating characteristic curve, AUC, for an ideal observer and the distributions of real and synthetic images. The second approach is based on the use of expert-human-observer studies to quantitatively evaluate the realism of synthetic images. In this approach, we developed a web-based software to conduct two-alternative forced-choice (2-AFC) experiments with expert human observers. The usability of this software was evaluated by conducting a system usability scale (SUS) survey with seven expert human readers and five observer-study designers. Further, we demonstrated the application of this software to evaluate a stochastic and physics-based image-synthesis technique for oncologic positron emission tomography (PET). In this evaluation, the 2-AFC study with our software was performed by six expert human readers, who were highly experienced in reading PET scans, with years of expertise ranging from 7 to 40 years (median: 12 years, average: 20.4 years). Main results. In the ideal-observer-study-based approach, we theoretically demonstrated that the AUC for an ideal observer can be expressed, to an excellent approximation, by the Bhattacharyya distance between the distributions of the real and synthetic images. This relationship shows that a decrease in the ideal-observer AUC indicates a decrease in the distance between the two image distributions. Moreover, a lower bound of ideal-observer AUC = 0.5 implies that the distributions of synthetic and real images exactly match. For the expert-human-observer-study-based approach, our software for performing the 2-AFC experiments is available at https://apps.mir.wustl.edu/twoafc. Results from the SUS survey demonstrate that the web application is very user friendly and accessible. As a secondary finding, evaluation of a stochastic and physics-based PET image-synthesis technique using our software showed that expert human readers had limited ability to distinguish the real images from the synthetic images. Significance. This work addresses the important need for mechanisms to quantitatively evaluate the clinical realism of synthetic images. The mathematical treatment in this paper shows that quantifying the similarity in the distribution of real and synthetic images is theoretically possible by using an ideal-observer-study-based approach. Our developed software provides a platform for designing and performing 2-AFC experiments with human observers in a highly accessible, efficient, and secure manner. Additionally, our results on the evaluation of the stochastic and physics-based image-synthesis technique motivate the application of this technique to develop and evaluate a wide array of PET imaging methods.


Introduction
In medical imaging, the use of simulation studies to develop and objectively evaluate new and improved imaging methods has been well recognized (Frangi et al 2018, Abadi et al 2020, 2022, Yousefirizi et al 2021. Simulation studies offer the advantage of evaluating the performance of a method against known ground truth, provide the ability to accurately model patient anatomy and physiology as well as imaging system characteristics, incorporate population variability, and generate multiple scan realizations of the same patient to evaluate reproducibility. Even more importantly, this is all done in silico, which is inexpensive and enables optimizing the method before conducting clinical studies. Given these advantages, simulation studies have been used to evaluate a wide range of imaging methods for system instrumentation (Surti et al 2006), image reconstruction (Song et al 2011), image enhancement (Yu et al 2020), and image segmentation (Liu et al 2022). Further, the advantages of simulation studies have led to the emergence of virtual clinical trial-based frameworks to evaluate imaging methods (Maidment 2014, Badano et al 2018, Abadi et al 2020, Badano 2021, Li et al 2022. Simulation studies have also shown promise in developing artificial intelligence (AI)-based algorithms for medical imaging. More specifically, a key challenge in developing AI-based algorithms is the requirement of large amounts of training data with known ground truth. This data can be difficult, expensive, and time-consuming to obtain, thus creating a barrier to developing learning-based algorithms. Studies have shown that synthetic images generated from simulations can help alleviate this requirement by providing such training data for purposes such as pre-training the network (Chartsias et al 2017a, Creswell et al 2018, Gong et al 2018, Guan and Loew 2019, Leung et al 2020. For the simulation-based development and evaluation studies to yield clinically relevant inferences, it is important that images generated by the synthesis techniques are clinically realistic (Song et al 2011, 2021. Ensuring this clinical realism requires that patient anatomy and physiology, population variability, and imaging-system physics are all modeled accurately. There has been much work on evaluating the accuracy in modeling the imaging physics (Gonias et al 2007, Poon et al 2015, Hernandez-Giron et al 2019. However, fewer studies have focused on developing approaches to ensure that the population variability is modeled accurately (Badano et al 2018, Zhou et al 2019a, Houbrechts et al 2021. Note that to ensure clinical realism, it is not sufficient to just assess whether the real and synthetic images match for one patient realization. Instead, for clinically relevant studies, the ideal goal is that the distributions of real and synthetic images should match. This provides confidence that the findings of objective evaluation studies with synthetic images, including virtual clinical trials, are clinically relevant. Further, the clinical realism of synthetic images has been observed to be necessary when using these images for pre-training AI-based algorithms (Leung et al 2020). Thus, there is an important need for mechanisms that can quantitatively evaluate the clinical realism of synthetic images and, ideally, the similarity in distributions of real and synthetic images. To address this need, we present two observer-study-based approaches in this manuscript, one based on the ideal observer and the other based on the human observer.
To quantify the distance between distributions of real and synthetic images, metrics such as the Fréchet inception distance (FID) (Heusel et al 2017) have been proposed. The FID measures the difference between the statistics extracted from real and synthetic images using a pre-trained Inception network. However, this network is typically pre-trained on ImageNet, which comprises only natural images. Thus, it is unclear whether the network can effectively generalize to evaluate the realism of synthetic medical images. Another set of metrics attempt to evaluate the difference between distributions of real and synthetic images based on the performance of an image classifier (Shmelkov et al 2018). These approaches, while promising, rely on the choice of the classifier. More importantly, it is theoretically unclear whether this performance relates to the similarity in distributions between the real and synthetic images.
More recently, observer-study-based approaches have been considered to evaluate the clinical realism of synthetic images (Burgess 2011, Chen et al 2016, Elangovan et al 2017, Ma et al 2017, Sturgeon et al 2017. In these approaches, a two-alternative forced-choice (2-AFC) experiment is typically performed. In this 2-AFC experiment, an observer is presented pairs of real and synthetic images. For each image pair, the observer is asked to identify the real image. It is well accepted that the probability of correctly identifying the real image is equivalent to the area under the receiver operating characteristics curve, AUC, for that observer (Barrett and Myers 2013). Thus, if an observer correctly identifies the real images for only 50% of the cases, this yields an AUC of 0.5. Consequently, this implies that the observer is unable to differentiate the real images from the synthetic images. However, this does not necessarily indicate that the distribution of synthetic images matches that of real images. To illustrate this point, we consider a numerical observer. This observer, in the 2-AFC experiment, calculates a test statistic for each image and identifies the image that yields a higher value of test statistic as real. However, the test statistic is just a single statistic derived from the entire image. Thus, while an AUC of 0.5 may indicate that the distributions of the test statistic of the real and synthetic images match, this does not necessarily indicate that distributions of the real and synthetic images also match. Further, when the AUC value is greater than 0.5, it is unclear how the AUC value relates to the distance between the distributions of real and synthetic images. A mathematical analysis for answering these questions is much needed.
The first goal of this work is to theoretically demonstrate that an ideal observer provides a mechanism to quantify the similarity in distributions between the real and synthetic images. This ideal observer, also referred to as the likelihood-ratio test, uses all the statistical information available in the data to maximize task performance. Further, this observer is numerical and, thus, paves the way for a mathematical analysis. In this context, in 1998, Barrett et al (1998) published a seminal paper with the goal of bridging the gap between the use of signal-to-noise ratio and the use of the AUC as a figure of merit for signal-detection tasks. In that paper, one of the important findings was deriving the AUC for an ideal observer explicitly in terms of the distributions of signal-present and signal-absent images. By following a similar mathematical treatment as in Barrett et al, but in the context of evaluating the clinical realism of synthetic images, we show that an ideal-observer-study-based approach can be used to quantitatively assess the similarity in distributions of the real and synthetic images (section 2). Specifically, we show that the ideal-observer AUC is related, to an excellent approximation, to the Bhattacharyya distance (Bhattacharyya 1943) between the distributions of the real and synthetic images.
The second goal of this work is to develop an openly-available web-based platform to evaluate the clinical realism of synthetic images using human-observer studies. In this context, a vast majority of observer-studybased approaches to evaluate the clinical realism of synthetic images have relied on the use of human observers (Burgess 2011, Chen et al 2016, Elangovan et al 2017, Ma et al 2017, Sturgeon et al 2017. Among the different human observers, physicians have multiple years of experience reading medical images and are very familiar with the intricate details of these images. Thus, these physicians, whom we refer to as expert human observers, are best placed to identify even minute differences between the real and synthetic images. To conduct observer studies with expert human readers, various software have been developed. However, these software often require manual installation on local workstations with compatible operating systems (Håkansson et al 2010, Zhang et al 2016, Genske and Jahnke 2022. The variety in existing operating systems and the fact that users must obtain administrative privileges to install software on workstations owned by institution limit the accessibility of those software. Consequently, these factors make it challenging and cumbersome to conduct human-observer studies. Thus, an accessible and easy-to-use tool that can facilitate the conducting of expert-human-observer studies for evaluating the realism of synthetic images is much needed. Our developed web-based platform (section 3) is in the direction of addressing this need.
2. Ideal-observer-study-based approach to quantitatively evaluate the similarity in the distributions of real and synthetic images 2.1. Problem formulation Consider a set of clinical images that are acquired from a population of patients scanned by a medical-imaging system. Denote the image of each patient by an M-dimensional vector,f r , which, we assume, lies within the Hilbert space of Euclidean vectors, denoted by M  . Additionally, consider an image-synthesis method that generates images of a simulated population of patients in silico. Each synthetic medical image, denoted by an Mdimensional vector,f s , is also assumed to lie within M  .
To evaluate the clinical realism of those synthetic images, we consider a 2-AFC experiment being performed by a numerical observer. In this experiment, an observer is presented with pairs of real and synthetic images,f r andf s . where (·) step denotes the Heaviside unit step function. As shown in Barrett and Myers (2013) in the context of signal-detection tasks and rephrased in this scenario of using the 2-AFC experiment to evaluate the clinical realism of synthetic images (appendix A), the right-hand side of the above expression is equivalent to the expression for the AUC for an observer in terms of integrals overf r andf s . Thus, from equation (1), the accuracy of an observer in identifying the real images in a 2-AFC experiment is equivalent to the AUC for that observer. We note that the expression for the AUC using equation (1) depends on the test statistics and, thus, does not specify a direct relationship between the AUC value and the distance between the distributions of the real and synthetic images. To gain insights into this relationship, we consider the use of an ideal observer, which uses all the statistical information available in the data to evaluate the realism of synthetic images. This ideal observer sets an upper bound on the performance of any available observers and provides the best ability to assess whether any differences exist between the distributions of the real and synthetic images.
An ideal observer is defined as a decision strategy that calculates the likelihood ratio of (ˆ) q f 2 and (ˆ) q f 1 and compares the ratio to a threshold. In other words, the ideal observer calculates the test statistic, Λ, given by Our goal is to relate the AUC for this ideal observer to the distance between the distributions of (ˆ) q f 1 and (ˆ) q f 2 . Toward this goal, a central component of our derivation is the use of a likelihood-generating function (Barrett et al 1998). We first provide the background for the likelihood-generating function in section 2.2. We show that the characteristic functions, which are used to obtain the ideal-observer AUC, can be derived solely based on the likelihood-generating function. Then, in section 2.3, we show that the ideal-observer AUC can be expressed, to an excellent approximation, by the likelihood-generating function evaluated at the origin. More importantly, this value at the origin relates directly to the Bhattacharyya distance between the distributions of the real and the synthetic images. Thus, by using the likelihood-generating function, we are able to establish a direct relationship between the ideal-observer AUC and the similarity in distributions of the real and the synthetic images.

Background for likelihood-generating function
The likelihood-generating function is central to our derivation as all moments of both Λ and its logarithm, denoted by λ, under hypotheses H 1 and H 2 can be derived. This function was originally introduced by Barrett et al (1998), and we follow a similar approach to define the function. Denote the expectation of a random variable t under hypothesis H j by 〈t〉 j . We can show that the moments of Λ under H 2 are related to those under H 1 byˆ(ˆ) The moment-generating function for a random variable t under hypothesis H j , denoted by M j (β), is defined by Thus, from equation (4), the relationship between the moment-generating functions under the two hypotheses is given by: Additionally, the characteristic function for a random variable t under hypothesis H j , denoted by ψ j (ξ), is defined by From equations (5) and (7), we readily see that the moment-generating functions and characteristic functions are related to each other by Then, using equations (6) and (8) yields the relationship between the characteristic functions for λ under hypotheses H 1 (class of synthetic images) and H 2 (class of real images): This equation is important since it can further be used to derive the relationship between the probability distributions of λ under the two hypothesis. Denote the probability distribution of λ under hypothesis H j by p j (λ). Applying inverse Fourier transform to equation (9) on both sides yields (appendix B) In equation (10), both p 1 (λ) and p 2 (λ) can be derived from a single non-negative function f (λ), as follows: Defining this function f (λ) can help us to derive the expressions for the moment-generating functions and characteristic functions now. Denote the two-sided Then, from equation (6), we obtain Similarly, ψ 1 (ξ) and ψ 2 (ξ) in equation (9) can be expressed in terms of the Fourier transform of f (λ), denoted by ( ) x : The term p j (λ) denotes a probability and should integrate to unity. Thus, from equations (13) and (14), should equal to unity. To enforce these constraints, the likelihood-generating function G(β) and another function T(ξ) are defined such that We can then express M 1 (β) and ψ 1 (ξ) as Additionally, from equation (8), T(ξ) can be expressed in terms of G(β): Thus, we see that the characteristic functions can be expressed using only the likelihood-generating function.
2.3. Deriving the relationship between the ideal-observer AUC and the similarity in distributions of the real and the synthetic images Having obtained the characteristic functions using the likelihood-generating function, we can now derive the expression for the ideal-observer AUC. For this purpose, we note from equation (1) that by expressing the step function in terms of its Fourier transform, we can calculate the AUC aŝˆ(ˆ) where P denotes the Cauchy principal value for evaluating the improper integral. Note that in equation (18b), the expression within each curly bracket is the same as calculating the expectation of the term ( ) (ˆ) p xq  f 2 i . Using the fact that this expectation can be calculated from the probability density on eitherf or (ˆ) q f , we can further write equation (18b) in terms of the characteristic functions (equation (7)) as By replacing the expression for ψ 2 (ξ) from equation (9) and using the Hermiticity property of the Fourier transform, we obtain where, in the second step, we have used the expression for ψ 1 (ξ) from equation (16b) and then the relationship between T(ξ) and G(β) from equation (17). To simplify this further, we can approximate G(β) via the Maclaurin series expansion: Substituting this in equation (20b) and assuming that the contribution of higher order (n > 1) terms is negligible yields Next, using equations (15a), (12), and (11a), we obtain Note that equation (25) is obtained without making any assumption of the probability law of either the imagesf or the likelihood ratio Λ. From equation (25), it is easy to show that the value of the ideal-observer AUC decreases as the Bhattacharyya distance between (ˆ) Thus, an ideal-observer-study-based approach provides a mechanism to quantitatively evaluate the similarity in distributions of the real and the synthetic images.
2.4. Illustrating the relationship between the ideal-observer AUC and the Bhattacharyya distance for a twopixel image setup To illustrate the relationship in equation (25), consider thatf denotes images consisting of only two pixels. For the sake of simplicity, assume that (ˆ) q f 1 and (ˆ) q f 2 are described by 2D Gaussian distributions that have the same covariance matrix but different means, i.e.
. We readily see that the Bhattacharyya distance between (ˆ) q f 1 and (ˆ) q f 2 decreases as the difference between μ 1 and μ 2 decreases. Using equation (25), we can obtain the AUC at different values of ( (ˆ) (ˆ)) D q q f f , 3. A web-based expert-human-observer-study-based approach to quantitatively evaluate the clinical realism of synthetic images As introduced in section 1, human-observer studies have been widely used to evaluate the clinical realism of synthetic images. Among the different human observers, expert human readers, such as physicians who are highly experienced in reading medical images, can identify minute differences between the real and synthetic images. A 2-AFC experiment provides a mechanism to quantify the performance of the expert human observers on this task. If an expert human observer correctly identifies the real images for only around 50% of the cases in the 2-AFC experiment, then, as mentioned in section 2.1 with the proof provided in appendix A, this would indicate an AUC of ∼0.5 on the task of detecting the real image. This would imply that the expert human observer was unable to distinguish between the real and synthetic images, thus, suggesting that the synthetic images are clinically realistic as evaluated by that observer. While several tools have been developed for conducting human-observer studies (Håkansson et al 2010, Zhang et al 2016, users often need to manually install the tools on local workstations with compatible operating systems and/or have programming knowledge. These requirements can reduce the accessibility of the tools and consequently, serve as a hurdle in designing and conducting the observer studies. To address these issues, we develop an openly available software for conducting the 2-AFC experiments by expert human observers to quantitatively evaluate the clinical realism of synthetic images. This software is designed to be accessible, secure, and have mechanisms for both designing new 2-AFC experiments by investigators and performing the experiments by expert human observers. To achieve these goals, we design this software to be web-based and with a dual-user 'Investigator-Reader' interface. The 'Investigator interface' allows an investigator to design a 2-AFC experiment and upload the real and the synthetic images. The 'Reader interface' allows the expert human observers recruited by this investigator to perform the 2-AFC experiment. The programming environment for building the software is detailed in appendix C. In the following, we focus on describing the main functionalities of this software and the procedures for the investigator and reader to design and perform the 2-AFC experiment.

Developed software 3.1.1. Investigator interface
The layout for the investigator interface is shown in figure 2. As a first step, the investigator is required to provide a project title and a corresponding four-digit passcode, which the investigator should then share with the readers. This ensures that only readers authorized by this investigator can access the images, thus ensuring the security of the images. To improve the accessibility for readers, the investigator is asked to provide instructions for the readers to perform the 2-AFC experiment on the uploaded images. These instructions will be displayed on the screen once a reader begins the experiment. Our software allows the investigator to upload an arbitrary number of image pairs. The investigator is also provided an option to shuffle the order of image pairs. Finally, the investigator is asked to provide an email address, to which the results of the observer study from each reader would be sent. Note that if an investigator receives results with a percent accuracy much lower than 50%, this is likely an indication that the observer is not trained and, thus, the results should be treated with caution.

Reader interface
The reader is required to provide the project title and the corresponding passcode to access the images uploaded by a specific investigator. If these entries are provided correctly, the reader will be directed to the webpage, as shown in figure 3, to perform the 2-AFC experiment. In this experiment, a synthetic image sampled from (ˆ) q f 1 and a real image sampled from (ˆ) q f 2 are presented side-by-side (section 2.1). For each image pair, the reader is asked to identify the image that they perceive as real. While making the decision, the reader can adjust the contrast and invert the intensities of the images. The goal of providing these functionalities is to increase the clinical relevance and rigor of the observer study. The reader is also asked to provide a confidence level for the decision. The interpretations of the confidence levels are provided to the reader (figure 3). These interpretations are similar to those used in previous studies to conduct human-observer studies (Chen et al 2016, Ma et al 2017. The confidence levels could be a useful tool for improving the design of the synthesis technique after the observer-study evaluation. For example, if an expert reader correctly distinguishes the real image from the synthetic image with high confidence level, this could indicate that the synthetic image is highly unrealistic. Investigators could then incorporate such feedback while improving the design of their synthetic-imagegeneration approaches. Additionally, the reader is provided with an option to leave additional comments.

Evaluating usability of the developed software
To evaluate the usability of our software, we conduct a system usability scale (SUS) survey (Brooke 1996). This survey is widely used to test the usability of newly developed software and websites. The SUS evaluates a software on three main aspects, namely, effectiveness, efficiency, and satisfaction. These aspects assess whether users achieve their goals successfully, the effort and/or resource spent to achieve the goals, and whether the user experience is satisfactory, respectively.
The SUS survey was designed by adapting from Brooke (1996) and consisted of a 10-item questionnaire about the software with five response options for respondents: strongly disagree, disagree, neutral, agree, and strongly agree (table 1). For the odd-numbered items, a score of 0 was assigned to 'strongly disagree' and a score of 4 was assigned to 'strongly agree'. For the even-numbered items, a score of 4 was assigned to 'strongly disagree' and a score of 0 was assigned to 'strongly agree'. The scores were then added, and the summed score was multiplied by 2.5 such that the eventual score fell between 0 and 100.
We first conducted the survey with five board-certified nuclear medicine physicians with years of expertise ranging from 7 to 40 years (median: 12 years, average: 20.4 years), one nuclear medicine physicist, and one nuclear medicine resident. These users are considered as the expert human observers who would use our software to evaluate the clinical realism of synthetic images. Additionally, we conducted the survey with five users who were asked to evaluate the software as investigators designing an observer study. Conducting the survey with all these users provides evidence for the utility of the software in practical settings.
3.3. Evaluating the clinical realism of a positron emission tomography (PET) image-synthesis technique using the developed software To demonstrate the application of our software to quantitatively evaluate the clinical realism of image-synthesis techniques, we used the software to evaluate a recently developed technique for oncologic PET. This technique is a stochastic and physics-based method that generates 2D 18 F-fluorodeoxyglucose (FDG)-PET images of patients with lung cancer (Liu et al 2021a). By following the simulation procedure detailed in Liu et al (2021a), we  I think that I would need the support of a technical person to be able to use this software. 5 I found the various functionalities of this software were well integrated. 6 I thought there was too much inconsistency in this software. 7 I would imagine that most people would learn to use this software very quickly. 8 I found the software very cumbersome to use. 9 I felt very confident using the software. 10 I needed to learn a lot of things before I could get going with this software. generated 50 synthetic PET images for our 2-AFC study. The source code for this technique is openly available at https://github.com/ziping-liu/A-stochastic-and-physics-based-method-to-generate-oncological-PETimages.git. Our evaluation study was retrospective, involved clinical images, and was IRB-approved and HIPAAcompliant with informed consent being waived. The 2-AFC study using our developed software was conducted by six expert readers, including five boardcertified PET physicians (BAS, FD, JCM, TJF, and MI) and one PET physicist (RL). The readers were highly experienced in reading PET scans, with years of expertise ranging from 7 to 40 years (median: 16 years, average: 20.3 years). During the study, each of the 50 synthetic images was paired with an existing clinical PET image to be displayed to the readers simultaneously with our software (section 3.1.2; figure 3). The readers were then asked to identify the real image, provide a confidence level for the decision, and optionally leave a comment. We then computed the percentage of times that each reader correctly identified the real PET image.

Evaluating usability of the developed software for conducting 2-AFC experiments with expert human observers
In this section, we report the outcome of the SUS survey conducted to evaluate the usability of the developed web application (section 3.2). Figure 4 presents the distribution of responses from (A) seven expert human readers and (B) five observer-study designers to each item in the questionnaire described in table 1. Figure 5 shows the total score computed for each user based on the rule defined in section 3.2. For the group of expert human readers, a mean score of 84 with standard deviation of 8 was observed. Similarly, a mean score of 87 with standard deviation of 5 was obtained for the group of investigators. Based on Lewis and Sauro (2018), these results indicate that our software is very highly usable.  Table 2 shows the percent accuracy and median confidence level for each expert human observer participating in the 2-AFC study to evaluate the clinical realism of the stochastic and physics-based image-synthesis technique using our developed software, as described in section 3.3. We observe that all the readers identified the real PET image correctly only ∼50% of the time. Additionally, for half of the readers, the median value of confidence levels was 3. Figure 6 shows the number of correct (upper row) and incorrect (lower row) decisions made by the (a) five PET physicians, (b) the PET physicist, and (c) all the readers, respectively, at each confidence level. When combining all the readers, only 164/300 (55%) decisions were made correctly. Among these correct decisions, only 71 (43%) were made with confidence levels 4. Additionally, 34/136 (25%) incorrect decisions were made with high confidence levels 4.

Discussion
To ensure that simulation-based development and evaluation of medical imaging methods are clinically relevant, images generated by the synthesis technique must be clinically realistic and, ideally, have the same distribution as that of real images. The first contribution of this work is to theoretically demonstrate that an ideal-observer-study-based approach provides a mechanism to quantitatively evaluate the similarity in distributions between the real and synthetic images. Further, we show that the AUC for an ideal observer can be expressed, to an excellent approximation, by the Bhattacharyya distance between the distributions of real and synthetic images. Thus, when the ideal-observer AUC decreases, this indicates that the distance between the two distributions decreases. Moreover, a lower bound of AUC = 0.5 indicates that the distribution of the synthetic images exactly matches that of the real images. Thus, by quantifying the similarity in distributions between the real and synthetic images, this ideal-observer-study-based approach provides a theoretical foundation for quantitative evaluation of the clinical realism of synthetic images.
The second contribution of this manuscript is to develop a web-based platform for facilitating the use of human-observer-study-based approaches to quantitatively evaluate the clinical realism of synthetic images. Our  software is openly available, does not require installation on a local workstation, is platform-independent, eliminates the need for on-site study, and allows simultaneous access by multiple users. The goal of incorporating all these features is to strengthen the usability of this software. Additionally, our software provides features that allow varying the contrast and intensity of images. This leads to an user interface that is similar to those present in clinical tools, thus further strengthening the rigor and clinical relevance of the 2-AFC experiments. Our results from the SUS survey shown in section 4.1 demonstrate that the software is highly userfriendly and accessible. Further, our software provides multiple features to align with the General Data Protection Regulation policies. Specifically, the software provides mechanisms to secure stored data, allow users to delete uploaded data, and prevent data from unauthorized access. All these features are important for evaluation studies that include patient data. Our developed software can be used to evaluate a large class of image-synthesis techniques, including physics-based methods (Duchateau et al 2017, Ma et al 2017, Leung et al 2020, Hamdi et al 2021, generative adversarial network-based methods (Costa et al 2017, Nie et al 2017, Wang et al 2021, and other AI-based methods (Chartsias et al 2017b, Xiang et al 2018, Bahrami et al 2020, Dutta et al 2022. Further, while the key purpose of our software is evaluating the realism of synthetic images, the software can also be used to conduct 2-AFC experiments for performing image-quality assessment. For this secondary purpose, tools have been developed previously (Vuong et al 2018, Genske andJahnke 2022). Similar to those tools, our software can be used to evaluate newly developed image-reconstruction and image-processing methods on signal-detection tasks.
Another application of the proposed realism-evaluation strategies is in assessing the realism of synthetic images that are generated for virtual clinical trials. For this application, it is important to account for the clinical task of interest and not just assess whether the images look realistic to a human observer (Badano 2017). In that context, our ideal-observer-study-based approach provides a mechanism to quantify the difference in distributions of real and synthetic images. Further, performance on clinical tasks of interest typically depends on the distribution of the image. Future research may reveal that having a measure of the difference between the distributions of real and synthetic images can help to objectively compare the performance on the clinical task with those images. In that case, our theoretical formalism could provide a mechanism to account for the clinical task of interest when evaluating the realism of synthetic images.
As a secondary finding of this work, our evaluation of a stochastic and physics-based image-synthesis technique (section 3.3) using the expert-human-observer-based study with the developed software indicates that the expert readers had limited ability to distinguish the real images from the synthetic images. As shown in table 2, all the expert readers, even including the most experienced PET physician with 40 years of reading PET scans, correctly identified the real images only in ∼50% of the cases. Additionally, we observe from figure 6 that among the 164 (out of 300) correct decisions, only 43% were made with high confidence levels, suggesting that the readers were not confident even when they correctly identified the real image. Moreover, the readers were falsely confident for 25% of incorrect decisions. These results motivate the use of the image-synthesis technique to generate images for the development and evaluation of a wide range of PET imaging methods. In fact, this technique was used to objectively evaluate a recently developed PET segmentation method (Liu et al 2021b).
There are some limitations in this work. First, our ideal-observer-study-based approach to evaluate the clinical realism of synthetic images was presented in theory and not yet applied to a clinical scenario. As shown in section 2, developing the ideal observer requires knowledge of the probability distributions of the real and synthetic images. However, in clinical studies, these distributions are high-dimensional and do not have a known analytical form. To address these issues, AI-based methods are showing promise in approximating the ideal-observer test statistics for signal-detection tasks (Kupinski et al 2001, Zhou et al 2019b. Our theoretical formalism motivates extending these methods for the task of clinical realism evaluation. Second, our theoretical formalism was presented specifically for an ideal observer and thus, we reiterate that it should not be used to directly interpret results obtained with expert human observers. However, in that context, we do point out that several studies (He et al 2004, Li et al 2016 have shown correlations between the performance of human observers and channelized Hotelling observers (CHOs). The CHOs utilize templates that are derived from the firstand second-order statistics of the channel vectors extracted from the images. Thus, in special cases where the channel vectors are sufficient statistics for describing the distributions of real and synthetic images, our idealobserver analysis may be used to quantify the similarity in distributions of real and synthetic images. Examining this connection is an important future research direction. A third limitation is that our web application is currently designed to evaluate the realism of synthetic images on a per-slice basis and not the entire 3D volume. Additionally, in the designed application, the slices are displayed only in a single orientation. Expanding the web application to display images in 3D and in multiple orientations is an important area of future development. Finally, our web application is currently developed for conducting 2-AFC experiments. Considering that different variants of the 2-AFC experiment have been used in the human-observer studies (Zhang et al 2016, Ikejimba et al 2019, expanding our software to allow conducting those experiments is another important area of future development.

Conclusion
In this work, we investigated two observer-study-based approaches to quantitatively evaluate the clinical realism of synthetic images. We theoretically demonstrated that an ideal-observer-study-based approach provides a mechanism to quantify the similarity in distributions of real and synthetic images. Further, we showed that the ideal-observer AUC can be expressed, to an excellent approximation, by the Bhattacharyya distance between the distributions of real and synthetic images. Additionally, we developed a software that provides a web-based platform to facilitate the conducting of expert-human-observer studies for quantitative evaluation of the realism of synthetic images. This software is available at https://apps.mir.wustl.edu/twoafc. The software provides multiple functionalities towards increasing the rigor and clinical relevance of 2-AFC experiments. Our results from the SUS survey demonstrate that this software enables designing and performing 2-AFC experiments with expert human observers in a highly accessible and user-friendly manner. Finally, as a secondary finding of this work, evaluation of a stochastic and physics-based PET image-synthesis technique showed that the expert human observers were generally unable to distinguish the real images from the synthetic images. This finding motivates the application of this technique to the development and evaluation of PET imaging methods.
In this appendix, we prove that when an observer performs a 2-AFC experiment, the expression for the probability of a correct decision (equation (1)) is equal to the AUC for that observer. Our proof is similar to that provided in Barrett et al (1998) but for a different context. In that paper, the derivation was presented in the context of performing a 2-AFC study to evaluate the observer performance for a signal-detection task. Here, we paraphrase the derivation for the application of evaluating the clinical realism of synthetic images.
Proof. Consider an observer performing the task of identifying an image as synthetic (H 1 ) or real (H 2 ). For a given image, the observer calculates a test statistic, denoted by a random variable t, and then compares the value of t to a threshold, denoted by x. If t x  , the observer will identify the image as real, i.e. assign the image to H 2 .
Otherwise, the image is considered synthetic and assigned to H 1 . The performance of this observer can be fully specified by two quantities. The first quantity, referred to as the true-positive fraction (TPF), measures the fraction of times that the observer identifies the image as real when the image is indeed real. The second quantity, referred to as false-positive fraction (FPF), measures the fraction of times that the observer identifies the image as real when the image is in fact synthetic. Denote the probability of an event by (·) Pr and the probability distribution of a random variable by (·) pr . Given the threshold x, the TPF and FPF can be calculated as follows: