Artificial intelligence model GPT4 narrowly fails simulated radiological protection exam

This study assesses the efficacy of Generative Pre-Trained Transformers (GPT) published by OpenAI in the specialised domains of radiological protection and health physics. Utilising a set of 1064 surrogate questions designed to mimic a health physics certification exam, we evaluated the models’ ability to accurately respond to questions across five knowledge domains. Our results indicated that neither model met the 67% passing threshold, with GPT-3.5 achieving a 45.3% weighted average and GPT-4 attaining 61.7%. Despite GPT-4’s significant parameter increase and multimodal capabilities, it demonstrated superior performance in all categories yet still fell short of a passing score. The study’s methodology involved a simple, standardised prompting strategy without employing prompt engineering or in-context learning, which are known to potentially enhance performance. The analysis revealed that GPT-3.5 formatted answers more correctly, despite GPT-4’s higher overall accuracy. The findings suggest that while GPT-3.5 and GPT-4 show promise in handling domain-specific content, their application in the field of radiological protection should be approached with caution, emphasising the need for human oversight and verification.


Introduction
Recent breakthroughs in large language model (LLM) technology have brought about a transformative impact in the realm of artificial intelligence (AI).One prominent illustration of this is the Generative Pre-Trained Transformer (GPT), released by Open AI in 2018 [1].GPT 4.0 has proven remarkable ability in assessing knowledge in specialised domains such as medicine, law, and business [2][3][4]-areas that have historically been the exclusive purview of professionals.Particularly noteworthy is its exceptional performance on assessments like the Korean general surgery board exam, the United States Medical Licensing Exam, and the Wharton MBA final exam, each achieved without the finetuning of the pretrained model [5][6][7].
The third iteration of the Generative Pre-Trained Transformer (GPT-3.5)comprises 175 billion parameters, while GPT-4 boasts an impressive 1.76 trillion parameters [8,9].Parameters can be defined as the components of a LLM that shape its proficiency in tasks such as text generation.
The widespread adoption of ChatGPT led to server capacity challenges, prompting the introduction of a paid version known as ChatGPT Plus, which has become the primary means of accessing the underlying GPT-4 technology.
In contrast to its predecessor, GPT-4 is a multimodal model, capable of processing not only text as input but also images and diagrams.These advancements are exemplified by GPT-4's remarkable performance on various exams, including the Law School Admissions Test, Standardised Admissions Test, Uniform Bar Exam, and Graduate Record Examinations, where it achieved higher scores [10][11][12].
The health physics multiple choice certification exam is a multidisciplinary challenge that includes scientific principles from a broad range of subdomains, substantial mathematical theory, practical applications, and ethical considerations.It is renowned for its difficulty, with only an average of 18% of candidates managing to surpass a passing threshold of 67% [13].ChatGPT, while proficient in providing information and explanations, may fall short in meeting the demands of critical thinking, problem-solving, and contextual comprehension required for such an exam.This study aims to assess and quantify the performance of ChatGPT-3.5 and GPT-4 on their ability to accurately respond to questions across five knowledge domains, shedding light on their strengths and weaknesses within the specialised domains of radiological protection and health physics.

Methods
In the study, a collection of 1064 simulated radiological protection and health physics questions were utilised to evaluate the performance of two advanced language models, GPT-3.5 and GPT-4.0.A standardised format was employed to present each question to the models, and the responses were systematically recorded and compared to the correct answers.Particular technical settings were adjusted to optimise the precision of the model's outputs.The questions were then categorised into specific health physics topics, the performance within each category was calculated, and a final weighted test score was derived.

Health physics certification examinations
The American Board of Health Physics (ABHP) was established with the aim of improving the standards and practice of health physics.The ABHP sets professional and ethical standards, evaluates qualifications through examinations, issues certificates to qualified individuals, and maintains a registry of those certified [14].The ABHP certifies health physicist professionals, in part, by assessing competency and qualifications to evaluate candidates for certification in the field of health physics.The Part I exam is administered as a multiple-choice examination.The ABHP maintains a set of typical exam questions in section 8 of the exam preparation guide hosted on their website [15].

Test structure and grading criteria
The exam consists of 150 questions and is structured into five domains based on a role delineation study conducted in the mid-1980s and reaffirmed through subsequent surveys [15].Each domain is assigned a specific percentage weight.'Measurements and Instrumentation' has a weight of 25% and covers topics such as instrument selection, data interpretation, and quality control.'Standards and Requirements' accounts for 20% and includes standards and guidelines from organisations like ICRP, NCRP, and regulatory agencies like NRC and DOE.'Hazards Analysis and Controls' also has a weight of 20% and focuses on identifying and controlling radiological hazards.'Operations and Procedures' makes up another 20% and deals with the incorporation of radiation protection into operational programs.'Fundamentals and Education' , the last domain, has a weight of 15% and focuses on the training received and provided by health physicists.Sample questions found on the ABHP website [15] illustrate the questions found on the Certified Health Physicist (CHP) exam.

Database development and validation
A database of simulated radiological protection and health physics questions was developed for this study.A candidate recently completing Part 1 of the CHP exam manually reviewed each question and corresponding answers for accuracy and validity.The ABHP does not publicly release past Part I exams, so this database of simulated questions was used as a surrogate for the actual exam.While the questions submitted to the LLM were similar in scope and difficulty to typical certification exam questions, no questions that required accompanying graphics (plots, diagrams, etc) were included in this study.

AI prompting methodology
A standardised prompting strategy was developed for interacting with the language models.Questions, along with their multiple-choice answers, were presented to the model followed by the instruction, 'Give the number of the best answer.Start your response with "The answer is:"' The goal of this approach was to have the LLM respond with just the multiple-choice answer (1)(2)(3)(4)(5) and not provide a lengthy (costly) explanation.

Model interrogation
In this study, OpenAI's LLMs, GPT-3.5 and GPT-4.0, were utilised via API calls programmed in Python 3.0 to engage with a dataset of specialised questions.For deterministic output, a series of parameters were finely tuned: the 'temperature' was set to 0 to avoid randomness in the answers, and 'max_tokens' was limited to 20 to manage API-related costs.Two additional parameters were adjusted for nuanced control over the output.The 'top_p' was set to 1, meaning the model was restricted to selecting from the most probable next words, essentially making the decision-making process less random.'Frequency_penalty' was set to 0, ensuring that the frequency of the generated tokens did not influence the model's choices.All responses from the models were captured and categorised.If any answer was not provided in the initial 20-token limit, that question was reissued with a 600-token ceiling to obtain a more comprehensive response with better odds of obtaining an answer that could be evaluated for correctness.

Answer evaluation
Each model-generated response was evaluated against the database of answers.A response was deemed correct if the LLM output exactly matched either the number or the text of the correct multiple-choice answer.Approximate answers, such as those which required rounding, were not considered correct.If the response did not follow the requested format, it was marked as incorrectly formatted but could still be considered correct.

Categorisation and scoring
Questions were categorised into the following unique domains: 'Fundamentals and Education' , 'Hazard Analysis and Controls' , 'Measurements and Instrumentation' , 'Operations and Procedures' , and 'Standards and Requirements' .The fraction of correct and correctly formatted answers within each category was calculated using Microsoft Excel 365.The final test performance score was calculated by taking the weighted average of the category-specific scores using category weights defined by the ABHP.

Results
As indicated by the weighted average, both GPT-3.5 and GPT-4 fell short of a 67% passing threshold for the exam.GPT-3.5 scored a weighted average of 45.3%, while GPT-4 narrowly missed the threshold with a score of 61.7%.Of the five knowledge domains: 'Fundamentals and Education' , 'Hazard Analysis and Controls' , 'Measurements and Instrumentation' , 'Operations and Procedures' , and 'Standards and Requirements' , GPT-4 demonstrated superior performance compared to   Both models consistently performed well in comparison to each other across all categories, as illustrated in figure 1.The domain 'Measurements and Instrumentation' displayed the lowest accuracy, whereas 'Fundamentals and Education' and 'Operations and Procedures' achieved the highest and second-highest accuracy, respectively.
Despite GPT-3.5'slower overall question accuracy, it excelled in formatting analysis, correctly formatting answers 6.3 percentage points more than GPT-4, with average scores of 99.1% and 92.8%, respectively.The GPT4 AI model tended to disregard the requested single-digit answer format and launch into a truncated paragraph-style answer.Notably, despite having the highest accuracy in correct answers among all knowledge domains, the dataset for 'Measurements and Instrumentation' displayed the lowest percentage of correctly formatted answers for both models.No discernible patterns in formatting performance were observed for the remaining knowledge domains.

Discussion
The weighted average showed that neither GPT-3.5 nor GPT-4 met a 67% passing threshold for the exam.The input incorporated the exact questions from the dataset of simulated questions and a simple, standardised prompting methodology.It is well established that GPT's responses can exhibit variability depending on the prompting.To enhance its performance, various strategies could be implemented.These include crafting explicit instructions, such as instructing the model to adopt a persona (e.g. that of a professional Health Physicist), breaking down intricate tasks into simpler subtasks, or in-context learning (i.e.providing sample problems or specific reference materials).Each strategy can increase the question accuracy compared to our current methodology.

Exploring the potential of in-context learning
Both ChatGPT-3.5 and GPT-4 rely exclusively on training from publicly available data sources [16].Furthermore, the precise sources used to generate responses and the criteria for selecting those sources, such as their date and credibility, remain undisclosed.Importantly, scientific progress thrives on fresh insights and the synthesis of complex ideas, qualities that are crucial in the context of the CHP exam.In our study, we employed 'zero-shot prompting' , which involves presenting language models with questions they have not been specifically prepared for, akin to testing a student on material they have not studied [17].For example, the models were asked to answer complex questions related to radiological protection, such as calculating radiation doses or selecting appropriate safety measures, without prior exposure to this specific type of problem.
Alternatively, in-context learning presents a valuable strategy for harnessing language models like GPT to tackle new tasks with minimal examples [18].The model is exposed to sample inputs and their corresponding outputs that demonstrate specific tasks, enabling it to generate responses to new inputs based on these examples.Given the multitude of potential sources it can reference and the documented risk of generating inaccurate information, adopting in-context learning has the potential to significantly enhance output accuracy.

Discrepancy in performance
Despite both models falling short of the passing threshold, a notable performance gap was observed between GPT-3.5 and GPT-4.GPT-4 exhibited superior performance across all knowledge domains and even surpassed the passing threshold in the Operations and Procedures sub-section.This discrepancy may be due to the advancements incorporated into GPT-4, which includes a substantially larger training dataset than GPT 3.5 and a ten-fold increase in the number of neural network parameters [8,9].Furthermore, with GPT-4's image processing capabilities, it can be inferred that if the exam questions were to include figures and tables, as seen in many current certification exams, the performance gap between the two models would likely be more pronounced.

Contradictory results from formatting analysis
In the context of comparing GPT-3.5 and GPT-4, an interesting paradox arises where GPT-4 exhibits greater overall accuracy on an entire dataset, yet GPT-3.5 surpasses GPT-4 in terms of correctly formatting answers within the allocated token limit.GPT-4 often failed to give a simple multiple-choice answer, and instead attempted to launch into an explanation.This paradox can be attributed to several factors related to the capabilities and differences between the two models.GPT-3.5 may have been trained using a question-and-answer format for specific tasks.On the contrary, the training data structure for GPT-4, being a more recent model, may not have been similarly formatted as extensively for these specific tasks.

Consistency by knowledge domain
The results displayed remarkable consistency in performance by knowledge domain between the two models.However, it is worth noting that the 'Measurements and Instrumentation' domain exhibited the lowest accuracy for both models.This knowledge domain encompasses a wide range of topics, including various measurement types, analytical techniques, measurement methodologies, result interpretation, quality control, calibration, and instrumentation testing.Questions within this domain often involve the practical application of instrumentation, which can pose a particular challenge for GPT models due to the need for real-world expertise that these models lack, along with the absence of access to specialised databases or direct training materials.
In contrast, the 'Fundamentals and Procedures' and 'Operations and Procedures' domains achieved the most robust performance.These domains revolve around the integration of radiation protection principles into operational procedures, such as emergency response protocols, record-keeping practices, and adherence to standard operating procedures.Additionally, they encompass fundamental skills required in the field.The information within these domains necessitates less complex processing, with answers that are more readily accessible.The transition from GPT-3.5 to GPT-4, accompanied by an expansion in model parameters, further bolsters the accessibility of answers within these domains, rendering them especially conducive to enhanced performance.

Cost and pricing
It is relevant to consider the financial cost of utilising GPT-3.5 and GPT-4 services related to answering radiological protection multiple choice questions.The cost-effectiveness of these models is an important factor, especially when scaled to larger datasets or more frequent use cases.Our analysis employed API calls at a rate of $0.002 USD per 1000 tokens for GPT-3.5 and $0.03 for GPT-4, with a total API cost of $5.21 for the entire study.As of the time of this article, ChatGPT and GPT 4.0 are currently only accessible under an additional monthly subscription cost of $20 per month.

Limitations
The primary limitation of our study stems from the utilisation of a simulated dataset comprising questions that, by definition, differ from those found in an actual certification exam.While the questions in our dataset assessed identical concepts, they were exclusively text-based, in contrast to current certification exams, which often appropriately include questions featuring images and graphics.This disparity in question format may have contributed to a more pronounced disparity in accuracy between ChatGPT-3.5 and GPT-4, given that ChatGPT-3.5 lacks image recognition capabilities, whereas GPT-4 possesses this functionality.
The lack of prompt engineering also poses a limitation to the assessment of the capability of the models.Since several strategies are known to enhance the ability of GPT to yield more accurate responses, it is likely that our methodology did not fully exhibit the models' potential as it may have if prompt engineering was implemented and optimised.
Although GPT-3.5 is freely available through the ChatGPT website, access to GPT-4 is not free and access to the API version must also be requested.Moreover, the results of the study are qualified due to the exclusive use of ChatGPT as the AI program under examination.The dynamic landscape of evolving AI programs implies that many studies assessing different programs may yield diverse outcomes.Additionally, it is crucial to acknowledge that ChatGPT is subject to ongoing updates and improvements, and the specific version employed in our study may not necessarily align with the most current iteration available at the time of this article's publication.
In this study, we employed zero-shot prompting as opposed to 'multi-shot prompting' which employs multi-shot learning by using various example questions and answers.Multi-shot learning allows the model to better understand the desired outcome.Unlike the multi-shot learning approach, zero-shot prompting can be limiting as it does not leverage the models' full potential to learn from examples and other techniques.

Conclusions
In conclusion, this evaluation of GPT-3.5 and GPT-4's performance reveals a multifaceted landscape.Both models fell short of the passing threshold, highlighting the ongoing challenges in deploying these AI systems for specific tasks like accurately answering specific radiological protection and health physics questions.However, promising avenues for improvement are evident.
One such avenue lies in the implementation of tailored strategies to enhance performance, encompassing the use of explicit instructions, in-context learning, and breaking down complex tasks.These strategies, if effectively applied, have the potential to elevate question accuracy, addressing the models' limitations.
Moreover, the exploration of in-context learning presents an intriguing prospect.GPT models, while drawing from publicly available data sources, can benefit from in-context learning to adapt to new tasks more efficiently, potentially mitigating the limitations associated with their knowledge sources.
The discrepancy in performance between GPT-3.5 and GPT-4 underscores the advantages of technological advancements.GPT-4's superior performance, bolstered by increased parameters and enhanced processing capabilities, demonstrates its potential for tackling intricate scientific concepts and domain-specific challenges.This distinction becomes particularly significant when considering the inclusion of visual content in questions.
Paradoxically, the improved intelligence of the GPT4 model does not proportionately correspond to its ability or willingness to follow human directions in the input prompt.While GPT-4 achieves greater overall accuracy, GPT-3.5 excels in correctly formatting answers.This paradox highlights the need for an iterative approach to prompting the GPT4 model to maximise the value of its output.
The consistency in performance across knowledge domains provides valuable insights, revealing the particular challenges faced by GPT models in handling topics related to measurements, analytical techniques, and instrumentation.These topics demand real-world expertise and access to specialised databases, factors that GPT models inherently lack.In contrast, 'Fundamentals and Procedures' and 'Operations and Procedures' domains capitalise on the strengths of these models, allowing for enhanced accessibility and performance improvement with the transition from GPT-3.5 to GPT-4.
In essence, this study not only sheds light on the performance of GPT-3.5 and GPT-4 but also highlights the potential for further advancements in AI language models through the strategic utilisation of fine-tuning, in-context learning, and addressing the unique challenges posed by specific knowledge domains.

Figure 1 .
Figure 1.Comparative accuracy analysis of GPT-3.5 and GPT-4 across distinct knowledge domains.The accuracy percentages of two advanced language models, GPT-3.5 and GPT-4, were determined in five distinct knowledge domains.Each domain-'Fundamentals and Education' , 'Hazard Analysis and Controls' , 'Measurements and Instrumentation' , 'Operations and Procedures' , and 'Standards and Requirements'-reveals the models' performances, showcasing the nuanced strengths and weaknesses between knowledge domains.For clarity, use of colours corresponding to model, GPT-3.5 (blue) and GPT-4 (orange), have been placed across the x-axis.