The impact of AI in physics education: a comprehensive review from GCSE to university levels

With the rapid evolution of artificial intelligence (AI), its potential implications for higher education have become a focal point of interest. This study delves into the capabilities of AI in physics education and offers actionable AI policy recommendations. Using openAI’s flagship gpt-3.5-turbo large language model (LLM), we assessed its ability to answer 1337 physics exam questions spanning general certificate of secondary education (GCSE), A-Level, and introductory university curricula. We employed various AI prompting techniques: Zero Shot, in context learning, and confirmatory checking, which merges chain of thought reasoning with reflection. The proficiency of gpt-3.5-turbo varied across academic levels: it scored an average of 83.4% on GCSE, 63.8% on A-Level, and 37.4% on university-level questions, with an overall average of 59.9% using the most effective prompting technique. In a separate test, the LLM’s accuracy on 5000 mathematical operations was found to be 45.2%. When evaluated as a marking tool, the LLM’s concordance with human markers averaged at 50.8%, with notable inaccuracies in marking straightforward questions, like multiple-choice. Given these results, our recommendations underscore caution: while current LLMs can consistently perform well on physics questions at earlier educational stages, their efficacy diminishes with advanced content and complex calculations. LLM outputs often showcase novel methods not in the syllabus, excessive verbosity, and miscalculations in basic arithmetic. This suggests that at university, there’s no substantial threat from LLMs for non-invigilated physics questions. However, given the LLMs’ considerable proficiency in writing physics essays and coding abilities, non-invigilated examinations of these skills in physics are highly vulnerable to automated completion by LLMs. This vulnerability also extends to pysics questions pitched at lower academic levels. It is thus recommended that educators be transparent about LLM capabilities with their students, while emphasizing caution against overreliance on their output due to its tendency to sound plausible but be incorrect.


Background
Since OpenAI introduced GPT, there has been a burgeoning interest in the Higher Education (HE) sector regarding the potential impact of Artificial Intelligence (AI) on learning and teaching [1,2,3].The transformative potential of AI, particularly Large Language Models (LLMs) -neural networks trained on vast amounts of text -has captivated educators.Reinforcing its significance in the educational realm, OpenAI even released a "tips for educators" blog post ‡.Chatbots like ChatGPT, built on the transformer architecture [4], use a decoder-only design to predict subsequent words, equipping them to handle intricate queries.Following the prominence of ChatGPT, tech behemoths such as Meta, Google, and Baidu have launched their own AI-driven chatbots: LLama, Bard, and Ernie.While these models excel in various tasks, preliminary research indicates that they may not consistently meet the rigorous academic standards of university settings [5], with GPT-4, the latest iteration of the GPT series, outperforming its counterparts.
Research within Physics specifically has begun to assess the capabilities and implications of AI, largely focusing on ChatGPT.For instance, at the secondary school level, a pilot study led by Bitzenbauer engaged students in asking ChatGPT physics questions as a learning exercise and elicited their feedback on the generated responses [6].Moreover, Yeadon and Halliday, when examining a set of Physics exams administered at Durham University, found that GPT-4 typically achieved scores around the 50% mark [7].Interestingly, the markers frequently noted the plausible sounding nature of the responses from GPT-4, despite them not always being correct.This phenomenon was also highlighted in a study by Dahlkemper et al. [8].They observed that responses from ChatGPT to challenging Physics questions -ones that were more difficult than the students tested had previously encountered -were rated comparably to researcher-written responses.This was the case even though, for simpler questions, incorrect ChatGPT responses could be identified more easily by the students.Similarly, Gregorcic and Pendrill [9] found that a conversation with ChatGPT could yield intricate dialogue but incorrect physics concepts.This issue of complex yet plausible-sounding language masking incorrect content is a hallmark of ChatGPT completions.Focusing on essay-based Physics assignments, researchers discovered that ChatGPT's performance was generally on par with the average student's score on short-form Physics essay assignments [10].
A growing body of research suggests the importance of equipping students with skills and experience to interact effectively with AI [6,11,12].While this idea seems practical, it's crucial to acknowledge the continuous evolution of AI and computational technologies.As these systems become more user-friendly, the requirement for extensive technical knowledge decreases.This trend is evident in the rise of intuitive development ‡ Available at https://openai.com/blog/teaching-with-ai.
environments like Replit and design tools like Figma, both of which have simplified many complexities in software creation.A similar trend towards readability is seen in programming languages, with Python being a prime example.Supporting this trend, ChatGPT has shown the capability to convert natural language into functional source code that can solve Leetcode problems [13].Here, natural language can be seen as an even higher-level representation of source code, which itself is a higher-level representation of machine code.This suggests a future where specialized knowledge in areas like prompt engineering could become less important, replaced by more intuitive and direct interaction with AI systems.
Indeed, the interaction techniques used in this research might soon be outdated due to the rapid development in AI technology.Thus, understanding the effects of different interaction methods on AI performance, along with grasping AI's current capabilities, is vital for the Physics education community.This study aims to delve into these topics to provide educators with a better understanding of how to adapt to the AI evolution and to suggest practical ways to adjust to this rapid change.

Interaction with AI models
There's a growing recognition that the conventional back-and-forth messaging inherent in chat-style interactions may not be entirely representative of the full potential of Large Language Models (LLMs).The interaction quality and the outcome largely depend on not just the proper formulation of the prompt, but also on the application of various frameworks like Reflection and Chain of Thought reasoning.Furthermore, integrating external tools such as Wolfram Alpha can significantly enhance the performance of the LLMs.
Prompting techniques, including Zero Shot and Few Shot, are among the primary factors influencing the output quality.The Zero Shot approach entails asking a question directly and expecting an answer, without providing any prior context or examples.On the other hand, Few Shot involves presenting examples to the model before posing the question [14], thereby giving the model a context to generate a more informed response.This study uses OpenAI's ChatGPT thus in order to implement Few Shot prompting In Context Learning (ICL) is used whereby the examples are presented within the prompt sent to the LLM.
The Confirmatory Check technique is an implementation that combines elements of the Chain of Thought [15] and Reflection [16] methods.It encourages a LLM to reconsider its previous outputs, removing excess content if appropriate.This method prompts the model to evaluate its initial response, thereby mitigating the problem where the LLM becomes 'stuck' with a mistake in its produced answer.Additionally, LLM outputs can sometimes be long, rambling, and inconsistent with the complexity of the question.The Confirmatory Check technique provides an opportunity for the LLM to avoid these issues.This approach offers valuable insights in educational contexts, where it mimics a more conversational interaction between students and the LLM.This back-and-forth dialogue can lead to more refined and accurate answers.
The idea of equipping LLMs with external tools to handle challenging tasks has also gained traction recently.A notable instance is the integration of Wolfram Alpha with ChatGPT, allowing the LLM to leverage Wolfram Alpha's capabilities to tackle complex mathematical tasks that are typically difficult for LLMs [17].
In exploring these different techniques, our aim is not only to provide a broader understanding of how these models can be utilized but also to evaluate their efficacy within the context of Physics education.This serves the larger goal of this study -to benchmark these cutting-edge LLMs thoroughly and provide educators with a clearer picture of AI's strengths and weaknesses.By doing so, we hope to offer a comprehensive resource to understand AI's current capabilities and thereby inform educators about effective ways to integrate AI into their teaching practices.

Question sources
To ensure a comprehensive examination of the AI's capabilities across various difficulty levels, we sourced questions that spanned from GCSE to A-Level, as well as introductory university courses.These questions were obtained from a wide array of educational boards and institutions, culminating in a diverse and robust dataset.To transfer questions from their original sources into a digital, machine-readable format, we utilized a combination of regular expression matching and manual transcription.Special emphasis was placed on maintaining the accuracy of the transcription process, preserving the original complexity and structure of each question.However, due to the requirement of sending API requests in Latin-1 encoding (ISO/IEC 8859-1), mathematical notations such as the square root or integral symbols were unavailable.We adapted to this constraint by using natural language short-hands, such as 'sqrt(x)' or 'integrate(x)', which proved to be an effective solution.Further, when questions incorporated tables or figures, we adopted specific strategies.Tables were reformatted to resemble nested Python lists.As for figures, we provided detailed descriptions.However, this approach for figures was seldom practical.The questions were organized into three distinct categories: Numerical, where calculations such as "find the acceleration" were required; Multiple Choice, involving selection from a list of options; and Written Descriptive Answers, where textual responses were needed.The sources of the questions are detailed in Table 1.
Building on the extensive research focusing on university-level physics exam questions [7], the current study narrows its scope to introductory level questions.The textbooks from which these questions were sourced are shown in Table 1.To ensure fair evaluation, the scoring procedures for these questions were standardized across the different educational levels.For GCSE and A-Level questions, we adhered to the respective mark schemes provided.On the other hand, university-level questions, being derived from textbooks, lacked a standard mark scheme.To address this, a customized scoring rubric was developed.Specifically, questions from University Physics with Modern Physics were found to be more elaborate and were consequently marked on a 2-point scale.A score of 2/2 was awarded for completely accurate answers, 1/2 for answers that were near correct with correct application of physics principles, and 0/2 for all other responses.For questions sourced from the other university textbooks, a simpler 1-point scale was utilized, wherein each question was marked as either correct or incorrect.This approach aimed to strike a balance between accommodating the inherent complexity of questions from different sources and maintaining an equitable evaluation framework.

Generating the AI answers
We utilized the OpenAI API, specifically the GPT-3.5-turbolanguage model, to generate AI responses from an array of message objects [18].By altering the format of the message object array, we implemented various interaction techniques: Zero Shot, In Context Learning and Confirmatory Check.Each message object has a role of either system, user, or assistant.The system message objects guide the behavior of the LLM.The OpenAI default system message is 'You are a knowledgeable assistant.'[19], which was retained for the Zero-shot prompting interactions.It was followed by a system message reading 'Please answer the following question.' to ensure the question was answered, and then a user message containing the actual question content.
For the In Context Learning prompt implementation, the system messages were modified to include a series of example question-answer pairs before the target question, as shown in Figure 1.These examples served to establish the context for the expected responses.Studies have shown that beyond five examples, the benefits of additional examples become negligible [20].We found the LLM would often provide lengthy responses, so the examples were deliberately concise.Although a word-based example was initially included, it was determined to be unnecessary since the model is trained principally on long text passages.
The Confirmatory Check technique was implemented by sending the In Context Learning message object to the API with the In Context Learning response appended as an assistant message.It was followed by a user message reading, "Please check the previous answer to ensure you're happy with it.If you feel that you can express it more succinctly, then please do so.For reference, this was the original question: <question inserted>."This approach allowed the LLM an opportunity to refine its In Context Learning answer.
While the OpenAI API does not directly provide a confidence score or probability with each response, the 'temperature' parameter was set at 0 to eliminate randomness in the generated responses [21].The 'max tokens' parameter was set at 2000, suitable for extensive answers.After processing each question, the result was saved with the new answers in an Excel workbook to prevent data loss in case of program termination.The grading and interpretation of the AI's responses are discussed in the subsequent sections.

Automated grading
To assess the LLM's capability in evaluating its own responses, the answer from each question -spanning different prompting styles -was submitted to the API.This submission included the solution, marking guidance, available marks, and the original question §.Tasking the LLM with marking its answers emulates a human marker's role.Subsequent comparisons between LLM-assigned scores and human evaluations provided insights into the LLM's efficacy.Given the potential for the LLM to assign improbable scores, like values below zero or exceeding available marks, checks were put in place.If an invalid score was provided thrice consecutively, it was recorded as '-1' signifying a marking failure.Due to the comprehensive marking guidance availability, only GCSE and A-Level sources were utilized.

Mathematical capabilities
To assess the mathematical abilities of the LLM, two datasets comprising a total of 5,000 numbers were generated.The first dataset contained 2,500 integer pairs stratified based on the number of digits, ranging from 1-digit to 5-digit numbers.Within each stratum of 500 pairs, a random arithmetic operation (addition, subtraction, multiplication, or division) was assigned, ensuring an approximate distribution of 125 operations per digitlength category.The second dataset focused on single operand operations applied to 2,500 integers.Operations included squaring, square-rooting, calculating the natural logarithm, sin or cos.Each digit-length category in this dataset had approximately § A overview of the prompt instructing the AI to mark its own work can be found in the Appendix Figure A1.
[{" role ": " user " , " content ": """ A 30 W light bulb uses 600 J of electrical energy in time t to produce 450 J of light energy .What is the efficiency eta of the light bulb ?""" } , {" role ": " assistant " , " content ": """ The efficiency is the ratio of useful energy output to the total energy input expressed as a percentage so in this case : eta = (450 J / 600 J ) * 100% eta = 75 % """ } , {" role ": " user " , " content ": """ Interference fringes , produced by monochromatic light are viewed on a screen placed a distance L from a double slit system with slit separation S .The distance between the centres of two adjacent fringes ( the fringe separation ) is W .If both S and L are doubled , what will be the new fringe separation ?A ) 2 W B ) W /2 C ) W D ) 4 W """ } , {" role ": " assistant " , " content ": """ C """ } , {" role ": " user " , " content ": """ A car accelerates from 12 m s to 21 m s in 6.0 s .How far did it travel in this time ?Assume constant acceleration .""" } , {" role ": " assistant " , " content ": """ First , we ' ll find the acceleration using the equation : a = ( v -u ) / t = ( 21  100 of each operation.The primary objective was to gauge the LLM's computational accuracy, especially in relation to numerical complexity.Responses were assessed based on perfect match accuracy, deviation within 5% of the correct answer, and deviation within 10% of the correct answer.For the purposes of determining a 'perfect match', responses were considered correct if they were accurate up to 5 decimal places.Detailed insights derived from these analyses are delineated in Section 3.4.To determine if these observed differences were statistically significant, an Analysis of Variance (ANOVA) test was conducted, with the results summarized in Table 2. ANOVA is particularly apt for this analysis as it allows for a comparison of means across more than two groups.The null hypothesis for the ANOVA test states that there is no significant difference between the group means.The alternative hypothesis posits that at least one group mean is different.For the GCSE, A Level, and Introductory University levels, the p-values were 0.5429, 0.1310, and 0.8828, respectively, indicating that we fail to reject the null hypothesis for all three academic levels.This suggests that the choice of prompting technique does not play a pivotal role in the AI's performance.

Overview
For a more nuanced analysis, each question was categorized as either Multiple Choice, Numerical, or Word-based.However, at the Introductory University level, the dataset is overwhelmingly composed of numerical questions (> 99%).This dominance renders a detailed, segregated analysis by question type challenging for this academic level.Nevertheless, the ANOVA test results for the GCSE and A Level, as showcased in Table 2, indicate a statistically significant difference in the performance of the three prompting techniques across the various question types.Yet the differences are not consistent between academic levels with the LLM performing best on numerical questions at GCSE but best on word based at A-Level.Further word based questions were the worse performing type for the LLM at GCSE.The nature of the question can notably affect the LLM's accuracy.For example, in multiple choice questions, the LLM frequently settled on an answer that wasn't among the provided options.In these scenarios, it either refrained from answering altogether or selected the option that was closest to its often incorrect answer.Beyond these question types, it was observed that questions with tables scored similarly to those without, indicating that tables do not hinder LLM performance.

Example question answer
Looking at specific examples offers a clear perspective on the influence of prompt engineering.As depicted in Figure 4, the nuances of different prompting styles can lead to varied responses.Given the Physics question 'Write a decay equation in terms of a quark model for beta-minus decay' the Zero-Shot prompt failed to appreciate the question asked about the quark model instead detailing β − in a nucleus.The In Context Learning prompting got the question completely correct but Confirmatory Check approach lost marks due to it stating an electron neutrino rather than an electron antineutrino in the answer.Interesting this may have been because a actual ν character was returned instead of the words 'anti-v' but ν isn't available in the Latin-1 character set.The Zero-shot approach, while thorough, often yielded verbose answers, averaging 427 characters in length.In contrast, the In-context Learning method trimmed responses to an average of 405 characters.The Confirmatory Check approach stood out as the most concise, with answers averaging just 228 characters.Additionally, while some mathematical content in the responses mirrored conventional formats, there were instances where the representations, though appearing correct, were mathematically inaccurate.

AI marking
For this evaluation, only instances where both human and the LLM successfully assigned a grade were included.Out of 3486 AI-generated answers to 1162 questions∥, the LLM only successfully graded 2209 instances, achieving a 63.4% rate of successful evaluations.All scores were normalized to facilitate a fair comparison across questions with different  maximum marks.Human and LLM evaluation showed a concordance in scores for Zeroshot, In Context Learning, and Confirmatory Checking with rates of 49.82%, 51.96%, and 50.54%, respectively.This means that for approximately half of the questions, the LLM gave the same score as the human markers.Among these 2209 graded instances, human markers assigned an average normalized score of 0.515, with a standard deviation of 0.448.The LLM's average normalized score was a lot higher at 0.952 but had a lower standard deviation of 0.167.The observed correlations in Table 3 show that human markers often grade In Context Learning and Confirmatory Checks in a correlated manner, evidenced by a strong internal correlation of 0.913.In contrast, AI markers displayed a slightly weaker internal correlation of 0.662 between these same methods.Comparing human and AI grading reveals a moderate level of agreement, particularly for In Context Learning (ICL) and Confirmatory Checking (CC) with correlation values of 0.241 and 0.257, respectively.Zero-shot prompting shows a weaker correlation of 0.189.Understanding these correlation values alongside the concordance rates suggests that the agreement is higher for straightforward questions with single correct answers.Meanwhile, more complex questions are likely sources of disagreement.These discrepancies may arise from the LLM's different interpretation of the marking guidance or its emphasis on different parts of the response.The LLM-assigned scores also have a lower standard deviation, indicating a more consistent but potentially less nuanced grading approach.

Mathematical capabilities
For the two-integer operations, the LLM achieved an exact accuracy rate of 52.3% across all questions.When a tolerance of ±5% of the exact answer was considered, the accuracy rate climbed to 75.8%.The analysis uncovered significant variances in performance depending on the arithmetic operation involved.While the model demonstrated high accuracy in addition and subtraction across all levels of numerical complexity, its performance in multiplication and division was less reliable, especially with higher-digit numbers.Figure 5 illustrates these findings, highlighting how the LLM's accuracy is influenced by both the operation type and the numerical complexity involved.These results suggest caution when employing the LLM for tasks requiring precise numerical calculations, as its performance can be operation and complexity dependent.
When evaluating operations involving single operands, the LLM achieved an exact accuracy rate of 38.1% on complex mathematical functions.Introducing a margin of error displayed some improvement: the accuracy ascended to 63.2% with a 5% tolerance.Delving deeper into the individual functions, it was evident that the LLM struggled with trigonometric functions when handling more than a single digit.Contrastingly, operations like the natural logarithm and squaring maintained commendable performance -even with larger numbers, they mostly stayed within a 5% tolerance.Figure 6 visually underscores these insights, showing a general trend where performance diminishes with an increase in the number of digits.This observed trend mirrors the pattern from basic arithmetic operations, reinforcing the notion that the LLM's capability diminishes with heightened numerical complexity.The Impact of AI in Physics Education

Overview
Artificial Intelligence, especially in the realm of Large Language Models (LLMs), continues to draw attention in academic circles.Within this landscape, this study set out to evaluate the proficiency of AI in Physics Education.The results presented in this study and elsewhere allow us to make general conclusions about LLM use within Physics Education and to provide recommendations for educators.
For the characteristics of LLM output, one notable aspect is that without a specific syllabus to adhere to, LLMs often introduced innovative methods, leading to novel approaches in answering.While this can be a fresh perspective, it does not always align with the traditional academic evaluations.Contrary to prior work emphasizing the importance of good prompting [22,23], our investigation revealed statistically insignificant difference between different interaction techniques.We found that AI struggles with harder Physics, as shown in Figure 3.As the academic level increased, the amount of correct responses decreased.Previous research has highlighted how AI can often struggle with more complex Physics; beyond introductory textbooks, Yeadon et al. [7] demonstrated how GPT-3.5 typically fails to pass most Physics exams at Durham University.However the latest foundation model GPT-4 consistently outperforms GPT-3.5 and often scores nearly 50% on exams, this is shown in Figure 7.Given these results, and as highlighted at the end of [5], the current potential threat of AI in non-invigilated online exams at university level seems to be relatively contained.In fact, it would be prudent to warn students that AI performance at GCSE and A Level may not transfer to university assessments.This leads to the conclusion that whilst non-invigilated GCSE and A-Level assessments should be wary of how good the latest foundational AI models are, at University level the threat is not as dire.The score of the best AI systems seems to, on average, peak at around 50% for Physics questions meaning currently only the weaker students would benefit.
As a part of a Physics degree, often there are written elements and computational work.Here the threat to assessment fidelity is more pronounced.There are LLMs specifically trained on coding examples which can excel at complex coding tasks found in a computer science focused degree where the complexity would typically be beyond that found in a Physics degree [24,25].Further, research looking at Physics essays specifically found AI excels here [10].It is important that educators are aware of the capabilities in these areas and it is recommended that for coding and essay work, if the assessment is non-invigilated educators should enter their assignments into GPT-4 and see the capabilities themselves.The wide availability and capability of modern LLMs ¶ The modules acronyms are for: ACMP : Advanced Condensed Matter Physics, FoP1 : Foundations of Physics 1, FoP2A : Foundations of Physics 2A, FoP3A : Foundations of Physics 3A, MMP : Mathematical Methods in Physics, MAOP3 : Modern Atomic and Optical Physics 3, P&C : Planets and Cosmology, TA : Theoretical Astrophysics, TP2: Theoretical Physics 2, TP3: Theoretical Physics 2. may be irreconcilable with with take home short essays or typical Physics coding tasks.
LLMs often produce verbose outputs, the AI's proclivity to produce extensive responses, often not proportional to the question's complexity, is not only a hallmark but seems to be an integral part of quality answers.Whilst not statistically significant, there was a decline in performance with the Confirmatory Checking raises concerns about the AI's current capacity for iterative, conversational interaction, resonating with the observations by [9].Interestingly, looking at the linguistics of the output much prior research has highlight how AI generated content is both difficult to detect [26,27] and potentially bias against non-native English speakers [28].Curiously there are simple techniques to get the AI to reveal itself such as asking 'Do you agree with this statement?'will often get the LLM to state 'As an AI assistant I do not have personal opinions, emotions, or preferences.'.Similarly the use of zero-width spaces or hidden prompt injection attacks [29] within questions can also foil LLM effectiveness.
The present work also highlighted how LLMs can struggle with mathematical computations as the lengths of digits involved increases.Of the 5000 mathematical questions asked only 45.2% were answered correctly.The difference here however is that modern computers already have sophisicated mathematical capabilities meaning it would be inapt to use a LLM to work out the cosine of a number when calculators are available.The AI's grading capability further supports this viewpoint, when marking (1) Novel Approaches: LLMs, unbound by a specific syllabus, frequently adopt innovative methods when addressing questions.Such distinct approaches can serve as an indication that LLMs might have been employed in crafting the answer.
(2) Mathematical Hurdles: Pure LLMs, lacking computational tools, face challenges when handling mathematical operations, particularly with numbers exceeding three digits.
(3) Prompting Limitations: Contrary to initial beliefs about best-practice prompt engineering, this study revealed its apparent limited efficacy in the realm of Physics questions.While ICL enhancements did yield improved results, the advantage over Zero-shot approaches was marginal.
(4) Verbose Outputs: The AI consistently produces verbose answers, often misaligned with the question's complexity.Notably, when provided an opportunity to refine its outputs, the LLM frequently produced content of diminished quality.
(5) Graphical Challenges: In this study, an effective method for LLMs to address graphical questions was not identified.With minimal reformatting, LLMs handled questions involving tabled data comparably to other queries.Moreover, for 'sketch' / 'diagram' tasks, LLMs frequently used rows of symbols, offering reasonable attempts.multiple choice questions the AI often struggled to do this simple task correctly, a case of over engineering / using the AI for the wrong task.In fact when extending the marking to all questions a congruence rate of only 50.8% with human evaluations was found, indicating clear limitations in certain areas.On a positive note, during our interactions, the AI maintained a respectful tone without displaying any abusive or exclusionary language, reflecting advancements in ethical AI design.While premium versions of some technologies might be inaccessible to some due to cost, educators should ensure that no student is mandated to use paid resources.To summarize, while AI has made significant strides, limitations persist in its application to Physics.The key conclusions from our study are outlined in Figure 8.

Recommendations
The swift progress in AI technology raises numerous ethical dilemmas, especially regarding its potential misuse in academia, its inherent biases, and its overarching societal repercussions.Echoing the concerns raised by [30], the incorporation of AI into the educational realm warrants a balanced mix of skepticism and meticulous scrutiny.
As AI models continuously advance, a shared responsibility falls upon educators, developers, and policymakers to maintain vigilance, ensuring that AI tools are harnessed ethically and judiciously.In light of the current state of affairs, specific recommendations are posited, as depicted in Figure 9.
(1) Transparency About Capabilities: With over 100 million users, ChatGPT's influence is undeniable, and it's frequently highlighted in the news.Educators should openly discuss its strengths and weaknesses with students, especially its propensity to produce plausible yet occasionally incorrect or incomplete answers.
(2) Caution Students Against Overreliance: While AI may prove valuable at GCSE levels, its effectiveness can diminish in university settings, as illustrated in Figure 2. Students should be reminded that relying heavily on AI can deprive them of genuine learning experiences.
(3) Avoid Teaching AI Interaction Techniques: The study found no significant variance in performance across different prompt engineering methods for Physics questions.This was surprising as effective prompting techniques are subject of much research and reported improved performance [31,32,15].However, from a Physics teaching perspective there is not enough clear benefit in improving Physics question answering abilities.Further, given the rapid advancements in AI, previously effective techniques can soon become outdated.
(4) Change Some Assessment Methods: Non-invigilated coding and short form essays are very vulnerable to automated completion by LLMs [10].Further, as AI-written text is difficult to discern [26,27] and potentially bias against non-native English speakers [28].Advertised AI detectors should not be trusted.
(5) Anticipation of Evolving Capabilities: Educators should stay updated with the latest in AI advancements.As Yeadon and Halliday's study [7] illustrates, there's a discernible improvement from GPT-3.5 to GPT-4.However, it remains uncertain whether future models will improve further still or approach an asymptote.
(6) Ethical Considerations in AI Use: AI's interaction has shown a respectful tone without exclusionary language, highlighting advancements in ethical AI design.However, educators should ensure equitable access by not mandating the use of premium, potentially inaccessible technologies for students.

Figure 1 .
Figure 1.Message array used to implement the Few-shot prompting via In Context Learning, illustrating how context is provided to guide the Language Model's responses.

Figure 2 .
Figure 2. Comparative analysis of overall scores achieved by different AI prompting techniques (Zero Shot, Few Shot, Confirmatory Check) across three academic levels: GCSE, A Level, and Introductory University.

Figure 2
Figure 2 illustrates the overall scores achieved by different AI prompting techniques across three academic levels: GCSE, A Level, and Introductory University.The three techniques represented are Zero Shot (blue), In Context Learning (red), and Confirmatory Check (green).It shows that the performance of the three prompting techniques remains relatively consistent across the three academic levels, while the overall performance decreases as the academic level increases.Although there are slight variations in the percentage of correct answers, none of the techniques consistently outperforms the others across all levels.To determine if these observed differences were statistically significant, an Analysis of Variance (ANOVA) test was conducted, with the results summarized in Table2.ANOVA is particularly apt for this analysis as it allows for a comparison of means across more than two groups.The null hypothesis for the ANOVA test states that there is no significant difference between the group means.The alternative hypothesis posits

Figure 3 .
Figure 3.A detailed breakdown of the AI's performance in terms of percentage correct for various question types (Multiple Choice, Numerical, Word-based) at different academic levels, highlighting areas of success and potential improvement.

Figure 4 .
Figure 4. Comparison of responses for the given question based on different prompting styles in response to the question 'Write a decay equation in terms of a quark model for beta-minus decay'.

Figure 5 .
Figure 5. Accuracy of the LLM in performing basic arithmetic operations across varying numerical complexity.The bars represent exact accuracy, while the transparent overlay indicates accuracy within a 5% margin.The black dot markers denote the overall average accuracy for each digit length, across all operations.

Figure 6 .
Figure 6.Accuracy of the LLM in performing common mathematical functions across varying numerical complexity.The bars represent exact accuracy, while the transparent overlay indicates accuracy within a 5% margin.The black dot markers denote the overall average accuracy for each digit length, across all functions.

Figure 7 .
Figure 7. Performance of GPT-4 and GPT-3.5 on different physics exams as presented by Yeadon et al. [7].The black crosses indicate the average student mark from 2018 -2021 on the modules for the exam and the dashed black line shows the 40% score required to pass the exam.Critically, these exams were marked by the same academics who mark student exams.Acronym definitions are provided below.¶Adapted with permission from [7].

Figure 8 .
Figure 8. Key conclusions derived from this study's assessment of LLM responses to Physics questions.

Figure 9 .
Figure 9. Recommendations for educators in addressing AI.

Table 1 .
Question sources used for the evaluation.

Table 2 .
ANOVA Results for Different Prompting Techniques and Question Types

Table 3 .
Correlation Matrix for grades assigned by Humans and the LLM.Human-ZS: Zero-shot prompted answers; Human-ICL: In Context Learning prompted answers; Human-CC: Confirmatory Check prompted answers; LLM-ZS, LLM-ICL, and LLM-CC are analogous for the LLM.
4.3.Concluding ThoughtsAI is set to change how we approach education.Drawing from the findings of this study and the broader literature, it's clear that within the realm of Physics education, AI presents a spectrum of threats and opportunities that vary based on context.Assessments at earlier educational stages, such as GCSE and A-Level, are notably susceptible when they are open-book.In contrast, when addressing advanced topicsespecially at the university level and in textbook work -AI does not consistently provide correct answers, regardless of the prompting style.Moreover, students producing a high volume of quality work should not be unwelcome.The primary concern should be the active and meaningful involvement of students in creating such work.The path ahead remains uncertain; forthcoming Foundation models might bring about marginal enhancements or represent substantial breakthroughs in capabilities.With sustained research, assessment, and collaboration, the academic community has the opportunity to channel the potential of AI, ensuring it enhances, rather than diminishes, Physics education.Square brackets contain necessary information.Based on the question, solution, and any guidance, assess the answer's correctness.Return only a number indicating the marks.Always return a number from 0 to 9. Responses are tested using Python '.isdigit()' method.Any non-numeric answer will be sent back for reevaluation.Figure A1.Condensed system prompt for AI self-marking.The AI was programmed to return a numerical score based on the question's solution and guidance.The full prompt, with multiple detailed examples, is abbreviated here for brevity.The AI accurately marked questions 58.8% of the time.