ChatGPT as a tool for honing teachers’ Socratic dialogue skills

In this proof-of-concept paper, we propose a specific kind of pedagogical use of ChatGPT—to help teachers practice their Socratic dialogue skills. We follow up on the previously published paper ‘ChatGPT and the frustrated Socrates’ by re-examining ChatGPT’s ability to engage in Socratic dialogue in the role of a physics student. While in late 2022 its ability to engage in such dialogue was poor, we see significant advancements in the chatbot’s ability to respond to leading questions asked by a human teacher. We suggest that ChatGPT now has the potential to be used in teacher training to help pre- or in-service physics teachers hone their Socratic dialogue skills. In the paper and its supplemental material, we provide illustrative examples of Socratic dialogues with ChatGPT and present a report on a pilot activity involving pre-service physics and mathematics teachers conversing with it in a Socratic fashion.


Introduction
Only a year after the release of ChatGPT, the use of Artificial Intelligence-based chatbots in education has gained steam.This upward trend is expected to continue due to the constantly growing body of research on the potential of this technology to improve teaching and learning [1][2][3].Despite the risks related to the use of AI-based chatbots [4,5], they hold the potential to be useful for both students and teachers.Students could benefit from having a personal, always-available, and patient tutor [6,7], a peer for doing homework [8], or an "object to think with" [9].On the other hand, teachers could take advantage of a timesaving assistant for tasks such as grading [10], or lesson planning [11], as well as a source of material for classroom or homework activities [12].However, for these applications to achieve optimal effectiveness, it is crucial that the chatbot performs well in the specific domain [13].This does not seem to be the case in physics, where even the state-of-the-art chatbot ChatGPT-4 1 [14] still has some way to go before its performance can be considered expert-like.Surprisingly, this is true not only for advanced physics topics but even for introductory conceptual physics, where it can still clumsily fail [15].Therefore, the effectiveness of the technology is contingent upon the users' expertise in the subject matter, their ability to craft useful prompts, and critically evaluate the outputs.In this paper, we explore a potential application of ChatGPT for training physics teachers in a specific pedagogic skill.To become experts at facilitating learning, teachers need to have the ability to engage in meaningful dialogue with students.Above all, they need to practice "active listening", recognising the strengths and weaknesses of students' reasoning and responding in appropriate ways [16].One technique by which a teacher can help students is not directly validating or invalidating their answers, but instead asking strategically placed questions, which can lead students to improve their answers.We refer to this approach as "Socratic dialogue".Here we discuss the feasibility of several versions of ChatGPT to play the role of a tutee, or a model of a student, with whom teachers can engage in Socratic dialogue.This may be potentially useful for pre-service teachers, who often do not have easy access to students with whom they could practice their teaching skills.Moreover, because of the lack of experience, it is easier for pre-service teachers to get overwhelmed by the complexity of real-world teaching situations.The controlled, private and safe environment offered by a chatbot could potentially serve as a good starting point.

Background
In late 2022, just a few weeks after the release of ChatGPT, one of the authors, together with a colleague, reported in this journal on their initial experimentation with this chatbot, asking it a conceptual introductory physics question: "A teddy bear is thrown into the air.What is its acceleration in the highest point?" [17] (p.2).
They noticed that the response was linguistically advanced, but contained serious physics errors and inconsistencies.Moreover, upon attempting to have the chatbot reconcile the inconsistencies through Socratic dialogue, they were met with confident-sounding but incorrect responses, a feature wellrecognized in the model that was in use at the time [18].They also found it very difficult to have the chatbot notice its own errors and develop its responses based on a critical reassessment of its own answers, even when they were clearly incorrect and inconsistent with each other.The first conclusion at the time was that the fairly common generation of physics errors and inconsistencies made ChatGPT unsuitable for the role of a physics tutor.Furthermore, its confidence and stubbornness severely limited the possibility of having a Socratic dialogue, as it happens between a teacher and a student.Thus, the title of the paper "ChatGPT and the frustrated Socrates".Since then, ChatGPT has evolved in terms of both its performance in physics, and its "character" during dialogue.Both the freely available version (ChatGPT-3.5) and the subscription-based one (ChatGPT-4) have experienced improvement.For example, ChatGPT-3.5 no longer makes the same mistake in answering the above-cited question, and has become "humbler" and more responsive to the user's follow-up questions.For an illustration of a Socratic-style dialogue with ChatGPT-3.5 done in February 2024, see supplemental material A1.On the other hand, ChatGPT-4 outperforms ChatGPT-3.5 in terms of its physics performance [19,20] and, as we show in this paper, its responsiveness to critical feedback and guiding questions.We illustrate this in section 3, through a case study of a Socratic interaction with ChatGPT-4.For another example of a Socratic dialogue with ChatGPT-4, see supplemental material A2.The finding that both versions have become better partners in Socratic dialogue allows us to revisit the idea of using ChatGPT as a training tool for pre-service teachers.In section 4, we report on a practical implementation of a Socratic dialogue-based activity with ChatGPT-3.5 in the context of pre-service physics teacher training, carried out in April 2023.We also highlight the immediate benefits and challenges of using the chatbot in this way.Examples of pre-service teachers' conversations with ChatGPT can be found in supplemental material B. In section 5, we address the potential of the two ChatGPT's versions to serve as a model of a student for the purposes of teacher training, and reflect on the experienced and potential benefits of such activities.Finally, in section 6 we summarise the key findings and provide suggestions for future work on the topic.

Socratic dialogue with ChatGPT-4: an illustrative example
To be able to engage in Socratic dialogue, we first looked for tasks that the chatbot would not solve correctly on the first attempt2 .For example, we found that the question used in the "Frustrated Socrates" was too easy for ChatGPT-4: we asked it 10 times in separate chat windows and always received a correct answer.We thus used a slightly more challenging but topically related question, which turned out to be difficult enough for the chatbot to answer incorrectly: Nina releases a bouncy ball, which falls to the ground and bounces back to the same height it was released from.Describe the graph of velocity vs. time for the ball.
Figure 1 shows our own sketch of the v-t graph for the described scenario.We have chosen the positive direction to be upwards.We see that the velocity starts at zero, as the ball is released from rest.The ball first accelerates downwards with constant acceleration (directed downward and thus negative).This is represented by the straight line with constant negative slope, starting at t=0 and reaching its minimum (maximum in speed) just before the collision with the ground (t=tc).We assume that the collision happens instantaneously and is perfectly elastic.At the moment of collision, the velocity changes direction and becomes positive, with the same magnitude it had just before the collision (assuming a perfectly elastic collision).At the moment of collision (t=tc), the graph thus "jumps" from the most negative velocity to the most positive velocity (of the same magnitude).Just after the collision, the ball moves upward (in the positive direction), having positive velocity, which decreases linearly at the same rate as before the bounce (with the same negative constant acceleration).Our observation ends when the ball's velocity becomes zero, indicating it has momentarily stopped at the highest point (at the height of its initial release).After this, the straight downward-sloping line of the graph would continue until the next bounce.We now turn to ChatGPT-4's initial response to the above-stated task.The excerpt below shows our prompt, followed by the chatbot's response.We bolded the incorrect statements.

User
Nina releases a bouncy ball, which falls to the ground and bounces back to the same height it was released from.Describe the graph of velocity vs. time for the ball.

ChatGPT
The graph of velocity vs. time for a bouncy ball that falls to the ground and bounces back to the same height would have the following characteristics: 1. **Initial Descent Phase**: At the start, when Nina releases the ball, it starts from rest, so its initial velocity is zero.As it falls, its velocity increases in the negative direction (assuming down is negative) due to gravity.This section of the graph would show a straight line with a negative slope, representing constant acceleration (due to gravity).
2. **Just Before Impact**: Right before the ball hits the ground, its velocity is at its maximum (in the negative direction), as it has been accelerating downwards.
3. **Impact with the Ground**: The impact with the ground is a very brief moment where the velocity changes direction.The graph here would show a sharp spike, as the velocity changes from its maximum negative value to a positive value.This is an instantaneous change, representing the elastic collision with the ground.4. **Ascent Phase**: After the bounce, the ball moves upwards.Its velocity is now positive (upwards direction) and decreases as it moves up due to the opposing force of gravity.This would be represented by a straight line with a positive slope, again indicating constant deceleration (due to gravity).5. **At the Maximum Height**: When the ball reaches the same height it was released from, its velocity becomes zero momentarily before it starts to fall again.This would be the peak point on the graph where the line crosses the time axis.
6. **Repeated Motion**: If the ball continues to bounce without losing energy, this pattern (descent, impact, ascent, peak) would repeat, showing similar shaped lines for each bounce.However, in a real-world scenario, each successive bounce would be lower due to energy losses, and this would be reflected in the graph as smaller peaks and less negative troughs in the velocity.
Overall, the graph would consist of straight lines with alternating negative and positive slopes, interspersed with sharp spikes at the moments of impact with the ground.
We can notice that ChatGPT-4 makes appropriate assumptions and mostly describes the graph correctly.However, it makes a mistake when describing the slope of the graph after the bounce, claiming it changes its sign and becomes positive.Moreover, the meaning of "sharp spikes" is unclear.The first relevant finding of this case study is that we were able to find an introductory conceptual physics question that ChatGPT-4 did not answer correctly.This further supports previous findings, suggesting that its performance on even basic physics tasks significantly limits the usefulness of the chatbot as a physics tutor [15].We now present and examine the Socratic dialogue we had with the chatbot, in which we attempted to make it improve its initial response.

User
In the ascent phase, you say that the slope is positive.If that is the case, then it will never become zero.How does the zero velocity at maximum height occur then?

ChatGPT
You're correct to point out the inconsistency in my previous explanation.During the ascent phase, when the ball is moving upwards against gravity after bouncing, the slope of the velocity vs. time graph should actually be negative, not positive.This represents the ball's deceleration as it moves upward.
Here's the corrected description of the ascent phase: -**Ascent Phase**: After the bounce, as the ball moves upwards, its initial velocity is positive (upward direction).However, due to gravity acting downwards, the ball decelerates.This deceleration is represented by a negative slope on the velocity vs. time graph.The ball's velocity decreases linearly until it reaches the maximum height, where the velocity becomes zero momentarily.This peak point, where the ball's velocity is zero, is represented by the line crossing the time axis.So, during the ascent, the velocity decreases from a positive value (right after the bounce) to zero (at the maximum height), due to the negative acceleration (i.e., deceleration) caused by gravity.The graph will show a straight line with a negative slope during this phase.
In contrast to its 2022 version, ChatGPT-4 immediately recognises that its answer contains inconsistencies.It then proceeds to address them and does so in a meaningful and correct way.Its new response is correct.However, we decided to probe it further, to see if it could produce a more detailed graph description.This can be seen as a simulation of a more in-depth investigation of a student's understanding.We asked it the following follow-up question: User How will the slopes of the initial descent phase and the ascent phase compare?

ChatGPT
The slopes of the initial descent phase and the ascent phase in a velocity vs. time graph for a bouncing ball will have the same magnitude but opposite signs.This is because the acceleration due to gravity is constant in both phases, but its effect on the velocity is in opposite directions.
1. **Initial Descent Phase**: In this phase, the ball accelerates downwards due to gravity.If we consider downward direction as negative, then the acceleration due to gravity is also negative.Since the ball starts from rest and accelerates downwards, the velocity becomes increasingly negative over time.The graph here shows a straight line with a negative slope.The magnitude of this slope represents the acceleration due to gravity.
2. **Ascent Phase**: After the ball bounces, it moves upwards.While the ball is moving upwards, gravity is still acting downwards, effectively decelerating the ball.This deceleration is numerically equal to the acceleration during descent.In the ascent phase, the ball starts with a positive velocity (upwards) and this velocity decreases over time due to the negative acceleration (deceleration).The graph here shows a straight line with a positive slope.The magnitude of this slope is the same as that of the initial descent phase, but the sign is opposite.
In summary, the magnitude of the slopes during both the descent and ascent phases is the same because the acceleration due to gravity is constant.However, the slope is negative during descent (increasing negative velocity) and positive during ascent (decreasing positive velocity).
Here, we can once again see the same mistake of claiming that the slope of the graph is positive after the bounce.Thus, it was not enough for the chatbot to admit and correct this mistake once, as it re-emerged after further probing questions.Similar behaviour is also not uncommon with human students, who often need to engage with difficult ideas repeatedly to master them; a single "Aha!" moment is an insufficient indicator of lasting learning having taken place.In this sense, the chatbot is a useful simulator of a student for the purposes of teacher training.We once again asked a leading question to encourage it to reconcile its previous responses with the latest one:

User
How does this fit with your previous description of the Ascent phase that I asked you about as the first follow up question?

ChatGPT
You are right to seek clarification.My previous response about the ascent phase described the slope as negative, which indeed fits with the reality of the situation.Let's reconcile this with the comparison of the slopes in the initial descent and ascent phases: 1. **Initial Descent Phase**: During the descent, the ball accelerates downward due to gravity.If we take downward as the negative direction, this means the velocity is increasing in the negative direction.The slope of the velocity vs. time graph in this phase is negative, indicating a negative acceleration (gravity pulling the ball downwards).
2. **Ascent Phase**: After the ball bounces and moves upward, it starts with a high positive velocity.As it ascends, gravity slows it down, meaning the velocity decreases.This deceleration is again due to gravity, and the magnitude of this acceleration (in absolute terms) is the same as during the descent.However, because the velocity is decreasing, the slope of the velocity vs. time graph in this phase is also negative.
In both phases, the ball is under the influence of the same gravitational acceleration.The key difference is the direction of the velocity: negative (downward) during descent and positive (upward) during ascent.The slope of the graph is negative in both cases because gravity is continuously acting in the downward direction, causing an increase in downward velocity during descent and a decrease in upward velocity during ascent.
Therefore, the slopes during the descent and ascent phases are both negative, reflecting the constant negative acceleration due to gravity.The difference is in the sign of the velocity: negative during descent and positive during ascent until it reaches zero at the peak of the bounce.
Once again, ChatGPT-4 responds in a way that affirms the value of our question, recognises the inconsistency, and proceeds to reconcile it.The response is once again correct.Compared to its 2022 version, ChatGPT-4 displays a better knowledge of physics, as well as a "humbler" and more responsive attitude in the conversation.This significantly transformed the quality of the interaction that we could have with it.The chatbot's responses were less defensive and self-assured, and of higher quality in terms of physics content, making the dialogue significantly more productive.

ChatGPT in pre-service teacher training: a pilot study
In April 2023, we performed a pilot study, where we asked 17 pre-service teachers in physics and mathematics from one Swedish and one Slovenian university to engage with ChatGPT as part of their training coursework.Since GPT-4 had been released just a few weeks before the activity and was not freely available, all participants opted to use ChatGPT-3.5.We asked them to (a) find at least one physics question that the chatbot would get right and at least one that it would get wrong, (b) use guiding questions to make it improve the unsatisfactory responses, and (c) reflect on how the dialogue went.While this pilot study is part of a larger research project on which we plan to report in a separate paper, it is worth highlighting some initial findings, relevant to the practical use of the activity in teacher training.In summary, the pre-service teachers reported 27 dialogues with ChatGPT; 21 dialogues were initiated by ChatGPT's wrong answer and in 6 dialogues the participants asked follow-up questions to elaborate the correct initial answer, even if that was not required of them by the task.The length of the dialogues spanned from 2 up to 21 prompts, with the average of 8 prompts for the dialogues with a wrong initial answer and 3 prompts for the elaborative conversations.Out of 21 dialogues with the wrong initial answer, 16 were considered successful by the participants, while in the remaining 5 dialogues the participants gave up before ChatGPT gave an answer they would deem satisfactory.In 3 of the latter cases, they did not insist on the correctness of the numerical value.In their reflections, all participants were positive about the activity.However, 5 of them also expressed the feeling of frustration related to their interaction with ChatGPT 3 .The study yielded several benefits, enhancing the educational experience for pre-service teachers.The most immediate outcome was the high level of engagement and lively discussion the activity fostered among the participants.It proved beneficial for collaborative learning, as students were eager to share their experiences, strategies, and insights gained from interacting with ChatGPT.A noteworthy aspect of the study was the natural progression of focus, from interacting with ChatGPT to engaging in deeper discussions about students' behaviours and characteristics in the learning process.Initially, the teachers concentrated on navigating the chatbot's responses, but soon this task evolved into a broader, reflective discourse.As they began to analyse the outputs from ChatGPT, the conversation spontaneously shifted towards a comparative analysis of the chatbot's answers and those typically given by students.Throughout their interaction with ChatGPT, the participants gained a nuanced understanding of the chatbot's strengths and weaknesses in an educational context.They experienced first-hand how it managed various types of questions, ranging from simple calculations to more complex conceptual physics problems, recognising in what aspects the chatbot generally performs better.Such awareness is increasingly important for teachers as AI becomes more present in educational contexts.A critical learning outcome from this activity was the training of the ability to identify both productive and unproductive aspects of ChatGPT's responses.The task demanded high levels of concentration and critical thinking, as the participants had to discern subtle inaccuracies or mistakes in otherwise well-written and professional-sounding responses.As expected, the exercise also involved experiencing some frustration, which, to a certain extent, could be considered as a potential benefit.Engaging with ChatGPT-3.5,especially when trying to correct its mistakes or probe deeper into its reasoning, sometimes mirrored the challenges teachers face with real students.This element of the activity served as a realistic preparation for the sometimes frustrating yet rewarding journey of teaching, and most of participants pointed it out.It underscored the importance of patience and adaptability, appreciable qualities for any educator.Transitioning from these beneficial aspects, the study also presented specific challenges that provided additional learning opportunities.One of the primary challenges experienced by participants was to find questions that ChatGPT-3.5 would answer incorrectly.While the chatbot demonstrated proficiency in responding to factual information questions, it often faltered with conceptual questions or mathematical tasks.This tendency revealed to the teachers an interesting aspect of ChatGPT's capabilities: its relative strength in handling straightforward, fact-based questions compared to conceptual or reasoning-based ones.This observation has important implications for the use of such AI tools in educational settings, indicating a need for careful question selection that challenges the AI's capabilities and provides meaningful learning opportunities for students.Additionally, the frustration the dialogue with the chatbot often led to, emerged also as a challenge.While a degree of frustration is inherent and even beneficial in the learning process, there were instances where the interaction with ChatGPT became overly exhausting, causing some participants to interrupt it.This was particularly evident when the chatbot repeatedly failed to grasp the essence of more complex or abstract questions.We provide some examples of dialogues the pre-service teachers had with ChatGPT and the accompanying reflections in supplemental material B.

Discussion and future work
The ability to admit mistakes and revisit its initial responses makes currently operating versions of ChatGPT suitable as a model of a student who is prepared to learn from its own mistakes.This also makes it interesting for potential use in teacher training.While both the freely accessible and paid versions of ChatGPT are an improvement from its initial 2022 version, the paid version is easier to guide and has better physics knowledge.In effect, the two versions can be used to simulate students with somewhat different characteristics.We could picture ChatGPT-4 modelling an easily guidable student with relatively solid physics knowledge, while ChatGPT-3.5 is modelling a more stubborn, self-confident and tougher to guide student with more severe flaws in its physics knowledge.When using the two versions of the chatbot in pre-service teacher training, starting with ChatGPT-4 will likely give the teachers a less frustrating and softer experience, while ChatGPT-3.5 can provide opportunities for more challenging dialogues.However, it is important to keep in mind that a chatbot can, at most, simulate a simplified model of a student, and it is not a replacement for genuine interactions with human students.Of course, training Socratic skills with ChatGPT alone does not make one a good teacher, much like training a specific manoeuvre in a flight simulator does not in itself make one a good pilot.We see the potential of ChatGPT as a training device that could help teachers practice specific skills and teaching techniques in a comfortable and safe environment.This kind of use is made even more relevant in light of recent research showing that with appropriate prompting, ChatGPT can be made to answer physics questions in line with typical student difficulties 4 [22].This even suggests some potential for using AIbased chatbots to explicitly enact different student personas and further diversify the content of the teacherchatbot Socratic dialogue.While we have not systematically explored this possibility, this appears to be one promising avenue for future exploration.Moreover, the cases presented in this paper are limited to interactions in natural language.While this has been and remains one of the main ways of interacting with chatbots, it is worth noting that since autumn 2023, ChatGPT-4 has also been able to process and generate images, as well as run code, and return graphs as output [23].Exploring how these advanced features can be integrated into Socratic dialogue and teacher training represents another possible direction for future educational development and research.

Conclusion
ChatGPT has improved its physics performance and interaction abilities in the last year.While it was challenging to have a productive Socratic dialogue with it when it was released in late 2022, the currently available versions of the chatbot offer better possibilities for their use in education.When engaged in Socratic dialogue with a user, ChatGPT has become better at critically reassessing its own output, recognising inconsistencies in previous responses, and addressing them.This allows a more productive dialogic interaction, where the user can lead the chatbot to improve its responses with strategically placed questions.ChatGPT thus has the potential to serve as a training tool for teaching physics through Socratic dialogue.This is supported both by our direct experiences interacting with it, and the practical insights gained from our pilot study, demonstrating a possible way of meaningfully integrating the chatbot into teacher training.We hope our considerations and reported experiences can spark interest among teachers and teacher educators to test and further explore different ways of using AI-based chatbots for physics teaching and learning, and report their findings and innovations to the rest of the physics education community.

Fig. 1 :
Fig.1: A sketch of the velocity vs. time graph of a ball falling from rest and bouncing from the floor to the same height it was released.The dashed vertical line at tc marks the moment of collision with the ground and highlights the discontinuity of the function.