Comparing AI and student responses on variations of questions through the lens of sensemaking and mechanistic reasoning

Physics education research (PER) shares a rich tradition of designing learning environments that promote valued epistemic practices such as sensemaking and mechanistic reasoning. Recent technological advancements, particularly artificial intelligence has caught significant traction in the PER community due to its human-like, sophisticated responses to physics tasks. In this study, we contribute to the ongoing efforts by comparing AI (ChatGPT) and student responses to a physics task through the cognitive frameworks of sensemaking and mechanistic reasoning. Findings highlight that by virtue of its training data set, ChatGPT’s response provide evidence of mechanistic reasoning and mimics the vocabulary of experts in its responses. On the other hand, half of students’ responses evidenced sensemaking and reflected an effective amalgamation of diagram-based and mathematical reasoning, showcasing a comprehensive problem-solving approach. Thus, while AI responses elegantly reflected how physics is talked about, a part of students’ responses reflected how physics is practiced. In a second part of the study, we presented ChatGPT with variations of the task, including an open-ended version and one with significant scaffolding. We observed significant differences in conclusions and use of representations in solving the problems across both student groups and the task formats.


Introduction
This paper describes a small project where we compared some AI responses to student responses on a physics question and two variations of that question.In her plenary lecture Lana Ivanjek provided a nice overview of some of the other projects that are working on artificial intelligence, particularly with relation to physics.[1] Most of the projects that have been done so far, have been investigating if ChatGPT, primarily ChatGPT 3.5, can answer certain physics questions.So, there are papers that talk about whether the ChatGPT can pass the Force Concept Inventory, whether it can pass an introductory physics course, get students into graduate school and so forth.[2,3] A very recent paper describes an investigation in a large number of courses, 32 different university courses.This particular study was done at New York University in Abu Dhabi.[4] They had students complete certain quiz questions and submitted the same questions to ChatGPT.They then mixed the responses all together had them evaluated by graduate student graders.In nine of the 32 classes, ChatGPT did as well or better than the students.(None of these were physics courses.)The popular press (e.g.Scientific American) concluded from this study that ChatGPT is rather good, and it is likely to do well in college courses.However, it does not do so well on conceptual questions.Many of the studies are just look at answers that ChatGPT gives to quiz or similar questions.

Background
Our study began at the conclusion of a dissertation project.The second author on this paper was looking at sensemaking and mechanistic reasoning among real students in an algebra-based physics course.The problems that he asked students to solve were designed to try to get them involved in scientific practice, rather than just get an answer.Once he finished defending the dissertation, we thought it would be interesting to see what happened if we submitted on of these questions to ChatGPT.We discovered that ChatGPT had some difficulties, so we investigated if it could do better on variations on the problems.
For sensemaking we base our work on the definition by Oen and Russ [5] as "a dynamic process of building or revising an explanation in order to 'figure something out' -to ascertain the mechanism underlying a phenomenon in order to resolve a gap or inconsistency in one's understanding."To determine if the students and ChatGPT were involved in sensemaking we looked for the five components listed in Table 1.[6] Table 1: Sensemaking elements Noticing of inconsistency in understanding.
Blending everyday and formal knowledge.
Seeking coherence between the generated ideas.
Unpacking the mechanism of the phenomenon.Mechanistic reasoning is even more complex.A description from Krist, Schwarz and Reiser [7] states that it "entails generating explanations by moving from the observable features of the phenomenon to the underlying entities or processes."Again, the process is about moving from what you can observe, or in the case of a problem what is given, to a broader view.In this case four components that are listed in Table 2 could be used as a rubric to see if the answers from the students and the machine were showing mechanistic reasoning.

The problem for students and ChatGPT
The problem that we originally presented to the students and ChatGPT is presented in Figure 1.
The ride described there frequently appears in carnivals, festivals and so forth, at least in the US, It's just a big cylinder.People get in, they stand up against the wall of the cylinder, and the cylinder and the floor start spinning.When they are spinning fast, the floor drops out.But, people are stuck to the wall so they do not drop down.This problem is slightly different from this standard end-of-chapter textbook problem.The usual problem, which we'll see later, is find the optimum angular velocity of the cylinder so that people will be stuck to the wall and will drop down.In this one, we give the students some parameters, including the angular velocity, and ask them if people will stick to the wall.The answer, by the way, is they will not; with the parameters we have given, they will fall down with the floor.For the dissertation research eight students form an algebra-based physics class were interviewed.So, we submitted the problem to ChatGPT 8 times.The students were thinking aloud as they were working through the problem.The interviewer did not try to influence them.In any way as to what their answer was.For the analysis he had transcripts of the thinking aloud plus he had the diagrams, equations and text that the students had written.From the video he could also analyse any gestures that the students made.For details on how the responses were put into the different categories of sensemaking and mechanistic reasoning see reference [6].
You are asked to design a Gravitron, an amusement park ride where the rider enters a hollow cylinder, radius of 4.6 m, the rider leans against the wall and the room spins until it reaches angular velocity, at which point the floor lowers.The coefficient of static friction is 0.2.You need this ride to sustain mass between 25-160 kg to be able to ride safely and not slide off the wall.If the minimum ω is 3 rad/s, will anyone slide down and off the wall at these masses?Explain your reasoning using diagrams, equations and words.Figure 1: The problem presented to students and ChatGPT (Referred to as Open Ended below)

Initial results
With ChatGPT, as stated above, the problem was submitted eight times.Obviously, we couldn't analyse its thinking aloud or gestures, but we could analyse the text and any diagrams that ChatGPT might have generated.The results are shown in Table 3 where "AI" means ChatGPT 3.5.

Table 3: Incidences of sensemaking elements for AI and students AI Students
Noticing gaps in understanding. -4 Blending every day and formal knowledge.8 6 Generating and connecting ideas.8 5 Seeking coherence between ideas.8 4 Unpacking the mechanism of the phenomenon.8 4 Generating and connecting ideas 8 5 Linking spatial and temporal relations 8 6 Using of diagrams 3 7 For the first entry in the Table 3, ChatGPT is just like a teenager; it knows everything.So it does not see any gaps in its knowledge.Otherwise, it matches on all of the components of our rubric.While the fewer of the students' responses match the rubric, most of them do math the criteria set by each of the components.However ChatGPT fails on the use of diagrams.Diagrams appear in only three of the eight solutions and they are rather poor diagrams.Two of the three diagrams are shown in Figure 2. and the image s cropped to make them more readable.)

Variations on the problem
We said we were not interested in the right answer, but we cannot just ignore it entirely.ChatGPT got it wrong-eight times out of eight.Half of the students got it right.So that's when we thought, let's try a different view.Let's give it the AI a slightly different problem and see if it can do any better.Figure 3 shows the first variation.The important difference is that we asked the machine to find the angular velocity.The second variation (Figure 4) is a scaffold version of the problem.There are three steps.First, we ask, "What assumptions did you make?" Then we tell it, create a freebody diagram, and then ask the question.So, it's scaffolded.

Figure 3:
The first variation on the problem.In this variation the angular velocity is not given, It is close to a typical end-of-chapter textbook problem on this topic.(Referred to as Modified Gravitron below.)You are asked to design a Gravitron for the county fair, an amusement park ride where the enters a hollow cylinder, radius of 4.6 m, the rider leans against the wall and the room spins until it reaches a specified angular velocity ⍵, at which point the floor lowers.The coefficient of static friction is 0.2.You need this ride to sustain mass between 25--160 kg (i.e., they should be able to ride safely and not slide off the wall.A.) What assumptions do you need to make to be able to solve this?B.) Create a free body diagram for the rider when the room is spinning.Note all applicable forces and label them.
C.) If the floor drops out when ⍵ is 3 rad/s, will anyone slide off the wall in the given mass range?Explain your reasoning.
You are asked to design a Gravitron for the county fair, an amusement park ride where the rider enters a hollow cylinder, radius of 4.6 m, the rider leans against the wall, and the room spins until it reaches angular velocity, at which point the floor lowers.The coefficient of static friction is 0.2.You need this ride to sustain mass between 25--160 kg to be able to ride safely and not slide off the wall.What should be the minimum angular velocity of the ride to avoid the riders from slipping down?Explain your reasoning using diagrams, equations and words.
26th The results are not significantly different from the previous one.Table 5 combines all of the components for sensemaking and mechanistic reasoning, except for diagrams.On reasoning skills ChatGPT did very well.We did not do the variation similar to the standard textbook problem with students so it's only ChatGPT.Only in the scaffold version that says draw a diagram did the artificial intelligence draw diagrams like the ones we've already seen.What about actually answering the problem?Table 6 shows the AI did better on the scaffold problem.Three of the solutions were correct.It was still not as good as students.And even in one attempt ChatGPT basically gave up just like some of our students did.

Conclusions
We can state several preliminary results.When we compare how well students and AI reach a correct answer, half of the students answer correctly while AI is basically not correct.Yet, at the same time, AI does demonstrate all of the components of sensemaking and mechanistic reasoning.If the AI process were a response to a quiz question that is being graded, it would receive considerable partial credit but not full credit because the answer is not correct.This reminds us of a students who do not know how to do a problem, but writes down what they know in hopes that something resonates with the grader.For two aspects of solving a problem -assumptions and diagrams -AI mostly only displayed them when explicitly asked to.When AI is asked to explicitly state the assumptions are, it writes great sentences -just what we would want our students to write.Students integrate diagrams with mathematics with conceptual ideas quite well.With three exceptions, AI only drew a diagram when it was required to.Of course, the students are near the end of their first semester of physics.It has been really emphasized that they should always draw a diagram when they are trying to solve a problem.ChatGPT has not been through a physics course.
This results reflect the overall conclusion.When the students show their work, it is acceptable but sometimes not well organized.Most of the students' responses show that they are on a path of mastering the content but are struggle occasionally.In a way they are practising physics, as we would like them to do.On the other hand the responses from artificial intelligence are much more sophisticated in its use of physics terms.The artificial intelligence talks elegantly about this problem, not the way students would talk.AI responses are the way physics is talked about instead of the way it is done.

Future Work
AI will form a major part of the learning process, so we need to integrate into physics teaching and learning.One possibility thing is to use AI to generate answers to questions that the students have already done.Then, have the students analyse them using a rubric.Then we could conduct some research to see what the students learn from this process.The research questions could include: • Do the students get a better idea of the reasoning processes that are used in problem solving?
• Do they build a more sophisticated physics vocabulary, AI already has?and • Perhaps most importantly, would they build an awareness that they cannot just put all their homework into ChatGPT and expect to get it right.This work is a small beginning in trying to learn how AI can help us teach and students learn.Several other similar works related to physics and a variety of other subjects have been completed recently.Further application of AI in teaching and learning are underway in a variety of disciplines.[8].When we start synthesizing these studies and continue with more in depth work, we will be able to include AI in our toolbox for the teaching and learning of physics.

Acknowledgement
This work was supported in part by the U.S. National Science Foundation.

Figure 2 :
Figure 2: Two of the three diagrams drawn by ChatGPT (Colors have been reversedand the image s cropped to make them more readable.)

Figure 4 :
Figure 4: The second variation of the problem.In this version three questions provide scaffolding to help the and reach the conclusion.(Referred to as Scaffolded below)

Table 4 :
Incidences of mechanistic reasoning elements for AI and students AI Students International Conference on Multimedia in Physics Teaching and Learning

Table 5 :
Elements of both reasoning processes on all three versions of the problem

Table 6 :
Conclusions reached by AI and students for each version of the problem.The modified version does not ask for a conclusion.