Attentive Simple Recurrent Unit Knowledge Tracing Based on Learning Ability

Knowledge tracing can analyze the current knowledge level of students through the data of students’ previous learning activities. However, the existing models usually consider the features of exercises, ignoring the individual differences of students. It is difficult to accurately predict students’ mastery. In this paper, we propose an attentive simple recurrent unit knowledge tracing (SRU-MAKT) based on learning ability. The experimental results show that our model is superior to the existing models, and the AUC increases by 1.6%. We also conduct visualization experiments, which show that SRU-MAKT has interpretability.


Introduction
Knowledge tracing (KT) technology uses sequences of historical student behaviours to assess their knowledge states and predict learning performance.This is a hot issue in the field of computer-assisted education.
Early KT models were mainly based on Bayesian theory.Although Bayesian knowledge tracing (BKT) [1] and its variants have high interpretability, their prediction effectiveness mainly depends on the plausibility of probability map building.In recent years, more and more KT models have started to use neural network structures [2].Compared with BKT, Deep Knowledge Tracing (DKT) [3] uses students' historical behavioural data for training and prediction, which has a greater improvement in prediction accuracy, but still has the problem of uninterpretable parameters.Self-attentive knowledge tracing (SAKT) [4] is not superior to the DKT model, although the attention module in SAKT is more flexible than recurrent neural networks (RNN) in DKT.So most mainstream KT models are still based on recurrent neural networks.
Therefore, the current KT field lacks research on the impact of attentional mechanisms acting together with recurrent neural networks on models.In addition, the mainstream KT models have the limitation of identifying all students as having the same learning ability, and cannot trace the learning ability of students in learning migration.This leads to the fact that most KT models lack the ability to handle the transfer of learning across knowledge concepts.
To address the above issues, we propose a new KT method by combining multi-scaled attentive knowledge tracing (MAKT) [5] with simple recurrent units (SRU) [6] and adding students' ability features.Recurrent networks continuously iterate over the existing knowledge state while processing attention, reflecting the dynamics of the learning process.In addition, the information on learning ability features characterized by the student's mastery of different knowledge concepts is introduced in the recurrent network layer, which simulates the learning transfer ability expressed in the learning process and improves the interpretability and practicality of the model.Our contributions in this paper are summarized as follows:  We combine the multi-scaled attention mechanism with simple recurrent units to propose an attentive simple recurrent unit knowledge tracing (SRU-MAKT);  We address the problem of learning migration across knowledge concepts in knowledge tracing models through learning ability representations;  Experiments on four real datasets show that SRU-MAKT improves performance by 1.6% on average compared to the best baseline model.

Proposed Method
The structure of the SRU-MAKT model is shown in Figure 1.The left part of Figure 1 is divided into the exercise and interaction embedding modules.Attention values and learning ability features are input to the recurrent unit and iteratively updated.
Figure 1.The overview of the SRU-MAKT model.

Problem setup
The model predicts the correct probability of a student's answer on the e t+1 exercise, based on his historical interaction records I = {i 1 , i 2 , ..., i t }.The element in the sequence i t = (e t , r t ), where e t represents the exercise information and r t {0, 1} indicates whether the response corresponding to the exercise is correct or not.Therefore, it can be transformed into a time series modeling problem using a sequential modeling approach favoring a model with input x = {x 1 , x 2 , ..., x t }.

Embedding and attention layer 2.2.1 Embedding layer.
The feature embedding layer is divided into an exercise embedding matrix  ∈ ℝ and an interaction embedding matrix  ∈ ℝ , which performs feature embedding on the input sequences e = {e 1 , e 2 , ..., e t+1 } and I = {i 1 , i 2 , ..., i t }, respectively.

Multi-scaled attention layer.
The multi-head attention layer uses linear projection matrices W h Q , W h K and W h V to project the input vectors into different spaces h times.The student's response to the current exercise is determined by the history of interactions together with the content of the exercise.As in Equations ( 1), ( 2) and ( 3), the Q h matrix is defined as the feature of the exercise containing each knowledge concept, whereas the K h and V h matrices are defined as the historical interaction features of the student's response results.
where  calculates the attention weights between the query vector containing the relative location encoding and the key vector, as in Equation ( 6).The multi-scaled attention mechanism is used to capture the difference in students' long-and short-term memory.

Students' ability representation
To examine the effect of cross-knowledge concept transfer on students' learning ability during longterm learning, the model characterizes students' learning ability features.Learning ability features can represent students' mastery of each knowledge concept, and this information influences students' responses when they encounter new exercises or new knowledge concepts.Therefore, we take inspiration from the work of dynamic student classification-based knowledge tracing (DKT-DSC) [7] to reprogram the students' learning ability profiles.The student's previous interaction performance is characterized so as to estimate the competency profile at the current moment in the learning process and updated at each training batch.Assuming the existence of k knowledge concepts, the model defines each student's learning ability features as a continuous interpretable vector as shown in Equation (7).
where element 0   1 denotes the proportion of exercises containing the kth knowledge concept correctly answered by student s by moment t.A larger value of   indicates a higher degree of mastery of the k knowledge concept and vice versa; N k denotes the total number of exercises containing knowledge concept k;  , ∈ 0, 1 is the response of student s to the exercise at moment t.

Attentive simple recurrent unit
The key to combining the attention mechanism with recurrent neural networks is to incorporate more expressive nonlinear operations into the recurrent neural network [8].Therefore, the attention value U replaces the simple linear transformation value of input information in the original recurrent network to enhance modeling capability.The input U in the recurrent neural network is shown in Equation (9).
where  denotes the learning ability coefficient of the student;  ∈ ℝ is a parameter matrix; Q matrix is represented as the exercise feature information containing each knowledge concept.
Q + A is a residual connection that improves the gradient propagation and stabilizes the training.When  = 0, the student is in the initial state and the input information is the original linear transformation.As the student's knowledge grows, their learning ability increases, and the attention mechanism can learn the long-term dependence of the model.Post-layer normalization [9] is added after the attention operation and before multiplication with the W o matrix.According to the observation of Liu [10], better results can be obtained using post-layer normalization.
The SRU structure consists of a lightweight recursive component that computes the hidden state c t sequentially by reading the input vector U t at each step t.The computation is similar to RNN with gating mechanisms.Once the internal state c t is generated, the highway network will be used to introduce skip connections and the final hidden state output h t will be computed directly using the current input information to make the gradient propagation optimal.Therefore, we combine the attention mechanism with the simple recurrent network to provide more efficient parallelism in processing information compared to other recurrent neural networks, as shown in the following equations.

𝑓
⊙   (11) where ⊙ is the element multiplication; v f , v r , b f , and b r are the parameter vectors to be learned during training.

Students' response prediction
The task of KT is to predict the student's response r t+1 at t+1 when the student encounters exercise e t+1 based on the student's historical interaction sequence.h t obtained in the above calculation already contains all the information before t.Therefore, the new hidden state h t+1 is determined by the current exercise information e t+1 together with the previous hidden state h t .Consequently, the student's performance on exercise e t+1 at t+1 is obtained from the new hidden state h t+1 , as shown in the following equations.
where p t+1 represents the probability that the student answers the question correctly at t+1; w 1 , w 2 represent the weight matrix learned during training; b denotes the bias vector.Then, the parameters in the model are trained by the cross-entropy loss function, as shown in Equation ( 17).
where p i and r i are the predicted data and the true label respectively.

Datasets
We use four datasets [11] to evaluate all models, including ASSIST-

Baseline and evaluation metric
To verify the validity of our model, we compared it with five baselines KT methods for student performance prediction, including DKT, DKT-DSC, SAKT, context-aware attentive knowledge tracing (AKT) [12], and MAKT.Our model makes improvements on MAKT.

Experimental environment
This experiment performs standard K-Fold cross-validation, where K = 5.For each fold, the experiment uses 20% of sequence data as the test set, 20% as the validation set, and 60% as the training set.
Our experiment used Adam optimizer in batch with a learning rate of 1e-5, batch size of 24, random discard of 0.05, training rounds of 100, number of multi-headed attention heads set to 8, and output dimension of 512 for the fully connected layer.For the other baseline models, the hyperparameters follow the optimal settings from their respective papers.

Student performance prediction
Table 3 shows the experimental results of SRU-MAKT with other KT baseline methods on four datasets.The experimental results showed that SRU-MAKT significantly outperformed the other four baseline models on four datasets.SRU-MAKT with the learning ability feature performs better than the standard DKT model.MAKT and SRU-MAKT use the multi-scaled attention mechanism to capture sequence features compared to other attention-based models.Compared to AKT, which relies exclusively on attentional mechanisms, SRU-MAKT improved the AUC by an average of 7.2% on the four datasets.Based on multi-scaled attention, SRU-MAKT combines the advantages of DKT-DSC with the addition of learning ability features and simple recurrent units.As the improved model of MAKT, SRU-MAKT improved the AUC performance by 1.6% on average over the four datasets compared to MAKT.

Student's ability visualization
Figure 2 shows the interpretability provided by the learning ability features in SRU-MAKT on the ASSIST-2017 dataset.We randomly select student A at t, intercepting his performance on 16 exercises encountered after t.Then we visualize the student's mastery of the 32 knowledge concepts ever learned.The darker the color is, the higher the mastery of the concept is.We also present the student's learning ability in the form of a radar chart.
As shown in Figure 2, student A has a good mastery of many concepts and is poor in a few concepts such as k 10 and k 13 .Properly, these concepts are relatively new to student A, and this student is not fully proficient in them.For example, at t+4, student A encountered the poorly practiced concept k 3 and answered two correctly in three consecutive questions; The accuracy on the new concepts k 33 and k 34 reached 80%.This indicates that student A is probably a good comprehensive student who can quickly acquire unlearned knowledge in future learning.
These results show that the learning ability features in SRU-MAKT can be used to estimate current learning ability from students' past records and applied to future learning processes.This information can provide feedback to teachers to tailor their teaching to students' needs.

Conclusions
In this paper, we propose an attentive simple recurrent unit KT model based on students' learning ability features, by combining simple recurrent units and multi-scaled attention mechanisms into a novel network structure.This structure introduces information that represents the features of students' learning abilities and enables them to capture the level of students' mastery of different knowledge concepts.It solves the problem of learning migration of KT models across knowledge concepts and improves the generalization ability and learning effectiveness of the network.In addition, the AUC performance of the SRU-MAKT model proposed in this paper is improved by 1.6% on average over the four datasets compared to the MAKT model.

Table 1 .
2009, ASSIST-2015, ASSIST-Chall from the ASSISTments online tutoring platform and STATICS-2011 from Carnegie Mellon University's Fall 2011 semester student and course-related questions.Table1lists the number of students, exercise labels, and interactions in the four datasets.Over the last decade, ASSIST-2009 has been the benchmark for KT research methods and is the most commonly used dataset.Since the old version had duplicate records, the new version "skill-builder" was released, which fixed the data modeling problem and removed the duplicate records.ASSIST-2015 does not contain metadata as well as knowledge concepts compared to ASSIST-2009, and the average number of responses per question is much higher than that of ASSIST-2009.And ASSIST-Chall is the richest dataset.Datasets.

Table 2 .
Table 2 summarizes the features of each model in the experiment.The experiments used AUC curves and ACC as metrics to evaluate all KT methods.Model feature comparison.

Table 3 .
Student performance prediction comparison.