Prediction of students’ academic performance using ANN with mini-batch gradient descent and Levenberg-Marquardt optimization algorithms

Online learning indirectly increases stress, thereby reducing social interaction among students and leading to physical and mental fatigue, which in turn reduced students’ academic performance. Therefore, the prediction of academic performance is required sooner to identify at-risk students with declining performance. In this paper, we use artificial neural networks (ANN) to predict this performance. ANNs with two optimization algorithms, mini-batch gradient descent and Levenberg-Marquardt, are implemented on students’ learning activity data in course X, which is recorded on LMS UI. Data contains 232 students and consists of two periods: the first month and second month of study. Before ANNs are implemented, both normalization and usage of ADASYN are conducted. The results of ANN implementation using two optimization algorithms within 10 trials each are compared based on the average accuracy, sensitivity, and specificity values. We then determine the best period to predict unsuccessful students correctly. The results show that both algorithms give better predictions over two months instead of one. ANN with mini-batch gradient descent has an average sensitivity of 78%; the corresponding values for ANN with Levenberg-Marquardt are 75%. Therefore, ANN with mini-batch gradient descent as its optimization algorithm is more suitable for predicting students that have potential to fail.


Introduction
World Health Organization (WHO) stated Coronavirus Disease 2019 (COVID-19) as a global pandemic on March 11st, 2020. Since the first case, COVID-19 positive cases around the world arise, including Indonesia. Indonesia government has been shifting learning activity from face-to-face learning to online learning to control rising COVID-19 cases in Indonesia. Online learning indirectly increased stress, thereby reducing social interaction among students and leading to physical and mental fatigue, which in turn reduced student's academic performance [1]. Therefore, early prediction of students' academic performance is required in identifying at-risk students regarding declining performance [2].
Artificial Neural Network (ANN) is formed through a collection of artificial neurons like biological neurons in human brain to process information with improved performance through learning data [3]. ANN can identify complex correlation, nonlinear, and unknown, between input and output data [4]. ANN has proven a success in the application of classification, forecasting, and predicting in the areas  [5], stock price market [6], etc. Furthermore, ANN can also be used for academic performance prediction [3] [7] [8].
In general, an ANN consists of layers of neurons, where neurons in each layer are connected using a connection that contain adjustable weights during training process [3]. Weights are updated to minimize error using an optimization algorithm. There are various types of optimization algorithm. In this paper, two optimization algorithms are used: mini-batch gradient descent and Levenberg-Marquardt. The two algorithms differ in their weight updating processes, where mini-batch gradient descent utilizes gradients [9] while Levenberg-Marquardt utilizes Jacobian matrix to update weights [10].
This paper aims to identify at-risk students regarding to academic performance decline during online learning by classifying whether students passed or failed on a course through students' learning activity data collected in Learning Management System (LMS). For an early prediction, ANNs with both optimization algorithms are implemented on data before the online learning study period ends: the first month and two months of study. The prediction in this paper is also expected to contribute as an evaluation tool to an upcoming online learning using LMS.

Data
The data were collected about a total of 232 students of Department Mathematics of Universitas Indonesia who enrolled in course X for 4 months. For an early prediction, the data used are the first one month and two months of study, before the midterm exam. In this paper, students' learning activity data in course X recorded in LMS Universitas Indonesia is used. Features chosen in the data are features that affect students' academic performance based on related research [2]. There features are divided into 5 parts, class access (Course viewed), quiz attempts (Quiz attempt started and Quiz attempt submitted), tasks upload (A submission has been submitted), study materials access (File viewed and URL viewed), and activities on discussion forum (discussion viewed and some content has been posted).
Furthermore, the end-of-term grade of students in course X is taken as an indicator of academic performance and converted into a categorical form: "Failed" if the score is less than 55 and "Passed" if the score is equal or greater than 55. In this paper, category of "Failed" is denoted as 1, while "Passed" is denoted as 0. After converting, we know that 219 students passed the course X while 13 students failed.

Method
Artificial Neural Network (ANN) is formed through a collection of artificial neurons like biological neurons in human brain to process information with improved performance through learning data [3]. ANN can recognize complex correlation, nonlinear, unknown, between input and output data [4]. ANN consists of a vast variety of architectures, namely, single layer perceptron and multilayer perceptron. Multilayer perceptron is a type of architecture that commonly used for doing information processing. Multilayer perceptron (MLP) has a structure of an input layer that contains input data, one or more hidden layers to process the data, and an output layer ( ) that consists of neurons to produce learning outcomes. Neurons in each layer are connected using a connection that contain adjustable weights during data processing. ANN operates training and testing processes. There are two main phases in training process: the forward phase and the backward phase.
In the forward phase, input vector transmits from input layer to output layer to obtain the expected output. For layer = 2,3 … , , forward phase is done for each th example using the following equations: where , is an argument of activation function , −1 are the weights of all neurons in ( − 1)th layer to the th neuron in the th layer, the bias in the th neuron of the th layer is denoted as , and , is an output of th neuron in the th layer for th example. In the first layer, th input example ( ) can be written as 1 .
After the forward phase is done for all input examples, the expected output, = ,1 is obtained, and the output error can be calculated using the loss function. The output error is used in the backward phase to update weights repeatedly until the stop condition is met. Weights updating is deployed with an optimization algorithm. In this paper, two optimization algorithms are used to update weights: minibatch gradient descent and Levenberg-Marquardt.

Mini-batch Gradient Descent Algorithm
Gradient descent algorithm is an optimization algorithm that is commonly used to minimize loss values in the model [9]. Gradient descent algorithm uses information of error gradient with respect to weight to update weights. There are variants of gradient descent, one of them is mini-batch gradient descent. training data. Mini-batch gradient descent reduces the variance of the weights update, which can lead to more stable convergence, and performs frequent updates that can result in faster learning [11]. In this paper, the batch size used is 32 ( = 32), so there will be 32 examples in a batch.
An update for every batch of training examples is calculated by using the following equation, where −1 and ( −1 )′ are weights before and after updating, respectively, is the learning rate, and ∇ −1 ( , ) is the error gradient with respect to weight. Here, the loss function ( , ) is defined as: where is the th target output, is the th expected output. An epoch in ANN consists of several iterations used to update weights. In this algorithm, an epoch consists of a number of iterations, which is equal to the number of batches. The algorithm of mini-batch gradient descent for an epoch is as follows [12]: 1. Initialize all weights randomly with small random values.

Input a batch of training examples.
3. For each training example in batch , perform the following steps: a. For each layer = 2, … , , compute forward phase using equation (1) and equation (2). In the th layer, we obtain the expected output = . b. Calculate the error value of the expected output using the loss function in equation (4). 4. For each = , − 1, … , 2, update the weights according to the equation (3). 5. Repeat steps 2 to 4 until all batches are used to update weights. The algorithm is repeated until the desired error or a certain number of epochs is reached.

Levenberg-Marquardt Algorithm
Levenberg-Marquardt is an optimization method invented by Kenneth Levenberg and Donald Marquardt in the neural network field [10]. The Levenberg-Marquardt method adopts gradient descent and Gauss-Newton approach. By adopting those methods, Levenberg-Marquardt becomes more effective due to faster convergence and better stabilization in comparison with other methods [13]. The following equation is used to update weights using the Levenberg-Marquardt rule: where and ′ are weights before and after updating, respectively, is Jacobian matrix, is an identity matrix and is a damping factor. Damping factor values are updated in each iteration by multiplying or dividing the damping factor by a power of 10 or some other factors [10]. The is an error vector consists of errors from each training data. With one neuron output, the Jacobian matrix for examples in equation (5) can be written as follows: where is the error of the th example with = 1,2, … , and , is the weight of th neuron in the ( − 1)th layer connected to the th neuron in the th layer with = 1,2, … , −1 , = 1,2, … , and = 2,3, … , . Each element in equation (6) is calculated using the following equation [10], where , is the input from th neuron for th example and , is the value of the th neuron in the th layer for th example. The element , can be calculated in the backward phase with these following equations: Equation (8) is used when layer is backpropagated to layer − 1, where , is the weight of th neuron connected to the th neuron in the th layer and , is a value of the th neuron in the th layer for th example. Meanwhile, the equation (9) is used to backpropagate within the same layer . The element slope , in equation (9) can be obtained in forward phase with following equation: where , and , can be obtained from equation (1) and (2), respectively. The loss function is used to calculate error for Levenberg-Marquardt algorithm is the Mean Square Error (MSE), defined as: where is th target output, is th desired output, and is the number of training examples. In this algorithm, an epoch consists of an iteration. The algorithm of Levenberg-Marquardt is as follows [10]: 1. For th iteration: a. Input training data and initial weights ( ). b. For each layer = 2, … , , compute forward phase using equation (1) and equation (2), also calculate the slope using equation (10). In th layer, we obtain expected output = . c. Calculate an error value in th iteration, denoted as ( ), using equation (11). d. Update the weights ( ) using equation (5) to obtain the weights ( + 1).

Proposed Methods
Before ANNs are implemented on data, the data must be pre-processed beforehand. In this paper, two pre-processing procedures are conducted, namely data normalization and applied adaptive synthetic sampling. Before normalizing data, each data is divided into training data and testing data with proportion of 70:30. It means 162 data are used as training data, while 70 data are used as testing data.
The first pre-processing procedure is data normalization, which transforms the range of values of each feature for both training and testing data, so that all features have the same range of values. Thus, the yielded result will be more accurate [14]. Min-max normalization technique is conducted in this paper's pre-processing in which the range of each input feature is rescaled to the range 0 to 1, with following equation: where and ′ are the value of feature before and after normalization, respectively, max( ) is the maximum feature value, min( ) is the minimum feature value [15]. In classification, classes with less data, known as minority classes, cause imbalanced data. Classification may be biased to majority class or the data that belongs to the minority class can be misclassified as the majority class [16]. An approach to handle imbalanced data is using Adaptive  (ADASYN). ADASYN is a technique for data balancing by generating synthetic data of minority class based on density distribution for different minority class examples depends on their level of difficulty in learning. Examples in minority class that are harder to learn will be used to generate more synthetic data compared to the easier examples. The basis for selecting the data to form the synthetic data with ADASYN reduces the bias generated when model classify the data [17].
The algorithm of ADASYN to make both minority and majority classes have the same amount of data is as follows [17]: 1. Calculate ratio using equation (13), where and are the number of minority and majority examples, respectively. If < 1, ADASYN begins to initialize the generation of synthetic data. 2. Calculate the number of generated synthetic data with the following equation: 3. Calculate the ratio of each example in the minority class as follows, 6. The synthetic data are formed by using the following equation: where generated synthetic data from each is denoted as , is a randomized constant, ranged from [0,1]. A minority data example from k nearest neighbors is chosen randomly.
After preprocessing data, two ANN models, one with mini-batch gradient descent and Levenberg-Marquardt, are implemented on data with periods of one and two months of studies. In this paper, there are 10 experiments in total for each model. For each experiment, the performance of each ANN model over each period is evaluated based on accuracy, sensitivity, and specificity, which are calculated as follows: where TP, TN, FP, and FN are shown in table 1. Accuracy is a measurement of a model performance in classifying all data correctly. Specificity is a measurement of a model performance in classifying positive class correctly, while sensitivity is a measurement of a model performance in classifying negative class correctly. In this study, the passing students are assigned as a positive class and the failing students are assigned to a negative class.  Based on the results of the evaluation, the period over which each model has the best performance is chosen as the best one over which to make a prediction.

Experimental Results
The ANN models were implemented on data that is divided into 2 periods: the first one month and two months of study.
Before implementation, each model receives hyperparameter tuning. Tuning hyperparameters is necessary to find the best combination of hyperparameters, to prevent over-and under-fitting. The hyperparameter in ANN consists of the number of hidden layers, the number of neurons in hidden layer, activation function, epochs, etc. Here, 2 hidden layers with the same number of neurons and 150 epochs are used for both ANNs: one with mini-batch gradient descent and one with Levenberg-Marquardt.
The details of hyperparameters used in ANN with mini-batch gradient descent and Levenberg-Marquardt algorithms to predict students' academic performance in first and second period can be seen in table 2. The implementation requires 10 experiments on the data in each period for each model. Based on these experiments, the performance of each ANN with both optimization methods on each period, specifically average accuracy, sensitivity, and specificity, can be determined. Figure 1 shows the result of students' academic performance prediction using ANN with both optimization algorithms. As we can see from figure 1, the performance of the ANN model with both optimization algorithms is generally improved for each metric from the first period to the second. This indicates that there is more data regarding student learning activities recorded EMAS that can improve the performance of the model. In other words, ANN with both algorithms performs better in the second period of prediction, i.e., using the data from the first two months of studies. This is supported by related research where prediction of the percentages of students passing using data collected from 50% of study time yields superior performance compared to that obtained using data collected from 25% of study time [18].
In the first period, the ANN-based prediction of academic performance using mini-batch gradient descent have an average accuracy, sensitivity, and specificity of 79%, 53%, and 81%, respectively. Meanwhile, in the second period, the average accuracy, sensitivity, and specificity are 86%, 78%, and 87%, respectively, all of which are improvements on the percentages obtained from the first period. The improvements of percentages from first period to second period also occur when the ANN is implemented using the Levenberg-Marquardt algorithm. In figure 1(b), the average accuracy, sensitivity, and specificity of the model in the second period are 87%, 75%, and 86%, respectively, all of which are superior to the percentages after the first period. From the results, ANN for both algorithm is shown better performance on the second period. Further analysis on the results of the implementation in the 2 nd period are shown in table 3 below.  Based on table 3, the average accuracy of the ANN model with mini-batch gradient descent is 86%; therefore, in 86% of data, it will yield a correct prediction. For comparison, an ANN model with the Levenberg-Marquardt algorithm has an average accuracy 87%, which gives it slightly superior accuracy to an ANN model with mini-batch gradient descent when making accurate data predictions.
Thereafter, the performance of an ANN model with both optimization algorithms is compared using average specificity and sensitivity. Based on table 3, the average sensitivity of an ANN model with minibatch gradient descent is 78%. This means that an ANN model using mini-batch gradient descent can successfully predict 78% of failing students correctly as failing. Meanwhile, an ANN model using Levenberg-Marquardt has an average sensitivity of 75%, which is slightly inferior to that of an ANN model with mini-batch gradient descent algorithm.
In addition of sensitivity, an ANN model with mini-batch gradient descent algorithm has an average specificity of 87%. This indicates that it can predict 87% of passing students correctly as passing. Meanwhile, the average specificity of an ANN model with Levenberg-Marquardt has an average specificity of 86%, which is slightly inferior to that obtained using the mini-batch gradient descent algorithm.
In table 3, the average result of students' academic performance prediction using ANN with both optimization algorithms are more than sufficient. However, in order to predict students' potential to fail, sensitivity is considered as the main consideration as evaluation of both optimization methods' performance to choose the best model to predict. From the results, mini-batch gradient descent yields slightly better sensitivity value than Levenberg-Marquardt. Thus, mini-batch gradient descent is a superior form of optimization alongside ANN to predict students' academic performance.

Conclusion
Early students' academic performance prediction is required to identify at-risk students with declining performance. To do so, ANN with mini-batch gradient descent and Levenberg-Marquardt are used to predict student performance. The implementation result shows that both algorithms give better prediction over two months than over one month. Therefore, predictions of academic performance should be made using data from two months' worth of studies. In order to predict students' potential to fail, sensitivity should be the main consideration to determine the best optimization algorithm. The result shows that the sensitivity value for ANN with mini-batch gradient descent is slightly higher than that of an ANN with Levenberg-Marquardt. Therefore, ANN using mini-batch gradient descent as its optimization algorithm is the superior option for making such predictions.