Analysis K-Means Clustering to Predicting Student Graduation

The prediction of students’ graduation outcomes has been an important field for higher education institutions because it provides planning for them to develop and expand any strategic programs that can help to improve student academics performance. Data mining techniques can cluster student academics performance in predicting student graduation. The aim of this study is to analysis the performance of data mining techniques for predicting students’ graduation using the K-Means clustering algorithm. The data pre-processing used for data cleaning, and data reducing using Principle Component Analysis to determine any variables that affect the graduation time. This algorithm processes datasets of student academics performance numbering 241 students with 16 variables. Based on the clustering using K-means, the highest accuracy rate is 78.42% in the 3-cluster model and the smallest accuracy rate is 16.60% in the 4-cluster model. The influential variable in predicting student graduation based on the value of the loading factor is the GPA total of the 1st to 6th semester.


Introduction
Education is the most important component of human life. Education can be in the form of theory, practice, even moral. In the Government Regulation of RI No.12/2012 concerning Higher Education, Chapter 1 Article 1 paragraph 1 explains "Education is a mindful planned effort to actualize the learning atmosphere and learning process to get up their religious-spiritual potential, self-control, personality strength, intelligence and character, and skills" [1], [2]. The prediction of students' graduation outcomes has been an important field for higher education institutions because it provides planning to develop and expand any strategic programs that can improve student academic performance. It can also affect the institutions' reputation in describing graduates' quality [3]- [6]. Most of the studies that have been done use the techniques of data mining or Multi-attribute Decision Making to predict students' completion. Some of the techniques were C4.5 Decision Tree [7], Naïve Bayes [6], MADM [8]- [10], and Support Vector Machine [3], [4].
In previous research, the prediction of student learning outcomes was carried out using the C4.5 and Naive Bayesian methods. This study compares the performance of the two methods in classifying students' graduation times into 3 classes. The classification model formed shows that these two methods have an accuracy rate above while the Naïve Bayes Classifier's precision rate is 60%, and in the Tree C4.5 method is 58.82% [6]. In this study, it has discussed the problem of classifying student graduation times using data mining techniques. It is also necessary to analyze the data's characteristics to see the clustering of student graduation data distribution. The number of clusters formed can be used for predicting student graduation. The data clustering aims to provide an agglomerate of similarity data records. Clustering is often confused as classification, but they both have different goals. In simple terms, the distance intraclusters need to be minimized for better clustering results, as in the K-means algorithm [11]. K-Means is a non-hierarchical clustering method that tries to partition data set into some clusters. Hence, the data has the same characteristics collected into the same cluster, and others are collected into other clusters. K-Means method is the notable cluster analysis algorithm in data mining. Through some experiments clustering in the K-means method, the obtained that its cluster result varies along with the initial cluster central point [12], [13]. The advantages of K-means algorithm are quick convergence to distortion minimum and apprehending how many clusters in the dataset [14], [15].

Methodology
This study's steps based on the cross-industry standard process for data mining (CRISP-DM). The steps are dataset collecting, data pre-processing, modeling, and evaluation of clustering presented in Figure 1. Steps of clustering to predicting students' graduation In Figure 1, the data preprocessing step consists of data cleaning and then reducing the attributes using the Principle Component Analysis (PCA) method. In this step, the dataset has been generated to perform clustering analysis using the K-means method. The evaluation stage is carried out to see the K-means method's performance in clustering student graduation data. In predicting student graduation through K-Means clustering is used student academic performance data which consists of attributes which are presented in Table 1.

Data preprocessing
Data cleaning is useful for cleaning data sets that have a missing value in which data that has missing value attributes are removed from the data set. Data reduction process using Principle Component Analysis (PCA). PCA used to attributes reduction. It used to eliminate irrelevant attributes in predicting students' graduation. The selecting relevant and non-correlated attributes without affecting the information in the initial data set, then the predictor is developed using the K-means method to clustering data set [16], [17].

Modeling
The K-means clustering method utilized to predict student graduation in this study. The algorithm of K-means clustering method shown in figure 2.

Evaluation
The evaluation step is done for measuring the performance of K-Means in clustering the students' graduation using the in the confusion matrix [18]- [20].

Result and Discussion
The data pre-processing step carried out data cleaning and reduction data with the PCA method, so we obtained the relevant attributes to clustering the dataset using the K-means algorithm. The value of each PCA loading factor variables is presented in Table 2.   Based on the result of PCA in Table 2, there are 5 variables loading factor which has a negative loading factor value which are birthplace = -0.055, ages = 0.205, school status = -0,023, type of school = -0,013, and class of students = -0.533. They were removed because it did not have a significant influence on the determination of graduation time. The relevant attributes for use in predicting students' graduation are: x Gender x Hometown x GPA in 1st semester x GPA in 2nd semester x GPA in 2nd semester x GPA in 3rd semester x GPA in 4th semester x GPA in 5th semester x GPA in 6th semester x GPA 1st to 6th semester x Credits in 2nd semester x Credits in 3rd semester x Credits in 4th semester x Activeness in student organizations x Parent's earnings x Parent's education level The number of students in the dataset used for the clustering process was 241 students. In clustering this dataset, the start work in the K-means method is to determine the number of clusters to be reviewed. In this study, 3 experimental models were carried out, namely an experiment with 2 clusters, 3 clusters, and 4 clusters. These three experiments were then evaluated to see which cluster model was best used in predicting student graduation. In K-means clustering, data selected randomly to be the center of the initial cluster according to the number of clusters determined. In the 3-cluster model, data are selected randomly through the average, minimum, and maximum values of the GPA attribute. Clusters center is formed, namely the 13th data, the 91st data, and the 202nd data in the dataset. Table 3. Initial cluster centre of K-means algorithm Student  Gender  HT  IP1  IP2  IP3  IP4  IP5  IP6  IPK  SKS2  SKS3  SKS4  AO  PI  PE   13  1  1  1  1  1  2  Based on the initial cluster center in Table 3, the next step determines the centroid distance for each data to each cluster center using Euclidian distance on equation (1). (1) Based on equation (1), the distance between each data and the center of the centroid is obtained as follows: Centroid distance for centroid-1 to data-1: Centroid distance for centroid-2 to data-1: Centroid distance for centroid-3 to data-1: Each data is grouped based on the distance of the closest centroid. Comparison of a confusion matrix for 3 models of clusters is presented in Table 4. Based on Table 4 confusion matrix of K-Means clustering, the accuracy, error, recall, specificity, and precision rate of each experiment for 2-cluster, 3-cluster, and 4-cluster models is presented in Figure 3. In the 2-cluster model, the accuracy rate obtained is 61.41%, the error rate is 38.59%, and the precision rate is 69.77%. In the 3-cluster model, the accuracy rate is 78.42%, the error rate is 21.58%, and the precision rate amounting to 68.10%. In contrast, in the 4-cluster model, the accuracy rate is 16.60%, the error rate is 83.40%, and the precision rate is 32.52%. The highest accuracy rate of K-Means clustering modeling with 3-cluster models to predict student graduation use the student academic dataset in Table 1. This shows that the most appropriate clustering of data to predict student graduation uses a 3-cluster model. The graph can illustrate the results of the study in Figure 4.  Figure 4. Effect of independent variables on the dependent variable chart.
In Figure 4, Gender influences students' graduation time, with the male gender study period is faster than the female gender. GPA 6th semester, Total credits 4th semester, and Grade Point Average total affect student graduation time. Students who have a higher 6th semester GPA and Grade Point Average from the first semester to the 6th semester can have passed on time than students with lower both of GPA 6th semester and GPA total. Students who have the greatest credits in the 4th semester have to pass on time than students who have lower credits. Students who have higher parents' educational levels do not result in students having a faster study period than others. Organizational activity influences student graduation, which is students who are active in organizational membership has a slower study period than students who are not active in organizational membership.

Conclusion
As the results of this study obtained that the attributes are birthplace, ages, school status, type of school, and a class of students all have negative loading factor value in the PCA method, which means it all has a small correlation value to the prediction of the student graduation. Variables affect student graduation time are gender, which is male students graduate faster than females, GPA in 6th semester and GPA 1st to 6th semester that students who have higher grades graduate faster than others, students who have higher credits in 4th-semester is faster than other, and students who are active in the organization late to graduate than students who are not active in the organization. Based on 3 cluster models is obtained that the 3-cluster model is the best clustering with an accuracy rate obtained is 78.42%, an error rate is 21.58%, and a precision rate is 68.10%. As future work, it is necessary to carry out a cluster analysis using other methods, resulting in better cluster performance.