Paper The following article is Open access

Application of a combination between Principal Component Analysis and Logistic Regression Based on Support Vector Machine on Educational Data Mining with Overlapping Data Problem

, , , and

Published under licence by IOP Publishing Ltd
, , Citation Siti Mutrofin et al 2020 IOP Conf. Ser.: Mater. Sci. Eng. 874 012018 DOI 10.1088/1757-899X/874/1/012018

1757-899X/874/1/012018

Abstract

In 2019, the government of the Republic of Indonesia issued a zoning-based policy for New Student Admissions (PPDB) from the level of elementary school (SD) to high school (SMA), especially for public schools. The policy is documented in Permendikbud No.51 / 2018. The government policy aims to ensure the equality of education and make prospective students not focus only on favorite schools. However, this policy raises new problems. One of them is that if the potential student has got a medium UN (National Examination) score or medium distance of the house to the destination school, then his potential to be accepted at the destination school is very small. It is even worse if the potential students do not know the lowest score and the farthest distance the destination school can accept. Thus, potential students will choose schools by only guessing without basing on valid data, so their chances of being accepted will be very small. This current research focused on Educational Data Mining at PPDB Public High School (SMA) in Jombang in the academic year 2019/2020 which aims to accommodate the needs of potential students to predict the destination schools based on their own grades and home distances using classification techniques of data mining. However, another problem emerged in this study. An overlapping data occurred where one data was also owned by more than one class. For example, a potential student of SMA Negeri 2 Jombang (SMAN 2 Jombang) has got a score of 80 in Bahasa Indonesia subject, which is the same as that of a student from SMA Negeri 3 Jombang (SMAN 3 Jombang). Data overlapping does not only occur in one data but almost all of the data. The data used in this study were 600 data, consisting of 308 from PPDB 2019 of SMAN 2 Jombang, and the rest were from SMAN 3 Jombang. The attributes used were the home distance from the destination school, overall UN scores, UN scores of Mathematics, Natural Sciences, Bahasa Indonesia, and English subjects. The algorithm used was a combination of Principal Component Analysis (PCA) with Logistic Regression (LR)-based Support Vector Machine (SVM) with Anova kernel. The validation applied 10-fold cross-validation and the evaluation of algorithm performance used the aspects of accuracy, precision, and recall. The results of this current study showed an accuracy of 94.33%, a precision of 96.28%, and a recall of 92.53%. The results were better than those that did not apply PCA (70.83% accuracy, 69.62% precision, and 76.62% recall). By PCA, data could be seen from another angle that could separate or differentiate one class from the others. Even though there were 100% overlapping data, none of them, from all attributes, was 100% exactly the same.

Export citation and abstract BibTeX RIS

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Please wait… references are loading.
10.1088/1757-899X/874/1/012018