Research on the application of Naive Bayes and Support Vector Machine algorithm on exercises Classification

In this paper, exercises on electronic major courses of college are classified, further, compared with the accuracy of algorithms based on the Naive Bayes and Support Vector Machine (SVM) in the classification of exercises. Data sets from exercises related to the content on the course of Data Structure are selected. According to knowledge content of course and characteristics of exercises, judgment and multiple-choice questions are used in experimental data, and these exercises are divided into seven categories depending on content of chapters. Firstly, the package of Jieba participle is applied to segment text of exercises, and the proper nouns of knowledge content are identified by importing a custom dictionary. Because characteristic words of short text are quite sparse, weight method about TF-IDF is directly adopted to feature representation of text instead of feature selection. Finally, algorithms based on the Nave Bayes and SVM are employed to train and classify text of exercises. The experimental results show that algorithm based on SVM has lower error rate and higher classification accuracy than the Naive Bayes.


Introduction
Exercises are an important part of teaching resources, which not only allow students to consolidate the knowledge points they have mastered, but also enable teachers to discover the points of knowledge undermastered by students. So the classification of the exercises is particularly important. The number of words in the text of exercises range from a dozen to dozens, which belongs to short text and even very short text. So classification of exercises is an application scenario for classification about short text. Commonly, traditional methods on text classification such as KNN [1], Logistic Regression [2], Naive Bayes [3] and SVM algorithm [4], in addition to Latent Dirichlet Allocation (LDA) for classification of short text [5], and the most widely used algorithms are Naive Bayes and SVM, even the combination of LDA theme models and Naive Bayes and SVM [6]. At present, the most popular method on short text classification is deep learning, which has best classification effect [7], but because of the small number of exercises sets and the weak explanation of the model in deep learning, this paper adopts the traditional method on text classification. The algorithms of exercises text classification used in this paper are Naive Bayes and SVM, because they are easy to understand and have less computational complexity.
The application scenarios of Naive Bayes are mainly in text classification and fraud detection. Xiao et al presented the method of Naive Bayes to classify the patent texts, and the result of the classification performance was better than the SVM, and finally improved the security of patent text classification [ of classifier based on SVM-KNN algorithm by feeding back and improving probability of classified prediction [10]. Zun et al applied algorithms of Naive Bayes and SVM to classify automatically IT research papers, proving that classifier of SVM is more accurate than Naive Bayes [11].
In this paper, the exercises text is pre-processed by using Jieba participles, the vector space model with weight of TF-IDF is constructed, and the exercises text is classified by adopting algorithms based on Naive Bayes and SVM. Combined with error rate, accuracy rate, recall rate and F1-Score, the accuracy of exercises classification of the two algorithms is evaluated.

Word segmentation
This article adopts the precise pattern about word segmentation of Jieba, which attempts to divide the sentences as accurately as possible. Therefore, it is suitable for the text analysis of exercises. Most of the knowledge points in the textbook of Data Structure are proper nouns, and the original package on word segmentation of Jieba cannot identify these new words, so custom dictionaries must be imported to improve the recognition rate of new words.

Removal of stop words
Auxiliary word, function words, adverbs, punctuation, letters, and illegal characters and so on from the exercises are removed. Combined with stop words list about Harbin institute of Technology, stop words library of machine intelligence laboratory of Sichuan University and Baidu's stop words list, it was reprocessed, and then added to the list of stop words similar to the "with" "total" "how much" and so on, which are of little significance to the keywords of knowledge points. A more comprehensive list of stop words for the exercises Classification of the Data Structure was rearranged, with a total of 2,503 stop words.

Merging synonyms
The same knowledge point has different expressions in different exercises, so it is necessary to merge synonyms. For example, "Simple selection sort "and "Direct selection sort" represent the same knowledge point, unified combined into "Simple selection sort". And "Hash Table, HASH Table, and Hash List" are merged into "Hash Table" and so on.

Feature representation of text
The purpose of the text representation is to convert the pre-processed text into a computerunderstandable way, which is the most important part of determining quality of text classification.
Traditionally, text representation methods of bag-of-words model and vector space model are commonly used. The bag-of-words model treats a single sentence in a text as a collection of words, and a "the bag of words" represent the text. Based on the bag-of-words model, VSM becomes the most commonly used method for text representation [12]. VSM was proposed by Salon [13], It regards document as an ndimensional vector in vector space, such as: Where represents the feature word and represents the weight of the feature word in the text , which is generally calculated using the TF-IDF value [14]. Calculation formula of TF-IDF is as follows: (2) , ∑

Classification algorithm
After constructing the vector space of TF-IDF weight, we can use machine learning algorithms to classify text of exercises.

Naive Bayes
Classification of Naive Bayes is a supervised learning classification algorithm based on Bayesian rules. Its basic principle is that when the prior probability of the class is known and the conditional probability of all values of each attribute in the class is known, the conditional probability that the sample to be classified belongs to a certain category can be calculated, and classification of the sample with the greatest conditional probability can be selected. It completes text categorization by assigning probabilistic labels to input documents to be categorized. The Bayesian formula is as follows: Among them, the probability that input text d belongs to category can be obtained. Where d denotes text， denotes classification category. According to the assumption of Bayesian model, each text in the data set is independent of each other, so the category corresponding to the maximum probability value obtained by input text d is the category of text d. The calculation formula is as follows: Y arg max ∏ | (6)

Support Vector Machine
Support Vector Machine (SVM) was first proposed by Vapnik et al [15]. Linear classifiers use hyperplane boundaries, while non-linear classifiers use hypersurfaces. This hyperplane can be represented by the classification function , when ， is the point on the hyperplane, and the point of corresponds to the data point of , and the point of corresponds to the point . The principle of SVM is to map points in low-dimensional space to high-dimensional space so that they become linear separable, and then use the principle of linear partition to judge the classification boundary. It is a linear partition in high-dimensional space, but it is a nonlinear partition in the original data space. SVM has many unique advantages in solving small sample, non-linear and high-dimensional pattern recognition problems, and can be extended to other machine learning problems such as function fitting.

Evaluation of classification results
In order to evaluate the experimental results and compare the classification accuracy of Naive Bayes and Support Vector Machine, this paper selects four indicators: error rate, accuracy rate, recall rate and F1-Score. Table 1 shows that parameters of classified evaluation index. For the formulas listed above, n represents the number of misclassification exercises for the classifier, N represents the actual total number of exercises, and A shows the number of exercises that actually belong to the class and are classified by the classifier into the class. B denotes the number of exercises that are not actually in the class but are classified by the classifier into that class. C represents the number of exercises that actually belong to the class but are not classified into that class.

Experimental data
There are 1034 exercises in this experiment, including 512 multiple-choice questions and 522 judgement questions. According to the knowledge structure system of Data Structure, exercises are divided into seven categories. Table 2 shows seven categories and their descriptions. This experiment chooses editing tool of Pycharm in the environment of Python language, uses polynomial Bayesian algorithm of Scikit-Learn library and linear SVM algorithm to classify short text. The experimental data is divided into training set and test set, which are divided into two experiments.  Table 3 and Figure 1 show the results about two experiments of exercises classification for Naive Bayes and SVM. It can be seen from the results of the first and second comparison experiments that due to the number of exercises in the training sample set are increased, the error rate of the predictive exercises of the classifier decreases, and the accuracy, recall rate and F1 value also increase. The results of each experiment shows that SVM algorithm has a lower error rate in predicting the exercises than the Naive Bayes algorithm. In the second experiment, the error rate of Naive Bayes is 10.79%, while the error rate of SVM is reduced to 6.16%. At the same time, the accuracy rate, recall rate and F1 value of SVM algorithm are higher than the Naive Bayes algorithm. Therefore, the algorithm of SVM is more accurate than the Naive Bayes in exercises classification.

Conclusion and Future work
Exercises is an important learning resource in the process of education and teaching, which can test knowledge mastered of students. Therefore, it is particularly important to classify exercises according to chapters, which can consolidate the knowledge learnt by students at different stages, and then check the leaks and fill the gaps. It also prepares for the recommendation of exercises according to the students' situation of doing exercises in the future. This paper classifies the relevant exercises in the course of Data Structure, deletes stop-words and merges synonyms, and then constructs word vector space of TF-IDF. Finally, we apply machine learning algorithms based on Naive Bayes and Support Vector to classify text of exercises. The results of two experiments show that the training corpus determines the accuracy of classification algorithm of exercises text to a certain extent. The more training data sets and the better the pre-processing of exercises text, the higher the accuracy of exercises text classification based on Naive Bayes and Support Vector Machine algorithms. At the same time, it can be seen from the results of each experiment that the SVM algorithm is higher than Naive Bayes, regardless of error rate or accuracy rate, recall rate and F1 value. Therefore, the classification performance of SVM algorithm is better than Naive Bayes algorithm in text classification of exercises.
More data sets and experiments will be added on the work of future and short text classification algorithms of KNN, Logical Regression and LDA topic models and their improvements will be compared. At the same time, in order to solve the problem of sparse feature in the short text, the next step is to expand the feature words, and use LDA topic model to expand the feature words, combined with SVM and Naive Bayes algorithms to classify exercises, so as to improve the accuracy of exercises classification.