Comparison of NB and NB-PSO to determine level of vehicles sales

Nowadays, there are various types and brands of vehicles in Indonesia, especially kinds of motor vehicles. In this case, motor vehicles are cars, trucks, and buses (exclude the motorcycles). This research classified the various brands of motor vehicles in the form of classes that are Well Selling (Laris) and Not Selling (Tidak Laris), so that consumers and producers can find out which motor vehicles brands are well selling based on their category and output. This study analyzed 3908 data into 3126 training data and 782 testing data. The data was obtained from the GAIKINDO (Indonesian Automotive Industries Association) site. There were 19 attributes but to ease the research, the attributes used are 8 (including 1 Class attribute to facilitate the search for the best-selling motor vehicles). This research compared the accuracy value among Naive Bayes method and NB-PSO (Naive Bayes- Particle Swarm Optimization) by using such dataset. NB-PSO is adjusted parameters of 1 inertia weight and 5 population size. The results of classification accuracy with the Naive Bayes method produces accuracy values of 92.11%, Precision values: 86.57% and Recall values: 97.12%. Meanwhile, the solutions of NB-PSO have accuracy values of 92.44%, Precision values: 87.07% and Recall values: 97.18%, so PSO method was able to improve the accuracy of classification of NB as many as 0.33%.


Introduction
Motor Vehicle is one of vehicles which is needed in our life. Along with the times, there are many kinds of motor vehicle brands so that consumers need to make decision fast and precisely to choose which motor vehicle is well selling (laris) and not selling (tidak laris), of which these two labels need to be classified in this research. Based on its specification, motor vehicle has various type or variation. Data attributes that will be used are 8 cars specification from that data which are Category, Brand, Type, Model, CC, Transmission, Original Country, Output, and Class. Those specifications are taken from a data which obtained from GAIKINDO (Gabungan Industri Kendaraan Bermotor Indonesia) site. The obtained data is data from January 2015 until March 2019 with the total 3908 record data. Those various motor vehicle brand will be classified into Classes which are well selling and not selling so that consumers, producers, and researchers can figure out which motor vehicle brand is best-selling. Therefore a distinctive classification is needed, so that the costumers know the best-selling motor vehicles brands in terms of the most demand category. Besides that, it encourages the motor vehicles producers to produce kinds of motor vehicles which are most demanded by the consumers.
There are some algorithms that can be used to handle data classification problems, including K-NN (K-Nearest Neighbor), C4.5, SVM (Support Vector Machine) but the algorithm that is frequently used in handling data classification problems is Naive Bayes. In this case, some advantages of naive bayes  [2]. This algorithm has also been used by Deden Rustiana and Nina Rahayu in analyzing the Automotive Market Sentiment on Twitter where the implementation of naïve bayes in this study resulted in a value of 93%, positive sentiment of 90%, negative sentiment of 90%, and negative sentiment of 100% [3]. Then Astrid in 2017 used the algorithm in activities at the University of Semarang's ICT Faculty [4]. Naive Bayes algorithm is also applied in DSS (Decision Support System) based on Web and Mobile to determine the choice of the Hijab model can be used by Muslims correctly and accurately in choosing the hijab model [5]. After that, Ghulam Asrofi conducted a research related to sentiment analysis of the East Java 2018 governor candidates using the Naive Bayes Method which could produce a classification of sentiments (positive, neutral and negative) Indonesian tweets for East Java Governor Candidates 2018 [6].
At first, this research only propose to test naïve Bayes method with those dataset [7], but recently, researchers continue to make improvements to Naive Bayes that has weakness in choicing attribute by adding method which is comparing accuracy value between Naïve Bayes method and NB-PSO (Naïve Bayes-Particle Swarm Optimization) using those dataset, since it is based on previous research conducted by Marlina et al that made a comparison between performance of classification methods Naive Bayes and C4.5 algorithms to find the best accuracy of those methods [8], and then Jie Lin and Jiankun Yu have succeed by using NB-PSO to improve the accuracy and more effective [9]. This is also proven by Widiastuti et al that PSO method can improve the accuracy of Naïve Bayes as many as 10,72% [10]. One of the optimization methods which can be used to improve Naïve Bayes performance is PSO. This optimization method is adopted from the behavior of flock of bird. Based on Carlisle and Dozier, something that influence in increasing the performance of Naïve Bayes is the size of population [11], so that in this research, it is needed to adjust the parameter of which inertia weight is 1 and population size is 5.

Research Method
Based on the figure 1, the first step starts from collecting dataset as many as 3908 data, then preprocessing is done which are cleaning, replacing, reduction, and transformation. After preprocessing has done, it results new dataset. This research compared naïve Bayes algorithm and naïve Bayes PSO optimization (once cross validation) and also Naïve Bayes PSO optimization (twice cross validation).
The following is cross validation with k=10 folds so dataset is divided into 10 to 90 which 10% is used as data testing and 90% is used as data training. Validation is conducted to test algorithm method used. K-folds cross validation is a method used to find out the level of success of an algorithm model by re-testing input attributes randomly.

Figure 1. Steps of Research Method
The next is evaluation process of each algorithm performance that has been determined by using confusion matrix technique and ROC Curve. The function of confusion matrix is to measure the level of accuracy, precision, and recall value from the evaluated algorithm model. The accuracy value is the level of accuracy of the percentage between the predicted value and the actual value, then precision value is the accuracy value with the predicted class. While, recall value is the percentage of the algorithm succeed value that is used. The following is the table of confusion matrix [2].

Result and Analysis
The research conducted is divided into 3 model, which are: a. Model by using Naïve Bayes algorithm. The following is the 1st model.   The first model is done using accuracy value 92,11%, precision value 86,57%, and recall value 97,12% [1], while the second testing model is done by Jaenal et al [7]. Yet, the dataset used is different; Jaenal et al did not use GAIKINDO dataset. The result of the comparison of the three model can be seen in histogram as follow.

Figure 5. Comparison of Three Model
Based on the test results, it was obtained that if we compare the first model and the second model, there is a decrease on accuracy value 0.35%, recall value 0.3%, precision value 0.43%, while the AUC value is stable.
Then, if we compare between the first testing model and the third testing model, there is an increase on accuracy value 0.33%, recall value 0.06%, precision value 0.5%, and also the AUC value 0.002. it can be concluded that the third testing model has the highest accuracy level than the first and the second testing model. Those results are in accordance with the previous study by Jaenal et al that PSO can increase the accuracy of NB with the average 0.33% from 7 dataset tested [12].
The following is a table of classification results of vehicles sales rate by using dataset from GAIKINDO of which the data representative can performed 16 from the total 3908 dataset.