Text Mining and Support Vector Machine for Sentiment Analysis of Tourist Reviews in Bangkalan Regency

Tripadvisor is a travel site that offers reviews of hotels, flights, restaurants and tourist attractions. Reviews from tourists are indispensable for developing tourism, but the number of comments will complicate the owner to analyze the important aspects of the review so that the reviews should be beneficial to develop spot, overlooked or unreadable. This research aims to facilitate the owner of tourist places in the Bangkalan regency to classify negative opinion, positive opinion, and to know the target opinion using techniques sentiment analysis. The initial stage of the sentiment process analysis on this research is Web scrapping on the TripAdvisor site, the purpose of this stage to collect user review data. The review Data obtained will then be classified using the Support Vector Machine (SVM) method. Further review data classification results will be processed using the Text Mining method, to find the target opinion that is considered important in the review. Based on the research that has been done, obtained the accuracy of the classification process with SVM method of 70.22% for Indonesian-language reviews. A post-publication change was made to this article on 20 Apr 2020 as the previously published article was a duplicate.


Introduction
Tripadvisor is a travel site that offers reviews of hotels, flights, restaurants and tourist attractions. Reviews from tourists are indispensable for developing tourism, but the number of comments will complicate the owner to analyzing the important aspects of the review so that the reviews should be beneficial to develop spot, overlooked or unreadable. This research aims to facilitate the owner of tourist places in the Bangkalan regency to classify negative opinion, positive opinion, and to know the target opinion using techniques sentiment analysis. A sentiment analysis system generally consists of two parts: an aspect and opinion word extraction model and an aspect-level sentiment classification model. The former aims to identify the aspect and opinion words mentioned in reviews [1] [2], while the latter aims to infer the sentiment polarity of specific aspects [3] [4]. .

Related Work
This section provides a review of articles mainly studying sentiment analysis. In 2018, Nanda Cahyo conducted an analysis of the sentiment review of mobile Banking application users using the Fuzzy K-Nearest Neighbor. In his research, Nanda Cahyo uses TF-IDF and cosine Similarity to calculate the distance between the data, and the highest F-Measure of 0.9604and the lowest 0.8349 [5]. In 2017, Wilianto analyzed the tourism sentiment in West Java using the Naïve Bayes classifier method and was charged with Vmap value for positive sentiment of 0.0084 and 0.0064 for negativevmap sentiment [6]. In 2018, Fanissa conducted research using the naïve Bayes method to analyse tourism sentiment in

Methods
In this study, We use Support Vector Machine (SVM) as a method for classifying positive sentiment and negative sentiment. From several literature as reference, here is a description of the method used in this research.

Support Vector Machine(SVM)
The Support Vector Machine (SVM) is a relatively new technique for predictor, both in case of classification and regression. The Support Vector Machine (SVM) is a set of guided learning methods that analyzes data and recognizes patterns, used for regression classification and analysis. The original SVM algorithm was created by Vladimir Vapnik and the current standard derivative (soft margins) proposed by Corinna Cortes and Vapnik Vladimir [9].

Text Mining
Text mining is an application of data mining concepts and techniques to find patterns in the text, which is the process of analyating text to obtain useful information for a particular purpose. Based on the irregularities of the text data structure, the process of text mining requires some initial stages in which the point is to prepare thetext to be more structured [10]. In this research, there are several phases to do text mining in the process of Sentiment analysis, namely:

Tokenizing
The tokenizing stage is the cutting phase of the input string based on each word that follows it.

Filtering
Stage filtering is a step taking the important words of the token result.

Stemming
The stemming stage is the search for the root word of each word filtering result.

Tagging
Tagging is a stage for searching the original/root form of each word stemming results.

Analyzing
The analyzing stage is the defining stage of how far between the words between the documents exist.

Result and Discussion
The data used in this study is a review of the tourism data in the Bangkalan on the TripAdvisor site taken on the october 2019 range. The process of web scraping using tools data miner. Bangkalan Tourism Data is displayed on table 1. Many of the tourist data have not been reviewed, which means the tourist spots are not much known or no one is writing their reviews on TripAdvisor.

Data labeling
The next process is opinion class labelling which aims to give positive or negative labels on reviews. In this study, we used Lexicon to calculate score of sentiment. Reviews will be positive labelled if a score greater or equal to 0. Otherwise if a score is less than or equal to 0, it will be negative labelled. The result of the labelling based on lexicon in the review is shown in Fig. 1.

Figure 1. Data Labeling Results
Once we obtained opinion class labelling, than we classified data based on sentiment. The percentages of data classifying is shown in fig. 3. Based on pictures 2 and 3 It is known that of a total of 1394 reviews, there are 1235 positive reviews and 159 negative sentiments.

Wordcloud sentiment
Wordcloud is important to find out the word that often appears in the sentiment reviews. In this study, we used wordcloud from positive class as shown in fig.4.

Figure 4. Positive Wordcloud
The next step after labelling process is classification stage using the SVM method. However before the classification process, we must calculate the weighted of words using TF-IDF. TF-IDF is the rate frequency of a term in reviews, while IDF is the relationship between a term in reviews. In this study, the kernel used to classify is a linear kernel. The Dataset is divided into 80% of training data and 20% for data testing. So the total amount for the training data is 1115 reviews, and 279 for data testing.

Accuracy testing
We used Confusion matrix to measure our proposed models. Confusion matrix is a method to perform accuracy of data mining or decision support system. In the measurement of performance using confusion matrix, there are 4 (four) terms as a representation of classification process results. The four terms are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN). The True Negative (TN) value is the amount of negative data detected correctly, whereas False Positive (FP) is a negative data but is detected as positive data. Meanwhile, True Positive (TP) is a positive data that is Based on table 3, we could find out accuracy, precision, and recall by using the value of confusion matrix which calculated as following:

Conclusion
Based on the research that has been done, it can be concluded that TripAdvisor users gave many positive reviews related to tourism in Bangkalan district. Positive sentiment percentage is 89% while the negative sentiment is 11%. Data Reviews that have been labeled positive and negative then classified using the Linear Support Vector Machine (SVM) method, has an accuracy 70.22% for Indonesian-language reviews.