Group chat analysis of hoax detection during the covid-19 pandemic using the k nearest neighbors algorithm and massive text processing

Group chat is the most widely used choice of various short information. Besides being easy to send messages, sharing short messages in group chat is considered effective compared to sending massively to several users. The ease of sending short messages in group chat is often used as the spread of fake news and untrue news or hoaxes, especially during the Covid-19 pandemic, the information shared can be easily shared by anyone without seeing a valid source. The dissemination of information related to Covid-19 without a clear source is a dangerous act, because it can lead users into false information and endanger themselves. Fake message detectors have not been widely implemented in instant message applications, for this reason, there is a need for a detector and a machine to analyze activities in group chat and see whether the message is included in content containing fake news or not. If a group chat has a lot of fake news, you can be sure that the group chat is not good to follow. The use of the K-Nearest Neighbors algorithm is considered quite effective in classifying an object, the results can be determined whether it is included in fake news, miss-information news, or true news. The process of processing messages is carried out by the massive text processing method because the characteristics of the text are different for each user so that text processing can be maximized for later classification. As a result, group chat can be analyzed based on active time, user messages, user activity, and messages sent between users.


Introduction
The phenomenon of spreading false information greatly affects people's social life patterns because the spread of false information can change people's mindsets, opinions, and how people make decisions. Information on the internet started to influence the wider community several years ago [1].
A hoax can simply be interpreted as false or untrue information. The term hoax becomes bigger when the use of technology, especially the internet, is getting wider too [15]. The ease with which users share information is one of the reasons why hoax information spreads faster. The main purpose of spreading hoax information is to lead public opinion and manipulate the situation through threats and various fraudulent artificial information [4].
The spread of false information is mostly through social media. Easy access to social media and the absence of supervision, allows users to disseminate any information without having to include the source. One of the media that often spreads false information is Group Chat. In Group Chat, users can freely send text, news, links, videos, images, and documents without having to include a source. In addition, there are message forward and broadcast message features that allow users to share messages faster and send them to many other users at one time [5].
K Nearest Neighbors (KNN) is an algorithm that is widely used in data classification, data mining, and machine learning because of its simple implementation and can be measured easily. In simple terms, determining the results of the KNN process is by looking at how close the values of the parameters are inputted with the dataset that has been provided is based on a certain k value [14] [6].
The use of KNN in classifying data has been widely used. One of the uses of the KNN is in detecting spam messages in e-mail with various methods resulting in different levels of accuracy, such as the extended KNN [3], Spearman Correlation as a method for calculating distance data [11] and combines with other algorithms such as the Recurrent Neural Network (RNN) [13].
Apart from detecting spam e-mails, the KNN algorithm is used to detect comments on social media using the Support Vector Machine [2]. Whereas in research to detect hoaxes, KNN was used together with other methods such as TF / TDM which resulted in an accuracy value between 74.1% to 83.6% [14].
In detecting hoaxes, in addition to KNN, many other algorithms have been developed and can be used as potential in optimizing final results, such as hoax detection with Naïve Bayes optimization [9], using N-Gram analysis [1], and Stochastic Gradient Descent [8].

Research procedure
The research procedure carried out in this study is shown in Figure 1. There are two main data that will be processed, namely the dataset as a comparison of hoax information taken during the pandemic and last updated in October 2020, and the second message data from the group for the period February to October 2020.

Data collection
Data collection is done by exporting the message data that has been provided by the group chat application service provider. The data is stored in text format with a description of the time, the message sender's contact, message, and other information such as media delivery, adding members to the group, and information about group members leaving. Meanwhile, the comparative dataset for hoax information is taken from the public dataset that has been collected.

Pre-Processing data 2.3.1. Cleaning data.
Data cleaning is the process of deleting data that is not needed in the research process. Because in this study it focuses on text processing, data in the form of media is not included in the next process. Some of the explanatory text contained in the data has also been removed so that it can focus on the hoax messages and data used.

Data normalization.
Data normalization is done to balance data so that it can be processed properly without more data errors. The details of the processed data are shown in Table 1. The word cannot be processed properly because it has a biased and ambiguous meaning. In the process, apart from eliminating non-standard words, analysis of the use of good words is carried out so that the data can be processed optimally [7].

Processing data 2.4.1. Massive text processing.
Massive text processing is message processing by separating the parts that are not needed and the parts that are needed. The required part will be stored in a database for further analysis. The flow of massive text processing is shown in Figure 2. Messages will be grouped into groups of text, media, links, and emoji. Each will be stored in the database. The main objective of massive text processing is the speed of data processing and data grouping so that it is easy to analyze.

Data classification using K Nearest Neighbors (k-NN).
Data classification is done using the KNN algorithm. The KNN algorithm works by calculating the value of the given parameter. Data is calculated based on the level of similarity of the message to the dataset that has been prepared. The message is similar to the dataset provided, meaning that the message has a meaning and meaning that is more or less the same as what the dataset provides.

Data analysis 2.5.1. Word cloud.
Word cloud is a depiction of how often a word appears in a sentence [7]. In this study, the use of word cloud aims to see in detail what topics are discussed most frequently in the group.

Group activity analysis.
Based on massive text processing, group data can be analyzed to find out how active the group being tested, when is the peak time used, who sent the most messages, and on what date the group was active in replying to messages.

Result and Discussion
The data processing process begins with the process of collecting data from a group chat. Data from group chat is then performed data cleaning and data normalization so that unnecessary data is not processed in this study. Chat data will be grouped by date, time, sender contact, message, and additional information. Additional information is the information contained in the message, including information when the user was added, information on deleting messages, information on media uploads, and information about leaving the group.
The chat data is then processed to find out what topics are often discussed and when the group is active. The activity of the group is shown in Figure 3.   In addition to group activity, message types can also be analyzed so that it can be seen which types are often shared in the group and who sends messages the most. The results of the analysis of the types of messages and the users who send messages can also be seen so that you can find out who is the most active and most often sends messages and what messages are sent.
While the most frequently discussed topics are mapped with word cloud as shown in Figure 4. Of the total 4519 chat data, analysis was carried out to detect hoaxes with KNN and the value of K= 1 as shown in Figure 5.  Table 2. From the results given, the higher the value, the greater the possibility that messages sent within the group have the potential to be detected as a hoax message, and conversely the smaller the value, the less likely it is that the message will be detected as a hoax message.

Conclusion
In the case of hoax detection from a group chat with massive text processing analysis and KNN with raw data from group chat and public datasets, it can be concluded that the most active group occurred between February 2020 and March 2020 and continued to be active until June 2020. In that month, in Indonesia, the COVID case is being reported a lot. With the most active time, which is around 8.00 AM to 9.00 AM and noon around 2.00 PM. The topics most discussed were UNNES and also about COVID. It can be seen from word cloud that in addition to the two topics, there are several discussions such as the corona case, the corona case update, and the Indonesian corona. As for hoax detection, each message has different data characteristics. With an average value of 0.536837, it can be concluded that the group chat is still limited to normal. This means not talking too much about false news. With the results of a minimum value of 0.142910 and a maximum value of 0.910547, it shows that sometimes there are indeed some members who share fake news whose data is not clear, but many also provide reliable data.