Study of Chinese spam filtering Based on Improved Naive Bayesian Classification Algorithm

Spam is a growing threat to mobile communications. This paper puts forward some mitigation technologies, including white list and blacklist, challenge response and content-based filtering. However, none are perfect and it makes sense to use an algorithm with higher accuracy for classification. Bayesian classification method shows high accuracy in spam processing, so it has attracted extensive attention. In this paper, a Bayesian classification method based on annealing evolution algorithm is introduced into Chinese spam filtering to improve the accuracy of classification. Our simulation results show that the algorithm has better performance in spam filtering.


Introduction
According to the spam and phishing in 2019 [1] released by Kaspersky on securelist on April 8, 2020. In 2019, spam accounted for 56.51% of the total mail, 4.03% higher than that in 2018. The largest source of spam is China, accounting for 21.26%. Therefore, how to effectively filter Chinese spam is a very important research topic.
Nowadays, e-mail has become an indispensable and important tool in people's daily life, which is mainly used for a variety of information interaction. The rapid development of e-mail and the development of related business make the number of e-mail users reach an amazing number, followed by the proliferation of spam on the network. Spam mainly comes from anonymous forwarding servers, one-time accounts and zombie hosts. In order to promote their products or promote their websites, some people achieve their goals by sending random e-mails, which makes spam far exceed the normal number of e-mails, occupies a large number of users' mailbox space and network bandwidth, affects users' use, and consumes the legitimate rights and interests of users, It brings heavy economic burden and social pressure to mail service providers [2]. This is not only a technical issue, nor just a policy and legal issue, but a global and comprehensive issue. Therefore, how to quickly and effectively solve the spam problem has great practical significance [3].
As thus, many experts propose to analyze the content of e-mail to identify spam. This is the task of combining spam filtering with text classification and information filtering, and introducing the methods commonly used in text classification and information filtering into spam filtering.
In this paper, naive Bayesian classification based on simulated annealing genetic algorithm is introduced into spam filtering. Compared with naive Bayesian method, the new method improves the accuracy of spam filtering, reduces the misjudgment rate of non-spam, achieves satisfactory results, and improves the quality of email filtering from the algorithm level.

Overview of naive Bayesian classification
Naive Bayes relies on two basic assumptions to simplify the calculation: (1) all features in the data set are independent of class label conditions. (2) The prediction process is not affected by other potential variables. According to the first assumption, the following joint probability can be obtained: Message category in spam determination: C ∈ C ={0, 1}. Set 1 to indicate spam and 0 to indicate non-spam. The task of mail classifier is to calculate the probability that the mail to be classified is spam. If it exceeds a certain threshold, the mail is considered as spam. On the basis of Bayesian theory, for any mail sample x t : In formula (2), ) | ( x j t C P represents the probability that mail x t belongs to class j C , ) ( j C P is the class a priori probability, and ) | ( It is generally believed that users would rather accept multiple spam messages than miss a non-spam. This cost relationship often introduces a cost factor λ To describe. Therefore, for the λ It is feasible to take a larger value, for example λ= 9999, that is, the loss of 9999 spam missed is equivalent to the loss caused by misreporting a non-spam as spam. (5) can be expressed as: If λ= 9999, then when ) | 1 (

Overview of Annealing Evolution Algorithm
2.2.1. Basic thought of Genetic Algorithm. The earliest proponent of genetic algorithm theory was American scientist holder, who deepened it according to Mendelian genetics and Darwin's theory of evolution and grasped the basic characteristics of information transmission; For example, Darwin believed that species would survive in the process of evolution and improve their adaptability to the environment; Although the nature of the species has not changed and still integrates the basic characteristics of the previous generation, there will be differences under the evolution of the species, adapt to the environment and realize the survival of the fittest and evolution. Mendelian genetic theory research shows that the genetic instruction code in the cell is contained in the chromosome, because the position of each gene is fixed and related to special properties, Each gene is also adaptive to the environment of future generations. Genes that adapt to survival can be selected by gene hybridization and gene mutation, and unsuitable individuals can be eliminated.

Basic thought of Simulated Annealing algorithm.
In 1982, Kirkpatrick and others introduced the idea of annealing into the field of combinatorial optimization and proposed an effective approximate algorithm for solving large-scale combinatorial optimization problems, especially NP complete combinatorial optimization problems -Simulated Annealing algorithm. Compared with the previous approximation algorithms, it has the advantages of simple description, flexible use, wide application, high operation efficiency and less limited by initial conditions, and has high practical value [4].
Simulated annealing algorithm is a stochastic optimization technology used to solve continuous, ordered discrete and multimodal optimization problems. It simulates the cooling process of classical particle system in thermodynamics to solve the extreme value of programming problem. It regards the objective function of solving the problem as the energy function of the object, the solution of the problem as each state of the annealing process, and the optimal solution is the optimal solution after annealing The lowest energy state. It imitates the solid annealing process and divides the whole process of the algorithm into three phases: heating process, that is, the process of setting the initial high temperature; isothermal process, that is, the search process at a certain temperature; cooling process, that is, the cooling process from the initial temperature to the final temperature. Through simulation calculation, the optimal solution or approximate optimal solution of the problem is finally obtained [5].
Simulated annealing algorithm is a general optimization algorithm. At present, it has been widely used in various fields and has become an optimization method with good development prospects. The specific algorithm steps are as follows: Step 1: given the initial temperature 0 t t = , randomly generate the initial state 0 S S = , let k = 0; Step 2: generate a new state 1 S by random disturbance, and calculate 0 1 , accept state 1 S as the current state; Step 4: desuperheating , make k = k + 1; Step 5: if the end is not reached, return to step 2; otherwise, the algorithm ends and the result is output.

Annealing Evolution Algorithm.
Simulated annealing algorithm (SA) and genetic algorithm (GA) are algorithms derived from some laws of nature. They are two branches of calculation according to natural laws. It is meaningful to study how to organically combine them and improve their efficiency, which is also in line with the essence of opposition and unity in nature.It combines SA and GA algorithms with complementary advantages, gives full play to SA local search ability and GA global search ability, and overcomes the problems of poor SA global search ability and low efficiency, poor GA local search ability and premature phenomenon.
Genetic algorithm can make the following improvements to simulated annealing algorithm: by dynamically adjusting the KNN subset, the capacity m of the KNN subset: ( ) f is the average fitness value, and ∂ is the coefficient. Then, in the preliminary phase of the algorithm, there is a big difference between the maximum fitness value and the average fitness value. Hence that the capacity of KNN subset is large. In the middle phase of the algorithm, the gap between the maximum fitness value and the average fitness value began to narrow. In the later phase of the algorithm, There is hardly any deviation between the capacity of the maximum fitness value and the average fitness value, and the capacity of KNN subset is getting smaller. In order to improve the operation efficiency of the algorithm, we should make it search in a more accurate range, so it is necessary to add dynamic adjustment parameters to the algorithm.

Bayesian classification method based on simulated annealing genetic algorithm 2.3.1. Fitness Function.
Most genetic algorithms only use the objective function as the fitness function in evolutionary search. The evaluation of fitness function is the basis of selection operation, and the design of fitness function directly affects the performance of genetic algorithm. In specific applications, the design of fitness function should be combined with the requirements of solving the problem itself. For different problems, the selection of fitness function has many forms. On the basis of the treatment of objective function, fitness function can be divided into the following two categories: objective function mapping into fitness function and fitness function calibration.
In this paper, the fitness function is set as: Where, R is the classification accuracy of NBC on the verification set, and D is the difference of NBC on the verification set, λ Is the coefficient that determines the influence of the degree of difference.

Crossover and mutation operations.
Step4: Adjust λ, Repeat step3. Genetic algorithm uses selection operation to realize the survival of the fittest of individuals in the population: individuals with high fitness are more likely to inherit to the next generation than individuals with poor adaptability. The selection operation is to select some individuals from the parent population and inherit them to the next generation population.

Implementation and Experimental Analysis
Trec06 is a public spam corpus provided by the international text retrieval conference. It is divided into English data set (trec06p) and Chinese data set (trec06c). The mail contained in it comes from real mail and retains the original format and content of the mail.The training set, validation set and test set data used in this experiment are from trec06 data set. 3000 emails, 1500 training set data, 750 validation set data and 750 test set data (including 500 spam and 250 non-spam) are randomly selected for spam filtering. Simultaneously, in order to make the results more universal, the experiment also uses the 2006 TREC public spam corpus Chinese version trec06c as the experimental sample Trec06c uses the first 10000 emails as the data source, including 6631 spam and 3369 non-spams.
In non-spam and spam classifications confusion matrix is used. In this case four possible cases are possible: In our case, accuracy is to measure the accuracy of spam identification, and recall rate is a measure of broadness of correctness. It can be seen from table 2 that the improved Bayesian classification method proposed in this paper is superior to the naive Bayesian algorithm in both precision and recall in spam filtering. The data comparison of the experimental results can fully show that the method in this paper does improve the overall performance of spam filtering.

Conclusion
In this paper, naive Bayesian classification based on simulated annealing genetic algorithm is introduced into spam filtering. Compared with naive Bayesian method, the new method improves the accuracy of spam filtering, reduces the misjudgment rate of non-spam, achieves satisfactory results, and improves the quality of email filtering from the algorithm level. This paper has made some research achievements in the classification of spam, but there are still many deficiencies, which need to be further studied in the future. The main performance is that because the naive Bayesian algorithm introduces the hypothesis