Explore the role and emphasis of K-Means, Decision Tree and Distance Based algorithms in data exception detection

K-Means, Decision Tree and Distance-Based algorithms are 3 important ways of classifying data. These three algorithms have different methods and focus on data classification. Therefore, they are always applied into different scenarios. K-Means algorithm is to divide three-dimensional data or two-dimensional data into several clusters to facilitate subsequent data processing and analysis. Decision Tree is based on the “tree” structure to make decisions. It is an important classification and regression method in data mining technology. It is a prediction analysis model expressed in the form of tree structure (including binary tree and multi tree). When it comes to the Distance-Based algorithm, it is a common anomaly detection method applicable to various data domains. It defines outliers based on the nearest neighbor distance. The purpose of this paper is to explore the kernel of three anomaly detection algorithms through an example of data anomaly detection. Therefore, in this paper, all three algorithms’ commonalities and differences will be discussed and illustrated through a case of data exception handling.


Introduction
Data exception detection, as the name implies, is to identify data that is different from normal data and that is significantly different from expected behavior.Generally speaking, it is to find objects different from most objects, namely outliers.It is generally stipulated that data has a "normal" model, and exceptions are considered as deviations from this normal model.The definition of exceptions is also specific in practical applications.In this essay, the author will introduce 3 different algorithms to detect data exceptions which are K-Means, decision tree and distance based algorithm.These three algorithms serve diverse purposes in different industries.K-Means technique clusters data into numerous distinct subsets.Clustering can be employed alone to determine the data's internal distribution structure or as a forward process for classification [1][2][3].Therefore, K-Means algorithm is often used to solve resource allocation problems.Researchers get efficient solutions to problems after data processing through K-Means algorithm.For example, K-Means algorithm can be applied to road network partitioning [1], image segmentation and data prediction are also common.Decision tree is also a widely used classification algorithm.Decision trees are easy to build without domain knowledge or parameter sets.In practice, the decision tree is more suitable for probing knowledge discovery, like decision-making analysis of disastrous weather with the aid of decision tree model [4], analysis of big data, predict and judge the value of medical methods and drugs and so on.The Distance Based algorithm focuses on handling data exceptions.It is mainly used to process and solve unexpected data in data.In different application fields, these inconsistent data patterns are usually called exceptions.But in fact, not only the Distance based algorithm can detect and process exceptions in data, but also the K-Means algorithm and Decision Tree algorithm.Obviously, the above two algorithms are not very common in dealing with data exceptions.Therefore, this paper will discuss and analyze the methods of K-Means and Decision Tree to detect data exceptions through an example.Through this example, the author will see the similarities and differences of K-Means algorithm, Decision Tree algorithm and Distance Based algorithm for data exception handling.The author will first conduct a preliminary data cleaning on our dataset, mainly dealing with missing values and repeated observations.Then the author can get a dataset without missing or duplicate data.Later, the author will first visualize the data to obtain a two-dimensional distribution image of the data.Then select the best number of clusters by observing data characteristics and process the data with K-Means algorithm.Get the center point of each cluster, and set a threshold.The points in each cluster whose distance from the center point is greater than this threshold are regarded as abnormal data.Then, visualize the data to get the overall data processing.Similarly, after cleaning the data, the author will use the Decision Tree algorithm to process the data and get the processed image, and then classify the data according to the processed tree and view the results.Finally, the author will perform anomaly detection on the cleaned dataset according to the formula of the Distance Based algorithm, and get a visual result chart.Finally, the author will compare the similarities and differences of these three algorithms for dataset exception handling, and give analysis and suggested application scenarios.

Preliminaries
In this section, some preparations and basic concepts are briefly introduced including data washing, definition of K-Means, Decision Tree and Distance Based algorithms.

Preparations
The data set counts the comprehensive exposure rate of an online advertisement.The data set makes statistics on the 'CPM' and 'CPC' indicators in chronological order.CPM (cost per mille) means the cost per 1000 clicks, which is used to represent the cost of 1000 clicks on an advertisement on a webpage.CPC (cost per click) means the cost per click.Internet advertising operators need to charge advertising providers according to these data.However, data at some time points in the data may be falsified, and some points may be false traffic generated by operators.Our main task is to detect these abnormal data points.The author will first screen and delete the missing values of the dataset to narrow the deviation of subsequent data anomaly detection results.

K-Means Algorithm
There are two definitions need to be introduced here.First is the Manhattan cluster distance which is the distance between two points.And it is also called urban block distance [1,3]: Since the coordinates are two-dimensional: the distance function between two points a=(x 1, y 1) and b=(x 2, y 2) can be defined as: The second definition is to make the K-Means algorithm more clearly, the author define a general notion: let us take the set of all points as U = {U1, . . ., Uk} [4].The set of centers as set C = {C1, . . ., Ck}.
The author defines the distance as: (2)

Decision Tree Algorithm
Another technique, Decision Tree and its derivatives, divides the input space into sections with independent parameters.The decision tree classification approach uses case-based inductive learning to create a tree-type classification model from unordered training samples.Each tree node stores the feature used to judge the category, and each leaf node indicates the final category.Figure 1 shows a simple decision tree with a classified route rule from root to leaf [5].When testing fresh samples, the operator just needs to start at the root node, test at each branch node, recursively enter the subtree along the relevant branch, and test again until the leaf node.The leaf node represents the test sample's prediction category [6].2) Internal node properties split all subsets recursively.
3) Build leaf nodes and divide these subsets into leaf nodes if they can be classified mostly accurately.4) A decision tree is formed because each subset has leaf nodes, or clear classes.Divide-and-conquer.

Definition of impurity function.
The most important concept of decision tree is the concept of impure function (I).When a node needs to be segmented, it is actually to find a suitable value of a suitable feature as a threshold for segmentation.Then the question arises: how to find the appropriate value of the appropriate feature?The main basis is the change of impurity (delta I).First, readers need to know the definition of impure function.An impure function is not a specific function, but a general term for functions that satisfy a series of constraints.The impurity function.
(I) is defined as:

Definition of information entropy.
First, introduce the concept of information entropy We regard the sample extraction process as a random test A, then A has k possible outputs A1, A2,..., Ak.It corresponds to k classifications Then the information entropy of A is defined as: Information entropy satisfies the definition of impure function So it can be defined as:

Distance Based Algorithm
For a given object p in dataset D, the distance between object p and any point d(p, o) in dataset D is k distance(p).We express the farthest distance of k adjacent points closest to object p in dataset D as d(p, ok) = k distance(p), and the nearest point from object p as ok, so the distance between a given object p and a point ok can be defined as d(p, ok) = k distance(p).It is intuitively understood that the distance between all points in dataset D and p is sorted with object p as the center [8].The distance between the knearest point ok and p of object p is k-distance.
From the k-distance, the author expand to a set of points -the set of all points whose distance to object p is less than or equal to the k-distance, which we call k-neighborhood [9](Figure 2

^`
reachdist ( , ) max distance ( ), ( , ) That is, the local accessible space of point A is the reciprocal of the average reachable distance from a point in its k-part domain to A. Numbers with distant starting points have tiny local accessible spaces.Note: max{k distance(B), d(A, B)} is the maximum achievable distance from each point in N (A) to A.

Definition of local outlier factor
The ratio of the average local reachable density of each point in the neighborhood at the k-th distance from point A to point A's local reachable density lrd is LOF (A).[10] A data point's anomalous degree is measured by its relative density with nearby data points, not its absolute local density.Point A is more likely to be an interlier if its LOF (A) is small.The closer LOF (A) is to 1, the more similar the local accessibility of point A is to its k-nearest neighbor, the less likely it is to be an outlier The greater the LOF (A) is, the more likely point A is to be distant from other points, the more likely it is to be an abnormal value.

Experiment Introduction
The data set in this experiment counts the comprehensive exposure rate of an online advertisement.The data set makes statistics on the 'CPM' and 'CPC' indicators in chronological order.CPM (cost per mille) means the cost per 1000 clicks, which is used to represent the cost of 1000 clicks on an advertisement on a webpage.CPC (cost per click) means the cost per click.Internet advertising operators need to charge advertising providers according to these data.However, data at some time points in the data may be falsified, and some points may be false traffic generated by operators.Our main task is to detect these abnormal data points.

Data pre-processing
The experiment first starts with the data pre-processing.In order to improve the preciseness and accuracy of the experiment and the reliability of the data, the author first finds and deletes the missing values of the data.The figure below is the data after data washing.First, I will use Pandas to search the missing values of the data, screen out all the data with missing values and delete them.After data washing, there will be no more empty values in the data.

K-Means Exception Detection
First, the pre-processed data needs to be visualized (Figure 4 is the visualization of data) and judge the number of clusters according to the characteristics of the data(Not very accurate).Or select K points as the cluster center for initial clustering (or non sample points).The center points are randomly selected to form a center class.(If the randomly selected points have not been selected, a cluster will be generated).Then, a cluster needs to be assigned to each point.Calculate the distance from each point to K centers, and mark the cluster number for each point.Clear all members of the original clusters.(Figure 5 is an example of 5 center points are selected).Add all points to each cluster.Then calculate the center position of each cluster again.Sum the dimensions corresponding to all points.Calculate the distance between two new and old centers.By comparison, the ideal clustering effect is finally achieved by calculating the distance between each sample and the new cluster center to determine the partition result and the new average value of samples in the cluster.
Finally, the average distance between each point and the cluster centers should be calculated and set a threshold.The threshold can be set according to the characteristic of the data.And all the points whose distance from their cluster centers are further than the threshold are regarded as exceptions.

Decision Tree Exception Handling
First, train the pre-processed data to obtain the training set and generate the feature matrix.Then the empirical entropy is calculated, and the conditional information entropy is calculated according to the selected features -the minimum conditional entropy of continuous features is calculated to obtain the maximum information gain.Sort in ascending order according to the characteristic value, and the classification labels are arranged together and initialize threshold.Calculate information gain based on selected features.When the second termination condition in the decision tree is met, the class with the largest number of instances is returned.Then, Select the feature with maximum information gain and information gain ratio.Subdivide into subsets based on selected features [8].Finally, visualize the decision tree.As shown in the figure 6, this is the decision tree generated by the algorithm The data exception can be sorted out by the classification that generated by the Decision Tree algorithm.

Distance-based Exception Detection
The Distance-based algorithm is much more straight forward than the other two algorithms.First, the object p needs to be selected.And based on definition of k-neighborhood, the operator needs to calculate the distance between each point and the object p.The points whose distance to the object p further than the average distance(or a threshold set in other ways) are regarded as exceptions.

Conclusion
All 3 algorithms introduced above can be used in handling data detection and exception.

K-Means Exception Detection
Cluster-based anomaly detection method K-Means assumes: 1) Normal data instances belong to one data cluster, while pathological data instances do not; 2) Normal data are close to their nearest cluster centroids, whereas abnormal data are far away; 3) Normal data belong to large and dense clusters, while abnormal data belong to either tiny or sparse clusters; Abnormal data is data that belongs to a small cluster, no cluster, or remote from the cluster center.

Decision Tree Exception Detection
Machine learning uses decision trees.The final conclusion of the decision process corresponds to the desired decision result.Decision problems test attributes.The results of each test lead to further decision problems, which are considered within the limit of the last decision result."Decision" or "judgment" is the difficulty.Humans naturally process choice problems using the decision tree's tree structure.The data processor can determine that the computational complexity of the Decision Tree algorithm is not high, the output results are relatively straightforward, interpretable, insensitive to the absence of intermediate values, able to process irrelevant feature data, and capable of obtaining a better model on medium-sized data sets with minimal effort [7].However, on the other hand, over matching may occur during the procession of using Decision Tree algorithm during data exception detection.

Distance-based Exception Detection
After calculating the average distance between each sample point and its nearest K samples, the threshold value is compared.An aberrant point exceeds the threshold value.The advantage is that it is not necessary to assume the distribution of data.The disadvantage is that only global outliers can be found, and local outliers cannot be found.

Figure 1 .
Figure 1.A simple case of Decision Tree.Decision tree classification algorithms are straightforward.Decision tree classification can be used if the training sample set has eigenvectors and categories.Only the decision tree's levels affect the predictive classification algorithm's complexity.Linear and data-efficient.For real-time classification.Most class and regression trees are binary.Divide and rule from the top.Recursive process from root to leaf, searching for "partition" attributes at intermediate nodes [6,7]: 1) Build the root node, place all training data there, and choose one ideal feature.The sub node receives subsets of the training data set.2) Internal node properties split all subsets recursively.3)Build leaf nodes and divide these subsets into leaf nodes if they can be classified mostly accurately.4) A decision tree is formed because each subset has leaf nodes, or clear classes.Divide-and-conquer.

Figure 2 .
Figure 2. Illustration of distance-based algorithm.2.4.1.Definition of reachability distance Reachability.The direct distance between data points A and B is the greatest K-adjacent distance of B.

Figure 4 .
Figure 4. Example of visualizing the data.Figure 5. Example of clusters' center selected.

Figure 5 .
Figure 4. Example of visualizing the data.Figure 5. Example of clusters' center selected.

Figure 6 .
Figure 6.First two layers of the tree structure generated by Decision Tree algorithm.