Hybrid Data Mining Method of Telecom Customer Based on Improved Kmeans and XGBoost

In order to mine specified telecom customers with special behaviours from vast voice communication records of Telecom company, a novel Hybrid Data Mining method of Telecom Customers (HDMTC) with special behaviours is proposed in this paper, which integrates Kmeans and XGBoost into one framework. First, AHP model helped to model the features of customers with special behaviours. Then, Due to semi-supervised methodology, Kmeans approach was improved by small amounts of tagged initial cluster centroids and weighted Euclidean distance. Improved Kmeans was utilized to tag the decision attribute of data to construct training dataset. After that, XGBoost model was trained via the training dataset. Finally, the efficiency of HDMTC was validated via real records from telecom company.


Introduction
Nowadays, the great development of communication technology substantially changes the style of people's life and work, whereas more and more phone spams and telecom frauds also intrude into people's life. Thus, it is critical need of telecom company that customers with special behaviors are discovered via data mining method from massive records of telecom communication and treated with specified solution.
Wang used fuzzy decision tree to recognize the behaviors of customers communication based on the records of telecom voice communication [1]. Liu analyzed the relation between customers to mining customers with strong association via semantic association algorithm based on fuzzy set. Without the analyzation of customers' behaviors, recall rate of his method was not good enough [2]. Song built XGBoost model to classify the abnormal customers based on the correlation coefficient and temporal features with diverse scales [3]. Zhou and Zou used improved Kmeans Algorithms, which are unsupervised learning methods, to build customer data mining models. However, the precision rate and recall rate of their unsupervised learning methods are not good enough without the guidance of tagged data [4][5]. Henriques combined Kmeans and XGBoost approaches to detect abnormal log events from massive log records [6]. Du integrated Kmeans and SVM approaches to improve the accuracy rate [7]. Telecom Customers (HDMTC) with special behaviors is proposed in this paper. Based on small amount of tagged data, this method can not only decrease the cost of tagging, but also improve the performance of data mining. The procedure of Algorithm is represented as following figure 1.

Preprocessing
Raw data were selected from the voice communication records of China telecom company. There are 18 attributes in each record. These raw data were preprocessed with 3 stages, which included to fill the missing values, delete the inconsistent or abnormal data, and extract the related attributes to model features of special behaviors.

Modeling the features of customer's special behaviors
In terms of special behaviors of telecom customer, the strong correlated features were selected to construct feature-based dataset of customer's special behaviors. In this paper, the special behaviors of customer, which were figured out by experts of telecom company, were phone scam, advertisement, and business call. Furthermore, based on quantitative and qualitative analysis of Analytic Hierarchy Process (AHP) methods, decision maker balances the importance of specified objective, and set weights of objective features.
The stages of AHP model are described as following: Stage 1: Build AHP model. In terms of decision attributes, all customers with special behavior t can be setup on the top level. The customers with the behavior of phone scam f, advertisement a, business b can be setup on middle level respectively. According to the decision rule, relevant conditional attributes of voice communication can be setup on bottom level. The AHP model is represented as following figure 2.

Selection of Initial Cluster Centroid
Selection of initial cluster centroids of Kmeans algorithm effects the final result significantly. In this paper, the tagged samples were employed to indicate the selection of initial cluster centroids, compare to traditional random selection of initial cluster centroids. According to the experience of telecom experts, samples ∈ 1,2,3, … , are selected from dataset , and tagged as training dataset ⊂ , in which at least one sample ⊂ in each cluster . Finally, the initial cluster centroids are calculated.

Weighted Euclidean Distance
Euclidean distance, which is utilized in classical Kmeans, doesn't take diversity and correlation of samples into consideration, and use same weights to measure similarity for all samples [8][9]. Therefore, in this paper, according to the optimized judgement matrix and weights, Euclidean distance is adjusted with the importance of specified features.  The performance of XGBoost, SVM, naïve Bayesian model with improved Kmeans, and our XGBoost  with improved Kmeans was compared as above table 2. According to F1 score of models for customers with special behaviors, 95.5% from our XGBoost with improved Kmeans, is better than the others. According to F1 score of models for customers with business behaviors, 98.6% from our XGBoost with improved Kmeans, is better than the others. Furthermore, precision of prediction for customers with business behaviors is higher than precision of prediction for customers with all special behaviors. Thus, the performance of our our XGBoost model with improved Kmeans is better in the analysis of customers with business behavior.

Conclusion
Based on the specified features of voice communication from telecom customers, the customers with special behaviors, including advertisement, phone scam and business, can be figured out to support specified solution of telecom company. In this paper, a Hybrid Data Mining method of Telecom Customer (HDMTC) combined with improved Kmeans and XGBoost was proposed to mine the customers with special behaviors. Semi-automatic tagging approach based on improved Kmeans was utilized to help organize training dataset. Then, XGBoost model can be trained well to classify the type of customers. Finally, the efficiency of our HDMTC method was validated via actual data from voice communication records of telecom company in the experiment.