Alleviating NB conditional independence using Multi-stage variable selection(MSVS): Banking customer dataset application

. Customer research is one of the important aspects of understanding customer behavior patterns with business enterprises and predicate how consumer satisfaction is achieved. Customer analysis brings out various underlying information about the customer patterns with enterprises and analysis decision helps to make better marketing strategies to improve the customer lifetime and also enhance the business profit. To perform effective customer analysis in this research Naive Bayes an ML algorithm is applied. The efficiency of NB comes from its conditional independence assumption and the violation of NB assumption results in poor prediction. But in most real-time customer datasets, the NB assumption is violated due to the presence of correlated, irrelevant, and noisy variables. To improve NB prediction with these customer customers, in this research Multi-Stage Variable Selection(MSVS) is proposed to select the relevant variables from the customer dataset which helps to predicate the customer patterns wisely. The proposed approach consists of two stages in selecting the relevant variable subset from the customer datasets. Further variable subset obtained from the proposed MSVS approach is experimented with using the NB algorithm and the results obtained are compared using the wrapper and filter approaches. From the results, it clearly shows the proposed MSVS approach performs better in selecting the variable subset and improves the NB prediction in customer analysis efficiency compare to wrapper and filter approaches. Further, the proposed approach works efficiently in time and less computational compare to wrapper and filter approaches.


Introduction
Customer behavior analysis within the enterprises is considered an important one to understand the patterns of the customers with business enterprises and also helps to predicate how the customers are satisfied with the business services and products [1]. Since the enterprise's business purely depends on the customers, the need of understanding customer patterns is a very important one to improve the business and to enhance customer satisfaction. Performing better customer behavior patterns helps to understand the unpotential customers within the enterprises and based upon the analysis helps to develop better marketing strategies to improve the less performing customers with the business [17]. To carry out an efficient analysis of customer patterns in this research NB an ML algorithm is performed. NB is a simple probabilistic classifier that performs customer analysis efficiently [3]. The efficiency of the NB algorithm comes from its conditional independence assumption-that is predictor variables in the dataset should conditional independent and all the predictor variables must be treated  [2]. This assumption makes the NB classifier to perform better analysis compare to other algorithms. But this NB assumption is not true with most customer datasets [4]. Since due to the advancement of the internet, technology, and CRM the collection of customer interaction data are enormous, and stored data may consist of correlated, irrelevant, noisy, and missing variables. This happens due to the collection of the same customer data from the different departments of CRM and storing the customer data in a centralized database. With these customer datasets, we cannot directly perform the customer analysis and if so proceed means then poor performance is witnessed. Further, this cause increases the complexity of the NB and results in lower prediction with these customer datasets [19]. To alleviate the NB assumption in this research multi-stage variable selection (MSVS) approach is proposed. The MSVS approach is based upon the variable selection mechanism to select the relevant variable subset and eliminate the correlated variables which help to improve the NB prediction and decrease the complexity of the NB model [5]. The main objective of variable selection is to obtain a relevant variable subset by eliminating correlated, and irrelevant variables from the whole dataset by using some sort of evaluation and search approach. The advantage of using variable selection includes a) reduce overfitting, b) minimize data dimension, c) remove irrelevant, correlated, noisy variables, d) choosing the best variable subset which helps to improve prediction accuracy and e) improves the NB prediction efficiency. Variable selection techniques can broadly be categorized into three types: filter, wrapper, and embedded approach [6].
The filter approach uses some evaluation method to evaluate the worthiness of the variables with the class label and rank accordingly to the variables scores. In the filter approach relevance of variables is obtained by using intrinsic properties of data and the variables with high rank or scores are considered for evaluation with the NB classifier. The main advantage of the filter approach is fast in execution and it is highly suitable for high-dimensional datasets and does not involve in interaction with the ML algorithm for choosing a variable subset. The main disadvantage of the filter approach is variable dependencies are ignored and this leads to poor performance in the NB classifier [7]. Likewise, the wrapper approach uses some search strategy and an ML algorithm to choose the best variable subset. The main advantage of the wrapper approach is variable subset obtained using the wrapper approach is an optimal one compare to the filter approach. This is possible due to the use of search strategies and ML algorithms to find the best variable subset. The main disadvantage of the wrapper approach is too computationally expensive due to the involvement of search strategies to find the best variable subset and has a high possibility of overfitting compare to the filter approach [8]. To address the above issue with filter and wrapper, multi-stage variable selection (MSVS) approach is proposed to select the efficient variable subset which improves NB prediction wisely with respect to time and computationally feasibility The MSVS approach consists of two stages in variable selection: one is using filter approach(Symmetrical Uncertainty) and the other one is using wrapper approach(Sequential Forward Selection) to choose a best relevant variable subset to evaluate using NB classifier. Further performance of the MSVS approach is validated using the Banking dataset and experimental results obtained are compared using filter and wrapper approach. Experimental reveals the proposed MSVS approach works better than the filter and wrapper approach to select the relevant variable subset to improve the NB prediction wisely.

Proposed System
In this research, a multi-stage variable selection (MSVS) approach is proposed to select the best relevant variables subset to improve NB prediction better compare to the filter and wrapper approaches in efficient time and with less computationally. The MSVS approach consists of two-stage for variable selection: first using a symmetrical Uncertainty filter approach to rank the variables accordingly to relevance with the output class and based upon different threshold values different variable subset are selected. Further, the selected variable subset is evaluated using a sequential forward selection wrapper 3 approach to select the best relevant variable subset. The objective of the MSVS approach to obtain an efficient variable subset in efficient time and with less computational. The idea of the MSVS approach is to overcome the drawbacks associated with the wrapper and filter approach in variable selection. The overall structure of the proposed MSVS approach is described in figure 1.

Variable selection using symmetrical Uncertainty first stage:
In the first stage, the customer dataset is evaluated using symmetrical Uncertainty filter approach to examine the worthiness of the variables with the class label. Based upon the evaluation the variable is ranked accordingly.

Algorithm 1: Variable selection in first stage:
1: Choose the customer dataset and divide into a learning set and testing set 2: Apply the symmetrical Uncertainty filter approach as a ranker 3: Compute the worthiness of variable with respect to the class label 4: Rank the variables from the higher score to lower score 5: Then apply different threshold values to choose a suitable variable subset. 6: Compute the performance of NB using the selected variable subset. 7: Lastly feed the variable subset obtained as input to the second stage in the wrapper approach.

2.1.1.Symmetrical Uncertainty:
Symmetrical Uncertainty is most commonly used feature selection approach to examine the worthiness of the feature with the class label and rank accordingly to correlation with class label. The quality of features is computed using entropy measure(information theory) and entropy of feature ‫ܨ‬ is calculated using [14] ‫)ܨ(ܪ‬ = − ∑ ܲ(݂ ) log ଶ (ܲ(݂)) (1) and entropy of ‫ܨ‬ ‫ݎ݁ݐ݂ܽ‬ ‫݃݊݅ݒݎ݁ݏܾ‬ ‫ܤ‬ is computed using here ܲ(݂ ) − ‫ݏݐ݊ݏ݁ݎ݁ݎ‬ prior probabilities(F) and ܲ(݂ |ܾ )-‫ݏݐ݊݁ݏ݁ݎ݁ݎ‬ ‫ݎ݅ݎ݁ݐݏ‬ ‫ݏ݁݅ݐ݈ܾܾ݅݅ܽݎ‬ Information Gain of the feature F with target B is measured using But, IG is biased with the features which has large values. So SU is represented as [9]. Based upon the symmetrical Uncertainty measure the variables are ranked and arranged from higher rank to lower rank. Since the filter approach applied here only ranks the variables and to choose the suitable variable subset the use of threshold value is needed one. In this research different threshold is applied to check how the results vary accordingly to the threshold value and the performance of NB is computed using the variable subset obtained. Further, the variable subset obtained is given as input to the second stage where the variable subset is evaluated using a wrapper approach to find an optimal variable subset.

Variable selection using Sequential forward selection second stage:
The variable subset obtained from first stage is passed as input to second stage to find optimal variable subset using SFS wrapper approach.

2.2.1sequential
Forward Selection: SFS belongs to the wrapper variable selection approach and the selection of variables starts with an empty set and lastly, the best variable subset in the dataset is selected based upon the search strategies and induction algorithm [15]. SFS algorithm starts with the null set and by evaluating the variables worthiness with other variables best variables are added. SFS approach is suitable for small variable set datasets and variables added cannot be removed back from the set. In the SFS approach, NB is applied as an induction algorithm to chose the best variable subset [10]. Further variable The variable subset selected from the MSVS approach is efficient when compared to the filter and wrapper approach since the most irrelevant variable is eliminated in the filter approach and the reduced variable subset is passed into the wrapper approach, By this optimal variable subset is selected from the wrapper approach with less computational time. This makes the proposed MSVS approach select the best variable subset compare to the wrapper and filter approach.

Naive Bayes
Naive bayes is a simple probabilistic classifier based upon bayes theorem and the algorithm is applied widely due to its 1. simple computing, 2. time efficency,3) works well with large datasets, and its conditional assumption [11]. Consider the dataset ‫ܦ‬ = ‫ݔ{‬ ଵ , … . . ‫ݔ‬ | ‫ܥ‬ } where ‫ݔ{‬ ଵ , … . . ‫ݔ‬ } are input predictors and ‫ܥ‬ ‫ݏݐ݊ݏ݁ݎ݁ݎ‬ the output label were (݇ = 1, … . . , ‫)ܭ‬ Consider the ‫ܽݐܽ݀‬ ܺ = ‫ݔ{‬ ଵ , … . . ‫ݔ‬ } with n variables; NB predicts the ‫ܥ‬ class for ܺ by By using chain rule, the ‫ݔ(‬ ଵ … … , ‫ݔ‬ | ‫ܥ‬ ) ‫ݎݐܽݎ݁݉ݑ݊‬ can be represented as With respect to NB conditional assumption the above equation can be written like Using the above equation the NB classification procedure is carried out. The efficiency of the NB depends upon the customer dataset applied. NB implies two important assumptions of datasets: one is conditional independence-the variables must be independent of each other and the other one is all variable in the datasets should be treated as equal [18]. The violation of the NB assumption in the customer dataset leads to poor prediction [13]. Since in most real-time domain datasets, NB assumption is violated and to improve the NB prediction and to satisfy NB assumption, in this research MSVS approach is proposed. The objective of the approach is to select the best variable subset inefficient time and in less computational time.

Experimental Results And Discussion
The proposed MSVS approach is experimented with using the bank dataset which consists of 45211 instances with 17 variables (one target class with two outcomes) [12]. Details about the customer dataset is given in table 1.The objective of the bank dataset is to find the customer who is likely to avail of the bank service and based upon the analysis the various marketing strategies can be developed to improve the customer lifetime and customer satisfaction. Further results obtained from the proposed MSVS approach is compared using wrapper and filter approach using different metrics like [16]         This research aims to improve the naive bayes prediction in the customer analysis which consists of correlated and irrelevant variables in the customer datasets. 2. In filter approach the variable subset are obtained efficiently in time , but the variable subset generated are not satisfactory. Since the approach evaluate the correlation of input variables with the class label only and the dependency between the other input predictors are not examined and this make possible to choose correlated variables. Further to choose the best variables using of threshold value is considered. 3. In wrapper approach the variable subset obtained efficiently compare to filter approach. But compare to time complexity wrapper approach is infeasible and this brings back log to the wrapper approach. 4. Considering the problem with wrapper and filter approach, this research propose MSVS approach. 5. The variable subset selected from the MSVS approach is efficient when compared to the filter and wrapper approach since the most irrelevant variable is eliminated in the filter approach and the reduced variable subset is passed into the wrapper approach, By this optimal variable subset is selected from the wrapper approach with less computational time. 6. This makes the proposed MSVS approach select the best variable subset compare to the wrapper and filter approach.7. Experimental results clearly shows the proposed MSVS -NB works better and improves the classification accuracy wisely compare to filter-NB and wrapper-NB.

Conclusion
In this research work to perform a better analysis of customer behavior patterns, the Multi-Stage variable Selection (MSVS) approach is proposed and experimented. The proposed approach overcomes the drawbacks associated with the wrapper and filter approaches and selects the best variable subset and improves the NB prediction better compare to both approaches. Further results obtained from the proposed MSVS approach are compared using the wrapper and filter approach using different metrics like accuracy, specificity, Precision, and FNR. The experimental results obtained reveals proposed MSVS approach achieves higher accuracy of 89.9759% , wherever wrapper approach achieves only accuracy of 89.8719 % and the filter approach obtains only accuracy of 89.1951 % compare to both approaches and the proposed MSVS approach is superior to both approaches in choosing best variable subset and improving the NB classifier wisely.