Analysis on Network Traffic Features for Designing Machine Learning based IDS

An intrusion detection system (IDS) is the most important technology for securing network systems. It can dynamically monitor network traffic for malicious activities that are aimed to violate confidentiality, integrity, authenticity, and availability of the network. Currently, several Machine Learning (ML) techniques are used to design and implement IDS since ML techniques can capture the complex nature of cyberattacks. However, network traffic information usually contains unimportant features that can deteriorate the efficacy of ML-based IDS. This research analyses the critical features in network traffic to be used for design/implementing the effective ML-based IDS. The selected features are applied to different ML methods to test the effectiveness. This research is conducted on the CICIDS2017 dataset generated by the Canadian Institute of Cybersecurity, using 30 percent of the full datasets and 100 percent of the Wednesday set. The best result achieved for 30 percent of the full set is by using 30 chosen features with the Bagging ensemble classifier giving the accuracy of 99.9 percent with the false-positive rate as low as 0.03 percent. The best result achieved for Wednesday set is by using the Random Forest Classifier which achieves an accuracy of 99.9 percent and a false-positive rate (FPR) of 0.02 percent.


Introduction
Nowadays, IDS is used for recognizing network malicious traffics. Various techniques such as pattern matching, statistics-based, and machine learning are used to design IDS. Many studies on machine learning were conducted and showed a high success rate. Many techniques, such as fast kNN (FKNN) [1], were used to reduce the computational time and some use more extensive techniques like Artificial Neural Network (ANN) to achieve an exceptional detection rate [2]. Some other research constructed IDS using ensemble learning and the use of oversampling techniques [3]. However, ML-based IDS does not only limit supervised learning, but some research also investigates unsupervised learning using a protocol-based IDS [4]. Furthermore, a hybrid approach has been done as semi-supervised learning using a hybrid of k-means clustering and ensemble learning [5]. Some challenges of ML-based IDS are because the data usually contain high dimensionality and noise. Feature selection techniques is a technique that reduces the dimensionality of the dataset. The benefit of feature selection is that it reduced computational time and improved the performance of ML-based IDS. There are many techniques of feature selection including Information Gain, Recursive Feature Elimination, and Correlations. Ustebay et al [6] used Recursive Feature Elimination with Random Forest estimators to eliminate unnecessary features from the dataset. It achieved 95 percent accuracy on the  [7] applied Fisher Score to select the best 30 features to earn an accuracy of 99 percent. The purpose of this research is to analyze the significant features of network traffic to be used for design and implement machine learning based IDS. Furthermore, the main contribution of this paper is to evaluate the important features in network traffic that are useful in classifying network attacks. Moreover, it will test the effectiveness of different ML methods on selected features by using accuracy, recall, false-positive rate, and F1-Score. Furthermore, the research will be conducted using the CICIDS2017 dataset. This paper is organized as follows; Related works are elaborated in Section 2. The methodology is presented in Section 3. Section 4 presents the results. Discussion and Conclusion are represented in Section 5 and Section 6, respectively.

CICIDS2017 Dataset
The CICIDS2017 dataset is an intrusion detection dataset provided by the Canadian Institute of Cybersecurity and it is a newer dataset, and it contains more up-to-date network attacks. It is a combination of eight CSV files which consisted of a total of approximately three million instances. The eight CSV files are the network traffic for each day, Monday is the only day that has no malicious data. The CICIDS2017 network traffic dataset consists of 15 labels, 14 of which are attacks and one benign. Totally, 288,602 records were missing values. The dataset is considered highly imbalanced since the label with the highest number is benign with 2,359,087 records where the least is heart bleed with 11 records. The in-depth information of the dataset can be found in [8]

Related Works
Zhou et al [9] used three datasets namely, NSL-KDD, AWID, and CICIDS2017 Wednesday set to classify the dataset into binary and multiclass classification. The Correlation-based feature selection combined with Bat algorithm (CFS-BA) was used as the feature selection technique where the number of features used was 10, 8, and 13 for NSL-KDD, AWID, and CICIDS2017, respectively. The reduced datasets then undergo Voting with C4.5, Random Forest (RF), and Forest with Penalizing Attributes (ForestPA) as estimators. The results achieved for the NSL-KDD dataset were the accuracy of 99.8 percent with the false acceptance rate (FAR) of 0.08 percent. Additionally, the results achieved for the AWID dataset were 99.5 percent with the FAR of 0.15 percent. Lastly, the results for the CICIDS2017 dataset were 99.9 with the FAR of 0.12 percent. Kurniabudi et al [10] proposed the CICIDS2017 Dataset Feature Analysis with Information Gain for Anomaly Detection. The research used 20 percent of the CICIDS2017 dataset, and the aim was to classify the malicious data into seven classes namely, Normal, Bot, Brute Force, Dos/DDos, Infiltration, Port Scan, and Web Attack. The number of features used was 4, 15, 22, 35, 52, 57, and 77 features in which ranked by using the weight obtained from the Information Gain technique. The classifiers used to classify the CICIDS2017 dataset were Random Forest (RF), Bayes Network (BN), Random Tree (RT), Naïve Bayes (NB), and J48 or C4.5. The best result was achieved by using RF with 22 features which achieved an accuracy of 99.86 percent and the FAR of 0.3 percent. Furthermore, similar accuracy and FAR can also be achieved by using J48 with 52 features. The accuracy achieved by using J48 and 52 features was 99.87 percent with the FAR of 0.2 percent. However, it should be noted that this method will take a longer execution time. Abdulhammed et al [11] proposed a feature selection technique with Principal Component Analysis (PCA) and Auto Encoder (AE) to classify the CICIDS2017 dataset into binary and multiclass classification. CICIDS2017 dataset was preprocessed and sampled using the Uniform Distribution Based Balancing (UDBB). The UDBB approach used uniform distribution balancing to learn and resample new instances. The dataset was pre-processed with PCA and AE. The experimental results showed that the optimal features for PCA are 10 features, whereas the optimal features for AE are 59. The classifiers used were RF, Bayesian Network (BN), Linear Discriminant Analysis (LDA), and

Methodology
The process for evaluating the importance of features in the CICIDS2017 dataset is explained in this section. The overall process will be presented in Section 3.1. The process of re-labelling and sampling are elaborated in Section 3.2. The methods and parameters used in feature selection and classification will be explained in Section 3.3 and Section 3.4, respectively.  Figure 1 illustrates the overall process of the system. The system consisted of three parts, namely Preprocessing, Feature Selection, and Classification. Pre-processing is the procedure for preparing the CICIDS2017 dataset. Then, the feature selection methods were applied to extract the dominant features that are used to classify malicious packages from benign packages. To evaluate the effectiveness of the selected features, various ML algorithms were applied to classify attack classes.

Preprocessing
The research analyzes both the Wednesday set and the Full set since many related works have been done on both sets. CICIDS2017 contains network traffic records from Monday to Sunday. Totally, 288,602 records were dropped because they contain missing values. Additionally, nominal features, namely Flow ID, Source IP, Destination IP, Protocol, and Timestamp were excluded, according to the suggestion of Ring et al [12]. The class labels in CICIDS2017 are strings that represent the name of network attack methods. The relabeled process changed all labels into integer categories where 0 will present the benign label and label 1 to 14 will represent the attack labels. The preprocess data was split into four datasets, feature selection dataset, cross-validation dataset, training dataset, and testing dataset. Additionally, standardization was applied to both CICIDS2017 Wednesday and the Full set before feature selection. The purpose of standardization is to scale the mean to zero and the standard deviation to one. The CICIDS2017 Full sets have approximately three million transactions after the cleansing process. Therefore, the random sampling method is required to reduce the size of the CICIDS2017 Full set dataset. Cochran's sample size formula [13] is applied to calculate the optimal number of samples. Totally, 30 percent of records in the Full set are randomly chosen to obtain a confidence level of 99 percent and a margin of error of one percent.

Feature Selection Process
The feature selection process is illustrated in Figure 2 below. Firstly, Recursive Feature Elimination (RFE) with Random Forest estimator was applied for ranking the features. Secondly, the first 30 features were selected. The reason is many studies used 10 to 35 features in classifying malicious network traffics. Thirdly, the three-fold cross-validation process was applied to find the optimum number of features.

Classification
The CICIDS2017 Full set and Wednesday set were split as illustrated in Figure 3. All basic classifiers were applied to both CICIDS2017 Wednesday and Full set. The top three best candidate from basic classifiers were selected to be used in Stacking and Voting.

Results
The results from the features selection method and the classification methods will be reported in this section. Section 4.1 will present the result of feature selection. Section 4.2 will present the CICIDS2017 Full set result. Section 4.3 will present the result for the CICIDS2017 Wednesday set. Table 1 shows all 30 selected features from RFE. The features in bold fonts represent the top 18 features selected after the cross-validation process.  Table 2 and 3 shows the classification result of the CICIDS2017 Full set. The value in bold fonts indicates the best result for basic and ensemble classifiers. The performance of machine learning algorithms is evaluated using accuracy, precision, recall, false-positive rate, and F1-Score.

Discussion
The result from Section 4 shows that the Decision tree outperforms other basic classification methods in both Full Set and Wednesday set. Moreover, ensemble classifiers can significantly reduce the FPR. The result also shows that using 18 and 30 features shown no significant difference in terms of accuracy. It should also be pointed out that the features selected from RFE are consistent with many features proposed in Kurniabudi et al. Decision tree, Bagging, and Random Forest classifiers can be used to classify attacks in multi-class classification accurately with an accuracy of more than 99 percent. The three classifiers are shown to be superior to the other classifiers because the base of all three classifiers is tree-type classifier. The advantages of using tree type classifiers are their ability to handle outliers and unbalanced datasets. Most misclassification in multi-class classification occurs from web attacks. Most web attacks such as SQL injection cannot be classified with network traffic alone, the content of the package may also be required to classify them. It is also worth noting that according to [8] the features that are important in classifying Heartbleed attack are features 7, 28, and 29 from Table 1. All three features are included in the results from RFE where feature 7 also appears in the 18 features set. Therefore, all classification models can correctly classify the Heartbleed attack. As mentioned in Section 2 that Abdulhammed et al uses UDBB to synthesize data where the new instances are randomly chosen from the uniform distribution. However, malicious traffic instances may not be uniformly distributed.
Thus, using uniform distribution as an assumption could be inaccurate.