Frequent Attack Sequences-based Network log Mining

As an important part of modern IT infrastructure, network provides great convenience for people to exchange information and share resources. However, network still faces with many threats such as network virus, hacker attacks, data theft and tampering and so on. Network logs includes a lot of valuable information about all behaviours happened in the network. How to analyse these network log to enhance the security of network becomes consequently the focus of many researchers. In this paper, we first design three similarity functions between two network attack records to create the network attack sequences, and then present a PrefixSpan-based frequent attack sequence mining algorithm to identify all frequent attack sequences in network log. The experimental results show that the PrefixSpan-based frequent attack sequence mining algorithm has shorter executing time and less running space than the Apriori algorithm. The PrefixSpan-based frequent attack sequence mining algorithm provides a network log analysing method for intrusion detection.


Introduction
With the continuous improvement of network technology, human life is becoming more and more dependent on the network. Network provides great convenience for people to exchange information and share resources. At the same time, the architecture and scale of network are getting more and more complicated with the wide application of new technologies such as Internet of Things and Cloud Computing, and the security protection of network is becoming harder and harder. How to use various novel technologies to strengthen network security becomes very important. During the running of network, massive network logs are created to record all behaviors happened in the network. These network log contain a lot of useful information for network security experts to find the network threats and strengthen the network security. Therefore, analyzing network log is very valuable to implement the active security protection of network.
For massive network log, this paper presents a network log mining algorithm based on frequent attack sequences to identify the network attack types. The main contributions of this paper are as follows: First, a series of relativity functions are designed to calculate the correlation degree between two network attack record with a synthetic consideration on three attributes of network attack record, i.e. the occurrence time of attack, the IP address of attack and the port number of attack. These relativity functions can extract all network attack sequences in network log. Second, a prefixSpan-based sequences mining algorithm is implemented to seek out all frequent attack sequences in network log and prove the high effectiveness of the prefixSpan-based frequent attack sequences mining algorithm in contrast to the Apriori based-frequent attack sequences by conducting multiple experiments.
The next paper structure is arranged as follows. The related works and some basic knowledge are introduced in Section 2 and Section 3 respectively. The frequent attack sequence-based network log mining algorithm is described in Section 4 and the experimental setup and result evaluation is explained in Section 5. Finally, the summary and future work are discussed in Section 6.

Related Works
With the rapid development of data mining technology, more and more researchers gradually realize the important role of network log analysis [1][2][3][4]. Zhang [5] proposed an attack pattern mining method using association analysis in data mining, which first divides the alert subsequences by sliding time window and then finds hidden attack patterns in network log by analyzing the alert subsequences. Hellerstein et al. [6] proposed an algorithm for mining unexpected, periodic and interdependent events from historical security logs. Treinen and Thurimella [7] designed a framework for the application of association rule mining in large intrusion detection infrastructures. Lee et al. [8] constructed an intrusion detection framework based on data mining technology and extracted the feature patterns from a large scale of audit logs, so as to discover users' abnormal activities. Nithya et al. [9]proposed a new network log pre-processing technology that gives the concepts of local noise and global noise in data cleaning stage and cleans the network log by eliminating noise data. Sahu et al. [10] analyzed users' past browsing behavior and proposed an improved network log mining method to predict users' browsing patterns online. Rodrigue [11] put forward the method of combining time series analysis with unsupervised learning technology to analyze the abnormal traffic log in the network, so as to provide the practical support for future network security decision-making. Bass [12] researched the fusion of intrusion detection data and multisensor data in an intrusion detection system.
The existing methods are still lacking in accuracy of mining results. Due to the lack of accuracy, there are some difficulties in the analysis of the results. At the same time, some methods only consider the time factor in mining, which is easy to cause the scene splitting of attack pattern mining results, affecting the practicability of the method.

Basic Concepts
In data mining, sequence data is a similar but somewhat different concept from itemset data. A comparison between itemset data and sequence data is shown in Table 1. Table 1. A comparison between itemset data and sequence data. 7 > An itemset consists of multiple items that have no chronological order, while a sequence is constituted by several itemsets and has chronological order. For example, the first sequence <A 1 Subsequence is similar to the subset in mathematics. If all itemsets of a sequence A are included in the itemsets of the other sequence B, then A is a subsequence of B. The formal definition of subsequence is shown as follows: Given that two sequences A={a 1 ,a 2 ,…,a n } and B={b 1 , b 2 ,…,b m }(n≤m), if there exists a digit sequence 1≤j 1 ≤j 2 ≤…≤j n ≤m that satisfies a 1 ⊆b j1 , a 2 ⊆b j2 ,…, a n ⊆b jn , then A is a subsequence of B. If a 1 =b 1 , a 2 =b 2 , …, a n-1 =b n-1 and a n ⊆b n , then A is a prefix of B.

Frequent Attack Sequences-Based Network log Mining
Frequent attack sequences-based network log mining includes three steps. First, the network attack sequences should be extracted from the network log. Second, the frequent attack sequences should be mined with some association rules mining algorithms. Finally, the network attack types will be clustered with some clustering algorithms. The procedure of frequent attack sequence-based network log mining is shown in Fig 1. Relativity Degree Computing These network log can be stored in a relational database with some word segmentation algorithms. Assuming A={a 1 ,a 2 ,…,a n } is a network attack record extracted from the network log, and a i (1≤i≤n) is an attribute representing a kind of network attack information such as occurrence time of attack, protocol type of attack, source IP of attack, source port number of attack, destination IP of attack, destination port number of attack, duration time of attack and so forth. To seek out all network attack sequences from the network log, we design a similarity calculation function between two network attack records that takes synthetically the occurrence time of attack, the IP address of attack and the port number of attack into account.

The similarity calculation function of occurrence time of attack
When the time stride between two network attack records is bigger than a given time sliding window, we consider that two network attack records are not belong to the same network attack sequence. On the contrary, we consider that two network attack records are belong to the same network attack sequence and use the Gaussian distribution to compute the time similarity of two network attack record. The similarity calculation function of occurrence time of attack is as follows:

The similarity calculation function of IP address of attack
We use L(ip i ,ip j ) to describe whether two IP addresses are the same, and the value of L(ip i ,ip j ) is as follows: The similarity calculation function of IP address of attack is as follows:

The similarity calculation function of port number of attack
Similar to the IP address of attack, we use L(port i ,port j ) to describe whether two port number are the same, and the value of L(port i ,port j ) is as follows: L�port i ,port j �= � 1 port i is the same as port j 0 ℎ (4) The similarity calculation function of port number of attack is as follows: S port �A i ,A j �=(L�A i .srcPort,A j .srcPort�+L�A i .desPort,A j .desPort�+ L(A i .srcPort,A j .desPort)+L(A i .desPort,A j .srcPort))/4 (5) Taking the above formulas into account, we get the similarity calculation function of two network attack records is as follows: where k denotes respectively time, ip or port, and w time +w ip + w port =1.

Mining Frequent Attack Sequences from Network Attack Sequences
To identify the network attack type, we need firstly mine all frequent attack sequences from the network attack sequences. Many algorithms can be used to mine the frequent attack sequence, e.g. Apriori, Apriori-all, Apriori-some, FP-Growth, DSP and PrefixSpan. In consideration of the mining time and space, we choose PrefixSpan to find out all frequent attack sequences. The procedure of mining the frequent attack sequences with the PrefixSpan algorithm is as follows: Input: the network attack sequence set S A , the minimum support threshold Min_Supt. Output: the frequent attack sequences FS A in S A . Procedure: Step 1: Scan each network attack sequence in S A to find out all network attack sequence prefixes that the length is 1 and create the corresponding projective network attack sequence set 1-S A .
Step 2: Count the support (i.e. frequency) of each network attack sequence prefix. Delete all network attack sequences with the prefixes that the support is less than Min_Supt from S A and obtain the frequent 1-network attack sequence set 1-S A .
Step 3: For each network attack sequence prefix that the length is i and the support is bigger than Min_Supt, execute the following recursively mining operation: a) Find out the corresponding projective network attack sequence set i-S A . If i-S A is empty, then end the recursive operation and return 0; b) Count the support of each network attack sequence in i-S A . If the support of each network attack sequence is less than Min_Supt, then end the recursive operation and return 0; c) Join each network attack sequence with the current prefix and obtain a series of new prefixes. d) Do i=i+1 and take each new joint prefix as a prefix to execute recursively the step 3.

Filtering All Frequent Attack Sequences with Two Rules
After mining all frequent attack sequences, we should verify these frequent attack sequences with two rules. The procedure is as follows: Rule 1: For each frequent attack sequence, its all non-empty subsequences are still frequent attack sequence.
Rule 2: For each non-empty subsequence sub-FS A of each frequent attack sequences i-FS A (1≤i≤n), ≥Min_Conf should be satisfied. Here, Min_Conf is the minimum confidence.
All frequent attack sequences can be mined from the network attack log based on the above algorithm.

Experimental Results and Analysis
Two metrics are often used to evaluate the mining result: precision and recall. Suppose TP is the number of network attack sequences that are truly frequent and mined by the above algorithm, FP is the number of network attack sequences that are truly not frequent but mined by the above algorithm, FN is the number of network attack sequences that are truly frequent but not mined by the above algorithm, TN is the number of network attack sequences that are truly not frequent and not mined by the above algorithm. The computational formulas of precision and recall are shown as follows: precision=TP/(TP+FP) (7) recall=TP/(TP+FN) (8) To validate the effect of the network log mining algorithm based on frequent attack sequences, a series of experiments have been done. In the experiments, we chose about four years' network log with 1358 users and 29765 sessions from a Lenovo ThinkServer TS250 System with Intel Xeon CPU E3-1225 V5@3.3GHz, and eliminated the sessions that the frequency of occurrences is less than 0.001. Furthermore , the degree of support and confidence in two algorithms are set to 0.5 and 0.6, the threshold of time sliding windows is set to 10ms.  Figure 3. The recall of two algorithms. We adopt the precision and recall to verify the usefulness of the frequent attack sequence-based network log mining algorithm and made a comparison with the Apriori algorithm. The precision and recall of two algorithms are respectively shown in Fig. 2 and Fig. 3. On the basis of the experimental results, the precision of the PrefixSpan-based network log mining algorithm is evidently higher than the Apriori-based network log mining algorithm, and the recall of the PrefixSpan-based network log mining algorithm is slightly higher than the Apriori-based network log mining algorithm. These experimental results illustrate the favorable effect of the frequent attack sequence-based network log mining with the PrefixScan algorithm.
In addition, the PrefixSpan-based network log mining algorithm is suitable for both the sparse dataset and the dense dataset and especially the sparse dataset, while the Apriori-based network log mining algorithm is only suitable for the sparse dataset. The PrefixSpan-based network log mining algorithm holds shorter mining time and smaller mining space than the Apriori-based network log mining algorithm.

Conclusions and Future Work
The network log mining is important for the network security protection. Aiming at the massive and complex network log, this paper presented a PrefixScan-based network log mining algorithm to identify the frequent attack sequences and offer a network log analyzing means for intrusion detecting and security precognition. The experimental results showed the PrefixSpan-based network log mining algorithm possesses higher precision and recall than the Apriori-based network log mining algorithm. We will explore how to implement the proposed algorithm in a network security forewarning system in the next works.