Summary of Association Rules

In recent years, scholars at home and abroad have conducted a large number of researches on association rules, in order to deeply understand the mining technology of association rules, and master its research status and current situation, The development trend, first of all, related definitions and classification methods of association rules are introduced. Secondly, the general methods of association rule mining are summarized from serial and parallel perspectives, and some typical association rule mining modes are summarized and analysed. Finally, the quality improvement of association rule mining and its field application are discussed.


Introduction
With the rapid development of society and the continuous progress of science and technology, the speed of information circulation is getting faster and faster. A lot of data has been produced in all works of life. How to excavate potential and valuable information from a large amount of data has become a concern of all industries. Association rule mining is one of the important research methods in the field of data mining, which is widely used in finance, medicine, Internet and other fields. The original association rules mining is put forward for basket analysis problem, its aim is to find the relationship between different commodity transaction database, to obtain the general rules of customer buying patterns. These rules can guide merchants to reasonably arrange the purchase, inventory and shelf design. Agrawal et al. [1] proposed the earliest Apriori algorithm based on frequent itemsets. After that, researchers at home and abroad have deeply studied the mining of association rules. Related work includes optimization of Apriori algorithm, parallel association rule mining, quantitative association rule mining and association rule mining theory.

Definition of Association Rules
First introduce the traditional association rules algorithm, association rule is the containing type shape such as X and Y, among them, the X and Y, respectively, known as the forerunner of association rules (antecedent or left -hand side, LHS) and subsequent (consequent or right -hand -side, RHS). Where, the association rule XY has support and trust.
Definition: suppose I {I1, I2, I3...IM} is the set of terms. Given a Transaction database D, where each Transaction (Transaction)t is a non-empty subset of I, that is, each Transaction corresponds to a unique identifier TID(Transaction ID).The support of association rules in D is the percentage of transactions in D containing both X and Y, that is, the probability. Confidence is the percentage of Y in the case that the transaction in D already contains X, that is, the conditional probability. If the minimum support threshold and minimum confidence threshold are met, the association rule is considered interesting. These thresholds are set manually for mining purposes.

Mining Process
Association rule mining process consists of two phases: the first stage must first find out all of the high frequency from data collection team (the Frequent Itemsets, i.e., calculate meet support team (sets), the second stage again by the high-frequency team in Association Rules (Association Rules), namely: calculate again meet the confidence of the team.
The first stage of association rule mining must identify all Large frequency sets items from the original data set. High frequency means that the frequency of a project's presence must reach a certain level relative to all records. The frequency of the occurrence is called A project team Support (Support), with A 2 -with A and B two items itemset, for example, we can through the formula contains {A, B} (1) obtained Support the project team, if the Support is greater than or equal to the set of Minimum Support (Minimum Support) threshold value, the {A, B} is called high frequency project team. A k-itemset that satisfies the minimum support is called Frequent k-itemset, which is usually expressed as Large k or Frequent k. The algorithm generates Large k+1 from the Large k project group until it can no longer find a longer high frequency project.
The second stage of Association Rules mining is to generate Association Rules. Association rules from high frequency project team, is to use the high frequency of the previous step k -program to generate the rules, the Minimum reliability (Minimum Confidence) under the condition of threshold, if the rule is obtained by reliability to meet the Minimum reliability, according to the rules for the association rules. For example, the reliability of rule AB generated by high frequency k-project group {A,B} can be obtained by formula (2). If the reliability is greater than or equal to the minimum trust, then AB is called the association rule.
Association rule mining is generally applicable to the case where the index in the record is taken as a discrete value. If the original indexes is continuous data in the database, in association rules mining should be performed before the appropriate data discretization (in fact, is to a value corresponding to a certain value range), data discretization is an important part of the former data mining, the process of discretization is reasonable will directly affect the results of association rule mining.

Classification of Association Rules
Based on the type of values processed in a rule or pattern. Variables processed by association rules can be classified into Boolean and numeric types. Boolean association rules deal with values that are discrete and categorized, which shows the relationship between these variables. And numeric association rules can be combined and multidimensional association or multi-level association rules, to deal with numeric fields, the dynamic segmentation, or directly to the original data, of course can also contain numeric association rules are variable. For example, gender = "male" => occupation = "lawyer", which is the Boolean association rule. Gender = "female" =>avg (income) =2300, the income involved is a numerical type, so it is a numerical association rule.

Mining based on infrequent mode and negative correlation mode.
What we often focus on in life may be the mining of frequent itemsets, but sometimes it's interesting to find rare or negative patterns. Sales of diamond watches are infrequent in jewelry sales data, but transactions involving diamond watch sales can be interesting. There is data in the supermarket, if we find customers frequent for classic coke and diet coke, but can't be bought, two are together buy coke and diet coke classic is considered to be a negative correlation model.

2.3.3.
Mining based on multi-layer and multi-dimensional space. Data in association rules can be divided into single-dimensional and multi-dimensional data and single-layer and multi-layer data. In the one-dimensional association rules, we only involve one dimension of data, such as items purchased by users. In a multidimensional association rule, the data to be processed will involve multiple dimensions. In other words, a single-dimensional association rule deals with relationships within a single attribute; Multidimensional association rules are rules that deal with certain relationships between attributes. In multi-level association rules and association rules between a data item value often appear in the concept of the higher layer, therefore, mining association rules in multi-level association rules than single layer can get a more in-depth knowledge. According to the granularity level of corresponding items in rules, multi-layer association rules can be divided into the same layer and inter-layer association rules. Two strategies for setting support for multi-layer association rule mining set different minimum support for unified minimum support and different levels. The former is relatively easy to generate rules, but does not take into account the precision of each level, which is easy to cause information loss and information redundancy. The latter increases the flexibility of mining.

Mining association rules based on constraints.
Thousands of rules can be discovered from a given data set during data mining, most of which are not relevant or interesting to the user. Often users have good judgment about which direction to dig for rules that might generate interest. Therefore, a good heuristic method is to let users explain their intuitions or expectations as constraints to the search space. This strategy is called constraint based association rule mining

Classical Association Rule Mining Algorithm
Apriori classical association rule mining algorithm is based on two core theories: the subset of frequent item set is frequent item set; The superset of an infrequent itemset is an infrequent itemset. Its main ideas are: (1) L1 = {large 1-itemsets}; (2) for (k = 2; Lk -1≠ Ø ; K + +) do begin (3) Ck = appriori-gen (Lk -1 ) ; (4) for all transactions t∈D do begin (5) Ct = subset (Ck, t); (6) for all candidates c∈Ct do c. count + + ; Through single-pass scan data sets get frequent 1 -itemsets, layered iteration after discovered by the previous iteration (k -1) -frequent itemsets generate candidate k -itemsets, when there is no new frequent itemsets over. Apriori algorithm generates candidate itemsets based on Apriori property, which greatly reduces the size of frequent sets and shows good performance. But as the database grows, there are also two fatal performance bottlenecks: a large number of candidate sets are generated; Scanning the database multiple times requires a considerable I /O load. In view of the disadvantages of Apriori algorithm, subsequent researchers proposed improved algorithm, such as serial algorithm  [2] is proposed based on a hash algorithm of frequent itemsets (hash) technology, it USES hash function to the candidate project distribution in the column to a different hash bucket to count, and the project in each bucket subset will be greater than or equal to the minimum support count project subset called frequent itemsets. This method filters the candidate set generated by k-item set connection through hash table, and reduces the data set effectively. DHP were studied. the optimum processing algorithm is to generate the hash tables and stored data set for overhead storage space for the improvement of the performance of the algorithm, when large itemsets in a database, and generated in the process of the hash table efficiency due to the large amount of calculation will be reduced.
Algorithm based on partition (partition) [3] the main idea is to database is divided into several mutually disjoint, fully stored in the memory in the transaction database, using the algorithm of mining the frequent itemsets of the parts, then the set of all frequent items and generate candidate set of transaction databases, and support by scanning calculation and the frequent itemsets of transaction databases. The algorithm only needs to scan the database twice in the process of execution, which overcomes the bottleneck of high cost Apriori algorithm I /O.
For number according to the library number according to the large amount, brush stroke number according to the library once more, Toivonen [4] is proposed based on Sampling from association rules algorithm thought, first extract sample data from a database D D 'get some strong association rules, with the rest of the database D -D' to verify if it is correct. Sampling technology can quickly realize frequent itemsets mining, but as a result from algorithm using random Sampling method to sample and easy to cause data distortion (data skew) problems, lead to the sample database mining results with the database D mining results error increase. In view of the data skew problem, Lin et al. [5] proposed the anti-skew algorithm. Han et al. [6] proposed a does not produce the discovery of the candidate set of frequent itemsets mining methods FP -growth, it USES a form of FP -tree data structure to compress the data, in the process of generating frequent itemsets partition strategy bottomup generated suffix item set and FP -tree structure condition, and then explore a particular item at the end of the frequent itemsets. Experiments show that the generated fp-tree is small enough or the path overlap is large enough, and the operation speed of fp-growth algorithm is several orders of magnitude faster than that of Apriori algorithm.

Parallel distributed algorithm.
Practical applications, the association rules to deal with the amount of data show exponential growth, which makes the problems focused on the algorithm of association rules mining efficiency and I/O load, even if used on a single processor optimized serial algorithm cannot meet the needs of mining properties, and the use of a multiprocessor system for parallel computing can improve the mining efficiency. Agrawal et al. [7] proposed three parallel algorithms, CD, DD and CaD.CD algorithm to generate candidate set all stored in each processor, using the Apriori algorithm to calculate the candidate set on a local database support count, then exchange the processor's local candidate set support count, makes each processor global support count to find the global frequent itemsets. Each processor synchronizes at the end of each cycle.CD algorithm achieves parallelization by dividing database, but it does not realize parallelization when generating candidate sets. When candidate sets are large, there is insufficient memory.DD algorithm divides candidate sets into different processors, which overcomes the problem of low memory utilization of CD algorithm, but this algorithm has a large traffic volume and must be completed on the processor with high communication speed. The CaD algorithm integrates CD and DD algorithms, and redistributes the database while allocating frequent 1-item sets, enabling each processor to independently complete the work of generating candidate sets. Product data management (PDM) algorithm [2] is by Park et al in DHP were studied. the optimum processing algorithm based on the improved parallel algorithm, the algorithm is similar to the CD algorithm, the candidate set generated by parallel and parallel to determine the composition of frequent itemsets, when o 2 -frequent itemsets, only exchange hash table to satisfy minimum support count itemsets. PDM algorithm not only inherits the advantages of DHP in reducing the number of candidate sets and transaction databases, but also realizes the parallelization of hash table composition, and the input robustness is quite good. Based on DIC minds, Cheung and others brought by [8] APM line and calculate method, APM using global pruning technology about reducing candidate 2 -itemsets, which is effective when the data distribution, and DIC algorithms require all the data in the database. For this reason, the APM algorithm needs to cluster the database into a homogeneous sub-database according to the number of processors, and then implement DIC algorithm on the sub-database by each processor until no new candidate set is generated. For the low efficiency of DD algorithm, IDD and HD algorithms are introduced [9]. In the IDD algorithm, the local database communicates with other operators through a comprehensive looped broadcast mode. Each processor has a sending buffer (sbuf) and a receiving buffer (rbuf) for asynchronous sending and receiving operations between adjacent nodes of each processor. Compared with DD algorithm, IDD algorithm only performs point-to-point communication once between adjacent processors, reducing communication times and eliminating network competition. To reduce the redundancy of candidate set generation, the IDD algorithm filters out those prefixes by detecting the prefixes of the item set. HD algorithm integrated the CD and IDD algorithm, the algorithm could be divided into equal to the size of the processor, using CD algorithm between groups, groups within the IDD algorithm, HD algorithm has some advantages in dealing with largescale database, and load balancing.

Mining Association Rules Based on Data Flow
A new data mining model based on the data flow in the financial management, the stock trend analysis, wireless sensor network, and other fields are widely used, it is different from the traditional static database model, data flow is generated by an online online, its characteristics, such as continuous and unbounded 2 cui yan, such as: the characteristics of the review, 133, sequence association rules mining. Therefore, association rule mining based on data flow should not adopt the previous form of scanning data multiple times, but should carry out single scanning with data update. In order to adapt to the characteristics of data flow and solve the problem of insufficient storage space, the window query of data flow is generally carried out by means of sliding window technology. Giannella et al. [10] proposed FP -stream algorithm for mining frequent itemsets is a classic algorithm of data stream mining, to embed the tilt time window table based on memory frequently and the frequent sequence information in the frequent pattern tree (pattern -tree), dig up with approximate support time sensitive frequent pattern, and realize the time granularity Pattern incremental maintenance. According to the concept of "provide a frequent closed itemsets without losing support information of minimum said", for mining frequent closed itemsets, on this in time and space complexity of improve the mining efficiency. Chi et al. [11] the Moment algorithm with sliding window technique based on data stream mining frequent itemsets, the use of addition and deletion two operations update sequence, improve the internal storage update speed system. In the literature [12] DS -CFI algorithm sliding window is first divided into multiple Windows, the calculation of each window frequent item sets in the DSCFI -tree, closed recycling DSCFI -tree incremental updating dynamic mining frequent closed itemsets in the sliding window. Literature [13] to study the time sensitive data stream mining algorithm of frequent itemsets FIMoTS, the algorithm use type change line concept to dynamic classification of itemsets, according to the size of the sliding window change only to deal with beyond the boundaries of itemsets in order to improve the computational efficiency. Literature [14] proposed a fiut-stream algorithm, which took bittable as the basic structure, compressed the data into the table, updated the list nodes through the sliding window, and used FIUT algorithm to mine frequent itemsets. Manku et al. [15] proposed data stream mining algorithm Sampling and Lossy counting based on the idea of data segmentation. Literature [16] uncertain data stream is studied under the environment of top -k frequent query algorithm, incremental updating algorithm USES the existing results, and by using poisson distribution structure elements become frequent items on the probability of lower bound to the data filtering, greatly reduce the query time.

Association Rule Mining Based on Graph
Graph mining refers to the application of association analysis to graph based data, and the discovery of a set of common substructures in the collection of graphs, that is, frequent subgraph mining. According to the mining search path frequent subgraph mining algorithm is divided into BFS broad first search and DFS deep first search. Breadth first search algorithms include AGM [17] and FSG [18].AGM algorithm takes frequent vertices as the initial set based on Apriori, and USES recursive method to gradually add nodes to excavate all frequent subgraphs. On the basis of AGM algorithm, FSG has made improvements. Instead of increasing the vertex generation candidate subgraph pattern, it has optimized the pruning strategy of candidate subgraph and improved the speed of computing support.In view of the disadvantages of Apriori algorithm producing a large number of candidate subgraphs, fp-growth based fp-growth algorithm gSpan, FFSM, close Graph are presented. Yan et al. [19] proposed the gSpan algorithm based on the depth-first search algorithm for the first time. In the algorithm, the minimum DFS encoding is marked and the right-most extension is carried out to avoid the generation of duplicating graphs to reduce redundancy. In view of the existing algorithm for mining frequent subgraph output subgraph too many questions, literature [20] put forward a new measurement between figure and hypergraph model: Δ -jump model, converts mining frequent subgraph patterns in the database to jump mining frequent patterns, and designed based on extended outside cutting and cutting technology in efficient mining algorithm GraphJP, through a large number of data validation, GraphJP algorithm can efficiently achieve the frequent leap pattern mining. [21] studied a kind of uncertain figure of frequent subgraph mining method, to graph S in uncertain database on the determination of D is included in the database support the probability distribution of expectations subgraph importance as a threshold value evaluation, using the search space pruning technology at the same time, this model reduces the number of subgraph isomorphism testing, improve the efficiency of the search space. [22] introduced a search strategy based on partitioning mixing depth and width of uncertain EDFS frequent subgraph mining algorithm, namely first carries on the depth first search, in its search results after using breadth-first search algorithm.

Sequential Association Rule Mining
Agrawal and Strikant first proposed the concept of sequential pattern mining, which is the process of mining frequent subsequences from a sequence database with minimal support. Sequential pattern mining is different from association rule mining, which mainly studies the relation between item sets. Agrawal et al., put forward three candidate code generation based on Apriori nature -test mining algorithm: AprioriAll, AprioriSome, DynamicSome [23], and then puts forward the classic Apriori algorithm GSP [24], the above algorithms are based on the level of format of the algorithm. Zaki [25] sequence pattern algorithm was proposed based on the vertical format SPADE, converts the sequence database to record a set position of the vertical format database, and then dynamically linked mining frequent sequential patterns, three times algorithm by scanning database, reduced the I/O overhead. [26] proposed a model based on projection PrefixSpan growth algorithm, it USES frequent sequence prefix projected recursive producing frequent sequential patterns, algorithm avoids the overhead of the problems of the candidate set generated. Lin et al. [27] was proposed based on the sequence of memory index MEMISP model method, algorithm for frequent sequences using the index technology, scan the database once and do not produce the candidate sequences and projection database, PrefixSpan is obviously better than the memory utilization. Maximum sequence pattern mining and closed sequence pattern mining can effectively compress the number of result sets of traditional sequence pattern mining, for which, tong yongxin et al. [28] applied these two technologies Combined with the advantages of the proposed compressed sequence pattern mining algorithm CFSP, a small number of representative frequent sequence combinations to represent all sequence pattern information. The two-step method is adopted in the algorithm. First, most of the representative sequence patterns are mined according to the interference dominant sequence pattern detection mechanism. Secondly, the remaining representative sequences are mined with a small amount of time.
In addition, there are other extended sequential pattern compression algorithms for different requirements. [29] for the first time introduced the ideas of logic in frequent sequential pattern mining, reducing the dependence of the mining results to support threshold, algorithm by logical equivalence rules to filter results, delete do not conform to the rules of logic, greatly improve the intelligibility and practicability of the set of rules. [30] with a wildcard is designed (wildcards) of One -Off mining, sequence pattern mining method, the calculation mode of support at the same time satisfy the conditions of One -Off: model any two time don't share the same position in the sequence of characters. More meaningful patterns can be found in biomedical and Web log fields by introducing wildcard sequence patterns.

Developement trend of association rules
Association is following the bayesian classification algorithm (Bayes theorem), support vector machine (SVM), and other machine learning methods after another important classification techniques, are widely used in text categorization, a Web document classification, medical image classification. Classical association classification algorithms are CBA, CMAR and CPAR. [42] will be applied to text classification, associative classification and traditional associative classification algorithm depend on the test set the classification of the support and confidence threshold low effect problem, put forward a kind of based on simulated annealing technology support and confidence associative classification algorithm of intelligent optimization. Literature [43] proposed a method of high precision classification of related text based on category similarity aggregation. Medical image plays an important auxiliary role in medical diagnosis. Medical image data mining is different from general data mining, which requires stability, high efficiency and reliability. [44] put forward a kind of suitable for medical image data mining CARMI associative classification algorithm, by introducing a double degree reduced image data intensive support, support degree classification of interference caused by higher characteristic, realize high-efficient and classification of image data. Network intrusion detection technology as a powerful assistant firewall technology, effectively make up for the firewall policy when dealing with hacker attacks, information theft, become a reliable safeguard network security monitoring system. Traditional intrusion detection system (IDS) USES association rules to generate standard intrusion detection rules library, and generates alarm data flow by judging the matching degree between network data and rule library. [45] proposes a network intrusion detection model based on the context to validate (CVNIDM), will attack external conditions into consideration in order to distinguish of IDS alerts information, so as to improve the accuracy of IDS alerts. In literature [46], a comprehensive correlation confidence degree (h-confidence degree) association rule algorithm for calculating adjacent alarms is proposed to effectively discover the alarm association rules with low support degree and high confidence degree. Privacy data mining refers to the interference in the process of data mining by adopting data and query restrictions to protect the original data, the basic strategy to avoid business sensitive data, personal data privacy leak. Literature [47] proposed a privacy protection association rule algorithm EMASK based on data interference and distribution reconstruction. BEMASK algorithm [48] improved the EMASK algorithm based on the idea of granularity calculation, and transformed relational data tables into machine-oriented relational models by means of granularity The calculated frequent itemsets are converted to calculate the intersection of basic particles, and the data are represented vertically, which reduces the number of I /O operations compared with EMASK algorithm. [49] put forward a kind of meet the uniform distribution of uncertain data association rules algorithm UFI -DM, effectively solve the traditional data mining technique is applied to uncertain factors in the privacy protection problem when inefficient or infeasible problem. USES association rules for protein structure prediction, according to the characteristics of the protein sequences and handle these sequence data can be quantitatively, and the association rules algorithm used in the data sets for protein sequences in relationship. Literature [50] knowledge discovery based on inner cognitive mechanism (KDTICM) and knowledge discovery in database (KDD) model, put  8 forward a kind of protein secondary structure prediction method based on associative classification SAC, the algorithm with 85% accuracy the successful prediction of protein sequences. With the increase of the number of satellites such as remote sensing navigation and positioning, geophysics, etc., big data of space earth brings new opportunities for the study of earth science. The traditional data analysis method is mainly based on statistical analysis and nonlinear fitting. The mining model based on association rules can reveal the relationships among the earth data such as ocean, land and atmosphere, thus promoting the development of the field of global change and disaster science. Literature [51] used multiple constraints to excavate the time-series association rules, and analyzed and obtained the relationship between climate index which was highly consistent with the actual situation and abnormal precipitation events in the land area.

Conclusion
After a long period of research and development, association rule mining has become increasingly mature in the design and optimization of frequent pattern mining algorithms and is widely used in Internet, finance, biological information and other fields. But the direction of future research in this field is still challenging Work: design more efficient mining algorithm; Realize the interaction between the user and the mining system and develop the easy-to-understand visual interface. To improve the extended mining algorithm in special field, such as periodic pattern mining. Extend the application domain of association rules