420 likes | 581 Views
Association Rule Mining (II). Instructor: Qiang Yang Thanks: J.Han and J. Pei. Bottleneck of Frequent-pattern Mining. Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i 1 i 2 …i 100
E N D
Association Rule Mining (II) Instructor: Qiang Yang Thanks: J.Han and J. Pei
Bottleneck of Frequent-pattern Mining • Multiple database scans are costly • Mining long patterns needs many passes of scanning and generates lots of candidates • To find frequent itemset i1i2…i100 • # of scans: 100 • # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 ! • Bottleneck: candidate-generation-and-test • Can we avoid candidate generation? Frequent-pattern mining methods
FP-growth: Frequent-pattern Mining Without Candidate Generation • Heuristic: let P be a frequent itemset, S be the set of transactions contain P, and x be an item. If x is a frequent item in S, {x} P must be a frequent itemset • No candidate generation! • A compact data structure, FP-tree, to store information for frequent pattern mining • Recursive mining algorithm for mining complete set of frequent patterns Frequent-pattern mining methods
Example Min Support = 3 Frequent-pattern mining methods
Scan the database • List of frequent items, sorted: (item:support) • <(f:4), (c:4), (a:3),(b:3),(m:3),(p:3)> • The root of the tree is created and labeled with “{}” • Scan the database • Scanning the first transaction leads to the first branch of the tree: <(f:1),(c:1),(a:1),(m:1),(p:1)> • Order according to frequency Frequent-pattern mining methods
Scanning TID=100 root {} Transaction Database TID Items 100 f,a,c,d,g,i,m,p Header Table Node Item count head f 1 c 1 a 1 m 1 p 1 f:1 c:1 a:1 m:1 p:1 Frequent-pattern mining methods
Frequent Single Items: F1=<f,c,a,b,m,p> TID=200 Possible frequent items: Intersect with F1: f,c,a,b,m Along the first branch of <f,c,a,m,p>, intersect: <f,c,a> Generate two children <b>, <m> Scanning TID=200 Frequent-pattern mining methods
Scanning TID=200 root {} Transaction Database TID Items 200 f,c,a,b,m Header Table Node Item count head f 1 c 1 a 1 b 1 m 2 p 1 f:2 c:2 a:2 m:1 b:1 p:1 m:1 Frequent-pattern mining methods
The final FP-tree {} Transaction Database TID Items 100 f,a,c,d,g,i,m,p 200 a,b,c,f,l,m,o 300 b,f,h,j,o 400 b,c,k,s,p 500 a,f,c,e,l,p,m,n Min support = 3 Header Table Node Item count head f 1 c 2 a 1 b 3 m 2 p 2 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Frequent 1-items in frequency descending order: f,c,a,b,m,p Frequent-pattern mining methods
FP-Tree Construction • Scans the database only twice • Subsequent mining: based on the FP-tree Frequent-pattern mining methods
How to Mine an FP-tree? • Step 1: form conditional pattern base • Step 2: construct conditional FP-tree • Step 3: recursively mine conditional FP-trees Frequent-pattern mining methods
Conditional Pattern Base • Let {I} be a frequent item • A sub database which • consists of the set of prefix paths in the FP-tree • With item {I} as a co-occurring suffix pattern • Example: • {m} is a frequent item • {m}’s conditional pattern base: • <f,c,a>: support =2 • <f,c,a,b>: support = 1 • Mine recursively on such databases {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Frequent-pattern mining methods
Conditional Pattern Tree • Let {I} be a suffix item, {DB|I} be theconditional pattern base • The frequent pattern tree TreeI is knownas the conditional pattern tree • Example: • {m} is a frequent item • {m}’s conditional pattern base: • <f,c,a>: support =2 • <f,c,a,b>: support = 1 • {m}’s conditional pattern tree {} f:4 c:3 a:3 m:2 Frequent-pattern mining methods
Composition of patterns a and b • Let a be a frequent item in DB, B be a’s conditional pattern base, and b be an itemset in B. Then a + b is frequent in DB if and only if b is frequent in B. • Example: • Starting with a={p} • {p}’s conditional pattern base (from the tree) B= • (f,c,a,m): 2 (c,b): 1 • Let b be {c}. • Then a+b={p,c}, with support = 3. Frequent-pattern mining methods
Let P be a single path FP tree Let {I1, I2, …Ik} be an itemset in the tree Let Ij have the lowest support Then the support({I1, I2, …Ik})=support(Ij) Example: Single path tree {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Frequent-pattern mining methods
FP_growth Algorithm Fig 6.10 • Recursive Algorithm • Input: A transaction database, min_supp • Output: The complete set of frequent patterns • 1. FP-Tree construction • 2. Mining FP-Tree by calling FP_growth(FP_tree, null) • Key Idea: consider single path FP-tree and multi-path FP-tree separately • Continue to split until get single-path FP-tree Frequent-pattern mining methods
FP_Growth (tree, a) • If tree contains a single path P, then • For each combination (denoted as b) of the nodes in the path P, then • Generate pattern b+a with support = min_supp of nodes in b • Else for each a in the header of tree, do { • Generate pattern b = a + a with support = a.support; • Construct • (1) b’s conditional pattern base and • (2) b’s conditional FP-tree Treeb • If Treeb is not empty, then • Call FP-growth(Treeb, b); • } Frequent-pattern mining methods
FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K Frequent-pattern mining methods
FP-Growth vs. Tree-Projection: Scalability with the Support Threshold Data set T25I20D100K Frequent-pattern mining methods
Why Is FP-Growth the Winner? • Divide-and-conquer: • decompose both the mining task and DB according to the frequent patterns obtained so far • leads to focused search of smaller databases • Other factors • no candidate generation, no candidate test • compressed database: FP-tree structure • no repeated scan of entire database • basic ops—counting and FP-tree building, not pattern search and matching Frequent-pattern mining methods
Implications of the Methodology: Papers by Han, et al. • Mining closed frequent itemsets and max-patterns • CLOSET (DMKD’00) • Mining sequential patterns • FreeSpan (KDD’00), PrefixSpan (ICDE’01) • Constraint-based mining of frequent patterns • Convertible constraints (KDD’00, ICDE’01) • Computing iceberg data cubes with complex measures • H-tree and H-cubing algorithm (SIGMOD’01) Frequent-pattern mining methods
Visualization of Association Rules: Pane Graph Frequent-pattern mining methods
Visualization of Association Rules: Rule Graph Frequent-pattern mining methods
Mining Various Kinds of Rules or Regularities • Multi-level, quantitative association rules, correlation and causality, ratio rules, sequential patterns, emerging patterns, temporal associations, partial periodicity • Classification, clustering, iceberg cubes, etc. Frequent-pattern mining methods
uniform support reduced support Level 1 min_sup = 5% Milk [support = 10%] Level 1 min_sup = 5% Level 2 min_sup = 5% 2% Milk [support = 6%] Skim Milk [support = 4%] Level 2 min_sup = 3% Multiple-level Association Rules • Items often form hierarchy • Flexible support settings: Items at the lower level are expected to have lower support. • Transaction database can be encoded based on dimensions and levels • explore shared multi-level mining Frequent-pattern mining methods
Quantitative Association Rules • Numeric attributes are dynamically discretized • Such that the confidence or compactness of the rules mined is maximized. • 2-D quantitative association rules: Aquan1 Aquan2 Acat • Cluster “adjacent” association rules to form general rules using a 2-D grid. • Example: age(X,”34-35”) income(X,”30K - 50K”) buys(X,”high resolution TV”) Frequent-pattern mining methods
Redundant Rules [SA95] • Which rule is redundant? • milk wheat bread, [support = 8%, confidence = 70%] • “skim milk” wheat bread, [support = 2%, confidence = 72%] • The first rule is more general than the second rule. • A rule is redundant if its support is close to the “expected” value, based on a general rule, and its confidence is close to that of the general rule. Frequent-pattern mining methods
INCREMENTAL MINING [CHNW96] • Rules in DB were found and a set of new tuples db is added to DB, • Task: to find new rules in DB + db. • Usually, DB is much larger than db. • Properties of Itemsets: • frequent in DB + db if frequent in both DB and db. • infrequent in DB + db if also in both DB and db. • frequent only in DB, then merge with counts in db. • No DB scan is needed! • frequent only in db, then scan DB once to update their itemset counts. • Same principle applicable to distributed/parallel mining.
CORRELATION RULES • Association does not measure correlation [BMS97, AY98]. • Among 5000 students • 3000 play basketball, 3750 eat cereal, 2000 do both • play basketball eat cereal [40%, 66.7%] • Conclusion: “basketball and cereal are correlated” is misleading • because the overall percentage of students eating cereal is 75%, higher than 66.7%. • Confidence does not always give correct picture! Frequent-pattern mining methods
P(A^B)=P(B)*P(A), if A and B are independent events A and B negatively correlated the value is less than 1; Otherwise A and B positively correlated. P(B|A)/P(B) is known as the lift of rule BA If less than one, then B and A are negatively correlated. BasketballCereal 2000/(3000*3750/5000)=2000*5000/3000*3750<1 Correlation Rules Frequent-pattern mining methods
Chi-square Correlation [BMS97] • The cutoff value at 95% significance level is 3.84 > 0.9 • Thus, we do not reject the independence assumption. Frequent-pattern mining methods
Constraint-based Data Mining • Finding all the patterns in a database autonomously? — unrealistic! • The patterns could be too many but not focused! • Data mining should be an interactive process • User directs what to be mined using a data mining query language (or a graphical user interface) • Constraint-based mining • User flexibility: provides constraints on what to be mined • System optimization: explores such constraints for efficient mining—constraint-based mining Frequent-pattern mining methods
Constraints in Data Mining • Knowledge type constraint: • classification, association, etc. • Data constraint— using SQL-like queries • find product pairs sold together in stores in Vancouver in Dec.’00 • Dimension/level constraint • in relevance to region, price, brand, customer category • Rule (or pattern) constraint • small sales (price < $10) triggers big sales (sum > $200) • Interestingness constraint • strong rules: min_support 3%, min_confidence 60% Frequent-pattern mining methods
Constrained Mining vs. Constraint-Based Search • Constrained mining vs. constraint-based search/reasoning • Both are aimed at reducing search space • Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI • Constraint-pushing vs. heuristic search • It is an interesting research problem on how to integrate them • Constrained mining vs. query processing in DBMS • Database query processing requires to find all • Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing Frequent-pattern mining methods
Constrained Frequent Pattern Mining: A Mining Query Optimization Problem • Given a frequent pattern mining query with a set of constraints C, the algorithm should be • sound: it only finds frequent sets that satisfy the given constraints C • complete: all frequent sets satisfying the given constraints C are found • A naïve solution • First find all frequent sets, and then test them for constraint satisfaction • More efficient approaches: • Analyze the properties of constraints comprehensively • Push them as deeply as possible inside the frequent pattern computation. Frequent-pattern mining methods
Anti-Monotonicity in Constraint-Based Mining TDB (min_sup=2) • Anti-monotonicity • intemset S satisfies the constraint, so does any of its subset • sum(S.Price) v is anti-monotone • sum(S.Price) v is not anti-monotone • Example. C: range(S.profit) 15 is anti-monotone • Itemset ab violates C • So does every superset of ab Frequent-pattern mining methods
Which Constraints Are Anti-Monotone? Frequent-pattern mining methods
Monotonicity in Constraint-Based Mining TDB (min_sup=2) • Monotonicity • When an intemset S satisfies the constraint, so does any of its superset • sum(S.Price) v is monotone • min(S.Price) v is monotone • Example. C: range(S.profit) 15 • Itemset ab satisfies C • So does every superset of ab Frequent-pattern mining methods
Which Constraints Are Monotone? Frequent-pattern mining methods
Succinctness, Convertible, Inconvertable Constraints in Book • We will not consider these in this course. Frequent-pattern mining methods
Associative Classification • Mine association possible rules in form of itemset class • Itemset: a set of attribute-value pairs • Class: class label • Build Classifier • Organize rules according to decreasing precedence based on confidence and support • B. Liu, W. Hsu & Y. Ma. Integrating classification and association rule mining. In KDD’98 Frequent-pattern mining methods
Classification by Aggregating Emerging Patterns • Emerging pattern (EP): A pattern frequent in one class of data but infrequent in others. • Age<=30 is frequent in class “buys_computer=yes” and infrequent in class “buys_computer=no” • Rule: age<=30 buys computer • G. Dong & J. Li. Efficient mining of emerging patterns: discovering trends and differences. In KDD’99 Frequent-pattern mining methods