Mining Frequent Itemsets with Constraints

Mining Frequent Itemsetswith Constraints Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP

family person person name age H O C H O H C Knowledge Discovering from Database databases extract patterns name name person C name H name H H Finding interesting patterns from large scale databases H C name age phone O family C C O H person person C H C H C N H name name name H Applications in data engineering, bioinformatics, chemistry, management science, linguistics, etc.

Frequent Pattern Approach -enumerates candidates of knowledge, -filter by some constraints to remove unnecessary patterns ・It is difficult to define “what is interesting” in math. terms ・Popular approach is -have a “look” at them (evaluate) filtering database candidates beer and nappy ・ Patterns frequently appearing in the database are good candidates for the task

Frequent Pattern Approach ・Patterns with high frequencies are something “obvious”  we have to search into low frequency patterns ・But, there are huge number of patterns with low frequencies ・ Directly finding patterns satisfying the given constraints is important In this talk, we focus on transactions database, and show algorithms for finding frequent patterns satisfying the given constraints efficiently

Transaction Database 1,2,5,6,7 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 ・ Transaction database T: a database composed of transactions defined on itemset E i.e., 　T,T ∈T, T ⊆E -basket data -links of web pages -words in documents ・ A subset of Eis called a pattern or itemset T ＝ Takes long time to operate In practice, the size of T can be over million

Occurrences of Pattern ・ For a pattern P, occurrenceof P :　 a transaction in T including P denotationof P:　 set of occurrences ofP frequencyof P : the size of the denotation ofP P is frequent⇔frequency of P is no less than θ, patterns included in at least 3 transactions {1} {2} {7} {9} {1,7} {1,9} {2,7} {2,9} {7,9} {2,7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 denotation of {1,2} ＝{ {1,2,5,6,7,9}, {1,2,7,8,9} } T＝ Frequent pattern mining problem: given θ and T, find all frequent patterns

Backtracking Algorithm 1,2,3,4 1,2,3 1,2,4 1,3,4 2,3,4 ・“Frequent” is a monotone property, so backtracking algorithm works, starting from the emptyset BackTrack (P) OutputP For each item i > max. item of P If P∪{i} is frequent then call BackTrack (P∪{i}) 1,3 1,4 2,3 2,4 3,4 1,2 1 3 4 2 ・ In practice, very fast ・ Frequency computing is the most heavy part φ

Evaluate Computation Time ・Enumeration takes long time if there are many output, so we evaluate its efficiency by “computation time for each output” (throughput) ・ Recent good implementations of frequent pattern mining takes constant time for each, if the number of output is large ・ But, dealing with constraint checking is not so trivial We show some algorithms for - maximality in equivalence class, - constraints on items - constraints on additional items (rules)

Decrease #Solutions: Closed Pattern ・Closed patterns do not lose the information of occurrences ・Usually, ＃closed patterns ＜＃frequent patterns ・ Closed pattern: maximal one among patterns with the same occurrences itemset lattice Enumerate all frequent closed patterns instead of frequent patterns Our algorithm: LCM φ

Enumeration by ppc Extension ・ the generation is ppc extension (prefix preserving closure) if they have the same prefix (items ＜ i) ・Any closed pattern is generated from the other closed pattern by: 1. adding an item i, and 2. taking maximal (but not uniquely generated) i h i ・ Any closed pattern is generated uniquely by ppc extension of another closed pattern φ

Example φ ・ usual generation acyclic ・ ppc extension tree {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} generation ppc extension {2,5} {2,7,9} T＝ {1,2,7,9} {2,3,4,5} {1,2,7,8,9} {1,2,5,6,7,9}

Time Complexity ・ One ppc extension needs O(||T||) time ( ||T|| is sum of sizes of transactions, i.e., ∑|T|, T∈T ) ・ There are at most |I| candidates for ppc extensions (I: pattern)  Closed patterns can be enumerated in O(|I|||T||) time for each (without extra memory for previously found patterns) ・ In practice, computation time can be smaller, by - recursively reduce the database, - generate candidates at once by sweeping the reduced database O(1) time for each, if ＃outputs is large enough for input size

Experimental Result BEST Award: implementation competition FIMI ‘04 ・ Usually, very fast, rather than other algorithms (except for dense databases)

Constraints on Weight, Size, etc. ・It is not difficult to add constraints w.r.t. weights - lower, and/or upper bounds on - size, sum, max., min., ave., variance, of - item or transaction weights If the constraints are unti-monotone, still linear time in ＃solutions itemset lattice Even if it is monotone, usually linear if ＃solutions is large φ

Non-monotone Constraints SLOW FAST FAST FAST ・Especially, if constraints are given on the items (ex., include A or B if it includes C, etc) the time for checking is very short, rather than frequency computing  slight increase of computation time ・Hardness depends on the properties of the constraints - Find patterns with constraints, then check frequency&closedness - Find closed patterns then check the constraints itemset lattice - Logical constraints - Highly dependent patterns (frequency >>Πfrequency of its items) φ

Association Rule Mining ・ Association rule is a rule of the form (a,b,c) d ・ If transactions including d (not including d) are high ratio among transactions including (a,b,c), rule (a,b,c) d (¬d) is reliable, and characterize database Finding good rules is important problem ・(a,b,c) has to be frequent, so that the rule is common in the database ・ However, evaluating the ratio for each pair of closed pattern and item takes so long time, by simple way

Occurrence Deliver A A A A 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 C C C T＝ D ・ Compute the denotations of P ∪{i} for alli’s at once, P = {1,7} Check the frequency for all items to be added in linear time of the database size frequency of item = reliability of rule Computed in short time

results

Conclusion ・We see algorithms for enumerating frequent pattern with constraints ・ Closed patterns: decreasing #solution without loosing information - algorithmLCM ・ Closed patterns with monotone/unti-monotone/general constraints ・ Rule mining with closed patterns Closed pattern for other kind of patterns Algorithm sense, we can do. How to implement in simple and easy way?

Mining Frequent Itemsets with Constraints

Mining Frequent Itemsets with Constraints

Presentation Transcript

Parallel Mining of Maximal Frequent Itemsets form Databases

Algorithms for Mining Maximal Frequent Itemsets -- A Survey

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets

Mining Frequent Itemsets over Uncertain Databases

The Concept of Maximal Frequent Itemsets

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

Fast Algorithms for Mining Frequent Itemsets

CPS 196.03: Information Management and Mining Association Rules and Frequent Itemsets

Efficient Algorithms for Mining Share-Frequent Itemsets

Text clustering using frequent itemsets

Fast and Memory Efficient Mining of Frequent Closed Itemsets

Mining Frequent Itemsets over Uncertain Databases

Mining Approximate Frequent Itemsets in the Presence of Noise

Fast Algorithms for Mining Frequent Itemsets

Frequent Itemsets Mining in Distributed Wireless Sensor Networks

Fast Algorithms for Mining Frequent Itemsets