Efficiently Mining Long Patterns from Databases

Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Abstract • Max-Miner : scale roughly linearly in the number of maximal patterns, irrespective of the length of the longest pattern • previous algorithms : scale exponentially with longest pattern length

Introduction (Cont’d) • pattern mining algorithms have been developed to operate on databases where the longest patterns are relatively short • Interesting data-sets with long patterns • sales transactions detailing the purchases made by regular customers over a large time window • biological data from the fields of DNA and protein analysis

Introduction (Cont’d) • Apriori-like algorithms are inadequate on data-sets with long patterns • a bottom-up search • enumerates every single frequent itemset • exponential complexity • in order to produce a frequent itemset of length l, it must produce all 2l of its subsets since they too must be frequent • restricts Apriori-like algorithms to discovering only short patterns

Introduction • Max-Miner algorithm • for efficiently extracting only the maximal frequent itemsets • roughly linear in the number of maximal frequent itemsets • “look ahead” , not bottom-up search • can prune all its subsets from consideration, by identifying a long frequent itemset early on

Max-Miner (Cont’d) • Rymon’s generic set-enumeration tree search frame work • ex. Figure 1. • breadth-first search • in order to limit the number of passes • pruning strategies • subset infrequency pruning (as does Apriori) • superset frequency pruning

{ } 1 2 3 4 1,2 1,3 1,4 2,3 2,4 3,4 1,2,3 1,3,4 2,3,4 1,2,3,4 Figure 1. Complete set-enumeration tree over four items

Max-Miner (Cont’d) • candidate group g, • head, h(g) • represents the itemsets enumerated by the node • tail, t(g) • an ordered set • contains all items not in h(g) that can potentially appear in any sub-node • ex. the node enumerating itemset {1} •  h(g) = {1}, t(g) = {2, 3, 4}

Max-Miner • counting the support of a candidate group g, • computing the support of itemsets h(g), h(g)  t(g) and h(g)  {i} for all i  t(g) • superset-frequency pruning • halting sub-node expansion at any candidate group g for which h(g)  t(g) is frequent • subset-infrequency pruning • removing any such tail item from candidate group before expanding its sub-nodes

MAX-MINER (Data-set T) ;; Returns the set of maximal frequent itemsets present in T Set of Candidate Groups C  { } Set of Itemsets F  {GEN-INITIAL-GROUP(T, C)} while C is non-empty do scan T to count the support of all candidate groups in C for each g  C such that h(g)  t(g) is frequent do F  F  {h(g)  t(g)} Set of Candidate Groups Cnew  { } for each g  C such that h(g)  t(g) is infrequent do F  F  {GEN-SUB-NODES(g, Cnew)} C  Cnew remove from F any itemset with a proper superset in F remove from C any group g such that h(g)  t(g) has a superset in F return F Figure 2. Max-Miner at its top level

GEN-INITIAL-GROUPS (Data-set T, Set of Candidate Groups C) ;; C is passed by reference and returns the candidate groups ;; The return value of the function is a frequent 1-itemset scan T to obtain F1, the set of frequent 1-itemsets impose an ordering on the items in F1 for each item i in F1 other than the greatest item do let g be a new candidate with h(g) = {i} and t(g) = {j | j follows i in the ordering} C  C  {g} return the itemset in F1 containing the greatest item Figure 3. Generating the initial candidate groups

GEN-SUB-NODES (Candidate Group g, Set of Cand. Groups C) ;; C is passed by reference and returns the sub-nodes of g ;; The return value of the function is a frequent itemset remove any item i from t(g) if h(g)  {i} is infrequent reorder the items in t(g) for each i  t(g) other than greatest do let g’ be a new candidate with h(g’) = h(g)  {i} and t(g’) = { j | j  t(g) and j follows i in t(g)} C  C  {g’} return h(g)  {m} where m is the greatest item in t(g), or h(g) if t(g) is empty Figure 4. Generating sub-nodes

Item Ordering Policies • to increase the effectiveness of superset-frequency pruning • to position the most frequent items last • ordering : Gen-Initial-Group • in increasing order of sup({i}) • reordering : Gen-Sub-Nodes • in increasing order of sup(h(g)  {i}) • consider only the subset of transactions relevant to the given node

Efficiently Mining Long Patterns from Databases

Efficiently Mining Long Patterns from Databases

Presentation Transcript

Technologies for Mining Frequent Patterns in Large Databases

Profit Mining: From Patterns to Action

Mining Sequential Patterns

Mining Sequential Patterns

Mining Sequential Patterns

Mining Logs for Long-Term Patterns

Constraint Mining of Frequent Patterns in Long Sequences

Mining Graph Patterns Efficiently via Randomized Summaries

PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

Mining Frequent Patterns

Mining Sequential Patterns

Mining Patterns from Protein Structures

Data Mining: Concepts and Techniques Mining sequence patterns in transactional databases

Mining Sequence Patterns in Transactional Databases

Mining Multidimensional Databases

IncSpan: Incremental Mining of Sequential Patterns in Large Databases

Mining Probabilistically Frequent Sequential Patterns in Uncertain Databases

PrefixSpan﹕ Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

Mining Graph Patterns Efficiently via Randomized Summaries

Mining Multidimensional Databases

Mining Sequential Patterns

Mining Sequential Patterns