Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration Takeaki Uno(1), Hiroki Arimura(2) (1) National Institute of Informatics, JAPAN (The Guraduate University for Advanced Science) (2) Hokkaido University, JAPAN May/25/2008 PAKDD 2008

Frequent Pattern Mining •Problem of finding all frequently appearing patterns from given database database:transaction database (itemset), tree, graph, vector patterns:itemset, tree, path/cycle, graph, geometric graph… database ・実験1● ,実験3 ▲ ・実験2● ,実験4● ・実験2●, 実験3 ▲, 実験4● ・実験2▲,実験3 ▲ ．　　　　．　　　　． Extract frequently appearing patterns ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT ・ATGCAT ・CCCGGGTAA ・GGCGTTA ・ATAAGGG ．　　　　．　　　　． experiments genome

Researches on Pattern Mining •So many studies and applications on itemsets, sequences, trees, graphs, geometric graphs •Thanks to the efficient algorithms, we would say any simple structures can be enumerated in practically short time •One of the next problems is “how to handle the noise, error, and ambiguity” usual “inclusion” is too strict we want to find patterns “mostly” included in many records We consider ambiguous appearance of patterns

Related Works on Ambiguity •It is popular to detect “ambiguous XXXX”  dense substructures: clustering, community discovering…  homology search on genome sequence •Heuristic search is popular because of the difficulty on modeling and computation Advantage: usually works efficiently Problem: not easy to understand “what is found” much more cost for additional conditions(for each solution) •Here we look at the problem from “algorithmic point of view” (efficient models arising from efficient computation)

Itemset Mining •In this talk, we focus on the itemset mining transaction database D:each record called transaction is a subset of itemset E, that is, ∀T ∈D, T ⊆ E Occ(P): set of transactions including P frq(P) = |Occ(P)|: #transactions including P P is afrequent itemset frq(P) ≥σ (σ is minimum support) •Problem is to enumerate all frequent itemsets in D We introduce ambiguous inclusion for frequent itemset mining

Related works 1,2 2,3 1,3 θ=66% •fault-tolerant pattern、degenerate pattern、soft occurrence, etc. mainly two approaches (1)generalize inclusion: (1-a) the ratio of included items ≥θ  include  lose monotonicity; no subset may be frequent in the worst case  several heuristic-search-based algorithms (1-b) at most k items are not included include  satisfy monotonicity; so many small itemsets are frequent  maximal enumeration or complete enumeration with small k

Related works 2 (2)find pairs of itemset and transaction set such that few of them do not satisfy inclusion  equivalent to finding dense submatrix, or dense bicluster so many equivalent patterns will be found  mainly, heuristic search for finding one such dense substructure •ambiguity on the transaction set  an itemset can have many partners items transactions We introduce a new model for (2)to avoid redundancy, and propose an efficient depth-first search type algorithm

Average Inclusion 1,3,4 2,4,5 1,2 2,350% 4,550% 1,266% •inclusion ratio of t for P ⇔ | t∩P | ／ |P| •average inclusion ratio of transaction set T for P 　⇔average of inclusion ratio over all transactions in T ∑ |t ∩ P| ／ ( |P| × |T| )  equivalent to dense submatrix/subgraph of transaction-item inclusion matrix/graph •For a density threshold θ, maximum co-occurrence sizecov(P) of itemset P ⇔maximum size of transaction set s.t. average inclusion ratio ≥θ

Problem Definition θ=66%: cov({3}) = 1 cov({2}) = 3 cov({1,3}) = 2 cov({1,2}) = 3 1,3,4 2,4,5 1,2 •For a density threshold θ, the maximum co-occurrence sizecov(P) of itemset P ⇔maximum size of transaction set s.t. average inclusion ratio ≥θ •Ambiguous frequent itemset:itemset P s.t., cov(P) ≥ σ (σ: minimum support) •Ambiguous frequent itemsets are not monotone !! Ambiguous frequent itemset enumeration: the problem of outputting all ambiguous frequent itemsets for given database D, density threshold θ, minimum support σ The goal is to develop an efficient algorithm for this problem

i1 v1 i1, i2 i1, i2 i1, i2 i1, i2 Hardness for Branch-and-Bound •A straightforward approach to this problem is branch-and-bound •In each iteration, divide the problem into two non-empty problems by the inclusion of an item Checking the existence of ambiguous frequent itemset is NP-comp. (Theorem 1)

θ= 1 θ= 0 Is This Really Hard? •We proved NP-hardness for "very dense graphs"  unclear for middle dense graph  not impossible for polynomial time enumeration polynomial time in (input size) + (output size) hard easy ????? easy

Efficient Algorithm: Idea of Reverse Search objects •We don’t use branch and bound, but use reverse search •Define an acyclic parent-child relation on all objects to be found Depth-first search on the rooted tree induced by the relation Recursively find children to search, thus an algorithm for finding all children is sufficient

Neighboring Relation •AmbiOcc(P) of an ambiguous frequent itemset P ⇔lexicographically minimum one among transaction sets whoseaverage inclusion ratio ≥θ and size = cov(P) •e*(P):the item e in P s.t. # transactions in AmbiOcc(P) including e is the minimum (ties are broken by taking the minimum index) •the parent Prt(P) of P:P ＼ e*(P) θ＝66%, σ= 4 e*(P) = 5 Prt({1,4,5}) {1,4} AmbiOcc({1,4}) = {D,A, B,C, F} A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 {1,4,5} D, A,B, C,F, E AmbiOcc({1,4,5}) = {D,A,B,C}

Properties of Parent •The parent Prt(P) of P:P ＼ e*(P)  uniquely defined • Average inclusion ratio of AmbiOcc(P) for P does not decrease  Prt(P) is an ambiguous frequent itemset •|Prt(P)| < |P| (parent is always smaller) 　the relation is acyclic, and induces a tree (rooted at φ) θ＝66%, σ= 4 e*(P) = 5 Prt({1,4,5}) {1,4} AmbiOcc({1,4}) = {D,A, B,C, F} A: 1,3,4,7 B: 2,4,5 C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 {1,4,5} D, A,B, C,F, E AmbiOcc({1,4,5}) = {D,A,B,C}

Enumeration Tree θ＝66%, σ= 4 φ A: 1,3,4,7 B: 2,4,5, C: 1,2,7 D: 1,4,5,7 E: 2,3,6 F: 3,4,6 1 2 3 4 7 •The relation is acyclic, and induces a tree (rooted at φ) •We call the tree enumeration tree 1,4 3,4 4,5 4,7 1,7 1,4,5 1,3,4 1,4,7 3,4,7 4,5,7 1,2,7 1,3,7 1,5,7 1,3,4,7 1,4,5,7

itemsets Listing Children •To perform a depth-first search on enumeration tree, what we have to do is “finding all children of given itemset” •P = Prt(P’) is obtained by removing an item from P’  a child P’ of P is obtained by adding an item to P  to find all children, we examine all possible items φ

itemsets Check Candidates •An item addition does not always yield a child  They are just “candidates” •If the parent of a candidate P’ = P∪e is P (satisfies e*(P’) = e ), P’is a child of P  checking by computing e*(P∪e), for each candidate P∪e Theorem Enumeration is done in O(||D||n) time for each ambifuous frequent itemset φ

Algorithm Description Algorithm AFIM ( P:pattern, D:database ) output P compute cov(P∪e) for all item e not in P for each e s.t. cov(P∪e) ≥ σ do compute AmbiOcc(P∪e) compute e*(P∪e) ife*(P∪e) = e then call AFIM ( P∪e, D) done

Efficient Computation of cov’s •For efficient computation, we classify transactions by inclusion ratio •When we compute cov(P∪e), we compute the intersection of each group and Occ(e)  inclusion ratio increases, for transactions included in Occ(e)  by moving such transactions, classification for P∪e is obtained •This task for all items is done efficiently by Delivery, which takes O(||G||) time where ||G|| is the sum of transaction sizes in group G  computation of cov(P∪e) can be done in linear time 0 miss 1 miss 2 miss 3 miss 4 miss 5 miss

Computing AmbiOcc and e* •Computation of AmbiOcc(P∪e) needs greedy choice of transactions, in the decreasing order of (inclusion ratio & index) •Computation of e*(P∪e) needs intersection of AmbiOcc(P∪e) and Occ(i) for each i∈P Delivery  need O(||D||) time in the worst case •However, when cov(P) is small, not so many transactions may be scanned, thus we expect the average computation time is not so long

・・・ Bottom-wideness long time •DFS search generates several recursive calls in each iteration  Recursion tree grows exponentially, by going down  Computation time is dominated by the lowest levels •Computation time decreases by going down short time Near by bottom levels, computation time may be close to σ, thus an iteration may take O(σt) time where t is the average size of transactions

Computational Experiments CPU:Pentium M 1.1GHz, memory: 256MB OS: Windows XP + Cygwin Code: C Compiler: gcc 2.3 •Test instances are taken from benchmark datasets for frequent itemset mining

BMS-WebView 2 •A real-world web access data (sparse; transaction siz = 4.5)

Mushroom •A real-world machine learning data of mushrooms (density = 1/3)

Possibility for Further Improvements •Ratio of unnecessary operations, non-maximal patterns

Conclusion •Introduced a new model for frequent itemset mining with ambiguous inclusion relation, which avoids redundancy •Showed a hardness result for branch-and-bound •Showed efficiency on practical (sparse) datasets Future Works: •Reduce the time complexity and fill the gap from the practice •Efficient models and computation for maximal ones •Application of the technique to the other problems (ambiguous pattern mining for graph, tree, vector data, etc.)

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration