1 / 33

A Top-Down Row Enumeration Approach of Frequent Patterns from Very High Dimensional Data

A Top-Down Row Enumeration Approach of Frequent Patterns from Very High Dimensional Data. Jiaofen Xu. The Presentaion-Based Paper. Top-Down Mining of Frequent Patterns from Very High Dimensional Data. Hongyan Liu, Jiawei Han, Dong Xin and Zheng Shao. Outline. . Introduction

dagan
Download Presentation

A Top-Down Row Enumeration Approach of Frequent Patterns from Very High Dimensional Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Top-Down Row Enumeration Approach of Frequent Patterns from Very High Dimensional Data Jiaofen Xu

  2. The Presentaion-Based Paper • Top-Down Mining of Frequent Patterns from Very High Dimensional Data. • Hongyan Liu, Jiawei Han, Dong Xin and Zheng Shao

  3. Outline  • Introduction • Preliminaries • Algorithm • Experimental Study • Conclusion

  4. What is high dimensional data? • The dimension of the data being in the hundreds or thousands. e.g. in text/web mining and bioinformatics. • A specific kind of high dimensional data set, which contain as and a large number of tuples. many as tens of thousands of columns but only a hundred or a thousand rows, such as microarray data. • Different from transactional data set, which usually have a small number of columns and a large number of rows.

  5. Frequent Pattern Mining Frequent Close Pattern Mining For frequent itemset X, if there exists no item y such that every transaction containing X also contains y, then X is a frequent closed pattern.

  6. Column enumeration & row enumeration An example table T Simple~~ a1, a2, b1, b2, c1, c2, d1, d2, d3 1, 2, 3, 4, 5 a1b1, a1b2, a1c1, a1c2, a1d1, a1d2, a1d3 , …… 12, 13, 14, 15, 23, 24, 25, …… a1b1c1, a1b2c1, a1c1d1, a1c2d1, a1c1d2, a1c2d2, a1c1d3, a1c2d3, …… 123, 124, 125, 134, 135, 145, 234, 235, 245, ……

  7. Why are the current column enumeration-based frequent pattern mining methods not suitable? Motivations Would row enumeration- based method generates less? The other reason is that with just a small number Of rows (samples), column-enumeration methods cannot get sufficient support to generate frequent pattern. Column enumeration- Based algorithms take column(item) combination Space as search space. For 55555 markers, the number of possible frequent patterns is 2 55555 . Notice that, the kind of high dimensionality Datasets we deal with typically contains as many as Tens of thousands of columns, but only a hundred or a thousand rows

  8. State of the art • Bottom-up row enumeration-based method F. Pan, G. cong, A.K.H. Tung, J. Yang, and M.J. Zaki. CARPENTER: Finding closed patterns in long biological datasets. In Proc. 2003 ACM SIGKDD Int. conf. However, the bottom-up search strategy checks row combinations from the smallest to the largest, it cannot make full use of the minimum support threshold to prune search space. • Top-down row enumeration-based method

  9. Contributions of the paper This is critical for mining high dimensional data, because the dataset is usually big, and without pruning the huge search space, one has to generate a very large set of candidate itemsets for checking. In CARPENTER, the closeness-checking method is that before outputting each itemset found currently, we must check if it is already found before. If not, output it. Otherwise, discard it. • A top-down search method is proposed to take advantage of the pruning power of minimum support threshold, which can cut down the search space dramatically. • A new method, called closeness-checking, is developed to check efficiently and effectively whether a pattern is closed. It does not need to scan the mining data set, nor the result set, and is easy to integrate with the top-down search process.

  10. Outline  • Introduction • Preliminaries • Algorithm • Experimental Study • Conclusion

  11. Table and transposed table minsup =2 Transposed table TT Table TT is already pruned by minsup. For clarity, we call each row of TT a tuple. Original table T

  12. Closed itemset and closed rowset • Definition 1 (Closure): Given an itemset I and a rowset S, define • Based on these definitions, we define C(I) as the closure of an itemset I, and C(S) as the closure of a rowset S as follows: • Definition 2 (Closed itemset and closed rowset): An itemset I is called a closed itemset iff I=C(I). Likewise, a rowset S is called a closed rowset iff S= C(S). • Definition 3 (Frequent itemset and large rowset): Given minsup, an itemset I is called frequent if |r(I)| ≥ minsup, where |r(I)| is called the support of itemset I, and a roset S is called large if |S| ≥ minsup, where |S| is called the size of rowset S. • Further, an itemset I is called frequent closed itemset if it is both closed and frequent. Likewise, a rowset S is called large closed rowset if it is both closed and large.

  13. Example • For an itemset {b1, c2}, r({b1, c2})= {2, 4}, and i({2, 4})={b1,c2, d2}, so C({b1, c2}= {b1, c2, d2}. Therefore, {b1, c2} is not a closed itemset. If minsup=2, it is a frequent itemset. • For an rowset {1, 2}, i({1, 2})={a1, b1} and r({a1, b1})={1, 2, 3}, then C({1,2})={1, 2, 3}. So rowset {1, 2} is not a closed rowset, but apparently {1, 2, 3} is.

  14. Mining Task • Originally, we want to find all of the frequent closed itemsets which satisfy the minimum support threshold minsup form the original table T. • After transposing T to transposed table TT, the mining task becomes finding all of the large closed rowsets which satisfy minimum size threshold minsup from table TT.

  15. Top-down Search Strategy minsup =3 Given user specified minsup, we can stop further search of the tow-down row enumeration tree at level (n-minsup) for mining frequent itemsets.

  16. X-excluded transposed table Tables corresponding to a parent node and a child node are called parent table and child table respectively. • Each node of the tree in Figure 3.2 corresponds to a sub-table. For example, the root represents the whole table TT, and then it can be divided into 5 sub-tables: table without rid 5, table with 5 but without 4, table with 45 but without 3, table with 345 but without 2, and table with 2345 but without 1. • Definition (x-excluded transposed table) : Given a rowset x={ri1, ri2, …, rik} with an order such that ri1> ri2>…> rik, an minsup and its parent table TT|p, an x-excluded transposed table TT|x is a table in which each tuple contains rids less than any of rids in x, and at the same time contains all of the rids greater than any of rids in x. Rowset x is called an excluded rowset.

  17. Example minsup =2 In TT|4, each tuple must contain rid 5 as it is greater than 4, and in the meantime must contain at least one rid less than 4 as minsup is 2. As a result, in Table TT, only those tuples containing rid 5 can be a candidate tuple of TT|4. In TT|54, x={5, 4}, each tuple only contains rids which are less than 4, and contains at least two such rids as minsup is 2.

  18. Excluded row enumeration tree • Extract form TT or its direct parent table TT|p each tuple containing all rids greater than rik. • For each tuple obtained in the first step, keep only rids less than rik. • Get rid of tuples containing less than (minsup-j) number of rids, where j is the number of rids greater than rik in S.

  19. Closeness-checking • Lemma 2: Given a rowset S in transposed table TT, for every tuple ij containing S, which means ij ∈i(S), if S≠∩r({ij}), where ij ∈ i(S), then S is not closed. • Lemma 1 and Lemma 2 are the basis of our closeness-checking method. • Lemma 1: In transposed table TT, a rowset S is closed iff it can be represented by an intersection of a set of tuples, that is:

  20. Skip-rowset • The so-called skip-rowset is a set of rids which keeps track of the rids that are excluded from the same tuple of all of its parent tables. • When two tuples in an x-excluded transposed table have the same rowset, they will be merged to one tuple, and the intersection of corresponding two skip-rowsets will become the current skip-rowset.

  21. Example of skip-rowset and merge of x-excluded transposed table • When we got TT|54 from its parent TT|5, we excluded rid 4 from tuple b1 and d2 respectively. • The first 2 tuples have the same rowset {1, 2, 3}. The skip-rowset of this rowset becomes empty because the intersection of an empty set and any other set is still empty. • If the intersection result is empty, it means that currently this rowset is the result of intersection of two tuples. Therefore, it must be a closed rowset.

  22. Outline  • Introduction • Preliminaries • Algorithm • Experimental Study • Conclusion

  23. TD-Close

  24. minsup =2 Table TT|543

  25. Outline  • Introduction • Preliminaries • Algorithm • Experimental Study • Conclusion

  26. Experimental Study FPclose is a column enumeration-based algorithm, which won the FIMI’ 03 best implementation award. • Compare the algorithm with Carpenter and FPclose. • Using D#T#C# to represent specific dataset, where D# stands for dimension, the number of attributes of each data set, T# for number of tuples, and C# for cardinality, the number of values per dimension (or attribute). • In these experiments, D# ranges from 4000 to 10000, T# varies from 100, 150 to 200, and C# varies from 8, 10 to 12.

  27. Outline  • Introduction • Preliminaries • Algorithm • Experimental Study • Conclusion

  28. Conclusion • Existing algorithms, such as Carpenter, adopt a bottom-up fashion to search the row enumeration space, which makes the pruning power of minimum support threshold (minsup) very weak, and therefore results in long mining process even for high minsup, and much memory cost. • To solve this problem, a top-down style row enumeration method and an effective closeness-checking method are proposed in this paper. Several pruning strategies are developed to speed up searching. • Both analysis and experimental study show that this method is effective and useful.

  29. Future work • Integrating top-down row enumeration method and column row enumeration method for frequent pattern mining from both long and deep large datasets. • Mining classification rules based on association rules using top-down searching strategy.

  30. Questions ?

  31. Thank you~

More Related