540 likes | 555 Views
Extending Association Analysis. Michael Steinbach Ph.D. Defense. Outline. Introduction Extending association analysis to non-binary data and non-traditional patterns Generalizing the notion of support Generalizing the notion of confidence Creating new types of association patterns
E N D
Extending Association Analysis Michael Steinbach Ph.D. Defense
Outline • Introduction • Extending association analysis to non-binary data and non-traditional patterns • Generalizing the notion of support • Generalizing the notion of confidence • Creating new types of association patterns • Analyzing the structure of association patterns • Conclusions and future work
Traditional Association Analysis • Association analysis: Analyzes relationships among items (attributes) in a binary transaction data • Example data: market basket data • Data can be represented as a binary matrix • Applications in business and science • Two types of patterns • Itemsets:Collection of items • Example: {Milk, Diaper} • Association Rules:XY, where X and Y are itemsets. • Example: Milk Diaper Set-Based Representation of Data Binary Matrix Representation of Data
Traditional Association Analysis … • Association measures evaluate the strength of an association pattern • Support and confidence are the most commonly used • The support, (X), of an itemset X is the number of transactions that contain all the items of the itemset • Frequent itemsets have support > specified threshold • Different types of itemset patterns are distinguished by a measure and a threshold • The confidence of an association rule is given by conf(XY) = (X Y) / (X) • Estimate of the conditional probability of Y given X
Traditional Association Analysis … • Process of finding interesting patterns: • Find frequent itemsets using a support threshold • Find association rules for frequent itemsets • Sort association rules according to confidence • Support filtering is necessary • To eliminate spurious patterns • For efficiency, we need the anti-monotone property: XY implies (Y) ≤ (X) • Confidence is used because of its interpretation as conditional probability Given d items, there are 2d possible candidate itemsets
Extending Association Analysis • Why extend association analysis? • To address limitations of existing schemes for association analysis • To create new kinds of useful patterns • To better understand the structure of the association patterns in a data set
Limitations of Association Analysis • Traditional association analysis does not apply to • Non-binary data • Must transform data into binary transaction data to apply traditional association analysis techniques. • Order and magnitude information can be lost • Can often “make it work” by coding combinations of values, but this adds complexity and explodes the number of items • Limited solutions exist • Min-Apriori (Han, Karypis, Kumar 1997) • Non-traditional association patterns. • Error Tolerant Itemsets (ETIs) (Yang, Fayyad, and Bradley 2001) • General Boolean formulas (Bollman-Sdorra, et al. 01, Srikant et al. 97) Document Data
Limitations of Association Analysis … • Support and confidence are not appropriate for all applications Example involving coffee and tea: • Every customer in a grocery store purchases coffee • Only 1/4 of the customers purchase tea • conf(tea coffee) = 1 • But this is misleading because any item implies coffee • This problem is common when the frequency of items has a skewed support distribution • This cross-support problem can be addressed by using other measures, such as h-confidence (hyperclique pattern)
Limitations of Association Analysis … • Lack of knowledge of structure of association patterns • Support threshold is critical • If too high, no patterns • If too low, too many patterns • At some support threshold,algorithms to find association patterns “hit the wall” • Particular difficulty in finding patterns with low support • LPMiner (Seno, Karypis 2001) From Summary of Results, Frequent Itemset Mining Implementations 2003
Overview and Contributions • Presentation and contributions fall into three categories • A mathematical framework to extend association analysis to non-binary data and non-traditional patterns • Generalizing the notion of support • Extend the hyperclique pattern (Xiong, et al 2003)to continuous data • Generalizing the notion of confidence • Define notion of confidence for Error-Tolerant Itemsets
Overview and Contributions • A framework for creating new types of association measures (and their accompanying itemset patterns) • Can use any pairwise association or proximity measure as the basis for defining a measure of itemset strength • Examples: cosine, confidence, correlation • All measures have the anti-monotone property • Analyzing the structure of association patterns • Introduce the notion of support envelopes • Can visualize the structure of association patterns
Publications Related to Thesis Steinbach, M., Tan, P., Xiong, H., and Kumar, V., Generalizing the Notion of Support. KDD '04, pp. 689-694, Seattle, WA, August 22 - 25, 2004. Steinbach, M. and Kumar, V., Generalizing the Notion of Confidence. ICDM’05, to appear, Houston, TX, November 27 - 30, 2005. Steinbach, M., Tan, P., and Kumar, V., Support Envelopes: A Technique for Exploring the Structure of Association Patterns. KDD '04, pp. 689-694, Seattle, WA, August 22 - 25, 2004.
Additional Publications Books: P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Pearson Addison-Wesley, May, 2005. Book Chapters: V. Kumar, P.-N. Tan, and M. Steinbach, Data Mining, in Handbook of Data Structures and Applications, CRC Press, 2004. M. Steinbach, L. Ertoz, and V. Kumar, Challenges of Clustering High Dimensional Data. in New Vistas in Statistical Physics - Applications in Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag, 2004. L. Ertoz, M. Steinbach, and Vipin Kumar, Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach, in Clustering and Information Retrieval, 2003, Kluwer Academic Publishers. P. Zhang, M. Steinbach, V. Kumar, S. Shekhar, P.-N. Tan, S. Klooster, and C. Potter, Discovery of Patterns of Earth Science Data Using Data Mining, in Next Generation of Data Mining Applications, IEEE Press, 2005.
Additional Publications … Journal Articles: H. Xiong, G. Pandey, M. Steinbach, and V. Kumar, Enhancing Data Analysis with Noise Removal, IEEE Transactions on Knowledge and Data Engineering (TKDE), 2006, accepted for publication as a regular paper. C. Potter, P.-N.Tan, M. Steinbach, S. Klooster, V. Kumar, R. Myneni, and V. Genovese, Major Disturbance Events in Terrestrial Ecosystems Detected using Global Satellite Data Sets, Global Change Biology, 2003. C. Potter, S. Klooster, M. Steinbach, P. Tan, V. Kumar, S. Shekhar, R. Nemani, and R. Myneni, Global Teleconnections of Ocean Climate to Terrestrial Carbon Flux, J. of Geophysical Research, Vol. 108, No. D17, 4556, 2003. C. Potter, S. Klooster, M. Steinbach, P. Tan, V. Kumar, S. Shekhar, and C. Carvalho, Understanding Global Teleconnections of Climate to Regional Model Estimates of Amazon Ecosystem Carbon Fluxes, Global Change Biology, 2003 C. Potter, S. Klooster, M. Steinbach, P. Tan, V. Kumar, R. Myneni, V. Genovese, Variability in TerrestrialCarbon Sinks Over Two Decades: Part 1-North America, Earth Interactions, 2003. Conferences: H. Xiong, M. Steinbach, and V. Kumar, Privacy Leakage in Multi-relational Databases via Pattern based Semi-supervised Learning, in Proc. of the ACM Conference on information and Knowledge Management (CIKM 2005), Bremen, Germany, 2005. H. Xiong, M. Steinbach, P.-N. Tan, and V. Kumar, HICAP: Hierarchical Clustering with Pattern Preservation, in Proc. 2004 SIAM International Conf. on Data Mining (SDM 2004), pp. 279 - 290, Florida, 2004 M. Steinbach, P.N Tan, V. Kumar, S. Klooster, C. Potter: Discovery of climate indices using clustering. KDD 2003: 446-455 L. Ertöz, M. Steinbach, and V. Kumar: Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data. SDM 2003.
Additional Publications … Workshops: M. Steinbach, P.-N. Tan, V. Kumar, C. Potter, and S. Klooster, Temporal Data Mining for the Discovery and Analysis of Ocean Climate Indices, KDD Workshop on Temporal Data Mining, 2002. M. Steinbach, P.-N. Tan, V. Kumar, C. Potter, and S. Klooster, Data Mining for the Discovery of Ocean Climate Indices, The Fifth Workshop on Scientific Data Mining, 2nd SIAM International Conference on Data Mining, 2002. V. Kumar, M. Steinbach, P.-N. Tan, S. Klooster, C. Potter, A. Torregrosa, Mining Scientific Data: Discovery of Patterns in the Global Climate System, Joint Statistical Meeting, 2001. M. Steinbach, P.-N. Tan, V. Kumar, C. Potter, S. Klooster, A. Torregrosa, Clustering Earth Science Data: Goals, Issues and Results, KDD Workshop on Mining Scientific Datasets, 2001. P.-N. Tan, M. Steinbach, V. Kumar, C. Potter, S. Klooster, A. Torregrosa, Finding Spatio-Temporal Patterns in Earth Science Data, KDD Workshop on Temporal Data Mining, 2001. M. Steinbach, G. Karypis, and V. Kumar, Efficient Algorithms for Creating Product Catalogs, Web Mining Workshop, 1st SIAM International Conference on Data Mining, Chicago, IL, 2001. L. Ertoz, M. Steinbach, and V. Kumar, Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach,Text Mine'01, Workshop on Text Mining, 1st SIAM International Conference on Data Mining, Chicago, IL, April, 2001. M. Steinbach, G. Karypis, and V. Kumar, A Comparison of Document Clustering Techniques TextMining Workshop, KDD 2000, Boston, MA, August, 2000.
Outline • Introduction • Extending association analysis to non-binary data and non-traditional patterns • Generalizing the notion of support • Generalizing the notion of confidence • Creating new types of association patterns • Analyzing the structure of association patterns • Conclusions and future work
Generalizing Support: Problem Statement • Challenge: Create a framework for generalizing support that • Handles non-binary data (ordinal, continuous) • Handles new types of patterns • Allow people to more easily express, explore, and communicate new types of association patterns • Motivating examples for continuous data Document Data Microarray data www.biology.ucsc.edu/mcd/research.html
Proposed Approach • Proposed Approach: Support ( ) can be viewed as being composed of two steps (functions): • Evaluate the strength of a pattern in each object (transaction) • Summarize all these evaluations with a single number • Evaluation vector is given by v = eval(X) • Summarization (norm) function measures strength of the pattern in all transactions • Example • eval = (logical and) • X = { Milk, Diapers } • norm = sum • (X) = (normeval)(X) = norm(eval(X)) = norm(v) norm(v)
Evaluation and Summarization Functions • Evaluation functions • Boolean functions constructed from and (), or (), and not () • min, max, range • product • Special purpose: Error-Tolerant Itemsets • Summarization functions • Vector norms • L1, L2, and L2 squared • Sums • Average • Weighted average • Weighted vector norms
Usefulness of Support Framework • Traditional support results from a number of choices • eval = { , min, } • norm = { L1, L2 squared, sum } • Any of these nine combinations give the traditional support for binary data • But for continuous data, these support measures are different • Can extend a recently developed association pattern, the hyperclique pattern (Xiong, et al. 2003), to continuous data • eval = min • norm = L2 squared • Has led to the creation of a new kind of pattern defined by range support • eval = range • norm = L2 squared
Outline • Introduction • Extending association analysis to non-binary data and non-traditional patterns • Generalizing the notion of support • Generalizing the notion of confidence • Creating new types of association patterns • Analyzing the structure of association patterns • Conclusions and future work
Generalizing Confidence: Problem Statement • Challenge: Create a framework for generalizing confidence that • Handles non-binary data (ordinal, continuous) • Handles new types of patterns • Allow people to more easily express, explore, and communicate new types of association patterns
Example: Error-Tolerant Itemsets • A (strong) error-tolerant itemset (ETI) can have a fraction of the items missing in each transaction. Example: see the data in the table • Let = 5/8. In other words, eachtransaction only needs to have 3/8 (37.5%) of the items. • X = {i1, i2, i3, i4} andY = {i5, i6, i7, i8} are both ETIs with a support of 4. ! Standard confidence:
A Framework for Generalizing Confidence • Proposed Approach: Confidence can be viewed as being composed of two steps (functions): • Evaluate the strength of a pattern in each object (transaction) for the two sets of attributes (items), X and Y (XY = ) • Evaluation functions can be the same as previously mentioned, e.g., min, max, range, boolean functions, etc. • Measure the strength of the relationship between the resulting pair of pattern evaluation vectors, vXandvY • Confidence functions can be a measure of prediction or proximity. • Measure the extent to which the strength of one association pattern can be used to predict another, such as confidence, or • Capture the proximity (similarity or dissimilarity) between the two association patterns. • Euclidean distance, correlation, cosine, Bregman divergence
Confidence for Boolean Support Functions • A Boolean support function • Has an evaluation function that returns a binary evaluation vector indicating the presence or absence of a pattern in each transaction. • Uses the sum, L1, or L2 squared summarization function • Goal is to define confidence for Boolean support functions so that conf( XY ) can be interpreted as an estimate of the conditional probability of Y given X. • Key observation is that you have to work with the evaluation vectors and the basic definition of conditional probability • Thus, conf(X Y) = prob(vY|vX ) = prob(vXvY ) / prob(vY ) • Another way to express this is as conf( X Y ) = traditional confidence(vX, vY)
Example: Error-Tolerant Itemsets … • Returning to the ETI example, we get the following: X = {i1, i2, i3, i4} Y = {i5, i6, i7, i8}vX vY conf(X,Y) = prob(vY|vX ) = prob(vXvY ) / prob(vX ) = support(vX vX) / support(vX) = 0 / 4 = 0
Confidence for Continuous Data • One approach is to define a confidence measure for continuous data that agrees with traditional confidence for binary data. • Normalize attributes to have an L1 norm of 1 • eval fuction is min, • norm fuction is L1 • Confidence is defined as • Another approach is to drop the requirement of being consistent with the case of binary data (Min-Apriori (Han, Karypis, Kumar 1997) • Normalize attributes to have an L1 norm of 1 • eval fuction is min • norm fuction is L1 • Traditional definition of confidence: conf(XY) = (X Y) / (X)
Example: Min Apriori • This approach is inconsistent with traditional confidence Original Data Normalized Data Evaluation Vectors Standard confidence: Min-Apriori confidence:
Outline • Introduction • Extending association analysis to non-binary data and non-traditional patterns • Generalizing the notion of support • Generalizing the notion of confidence • Creating new types of association patterns • Analyzing the structure of association patterns • Conclusions and future work
New Association Patterns: Motivation • There are many pairwise measures of association or proximity among items (attributes) • Each measure has specific properties and applications • E.g., cosine measure is good for sparse data, while correlation is more appropriate for dense data • Interestingness measures(Tan and Kumar 02)
Proposed Approach • Proposed Approach: Using pairwise measures of association or proximity • Find values for all pairs of attributes (or sets of attributes) • Apply the min function to obtain a single value Example: If X = {i1, i2, i3} and our pairwise measure is cosine, then we can define, , a measure of itemset strength (X) = min( cosine(i1, i2), cosine(i1, i3), cosine(i2, i3) ) • A set of attributes, X, is a clique association pattern with respect to a threshold and a pairwise association measure if (i, j), i, jX ( can be cosine, corr, conf,…)
Proposed Approach … Actually three approaches ( is a pairwise measure) • Subset-Subset • min{ ( X, Y), for all itemsets X and Y} • All-confidence ( = confidence) is an example (Omiecinski 2003) • All-subsets patterns: all-subsets cosine, all-subsets correlation, all-subsets confidence • Item-Subset • min{ ( X, Y), for all itemsets X and Y, where X is a single item} • H-confidence ( = confidence) is an example (Xiong, 2003) • Hyperclique patterns: h-cosine, h-correlation, h-confidence • Item-Item • min{ ( X, Y), for all itemsets X and Y, X and Y are single items} • Clique patterns: cosine clique, correlation clique, confidence clique
Proposed Approach … • When one or both of the itemsets are not single items (attributes), it is not possible to directly apply most pairwise measures • Confidence is an exception • Can use the approach proposed for generalizing confidence • Compute the evaluation vector of the itemset • Then apply the pairwise measure to the two vectors: the evaluation vector and the original attribute vector
An Experiment • We compared the performance of h-confidence, cosine clique, and confidence clique patterns. • The h-confidence hyperclique pattern is important because the hyperclique pattern has many applications • Clustering, classification, data cleaning • Typically applied to objects instead of items • Purity of patterns is excellent • Often the h-confidence patterns don’t cover many objects • Better coverage may mean better application performance • Cos, conf related to h-conf
Experimental Results • We used several document data sets with class labels for the documents • Patterns were found on documents and goodness was measured by the entropy of the patterns • Three quantities are reported • Number of patterns • Average entropy of the patterns • Coverage of documents • Also evaluated the cosine cliques for original data
la1 level=50 fbis level=70 2500 7000 h-confidence h-confidence cosine (orig data) cosine (orig data) cosine (binary data) cosine (binary data) 6000 confidence confidence 2000 5000 1500 4000 Number of Patterns Number of Patterns 3000 1000 2000 500 1000 0 0 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12 Number of Attributes in the Pattern Number of Attributes in the Pattern la1 level=50 fbis level=70 1 1 h-confidence h-confidence cosine (orig data) cosine (orig data) 0.9 0.9 cosine (binary data) cosine (binary data) confidence confidence 0.8 0.8 0.7 0.7 0.6 0.6 Average Entropy Average Entropy 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12 Number of Attributes in the Pattern Number of Attributes in the Pattern Experimental Results – LA1 and FBIS
tr45 level=50 cranmed level=30 9000 5000 h-confidence h-confidence cosine (binary data) cosine (orig data) 4500 8000 confidence cosine (binary data) confidence 4000 7000 3500 6000 3000 5000 Number of Patterns Number of Patterns 2500 4000 2000 3000 1500 2000 1000 1000 500 0 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 3 4 5 6 7 8 9 10 11 12 13 Number of Attributes in the Pattern Number of Attributes in the Pattern cranmed level=30 tr45 level=50 1 1 h-confidence h-confidence cosine (binary data) cosine (orig data) 0.9 0.9 confidence cosine (binary data) confidence 0.8 0.8 0.7 0.7 0.6 0.6 Average Entropy Average Entropy 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 3 4 5 6 7 8 9 10 11 12 13 Number of Attributes in the Pattern Number of Attributes in the Pattern Experimental Results – CranMed and tr45
la1 level=50 fbis level=70 30 35 h-confidence h-confidence cosine (orig data) cosine (orig data) cosine (binary data) cosine (binary data) 30 25 confidence confidence 25 20 20 Percent Coverage Percent Coverage 15 15 10 10 5 5 0 0 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12 Number of Attributes in the Pattern Number of Attributes in the Pattern cranmed level=30 tr45 level=50 40 18 h-confidence h-confidence cosine (orig data) cosine (binary data) 35 cosine (binary data) 16 confidence confidence 30 14 25 12 Percent Coverage Percent Coverage 20 10 15 8 10 6 5 4 0 2 2 3 4 5 6 7 8 9 10 11 12 13 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of Attributes in the Pattern Number of Attributes in the Pattern Experimental Results – Percent Coverage
Outline • Introduction • Extending association analysis to non-binary data and non-traditional patterns • Generalizing the notion of support • Generalizing the notion of confidence • Creating new types of association patterns • Analyzing the structure of association patterns • Conclusions and future work
Describing Association Patterns: Support Envelopes • The support envelope fora binary transaction data set and a pair of positive integers (m, n) • Is a subset of all items and transactions • The support envelope contains all association patterns involving m or more transactions and n or more items. • m is support • n is the length of the itemset • Itemsets and variants (frequent, maximal, closed) • Error Tolerant Itemsets (ETIs)
Simple Example • Idea: instead of finding all association patterns containing at least m transactions and n items, find the items and transactions containing all such patterns. • For an example using the data set below, find the set of items and transactions that contain all patterns with at least 3 transactions and at least 3 items.
Support Envelope Algorithm (SEA) The algorithm to find a support envelope is simple. 1: input: A data matrix and a pair of positive integers (m, n) 2: repeat 3: Eliminate all rows whose sum is less than n 4: Eliminate all columns whose sum is less than m 5: until there is no change 6: return the set of remaining rows and columns
(5, 2) {1-12} {A-E} (6, 1) {1-12} {A,C,D,E} (2, 3) {1,3,7,8,10,11} {A,B,C,D,E} (7, 1) {1-12} {C,D,E} (6, 2) {1,3,4,7-12} {A, C,D,E} (1, 4) {1,3,7,8,10} {A,B,C,D,E} (10, 1) {1-3,5-11} {E} (4, 3) {1,3,7,8,10} {A,C,D,E} (5, 3) {1,3,7,8,10} {C,D,E} (4, 4) {1,7,8,10} {A,C,D,E} Support Envelopes Form a Lattice Each box represents a support envelope. Format is the following: (m,n) Transactions Items Entire lattice of Envelopes is called the support lattice. Envelopes drawn with a dotted border are on the lattice boundary, which we call the support boundary. At most min( M, N) such envelopes.
Visualizing Support Envelopes for Mushroom • One of the support envelopes (576, 23) is denser than its surrounding neighbors.
An Interesting Dense Envelope for Mushroom • One of the columns was the column 48, ‘gill-color:buff’ • There are exactly 1728 instances of item 48, every one of which occurs with 13 other items (one of which is ‘poisonous’). • The co-occurrence of 14 items is larger than is typical for this data set. Support Envelope (576,23)
Outline • Introduction • Generalizing Support • Generalizing Confidence • Generalizing Association Patterns • Support Envelopes • Conclusions and Future Work
Conclusions and Future Work: Generalizing Support • We described a framework for generalizing support that is based on the simple, but useful observation that support can be viewed as the composition of two functions: • A function that evaluates the strength or presence of a pattern in each object, and • A function that summarizes these evaluations with a single number. • Future work • Efficient implementations • Exploring applications of the continuous hyperclique and range patterns • New types of support for non-binary data and nontraditional association patterns
Conclusions and Future Work: Generalizing Confidence • We described a framework for generalizing confidence that is based on the simple, but useful observation that support can be defined in terms of two functions: • A function that evaluates the strength or presence of a pattern in each object, and • A function that summarizes the relationship between the two evaluation vectors with a single number. • Future work • Exploring applications of the different measures of confidence • Creating new types of confidence based on interestingness and proximity measures
Conclusions and Future Work: New Patterns • We described a framework for creating a wide variety of new association measuresfrom any pairwise association or proximity measure • These measures are guaranteed to have the anti-monotone property • Specific instances of these measures, the cosine and confidence cliques, were proposed and found to be strictly superior to the hyperclique pattern • Future work • Research is needed to determine which measures (out of the large number possible) are useful for association analysis and what additional properties they might have • A more detailed study using more and different types of data sets is needed for cosine and confidence clique patterns • More efficient algorithms needed
Conclusions and Future Work: Support Envelopes • Support envelopes are a new tool for exploring association structure. • Support envelopes form a lattice - at most M * N envelopes • Envelopes on the boundary are especially interesting. • Bound the maximum sizes of association patterns • At most min( M, N ) boundary envelopes • Can visualize association structure by plotting support envelopes • Efficient algorithms • Future work • Parallel/distributed implementations of the support envelope code • Investigation of the basic approach and its variations for binary data • Application of support envelopes to other kinds of data or patterns • Support envelopes for a cube • Continuous data