510 likes | 930 Views
Lecture 6. Data mining: Definition and motivation Association Rule mining Apriori algorithm Additional association mining algorithms FP-Tree and FP-growth algorithms Cubegrades Clustering CPAR algorithm.
E N D
Lecture 6 • Data mining: Definition and motivation • Association Rule mining • Apriori algorithm • Additional association mining algorithms • FP-Tree and FP-growth algorithms • Cubegrades • Clustering • CPAR algorithm
Conceptually, data mining is the process of semi-automatically analyzing large databases to find useful patterns Differs from machine learning in that it deals with large volumes of data stored primarily on disk. Some types of knowledge discovered from a database can be represented by a set of rules. e.g.,: “Young women with annual incomes greater than $50,000 are most likely to buy sports cars” Other types of knowledge represented by equations, or by prediction functions Some manual intervention is usually required Pre-processing of data, choice of which type of pattern to find, postprocessing to find novel patterns Data Mining
Applications of Data Mining • Prediction based on past history • Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past history • Predict if a customer is likely to switch brand loyalty • Predict if a customer is likely to respond to “mailing campaign” • Predict if a pattern of phone calling card usage is likely to be fraudulent • Some examples of prediction mechanisms: • Classification • Given a training set consisting of items belonging to known classes, and a new item whose class is unknown, predict which class it belongs to • Regression formulae • given a set of parameter-value to function-result mappings for an unknown function, predict the function-result for a new parameter-value
Applications of Data Mining (Cont.) • Descriptive Patterns • Associations • Find books that are often bought by the same customers. If a new customer buys one such book, suggest that he buys the others too. • Other similar applications: camera accessories, clothes, etc. • Associations may also be used as a first step in detecting causation • E.g. association between exposure to chemical X and cancer, or new medicine and cardiac problems • Clusters • E.g. typhoid cases were clustered in an area surrounding a contaminated well • Detection of clusters remains important in detecting epidemics
What Is Association Mining? • Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93] • Motivation: finding regularities in data • Basket data analysis, cross-marketing, catalog design, sale campaign analysis. What products were often purchased together? — Beer and diapers?! • A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts. • Web log (click stream) analysis, DNA sequence analysis, etc. • What are the subsequent purchases after buying a PC? • What genes are sensitive to this new drug?
Customer buys both Customer buys milk Customer buys bread Basic Concepts:Association Rules • Itemset X={x1, …, xk} • Find all the rules XYwith min confidence and support • support, s, count of transactions that contain itemsets X,Y • confidence, c,conditional probability that a transaction having X also contains Y. • A C (3, 100%) • C A (3, 75%)
Mining Association Rules: • Goal: Compute rules with high support/confidence • How to compute? • Support: Find sets of items that occur frequently • Confidence: Find frequency of subsets of supported itemsets • Two phase generation: • Compute frequent itemsets • From the frequent itemsets, compute the rules • If we have all frequently occurring sets of items (frequent itemsets), we can compute support and confidence!
Apriori: A Candidate Generation-and-test Approach • Apriori pruning principle:If there is any itemset which is infrequent, its superset should not be generated/tested! Why? • Method: • generate length (k+1) candidate itemsets from length k frequent itemsets, and • test the candidates against DB • Performance studies show its efficiency and scalability
The Apriori Algorithm—An Example Database TDB L1 C1 1st scan Frequency ≥ 50%, Confidence 100%: A C B E BC E CE B BE C C2 C2 L2 2nd scan L3 C3 3rd scan
The Apriori Algorithm • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that t contains Lk+1 = candidates in Ck+1 with min_support end returnthe set of frequent itemsets;
Important Details of Apriori • How to generate candidates? • Step 1: self-joining Lk • Step 2: pruning • How to count supports of candidates? • Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd}
How to Count Supports of Candidates? • Why counting supports of candidates a problem? • There can be a huge number of candidates. • A transaction may contain many candidates • Method: • Candidate itemsets are stored in a hash-tree • Leaf nodeof hash-tree contains a list of itemsets and counts • Interior node contains a hash table • Subset function: finds all the candidates contained in a transaction
Subset function 3,6,9 1,4,7 2,5,8 2 3 4 5 6 7 3 6 7 3 6 8 1 4 5 3 5 6 3 5 7 6 8 9 3 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 Example: Counting Supports of Candidates Transaction: 1 2 3 5 6 1 + 2 3 5 6 1 3 + 5 6 1 2 + 3 5 6
Challenges of Frequent Pattern Mining • Drawbacks with Apriori • Multiple scans on the transaction database • Huge number of candidates generated and tested • Tedious workload of support counting for candidates • Ideas for improvement • Reduce the number transaction database scans • Shrink number of candidates • Facilitate support counting of candidates
DIC: Reduce Number of Scans • Once both A and D are determined frequent, the counting of AD begins • Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins • Allows for scanning multiple levels of candidate sets in one database scans. ABCD ABC ABD ACD BCD AB AC BC AD BD CD Transactions 1-itemsets B C D A 2-itemsets Apriori … {} Itemset lattice 1-itemsets S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97 2-items DIC 3-items
Eclat/MaxEclat and VIPER: Exploring Vertical Data Format • Use tid-list, the list of transaction-ids containing an itemset • Merging of tid-lists • Itemset A: t1, t2, t3, sup(A)=3 • Itemset B: t2, t3, t4, sup(B)=3 • Itemset AB: t2, t3, sup(AB)=2 • Major operation: intersection of tid-lists • M. Zaki et al. New algorithms for fast discovery of association rules. In KDD’97 • P. Shenoy et al. Turbo-charging vertical mining of large databases. In SIGMOD’00
Mining Frequent Patterns WithoutCandidate Generation • Grow long patterns from short ones using local frequent items • “abc” is a frequent pattern • Get all transactions having “abc”: DB|abc • “d” is a local frequent item in DB|abc abcd is a frequent pattern
{} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Construct FP-tree from a Transaction Database TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree F-list=f-c-a-b-m-p
Properties of the FP-tree Structure • Completeness • Preserve complete information for frequent pattern mining • Never break a long pattern of any transaction • Compressopm • Reduce irrelevant info—infrequent items are gone • Items are organized in frequency descending order: the more frequently occurring, the more likely to be shared • Never be larger than the original database (not counting node-links and the count field)
Partition Patterns and Databases • Frequent patterns can be partitioned into subsets according to f-list • F-list=f-c-a-b-m-p • Patterns containing p • Patterns having m but no p • … • Patterns having c but no a nor b, m, p • Pattern f • Completeness and non-redundancy
{} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Find Patterns Having P From P-conditional Database • Starting at the frequent item header table in the FP-tree • Traverse the FP-tree by following the link of each frequent item p • Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1
{} f:3 c:3 a:3 m-conditional FP-tree From Conditional Pattern-bases to Conditional FP-trees • For each pattern-base • Accumulate the count for each item in the base • Construct the FP-tree for the frequent items of the pattern base • m-conditional pattern base: • fca:2, fcab:1 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1
{} f:3 c:3 am-conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Recursion: Mining Each Conditional FP-tree Cond. pattern base of “am”: (fc:3) {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree {} Cond. pattern base of “cam”: (f:3) f:3 cam-conditional FP-tree
Mining Frequent Patterns With FP-trees • Idea: Frequent pattern growth • Recursively grow frequent patterns by pattern and database partition • Method • For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree • Repeat the process on each newly created conditional FP-tree • Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern. Note the “m”-conditional tree had a single path (ie we didn’t have to do recursion there, shown only for illustration purposes).
{} f:3 c:3 a:3 m-conditional FP-tree From Conditional Pattern-bases to Conditional FP-trees: Single path • m-conditional pattern base: • fca:2, fcab:1 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1
FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K
Interestingness Measure: Correlations (Lift) • play basketball eat cereal [40%, 66.7%] is misleading • The overall percentage of students eating cereal is 75% which is higher than 66.7%. • play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence • Measure of dependent/correlated events: lift
Max-patterns • Frequent pattern {a1, …, a100} (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 frequent sub-patterns! • Max-pattern: frequent patterns without proper frequent super pattern • BCDE, ACD are max-patterns • BCD is not a max-pattern Min_sup=2
MaxMiner: Mining Max-patterns • 1st scan: find frequent items • A, B, C, D, E • 2nd scan: find support for • AB, AC, AD, AE, ABCDE • BC, BD, BE, BCDE • CD, CE, CDE, DE, • Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan • R. Bayardo. Efficiently mining long patterns from databases. In SIGMOD’98 • Idea: For any frequent itemset X, find all single items j, such that j is not in X, and X+{j} is frequent. Suppose these {j}-items forms set Y . If X+ Y is frequent, we can conclude that any its subset also is frequent. Potential max-patterns
Cubegrades: Generalization of association rules • Instead of itemsets, the body and consequent of an association rules can be attribute-value pairs on the dimensions of a cube (eg area=‘urban’, company=‘large’ compensation=‘A’ (support: 80, conf: 80%)) • An association rule A=a1->B=b1 (s, c) can be seen as expressing how support (COUNT()) is affected, as we specialize (drill down) from source cell A=a1 to target cell A=a1, B=b1. • Generalizations of this include: • Rather than just drill downs, we can apply other operations and measure how they affect the cell (eg generalization, mutations). • Allow comparing other measures between source and target cells (rather than just COUNT()). Some examples could be like (AVG(commuteTime), MAX(overtime)) etc… • Cubegrades capture these generalizations (cubegrades are derived from cube gradients).
Examples of cubegrades • Area=‘urban’ Age=[25-35] => Education=“masters” [Avg(Salary)=75K, Delta-Avg(salary)=140%] • This cubegrade means that Avg(salary) increases by 40% to 75K for workers aged 25-35 in urban areas if they get masters. • Area=‘urban’ Age=[25-35] Education=‘bachelors’ => Education=“masters” [Avg(Salary)=75K, Delta-Avg(salary)=150%] • This cubegrade means that Avg(salary) increases by 50% to 75K for workers aged 25-35 in urban areas and have bachelor degrees, if they get masters.
Computing cubegrades • For computing cubegrades, its no longer sufficient to find the cubes that meet the support threshold. Rather, we have additional constraints on measures such as MIN(), MAX(), SUM(), AVG() and COUNT() on multiple measure attributes. • Use the GBP algorithm referenced in lecture 4.
Clustering • Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in the same cluster • Can be formalized using distance metrics in several ways • E.g. Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized • Centroid: point defined by taking average of coordinates in each dimension. • Another metric: minimize average distance between every pair of points in a cluster
Hierarchical Clustering • Example from biological classification • (the word classification here does not mean a prediction mechanism) chordata mammalia reptilialeopards humans snakes crocodiles ) • Agglomerative clustering algorithms • Build small clusters, then cluster small clusters into bigger clusters, and so on • Divisive clustering algorithms • Start with all items in a single cluster, repeatedly refine (break) clusters into smaller ones
Collaborative Filtering • Goal: predict what movies/books/… a person may be interested in, on the basis of • Past preferences of the person • Other people with similar past preferences • The preferences of such people for a new movie/book/… • One approach based on repeated clustering • Cluster people on the basis of preferences for movies • Then cluster movies on the basis of being liked by the same clusters of people • Again cluster people based on their preferences for (the newly created clusters of) movies • Repeat above till equilibrium • Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest
Associative Classification • CBA (Classification By Association: Liu, Hsu & Ma @ KDD’98), CMAR (Classification based on Multiple Association Rules: Li, Han, Pei @ ICDM’01) • Mine association possible rules in the form of • cond-set (a set of attribute-value pairs) class label • Build Classifier: select a subset of high quality rules. • Achieves a higher accuracy than traditional classification (eg C4.5) • Disadvantage: • Generates a large number of association rules which could have a large overhead. • The rules selected may lead to overfitting. This is both because of the large number of generated rules and the confidence evaluation measure.
CPAR (Classification Based on Predictive Association Rules) • Reference: Xiaoxin Yin, Jiawei Han: CPAR: Classification based on Predictive Association Rules. SDM 2003 • This combines the advantages of both traditional classification and rule-based associative classification. • It uses a greedy approach to generate rules • Avoids overfitting by using expected accuracy. • For prediction, CPAR uses the class label determined by the k best rules satisfied by the example. • Rule generation is based on FOIL ( First Order Inductive Learner).
FOIL • tuples are labeled ‘+’ or ‘-’ as the class -value • greedy algorithm that learns rules to distinguish positive examples from negative ones. It iteratively searches for the current best rule and removes all the positive examples covered by the rule. This is repeated until all the positive examples in the data set are covered. • For multiple-class problems, for each class, its examples are used as positive examples and those of other classes as negative ones. The rules for all classes are merged together to form the result rule set.
FOIL GAIN • |P|=# of positive examples satisfying the current rule r’s body • |P*|=# of positive examples satisfying the new rule’s body, obtained by adding p to r • |N|= # of negative examples satisfying the current rule r’s body. • |N*| = # of negative examples satisfying the new rule’s body obtained by adding p to r. • Gain(P)=|P*|(log (|P*| /(|P*|+|N*|)) – log(|P|/(|P|+|N|))
PRM (Predictive Rule Mining) • Based on FOIL • After an example is correctly covered by a rule, its not removed. Rather its weight is decreased by a decaying factor. • This weighted version of FOIL allows for producing more rules and each positive example is usually covered more than once.
PRM Algorithm Data structure • Maintains a PNArray. This structurestores the following information corresponding to rule r • P and N : # of ‘+’ & ‘-’ examples satisfying r’s body • P(p) and N(p) : for each possible literal p # of ‘+’ & ‘-’ examples satisfying r’ body (r’ is constructed by adding p to r)
CPAR Algorithm • Similar to the PRM algorithm except: • That instead of ignoring all literals except the best one, it keeps all close-to-the-best literals during the rule building process. • Thus by possibly selecting more than one literal at the same time, it can build several rules simultaneously
Laplace Accuracy • Laplace Accuracy for a rule = (nc+1)/(ntot+k) • Where:. • k is the number of classes • ntot is the number of examples satisfying the rule body. • nc: number of examples satisfying the rule body that belong to c.
Classification Using CPAR • The following procedure is used to predict the class of a given example: • select all the rules whose bodies are satisfied by the example • from the rules selected in step (1), select the best k rules for each class • compare the average expected accuracy of the best k rules of each class and choose the class with the highest expected accuracy as the predicted class