580 likes | 684 Views
Data Mining. Data Mining (DM)/ Knowledge Discovery in Databases (KDD). “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [Frawley et al, 1992]. Need for Data Mining. Increased ability to generate data Remote sensors and satellites
E N D
Data Mining (DM)/ Knowledge Discovery in Databases (KDD) “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [Frawley et al, 1992]
Need for Data Mining • Increased ability to generate data • Remote sensors and satellites • Bar codes for commercial products • Computerization of businesses
Need for Data Mining • Increased ability to store data • Media: bigger magnetic disks, CD-ROMs • Better database management systems • Data warehousing technology
Need for Data Mining • Examples • Wal-Mart records 20,000,000 transactions/day • Healthcare transactions yield multi-GB databases • Mobil Oil exploration storing 100 terabytes • Human Genome Project, multi-GBs and increasing • Astronomical object catalogs, terabytes of images • NASA EOS, 1 terabyte/day
Something for Everyone • Bell Atlantic • MCI • Land’s End • Visa • Bank of New York • FedEx
Market Analysis and Management • Customer profiling • Data mining can tell you what types of customers buy what products (clustering or classification) or what products are often bought together (association rules). • Identifying customer requirements • Discover relationship between personal characteristics and probability of purchase • Discover correlations between purchases
Fraud Detection and Management • Applications: • Widely used in health care, retail, credit card services, telecommunications, etc. • Approach: • Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. • Examples: • Auto Insurance • Money Laundering • Medical Insurance
Statistics Database AI Data Mining Hardware
Mining Association Rules • Assocation rule mining: • Finding associations or correlations among a set of items or objects in transaction databases, relational databases, and data warehouses. • Applications: • Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, etc. • Examples: • Rule form: “Body ® Head [support, confidence]”. • Buys=Diapers ® Buys=Beer [0.5%, 60%] • Major=CS ^ Class=DataMining ® Grade=A [1%, 75%]
Rule Measures: Support and Confidence Customer buys both Customer buys diaper • Find all the rules X & Y Z with minimum confidence and support • support, s, probability that a transaction contains {X, Y, Z} • confidence, c,conditional probability that a transaction having {X, Y} also contains Z. Customer buys beer • For minimum support 50%, minimum confidence 50%: • A C (50%, 66.6%) • C A (50%, 100%)
Association Rule • Given • Set of items I = {i1, i2, .., im} • Set of transactions D • Each transaction T in D is a set of items • An association rule is an implication • X and Y are itemsets, • Rule meets minimum confidence c (c% of transactions in D which contain X contain Y) • A minimum support s is also met È c X Y / X È s X Y / D
Measurement of rule strength in a transaction DB. A ® B [support, confidence] support = Prob(AÈ B) = confidence = Prob(B|A) = We are often interested in only strong associations, i.e. support ³ min_sup and confidence ³ min_conf. Examples. milk ® bread [5%, 60%]. tire Ù auto_accessories ® auto_services [2%, 80%]. #_of_trans_containing_all_the_items_in A È B total_#_of_trans #_of_trans_that_contain_both A and B #_of_trans_containing A Mining Strong Association Rules in Transaction DBs
Methods for Mining Associations • Apriori • Partition Technique: • Sampling technique • Anti-Skew • Multi-level or generalized association • Constraint-based or query-based association
Apriori (Levelwise) • Scan database multiple times • For ithscan, find all large itemsets of size i with min support • Use the large itemsets from scan i as input to scan i+1 • Create candidates, subsets of size i+1 which contain only large itemsets as subsets • Notation: Large k-itemset, Lk Set of candidate large itemsets of size k, Ck • Note: If {A,B} is not a large itemset, then no superset of it can be either.
Mining Association Rules -- Example Min. support 50% Min. confidence 50% For rule AC: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% Apriori principle: Any subset of a frequent itemset must be frequent.
L1 = {(A, 3), (B, 2), (C, 2), (D, 1), (E, 1), (F, 1)} Minsup = 0.25, Minconf = 0.5 C2 = {(A,B), (A,C), (A,D), (A,E), (A,F), (B,C), .., (E,F)} L2= {(A,B, 1), (A,C, 2), (A,D, 1), (B,C, 1), (B,E, 1), (B,F, 1), (E,F, 1)} C3 = {(A,B,C), (A,B,D), (A,C,D), (B,C,E), (B,C,F), (B,E,F)} L3 = {(A,B,C, 1), (B,E,F, 1)} C4 = {}, L4 = {}, End of program Possible Rules A=>B (c=.33,s=1), B=>A (c=.5,s=1), A=>C (c=.67,s=2), C=>A (c=1.0,s=2) A=>D (c=.33,s=1), D=>A (c=1.0,s=1), B=>C (c=.5,s=1), C=>B (.5,s=1), B=>E (c=.5,s=1), E=>B(c=1,s=1), B=>F (c=.5,s=1), F=>B(c=1,s=1) A=>B&C (c=.33,s=1), B=>A&C (c=.5,s=1), C=>A&B (c=.5,s=1),A&B=>C(c=1,s=1), A&C=>B (c=.5,s=1), B&C=>A (c=1,s=1), B=>E&F (c=.5,s=1),E=>B&F(c=1,s=1), F=>B&E (c=1,s=1), B&E=>F (c=1,s=1), B&F=>E(c=1,s=1),E&F=>B (c=1,s=1)
Partitioning • Requires only two passes through external database • Divide database into n partitions, each fits in main memory • Scan 1: Process one partition in memory at a time, finding local large itemsets • Candidate large itemsets are the union of all local large itemsets (superset of actual large itemsets, contains false +) • Scan 2: Calculate support, determine actual large itemsets • If data is skewed, partitioning may not work well. The chance that a local large itemset is a global large itemset may be small.
Partitioning • Will any large itemsets be missed? • If l Li, then t1(l)/t1 < MS & t2(l)/t2 < MS & … & tn(l)/tn < MS thus t1(l) + t2(l) + … + tn(l) < MS * (t1 + t2 + … + tn)
Clementine (UK, bought by SPSS) The Web Node shows the strength of associations in the data - i.e. how often field values coincide
Multi-Level Association Food • A descendant of an infrequent itemset cannot be frequent • A transaction database can be encoded by dimensions and levels bread milk 2% white wheat skim Fraser Sunset
Encoding Hierarchical Information in Transaction Database • A taxonomy for the relevant data items • Conversion of bar_code into generalized_item_id. food milk . . . bread 2% chocolate . . . . . . . . . old Mills Wonder Dairyland Foremost
1990 Milk and cereal selltogether! Mining Surprising Temporal Patterns Chakrabarti et al • Find prevalent rules that hold over large fractions of data • Useful for promotions and store arrangement • Intensively researched
1998 Zzzz... Prevalent != Interesting 1995 Milk and cereal selltogether! • Analysts already know about prevalent rules • Interesting rules are those that deviate from prior expectation • Mining’s payoff is in finding surprising phenomena Milk and cereal selltogether!
Association Rules - Strengths & Weaknesses • Strengths • Understandable and easy to use • Useful • Weaknesses • Brute force methods can be expensive (memory and time) • Apriori is O(CD), where C = sum of sizes of candidates (2n possible, n = #items) D = size of database • Association does not necessarily imply correlation • Validation? • Maintenance?
Clustering • Group similar items together • Example: sorting laundry • Similar items may have important attributes / functionality in common • Group customers together with similar interests and spending patterns • Form of unsupervised learning • Cluster objects into classes using rule: • Maximize intraclass similarity, minimize interclass similarity
Clustering Techniques • Partition • Enumerate all partitions • Score by some criteria • K means • Hierarchical • Model based • Hypothesize model for each cluster • Find model that best fits data • AutoClass, Cobweb
Clustering Goal • Suppose you transmit coordinates of points drawn randomly from this dataset • Only allowed 2 bits/point • What encoder/decoder will lose least information?
Idea One • Break into grid • Decode each bit-pair as middle of each grid cell 00 01 10 11
Idea Two • Break into grid • Decode each bit-pair as centroid of all data in the grid cell 00 01 11 10
K Means Clustering • Ask user how many clusters (e.g., k=5)
K Means Clustering • Ask user how many clusters (e.g., k=5) • Randomly guess k cluster center locations
K Means Clustering • Ask user how many clusters (e.g., k=5) • Randomly guess k cluster center locations • Each data point finds closest center
K Means Clustering • Ask user how many clusters (e.g., k=5) • Randomly guess k cluster center locations • Each data point finds closest center • Each cluster finds new centroid of its points
K Means Clustering • Ask user how many clusters (e.g., k=5) • Randomly guess k cluster center locations • Each data point finds closest center • Each cluster finds new centroid of its points • Repeat until…
K Means Issues • Computationally efficient • Initialization • Termination condition • Distance measure • What should k be?
Hierarchical Clustering • Each point is its own cluster
Hierarchical Clustering • Each point is its own cluster • Find most similar pair of clusters
Hierarchical Clustering • Each point is its own cluster • Find most similar pair of clusters • Merge it into a parent cluster
Hierarchical Clustering • Each point is its own cluster • Find most similar pair of clusters • Merge it into a parent cluster • Repeat