200 likes | 394 Views
Computational Intelligence in Biomedical and Health Care Informatics HCA 590 (Topics in Health Sciences). Rohit Kate. Data Mining: Association Rules. Some slides have been adapted from the slides at the Data Mining textbook’s website: http://www.cs.uiuc.edu/~hanj/bk3/. Reading.
E N D
Computational Intelligence in Biomedical and Health Care InformaticsHCA 590 (Topics in Health Sciences) Rohit Kate Data Mining: Association Rules Some slides have been adapted from the slides at the Data Mining textbook’s website: http://www.cs.uiuc.edu/~hanj/bk3/
Reading • Chapter 6, Text 3 (skip sections 6.2.3 to 6.2.6)
What is Data Mining? • Discovering interesting patterns and knowledge from massive amount of data Interesting: • Non-trivial • Previously unknown • Potentially useful • Alternative names • Knowledge discovery in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
The Need for Data Mining • There has been an explosive growth of data • Data collection and data availability • Automated data collection tools, database systems, Web, computerized society • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, YouTube, … • Medical: Electronic health records, biosensors, biomedical signals and images, … • “We are drowning in data, but starving for knowledge!” • Data mining is in great demand, and has wide applications
Knowledge Discovery (KDD) Process Knowledge Data mining plays an essential role in the knowledge discovery process Pattern Evaluation Interesting Patterns Data Mining Task-relevant Data Selection Data Cleaning Data Warehouse Data Integration Databases
Data Mining and Machine Learning • Closely related areas; techniques overlap • Machine learning: Automatically learn from data to predict an output • Data mining: Automatically find patterns from data Relation between the two: • Predicting an output requires discovering patterns between input and output, for e.g. decision trees
Data Mining Tasks • Clustering: Grouping examples by their similarity • Classification: Predicting something based on a trend in the data • Pattern mining: Finding patterns in data
What is Frequent Pattern Analysis? • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami [AIS’93] in the context of frequent itemsets and association rule mining • Motivation: Finding inherent regularities in data • What products were often purchased together?— Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents? • What combinations of drugs lead to which adverse effects? • Applications • Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
Basic Concepts: Itemsets • itemset: A set of one or more items • k-itemset: An item set with k elements, X = {x1, …, xk} • (absolute) support, or, support count of X: Frequency or occurrence of an itemset X • (relative)support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) • An itemset X is frequent if X’s support is no less than a minsup (minimum support) threshold
Itemsets • Important Observations: • If an itemset is frequent (w.r.t. a minimum support threshold) then all its subsets are also frequent • If {beer, diaper, nuts} occur more than 100 times then {beer, diaper} also occurs more than 100 times • If an itemset is not frequent then none of its superset can be frequent • If {beer, diaper} occurs less than 100 times then {beer, diaper, nuts} or {beer, diaper, nuts, milk} also occur less than 100 times • These observations are used in efficiently mining patterns
Basic Concepts: Association Rules • Patterns or rules of the form X Y where X and Y are itemsets; presence of X implies presence of Y • Support of an association rule is the probability that a transaction contains X U Y s = P(XUY) • Confidence of an association rule is the conditional probability that a transaction having X also contains Y c = P(Y|X)
Basic Concepts: Association Rules • Association rules: (many more!) • {Beer} {Diaper} (support=60%, confidence=100%) • {Diaper} {Beer} (support=60%, confidence=75%) • {Coffee, Diaper} {Nuts , Milk} (support=20%, confidence=50%) Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk
Finding Association Rules • Given a database (e.g. transaction records), find all the rules of the form XY with a minimum support and a minimum confidence Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk • Association rules: (min support=50%, min confidence=80%) • {Beer} {Diaper} (support=60%, confidence=100%) • {Diaper} {Beer} (support=60%, confidence=75%) • {Coffee, Diaper} {Nuts , Milk} (support=20%, confidence=50%)
Efficient and Scalable Mining Methods • Mining all patterns or association rules is computationally expensive: • If there are 100 items, there are 2100 – 1 = 1.27x1030 possible itemsets! • A naïve method will run for a prohibitively long time • Three well known methods: • Apriori (Agrawal & Srikant@VLDB’94) • Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) • Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
Apriori: A Candidate Generation & Test Approach • First efficiently finds all itemsets with minimum support and then discovers association rules from them with minimum confidence • Apriori pruning principle: If there is anyitemset which is infrequent (support less than the minimum support), its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) • Method: • Initially, scan database once to get frequent 1-itemset • Generate length (k+1) candidate itemsets from length k frequent itemsets • Test the candidates against database for minimum support • Terminate when no frequent or candidate set can be generated
Apriori Algorithm: Generating Frequent Itemsets Supmin = 2 Database L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan Prune
Apriori Algorithm: Efficient Steps • Li to Ci+1: Generating candidate k+1 itemsets from k-itemsets is done efficiently using only the items within the k-itemsets • Ci+1 to Li+1: If any k-itemset subset of a k+1 itemset has less than minimum support (not present in Li) then prune that k+1 itemset; scan the database to obtain the support for the remaining k+1 itemsets • Prune {A,B,C} because its subset {A,b} has less than minimum support (not present in L2) • Prune {A,C,E} because its subset {A,E} has less than minimum support (not present in L2)
Generating Association Rules from Frequent Itemsets • Once frequent itemsets, L are discovered, for each of them: • Consider all its subsets S • {B,C,E}: {B,C}, {C,E}, {B,E}, {B}, {C}, {E} • For each S, generate a candidate rule SL-S • {B,C} {E} • Determine its confidence: support(L)/support(S) • support({B,C,E})/support({B,C}) = 2/2 = 100% • Output the rule if this is greater than the minimum confidence prescribed
Interestingness Measure: Correlations (Lift) • play basketball eat cereal [40%, 66.7%] is misleading • The overall % of students eating cereal is 75% > 66.7%. • play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence • Measure of dependent/correlated events: lift lift > 1 means positively correlated, lift < 1 means negatively correlated Lift = 1 means they are independent lift: Occurrence of one “lifts” the other
Measures of Interestingness • “Buy walnuts buy milk [1%, 80%]” is misleading if 85% of customers buy milk • Support and confidence are not good to indicate correlations • Over 20 interestingness measures have been proposed (Tan, Kumar, Sritastava @KDD’02)