Association Rule Mining

Association Rule Mining ARM http://www.cs.ndsu.nodak.edu/~rahal/765/lectures/

Lecture Outline • Data Mining and Knowledge Discovery • Market Basket Research Models • Association Rule Mining • Apriori • Rule Generation • Methods To Improve Apriori’s Efficiency • Vertical Data Representation

What is Data Mining • Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns and knowledge in data. • Valid: The patterns hold in general. • Fargo is in Minnesota ! • Novel: We did not know the pattern beforehand. • (live in Fargo)  (live in ND) • Useful: We can devise actions from the patterns (actionable) • Understandable: We can interpret and comprehend the patterns.

What Motivated Data Mining? • As an evolution in the path of IT • 1-Data Collection and Database Creation • Primitive File Processing • 1960s and earlier • 2-Database Management Systems: • Hierarchical/Network/Relational database system • ERDs • SQL • Recovery and concurrency control in DBMSs • OLTP • 1970s-early 1980s

3.1-Advanced Database Systems • Object-oriented/object-relational databases • Application-oriented databases • Spatial, multimedia, scientific, etc … • Mid-1980s-present • 3.2-Web-based Database Systems • XML-based databases systems • Web analysis and mining • Semantic Web (the whole web as a single XML database) • Mid-1990s-present

3.3-Data Warehousing and Data Mining • Multi-dimensional Data warehouse and OLAP technology • Data Mining and Knowledge Discovery • tools to assist people in their decision-making processes • Late 1980s-present

Why Use Data Mining Today? • Market Competition Pressure! • “The secret of success is to know something that nobody else knows.” Aristotle • Wal-Mart VS K-Mart • Right products, right place, right time, and right quantities • Personalization, CRM • Security, homeland defense • Analysis of important application data • Bioinformatics • Stock market data

Human analysis skills are inadequate: • Volume and dimensionality of the data • High data growth rate • Storage • Computational power • Off-the-shelf software • Other factors

Where Could All Of This Data Be Coming From? • Supermarket scanners • Preferred customer cards • Sunmart’s MoreCards • Credit card transactions • Call center records • ATM machines • Demographic data • Sensor networks • Cameras • Web server logs • Customer web site trails • Biological data (e.g. MicroArray Experiments for expression levels) • Image data

Types Of Data/Information Repositories For Data Mining • By definition, data mining should be applicable to any kind of information repository • Flat files • Relational databases • data warehouses • transactional databases • Advanced database systems • object-oriented • Object-relational

Application-oriented databases • Multimedia • Text • Image • Video • Audio • Heterogeneous databases • Appear as centralized • Independent components managing different parts of the data

How Could We Describe Data • Numerical : Domain is ordered and can be represented on the continuous real line (e.g. age, income) • Continuous? • Nominal or categorical : Domain is a finite set without any natural ordering (e.g. occupation, marital status, race) • Ordinal : Domain is finite and ordered, (e.g.: grade scale, months in a year)

The Knowledge Discovery Process • Broader than Data Mining • Steps: • Identify the problem • Data mining • Action • Evaluation and measurement • Deployment and integration into real-life processes and/or applications

The Data Mining Step in More Detail • Cleaning and integration of various data sources • Remove noise and outliers • Missing Values (e.g. null values) • Noisy data (errors) • Inconsistent Data (integration) • FirstName and F_Name • Selection and transformation of relevant data into appropriate forms • Focus on fields of interest • Education on salary • Create common units • Height in cm and inches • Generate new fields • Discovery of interesting patterns from the data • Pattern evaluation to identify the interesting patterns based on some predefined measures • Knowledge presentation to communicate the mined knowledge and information to the user mostly through visualization techniques to provide a better view

This process can be repeated as needed • Data mining systems are expected to handle large amounts of data • Analysis of small datasets is sometimes called • machine learning • SDA – Statistical data analysis. • In other words, data mining must be scalable to large data sets • Scalability and efficiency

Data Mining Knowledge Patterns PreprocessedData Pattern evaluation TargetData Knowledge presentation Discovery Original Data Selection and transformation Cleaning and integration

Data Mining Tasks • Characterization • the process of summarizing the general characteristics and features of a specific class of data (usually referred to as the target class) • Characterizing the items in a store whose sales have decreased by 50% over a certain period of time. • There maybe some common characteristics to all those items which we would like to uncover. • Produced by a no-longer trusted producer

Discrimination • Discrimination is very similar to characterization in that it reveals the characteristics of a target class in comparison to those characteristics pertaining to one or more other classes. • The target and contrasting classes are specified by user and their data is retrieved from the database before the discrimination process starts. • As an example, a user might want to discriminate between the characteristics of the items in a store whose • sales have increased by 10% over a certain period of time this year • sales have increased by 10% over the same period of time last year.

Association Rule Mining • The process of discovering association rules among attribute values that exist in a given set of data. • Market basket research (MBR) where users are usually interested in mining associations between items in a store by using daily transactions. • An example of a rule might be diapersbeermeaning that customers buying diapers are very likely to buy beer. • This will give us a good pointer to place diapers next to beer so as to increase sales • sometimes people wonder about the strange placement of products in large stores • Maternity to infant

Classification • The process of using a set of training data with known class labels to come up with a model (or function) that predicts the unknown class label of new samples. • An example of classification can be found in the banking industry • customer characteristics like age, annual income, marital status, etc are used to predict the possibility of approving loan applications (the loan status is the class label). • In an initial step, a dataset containing a certain number of customers with known class labels is used to create a classifier which can then be used to predict the class label of a new application • ANN • Classification is very similar to regression except that the later is applicable to numerical data while the former is applicable to categorical and numerical data.

Clustering • The is process of grouping data objects into clusters such that • intra-cluster similarity is maximized • inter-cluster similarity is minimized. • In other words, objects within the same clusters are very similar and objects in different clusters are not. • E.g. studying collective properties of people at different income levels • Cluster people based on incomes • Study common properties within clusters • Lower income related to lower education

Outlier detection • Through clustering, we can find groups of objects that behave similarly • sometimes, we are only interested in those objects that lie scattered around without behaving similarly to any pattern existing in the data. • Those objects are known as outliers as they do not adhere to the patterns defined by the rest of the objects in the dataset. • Outlier detection is usually desired in applications where abnormal behavior is • of interest such as intrusion detection in networks or terrorist detection in ports of entry • not of interest, such as when we clean a dataset from noise

Similarity searches • given a database of objects, and a “query” object, • find all similar objects (neighbours) • Google search • Given a query which a small document • Find all similar documents • Ranked order them

Final Notes on Data Mining • Forms the center of a set of research fields and applications dealing with data analysis: • databases, statistics, machine learning, artificial intelligence, information sciences/technology and the like • at the same time introduces a lot of new features rendering itself as a separate science. • scalability to large datasets

Not all types of patterns mined by data mining systems are interesting. • Subjective and objective interesting measures.

Market Basket Research • We will mainly use the Market Basket Research (MBR) application in our ARM description • A large set of items, e.g. products sold in a supermarket. • A large set of transactions or baskets, each of which contains a small set of the items (called an itemset) bought by a customer during a single visit to a store.

TID Atts 1 a b c 2 a b d 3 a b e 4 a c d 5 a c e 6 a d e 7 b c d 8 b c e 9 b d e 10 c d e • The Set Model • Data is organized as a "TRANSACTION TABLE" with 2 attributes: TT(Tid, Itemset) • A transaction is a customer transaction at a cash register. • Each customer is given an identifier, Tid, for every transaction made • Itemset is the set of items in the customer's "basket". • Note that tuples in TT are not "flat" (each itemset is a "set") • i.e. not relational (why?) • a transformation can be made to equivalent but normalized models

The Normalized Set Model • Data is organized as a “NORMALIZED TRANSACTION TABLE" with 2 attributes: NTT(Tid,Iid) • An itemset is the group of items belonging to the same transaction • The TT(Tid, ItemSet) can be "transformed" to NTT(Tid, Iid) and vice versa • Could be stored in a database • Very deep (10 to 30 tuples)

The Boolean Model: "Boolean Transaction Table“: • BTT(Tid, Item-1, Item-2,... Item-n) • Tid is a transaction identifier • Each column is a particular Item (1 column for each item) • a 1  if item is in the basket • a 0  if item is not in the basket • TT, NTT and BTT are equivalent • This is the model mostly chosen for ARM

Association Rule Mining • Association Rule Mining (ARM) finds interesting associations and/or correlation relationships among large sets of data items. • Association rules provide information in the form of "if-then" statements. • These rules are • computed from the data • unlike the if-then rules of logic, association rules are probabilistic in nature • strength could be measured

An association rule defines a relationship of the form: • A C (if A then C) • Read as A implies C, whereA and Care sets of items in a data set. • A called antecedent and C the consequent • Given DB, ARM finds all the ARs

D = A data set comprising n records (transactions) and m Boolean valued attributes (BTT model) • I = The set of m attributes, {i1,i2, … ,im}, represented in D. • Itemset = Some subset of I. Each record in D is an itemset • For all rules AC:AI, CI, and AC= (A and C are disjoint).

TID Atts 1 a b c 2 a b d 3 a b e 4 a c d 5 a c e 6 a d e 7 b c d 8 b c e 9 b d e 10 c d e An Example DB • Items = 5 • I = {a,b,c,d,e} • Transactions = 10 • D = {{a,b,c}, {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {a,d,e}, {b,c,d}, {b,c,e}, {b,d,e}, {c,d,e}}

Support of an Itemset • Support of an itemset IS is the number of transactions in D containing all items in IS (support of IS={ab} is 3?) • Given a support threshold s, sets of items that appear in >s transactions are called frequent itemsets • The process is called frequent itemset mining

Items={m=milk, c=cheese, p=pepsi, b=bread, j=juice}. • Support threshold = 3 transactions. T1 = {m, c, b} T2 = {m, p, j} T3 = {m, b} T4 = {c, j} T5 = {m, p, b} T6 = {m, c, b, j} T7 = {c, b, j} T8 = {b, c} • Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}.

Support and Confidence of a Rule AC • Support of an itemset IS is the number of transactions containing all items in IS • Itemsets are used to derive rules • Support of a rule R: AC is the number of transactions in D containing all items in A U C. • Frequent rule • Significance of a rule • Confidence of a rule is Support(R)/ Support(A) • Confident rule • Strength of a rule • Out of those containing A, how many also contain C • Frequent + Confident  Strong

Example B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} • An association rule: {m, b}  c. • What is the confidence? • support(m, b, c) = 2 • Support(m, b) = 4 • Confidence = 2/4 = 50%. • And so what does that mean? • 50% that contain {m, b} also contain c

More On The Problem Definition • ARM is a two-step process: • Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support threshold • Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy the minimum support and minimum confidence thresholds • A typical question: “find all strong association rules with support >s and confidence >c.” • Given a database D • Findall frequent itemsets (F) using s • Produce all strong association rules using c

Finding F is the most computationally expensive part, once we have the frequent sets generating ARs is straight forward

The Anti-Monotonicity (downward-closure) of Support • Naïve: generate all subset itemsets of I and test each • The number of potential subset itemsets 2m • If m=5, #potential itemsets = 32 • If m=20, #potential itemsets 1,048,576 • Imagine what would supermarkets have? m = 10,000? • Conclusion? • Naïve approach is infeasible • Breakthrough: If an itemset A has support greater than s then all its subsets must also be have support greater than s • example • Alternatively if an itemset A is not frequent then none of its supersets will be supported. • Proposed by Agrawal 1993 from IBM Almaden Research Center…its started ARM and the field of data mining

Apriori • Proposed by Agrawal • Apriori • Uses the downward-closure of support to reduce the number of itemsets that need to be counted (called candidate frequent itemsets C) • Works on a level-by-level basis (i.e. uses frequent itemsets L from the previous to generate frequent itemsets at this level) • Ckand Lk • At every level k generates Ck from Lk-1and counts their frequency in the database

Two steps are performed to generate Ck • Join Step: Ckis generated by joining Lk-1with itself • Prune Step: all itemsets in Ck whose k-1 subsets are not ALL frequent (i.e. present in Lk-1)are removed • How many subsets does an itemset of size k have? • 2k • E.g. k=3 • How many subsets of size k-1 does an itemset of size k have? • k

The Apriori Algorithm • Pseudo-code:Ck: Candidate frequent itemset of size kLk : frequent itemset of size kL1 = {frequent items};for (k = 1; Lk !=; k++) do beginCk+1 = candidates generated from Lk; Remove any itemset from Ck+1 that has at least one infrequent k subset for each transaction t in database do increment the counts of all candidates in Ck+1 that are contained in t (count the frequency of each itemset in Ck+1)Lk+1 = candidates in Ck+1 with min_supportendreturnkLk;

Example of Generating Candidates • Suppose the items in all itemsets are listed in some order • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • Combine any two itemsets in Lk if they only differ by the last item • abcd from abc and abd • acde from acd and ace • C4 = {abcd , acde} • Pruning: • abcd: abc, abd, acd, bcd • acde: acd, ace, ade, cde • C4={abcd}

How To Generate Candidates? Lk Ck+1 • Step 1: self-joining Lk • insert intoCk+1select p.item1, p.item2, …, p.itemk, q.itemkfromLkp, Lkqwhere p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk < q.itemk • Step 2: pruning forallitemsets c in Ck+1do forallk-subsets s of c do if (s is not in Lk)then delete c from Ck+1

Database D L1 C1 Scan D C2 C2 L2 Scan D L3 C3 Scan D An Example – Support 2

Generation of Association Rules • Given all frequent itemsets • Every frequent itemset I of size > 2 is divided into a candidate head Y and a body X • such that X intersection Y = {}. • This process starts with Y = {}, resulting in the rule I  {} • always holds with 100% confidence (why?) • After that, the algorithm iteratively generates candidate heads of size k + 1, starting with k = 0

Is Apriori Fast Enough? Performance Bottlenecks • The core of the Apriori algorithm: • Uses frequent (k – 1)-itemsets to generate candidate frequent k-itemsets • Uses databases scan to collect counts for the candidate itemset – 1 scan per level • The bottleneck of Apriori: candidate generation • Huge candidate sets: • 104 frequent 1-itemset will generate 107 candidate 2-itemsets • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates. • Multiple scans of database: • Needs n scans, n is the length of the longest pattern • One scan per level

Association Rule Mining