Efficient Algorithms for Association Rule Mining in Data Analysis

Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Modified from slides by: Dan Li Presenter:Jimmy Jiang Discussion Lead: Leo Li

Origin of Problem • Basket data: new possibility of customized data-driven strategies for retail companies • Simple example: beer and diapers • Famous case for market basket analysis • If you buy diapers, you tend to buy beer

Usage of Data Mining (general) • Clustering, predictive modeling, dependency modeling, data summarization, change and deviation. • Association rules in dependency modeling • Applications: market basket analysis, direct/interactive marketing, fraud detection, science, sports stats , etc Some sources from: http://redbook.cs.berkeley.edu/redbook3/lec29.html

, where , and Association Rule • Formal Definition • Confidencec, c% of transactions that contain X also contain Y • Supports, s% of all transactions contain • Transaction: set of items, where I is the total set of items • Example • buys(x, “computer”) →buys(x, “financial management software”) [0.5%, 60%]

Mining Association Rules • The problem of finding association rules falls within the purview of database mining. • Eg1: Find proportion (confidence) of transactions that purchase diapers also purchase beer. • Eg2: Find proportion (support) of all transactions purchase diaper and/or beer (among all transactions). Finding association rules is valuable for • Cross-marketing • Catalog design • Add-on sales • Store layout and so on

Problem Decomposition • Find all sets of items (itemseis) that have transaction support above minimum support: Apriori and AprioriTid • Use the large itemsets to generate the desired rules: not discussed

Discovering Large Itemsets Intuition: any subset of a large itemset must be large. Algorithms for discovering large itemsets make multiple passes over the data. • In the first pass, determine which individual item is large. • In each sequent pass: • Previous large itemsets (seed sets) are used to generate candidate itemsets. • Count actual support for the candidate itemsets. • Determine which are the real large itemsets, pass to next pass • This process continues until no new large itemsets are found.

Algorithm Apriori

Itemset Generation • The apriori-gen function has two steps • Join • Prune • Efficiently Implementing subset functions • The hash tree

Example for Apriori

Example for Apriori (cont’)

Apriori TID • Candidate generation is same as Apriori, but DB is used for counting support only in the first pass. • More memory needed: storage set in memory containing frequent sets per transaction

Performance • Average size of transactions 5~20; • Average size of maximal potentially large itemsets 2~6 • Dataset sizes 2.4~8.4 MB (on an IBM RS/SOOO 530H workstation with a CPU clock rate of 33 MHz, 64 MB of main memory, and running AIX 3.2. The data resided in the AIX file system and was stored on a 2GB SCSI 3.5” drive, with measured sequential throughput of about 2 MB/second.)

Results

Smaller number of candidates

Discussion: • In year 1994, the author mentioned the dataset they used in scale-up experiment are not even as large as 1 GB. • Nowadays, this seems to be not sufficient as scale-up (as in previous discussion, Hive and Dremel could handle TB and even PB level data). • Do you think Apriori algorithm is still suitable today? Are there any limitations or issues exposed due to the increased data size?

Improving Apriori in efficiency • AprioriHybrid: • Use Apriori in initial passes; • (Heuristic) Estimation on the size of C^k; • Switch to AprioriTid when C^k is expected to fit in memory

AprioriHybrid

Later Work • Parallel versions • Quantitative association rules E.g., "10% of married people between age 50 and 60 have at least 2 cars." • Online association rules

Conclusion • Two new algorithms: Apriori and AprioriTid are discussed. • These algorithms outperform AIS and SETM. • Apriori and AprioriTid can be combined into AprioriHybird. • AprioriHybrid matches Apriori and AprioriTid when either one wins.

Discussion • This paper has spun off more similar algorithm in the database world than any other data mining algorithm. • Why do you think this is the case? Is it the algorithm? The problem? The approach? Or something else.

Efficient Algorithms for Association Rule Mining in Data Analysis

Efficient Algorithms for Association Rule Mining in Data Analysis

Presentation Transcript

Data Mining Association Rules

Mining Association Rules

Mining Association Rules

Fast Algorithms for Mining Association Rules

DATA MINING - ASSOCIATION RULES-

Mining Association Rules

Data Mining Association Rules: Advanced Concepts and Algorithms

Fast Algorithms for Mining Association Rules

Fast Algorithms For Mining Association Rules

Mining Causal Association Rules

Data Mining Association Rules

Association Rules Mining

Fast Algorithms for Mining Association Rules

Incremental Mining Association Rules

Fast Algorithms for Mining Frequent Itemsets

Fast Algorithms for Mining Association Rules

Fast Algorithms for Mining Association Rules *

Fast Algorithms for Mining Association Rules

Fast Algorithms for Mining Frequent Itemsets

Algorithms for Mining Association Rules

Mining Association Rules

Scalable Algorithms for Association Mining