210 likes | 248 Views
Fast Algorithms for Mining Association Rules. Rakesh Agrawal Ramakrishnan Srikant Modified from slides by: Dan Li Presenter: Jimmy Jiang Discussion Lead: Leo Li. Origin of Problem. Basket data: new possibility of customized data-driven strategies for retail companies
E N D
Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Modified from slides by: Dan Li Presenter:Jimmy Jiang Discussion Lead: Leo Li
Origin of Problem • Basket data: new possibility of customized data-driven strategies for retail companies • Simple example: beer and diapers • Famous case for market basket analysis • If you buy diapers, you tend to buy beer
Usage of Data Mining (general) • Clustering, predictive modeling, dependency modeling, data summarization, change and deviation. • Association rules in dependency modeling • Applications: market basket analysis, direct/interactive marketing, fraud detection, science, sports stats , etc Some sources from: http://redbook.cs.berkeley.edu/redbook3/lec29.html
, where , and Association Rule • Formal Definition • Confidencec, c% of transactions that contain X also contain Y • Supports, s% of all transactions contain • Transaction: set of items, where I is the total set of items • Example • buys(x, “computer”) →buys(x, “financial management software”) [0.5%, 60%]
Mining Association Rules • The problem of finding association rules falls within the purview of database mining. • Eg1: Find proportion (confidence) of transactions that purchase diapers also purchase beer. • Eg2: Find proportion (support) of all transactions purchase diaper and/or beer (among all transactions). Finding association rules is valuable for • Cross-marketing • Catalog design • Add-on sales • Store layout and so on
Problem Decomposition • Find all sets of items (itemseis) that have transaction support above minimum support: Apriori and AprioriTid • Use the large itemsets to generate the desired rules: not discussed
Discovering Large Itemsets Intuition: any subset of a large itemset must be large. Algorithms for discovering large itemsets make multiple passes over the data. • In the first pass, determine which individual item is large. • In each sequent pass: • Previous large itemsets (seed sets) are used to generate candidate itemsets. • Count actual support for the candidate itemsets. • Determine which are the real large itemsets, pass to next pass • This process continues until no new large itemsets are found.
Itemset Generation • The apriori-gen function has two steps • Join • Prune • Efficiently Implementing subset functions • The hash tree
Apriori TID • Candidate generation is same as Apriori, but DB is used for counting support only in the first pass. • More memory needed: storage set in memory containing frequent sets per transaction
Performance • Average size of transactions 5~20; • Average size of maximal potentially large itemsets 2~6 • Dataset sizes 2.4~8.4 MB (on an IBM RS/SOOO 530H workstation with a CPU clock rate of 33 MHz, 64 MB of main memory, and running AIX 3.2. The data resided in the AIX file system and was stored on a 2GB SCSI 3.5” drive, with measured sequential throughput of about 2 MB/second.)
Discussion: • In year 1994, the author mentioned the dataset they used in scale-up experiment are not even as large as 1 GB. • Nowadays, this seems to be not sufficient as scale-up (as in previous discussion, Hive and Dremel could handle TB and even PB level data). • Do you think Apriori algorithm is still suitable today? Are there any limitations or issues exposed due to the increased data size?
Improving Apriori in efficiency • AprioriHybrid: • Use Apriori in initial passes; • (Heuristic) Estimation on the size of C^k; • Switch to AprioriTid when C^k is expected to fit in memory
Later Work • Parallel versions • Quantitative association rules E.g., "10% of married people between age 50 and 60 have at least 2 cars." • Online association rules
Conclusion • Two new algorithms: Apriori and AprioriTid are discussed. • These algorithms outperform AIS and SETM. • Apriori and AprioriTid can be combined into AprioriHybird. • AprioriHybrid matches Apriori and AprioriTid when either one wins.
Discussion • This paper has spun off more similar algorithm in the database world than any other data mining algorithm. • Why do you think this is the case? Is it the algorithm? The problem? The approach? Or something else.