1 / 21

Fast Algorithms for Mining Association Rules

Fast Algorithms for Mining Association Rules. Rakesh Agrawal Ramakrishnan Srikant Modified from slides by: Dan Li Presenter: Jimmy Jiang Discussion Lead: Leo Li. Origin of Problem. Basket data: new possibility of customized data-driven strategies for retail companies

whitevictor
Download Presentation

Fast Algorithms for Mining Association Rules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Modified from slides by: Dan Li Presenter:Jimmy Jiang Discussion Lead: Leo Li

  2. Origin of Problem • Basket data: new possibility of customized data-driven strategies for retail companies • Simple example: beer and diapers • Famous case for market basket analysis • If you buy diapers, you tend to buy beer

  3. Usage of Data Mining (general) • Clustering, predictive modeling, dependency modeling, data summarization, change and deviation. • Association rules in dependency modeling • Applications: market basket analysis, direct/interactive marketing, fraud detection, science, sports stats , etc Some sources from: http://redbook.cs.berkeley.edu/redbook3/lec29.html

  4. , where , and Association Rule • Formal Definition • Confidencec, c% of transactions that contain X also contain Y • Supports, s% of all transactions contain • Transaction: set of items, where I is the total set of items • Example • buys(x, “computer”) →buys(x, “financial management software”) [0.5%, 60%]

  5. Mining Association Rules • The problem of finding association rules falls within the purview of database mining. • Eg1: Find proportion (confidence) of transactions that purchase diapers also purchase beer. • Eg2: Find proportion (support) of all transactions purchase diaper and/or beer (among all transactions). Finding association rules is valuable for • Cross-marketing • Catalog design • Add-on sales • Store layout and so on

  6. Problem Decomposition • Find all sets of items (itemseis) that have transaction support above minimum support: Apriori and AprioriTid • Use the large itemsets to generate the desired rules: not discussed

  7. Discovering Large Itemsets Intuition: any subset of a large itemset must be large. Algorithms for discovering large itemsets make multiple passes over the data. • In the first pass, determine which individual item is large. • In each sequent pass: • Previous large itemsets (seed sets) are used to generate candidate itemsets. • Count actual support for the candidate itemsets. • Determine which are the real large itemsets, pass to next pass • This process continues until no new large itemsets are found.

  8. Algorithm Apriori

  9. Itemset Generation • The apriori-gen function has two steps • Join • Prune • Efficiently Implementing subset functions • The hash tree

  10. Example for Apriori

  11. Example for Apriori (cont’)

  12. Apriori TID • Candidate generation is same as Apriori, but DB is used for counting support only in the first pass. • More memory needed: storage set in memory containing frequent sets per transaction

  13. Performance • Average size of transactions 5~20; • Average size of maximal potentially large itemsets 2~6 • Dataset sizes 2.4~8.4 MB (on an IBM RS/SOOO 530H workstation with a CPU clock rate of 33 MHz, 64 MB of main memory, and running AIX 3.2. The data resided in the AIX file system and was stored on a 2GB SCSI 3.5” drive, with measured sequential throughput of about 2 MB/second.)

  14. Results

  15. Smaller number of candidates

  16. Discussion: • In year 1994, the author mentioned the dataset they used in scale-up experiment are not even as large as 1 GB. • Nowadays, this seems to be not sufficient as scale-up (as in previous discussion, Hive and Dremel could handle TB and even PB level data). • Do you think Apriori algorithm is still suitable today? Are there any limitations or issues exposed due to the increased data size?

  17. Improving Apriori in efficiency • AprioriHybrid: • Use Apriori in initial passes; • (Heuristic) Estimation on the size of C^k; • Switch to AprioriTid when C^k is expected to fit in memory„

  18. AprioriHybrid

  19. Later Work • Parallel versions • Quantitative association rules E.g., "10% of married people between age 50 and 60 have at least 2 cars." • Online association rules

  20. Conclusion • Two new algorithms: Apriori and AprioriTid are discussed. • These algorithms outperform AIS and SETM. • Apriori and AprioriTid can be combined into AprioriHybird. • AprioriHybrid matches Apriori and AprioriTid when either one wins.

  21. Discussion • This paper has spun off more similar algorithm in the database world than any other data mining algorithm. • Why do you think this is the case? Is it the algorithm? The problem? The approach? Or something else.

More Related