1 / 28

Facilitating Interactive Mining of Global and Local Association Rules

Facilitating Interactive Mining of Global and Local Association Rules. Abhishek Mukherji* Elke A . Rundensteiner Matthew O. Ward Department of Computer Science, Worcester Polytechnic Institute, MA, USA . *Samsung Research America, CA, USA.

ismael
Download Presentation

Facilitating Interactive Mining of Global and Local Association Rules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Facilitating Interactive Mining of Global and Local Association Rules Abhishek Mukherji*Elke A. Rundensteiner Matthew O. Ward Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtoolis an open source multivariate visual analytics tool developed at WPI with a series of NSF grants over the past 20 years (http://sourceforge.net/projects/xmdvtool/). This PhD research work was partly supported by NSF under grants IIS-0812027, CCF-0811510 and IIS-1117139.

  2. Era of Big Data …. And we are DRIVING! Velocity Volume Variety Veracity 1. Where’s the Data in the Big Data Wave? Gerhard Weikum, Res. Director at Max Planck Institute, http://wp.sigmod.org/?p=786. 2. Analytic DB Technology for the Data Enthusiast. Pat Hanrahan, Stanford & Tableau, SIGMOD‘12 Keynote Talk.

  3. XmdvTool’sEfforts Towards This Paradigm Shift Visualize Static Data I. Visualize Stream & Sensor Data SNIFTool & FireStream ViStream* Visualize Data Records II. Visualize Mined Results PARAS/FIRE COLARM *Di Yang et al., Interactive visual exploration of neighbor-based patterns in data streams, ACM SIGMOD’10 Demo.

  4. I. Stream & Sensor Data Processing 1. SNIFTool/FireStream: Discover Patterns in Live Stream [CIKM ’08, ICDE Demo ’07] 2. JAQPOT: High Velocity Streams MJoin Exec. [BNCOD ’11] III. Scalable Nugget-guided Hypothesis Testing 1. SPHINX: Evidence-Hypotheses Explor.[CIKM’13] 2. Iterative Multi-Evidence-Hypotheses Model • II. Interactive Mining • PARAS /FIRE [VLDB’13, SIGMOD’13, CIKM’13] • COLARM [EDBT’14] • Summary of Graduate Research Works CAPE* XMDVTool^ *http://davis.wpi.edu/dsrg/PROJECTS/CAPE/index.html ^http://davis.wpi.edu/xmdv/index.html

  5. PARAS/FIRE: Interactive Visual Support for Parameter Space-Driven Mining of Global Rules [PVLDB 2013, SIGMOD 2013, CIKM 2013] Joint work with Xika Lin, Christopher Ryan Botaish, Jason Whitehouse, Elke A. Rundensteiner, Matthew O. Ward Department of Computer Science, Worcester Polytechnic Institute (WPI), MA, USA.

  6. Association Rule Mining (ARM) Basics Which customers to target for multi-car discount promos? • <Age: 30..39> and <Married: Yes>  <NumCars: 2> • Support = 40%, Confidence = 100% R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, VLDB’94. R. Srikantand R. Agrawal, Mining quantitative association rules in large relational tables, SIGMOD’96. 6

  7. MotivationforInteractive Mining Limitations (minsupp, minconf) • Unacceptably long response time. • Trial-and-error iterations. • Forced to rerun for each subset. Data Miner {ARs} Research Goals Data Analyst • Improve turnaround times of mining queries. • Provide parameter recommendations. • Preprocess data to enable fast interactive mining experience. C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01. C.Hidber. Online Association Rule Mining, SIGMOD’99. B. Nag, P. M. Deshpande, and D. J. DeWitt, Using a knowledge cache for interactive discovery of association rules, SIGKDD’99. M. Kubat et al., Itemsettrees for targeted association querying, IEEE TKDE’03. M. Kaya and R. Alhajj. Online mining of fuzzy multidimensional weighted association rules. Applied Intelligence’08.

  8. The State-of-the-art in Online Rule Mining 100 X {} 80 60 40 II. Rule Generation X Y Z Online 20 20 40 XY XZ YZ XYZ 10 XYZ I. Frequent Itemset Generation Offline Assumptions Cost(Freq. ItemsetGeneration) >> Cost(Rule Generation), Count(Itemsets) << Count(Rules). ^Cost per GB of RAM: $1000 (in 2000)  $25 (in 2012). C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01. M. Kubat et al., Itemset trees for targeted association querying, IEEE TKDE’03. M. Kaya and R. Alhajj. Online mining of fuzzy multidimensional weighted association rules. Applied Intelligence’08. ^ http://www.jcmit.com/memoryprice.htm

  9. Adjacency Lattice and Redundancy 100 X {} 80 60 40 X Y Z • Strict Redundancy [(AUC)XYZﬤ (AUC)XY] • Simple Redundancy [(AUC) = XYZ, (A)XYZ (A)XYZ] ∩ 20 20 40 XY XZ YZ XYZ 10 XYZ • Starting with maximal ancestors of XYZ, i.e., X, Y and Z. • If (XYZ qualifies) • Then skip XY and XZ as antecedent (simple) or consequent (strict). C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.

  10. Research Challenges PARAS: Preprocessing/Computational aspects • 1. Instead of the itemsets, can we pre-store association rules to altogether avoid the online rule generation step? 2. Instead of the itemset index, can we have a direct look-up using (minsupp, minconf)? 3. How to handle Redundancy Relationships in the context of the parameterized Index? FIRE: Visualization aspects 1. How should we visually present these mining results to the users? 2. How can we leverage these results to supportinteractive rule exploration? 3. Can we utilize some data visualization techniques to help users better understand these mined results?

  11. Parameter Space Model (PARAS) Confidence Stable Regions 0.8 1 0.2 0.5 100 YX (0.4,0.67) XY {} XZ  Y, YZ X S1= S (0.4,0.67) 80 60 40 (0,0.5) X Y Z XZ l3 XYZ, ZXY 20 20 40 ZX, ZY XY XZ YZ l2 10 XYZ (0.1, 0.125) XYZ 0.8 1 0.2 0.5 Support S2 =S (0.4,0.5) (0.2,0) l1

  12. + Stable Regions {S } w/ Neighbors + Rules S1 S2 Further, re-examining the redundancy definitions,we observed certain properties that enabled us to optimize computation and storage of redundancy information with respect to the parameter space.* * Xika Lin majorly contributed in the redundancy results. PARAS: Parameter Space Framework for Online Association Mining, VLDB 2013.

  13. Framework for Interactive Rule Exploration (FIRE) Chess dataset Mushroom dataset

  14. All rules versus unique rules view

  15. Unique + non-redundant rules view

  16. Two-region Comparison

  17. Rule Glyph View Lined Glyph* {poisonous? = edible} {gill-attachment = free, veil-type = partial, veil-color = white} Filled Glyph MDS layout *M. O. Ward, A taxonomy of glyph placement strategies for multidimensional data visualization, Information Visualization 2002. Filled Glyph

  18. PARAS: Experimental Evaluation • Data sets • Synthetic*: IBM Quest Generator (T10I4D100k and T10I4D5000k). • Tx_Iy_Dz = x avg # of items per transaction, y x 1k total # of items, z transactions. • Real : Chess, Mushroom, Webdocs`. ~ * R. Agrawal and R. Srikant,Fast algorithms for mining association rules in large databases, VLDB’94. ~ A. Asuncion and D. Newman, UCI ML repository. http://www.ics.uci.edu/~mlearn/MLRepository.html, 2007. ` C. Lucchese and S. Orlando and R. Perego and F. Silvestri, Webdocs: a real-life huge transactional dataset. FIMI’04.

  19. PARAS: Experimental Evaluation Tested Algorithms^ w/ redundancy resolution Apriori_RR Eclat_RR FP-Growth_RR AdjLattice_RR PARAS_RR w/o redundancy resolution Apriori Eclat FP-Growth PARAS ^ C. Borgelt, Efficient apriori, eclat & fp-growth, http://www.borgelt.net.

  20. PARAS: Experimental Methodologies • Average Online Processing Times (w/ and w/o RR). • Varying minsupp, fixed minconf • Fixed minsupp, varying minconf • Offline Preprocessing Times (AdjLatticeRR vs. PARAS)

  21. 1. Average Online Processing times (T5000k) w/ RR w/o RR For a large diversity of online queries, PARAS consistently outperforms the state-of-the-art competitors from the literature by 2 to 5 orders of magnitude over the tested datasets.

  22. 2. Pre-processing Times • Rule Generation • T5000k = 4 sec • Webdocs = 220 sec • Confirmed: • Cost(Freq. Itemset Generation) >> Cost(Rule Generation) PARAS requires ~10% extra offline preprocess time compared with AdjLatticeRR. C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01. B. Nag, P. M. Deshpande, and D. J. DeWitt, Using a knowledge cache for interactive discovery of association rules, SIGKDD’99.

  23. FIRE: User Study • Questions • Stable Region Usage Tests • T1: What are the most prominent rules by support and confidence? • T2: Which settings (out of choice of 4) returns a different set of rules? • T3: Find the common and unique rules for two distinct parameter settings. • Filter/Redundancy Test • T4: Find the most frequent characteristics of edible and poisonous mushrooms. • Skyline View Test • T5: Find the parameter settings that produce top-k rules in the dataset, • where k = 20, 50, 100. • 22 subjects • Mushroom and chess datasets • Cached Rule Miner (CRM) versus FIRE • Randomization to eliminate pre-knowledge

  24. Mushroom Dataset: Tasks 1, 2 and 3 Overall, FIRE outperforms the competitor CRM approach such that the users can achieve similar or better accuracy while having to use significantly less time for the tasks.

  25. Tasks 4 and 5 Overall, FIRE outperforms the competitor CRM approach such that the users can achieve similar or better accuracy while having to use significantly less time for the tasks.

  26. Conclusion We proposeda novel parameter space model, developed optimal algorithmsand designed effective visualizations to facilitate interactive rule exploration by tackling challenges related to both computational and visualization aspects of online rule mining. Gains of several orders of magnitude when using PARAS for online processing outweigh the one-time minimal offline preprocessing time and storage requirements. Our user study establishes usability and effectiveness of the proposed features and interactions of the FIRE system in facilitating interactive rule mining.

  27. Recent works at Samsung Research America User Behavior Analysis via On-device Mobile Sensing Association rule mining over multi-modal mobile context data Unobtrusively learn sequential patterns of mobile users “Typically, when I am home on Sunday nights, I call my parents” MobileMiner: Mining Your Frequent Behavior Patterns On Your Phone V Srinivasan et al., ACM UbiComp 2014 (Best Paper Nominee), HotMobile 2013. Mobile Sequence Miner: Adding Intelligence to Your Mobile Device via On-Device Sequential Pattern Mining A Mukherji et al., ACM MCSS Workshop in UbiComp 2014.

  28. Thanks Contact me with questions: Abhishek Mukherji Samsung Research America a.mukherji@samsung.com

More Related