190 likes | 293 Views
Summarizing Itemset Patterns: A Profile-Based Approach. Xifeng Yan, Hong Cheng, Jiawei Han, Dong Xin ACM KDD 05 ’ Advisor : Jia-Ling Koh Speaker : Yu-Jiun Liu
E N D
Summarizing Itemset Patterns: A Profile-Based Approach Xifeng Yan, Hong Cheng, Jiawei Han, Dong Xin ACM KDD 05’ Advisor:Jia-Ling Koh Speaker:Yu-Jiun Liu 2006/01/06
Introduction Ⅰ • Closed frequent pattern no super-pattern with the same support. • Maximal no frequent super-pattern. • Top-K V.S. K representatives
Introduction Ⅱ • The format of these representatives. • How to find these representatives? • The measure of their quality.
Definition • Bernoulli Distribution Vector • Pattern Profile
Equations • The relative frequency of item οi in D’. • Estimated Support
Pattern Profile Example • Both of the above datasets can be summarized by <abcd>, but the quality is better for D1. • p(a) = (50+1000)/(50+100+1000) = 0.91 • Mabc = <[0.91,0.96,1], abcd, 0.87> • M = <[0.91,0.96,1,1], abcd, 1>
Pattern Summarization • First, construct a special profile for each pattern that only contains that pattern itself. • Use the Kullback-Leibler divergence to merge similar patterns. • KL-divergence
Optimization Heuristics • Closed Itemset vs. Frequent Itemsets • Given patterns α and β, if and their supports are equal, then • Approximate Profiles • Using the following two equations to instead of original profile updating. for Algorithm 1 for Algorithm 2
Quality Evaluation • Definition (Restoration Error) • T is a testing pattern set. • T’ is the collection of the itemsets generated by the master patterns in profiles and .
Quality Evaluation • J tests “frequent patterns”, some of which may be estimated as “infrequent”. • Jc tests “estimated frequent patterns”, some of which are actually “infrequent”. • Therefore J and Jc are complementary to each other.
Quality Evaluation • Lemma • For any frequent itemset π, there must exist a profile Mk such that , where ψk is the master itemset of Mk.
Optimal Number of Profiles • How to determine K? • M = (p, ψ , ρ) • Ex: require for any i such that • p~q α~β Dα~Dβ~Dα∪Dβ • Checking the derivative of the quality over K • , If J increase suddenly from K* to K* - 1, K* is likely to be a good choice.
Experiment • Three real datasets and a series of synthetic datasets. • Language: Visual C++ • CPU: Intel 3.2GHz • Memory: 1GB • OS: Windows XP
Mushroom ※688 closed patterns
BMS-Webview1 & Replace ※threshold = 0.1% ※4195 closed patterns ※many small frequent itemsets ※threshold = 3% ※4315 closed patterns ※many small frequent itemsets
Synthetic Datasets • Provided by IBM • 7 datasets, each has 10000 transactions. Choose top-500. • K = 50 and 100