Summarizing Itemset Patterns: A Profile-Based Approach

Summarizing Itemset Patterns: A Profile-Based Approach Xifeng Yan, Hong Cheng, Jiawei Han, Dong Xin ACM KDD 05’ Advisor：Jia-Ling Koh Speaker：Yu-Jiun Liu 2006/01/06

Introduction Ⅰ • Closed frequent pattern no super-pattern with the same support. • Maximal no frequent super-pattern. • Top-K V.S. K representatives

Introduction Ⅱ • The format of these representatives. • How to find these representatives? • The measure of their quality.

Definition • Bernoulli Distribution Vector • Pattern Profile

Equations • The relative frequency of item οi in D’. • Estimated Support

Pattern Profile Example • Both of the above datasets can be summarized by <abcd>, but the quality is better for D1. • p(a) = (50+1000)/(50+100+1000) = 0.91 • Mabc = <[0.91,0.96,1], abcd, 0.87> • M = <[0.91,0.96,1,1], abcd, 1>

Pattern Summarization • First, construct a special profile for each pattern that only contains that pattern itself. • Use the Kullback-Leibler divergence to merge similar patterns. • KL-divergence

Hierarchical Agglomerative Clustering

K-means Clustering

Optimization Heuristics • Closed Itemset vs. Frequent Itemsets • Given patterns α and β, if and their supports are equal, then • Approximate Profiles • Using the following two equations to instead of original profile updating. for Algorithm 1 for Algorithm 2

Quality Evaluation • Definition (Restoration Error) • T is a testing pattern set. • T’ is the collection of the itemsets generated by the master patterns in profiles and .

Quality Evaluation • J tests “frequent patterns”, some of which may be estimated as “infrequent”. • Jc tests “estimated frequent patterns”, some of which are actually “infrequent”. • Therefore J and Jc are complementary to each other.

Quality Evaluation • Lemma • For any frequent itemset π, there must exist a profile Mk such that , where ψk is the master itemset of Mk.

Optimal Number of Profiles • How to determine K? • M = (p, ψ , ρ) • Ex: require for any i such that • p~q α~β  Dα~Dβ~Dα∪Dβ • Checking the derivative of the quality over K • , If J increase suddenly from K* to K* - 1, K* is likely to be a good choice.

Optimal Number of Profiles

Experiment • Three real datasets and a series of synthetic datasets. • Language: Visual C++ • CPU: Intel 3.2GHz • Memory: 1GB • OS: Windows XP

Mushroom ※688 closed patterns

BMS-Webview1 ＆ Replace ※threshold = 0.1% ※4195 closed patterns ※many small frequent itemsets ※threshold = 3% ※4315 closed patterns ※many small frequent itemsets

Synthetic Datasets • Provided by IBM • 7 datasets, each has 10000 transactions. Choose top-500. • K = 50 and 100

Summarizing Itemset Patterns: A Profile-Based Approach