An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS Dept. - Brown University Join work with: A. Kirsch, M. Mitzenmacher, A. Pietracaprina, G. Pucci, E. Upfal AlgoDEEP 16/04/10

Data Mining Discovery of hidden patterns (e.g., correlations, association rules, clusters, anomalies, etc.) from large data sets When is a pattern significant ? Open problem: development of rigorous (mathematical/statistical) approaches to assess significance and to discover significant patterns efficiently AlgoDEEP 16/04/10

Frequent Itemsets (1) Dataset Dof transactions over set of itemsI (D ⊆ 2I) Support of an itemset X ∈ 2I in D = number of transactions that contain X support({Beer,Diaper}) = 3 Significant? AlgoDEEP 16/04/10

Original formulation of the problem [Agrawal et al. 93] input: dataset D over I, support threshold s output: allitemsets of support ≥ s in D (frequent itemsets ) Rationale: significance = high support (≥s) Drawbacks: Threshold s hard to fix too low  possible output explosion and spurious discoveries (false positives) too high  loss of interesting itemsets (false negatives) No guarantee of significance of output itemsets Alternative formulations proposed to mitigate the above drawbacks Closed itemsets, maximal itemsets, top-K itemsets Frequent Itemsets (2) AlgoDEEP 16/04/10

Significance Focus on statistical significance significance w.r.t. random model We address the following questions: What support level makes an itemset significantly frequent? How to narrow the search down to significant itemsets? Goal: minimize false discoveries and improve quality of subsequent analysis AlgoDEEP 16/04/10

Related Work • Many works consider significance of itemsets in isolation. E.g., [Silverstein, Brin, Motwani, 98]: • rigorous statistical framework (with flaws!) • 2 test to assess degree of dependence of items in an itemset • Global characteristics of dataset taken into account in [Gionis, Mannila, et al., 06]: • deviation from random dataset w.r.t. number of frequent itemsets • no rigouros statistical grounding AlgoDEEP 16/04/10

Statistical Tests Standard statistical test null hypothesis H0 (≈not significant) alternative hypothesis H1 H0 is tested against H1by observing a certain statistic s p-value = Prob( obs ≥ s | H0 is true ) Significance level α= probability of rejecting H0 when it is true (false positive). Also called probability of Type I error AlgoDEEP 16/04/10

Random Model I = set of n items D = input dataset of t transactions over I: i ∊ I: n(i) = support of {i} in D fi= n(i)/t = frequency of i in D D= random dataset of t transactions over I: Item i is included in transaction j with probability fi independently of all other events AlgoDEEP 16/04/10

For each itemset X = i1, i2, .. , ik⊆ I: fX = fi1fi2 … fikexpected frequency of X in D null hypothesis H0(X): the support of X in D conforms with D, (i.e., it is as drawn from Binomial(t, fX ) ) alternative hypothesis H1(X): the support of X in D does not conforms with D Naïve Approach (1) AlgoDEEP 16/04/10

Naïve Approach (2) Statistic of interest: sx = support of X in D Reject H0(X) if: p-value = Prob(B(t, fX) ≥ sX) ≤α Significant itemsets = X ⊆ I :H0(X) is rejected AlgoDEEP 16/04/10

What’s wrong? D with t=1,000,000 transactions, over n=1000 items, each item with frequency 1/1000. Pair {i,j} that occurs 7 times: is it statistically significant? In D (random dataset) E[support({i,j})] = 1 p-value = Prob({i,j} has support ≥ 7 ) ≃ 0.0001  {i,j}must be significant! Naïve Approach (3) AlgoDEEP 16/04/10

Expected number of pairs with support ≥ 7 in random dataset is ≃ 50  existence of {i,j} with support ≥ 7 is not such a rare event! returning {i,j}as significant itemset could be a false discovery However, 300 (disjoint) pairs with support ≥ 7 in D is an extremely rare event (prob ≤ 2-300) Naïve Approach (4) AlgoDEEP 16/04/10

Multi-Hypothesis test (1) Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for m= null hypotheses: {H0(X)}|X|=k How to combine m tests while minimizing false positives? AlgoDEEP 16/04/10

Multi-Hypothesis test (2) V= number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[V/R] (FDR=0 when R=0) GOAL:maximize R while ensuring FDR ≤ β [Benjamini-Yekutieli ’01] Reject hypothesis with i–th smallest p-value if ≤ i·β/m: • m = • does not yield a support threshold for mining AlgoDEEP 16/04/10

Our Approach • Q(k, s) = obs. number of k-itemsets of support ≥ s • null hypothesis H0(s): the number of k-itemsets of support  s in D conforms with D • alternative hypothesis H1(s): the number of k-itemsets of support  s in D does not conforms with D • Problem: how to compute the p-value of Q(k, s)? AlgoDEEP 16/04/10

Main Results (PODS 2009) Result 1 (Poisson approx) Q(k,s)= number of k-itemsets of support ≥ s in D Theorem Exists smin : for s≥smin , Q(k,s) is well approximated by a Poisson distribution. Result 2 Methodology to establish a support threshold for discovering significant itemsets with small FDR AlgoDEEP 16/04/10

Approximation Result (1) • Based on Chen-Stein method (1975) • Q(k,s)= number of k-itemsets of support ≥ s in random datasetD • U~Poisson(λ) , λ = E[Q(k,s)] • Theorem:for k=O(1), t=poly(n), for a large range of item distributions and supports s: • distance (Q(k,s) , U)=O(1/n) AlgoDEEP 16/04/10

Approximation Result (2) Corollary: there exists smin s.t. Q(k,s) is well approximated by a Poisson distribution for s≥smin In practice: Monte-Carlo method to determine smin s.t., with probability at least 1-δ, distance (Q(k,s) , U)≤ ε for all s≥smin AlgoDEEP 16/04/10

Support threshold for mining significant itemsets (1) Determine smin and let h be such that smin +2h is the maximum support for an itemset Fix α1, α2, .. , αh such that ∑αi≤ α Fix β1, β2,.. , βh such that ∑ βi≤ β For i=1 to h: • si= smin +2i • Q(k, si) = obs. number of k-itemsets of support ≥ si • H0(k,si): Q(k,si) conforms with Poisson(λi= E[Q(k, si)]) • reject H0(k,si) if: p-value of Q(k,si) < αiand Q(k,si) ≥ λi / βi AlgoDEEP 16/04/10

Support threshold for mining significant itemsets (2) Theorem.Let s* be the minimum s such that H0(k,s) was rejected. We have: With significance level α, the number of k-itemsets of support ≥ s* is significant The k-itemsets with support ≥ s* are significant with FDR ≤ β AlgoDEEP 16/04/10

Experiments: benchmark datasets FIMI repository http://fimi.cs.helsinki.fi/data/ items frequencies range avg. trans. length AlgoDEEP 16/04/10

Experiments: results (1) • Test: α = 0.05, β = 0.05 • Qk,s* = number of k-itemsets of support ≥ s* in D • λ(s*) = expected number of k-itemsets with support ≥ s* Itemset of size 154 with support ≥7 AlgoDEEP 16/04/10

Experiments: results (2) • Comparison with standard application of Benjamini Yekutieli: FDR≤ 0.05 • R = output (standard approach) • Qk,s* = output (our approach) • r = |Qk,s*|/|R| AlgoDEEP 16/04/10

Conclusions • Poisson approximation for number of k-itemsets of support s ≥ smin in a random dataset • An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR AlgoDEEP 16/04/10

Future Work • Deal with false negatives • Software package • Application of the method to other frequent pattern problems AlgoDEEP 16/04/10

Thank you! Questions? AlgoDEEP 16/04/10

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Presentation Transcript

x – Statistically Significant – Non-significant

Algorithms for Mining Maximal Frequent Itemsets -- A Survey

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets

Mining Frequent Itemsets over Uncertain Databases

The Concept of Maximal Frequent Itemsets

CBW: An Efficient Algorithm for Frequent Itemset Mining

Fast Algorithms for Mining Frequent Itemsets

Efficient Algorithms for Mining Share-Frequent Itemsets

Statistically sound pattern discovery Part II: Itemsets

Text clustering using frequent itemsets

Fast and Memory Efficient Mining of Frequent Closed Itemsets

Mining Frequent Itemsets over Uncertain Databases

Fast Algorithms for Mining Frequent Itemsets

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Fast Algorithms for Mining Frequent Itemsets

Procedural Approach to Identifying the Significant Hazards