200 likes | 324 Views
Efficient Data-Reduction Methods for On-line Association Rule Mining. H. Bronnimann. B. Chen. P. Haas. M. Dash, Y. Qiao, P. Scheuermann. Exilixis. IBM Almaden. Northwestern University. Polytechnic Univ. hbr@poly.edu. bchen@ece.nwu.edu. peterh@almaden.ibm.com.
E N D
Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. Chen P. Haas M. Dash, Y. Qiao, P. Scheuermann Exilixis IBM Almaden Northwestern University Polytechnic Univ hbr@poly.edu bchen@ece.nwu.edu peterh@almaden.ibm.com {manoranj,yiqiao,peters}@ece.nwu.edu
Motivation • Volume of Data in Warehouses & Internet is growing faster than Moore’s Law • Scalability is a major concern • “Classical” algorithms require one/more scans of the database • Need to adopt to Streaming Data • One Solution: Execute algorithm on a sample • Data elements arrive on-line • Limited amount of memory • Lossy compressed synopses (sketch) of data
Motivation • Sampling Methods • Advantage: can explicitly trade-off accuracy and speed • Work best when tailored to application • Our Contributions • Sampling methods for count datasets • Base set of items & each data element is vector of item counts • Application: Association rule mining
Outline • Outline of the Presentation • Motivation • FAST • Epsilon Approximation • Experimental Results • Data Stream Reduction • Conclusion
The Problem Generate a smaller subset S0 of a larger superset S such that the supports of 1-itemsets in S0 are close to those inS NP-Complete: One-In-Three SAT Problem I1(T)= set of all 1-itemsets in transaction set T L1(T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T
FAST-trim • FAST-trim Outline Given a specified minimum support p and confidence c, FAST-trim Algorithm proceeds as follows: Obtain a large simple random sample S from D. Compute f(A;S) for each 1-itemset A. Using the supports computed in Step 2, obtain a reduced sample S0from S by trimming away outlier transactions. Run a standard association-rule algorithm against S0 – with Minimum support p and confidence c – to obtain the final set of Association Rules.
FAST-trim • FAST-trim Algorithm Uses input parameter k to explicitly trade-off speed and accuracy Trimming Phase while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 – {t*}, where Dist(S0 -{t*},S) = min Dist(S0 - {t},S) } } te G Note: Removal of outlier t* causes maximum decrease or minimum increase in Dist(S0,S)
FAST-grow • FAST-grow Algorithm Select representative transactions from S and add them to the sample S0 that is initially empty Growing Phase while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 {t*}, where Dist(S0 {t*},S) = min Dist(S0{t},S) } } t e G
Epsilon Approximation (EA) Epsilon Approximation (EA) • Theory based on work in statistics on VC Dimensions • (Vapnik & Cervonenkis’71) shows: Can estimate simultaneously the frequency of a collection of subsets VC dimension is finite • Applications to computational geometry and learning theory Def: A sample S0 of S1 is an eapproximation iff discrepancy satisfies
Epsilon Approximation (EA) Halving Method • Deterministically halves the data to get sample S0 • Apply halving repeatedly (S1 => S2 => … => St(= S0)) until • Each halving step introduce a discrepancy where m = total no. of items in database, ni= size of sub-sample Si • Halving stops with the maximum t such that
Epsilon Approximation (EA) How to compute halving? Hyperbolic cosine method [Spencer] • Color each transaction • red (in sample) or blue (not in sample) • Penalty for each item, reflects • Penalty small if red/blue approximately balanced • Penalty will shoot up exponentially when • red dominates (item is over-sampled), or • blue dominates (item is under-sampled) • Color transactions sequentially, keeping penalty low • Key property: no increase on penalty in average • => One of the two colors does not increase the penalty globally
Epsilon Approximation (EA) Penalty Computation • Let Qi = Penalty for item Ai • Init Qi = 2 • Suppose that we have colored the first j transactions where ri = ri(j) = no. of red transactions containing Ai bi = bi(j) = no. of blue transactions containing Ai di = parameter that influences how fast penalty changes as function of |ri- bi|
Epsilon Approximation (EA) How to color transaction j+1 • Compute global penalty: = Global penalty assuming transaction j+1 is red = Global penalty assuming transaction j+1 is blue • Choose color for which global penalty is smaller EA is inherently an on-line method
Performance Evaluation • Synthetic data set • IBM QUEST project [AS94] • 100,000 transactions • 1,000 items • number of maximal potentially large itemsets = 2000 • average transaction length: 10 • average length of maximal large itemsets: 4 • minimum support: 0.77% • length of the maximal large itemsets: 6 • Final sampling ratios: 0.76%, 1.51%, 3.0%, … dictated by EA halvings
Experimental Results • 87% reduction in sample size for accuracy: EA (99%), FAST_trim_D2 (97%), SRS (94.6%)
Experimental Results • FAST_grow_D2 is best for very small sampling ratio (< 2%) • EA best over-all in accuracy
Data Stream Reduction Data Stream Reduction (DSR) • Representative sample of data stream • Assign more weight to recent data while partially keeping track of old data NS/2 NS/2 NS/2 NS/2 … Total #Transactions = ms.Ns/2 mS mS-1 mS-2 1 Bucket# To generate NS-element sample, halve (mS-k) times of bucket k NS/2 NS/4 NS/8 1 … mS mS-1 mS-2 1 Bucket#
Data Stream Reduction • Practical Implementation To avoid frequent halving we use one buffer once and compute new representative sample when buffer is full by applying EA 0 Halving Empty Ns 1 Halving 1 Halving 2 Halving 2 Halving 3 Halving
Data Stream Reduction Problem: Two users immediately before and after halving operation see data that varies substantially Continuous DSR: Buffer divided into chunks New trans ns Next ns transactions arrive 2ns 3ns 4ns 5ns Oldest chunk is halved first Ns-2ns Ns-ns Ns Ns
Conclusion • Two-stage sampling approach based on trimming • outliers or selecting representative transactions • Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample • Can be used in conjunction with other non-sampling count-based mining algorithms • EA-based data stream reduction • We are investigating how to evaluate goodness of • representative subset • Frequency information to be used for discrepancy • function