200 likes | 213 Views
This research presents efficient data reduction methods for on-line association rule mining, including sampling techniques and epsilon approximation. The algorithms aim to reduce the volume of data in warehouses and handle streaming data. Experimental results show their effectiveness.
Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. Chen P. Haas M. Dash, Y. Qiao, P. Scheuermann Exilixis IBM Almaden Northwestern University Polytechnic Univ hbr@poly.edu bchen@ece.nwu.edu peterh@almaden.ibm.com {manoranj,yiqiao,peters}@ece.nwu.edu
Motivation • Volume of Data in Warehouses & Internet is growing faster than Moore’s Law • Scalability is a major concern • “Classical” algorithms require one/more scans of the database • Need to adopt to Streaming Data • One Solution: Execute algorithm on a sample • Data elements arrive on-line • Limited amount of memory • Lossy compressed synopses (sketch) of data
Motivation • Sampling Methods • Advantage: can explicitly trade-off accuracy and speed • Work best when tailored to application • Our Contributions • Sampling methods for count datasets • Base set of items & each data element is vector of item counts • Application: Association rule mining
Outline • Outline of the Presentation • Motivation • FAST • Epsilon Approximation • Experimental Results • Data Stream Reduction • Conclusion
The Problem Generate a smaller subset S0 of a larger superset S such that the supports of 1-itemsets in S0 are close to those inS NP-Complete: One-In-Three SAT Problem I1(T)= set of all 1-itemsets in transaction set T L1(T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T
FAST-trim • FAST-trim Outline Given a specified minimum support p and confidence c, FAST-trim Algorithm proceeds as follows: Obtain a large simple random sample S from D. Compute f(A;S) for each 1-itemset A. Using the supports computed in Step 2, obtain a reduced sample S0from S by trimming away outlier transactions. Run a standard association-rule algorithm against S0 – with Minimum support p and confidence c – to obtain the final set of Association Rules.
FAST-trim • FAST-trim Algorithm Uses input parameter k to explicitly trade-off speed and accuracy Trimming Phase while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 – {t*}, where Dist(S0 -{t*},S) = min Dist(S0 - {t},S) } } te G Note: Removal of outlier t* causes maximum decrease or minimum increase in Dist(S0,S)
FAST-grow • FAST-grow Algorithm Select representative transactions from S and add them to the sample S0 that is initially empty Growing Phase while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 {t*}, where Dist(S0 {t*},S) = min Dist(S0{t},S) } } t e G
Epsilon Approximation (EA) Epsilon Approximation (EA) • Theory based on work in statistics on VC Dimensions • (Vapnik & Cervonenkis’71) shows: Can estimate simultaneously the frequency of a collection of subsets VC dimension is finite • Applications to computational geometry and learning theory Def: A sample S0 of S1 is an eapproximation iff discrepancy satisfies
Epsilon Approximation (EA) Halving Method • Deterministically halves the data to get sample S0 • Apply halving repeatedly (S1 => S2 => … => St(= S0)) until • Each halving step introduce a discrepancy where m = total no. of items in database, ni= size of sub-sample Si • Halving stops with the maximum t such that
Epsilon Approximation (EA) How to compute halving? Hyperbolic cosine method [Spencer] • Color each transaction • red (in sample) or blue (not in sample) • Penalty for each item, reflects • Penalty small if red/blue approximately balanced • Penalty will shoot up exponentially when • red dominates (item is over-sampled), or • blue dominates (item is under-sampled) • Color transactions sequentially, keeping penalty low • Key property: no increase on penalty in average • => One of the two colors does not increase the penalty globally
Epsilon Approximation (EA) Penalty Computation • Let Qi = Penalty for item Ai • Init Qi = 2 • Suppose that we have colored the first j transactions where ri = ri(j) = no. of red transactions containing Ai bi = bi(j) = no. of blue transactions containing Ai di = parameter that influences how fast penalty changes as function of |ri- bi|
Epsilon Approximation (EA) How to color transaction j+1 • Compute global penalty: = Global penalty assuming transaction j+1 is red = Global penalty assuming transaction j+1 is blue • Choose color for which global penalty is smaller EA is inherently an on-line method
Performance Evaluation • Synthetic data set • IBM QUEST project [AS94] • 100,000 transactions • 1,000 items • number of maximal potentially large itemsets = 2000 • average transaction length: 10 • average length of maximal large itemsets: 4 • minimum support: 0.77% • length of the maximal large itemsets: 6 • Final sampling ratios: 0.76%, 1.51%, 3.0%, … dictated by EA halvings
Experimental Results • 87% reduction in sample size for accuracy: EA (99%), FAST_trim_D2 (97%), SRS (94.6%)
Experimental Results • FAST_grow_D2 is best for very small sampling ratio (< 2%) • EA best over-all in accuracy
Data Stream Reduction Data Stream Reduction (DSR) • Representative sample of data stream • Assign more weight to recent data while partially keeping track of old data NS/2 NS/2 NS/2 NS/2 … Total #Transactions = ms.Ns/2 mS mS-1 mS-2 1 Bucket# To generate NS-element sample, halve (mS-k) times of bucket k NS/2 NS/4 NS/8 1 … mS mS-1 mS-2 1 Bucket#
Data Stream Reduction • Practical Implementation To avoid frequent halving we use one buffer once and compute new representative sample when buffer is full by applying EA 0 Halving Empty Ns 1 Halving 1 Halving 2 Halving 2 Halving 3 Halving
Data Stream Reduction Problem: Two users immediately before and after halving operation see data that varies substantially Continuous DSR: Buffer divided into chunks New trans ns Next ns transactions arrive 2ns 3ns 4ns 5ns Oldest chunk is halved first Ns-2ns Ns-ns Ns Ns
Conclusion • Two-stage sampling approach based on trimming • outliers or selecting representative transactions • Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample • Can be used in conjunction with other non-sampling count-based mining algorithms • EA-based data stream reduction • We are investigating how to evaluate goodness of • representative subset • Frequency information to be used for discrepancy • function