1 / 20

Efficient Data-Reduction Methods for On-line Association Rule Mining

Efficient Data-Reduction Methods for On-line Association Rule Mining. H. Bronnimann. B. Chen. P. Haas. M. Dash, Y. Qiao, P. Scheuermann. Exilixis. IBM Almaden. Northwestern University. Polytechnic Univ. hbr@poly.edu. bchen@ece.nwu.edu. peterh@almaden.ibm.com.

ziven
Download Presentation

Efficient Data-Reduction Methods for On-line Association Rule Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. Chen P. Haas M. Dash, Y. Qiao, P. Scheuermann Exilixis IBM Almaden Northwestern University Polytechnic Univ hbr@poly.edu bchen@ece.nwu.edu peterh@almaden.ibm.com {manoranj,yiqiao,peters}@ece.nwu.edu

  2. Motivation • Volume of Data in Warehouses & Internet is growing faster than Moore’s Law • Scalability is a major concern • “Classical” algorithms require one/more scans of the database • Need to adopt to Streaming Data • One Solution: Execute algorithm on a sample • Data elements arrive on-line • Limited amount of memory • Lossy compressed synopses (sketch) of data

  3. Motivation • Sampling Methods • Advantage: can explicitly trade-off accuracy and speed • Work best when tailored to application • Our Contributions • Sampling methods for count datasets • Base set of items & each data element is vector of item counts • Application: Association rule mining

  4. Outline • Outline of the Presentation • Motivation • FAST • Epsilon Approximation • Experimental Results • Data Stream Reduction • Conclusion

  5. The Problem Generate a smaller subset S0 of a larger superset S such that the supports of 1-itemsets in S0 are close to those inS NP-Complete: One-In-Three SAT Problem I1(T)= set of all 1-itemsets in transaction set T L1(T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T

  6. FAST-trim • FAST-trim Outline Given a specified minimum support p and confidence c, FAST-trim Algorithm proceeds as follows: Obtain a large simple random sample S from D. Compute f(A;S) for each 1-itemset A. Using the supports computed in Step 2, obtain a reduced sample S0from S by trimming away outlier transactions. Run a standard association-rule algorithm against S0 – with Minimum support p and confidence c – to obtain the final set of Association Rules.

  7. FAST-trim • FAST-trim Algorithm Uses input parameter k to explicitly trade-off speed and accuracy Trimming Phase while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 – {t*}, where Dist(S0 -{t*},S) = min Dist(S0 - {t},S) } } te G Note: Removal of outlier t* causes maximum decrease or minimum increase in Dist(S0,S)

  8. FAST-grow • FAST-grow Algorithm Select representative transactions from S and add them to the sample S0 that is initially empty Growing Phase while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 {t*}, where Dist(S0 {t*},S) = min Dist(S0{t},S) } } t e G

  9. Epsilon Approximation (EA) Epsilon Approximation (EA) • Theory based on work in statistics on VC Dimensions • (Vapnik & Cervonenkis’71) shows: Can estimate simultaneously the frequency of a collection of subsets VC dimension is finite • Applications to computational geometry and learning theory Def: A sample S0 of S1 is an eapproximation iff discrepancy satisfies

  10. Epsilon Approximation (EA) Halving Method • Deterministically halves the data to get sample S0 • Apply halving repeatedly (S1 => S2 => … => St(= S0)) until • Each halving step introduce a discrepancy where m = total no. of items in database, ni= size of sub-sample Si • Halving stops with the maximum t such that

  11. Epsilon Approximation (EA) How to compute halving? Hyperbolic cosine method [Spencer] • Color each transaction • red (in sample) or blue (not in sample) • Penalty for each item, reflects • Penalty small if red/blue approximately balanced • Penalty will shoot up exponentially when • red dominates (item is over-sampled), or • blue dominates (item is under-sampled) • Color transactions sequentially, keeping penalty low • Key property: no increase on penalty in average • => One of the two colors does not increase the penalty globally

  12. Epsilon Approximation (EA) Penalty Computation • Let Qi = Penalty for item Ai • Init Qi = 2 • Suppose that we have colored the first j transactions where ri = ri(j) = no. of red transactions containing Ai bi = bi(j) = no. of blue transactions containing Ai di = parameter that influences how fast penalty changes as function of |ri- bi|

  13. Epsilon Approximation (EA) How to color transaction j+1 • Compute global penalty: = Global penalty assuming transaction j+1 is red = Global penalty assuming transaction j+1 is blue • Choose color for which global penalty is smaller EA is inherently an on-line method

  14. Performance Evaluation • Synthetic data set • IBM QUEST project [AS94] • 100,000 transactions • 1,000 items • number of maximal potentially large itemsets = 2000 • average transaction length: 10 • average length of maximal large itemsets: 4 • minimum support: 0.77% • length of the maximal large itemsets: 6 • Final sampling ratios: 0.76%, 1.51%, 3.0%, … dictated by EA halvings

  15. Experimental Results • 87% reduction in sample size for accuracy: EA (99%), FAST_trim_D2 (97%), SRS (94.6%)

  16. Experimental Results • FAST_grow_D2 is best for very small sampling ratio (< 2%) • EA best over-all in accuracy

  17. Data Stream Reduction Data Stream Reduction (DSR) • Representative sample of data stream • Assign more weight to recent data while partially keeping track of old data NS/2 NS/2 NS/2 NS/2 … Total #Transactions = ms.Ns/2 mS mS-1 mS-2 1 Bucket# To generate NS-element sample, halve (mS-k) times of bucket k NS/2 NS/4 NS/8 1 … mS mS-1 mS-2 1 Bucket#

  18. Data Stream Reduction • Practical Implementation To avoid frequent halving we use one buffer once and compute new representative sample when buffer is full by applying EA 0 Halving Empty Ns 1 Halving 1 Halving 2 Halving 2 Halving 3 Halving

  19. Data Stream Reduction Problem: Two users immediately before and after halving operation see data that varies substantially Continuous DSR: Buffer divided into chunks New trans ns Next ns transactions arrive 2ns 3ns 4ns 5ns Oldest chunk is halved first Ns-2ns Ns-ns Ns Ns

  20. Conclusion • Two-stage sampling approach based on trimming • outliers or selecting representative transactions • Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample • Can be used in conjunction with other non-sampling count-based mining algorithms • EA-based data stream reduction • We are investigating how to evaluate goodness of • representative subset • Frequency information to be used for discrepancy • function

More Related