300 likes | 565 Views
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules. S.D. Lee David W. Cheung Ben Kao The University of Hong Kong Data Mining and Knowledge Discovery , 1998. Presenter: Tri Tran CS331 – Spring 2006. Outline. Introduction
E N D
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong Data Mining and Knowledge Discovery, 1998 Presenter: Tri Tran CS331 – Spring 2006
Outline • Introduction • Problem Descriptions and Solutions • Mining of Association Rules • Update of Association Rules • Scheduling Update of Association Rules • DELI Algorithm • Example of DELI Algorithm • Experimental Results • Conclusions
Introduction • Applicability of data mining in many areas, such as decision support, market strategy and financial forecasts • Data mining enables us to find out useful information from huge databases • It enables marketers to develop and implement customized marketing programs and strategies • Mining of association rules is one of the most common data mining problems
Introduction (cont.) • Database keeps changing overtime, hence, the set of discovered association rules needs to be updated to reflect the changes Maintenance of discovered association rules is also an important problem • Existing solutions scan database multiple times to discover exactly the association rules • Apriori algorithm: discover a set of association rules • FUP2 algorithm: update the discovered association rules efficiently when transactions are added to, deleted from or modified in the database. • Authors propose an algorithm DELI to determine when rule update should be applied. • The algorithm estimates the maximum amount of changes in the set of rules due to newly added transactions using sampling techniques.
Problem 1: Mining of Association Rules • Given a database D of transactions and a set of possible items, find the large itemsets • Large Itemsets: itemsets which have a transaction support above a pre-specified support, s% • Transaction: a non-empty set of items • Association Rule: X => Y, X and Y are itemsets • By examining large itemsets, find association rules that their confidence are above a confidence threshold, c%
Solution: Apriori Algorithm • Finds out the large itemsets iteratively • At iteration k: • Use large (k-1)-itemsets, Lk-1, find candidate itemsets of size k, Ck • Check which ones have a support above pre-specified and add them to large k-itemsets • At every iteration, it scans the database to count the transactions which contain each candidate itemset • A large amount of time is spent in scanning the whole database
Problem 2: Update of Association Rules • After some updates have been applied to a database, find the new large itemsets and their support counts in an efficient manner • All database updates are either insertions or deletions • Association Rule Maintenance Problem • Efficiently update the discovered association rules by using the old database mining results
Δ- (δ-X) D (σX) D* D’ (σ’X) Δ+ (δ+X) Update of Association Rules Δ- :set of deleted transactions Δ+:set of added transactions D :old database D' :update database D*:set of unchanged transactions σX: support count of itemset X σ’X: new support count of itemset X δX-: support count of itemset X in Δ- δX+: support count of itemset X in Δ+ • D’ = (D - Δ-) U Δ+ • σ’X = (σX - δX-) + δX+
FUP2 Algorithm • Addresses maintenance problem • Apriori fails to use old data mining result • FUP2 reduces the amount of work that needs to be done • FUP2 works similarly to Apriori by generating large itemsets iteratively • It scans only the updated part of the database for old large itemsets • For the rest, it scans the whole database
FUP2 Algorithm • Finds out the large itemsets iteratively by reusing the results of previous mining • At iteration k: • Use the new large (k-1)-itemsets L’k-1 (w.r.t. D’) to find candidate itemsets of size k, Ck • Find support count of the candidate itemsets in Ck • Divide Ck into two partitions: Pk = Ck۸ Lk and Qk = Ck – Pk • For X in Pk, calculate σ’X = (σX - δX-) + δX+ • For X in Qk, eliminate candidates with δX+ - δX-< (Δ+ -Δ-)s%, • For the remaining candidates X in Qk,scan D* to find counts and add to δX+ to get σ’X
Problem 3: Find the difference between the old and new association rules • Before doing the update to find L’, we want to know the difference between L and L’ • Symmetric difference: measure how many large itemsets have been added and deleted after the database update • If too many => time to update association rules • If too few => old association rules are a good approximation of the updated database
DELI Algorithm • Difference Estimation for Large Itemsets • Purpose: to estimate the difference between the association rules in a database before and after it is updated • Decides whether to update the association rules • Key idea: it approximatesupper bound of the size of the association rule change byexamining samples of the database • Advantage: DELI saves machine resources and time
DELI Algorithm • Input: old support counts, D, Δ+ and Δ- • Output: a Boolean value indicating whether a rule-update is needed • Iterative algorithm: construct Ck from ~Lk-1 which is an approximation itemsets of L’k-1 • In each iteration, estimate the support count of itemsets in Ck using a sample S of m random transactions drawn from database D
Ck Lk Qk Pk DELI Algorithm – Step 1 • Obtain a random sample S of size m from database D • In each iteration: • generate a candidate set Ck = I (all 1-itemsets), k=1 apriori_gen(~Lk-1), k>1 • divide Ck into 2 partitions: Pk = Ck۸ Lk and Qk = Ck – Pk
Δ- (δ-X) D (σX) D* D’ (σ’X) Δ+ (δ+X) DELI Algorithm – Step 2 • Pk - the itemsets of size k that were large (>|D|s%) in old database and potentially large in the new one • For each itemset X Pk : • σ’X = (σX - δX-) + δX+ (scan only Δ- andΔ+) • If (σ’X >= |D’| * s%), then add X to Lk» (Lk» - itemsets, large both in old and new databases)
DELI Algorithm – Step 3 • Qk - the itemsets of size k that were not large (<|D|s%) in old database and potentially large in the new one (>|D’|s%) • For each itemset XQ : • If (δX+ - δX-) <= (| Δ+ | - | Δ- |)*s%, then delete X from Q • Prune away candidate itemsets that its support counts is not large (<|D’|s%) in the new database • For each remaining itemset XQ : • Find support count of X in the sample S, Tx (a binomially distributed random variable) • Estimate support count of X in D, σX and obtain an interval [ax, bx] with a 100(1-)% confidence • σ’X [ax + x, bx + x], where x = δX+ - δX- Reason: σ’X = σX + (δX+ - δX-)
DELI Algorithm – Step 3 • For each itemset XQ : • Compare estimated interval σ’X [ax + x, bx + x] with |D’|*s% • Lk> - itemsets that were not large in D but are large in D’ with 100(1-)% confidence • Lk≈ - itemsets that were not large in D, maybe large in D’ Lk> Lk≈ aX + X bX + X |D’|*s% |D’|*s%
~Lk Ck Lk> Lk» Lk Lk≈ Qk Pk DELI Algorithm – Step 4 • Obtain the estimated set of large itemsets of size k ~Lk = Lk» Lk> Lk≈ Itemsets: Lk» - large in D, large in D’ (Step 2) Lk> - not large in D, large in D’ with a certain confidence (Step 3) Lk≈ - not large in D, maybe large in D’ (Step 3) • ~Lk is an approximation of new Lk. ~Lk is an overestimated itemsets, therefore, the difference between ~Lk and Lk gives an upper bound.
DELI Algorithm – Step 5 • Decide whether an association rule update is needed • IF uncertainty ( Lk≈/~Lk ) is too large => DELI halts, update is needed • IF symmetric difference of large itemsets is too large => DELI halts, update is needed • IF ~Lk is empty => DELI halts, no update is necessary • IF ~Lk is non-empty => k = k + 1, go to Step 1
DELI Algorithm – Example|D|=106 | -|=9000 | +|=10000 S%=2%
DELI Algorithm – Example • k=1: • C1 = {A, B, C, D, E, F}, P1 = {A, B, C, D, E}, Q = {F} • P1: |D’|*s% = 20020 => L1 = {A, B, C, D, E} • Q1: (δX+ - δX-) = 17 (| Δ+ | - | Δ- |)*s% = 20 17 < 20 => drop F • ~L1= L1= {A, B, C, D, E} • Update? No. • k = 2, proceed to Step 1
DELI Algorithm – Example • k=2: • ~L1= {A, B, C, D, E}, P2={AB, AC, AD, AE, BC, BD, CD}, Q2={BE, CE, DE} • P2: |D’|*s% = 20020 => L2» = {AB, AC, AD, BC, BD, CD} • Q2: drop CE, DE; because (δX+ - δX-) <= (| Δ+ | - | Δ- |)*s% For BE: Assume support count of BE in S, Tx=202 => σX =20200 95% confidence interval [20200-2757, 20200+2757] for σX For σ’X , confidence interval: [17677, 23191] 17677 < |D’|*s% < 23191 => L2≈ ={BE} L2> = Ø
DELI Algorithm – Example • k=2: 4) ~L2 = {AB, AC, AD, BC, BD, CD, BE} 5) Update? No. (uncertainty=1/7 and difference=2/15). k = 3, proceed to Step 1. • k=3: … 4) ~L3 = {ABC, ACD, BCD} 5) Update? No. (uncertainty=0 and difference=2/15) • k=4: C4 = Ø STOP. • Returns: False (no update of association rules is needed).
Experimental Results • Synthetic databases – generate D, Δ+, Δ- • Use Apriori to find large itemsets • FUP2 is invoked to find large itemsets in the updated database – record time • Run DELI – record time • |D| = 100000, |Δ+|=| Δ-|= 5000, confidence = 95%, • s% = 2%, m = 20000
Experimental Results 90% ----------level of confidence--------- 99%
Conclusions • Real-world databases get updated constantly, therefore the knowledge extracted from them changes too • The authors proposed DELI algorithm to determine if the change is significant so that when to update the extracted association rules • The algorithm applies sampling techniques and statistic methods to efficiently estimate an approximate large itemsets
Final Exam Questions • Q1: Compare and contrast FUP2 and DELI • Both algorithms are used in Association Analysis • Goal: DELI decides when to update the association rules while FUP2 provides an efficient way of updating them • Technique: DELI scans a small portion of the database (sample) and approximates the large itemsets whereas FUP2 scans the whole database and returns the large itemsets exactly • DELI saves machine resources and time
Final Exam Questions • Q2: Difference Estimation for Large Itemsets • Q3 Difference between Apriori and FUP2: • Apriori scans the whole database to find association rules, and does not use old data mining results • For most itemsets, FUP2 scans only the updated part of the database and takes advantage of the old association analysis results