330 likes | 446 Views
Model Maintenance in Dynamic Environments. Venkatesh Ganti (Joint work with Raghu Ramakrishnan, Johannes Gehrke, Mong Li Lee). Mining Environment. Data repository for analysis Data mining models Frequent itemsets Decision trees Clusters … OLAP Aggregate queries
E N D
Model Maintenance in Dynamic Environments Venkatesh Ganti (Joint work with Raghu Ramakrishnan, Johannes Gehrke, Mong Li Lee)
Mining Environment • Data repository for analysis • Data mining models • Frequent itemsets • Decision trees • Clusters • … • OLAP • Aggregate queries • Repository updated regularly • Query workloads change Data Mining Data Warehouse …… …… OLAP DEMON
Two Parts of this Talk • Model Maintenance:Maintaining models under systematic data evolution [ICDE 00] • Tuning samples: Maintaining samples for approximate query answering with respect to changing query workloads [VLDB00] DEMON
d … D+d Systematic Block Evolution • Data warehouses are updated with blocks of new data • Block: a set of tuples appended simultaneously to the data warehouse D Result: a sequence of database snapshots DEMON
Model Maintenance: Objective • Allow selection of interesting time-varying subsets to be modeled • Low response time to get the updated model • Interesting classes of models • Frequent itemsets (LITS) • Clusters • Decision trees (DT) DEMON
M(D1+D2+D3+D4) D1 D2 D4 D3 Subset selection: Data Span • Span of interest • Everything until now—Unrestricted window • Recently collected—Most recent window • Unrestricted Window (UW) • Model the entire database M(D1+D2+D3) D1 D2 D3 DEMON
D1 D2 D3 D4 M(D2+D3+D4) D1 D2 D4 D5 M(D3+D4+D5) D3 Data Span (contd.) • Most Recent Window (MRW) of size w • E.g., model data collected in the last 3 days Sliding Windows Models M(D1+D2+D3) D1 D2 D3 DEMON
Block Selection Sequence • Maintain models on data collected on alternate days within the last 4 weeks • Require fine granular selection • Block selection sequence (BSS) • A 0/1 sequence: a bit for each block in the data span • 1--the block is selected for modeling • 0--the block is not selected for modeling DEMON
BSS: UW • A sequence of 0/1 bits, one for each block in the entire database • E.g., select all blocks collected on alternate days 1 0 1 0 1 D3 D4 D1 D2 D5 DEMON
1 0…0 1 0…0 1 ... D1 D8 D2-D7 D9-D14 D15 M(D1+D8+D15) BSS: MRW • Two types of BSS w.r.t. MRW • Window-independent • Window-relative • Model data collected on Mondays within the last 4 weeks • BSS: (1000000)* 1 0…0 1 0…0 1 ... D1 D8 D2-D7 D9-D14 M(D1+D8) DEMON
D3 D1 D2 D4 [1 0 1] D3 D4 D1 D2 D5 [1 0 1] BSS: MRW (contd.) • Window-relative BSS • Model all data collected on alternate days from the start in a window of size 3 • BSS: 101 D3 D2 D1 [1 0 1] Here, each successive subset is disjoint from its predecessor DEMON
Model Maintenance: Enumeration LITS Clustering DT UW:BSS MRW:BSS Includes both window-independent and window-relative block selection sequences. DEMON
Model Maintenance: Algorithms LITS Clustering DT UW:BSS GEMM(A) MRW:BSS GEMM: GEneric Model Maintenance Algorithm for any class of models that has an incremental maintenance algorithm A under tuple insertions DEMON
Maintenance under Insertions • Algorithm A • Input: old dataset D, old model M(D), a block of tuples d appended to D • Output:M(D+d) =A(D, d, M(D)) • Such algorithms exist for • Frequent itemsets (ECUT, ECUT+, BORDERS, FUP) • Clusters (BIRCH) • Decision trees (BOAT) • Note: We do NOT require A to handle deletions! DEMON
GEMM • Input • Data span (and window size for MRW) • BSS • A model-update algorithm A under tuple insertions (deletions not required) • Output • An efficient model maintenance algorithm DEMON
M(D1+D2+D3) D1 D2 D3 T3 M(D2+D3+D4) D1 D2 D4 D3 T4 D2 D4 D5 M(D3+D4+D5) D1 D3 T5 GEMM:MRW • Assume BSS is a sequence of 1’s and w=3 • We already know parts of future windows DEMON
GEMM: MRW (contd.) Idea: Start building models for future windows E.g.,: At T3, we maintain models on <D1+ D2+D3> (model required for window at T3) <D2+D3> (partial model for window at T4) <D3> (partial model for window at T5) Models at T3 M<D1 + D2 + D3> M<D2 + D3> M<D3> Models at T4 M<D2 + D3 + D4> (for window at T2) M<D3 + D4> (for window at T3) M<D4> (for window at T4) Immediate Offline DEMON
GEMM: Arbitrary BSS 1 0 1 0 1 ... T3: Model on <1.D1 + 0.D2 + 1.D3> T4: D4 is appended Model on <0.D2 + 1.D3 + 0.D4> T5: D5 is appended Model on <1.D3 + 0.D4 + 1.D5> D1 D1 D2 D4 D3 D5 Idea: We still know parts of future windows and the corresponding BSS for each of them E.g.,: At T3, we maintain models on <1.D1+0.D2+1.D3> (model required at T3) <0.D2+1.D3> <1.D3> identical DEMON
GEMM: Resource Requirements • Response time to new model • Updating one model with the new block • Other updates offline • Depends on the incremental algorithm • Space requirements • At most w models • Space required for a model is orders of magnitude less than that for data! DEMON
Maintenance under Insertions • Algorithm A • Input: old dataset D, old model M(D), a block of tuples d appended to D • Output:M(D+d) =A(D, d, M(D)) • Such algorithms exist for • Frequent itemsets (ECUT, ECUT+, BORDERS, FUP) • Clusters (BIRCH) • Decision trees (BOAT) • Note: We do NOT require A to handle deletions! DEMON
Frequent Itemset Models • Set of customer transactions • Frequent itemset: a set of items purchased together by “many” customers Minimum frequency threshold = 50% {b}, {c}, {a,c} are frequent itemsets DEMON
Incremental Algorithm [FAAM97,TBAR97] D4 • Input • Old dataset • Old set of frequent itemsets • New block D4 • Steps • Detect if new itemsets become frequent • Count frequencies of a small number of itemsets • Current algorithms scan(D1+D2+D3) completely • Update model… D1 D2 D3 DEMON
ECUT—New Counting Algorithm • Transformed data representation • Within each block Di • item x: sorted list of transaction identifiers containing “x”—TID-list(x) TID-list(a) = {1} TID-list(b) = {2,3} TID-list(c) = {1,2} Count({a,b}) = |TID-list(a) intersection TID-list(b)| DEMON
Experimental Comparison DEMON
Comparing Count Times DEMON
Summary of the first part LITS Clustering DT UW:BSS GEMM MRW:BSS • Maintenance algorithms under tuple insertions • Frequent itemsets • ECUT, ECUT+ • Clusters • BIRCH • Decision Trees • BOAT DEMON
Second Part of this Talk • Model Maintenance:Maintaining models under systematic data evolution [ICDE 00] • Maintaining samples with respect to changing query workloads [VLDB00] DEMON
S(R) Random Samples for AQUA Agg. query Q • All tuples in R are assumed to be equally important while drawing S(R) • In practice, queries exhibit locality • Consequence: S(R) wastes precious real estate Exact answer Typical AQUA approach R Approx. answer Uniform Random sample DEMON
Problem • Given • Relation R • Workload W: Q1,…,Qn • Goal: Dynamically tune “random sample of R” w.r.t. W • Model to be maintained: a simple random sample DEMON
R Uniform Random Sample R(Q1) SW(R) ICICLE R(Qn) ICICLES • R(Q): set of tuples in R required to answer Q • Random sample of R U R(Q1) U … U R(Qn) • Tuples required often are more likely to be in SW(R) DEMON
Mail Order Dataset DEMON
Conclusions and Future Work Static dataset Dynamic dataset Workload indifferent Workload sensitive DEMON
Questions? DEMON