Model Maintenance in Dynamic Environments

Model Maintenance in Dynamic Environments Venkatesh Ganti (Joint work with Raghu Ramakrishnan, Johannes Gehrke, Mong Li Lee)

Mining Environment • Data repository for analysis • Data mining models • Frequent itemsets • Decision trees • Clusters • … • OLAP • Aggregate queries • Repository updated regularly • Query workloads change Data Mining Data Warehouse …… …… OLAP DEMON

Two Parts of this Talk • Model Maintenance:Maintaining models under systematic data evolution [ICDE 00] • Tuning samples: Maintaining samples for approximate query answering with respect to changing query workloads [VLDB00] DEMON

d … D+d Systematic Block Evolution • Data warehouses are updated with blocks of new data • Block: a set of tuples appended simultaneously to the data warehouse D Result: a sequence of database snapshots DEMON

Model Maintenance: Objective • Allow selection of interesting time-varying subsets to be modeled • Low response time to get the updated model • Interesting classes of models • Frequent itemsets (LITS) • Clusters • Decision trees (DT) DEMON

M(D1+D2+D3+D4) D1 D2 D4 D3 Subset selection: Data Span • Span of interest • Everything until now—Unrestricted window • Recently collected—Most recent window • Unrestricted Window (UW) • Model the entire database M(D1+D2+D3) D1 D2 D3 DEMON

D1 D2 D3 D4 M(D2+D3+D4) D1 D2 D4 D5 M(D3+D4+D5) D3 Data Span (contd.) • Most Recent Window (MRW) of size w • E.g., model data collected in the last 3 days Sliding Windows Models M(D1+D2+D3) D1 D2 D3 DEMON

Block Selection Sequence • Maintain models on data collected on alternate days within the last 4 weeks • Require fine granular selection • Block selection sequence (BSS) • A 0/1 sequence: a bit for each block in the data span • 1--the block is selected for modeling • 0--the block is not selected for modeling DEMON

BSS: UW • A sequence of 0/1 bits, one for each block in the entire database • E.g., select all blocks collected on alternate days 1 0 1 0 1 D3 D4 D1 D2 D5 DEMON

1 0…0 1 0…0 1 ... D1 D8 D2-D7 D9-D14 D15 M(D1+D8+D15) BSS: MRW • Two types of BSS w.r.t. MRW • Window-independent • Window-relative • Model data collected on Mondays within the last 4 weeks • BSS: (1000000)* 1 0…0 1 0…0 1 ... D1 D8 D2-D7 D9-D14 M(D1+D8) DEMON

D3 D1 D2 D4 [1 0 1] D3 D4 D1 D2 D5 [1 0 1] BSS: MRW (contd.) • Window-relative BSS • Model all data collected on alternate days from the start in a window of size 3 • BSS: 101 D3 D2 D1 [1 0 1] Here, each successive subset is disjoint from its predecessor DEMON

Model Maintenance: Enumeration LITS Clustering DT UW:BSS MRW:BSS Includes both window-independent and window-relative block selection sequences. DEMON

Model Maintenance: Algorithms LITS Clustering DT UW:BSS GEMM(A) MRW:BSS GEMM: GEneric Model Maintenance Algorithm for any class of models that has an incremental maintenance algorithm A under tuple insertions DEMON

Maintenance under Insertions • Algorithm A • Input: old dataset D, old model M(D), a block of tuples d appended to D • Output:M(D+d) =A(D, d, M(D)) • Such algorithms exist for • Frequent itemsets (ECUT, ECUT+, BORDERS, FUP) • Clusters (BIRCH) • Decision trees (BOAT) • Note: We do NOT require A to handle deletions! DEMON

GEMM • Input • Data span (and window size for MRW) • BSS • A model-update algorithm A under tuple insertions (deletions not required) • Output • An efficient model maintenance algorithm DEMON

M(D1+D2+D3) D1 D2 D3 T3 M(D2+D3+D4) D1 D2 D4 D3 T4 D2 D4 D5 M(D3+D4+D5) D1 D3 T5 GEMM:MRW • Assume BSS is a sequence of 1’s and w=3 • We already know parts of future windows DEMON

GEMM: MRW (contd.) Idea: Start building models for future windows E.g.,: At T3, we maintain models on <D1+ D2+D3> (model required for window at T3) <D2+D3> (partial model for window at T4) <D3> (partial model for window at T5) Models at T3 M<D1 + D2 + D3> M<D2 + D3> M<D3> Models at T4 M<D2 + D3 + D4> (for window at T2) M<D3 + D4> (for window at T3) M<D4> (for window at T4) Immediate Offline DEMON

GEMM: Arbitrary BSS 1 0 1 0 1 ... T3: Model on <1.D1 + 0.D2 + 1.D3> T4: D4 is appended Model on <0.D2 + 1.D3 + 0.D4> T5: D5 is appended Model on <1.D3 + 0.D4 + 1.D5> D1 D1 D2 D4 D3 D5 Idea: We still know parts of future windows and the corresponding BSS for each of them E.g.,: At T3, we maintain models on <1.D1+0.D2+1.D3> (model required at T3) <0.D2+1.D3> <1.D3> identical DEMON

GEMM: Resource Requirements • Response time to new model • Updating one model with the new block • Other updates offline • Depends on the incremental algorithm • Space requirements • At most w models • Space required for a model is orders of magnitude less than that for data! DEMON

Maintenance under Insertions • Algorithm A • Input: old dataset D, old model M(D), a block of tuples d appended to D • Output:M(D+d) =A(D, d, M(D)) • Such algorithms exist for • Frequent itemsets (ECUT, ECUT+, BORDERS, FUP) • Clusters (BIRCH) • Decision trees (BOAT) • Note: We do NOT require A to handle deletions! DEMON

Frequent Itemset Models • Set of customer transactions • Frequent itemset: a set of items purchased together by “many” customers Minimum frequency threshold = 50% {b}, {c}, {a,c} are frequent itemsets DEMON

Incremental Algorithm [FAAM97,TBAR97] D4 • Input • Old dataset • Old set of frequent itemsets • New block D4 • Steps • Detect if new itemsets become frequent • Count frequencies of a small number of itemsets • Current algorithms scan(D1+D2+D3) completely • Update model… D1 D2 D3 DEMON

ECUT—New Counting Algorithm • Transformed data representation • Within each block Di • item x: sorted list of transaction identifiers containing “x”—TID-list(x) TID-list(a) = {1} TID-list(b) = {2,3} TID-list(c) = {1,2} Count({a,b}) = |TID-list(a) intersection TID-list(b)| DEMON

Experimental Comparison DEMON

Comparing Count Times DEMON

Summary of the first part LITS Clustering DT UW:BSS GEMM MRW:BSS • Maintenance algorithms under tuple insertions • Frequent itemsets • ECUT, ECUT+ • Clusters • BIRCH • Decision Trees • BOAT DEMON

Second Part of this Talk • Model Maintenance:Maintaining models under systematic data evolution [ICDE 00] • Maintaining samples with respect to changing query workloads [VLDB00] DEMON

S(R) Random Samples for AQUA Agg. query Q • All tuples in R are assumed to be equally important while drawing S(R) • In practice, queries exhibit locality • Consequence: S(R) wastes precious real estate Exact answer Typical AQUA approach R Approx. answer Uniform Random sample DEMON

Problem • Given • Relation R • Workload W: Q1,…,Qn • Goal: Dynamically tune “random sample of R” w.r.t. W • Model to be maintained: a simple random sample DEMON

R Uniform Random Sample R(Q1) SW(R) ICICLE R(Qn) ICICLES • R(Q): set of tuples in R required to answer Q • Random sample of R U R(Q1) U … U R(Qn) • Tuples required often are more likely to be in SW(R) DEMON

Mail Order Dataset DEMON

Conclusions and Future Work Static dataset Dynamic dataset Workload indifferent Workload sensitive DEMON

Questions? DEMON

Model Maintenance in Dynamic Environments

Model Maintenance in Dynamic Environments

Presentation Transcript

Condor in Dynamic Environments

Dynamic Capabilities in Complex Environments: A Qualitative Approach

Design of Mechanisms for Dynamic Environments

Managing Network Risk in Dynamic Environments

DYNAMIC MODEL SELECTION

The Dynamic Classical Model

Dynamic Synthesis of Mediators in Pervasive Environments

Hierarchical Radiosity for Dynamic Environments

CoSLAM : Collaborative Visual SLAM in Dynamic Environments

Maximum Thick Paths in Static and Dynamic Environments

Dynamic Model

Maintenance model in RS

Motor control and learning in altered dynamic environments

Optimization in Dynamic Environments

Planning in Dynamic and Partially-unknown environments

Dynamic Model Validation Project

GPS-based Navigation in Static and Dynamic Environments

Dynamic Model Data Management

Dynamic NH 3 model

On Self Adaptive Routing in Dynamic Environments

Dynamic Model

Hazardous substance management in dynamic scientific environments