Optimizing Multiple Continuous Queries

Optimizing Multiple Continuous Queries Dissertation Defense Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston, on leave at Yahoo! Research Jamie Callan Phil Hayes, Vivisimo, Inc. October 31, 2006, Carnegie Mellon

Emerging Stream Applications • Intelligence monitoring • Fraud detection • Onset epidemic patterns • Network intrusion detection • GeoSpatial change detection • Transactions • Senor network readings • Network traffic data Chun Jin Carnegie Mellon

ARGUS: Toward Collaborative Intelligence Analysis Stream Matching Novelty Detection Data Streams Ad hoc Query Matching Continuous Queries Ad hoc exploring New Continuous Queries Fraud Alerts New Patterns Terrorism Alerts New Connections Analyst A Analyst B Chun Jin Carnegie Mellon

Challenges • Large-Scale (~103) continuous queries • On FAST (104-105tuples/day) continuous streams • With LARGE (~106tuples) historical DBs. … but computation-sharable and highly-selective queries • Support stream processing for a broad range of queries on existing DB applications. … but DBMS technologies. Chun Jin Carnegie Mellon

Problems • Efficiency and scalability • Continuous query evaluation • Multiple/Large-scale queries • Practicality • Utilize DBMS legacy systems to support stream processing on a broad range of queries. Chun Jin Carnegie Mellon

Approaches • Efficiency and scalability • Incremental query evaluation • Incremental multiple query optimization (IMQO) • Query optimization • Practicality • Built atop DBMSs • Use SQL as the query language • Shows up-to hundreds-fold improvement (Details coming up) Selection/join queries Chun Jin Carnegie Mellon

Challenges to Multiple Query Optimization (MQO) Incremental MQO (IMQO) MQO is NP-hard! [Sellis90] … Q1 Q2 QK time … t1 t2 tK 0 Chun Jin Carnegie Mellon

Performing IMQO SELECT … FROM … WHERE … Query Network R • Index R • Identify common computations between QN and R • Select optimal sharing paths • Expand R with new computations Chun Jin Carnegie Mellon

Related Work • Efficiency and Scalability: • Incremental evaluation: Stream operators • Join(Rete) [Forgy82] [Urhan et al,00] [Viglas et al,03] • Aggregate [Haas et al,99] • IMQO: Stream Processing Projects • NiagaraCQ, TelegraphCQ [Chen et al,00] [Chandrasekaran et al,03] • STREAM, Aurora, Gigascope [Motwani et al,03] [Abadi et al,03] [Cranor et al,03] • ARGUS[Jin et al,05][Jin et al,06] • Practicality • Comprehensive IMQO framework • Richer query syntax and semantics • Canonicalization • More flexible plan structures • More general sharing strategies Chun Jin Carnegie Mellon

Thesis Statement • The thesis demonstrates constructively that incremental multiple query optimization, incremental evaluation, and other query optimization techniques provide very significant performance improvements for large-scale continuous queries. • The methods can function atop existing DBMS systems for maximal modularity and direct practical utility. • The methods work well across diverse applications. Chun Jin Carnegie Mellon

ARGUS Stream Processing ARGUS Execution Engine ARGUS Query Network Generator Data Tables System Catalog IMQO Module Input Streams Plan Instantiator SingleQuery Optimizer Query Network Code Assembler Register & initialize query network Register queries Chun Jin Carnegie Mellon Analyst Result streams

Query Network Generator Parser ARGUS Query Network Generator Canonicalizer System Catalog Incremental Multi-Query Optimizer Index & Search Interface ARGUS Manager SQL Query Plan Instantiator Query Rewriter Single-Query Optimizer Code Assembler Initiation and execution code Chun Jin Carnegie Mellon

Query Example • Suppose for every big transaction of type code 1000 or 2000, the analyst wants to check if the money stayed in the bank or left within twenty days. An additional sign of possible fraud is that the transactions involve at least one intermediate bank. The query generates an alarm whenever the receiver of a large transaction (over $1,000,000) transfers at least half of the money further within twenty days of this transaction using an intermediate bank. Chun Jin Carnegie Mellon

The Query in CNF SELECT * FROM Fed r1, Fed r2, Fed r3 WHERE (r1.type_code = 1000 OR r1.type_code = 2000) AND r1.amount > 1000000 AND (r2.type_code = 1000 OR r2.type_code = 2000) AND r2.amount > 500000 AND (r3.type_code = 1000 OR r3.type_code = 2000) AND r3.amount > 500000 AND r1.rbank_aba = r2.sbank_aba AND r1.benef_account = r2.orig_account AND r2.amount > r1.amount / 2 AND r1.tran_date <= r2.tran_date AND r2.tran_date <= r1.tran_date + 20 AND r2.rbank_aba = r3.sbank_aba AND r2.benef_account = r3.orig_account AND r2.amount = r3.amount AND r2.tran_date <= r3.tran_date AND r3.tran_date <= r2.tran_date + 20; S2 S1 S1 J1 J2 F S1 S2 J1 J2 Chun Jin Carnegie Mellon

Identify Sharable Computations • Literal predicates • Equivalency • Subsumption • OR predicates • Predicate sets • Topology • Sharing strategies • Self-join SELECT * FROM Fed r1, Fed r2, Fed r3 WHERE (r1.type_code = 1000 OR r1.type_code = 2000) AND r1.amount > 1000000 AND (r2.type_code = 1000 OR r2.type_code = 2000) AND r2.amount > 500000 AND (r3.type_code = 1000 OR r3.type_code = 2000) AND r3.amount > 500000 AND r1.rbank_aba = r2.sbank_aba AND r1.benef_account = r2.orig_account AND r2.amount * 2 > r1.amount AND r1.tran_date <= r2.tran_date AND r2.tran_date - 10 <= r1.tran_date AND r2.rbank_aba = r3.sbank_aba AND r2.benef_account = r3.orig_account AND r2.amount = r3.amount AND r2.tran_date <= r3.tran_date AND r3.tran_date - 10 <= r2.tran_date; ORp3 ORp4 ORp1 ORp2 ORp1 ORp2  r2.amount > r1.amount/2 J4 PJ1 J3  r3.tran_date <= r2.tran_date + 20 J4 F S1 S2 J1 J2 Chun Jin Carnegie Mellon

Computation Hierarchy S1 S2 sharable subsumption PS1 PS2 ORp2 ORp4 ORp1 subsumption p2 p4 p11 p12 subsumption Fed.type_code = 1000 OR Fed.type_code = 2000 Fed.amount > 500000 Fed.amount > 1000000 Chun Jin Carnegie Mellon

ER Model for Hierarchy name Node IsAChild type Associates psetid PredSet BelongsTo ORpid OR Pred BelongsTo pid text Literal Pred Chun Jin Carnegie Mellon

Problems in Index/Search • Rich syntax  Canonicalization • Subsumption • Literal predicate: subsumption + canonicalization  triple-string canonical form • ORPred/PredSet  algorithms • Self-join + canonicalization  Standard Table Alias (STA) assignment • Topology  multiple topology indexing (Details coming up) Chun Jin Carnegie Mellon

Canonicalization • Equivalency: r2.amount > r1.amount / 2 r2.amount *2 > r1.amount  r2.amount * 2 – r1.amount > 0 • Subsumption: r2.tran_date <= r1.tran_date + 20 r2.tran_date – r1.tran_date <= 20 r2.tran_date – 10 <= r1.tran_date r2.tran_date – r1.tran_date <= 10 • Triple-string canonical form: attribute-expression op constant Chun Jin Carnegie Mellon

Self-Join • Canonical forms refer to true table names. • Not good for self-join predicates: • r1.benef_account = r2.orig_accout  Fed. benef_account = Fed.orig_accout • Use Standard Table Alias (STA) •  T1. benef_account = T2.orig_accout • Enumerate STA assignments to find matches Chun Jin Carnegie Mellon

Self-Join in ORPred/PredSet Layers • OR Predicate: • (r1.c=1000 OR r1.a=r2.b) •  (Fed.c=1000 OR T1.a=T2.b) ? •  (T1.c=1000 OR T1.a=T2.b) ? • Add STA when indexing OR Predicates • Similar on Predicate Sets Chun Jin Carnegie Mellon

Subsumption at ORPred Layer Input: ORPred p P Output: All ORPreds r R, s.t. pr. Algorithm: For each ρ p, Find γr, such that ργ For each r found, Count # of γ that subsumes ρ, |I(r)| If |I(r)|=|p| pr Chun Jin Carnegie Mellon

Topological Connections S1 S3 S5 J1 J4 J7 B1 S2 S4 S6 Chun Jin Carnegie Mellon

System Catalog JoinTopologyIndex SelectionTopologyIndex PredicateSetIndex PredicateIndex Chun Jin Carnegie Mellon

Indexing & Searching Canonicalization Inference & Classification T2.amount * 2 – T1.amount > 0 r2.type_code = 1000 r3.type_code = 1000 r1.type_code = 1000 r1.amount > 1000000 r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount * 2 > r1.amount r1.tran_date <= r2.tran_date r2.tran_date – 10 <= r1.tran_date r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date – 10 <= r2.tran_date r1.type_code = 1000 r1.amount > 1000000 r2.type_code = 1000 r2.amount > 500000 r3.type_code = 1000 r3.amount > 500000 r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount * 2 > r1.amount r1.tran_date <= r2.tran_date r2.tran_date – 10 <= r1.tran_date r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date – 10 <= r2.tran_date T2.tran_date – T1.tran_date <= 10 Common Computation Searching Computation Indexing … … … Node PredSetID PredSetID PredID PredID CanonicalForm TopologyIndex PredicateSetIndex PredicateIndex System Catalog Chun Jin Carnegie Mellon

Sharing Strategies B1 ? B1 1 B2 J1 B2 J2 B2 ? B3 2 (c-1) Sharing-selection B3 (b-1) Joins in Q B1 J1 B2 (a) Query network R B1 J1 B1 B2 J3 J2 B2 J3 B3 J2 (c-2) Match-plan B3 Chun Jin Carnegie Mellon (b-2) Optimal plan for Q

Evaluation • Databases: • Synthesized FedWire money transfers (Fed 500000 records) • Anonymized Medical patient admission records (Med 835890 records) • Queries: • Seed queries • Generate sharable queries from seeds • A wide range of queries • Simulation: • Historical data (300000 on Fed, 600000 on Med) • Chunks of new data (4000 per chunk, etc.) Chun Jin Carnegie Mellon

Improvement Factors Incremental Evaluation 1-100x Join Order Optimization 1-10x Canonicalization 1-10x IMQO 1-50x DBMS 1x ARGUS 1-500x Conditional Materialization 1.2-1.8x Transitivity Inference 1-20x Chun Jin Carnegie Mellon

Fed IMQO & Canonicalization # of queries WQNS: weighted query network size HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0 Chun Jin Carnegie Mellon

Fed Sharing Strategies HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0 Chun Jin Carnegie Mellon

Summary of Contributions • Efficiency and scalability • Continuous queries Incremental query evaluation • Multiple/large-scale queries Incremental multiple query optimization (IMQO) • Query optimization • Practicality • Existing DB applications  Built atop DBMSs • A broad range of query syntax and semantics Support • Evaluation • Shows up-to hundreds-fold improvement • Works across various domains Chun Jin Carnegie Mellon

Future Work • Generalization of current work • Support multi-way joins • More sophisticated sharing strategies • Rerouting • Restructuring • Adaptive query processing • Adaptive re-optimization: rerouting and restructuring • Adaptive rescheduling • New infrastructure • Parallel/distributive processing • Automatic tuning: index selection • Support new data types • Text • Multimedia Chun Jin Carnegie Mellon

Acknowledgement • Advisor: Jaime Carbonell. • Committee: Chris Olston, Jamie Callan, and Phil Hayes • CMU and Dynamix ARGUS team: Jaime Carbonell, Phil Hayes, Santosh Ananthraman, Cenk Gazen, Bob Frederking, Eugene Fink, Dwight Dietrich, Ganesh Mani, Johny Mathew, and Aaron Goldstein. • CMU faculty and friends: many … Chun Jin Carnegie Mellon

Thank you! Questions and comments? Chun Jin Carnegie Mellon

Outline • Motivation • System and methods: • System architecture • Execution engine • Query network structures • IMQO framework • Query network generator • Query examples • Hierarchy/ER Model • Problems and solutions • System catalog • Sharing strategies • Evaluation • Conclusion and future work Chun Jin Carnegie Mellon

Adapted Rete Algorithm (Join) • Join on N and M • (N+ΔN) (M+ΔM) = N M + ΔN M + N ΔM + ΔN ΔM • When ΔN and ΔM are very small compared to N and M, time complexity of incremental join is O(N+M) Old Results New Incremental Results Chun Jin Carnegie Mellon

Incremental Evaluation N.rbank_aba = M.sbank_aba N.benef_account = M.orig_account M.amount > N.amount*0.5 N.tran_date <= M.tran_date M.tran_date >= N.tran_date+20 N J M N hist J ΔN new hist Compute ΔJ by ΔN M N ΔM ΔN ΔM new M ΔJ hist new ΔM Chun Jin Carnegie Mellon

Incremental Evaluation r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount > r1.amount*0.5 r1.tran_date <= r2.tran_date r2.tran_date >= r1.tran_date+20 type_code=1000 amount>500000 F F S1 S2 J1 J2 hist temp S2 hist S1 J1 temp hist Compute S1_temp by selecting from F_temp temp hist temp Compute J1_temp by joining S1_temp and S2_hist, joining S1_hist and S2_temp, and joining S1_temp and S2_temp Chun Jin Carnegie Mellon

Code Generation • Code template for each operator • Code block for each node • Sort the code blocks • Wrap up code blocks in Oracle stored procedures • Register and periodical execution Chun Jin Carnegie Mellon

Projection Management B1 S1 J1 B2 S2 Chun Jin Carnegie Mellon

Transitivity Inference Example • Given • r1.amount > 1000000 and • r2.amount > r1.amount * 0.5 and • r3.amount = r2.amount • We can infer highly-selective predicates: • r2.amount > 500000 • r3.amount > 500000 Chun Jin Carnegie Mellon

Query Optimizer • Similar to traditional enumeration-based query optimizer • Optimize • Join order • Conditional materialization DB History-based Cost Estimator Active List SQL Query Join Enumerator Join Graph Plan Update System Catalog History-based Query Optimizer StructureBuilder Chun Jin Carnegie Mellon

Conditional Materialization Unconditional Materialization r1 r2 Conditional Materialization: r1 Choose materialization or not based on cost estimates r2 Chun Jin Carnegie Mellon

Selection/Join Incremental Evaluation (Fed) 50 40 30 Execution Time(s) 20 10 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Rete Data1 DBMS Data1 Rete Data2 DBMS Data2 HP PC, Single core Pentium(R) 4 CPU, 1.7GHz, 512M RAM, Windows XP, Oracle 10.1.0 Chun Jin Carnegie Mellon

Fed Comparing All HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0 Chun Jin Carnegie Mellon

Med Comparing All HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0 Chun Jin Carnegie Mellon

Med IMQO & Canonicalization HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0 Chun Jin Carnegie Mellon

Med Sharing Strategies HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0 Chun Jin Carnegie Mellon

Optimizing Multiple Continuous Queries