240 likes | 338 Views
ARGUS: Rete + DBMS = Efficient Persistent Profile Matching on Large-Volume Data Streams. Chun Jin Language Technologies Institute School of Computer Science Carnegie Mellon University cjin@cs.cmu.edu. Stream Processing Model. Data Streams. Output. Storage. Stream Processing becomes
E N D
ARGUS: Rete + DBMS = Efficient Persistent Profile Matching on Large-Volume Data Streams Chun Jin Language Technologies Institute School of Computer Science Carnegie Mellon University cjin@cs.cmu.edu
Stream Processing Model Data Streams Output Storage • Stream Processing becomes demanding and prevalent. Chun Jin Carnegie Mellon
Stream Databases • Stream Database Applications • Network Traffic Analysis and Router Configuration • Dynamic Internet Services • Sensor Data Analysis • Anomaly Detection • Stream Database Projects • STREAM, TelegraphCQ, Aurora • NiagaraCQ, OpenCQ, WebCQ • Gigascope, Tribeca • Tapestry, Alert, Tukwila, etc. • ARGUS Chun Jin Carnegie Mellon
Stream Anomaly Monitoring Systems (SAMS) • SAMS monitors structured data streams for anomalies or potential hazards. • Matches of queries may be high urgency alerts. Prompt detections are desirable. • Satisfaction of a SAMS query is often rare (very-high-selectivity). Chun Jin Carnegie Mellon
SAMS Dataflow Data Streams FedWire Money Transfers Patient Records Stream Anomaly Monitoring System Queries Storage Chun Jin Carnegie Mellon Alerts Analyst
Challenges to SAMS • Persistent queries may number in thousands or tens of thousands. • Daily stream volumes may exceed millions of records. • Prompt detections are desirable. • Very-high-selectivity Query Property. Chun Jin Carnegie Mellon
Proposed ARGUS Approach • Basic Framework: • Incremental evaluation schemes (Adapted Rete algorithm) • Rete (Forgy 1982): Incremental Evaluation based on Materialized Intermediate Results. • Upon a traditional DBMS platform • Exploiting Very-High-Selectivity Query Property: • Transitivity Inference • Conditional Materialization • Optimizing Join Order • Computation Sharing • Related to Other Applications • Stream Databases • Modern DBMS Query Optimization Chun Jin Carnegie Mellon
Query Example 4 • Suppose for every big transaction of type code 1000, the analyst wants to check if the money stayed in the bank or left within ten days. An additional sign of possible fraud is that transactions involve at least one intermediate bank. The query generates an alarm whenever the receiver of a large transaction (over $1,000,000) transfers at least half of the money further within ten days of this transaction using an intermediate bank. Chun Jin Carnegie Mellon
SQL Query for Example 4 FROM transaction r1, transaction r2, transaction r3 WHERE r2.type_code = 1000 AND r3.type_code = 1000 AND r1.type_code = 1000 AND r1.amount > 1000000 AND r1.rbank_aba = r2.sbank_aba AND r1.benef_account = r2.orig_account AND r2.amount > 0.5 * r1.amount AND r1.tran_date <= r2.tran_date AND r2.tran_date <= r1.tran_date + 10 AND r2.rbank_aba = r3.sbank_aba AND r2.benef_account = r3.orig_account AND r2.amount = r3.amount AND r2.tran_date <= r3.tran_date AND r3.tran_date <= r2.tran_date + 10; Chun Jin Carnegie Mellon
ARGUS System Architecture Data Tables Stream Anomaly Monitoring Intermediate Tables Data Streams Query Table Do_queries Analyst Rete Network Generator Query Scheduler Rete Networks Identified Threats Chun Jin Carnegie Mellon
ReteGenerator Architecture System Catalog • Common Computation Identification • Predicate Indexing • Extended Predicate Set Operations • Choose what and how to share • Recording and Manipulating Network Topology • Estimating Sharing Costs ReteGenerator Optimizer Join Order Conditional Materialization Transitivity Inference Sharing Module SQL Queries Chun Jin Carnegie Mellon
Adapted Rete Algorithm (Selection) • n and m are old data sets • Δn and Δm are the new much smaller incremental data sets. • Selection ơ • ơ(n+ Δn) = + ơ(Δn) ơ(n) Chun Jin Carnegie Mellon
Adapted Rete Algorithm (Join) • Join • (n+Δn) (m+Δm) = n m + Δn m + n Δm + Δn Δm • When Δn and Δm are very small compared to n and m, time complexity of incremental join is O(n+m) Old Results New Incremental Results Chun Jin Carnegie Mellon
Incremental Evaluation in Rete Example 4 r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount > r1.amount*0.5 r1.tran_date <= r2.tran_date r2.tran_date >= r1.tran_date+10 Type_code=1000 Amount>1000000 DataTable Type_code=1000 r1, r2, r3 r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date >= r2.tran_date+10 Type_code=1000 Chun Jin Carnegie Mellon
Complex Queries • A persistent query may contain multiple SQL statements, and a single SQL statement may contain unions of multiple SQL terms. • Each SQL term is mapped to a sub-Rete network. • These sub-Rete networks are then connected to form the statement-level sub-networks. • And the statement-level subnetworks are further connected based on the view references to form the final query-level Rete network. Chun Jin Carnegie Mellon
Transitivity Inference • Exploring transitivity properties of comparison operators • To derive hidden high-selective selection predicates • High-selective selection predicates can significantly improve performance as they may produce very small intermediate results. Subsequent join could be performed very fast on the materialized intermediate results. • Ono/Lohman VLDB90, Pirahesh/Leung/Hasan ICDE97 Chun Jin Carnegie Mellon
Transitivity Inference Example • Given • r1.amount > 1000000 and • r2.amount > r1.amount * 0.5 and • r3.amount = r2.amount • r1.amount > 1000000 is very high-selective on r1 • We can infer high-selective predicates: • r2.amount > 500000 • r3.amount > 500000 Chun Jin Carnegie Mellon
Conditional Materialization Unconditional Materialization r1 r2 Conditional Materialization: r1 Choose materialization or not based on cost estimates r2 Chun Jin Carnegie Mellon
Preliminary Evaluation:Queries and Data • 7 queries on synthesized FedWire money transfer database. 320006 records. • Two Data Conditions: • Data1: Old: first 300000 records New: remaining 20006 records ALERT • Data2: Old: first 300000 records New: next 20000 records NOT alert Chun Jin Carnegie Mellon
Preliminary Results 50 40 30 Execution Time(s) 20 10 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Rete Data1 SQL Data1 Rete Data2 SQL Data2 Rete with Transitivity Inference Chun Jin Carnegie Mellon
Transitivity Inference Q4 50 Q2 45 40 25 35 30 20 Execution Time(s) 25 15 20 Execution Time(s) 15 10 10 5 5 0 0 Data1 Data2 Data1 Data2 Rete TI Rete Non-TI SQL Non-TI SQL TI Chun Jin Carnegie Mellon
Conditional Materialization 50 45 40 35 Conditional 30 Execution Time(s) 25 Rete 20 SQL 15 10 5 0 Data1 Data2 Q4 assumes Transitivity Inference not applicable Chun Jin Carnegie Mellon
ARGUS Summary • Adapted Rete Algorithm upon a traditional DBMS platform • Exploit the very-high-selectivity query property for optimization: • Transitivity Inference • Conditional Materialization • Current and Future Work: • Optimizing Join Order • Computation Sharing Chun Jin Carnegie Mellon
Thank you! Questions and Comments? Chun Jin Carnegie Mellon