340 likes | 366 Views
This paper introduces content-based routing (CBR) as a method to optimize query plans based on different subsets of data, allowing for more efficient processing. The CBR algorithm is implemented in TelegraphCQ as an extension to Eddies.
E N D
Content-Based Routing:Different Plans for Different Data Pedro Bizarro Joint work withShivnath Babu, David DeWitt, Jennifer Widom September 1, 2005 VLDB 2005
Introduction Introduction • The opportunity to impove: • Optimizers pick a single plan for a query • However, different subsets of data may have very different statistical properties • May be more efficient to use different plans for different subsets of data
Overview of CBR • Eliminates single plan assumption • Identifies tuple classes • Uses multiple plans, each customized for a different tuple class • CBR applies to any streaming data: • E.gs.: stream systems, regular DBMS operators using iterators, and acquisitional systems. • Adaptive and low overhead algorithm • Implemented in TelegraphCQ as an extension to Eddies
Output tuples O2 O3 O1 Eddy Stream of Input tuples Overview of Eddies • Eddy routes tuples through a pool of operators • Routing decisions based on operator characteristics: selectivity, cost, queue size, etc. • Tuples not differentiated based on content • We call it SBR: Source-Based Routing
Content-Based Routing Example • Consider stream S processed by O1, O2, O3 Overall Operator Selectivities • Best routing order is O1, then O2, then O3
Content-Based Routing Example • Let A be an attribute with domain {a,b,c} Content-Specific Selectivities • Best routing order for A=a: O2, O1, O3 • Best routing order for A=b: O2, O1, O3 • Best routing order for A=c: O1, O3, O2
Classifier Attributes • Goal: identify tuple classes • Each with a different optimal operator ordering • CBR considers: • Tuple classes distinguished by content, i.e., attribute values • Classifier attribute (informal definition): • Attribute A is classifier attribute for operator O if the value of A is correlated with selectivity of O.
Best Classifier Attribute Example: • Attribute A with domain {a, b, c} • Attribute B with domain {x, y, z} • Which is the best to use for routing decisions? • Similar to AI problem: classifier attributes for decision trees • Two labels: pass operator, dropped by operator • AI solution: Use GainRatio to pick best classifier attribute
GainRatio to Measure Correlation • R: random sample of tuples processed by operator O GainRatio(R, A) = 0.87 GainRatio(R, B) = 0.002 Formulas from T. Mitchell, Machine Learning. McGraw-Hill, '97.
Classifier Attributes:Definition An attribute A is a classifier attribute for operator O, if for any large random sample R of tuples processed by O, GainRatio(R,A)>, for some threshold
Content-Learns Algorithm:Learning Routes Automatically • Content-Learns consists of two continuous, concurrent steps: • Optimization: For each OlO1, …,On find: • that Ol does not have a classifier attribute or • find the best classifier attribute, Cl, of Ol. • Routing: Route tuples according to the: • selectivities of Ol if Ol does not have a classifier attribute or • according to the content-specific selectivities of the pair <Ol, Cl> if Cl is the best classifier attribute of Ol
4 operators classifier attributes CA[]= 2 -1 1 1 operator 3 being profiled In[]= Out[]= f3 f2 f1 0 0 0 0 0 0 tuples in, tuples out 0 0 0 0 0 0 Content-Learns: Optimization Step • Find Cl by profiling Ol: • Route a fraction of input tuples to Ol • For each sampled tuple, map attribute values to d partitions • For each attribute, partition update pass/fail counters • When all sample tuples seen, compute Cl 2 sampled tuple 3 1 2 4 4 7 1 1 1 1 1 1 2 2 1 1 2 partitions 1 1 1 1 2 2 1 1 2 1 1 2 1 2 corresponding partitions 3 attributes
4 operators operator selectivities W[]= 25% 40% 50% 60% tuple 3 1 2 classifier attributes CA[]= 2 -1 2 1 50% - 20% 75% detailed selectivities S[]= 2 1 1 0% - 80% 55% corresponding partitions Content-Learns: Routing Step • SBR routes to Ol with probability inversely proportional to Ol’s selectivity, W[l] • CBR routes to operator with minimum: • If Ol does not have a classifier attribute, its =W[l] • If Ol has a classifier attribute, its =S[l,i], j=CA[l], i=fj(t.Cj) 40% 1 -1 2 50% 20% 2 partitions 1 55%
CBR Update Overheads • Once per tuple: • selectivities as fresh as possible • Once per sampled tuple: • correlations between operators and content • Once per sample (~2500 tuples) • Computing GainRatio and updating one entry in array CA operator selectivities W[]= 25% 40% 50% 60% classifier attributes CA[]= 2 -1 2 1 In[]= Out[]= partitions: 1,…,d 0 1 2 0 1 1 50% - 20% 75% tuples in, tuples out detailed selectivities S[]= 2 1 0 1 0 0 0% - 80% 55% operators: 1,...,n attributes: 1,…,k
Experimental Results:Datasets • Stream-star dataset • Synthetic dataset based on a star-schema: SELECT * FROM stream S, d1, d2, …, dN WHERE s.fkd1 = d1.pk // Operator O1 … AND s.fkdN = dN.pk; // Operator ON • Attribute S.attrC best classifier attribute for all operators • 8 other attributes in S not correlated with operator selectivities • 100K records in stream, 10K records in dimension tables • Lab dataset • Explained later
Experimental Results:System, Metrics, Defaults • TelegraphCQ version 0.2 • Tao Linux release 1, 512 MB RAM, Pentium 4 – 2.4 GHz • Metrics: % improvement running time and routing calls • Default values:
Experimental Results:Run-time Overheads Routing overhead time to perform routing decisions (SBR, CBR) Learning overhead: Time to update data structures (SBR, CBR) plus Time to compute gain ratio (CBR only). Overhead increase: 30%-45%
Experimental Results:Varying Skew One operator with selectivity A, all others with selectivity B Skew is A-B. A varied from 5% to 95% Overall selectivity: 5% 6 joins
Experimental Results:Random Selectivities Attribute attrC correlated with the selectivities of the operators Other attributes in stream tuples not correlated with selectivities Random selectivities in each operator
Experimental Results:Varying Aggregate Selectivity Aggregate selectivity in previous experiments was 5% or ~8% Here we vary aggregate selectivity between 5% to 35% Random selectivities within these bounds 6 joins
Experimental Results:Datasets • Lab dataset • Real data • 2.2 M readings from 54 sensors (Intel Research Berkeley Lab) • Single stream with attributes: • Light • Humidity • Temperature • Voltage • sensorID • Year • Month • Day • Hours • Minutes • Seconds
Experimental Results:Challenging Adaptivity Experiment (1) • Using Lab dataset • Example query:SELECT * FROM sensors WHERE light>500; • Observations: • Very high variation in selectivity • Best classifier attributes change with time • No classifier attribute found for over half the time
Experimental Results:Challenging Adaptivity Experiment (2) • Query: SELECT * FROM sensors WHERE light BETWEEN lowL AND highL AND temperature BETWEEN lowT AND highT AND humidity BETWEEN lowH AND highH AND voltage BETWEEN lowV AND highV; • lowX random number from lower 25% of domain • highX random number from upper 25% • Results for 50 different queries. • Average improvement of: • 8% in routing calls • 5% in execution time • 7% in time spent evaluating operators • 18% in routing calls until a tuple is dropped
Experimental Results:Varying Operator Cost Run random query from previous slide Run query for periods with correlations Varied operator cost by running CPU intensive computations 4 operators
Conclusions • CBR eliminates single plan assumption • Explores correlation between tuple content and operator selectivities • Adaptive learner of correlations with negligible overhead • Performance improvements over non-CBR routing • Selectivity changes much more than correlations
Acknowledgements • Sam Madden and Amol Deshpande for providing the Lab dataset. • Sailesh Krishnamurthy, AmolDeshpande, Joe Hellerstein, and the rest of the TelegraphCQ team for providing TelegraphCQ and answering all my questions.
Motivational Example (1):Intrusion Detection Query • “Track packets with destination address matching a prefix in table T, and containing the 100-byte and 256-byte sequences “0xa...8” and “0x7...b” respectively as subsequence” • SELECT * FROM packetsWHERE matches(destination, T)AND contains(data, “0xa...8”)AND contains(data, “0x7...b”); O1 O2 O3
Motivational Example (2):Intrusion Detection Query • Assume: • costs are: c3>c1>c2 • selectivities are: 3>1>2 • SBR routing converges to O2, O1, O3 O3 SBR O2 O1 almost all tuples follow this route Stream of tuples
Motivational Example (3):Intrusion Detection Query • Suppose an attack (O2 and O3) on a network whose prefix is not in T (O1) is underway: • σ2and σ3will be very high, σ1 will be very low • O1, O2, O3will be the most efficient ordering for “attack” tuples O3 O3 SBR CBR O2 O1 O2 O1 addr attack tuples follow this route almost all tuples follow this route non-attack tuples follow this route Stream of tuples Stream of tuples
Experimental Results:Varying Skew One operator with selectivity A, all others with selectivity B Skew is A-B. A varied from 5% to 95% Overall selectivity: 5% 2 joins 6 joins
Adaptivity in Stream Systems [Avnur+] Eddies: Continuously Adaptive Query Processing. SIGMOD'00. [Arasu+] STREAM: The Stanford Stream Data Manager. DEBul 26(1). [Babu+] Adaptive ordering of pipelined stream filters. SIGMOD'04. [Babu+] StreaMon: An Adaptive Engine for Stream Query Processing. SIGMOD'04. [Carney+] Monitoring streams – a new class of data management applications. VLDB'02. [Chandrasekaran+] TelegraphCQ: Continuous dataflow processing for an uncertain world. CIDR'03. [Chandrasekaran+] Psoup: a system for streaming queries over streaming data. VLDBj 12(2). [Deshpande] An initial study of overheads of eddies. SIGMODrec 33(1). [Deshpande+] Lifting the Burden of History from Adaptive Query Processing. VLDB'04. [Madden+] Continuously adaptive continuous queries over streams. SIGMOD'02. [Raman+] Using state modules for adaptive query processing. ICDE'03. [Tian+] Tuple Routing Strategies for Distributed Eddies. VLDB'03. [Viglas+] Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources. VLDB'03. AQP Surveys [Babu+] Adaptive Query Processing in the Looking Glass. CIDR'05. [Hellerstein+] Adaptive query processing: Technology in evolution. DEBul 23(2). Related Work Adaptivity in DBMS • [Babu+] Proactive Re-optimization. SIGMOD'05. • [Graefe+] Dynamic query evaluation plans. SIGMOD'89. • [Kabra+] Efficient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plans. SIGMOD'98. • [Markl+] Robust Query Processing through Progressive Optimization. SIGMOD'04. • [Wong+] Decomposition - a strategy for query processing. TODS 1(3). Adaptivity in Distributed Systems • [Bouganim+] Dynamic query scheduling in data integration systems. ICDE'00. • [Ives+] An adaptive query execution system for data integration. SIGMOD'99. • [Ives+] Adaptive query processing for internet applications. DEBul 23(2). • [Ives+] Adapting to Source Properties in Processing Data Integration Queries. SIGMOD'04. • [Ives] Efficient Query Processing for Data Integration. PhD thesis. • [Ng+] Dynamic query re-optimization. ICSSDM'99. • [Urhan+] Dynamic pipeline scheduling for improving interactive performance of online queries. VLDB'01. • [Urhan+] Cost based query scrambling for initial delays. SIGMOD'98. • [Zhu+] Dynamic plan migration for continuous queries over data streams. SIGMOD'04.
Acquisitional Systems [Madden+] The Design of an Acquisitional Query Processor for Sensor Networks. SIGMOD'03. [Deshpande+] Model-Driven Data Acquisition in Sensor Networks. VLDB'04 [Deshpande+] Exploiting Correlated Attributes in Acquisitional Query Processing. ICDE'05. Multiple Plans in Same Query [Antoshenkov+] Query processing and optimization in oracle rdb. VLDBj 5(4). [Bizarro+] Content-Based Routing: Different Plans for Different Data. VLDB'05. [Polyzotis] Selectivity-Based Partitioning: A Divide-and-Union Paradigm For Effective Query Optimization. Unpublished. URDs and Modeling Uncertainty [Anderson+] Index key range estimator. U. S. Patent 4,774,657. [Babcock+] Towards a Robust Query Optimizer: A Principled and Practical Approach. SIGMOD'05. [Chu+] Least expected cost query optimization: An exercise in utility. SoPoDS'99. [Ramamurthy+] Buffer-pool Aware Query Optimization. CIDR'05. [Viglas] Novel Query Optimization and Evaluation Techniques, Ph.D. Thesis. Related Work (cont’d) Optimization, Cardinality Estimation, Correlations • [Selinger+] Access Path Selection in a Relational Database Management System. SIGMOD'79. • [Acharya+] Join Synopses for Approximate Query Answering. SIGMOD'99. • [Christodoulakis] Implications of certain assumptions in database performance evaluation. TODS 9(2). • [Graefe] Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25(2). • [Getoor+] Selectivity Estimation using Probabilistic Models. SIGMOD'01. • [Ioannidis+] On the Propagation of Errors in the Size of Join Results. SIGMOD'91. • [Ilyas+] CORDS: Automatic discovery of correlations and soft functional dependencies. SIGMOD'04. • [Stillger+] LEO - DB2’s LEarning Optimizer. VLDB'01. AI and Learning from Streams • [Mitchell] Machine Learning. McGraw-Hill, '97. • [Guha+] Data-streams and histograms. ACM Symp. on Theory of Computing, '01. • [Domingos+] Mining high-speed data streams. SIGKDD'00. • [Gehrke+] On computing correlated aggregates over continual data streams. SIGMOD'01.