1 / 34

Content-Based Routing: Different Plans for Different Data

This paper introduces content-based routing (CBR) as a method to optimize query plans based on different subsets of data, allowing for more efficient processing. The CBR algorithm is implemented in TelegraphCQ as an extension to Eddies.

Download Presentation

Content-Based Routing: Different Plans for Different Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Content-Based Routing:Different Plans for Different Data Pedro Bizarro Joint work withShivnath Babu, David DeWitt, Jennifer Widom September 1, 2005 VLDB 2005

  2. Introduction Introduction • The opportunity to impove: • Optimizers pick a single plan for a query • However, different subsets of data may have very different statistical properties • May be more efficient to use different plans for different subsets of data

  3. Overview of CBR • Eliminates single plan assumption • Identifies tuple classes • Uses multiple plans, each customized for a different tuple class • CBR applies to any streaming data: • E.gs.: stream systems, regular DBMS operators using iterators, and acquisitional systems. • Adaptive and low overhead algorithm • Implemented in TelegraphCQ as an extension to Eddies

  4. Output tuples O2 O3 O1 Eddy Stream of Input tuples Overview of Eddies • Eddy routes tuples through a pool of operators • Routing decisions based on operator characteristics: selectivity, cost, queue size, etc. • Tuples not differentiated based on content • We call it SBR: Source-Based Routing

  5. Content-Based Routing Example • Consider stream S processed by O1, O2, O3 Overall Operator Selectivities • Best routing order is O1, then O2, then O3

  6. Content-Based Routing Example • Let A be an attribute with domain {a,b,c} Content-Specific Selectivities • Best routing order for A=a: O2, O1, O3 • Best routing order for A=b: O2, O1, O3 • Best routing order for A=c: O1, O3, O2

  7. Classifier Attributes • Goal: identify tuple classes • Each with a different optimal operator ordering • CBR considers: • Tuple classes distinguished by content, i.e., attribute values • Classifier attribute (informal definition): • Attribute A is classifier attribute for operator O if the value of A is correlated with selectivity of O.

  8. Best Classifier Attribute Example: • Attribute A with domain {a, b, c} • Attribute B with domain {x, y, z} • Which is the best to use for routing decisions? • Similar to AI problem: classifier attributes for decision trees • Two labels: pass operator, dropped by operator • AI solution: Use GainRatio to pick best classifier attribute

  9. GainRatio to Measure Correlation • R: random sample of tuples processed by operator O GainRatio(R, A) = 0.87 GainRatio(R, B) = 0.002 Formulas from T. Mitchell, Machine Learning. McGraw-Hill, '97.

  10. Classifier Attributes:Definition An attribute A is a classifier attribute for operator O, if for any large random sample R of tuples processed by O, GainRatio(R,A)>, for some threshold 

  11. Content-Learns Algorithm:Learning Routes Automatically • Content-Learns consists of two continuous, concurrent steps: • Optimization: For each OlO1, …,On find: • that Ol does not have a classifier attribute or • find the best classifier attribute, Cl, of Ol. • Routing: Route tuples according to the: • selectivities of Ol if Ol does not have a classifier attribute or • according to the content-specific selectivities of the pair <Ol, Cl> if Cl is the best classifier attribute of Ol

  12. 4 operators classifier attributes CA[]= 2 -1 1 1 operator 3 being profiled In[]= Out[]= f3 f2 f1 0 0 0 0 0 0 tuples in, tuples out 0 0 0 0 0 0 Content-Learns: Optimization Step • Find Cl by profiling Ol: • Route a fraction of input tuples to Ol • For each sampled tuple, map attribute values to d partitions • For each attribute, partition update pass/fail counters • When all sample tuples seen, compute Cl 2 sampled tuple 3 1 2 4 4 7 1 1 1 1 1 1 2 2 1 1 2 partitions 1 1 1 1 2 2 1 1 2 1 1 2 1 2 corresponding partitions 3 attributes

  13. 4 operators operator selectivities W[]= 25% 40% 50% 60% tuple 3 1 2 classifier attributes CA[]= 2 -1 2 1 50% - 20% 75% detailed selectivities S[]= 2 1 1 0% - 80% 55% corresponding partitions Content-Learns: Routing Step • SBR routes to Ol with probability inversely proportional to Ol’s selectivity, W[l] • CBR routes to operator with minimum: • If Ol does not have a classifier attribute, its =W[l] • If Ol has a classifier attribute, its =S[l,i], j=CA[l], i=fj(t.Cj) 40% 1 -1 2 50% 20% 2 partitions 1 55%

  14. CBR Update Overheads • Once per tuple: • selectivities as fresh as possible • Once per sampled tuple: • correlations between operators and content • Once per sample (~2500 tuples) • Computing GainRatio and updating one entry in array CA operator selectivities W[]= 25% 40% 50% 60% classifier attributes CA[]= 2 -1 2 1 In[]= Out[]= partitions: 1,…,d 0 1 2 0 1 1 50% - 20% 75% tuples in, tuples out detailed selectivities S[]= 2 1 0 1 0 0 0% - 80% 55% operators: 1,...,n attributes: 1,…,k

  15. Experimental Results:Datasets • Stream-star dataset • Synthetic dataset based on a star-schema: SELECT * FROM stream S, d1, d2, …, dN WHERE s.fkd1 = d1.pk // Operator O1 … AND s.fkdN = dN.pk; // Operator ON • Attribute S.attrC best classifier attribute for all operators • 8 other attributes in S not correlated with operator selectivities • 100K records in stream, 10K records in dimension tables • Lab dataset • Explained later

  16. Experimental Results:System, Metrics, Defaults • TelegraphCQ version 0.2 • Tao Linux release 1, 512 MB RAM, Pentium 4 – 2.4 GHz • Metrics: % improvement running time and routing calls • Default values:

  17. Experimental Results:Run-time Overheads Routing overhead time to perform routing decisions (SBR, CBR) Learning overhead: Time to update data structures (SBR, CBR) plus Time to compute gain ratio (CBR only). Overhead increase: 30%-45%

  18. Experimental Results:Varying Skew One operator with selectivity A, all others with selectivity B Skew is A-B. A varied from 5% to 95% Overall selectivity: 5% 6 joins

  19. Experimental Results:Random Selectivities Attribute attrC correlated with the selectivities of the operators Other attributes in stream tuples not correlated with selectivities Random selectivities in each operator

  20. Experimental Results:Varying Aggregate Selectivity Aggregate selectivity in previous experiments was 5% or ~8% Here we vary aggregate selectivity between 5% to 35% Random selectivities within these bounds 6 joins

  21. Experimental Results:Datasets • Lab dataset • Real data • 2.2 M readings from 54 sensors (Intel Research Berkeley Lab) • Single stream with attributes: • Light • Humidity • Temperature • Voltage • sensorID • Year • Month • Day • Hours • Minutes • Seconds

  22. Experimental Results:Challenging Adaptivity Experiment (1) • Using Lab dataset • Example query:SELECT * FROM sensors WHERE light>500; • Observations: • Very high variation in selectivity • Best classifier attributes change with time • No classifier attribute found for over half the time

  23. Experimental Results:Challenging Adaptivity Experiment (2) • Query: SELECT * FROM sensors WHERE light BETWEEN lowL AND highL AND temperature BETWEEN lowT AND highT AND humidity BETWEEN lowH AND highH AND voltage BETWEEN lowV AND highV; • lowX random number from lower 25% of domain • highX random number from upper 25% • Results for 50 different queries. • Average improvement of: • 8% in routing calls • 5% in execution time • 7% in time spent evaluating operators • 18% in routing calls until a tuple is dropped

  24. Experimental Results:Varying Operator Cost Run random query from previous slide Run query for periods with correlations Varied operator cost by running CPU intensive computations 4 operators

  25. Conclusions • CBR eliminates single plan assumption • Explores correlation between tuple content and operator selectivities • Adaptive learner of correlations with negligible overhead • Performance improvements over non-CBR routing • Selectivity changes much more than correlations

  26. Acknowledgements • Sam Madden and Amol Deshpande for providing the Lab dataset. • Sailesh Krishnamurthy, AmolDeshpande, Joe Hellerstein, and the rest of the TelegraphCQ team for providing TelegraphCQ and answering all my questions.

  27. Q & A?

  28. Extra slides

  29. Motivational Example (1):Intrusion Detection Query • “Track packets with destination address matching a prefix in table T, and containing the 100-byte and 256-byte sequences “0xa...8” and “0x7...b” respectively as subsequence” • SELECT * FROM packetsWHERE matches(destination, T)AND contains(data, “0xa...8”)AND contains(data, “0x7...b”); O1 O2 O3

  30. Motivational Example (2):Intrusion Detection Query • Assume: • costs are: c3>c1>c2 • selectivities are: 3>1>2 • SBR routing converges to O2, O1, O3 O3 SBR O2 O1 almost all tuples follow this route Stream of tuples

  31. Motivational Example (3):Intrusion Detection Query • Suppose an attack (O2 and O3) on a network whose prefix is not in T (O1) is underway: • σ2and σ3will be very high, σ1 will be very low • O1, O2, O3will be the most efficient ordering for “attack” tuples O3 O3 SBR CBR O2 O1 O2 O1 addr attack tuples follow this route almost all tuples follow this route non-attack tuples follow this route Stream of tuples Stream of tuples

  32. Experimental Results:Varying Skew One operator with selectivity A, all others with selectivity B Skew is A-B. A varied from 5% to 95% Overall selectivity: 5% 2 joins 6 joins

  33. Adaptivity in Stream Systems [Avnur+] Eddies: Continuously Adaptive Query Processing. SIGMOD'00. [Arasu+] STREAM: The Stanford Stream Data Manager. DEBul 26(1). [Babu+] Adaptive ordering of pipelined stream filters. SIGMOD'04. [Babu+] StreaMon: An Adaptive Engine for Stream Query Processing. SIGMOD'04. [Carney+] Monitoring streams – a new class of data management applications. VLDB'02. [Chandrasekaran+] TelegraphCQ: Continuous dataflow processing for an uncertain world. CIDR'03. [Chandrasekaran+] Psoup: a system for streaming queries over streaming data. VLDBj 12(2). [Deshpande] An initial study of overheads of eddies. SIGMODrec 33(1). [Deshpande+] Lifting the Burden of History from Adaptive Query Processing. VLDB'04. [Madden+] Continuously adaptive continuous queries over streams. SIGMOD'02. [Raman+] Using state modules for adaptive query processing. ICDE'03. [Tian+] Tuple Routing Strategies for Distributed Eddies. VLDB'03. [Viglas+] Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources. VLDB'03. AQP Surveys [Babu+] Adaptive Query Processing in the Looking Glass. CIDR'05. [Hellerstein+] Adaptive query processing: Technology in evolution. DEBul 23(2). Related Work Adaptivity in DBMS • [Babu+] Proactive Re-optimization. SIGMOD'05. • [Graefe+] Dynamic query evaluation plans. SIGMOD'89. • [Kabra+] Efficient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plans. SIGMOD'98. • [Markl+] Robust Query Processing through Progressive Optimization. SIGMOD'04. • [Wong+] Decomposition - a strategy for query processing. TODS 1(3). Adaptivity in Distributed Systems • [Bouganim+] Dynamic query scheduling in data integration systems. ICDE'00. • [Ives+] An adaptive query execution system for data integration. SIGMOD'99. • [Ives+] Adaptive query processing for internet applications. DEBul 23(2). • [Ives+] Adapting to Source Properties in Processing Data Integration Queries. SIGMOD'04. • [Ives] Efficient Query Processing for Data Integration. PhD thesis. • [Ng+] Dynamic query re-optimization. ICSSDM'99. • [Urhan+] Dynamic pipeline scheduling for improving interactive performance of online queries. VLDB'01. • [Urhan+] Cost based query scrambling for initial delays. SIGMOD'98. • [Zhu+] Dynamic plan migration for continuous queries over data streams. SIGMOD'04.

  34. Acquisitional Systems [Madden+] The Design of an Acquisitional Query Processor for Sensor Networks. SIGMOD'03. [Deshpande+] Model-Driven Data Acquisition in Sensor Networks. VLDB'04 [Deshpande+] Exploiting Correlated Attributes in Acquisitional Query Processing. ICDE'05. Multiple Plans in Same Query [Antoshenkov+] Query processing and optimization in oracle rdb. VLDBj 5(4). [Bizarro+] Content-Based Routing: Different Plans for Different Data. VLDB'05. [Polyzotis] Selectivity-Based Partitioning: A Divide-and-Union Paradigm For Effective Query Optimization. Unpublished. URDs and Modeling Uncertainty [Anderson+] Index key range estimator. U. S. Patent 4,774,657. [Babcock+] Towards a Robust Query Optimizer: A Principled and Practical Approach. SIGMOD'05. [Chu+] Least expected cost query optimization: An exercise in utility. SoPoDS'99. [Ramamurthy+] Buffer-pool Aware Query Optimization. CIDR'05. [Viglas] Novel Query Optimization and Evaluation Techniques, Ph.D. Thesis. Related Work (cont’d) Optimization, Cardinality Estimation, Correlations • [Selinger+] Access Path Selection in a Relational Database Management System. SIGMOD'79. • [Acharya+] Join Synopses for Approximate Query Answering. SIGMOD'99. • [Christodoulakis] Implications of certain assumptions in database performance evaluation. TODS 9(2). • [Graefe] Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25(2). • [Getoor+] Selectivity Estimation using Probabilistic Models. SIGMOD'01. • [Ioannidis+] On the Propagation of Errors in the Size of Join Results. SIGMOD'91. • [Ilyas+] CORDS: Automatic discovery of correlations and soft functional dependencies. SIGMOD'04. • [Stillger+] LEO - DB2’s LEarning Optimizer. VLDB'01. AI and Learning from Streams • [Mitchell] Machine Learning. McGraw-Hill, '97. • [Guha+] Data-streams and histograms. ACM Symp. on Theory of Computing, '01. • [Domingos+] Mining high-speed data streams. SIGKDD'00. • [Gehrke+] On computing correlated aggregates over continual data streams. SIGMOD'01.

More Related