330 likes | 509 Views
Eddies: Continuously Adaptive Query Processing. Based on a SIGMOD’2002 paper and talk by Avnur and Hellerstein. State-of-Art in Query Optimization. Given: Database state and statistics known a-priori One (short) user query to process Query may be run only once Query Processing:
E N D
Eddies: Continuously Adaptive Query Processing Based on a SIGMOD’2002 paper and talk by Avnur and Hellerstein.
State-of-Art in Query Optimization • Given: • Database state and statistics known a-priori • One (short) user query to process • Query may be run only once • Query Processing: • A-priori decide on a (static) query plan • Run query using this one plan • Also: • Possibly update statistics sometimes (in steady state)
Adaptive Systems: General Flavor Repeat: • Observe (model) environment • Use observation to choose behavior • Take action
Adaptivity in Current DBs • Limited & coarse grain Repeat: • Observe (model) environment • runstats (once per week!!): model changes in data • Use observation to choose behavior • query optimization: fixes a single static query plan • Take action • query execution: blindly follow plan
Query Optimization • Adaptivity at a per-week frequency! • Not suited for volatile environments
A Networking Problem!? • Networks do dataflow! • Significant history of adaptive techniques • E.g. TCP congestion control • E.g. routing • But traditionally much lower function • Ship bitstreams • Minimal, fixed code • Lately, moving up the foodchain? • app-level routing • active networks
Varying … • Computing resources • Data flows unpredictably from sources • Code performs unpredictably along flows • Continuous volatility due to many decentralized systems • Data Characteristics • Distributions • Burstiness • User preferences • What get fast • How much data
Toward Continuous Adaptivity • Need much more frequent adaptivity • Goal: adapt per tuple of each relation?? • The traditional runstats-optimize-execute loop is far too coarse-grained • So, continuously perform all 3 functions, at runtime • Aim for adaptivity over best-case performance (as the later never exists for long)
Road Map • Adaptive Query Processing • Intra-join adaptivity • Synchronization Barriers • Moments of Symmetry • Eddies • Encapsulated, adaptive dataflow
Adaptable Operators and Plans • Moments of symmetry = query processing stage during which pipelined query operators or inputs can be easily reordered (with no or minimal state management) • Synchronization barriers = require inputs from different sources to be coordinated and possibly restricted to the rate of the slower input • We need “good” operators.
Adaptable Joins, Issue 1 • Synchronization Barrier: merge join • Right input frozen,waiting for left • Can’t adapt while waitingfor barrier! • So, favor joins that have: • no barriers or seldom barriers • at worst, adaptable barriers 2000 2001 2002 2003 2004 2 3 4 5 6
Adaptable Joins, Issue 2 • Would like to reorder in-flight (pipelined) joins • Base case: swap inputs to a join ?? • Moment of symmetry: • inputs can be swapped with no/little state management • Aim for frequent moments of symmetry more frequent adaptivity
R S R S R S Adaptable Joins, Issue 2 • Moments of Symmetry • Suppose you can adapt an in-flight query plan • How would you do it? • Base case: reorder inputs of a single join • Nested loops join
R S Adaptable Joins, Issue 2 • Moments of Symmetry • Suppose you can adapt an in-flight query plan • How would you do it? • Base case: reorder inputs of a single join • Nested loops join • Cleaner if you waittil end of inner loop
R S Adaptable Joins, Issue 2 • Moments of Symmetry • Suppose you can adapt an in-flight query plan • How would you do it? • Base case: reorder inputs of a single join • Nested loops join • Cleaner if you waittil end of inner loop • Hybrid Hash • Reorder while “building”?
Moments of Symmetry, cont. • Moment of Symmetry: • Can swap join inputs w/o state modification • Nested Loops join: end of each inner loop • Hybrid Hash join: never • Sort-Merge join: essentially always • More frequent moments of symmetry more frequent adaptivity
Joins for Adaptivity • Pipelined hash join (hash ripple or Xjoin) • No synchronization barriers • Continuous symmetry • Good for equi-join • Simple (or block) ripple join • Synchronization barriers at “corners” • Moments of symmetry at “corners” • Good for non-equi-join • When symmetry: At corners, i.e., for each “new” tuple, once it has been processed using the given operator ‘s state R S
Beyond Binary Joins • Think of swapping “inners” • Can be done at a global moment of symmetry • Intuition: like an n-ary join • Except that each pair can bejoined by a different algorithm! • So… • Need to introduce n-ary joins to a query engine
Need well-behaved join algorithms • Pipelining • Avoid synch barriers • Frequent moments of symmetry
Continuous Adaptivity Goal: Eddies Eddy • Avoid need for traditional cost estimation • Avoid generation of a ‘good’ query plan
Continuous Adaptivity: Eddies Eddy • A pipelining n-ary tuple-routing iterator (just like join or sort) • works well with ops that havefrequent moments of symmetry
Continuous Adaptivity: Eddies Eddy • Adjusts flow adaptively • Tuples flow in different orders • Visit each op once before output
Routing: Eddies Eddy • Naïve routing policy: • All ops fetch from eddy as fast as possible • Previously-seen tuples precede new tuples
Schedule : Grab when Ready? • Two expensive selections s1 and s2 • Selectivity(s1)=Selectivity(s2)=50% • Cost(s2) = 5. • Vary Cost(s1). • What expect? ? • Does it make a difference at all?
Cost Factor? • Two expensive selections, 50% selectivity • Cost(s2) = 5. Vary cost of s1. • Favors faster operation
But is it Enough? • Given two expensive selections: • Cost same, say cost(s1)=cost(s2)=5 • Selectivity(s2) = 50%. • Vary selectivity of s1. • Does that make a difference?
Selectivity-based? • Two expensive selections, cost 5 • Selectivity(s2) = 50%. Vary selectivity of s1.
Schedule: Selectivity-based? • Conclude: Heavy tuple shedder early on is good.
How to choose? • If we knew all selectivities and all costs (and they were static), maybe we could pick the best overall “schedule” here. • Otherwise, we need a cheap means to observe their changes • And, we need a means to react in a simply manner based on those perceived changes
An Aside: How to choose? • A machine learning problem? • Each agent pays off differently • Explore Or Exploit? • Heuristics ? • Sometimes want to randomly choose one • Usually want to go with the best • If probabilities are stationary, dampen exploration over time
Eddies with Lottery Scheduling • Operator gets 1 ticket when it takes a tuple • Favor operators that run fast (low cost) • Operator loses a ticket when it returns a tuple • Favor operators that drop tuples (low selectivity) • Winner? • Large number of tickets == measure of goodness • Lottery Scheduling: • When two operators vie for the same tuple, hold a lottery • Never let any operator go to zero tickets • Support occasional random “exploration”
Lottery-Based Eddy • Two expensive selections, cost 5 • Selectivity(s2) = 50%. Vary selectivity of s1.
Summary • Eddies: Continuously Adaptive Dataflow • Suited for volatile performance environments • Changes in operator/machine peformance • Changes in selectivities (e.g. with sorted inputs) • Changes in data delivery • Currently adapts join order • Competitive methods to adapt access & join methods? • Requires well-behaved join algorithms • Pipelining • Avoid synch barriers • Frequent moments of symmetry • The end of the runstats/optimizer/executor boundary! • At best, System R is good for “hints” on initial ticket distribution