370 likes | 489 Views
CONTROL Overview. CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley Peter Haas, IBM Almaden. Context (wild assertions). Value from information The pressing problem in CS (?) (!!)
E N D
CONTROL Overview CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley Peter Haas, IBM Almaden
Context (wild assertions) • Value from information • The pressing problem in CS (?) (!!) • “Point” querying and data management is a solved problem • at least for traditional data (business data, documents) • “Big picture” analysis still hard
Data Analysis c. 1998 • Complex: people using many tools • SQL Aggregation (Decision Support Sys, OLAP) • AI-style WYGIWIGY systems (e.g. Data Mining, IR) • Both are Black Boxes • Users must iterate to get what they want • batch processing (big picture = big wait) • We are failing important users! • Decision support is for decision-makers! • Black box is the world’s worst UI
Black Box Begone! • Black boxes are bad • cannot be observed while running • cannot be controlled while running • These tools can be very slow • exacerbates previous problems • Thesis: • there will always be slow computer programs, usually data-intensive • fundamental issue: looking into the box...
Crystal Balls • Allow users to observe processing • as opposed to “lucite watches” • Allow users to predict future • Ideally, allow users to change future • online control of processing • The CONTROL Project: • online delivery, estimation, and control for data-intensive processes
Performance Regime for CONTROL • Online performance: • Maximize 1st derivative of the “mirth index” 100% CONTROL Traditional Time
Examples • Online Aggregation • Informix Dynamic Server • Enhanced by UCB students with Control algorithms • Lots of algorithmics, many fussy end-to-end system issues [Avnur, Hellerstein, Raman DMKD ’00] • IBM has ongoing project to do this in DB2 • IBM buys Informix (4/01) • Online Visualization • Visual enumeration & aggregation • Interactive data cleaning & analysis • Potter’s Wheel ABC • Online “enumeration” and discrepancy detection
Example: Online Aggregation SELECT AVG(gpa) FROM students GROUP BY college
Example: Online Data Visualization • In Tioga DataSplash
Decision-Support in DBMSs • Aggregation queries • compute a set of qualifying records • partition the set into groups • compute aggregation functions on the groups • e.g.: Select college, AVG(grade) From ENROLL Group By college;
Interactive Decision Support? • Precomputation • the typical “OLAP” approach (a.k.a. Data Cubes) • doesn’t scale, no ad hoc analysis • blindingly fast when it works • Sampling • makes real people nervous? • no ad hoc precision • sample in advance • can’t vary stats requirements • per-query granularity only
Online Aggregation • Think “progressive” sampling • a la images in a web browser • good estimates quickly, improve over time • Shift in performance goals • online mirth index • Shift in the science • UI emphasis drives system design • leads to different data delivery, result estimation • motivates online control
Not everything can be CONTROLed • “needle in haystack” scenarios • the nemesis of any sampling approach • e.g. highly selective queries, MIN, MAX, MEDIAN • not useless, though • unlike presampling, users can get some info (e.g. max-so-far) • we advocate a mixed approach • explore the big picture with online processing • when you drill down to the needles, or want full precision, go batch-style • can do both in parallel
New Techniques • Online Reordering • gives control of group delivery rates • applicable outside the RDBMS setting • Ripple Join family of join algorithms • comes in naïve, block & hash • Statistical estimators & confidence intervals • for single-table & multi-table queries • for AVG, SUM, COUNT, STDEV • Leave it to Peter • Visual estimators & analysis
Online Reordering • users perceive data being processed over time • prioritize processing for “interesting” tuples • interest based on user-specified preferences • reorder dataflow so that interesting tuples go first • encapsulate reordering as pipelined dataflow operator T T S R S R
Context: an application of reordering • online aggregation • for SQL aggregate queries, give gradually improving estimates • with confidence intervals • allow users to speed up estimate refinement for groups of interest • prioritize for processing at a per-group granularity SELECT AVG(gpa) FROM students GROUP BY college
consume produce Framework for Online Reordering • want no delay in processing • in general, reordering can only be best-effort • typically process/consume slower than produce • exploit throughput difference to reorder • two aspects • mechanism for best-effort reordering • reordering policy network xfer. acddbadb... f(t) abcdabc.. process reorder user interest
Juggle mechanism for reordering process/consume • two threads -- prefetch from input -- spool/enrich from auxiliary side disk • juggle data between buffer and side disk • keep buffer full of “interesting” items • getNext chooses best item currently on buffer • getNext, enrich/spool decisions -- based on reordering policy • side disk management • hash index, populated in a way that postpones random I/O getNext buffer prefetch enrich spool produce side disk
Reordering policies “good” permutation of items t1…tn to t1…tn GOAL: • quality of feedback for a prefix t1t2…tk QOF(UP(t1), UP(t2), … UP(tk )), UP = user preference • determined by application • goodness of reordering: dQOF/dt • implication for juggle mechanism • process gets item from buffer that increases QOF the most • juggle tries to maintain buffer with such items QOF time
QOF in Online Aggregation • avg weighted confidence interval • preference acts as weight on confidence interval (Recall from Central Limit Theorem that sample mean’s confidence interval half-width is proportional to s/n. Conservative (Hoeffding) confidence intervals also have a n in the denominator. So…) QOF= UPi /ni , ni= number of tuples processed from group I • process pulls items from group with maxUPi /nini • desired ratio of group i tuples on buffer =UPi2/3/ UPj2/3 • juggle tries to maintain this by enrich/spool
Other QOF functions • rate of processing (for a group) preference • QOF= (ni - nUPi)2(variance from ideal proportions) • process pulls items from group with max(nUPi - ni ) • desired ratio of group i tuples in buffer = UPi
Results: Reordering in Online Aggregation • implemented in Informix UDO server • experiments with modified TPC-D queries • questions: • how much throughput difference is needed for reordering • can we reorder handle skewed data • one stress test: skew, very small proc. cost • index-only join • 5 orderpriorities, zipf distribution consume SELECT AVG(o_totalprice), o_orderpriority FROM order WHERE exists ( SELECT * FROM lineitem WHERE l_orderkey = o_orderkey) GROUP BY o_orderpriority juggle index scan process
Performance results confidence interval # tuples processed E C A time time • 3 times faster for interesting groups • 2% completion time overhead
R R S S Traditional Ripple Haas & Hellerstein, SIGMOD 99 Ripple Joins • Good confidence intervals for joins of samples • Vs. samples of joins! • Requires “Cross-Product CLT” • Progressively Refining join: • ever-larger rectangles in R S • we can update confidence intervals at “corners” • comes in loop, index and hash flavors • Benefits: • sample from both relations simultaneously • “animation rate”: • Goal for the next “corner”, determines an optimization problem based on observations so far • Old-fashioned systems are one extreme • adaptively tune “aspect ratio” for next “corner” • sample from higher-variance relation faster • intimate relationship between delivery and estimation
Aspect Ratios • Consider an extreme example: • In general, to get to the next corner: • Need a cost model parameterized by relation • Different for block and hash • “Benefit”: change in confidence interval • An online linear optimization problem • Arguments about estimates converging quickly, stabilizing…
Fussy Implementation Details • How to implement as an iterator? Issues: • Need cursors on all inputs (as usual) • Need to maintain aspect ratios • Need to maintain current “inner” & cursor • I.e. the relation currently being scanned • Need to know current sampling step • To know how far to scan current “inner” • Need to know “starter” for next step • Determines length of scan (see pic), end of sampling step • And pass that role along at EOF
Ripple Join Performance • Too lazy to fetch graphs, but… • Typical orders of magnitude benefit vs. batch…
CONTROL Lessons • Dream about UIs, work on systems • User needs drive systems design! • Systems and statistics intertwine • “what unlike things must meet and mate” • Art, Herman Melville • Sloppy, adaptive systems a promising direction
Questions • Where else do these lessons apply? • Outside of data analysis, manipulation • Systems people think a lot about interfaces (APIs)… • Encapsulation, narrow interfaces … • In the CONTROL regime, how do you design these APIs and build systems? • Ubiquitous computing: • Is it about portable computing and point access/delivery? • Or sensors/actuators, dataflow, big-picture queries?
More? • CONTROL: http://control.cs.berkeley.edu • Overview: IEEE Computer, 8/99 • Telegraph: http://db.cs.berkeley.edu/telegraph
Backup slides • The following slides may be used to answer questions...
Sampling • Much is known here • Olken’s thesis • DB Sampling literature • more recent work by Peter Haas • Progressive random sampling • can use a randomized access method (watch dups!) • can maintain file in random order • can verify statistically that values are independent of order as stored
Estimators & Confidence Intervals • Conservative Confidence Intervals • Extensions of Hoeffding’s inequality • Appropriate early on, give wide intervals • Large-Sample Confidence Intervals • Use Central Limit Theorem • Appropriate after “a while” (~dozens of tuples) • linear memory consumption • tight bounds • Deterministic Intervals • only useful in “the endgame”