Interactive Query Processing

Interactive Query Processing Vijayshankar Raman Computer Science Division University of California at Berkeley

Motivation: Natureof querying • querying = extracting information from data sets • different techniques in different settings • intrinsically slow • significant user-system interaction • info. seekers work iteratively, gradually refining requests based on feedback [O’day and Jeffries ‘93, Koenemann and Belkin ‘96] Interactive Query Processing

Problems with traditional solutions query process exact answer • mismatch between system functionality and iterative mode of HCI • “black box” functionality • batch processing • frustrating delays in iterative process Query: find average grade of students in each college Interactive Query Processing

Interactive processing • HCI requirements • users must get continual feedback on results of processing • allow users to control processing based on prior feedback query process exact answer Interactive Query Processing

Interactive processing results • HCI requirements • users must get continual feedback on results of processing • allow users to control processing based on prior feedback • performance goals • not to minimize time to give complete results • give continually improving partial results • adapt to dynamically specified performance goals result estimates Interactive Query Processing

Background: Traditional Database Query Processing  T T • data access: scan-based ( ) or index-based ( ) • selection () -- filter tuples based on condition • join ( ) -- apply filter on cross-product of the inputs • can use index if available (“index-join”) • else dynamically build hash-tables on inputs (“hash-join”) • query optimizer • chooses plan: operator implementations and ordering • cost model based on pre-computed summary statistics declarative query select R.a, S.b, T.c from R, S, T where <conditions> query executor query optimizer S query plan R R Interactive Query Processing

My Research • interactive query processing • interactive data cleaning (Potter’s Wheel) • assume: dataflow thru pipelined operators • support for dynamic user control in traditional query proc. architectures • adaptively giving partial results in response to user control • still more aggressive adaptation • state modules T S R S R T S R T T Interactive Query Processing

Talk Outline • motivation and context • support for dynamic user control in traditional query proc. architectures • architecture for adaptively generating partial results • policy for generating partial results • user interface for displaying partial results • impact on routing • wrapup Interactive Query Processing

Design goals in supporting user control • make minimal change to system architecture • must be independent of particular query processing algorithms • no delay in processing Interactive Query Processing

Online Reordering (Raman et al. ’99,’00) • users perceive data being processed over time • prioritize processing for “interesting” tuples • interest based on user-specified preferences • reorder dataflow so that interesting tuples go first • encapsulate reordering as pipelined dataflow operator T T S R S R Interactive Query Processing

Context: an application of reordering • online aggregation [Hellerstein/Haas/Wang ‘97, Haas/Hellerstein ‘99 ] • for SQL aggregate queries, give gradually improving estimates • with confidence intervals • allow users to speed up estimate refinement for groups of interest • prioritize for processing at a per-group granularity SELECT AVG(gpa) FROM students GROUP BY college Interactive Query Processing

Online Aggregation Screenshot SELECT AVG(gpa) FROM students GROUP BY college Interactive Query Processing

Framework for Online Reordering consume produce • want no delay in processing • in general, reordering can only be best-effort • typically process/consume slower than produce • exploit throughput difference to reorder • two aspects • mechanism for best-effort reordering • reordering policy network xfer. acddbadb... f(t) abcdabc.. process reorder user interest Interactive Query Processing

Juggle mechanism for reordering process/consume • two threads -- prefetch from input -- spool/enrich from auxiliary side disk • juggle data between buffer and side disk • keep buffer full of “interesting” items • getNext chooses best item currently on buffer • getNext, enrich/spool decisions -- based on reordering policy • side disk management • hash index, populated in a way that postpones random I/O getNext buffer prefetch enrich spool produce side disk Interactive Query Processing

Reordering policies • quality of feedback for a prefix t1t2…tk QOF(UP(t1), UP(t2), … UP(tk )), UP = user preference • determined by application • goodness of reordering: dQOF/dt • implication for juggle mechanism • process gets item from buffer that increases QOF the most • juggle tries to maintain buffer with such items “good” permutation of items t1…tn to t1…tn GOAL: QOF time Interactive Query Processing

QOF in Online Aggregation • avg weighted confidence interval • preference acts as weight on confidence interval • QOF= UPi /ni , ni= number of tuples processed from group i • process pulls items from group with maxUPi /nini • desired ratio of group i tuples on buffer =UPi2/3/ UPj2/3 • juggle tries to maintain this by enrich/spool Interactive Query Processing

Other QOF functions • rate of processing (for a group)  preference • QOF=  (ni - nUPi)2(variance from ideal proportions) • process pulls items from group with max(nUPi - ni ) • desired ratio of group i tuples in buffer = UPi Interactive Query Processing

Results: Reordering in Online Aggregation • implemented in Informix UDO server • experiments with modified TPC-D queries • questions: • how much throughput difference is needed for reordering • can we reorder handle skewed data • one stress test: skew, very small proc. cost • index-only join • 5 orderpriorities, zipf distribution consume SELECT AVG(o_totalprice), o_orderpriority FROM order WHERE exists ( SELECT * FROM lineitem WHERE l_orderkey = o_orderkey) GROUP BY o_orderpriority juggle index scan process Interactive Query Processing

Performance results # tuples processed time • without reordering Interactive Query Processing

Performance results # tuples processed time • 3 times faster for interesting groups • overhead: 2% completion time, 1 extra disk Interactive Query Processing

Performance results confidence interval E C A time Interactive Query Processing

Overall Findings • higher processing costs • index/hash join, subquery, … • reordering easy • very low processing costs • juggle constrained by density of interesting tuples • outlier groups hard to speed up • better to use index stride [Hellerstein/Haas/Wang ‘97] • needs pre-computed index • reordering becomes easier over time • question to answer: • where to place juggle? Interactive Query Processing

Outline • motivation and context • support for dynamic user control in traditional query proc. architectures • architecture for adaptively generating partial results • policy for generating partial results • user interface for displaying partial results • impact on routing • wrapup Interactive Query Processing

Incremental results Result Space • traditional arch. also generate continual result tuples • arises from continual dataflow thru pipelining operators • much work on pipelining joins [Wilschut/Apers’91, Haas/Hellerstein’99, Ives et al.’99, Urhan/Franklin’00] • this is too rigid • especially in distributed envirorments Interactive Query Processing

Context: Query processing in Telegraph • Telegraph: adaptive dataflow system to query diverse, distributed sources • much data available as services over Internet • currently only accessible by browse/search/forms • want to combine this data through queries • examples: • campaign finance information (Election 2000)Federal Election Commision Yahoo Home PricesCensus APBnews Crime Ratings Maps IMDB • restaurant information SwitchBoard Fodors S.F.Chronicle MapQuest Health inspection reports Interactive Query Processing

Partial results • complete result tuples too rigid • source latencies high, diverse • dynamic source variations, delays • query does not capture user desires • non-expert and even expert users: query too broad Interactive Query Processing

Partial results • complete tuples too rigid • source latencies high, diverse • dynamic source variations, delays • query does not capture user desires • non-expert and even expert users: query too broad • want to process queries flexibly • give partial result tuples asap • adapt dynamically to user preferences and source variations Interactive Query Processing

Correctness of partial results • some columns essential for UI • e.g. group-by columns / sort-by columns • for maximum flexibility -- outer-join semantics • good idea for Web sources • or -- strict join semantics • no partial results without ensuring match exists • key constraints helpful • aggregates: • update early, and compensate later • statistical guarantees for aggregates difficult • fanouts unknown (can be 0!) • key constraints helpful Interactive Query Processing

Dynamic query plans Eddy T R S T R S S R T • Eddy [Avnur and Hellerstein 2000] • router for directing data thru modules • minimize completion time • adaptively choosing join order for arbitrary tuple • all partial tuples generable Interactive Query Processing

Partial Results in Dynamic Plans Eddy . . . modules • my focus: continual partial results • dynamically adapt dataflow to suit user preferences • Eddy must decide • what tuple to route next • where to route it • based on user preferences and module properties • need routing policy and a reordering mechanism • eddy memory buffer and module queues bounded . . . inputs R S P Interactive Query Processing

Prioritizing tuples copy . . . modules • enhance Eddy with Juggle • reorder exploiting slowness of modules • synergy -- juggle location problem solved! • juggle everywhere, to the extent of throughput difference R S P . . . inputs Interactive Query Processing

Routing and reordering policy • GOAL: at any time, route to max. dQOF/dt = benefit of sending tuple to module / cost • cost: estimate data rates to/from module • benefit: dependent on application and UI • how partial results impact the UI • user preferences GOAL: QOF time Interactive Query Processing

Outline • motivation and context • online reordering for user prioritization of partial results • architecture for adaptively generating partial results • policy for generating partial results • Telegraph UI: displaying results and inferring preferences • experimental results • wrapup Interactive Query Processing

Telegraph UI • screenshot Interactive Query Processing

Getting user preferences • infer from navigation • row/column scrolling, group drill down and rollup • prioritize visible rows/columns • “query evolution” • subset “one-size-fits-all” queries • future work: query expansion • explicit • up/down buttons on columns • at the cell level -- need for some expensive sources • map to QOF metric on partial results Interactive Query Processing

QOF: Benefit of a partial result • depends on user preferences • benefit of updating a cell in output • row priority x column weight x cell resolution • incremental cell resolution: how much does one extra update add to the cell’s value • scalars -- 1 • aggregations -- change in confidence interval • informing user about execution progress • convey cell resolution on UI • future work Interactive Query Processing

QOF: Benefit of Routing a Tuple T M t • route tuple according to expected benefit and cost • benefit of sending a tuple t to a module M and forming set T = cells c Tbenefit of updating c Interactive Query Processing

Throughput difference Bush Contributors Income (index) AFB Crime Ratings (index) 40000 Bush Contributors 30000 20000 Number of tuples read 10000 Income Crime Ratings 0 200 400 600 time (s) Interactive Query Processing

Benefit of giving partial results Bush Contributors Income (index) AFB Crime Ratings (index) 40000 Bush Contributors 30000 #partial results number of partial results #complete results 20000 10000 Income Crime Ratings 0 200 400 600 time (s) Interactive Query Processing

Effect of Delays Bush Contributors Income (index) AFB Crime Ratings (index) with delay in AFB Crime Ratings number of partial results 10000 1000 delay number of complete results 100 10 0 200 400 600 time(s) Interactive Query Processing

Prioritizing particular values scroll scroll • SELECT AVG(Income) FROM Bush Donors, Census GROUP BY BushDonors.State • scrolling: {AZ, AR, CA, CO, CT}, {LA, KY, MA, MD, MI}, {TX, UT, VA, VI, VT} 600 400 # Aggr. Updates 200 0 100 200 300 400 time (s) Interactive Query Processing

Distribution of contributions 5000 4000 3000 # Bush Contributors 2000 1000 State Interactive Query Processing

Outline • motivation and context • online reordering for user prioritization of partial results • architecture for adaptively generating partial results • policy for generating partial results • more aggressive adaptation: state modules • wrapup Interactive Query Processing

Granularity of Query Operators P hash jn ind.jn Eddy R S R P S hash jn • relational operators: logical abstractions • encapsulate multiple physical effects • inflexible in handling unexpected changes • cannot gracefully adapt • access method/data source selection • join algorithm selection • resource allocation: e.g. memory • delays [Query Scrambling, XJoin] • want to encapsulate at level of physical operators Interactive Query Processing

State Modules Elevator Pitch Eddy P R P P • isolate state in StateModules • work sharing • routing flexibility • query execution = routing • adapt access methods, join algorithms gracefully • directly measure & adapt resource consumption ?? hash jn ind.jn Eddy R R P Interactive Query Processing

Outline • motivation and context • online reordering for user prioritization of partial results • architecture for generating more aggressive partial results • policy for generating partial results • more aggressive adaptation: state modules • wrapup Interactive Query Processing

Related Work • information retrieval • ranked retrieval, relevance feedback • search strategies, Berry Picking • incremental query processing • pipelining hash joins (Haas/Hellerstein`99, Ives et al.`99, Urhan/Franklin`00) • top N/fast first queries (Carey/Kossman`97, Antoshenkov/Ziauddin`96) • adaptivity • parametric query plans (Graefe/Cole`94) • mid-query reoptimization (Urhan/Franklin/Amsaleg`98, Kabra/DeWitt`98, Ives et al.`99) • competition (Antoshenkov/Ziauddin`96) • miscellaneous • precomputed summaries (OLAP, materialized views, AQUA) • parachute queries (Bonnet/Tomasic‘98) • APPROXIMATE (Vrbsky/Liu`93) Interactive Query Processing

Summary • applications, query processors need tighter coupling • online reordering • effective way of supporting dynamic user control • partial results as desired • embed reordering within dynamically controlled dataflow • hard to map user-interaction needs intoconcrete algorithm performance goals • wanted: benchmarks based on user/application traces Interactive Query Processing

Interactive Query Processing

Interactive Query Processing

Presentation Transcript

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing