740 likes | 755 Views
This study delves into interactive query processing, focusing on supporting user control with continual feedback and adaptive partial results. Traditional query processing limitations are addressed, emphasizing user-system interaction and performance goals.
E N D
Interactive Query Processing Vijayshankar Raman Computer Science Division University of California at Berkeley
Motivation: Natureof querying • querying = extracting information from data sets • different techniques in different settings • intrinsically slow • significant user-system interaction • info. seekers work iteratively, gradually refining requests based on feedback [O’day and Jeffries ‘93, Koenemann and Belkin ‘96] Interactive Query Processing
Problems with traditional solutions query process exact answer • mismatch between system functionality and iterative mode of HCI • “black box” functionality • batch processing • frustrating delays in iterative process Query: find average grade of students in each college Interactive Query Processing
Interactive processing • HCI requirements • users must get continual feedback on results of processing • allow users to control processing based on prior feedback query process exact answer Interactive Query Processing
Interactive processing results • HCI requirements • users must get continual feedback on results of processing • allow users to control processing based on prior feedback • performance goals • not to minimize time to give complete results • give continually improving partial results • adapt to dynamically specified performance goals result estimates Interactive Query Processing
Background: Traditional Database Query Processing T T • data access: scan-based ( ) or index-based ( ) • selection () -- filter tuples based on condition • join ( ) -- apply filter on cross-product of the inputs • can use index if available (“index-join”) • else dynamically build hash-tables on inputs (“hash-join”) • query optimizer • chooses plan: operator implementations and ordering • cost model based on pre-computed summary statistics declarative query select R.a, S.b, T.c from R, S, T where <conditions> query executor query optimizer S query plan R R Interactive Query Processing
My Research • interactive query processing • interactive data cleaning (Potter’s Wheel) • assume: dataflow thru pipelined operators • support for dynamic user control in traditional query proc. architectures • adaptively giving partial results in response to user control • still more aggressive adaptation • state modules T S R S R T S R T T Interactive Query Processing
Talk Outline • motivation and context • support for dynamic user control in traditional query proc. architectures • architecture for adaptively generating partial results • policy for generating partial results • user interface for displaying partial results • impact on routing • wrapup Interactive Query Processing
Design goals in supporting user control • make minimal change to system architecture • must be independent of particular query processing algorithms • no delay in processing Interactive Query Processing
Online Reordering (Raman et al. ’99,’00) • users perceive data being processed over time • prioritize processing for “interesting” tuples • interest based on user-specified preferences • reorder dataflow so that interesting tuples go first • encapsulate reordering as pipelined dataflow operator T T S R S R Interactive Query Processing
Context: an application of reordering • online aggregation [Hellerstein/Haas/Wang ‘97, Haas/Hellerstein ‘99 ] • for SQL aggregate queries, give gradually improving estimates • with confidence intervals • allow users to speed up estimate refinement for groups of interest • prioritize for processing at a per-group granularity SELECT AVG(gpa) FROM students GROUP BY college Interactive Query Processing
Online Aggregation Screenshot SELECT AVG(gpa) FROM students GROUP BY college Interactive Query Processing
Framework for Online Reordering consume produce • want no delay in processing • in general, reordering can only be best-effort • typically process/consume slower than produce • exploit throughput difference to reorder • two aspects • mechanism for best-effort reordering • reordering policy network xfer. acddbadb... f(t) abcdabc.. process reorder user interest Interactive Query Processing
Juggle mechanism for reordering process/consume • two threads -- prefetch from input -- spool/enrich from auxiliary side disk • juggle data between buffer and side disk • keep buffer full of “interesting” items • getNext chooses best item currently on buffer • getNext, enrich/spool decisions -- based on reordering policy • side disk management • hash index, populated in a way that postpones random I/O getNext buffer prefetch enrich spool produce side disk Interactive Query Processing
Reordering policies • quality of feedback for a prefix t1t2…tk QOF(UP(t1), UP(t2), … UP(tk )), UP = user preference • determined by application • goodness of reordering: dQOF/dt • implication for juggle mechanism • process gets item from buffer that increases QOF the most • juggle tries to maintain buffer with such items “good” permutation of items t1…tn to t1…tn GOAL: QOF time Interactive Query Processing
QOF in Online Aggregation • avg weighted confidence interval • preference acts as weight on confidence interval • QOF= UPi /ni , ni= number of tuples processed from group i • process pulls items from group with maxUPi /nini • desired ratio of group i tuples on buffer =UPi2/3/ UPj2/3 • juggle tries to maintain this by enrich/spool Interactive Query Processing
Other QOF functions • rate of processing (for a group) preference • QOF= (ni - nUPi)2(variance from ideal proportions) • process pulls items from group with max(nUPi - ni ) • desired ratio of group i tuples in buffer = UPi Interactive Query Processing
Results: Reordering in Online Aggregation • implemented in Informix UDO server • experiments with modified TPC-D queries • questions: • how much throughput difference is needed for reordering • can we reorder handle skewed data • one stress test: skew, very small proc. cost • index-only join • 5 orderpriorities, zipf distribution consume SELECT AVG(o_totalprice), o_orderpriority FROM order WHERE exists ( SELECT * FROM lineitem WHERE l_orderkey = o_orderkey) GROUP BY o_orderpriority juggle index scan process Interactive Query Processing
Performance results # tuples processed time • without reordering Interactive Query Processing
Performance results # tuples processed time • 3 times faster for interesting groups • overhead: 2% completion time, 1 extra disk Interactive Query Processing
Performance results confidence interval E C A time Interactive Query Processing
Overall Findings • higher processing costs • index/hash join, subquery, … • reordering easy • very low processing costs • juggle constrained by density of interesting tuples • outlier groups hard to speed up • better to use index stride [Hellerstein/Haas/Wang ‘97] • needs pre-computed index • reordering becomes easier over time • question to answer: • where to place juggle? Interactive Query Processing
Outline • motivation and context • support for dynamic user control in traditional query proc. architectures • architecture for adaptively generating partial results • policy for generating partial results • user interface for displaying partial results • impact on routing • wrapup Interactive Query Processing
Incremental results Result Space • traditional arch. also generate continual result tuples • arises from continual dataflow thru pipelining operators • much work on pipelining joins [Wilschut/Apers’91, Haas/Hellerstein’99, Ives et al.’99, Urhan/Franklin’00] • this is too rigid • especially in distributed envirorments Interactive Query Processing
Context: Query processing in Telegraph • Telegraph: adaptive dataflow system to query diverse, distributed sources • much data available as services over Internet • currently only accessible by browse/search/forms • want to combine this data through queries • examples: • campaign finance information (Election 2000)Federal Election Commision Yahoo Home PricesCensus APBnews Crime Ratings Maps IMDB • restaurant information SwitchBoard Fodors S.F.Chronicle MapQuest Health inspection reports Interactive Query Processing
Partial results • complete result tuples too rigid • source latencies high, diverse • dynamic source variations, delays • query does not capture user desires • non-expert and even expert users: query too broad Interactive Query Processing
Partial results • complete tuples too rigid • source latencies high, diverse • dynamic source variations, delays • query does not capture user desires • non-expert and even expert users: query too broad • want to process queries flexibly • give partial result tuples asap • adapt dynamically to user preferences and source variations Interactive Query Processing
Correctness of partial results • some columns essential for UI • e.g. group-by columns / sort-by columns • for maximum flexibility -- outer-join semantics • good idea for Web sources • or -- strict join semantics • no partial results without ensuring match exists • key constraints helpful • aggregates: • update early, and compensate later • statistical guarantees for aggregates difficult • fanouts unknown (can be 0!) • key constraints helpful Interactive Query Processing
Dynamic query plans Eddy T R S T R S S R T • Eddy [Avnur and Hellerstein 2000] • router for directing data thru modules • minimize completion time • adaptively choosing join order for arbitrary tuple • all partial tuples generable Interactive Query Processing
Partial Results in Dynamic Plans Eddy . . . modules • my focus: continual partial results • dynamically adapt dataflow to suit user preferences • Eddy must decide • what tuple to route next • where to route it • based on user preferences and module properties • need routing policy and a reordering mechanism • eddy memory buffer and module queues bounded . . . inputs R S P Interactive Query Processing
Prioritizing tuples copy . . . modules • enhance Eddy with Juggle • reorder exploiting slowness of modules • synergy -- juggle location problem solved! • juggle everywhere, to the extent of throughput difference R S P . . . inputs Interactive Query Processing
Routing and reordering policy • GOAL: at any time, route to max. dQOF/dt = benefit of sending tuple to module / cost • cost: estimate data rates to/from module • benefit: dependent on application and UI • how partial results impact the UI • user preferences GOAL: QOF time Interactive Query Processing
Outline • motivation and context • online reordering for user prioritization of partial results • architecture for adaptively generating partial results • policy for generating partial results • Telegraph UI: displaying results and inferring preferences • experimental results • wrapup Interactive Query Processing
Telegraph UI • screenshot Interactive Query Processing
Getting user preferences • infer from navigation • row/column scrolling, group drill down and rollup • prioritize visible rows/columns • “query evolution” • subset “one-size-fits-all” queries • future work: query expansion • explicit • up/down buttons on columns • at the cell level -- need for some expensive sources • map to QOF metric on partial results Interactive Query Processing
QOF: Benefit of a partial result • depends on user preferences • benefit of updating a cell in output • row priority x column weight x cell resolution • incremental cell resolution: how much does one extra update add to the cell’s value • scalars -- 1 • aggregations -- change in confidence interval • informing user about execution progress • convey cell resolution on UI • future work Interactive Query Processing
QOF: Benefit of Routing a Tuple T M t • route tuple according to expected benefit and cost • benefit of sending a tuple t to a module M and forming set T = cells c Tbenefit of updating c Interactive Query Processing
Throughput difference Bush Contributors Income (index) AFB Crime Ratings (index) 40000 Bush Contributors 30000 20000 Number of tuples read 10000 Income Crime Ratings 0 200 400 600 time (s) Interactive Query Processing
Benefit of giving partial results Bush Contributors Income (index) AFB Crime Ratings (index) 40000 Bush Contributors 30000 #partial results number of partial results #complete results 20000 10000 Income Crime Ratings 0 200 400 600 time (s) Interactive Query Processing
Effect of Delays Bush Contributors Income (index) AFB Crime Ratings (index) with delay in AFB Crime Ratings number of partial results 10000 1000 delay number of complete results 100 10 0 200 400 600 time(s) Interactive Query Processing
Prioritizing particular values scroll scroll • SELECT AVG(Income) FROM Bush Donors, Census GROUP BY BushDonors.State • scrolling: {AZ, AR, CA, CO, CT}, {LA, KY, MA, MD, MI}, {TX, UT, VA, VI, VT} 600 400 # Aggr. Updates 200 0 100 200 300 400 time (s) Interactive Query Processing
Prioritizing particular values scroll scroll • SELECT AVG(Income) FROM Bush Donors, Census GROUP BY BushDonors.State • scrolling: {AZ, AR, CA, CO, CT}, {LA, KY, MA, MD, MI}, {TX, UT, VA, VI, VT} 600 400 # Aggr. Updates 200 0 100 200 300 400 time (s) Interactive Query Processing
Prioritizing particular values scroll scroll • SELECT AVG(Income) FROM Bush Donors, Census GROUP BY BushDonors.State • scrolling: {AZ, AR, CA, CO, CT}, {LA, KY, MA, MD, MI}, {TX, UT, VA, VI, VT} 600 400 # Aggr. Updates 200 0 100 200 300 400 time (s) Interactive Query Processing
Distribution of contributions 5000 4000 3000 # Bush Contributors 2000 1000 State Interactive Query Processing
Outline • motivation and context • online reordering for user prioritization of partial results • architecture for adaptively generating partial results • policy for generating partial results • more aggressive adaptation: state modules • wrapup Interactive Query Processing
Granularity of Query Operators P hash jn ind.jn Eddy R S R P S hash jn • relational operators: logical abstractions • encapsulate multiple physical effects • inflexible in handling unexpected changes • cannot gracefully adapt • access method/data source selection • join algorithm selection • resource allocation: e.g. memory • delays [Query Scrambling, XJoin] • want to encapsulate at level of physical operators Interactive Query Processing
State Modules Elevator Pitch Eddy P R P P • isolate state in StateModules • work sharing • routing flexibility • query execution = routing • adapt access methods, join algorithms gracefully • directly measure & adapt resource consumption ?? hash jn ind.jn Eddy R R P Interactive Query Processing
Outline • motivation and context • online reordering for user prioritization of partial results • architecture for generating more aggressive partial results • policy for generating partial results • more aggressive adaptation: state modules • wrapup Interactive Query Processing
Related Work • information retrieval • ranked retrieval, relevance feedback • search strategies, Berry Picking • incremental query processing • pipelining hash joins (Haas/Hellerstein`99, Ives et al.`99, Urhan/Franklin`00) • top N/fast first queries (Carey/Kossman`97, Antoshenkov/Ziauddin`96) • adaptivity • parametric query plans (Graefe/Cole`94) • mid-query reoptimization (Urhan/Franklin/Amsaleg`98, Kabra/DeWitt`98, Ives et al.`99) • competition (Antoshenkov/Ziauddin`96) • miscellaneous • precomputed summaries (OLAP, materialized views, AQUA) • parachute queries (Bonnet/Tomasic‘98) • APPROXIMATE (Vrbsky/Liu`93) Interactive Query Processing
Summary • applications, query processors need tighter coupling • online reordering • effective way of supporting dynamic user control • partial results as desired • embed reordering within dynamically controlled dataflow • hard to map user-interaction needs intoconcrete algorithm performance goals • wanted: benchmarks based on user/application traces Interactive Query Processing