950 likes | 1.16k Views
Query Processing and Networking Infrastructures. Day 1 of 2 Joe Hellerstein UC Berkeley Septemer 20, 2002. Two Goals. Day 1: Primer on query processing Targeted to networking/OS folk Bias: systems issues Day 2: Seed some cross-fertilized research Especially with networking
E N D
Query Processing and Networking Infrastructures Day 1 of 2 Joe Hellerstein UC Berkeley Septemer 20, 2002
Two Goals • Day 1: Primer on query processing • Targeted to networking/OS folk • Bias: systems issues • Day 2: Seed some cross-fertilized research • Especially with networking • Thesis: dataflow convergence • query processing and routing • Clearly other resonances here • Dataflow HW architectures • Event-based systems designs • ML and Control Theory • Online Algorithms
(Sub)Space of Possible Topics Distributed & Federated QP TransactionalStorage & Networking Active DBs (Trigger Systems) TraditionalRelational QP:Optimization & Execution Data Model& Query LanguageDesign Parallel QP NFNF Data Models (OO, XML, “Semistructured”) Adaptive QP Online and Approximate QP Indexing Data Reduction Data Streams& Continuous Queries Visual Querying &Data Visualization Compression Media Queries,Feature Extraction & Similarity Search Statistical Data Analysis(“Mining”) Boolean Text Search Traditional TextRanking HypertextRanking
Likely Topics Here TransactionalStorage & Networking Distributed & Federated QP Active DBs (Trigger Systems) TraditionalRelational QP:Optimization & Execution Data Model& Query LanguageDesign Parallel QP NFNF Data Models (OO, XML, “Semistructured”) Adaptive QP Online and Approximate QP Indexing Data Reduction Data Streams& Continuous Queries Visual Querying &Data Visualization Compression Media Queries,Feature Extraction & Similarity Search Statistical Data Analysis(“Mining”) Boolean Text Search Traditional TextRanking HypertextRanking
Plus Some Speculative Ones Distributed & Federated QP TraditionalRelational QP:Optimization & Execution ContentRouting IndirectionArchitectures Parallel QP Adaptive QP Online and Approximate QP Indexing Peer-to-PeerQP Data Streams& Continuous Queries SensornetQP NetworkMonitoring Boolean Text Search Traditional TextRanking
Outline • Day 1: Query Processing Crash Course • Intro • Queries as indirection • How do relational databases run queries? • How do search engines run queries? • Scaling up: cluster parallelism and distribution • Day 2: Research Synergies w/Networking • Queries as indirection, revisited • Useful (?) analogies to networking research • Some of our recent research at the seams • Some of your research? • Directions and collective discussion
Roots: database and IR research • “Top-down” traditions (“applications”) • Usually begins with semantics and models. • Common Misconceptions • Query processing = Oracle or Google. • Need not be so heavyweight or monolithic! Many reusable lessons within • IR search and DB querying are fundamentally different • Very similar from a query processing perspective • Many similarities in other data models as well • Querying is a synchronous, interactive process. • Triggers, rules and "continuous queries" not so different from plain old queries.
So… we’ll go bottom-up • Focus on resuable building blocks • Attempt to be language- and model-agnostic • illustrate with various querying scenarios
Confession: Two Biases • Relational query engines • Most mature and general query technology • Best documented in the literature • Conceptually general enough to “capture” most all other models/schemes • Everybody does web searches • So it’s both an important app, and an inescapable usage bias we carry around • It will inform our discussion. Shouldn’t skew it • Lots of other query systems/languages you can keep in mind as we go • LDAP, DNS, XSL/Xpath/XQuery, Datalog
What Are Queries For? I • Obvious answer: search and analysis over big data sets • Search: select data of interest • Boolean expressions over content • sometimes with an implicit ordering on results • Analysis: construct new information from base data. • compute functions over each datum • concatenate related records (join) • partition into groups, summarize (aggregates) • aside: “Mining” vs. “Querying”? As a rule of thumb, think of mining as WYGIWIGY. • Not the most general, powerful answer…
What Are Queries For? II • Queries bridge a (large!) level of indirection • Declarative programming: what you want, not how to get it • Easy (er) to express • Allows the “how” to change under the covers • A critical issue! • Not just for querying • Method invocation, data update, etc ?? !!
Motivation for this Indirection • Critical when rates of change differ across layers: • In particular, whendapp/dt<< denvironment/dt • E.g. DB apps are used for years, decades (!!) • E.g. networked env: high rates of change (??) • DB lit calls this “data independence”
Data Independence: Background • Bad Old Days • Hierarchical and “Network” (yep!) data models • Nesting & pointers mean that apps explicitly traverse data, become brittle when data layouts change • Apps with persistent data have slow dapp/dt • And the database environments change faster! • Logical changes to representation (schema) • Physical changes in storage (indexes, layouts, HW) • DBs often shared by multiple apps! • In B.O.D., all apps had to be rewritten on change
It’s a SW Engineering Thing • Analogy: imagine if your C structs were to survive for decades • you’d keep them very simple • encapsulation to allow future mods • Similar Analogy to NWs • protocol simplicity is good • soft state is good (discourages hardcoded refs to transient resources) • But the fun systems part follows directly: • Achieve the goal w/respectable performance over a dynamic execution environment
Codd’s Data Independence • Ted Codd, IBM c. 1969 and forward • Turing award 1981 • Two layers of indirection Applications Logical Independence Spanned by views and query rewriting Logical Representation(schema) Physical Independence Spanned by queryoptimization andexecution Physical Representation (storage)
A More Architectural Picture Bridges logical independence Declarativequeryover views Query Rewriter Bridges physical independence Query Processor Declarative queryover base tables Optimizer Query Plan (Procedural) Executor N.B.: This classical QParchitecture raises someproblems. To be revisited! IteratorAPI Access Methods
Access Methods • Base data access layer • Model: Data stored in unordered collections • Relations, tables, one type per collection • Interface: iterators • Open(predicate) -> cursor • Usually simple predicates: attribute op constant • op usually arithmetic (<, >, =), though we’ll see extensions (e.g. multi-d ops) • Next(cursor) -> datum (of known type) • Close(cursor) • Insert(datum of correct type) • Delete(cursor)
Typical Access Methods • “Heap” files • unordered array of records • usually sequential on disk • predicates just save cross-layer costs • Traditional Index AMs • B-trees • actually, “B+”-trees: all data at leaves • Can scan across leaves for range search • predicates (<,>,=, between) result in fewer I/Os • random I/Os (at least to find beginning of range) • Linear Hash index • Litwin ‘78. Supports equality predicates only. • This is it for IR and standard relational DBs • Though when IR folks say “indexing”, they sometimes mean all of query processing
Primary & Secondary Indexes Directory Directory Data (key, ptr) pairs Data
Directory Directory Data Data Primary & Secondary Indexes Directory (key, ptr) pairs
An Exotic Forest of Search Trees • Multi-dimensional indexes • For geodata, multimedia search, etc. • Dozens! E.g. R-tree family, disk-based Quad-Trees, kdB-trees • And of course “linearizations” with B-trees • Path indexes • For XML and OO path queries • E.g. Xfilter • Etc. • Lots of one-off indexes, often many per workload • No clear winners here • Extensible indexing scheme would be nice
p1 p2 p3 … Generalized Search Trees (GiST) [Hellerstein et al., VLDB 95] • What is a (tree-based) DB index? Typically: • A clustering of data into leaf blocks • Hierarchical summaries (subtree predicates -- SPs) for pointers in directory blocks
Generalized Search Trees (GiST) • Can realize that abstraction with simple interface: • User registers opaque SP objects with a few methods • Consistent(q, p): should query q traverse subtree? • Penalty(d, p): how bad is it to insert d below p • Union (p1, p2): form SP that includes p1, p2 • PickSplit({p1, …, pn}): partition SPs into 2 • Tree maintenance, concurrency, recovery all doable under the covers • Covers many popular multi-dimensional indexes • Most of which had no concurrency/recovery story • http://gist.cs.berkeley.edu
Some Additional Indexing Tricks [O’Neil/Quass, SIGMOD 97] • Bitmap indexing • Many matches per value in (secondary) index? Rather than storing pointers to heap file in leaves, store a bitmap of matches in a (sorted) heap file. • Only works if file reorg is infrequent • Can make intersection, COUNT, etc. quicker during query processing • Can mix/match bitmaps and lists in a single index • Works with any (secondary) index with duplicate matches • “Vertical Partitioning” / “Columnar storage” • Again, for sorted, relatively static files • Bit-slice indexes
Dataflow Infrastructure • Dataflow abstraction is very simple • “box-and-arrow” diagrams • (typed) collections of objects flow along edges • Details can be tricky • “Push” or “Pull”? • More to it than that • How do control-flow anddataflow interact? • Where does the data live? • Don’t want to copy data • If passing pointers, where doesthe “real” data live?
Iterators • Most uniprocessor DB engines use iterators • Open() -> cursor • Next(cursor) -> typed record • Close(cursor) • Simple and elegant • Control-flow and dataflow coupled • Familiar single-threaded, procedure-call API • Data refs passed on stack, no buffering • Blocking-agnostic • Works w/blocking ops -- e.g. Sort • Works w/pipelined ops • Note: well-behaved iterators “come up for air” in inner loops • E.g. for interrupt handling g f S R
Where is the In-Flight Data? • In standard DBMS, raw data lives in disk format, in shared Buffer Pool • Iterators pass references to BufPool • A tuple “slot” per iterator input • Never copy along edges of dataflow • Join results are arrays of refs to base tables • Operators may “pin” pages in BufPool • BufPool never replaces pinned pages • Ops should release pins ASAP (esp. across Next() calls!!) • Some operators copy data into their internal state • Can“spill” this state to private disk space
Weaknesses of Simple Iterators • Evolution of uniprocessor archs to parallel archs • esp. “shared-nothing” clusters • Opportunity for pipelined parallelism • Opportunity for partition parallelism • Take a single “box” in the dataflow, and split it across multiple machines • Problems with iterators in this environment • Spoils pipelined parallelism opportunity • Polling (Next()) across the network is inefficient • Nodes sit idle until polled, and during comm • A blocking producer blocks its consumer • But would like to keep iterator abstraction • Especially to save legacy query processor code • And simplify debugging (single-threaded, synchronous)
Exchange [Graefe, SIGMOD 90] • Encapsulate partition parallelism & asynchrony • Keep the iterator API between ops • Exchange operator partitions input data by content • E.g. join or sort keys • Note basic architectural idea! • Encapsulate dataflowtricks in operators, leavinginfrastructure untouched • We’ll see this again next week, e.g. in Eddies
Exchange Internals • Really 2 operators, XIN and XOUT • XIN is “top” of a plan, and pulls, pushing results to XOUT queue • XOUT spins on its local queue • One thread becomes two • Producer graph & XIN • Consumer graph & XOUT • Routing table/fn in XINsupports partition parallelism • E.g. for || sort, join, etc. • Producer and consumer see iterator API • Queue + thread barrier turns NW-based “push” into iterator-style “pull” XOUT XIN route Exchange
Exchange Benefits? • Remember Iterator limitations? • “Spoils pipelined parallelism opportunity” • solved by Exchange thread boundary • “Polling (Next()) across the network is inefficient” • Solved by XIN pushing to XOUT queue • “A blocking producer blocks its consumer” • Still a problem!
Exchange Limitations • Doesn’t allow consumer work to overlap w/blocking producers • E.g. streaming data sources, events • E.g. sort, some join algs • Entire consumer graph blocks if XOUT queue empty • Control flow coupled to dataflow, so XOUT won’t return without data • Queue is encapsulated from consumer • But … • Note that exchange model is fine for most traditional DB Query Processing • May need to be extended for new settings…
Fjords [Madden/Franklin, ICDE 01] • Thread of control per operator • Queues between each operator • Asynch or synch calls • Can do asynch poll-and-yield iteration in each operator (for both consumer and producer) • Or can do synchronous get_next iteration • Can get traditional behavior if you want: • Synch polls + queue of size 1 • Iterators • Synch consumer, asynch producer • = Exchange • Asynch calls solve the blocking problem of Exchange exchange iterator
Fjords • Disadvantages: • Lots of “threads” • Best done in an event-programming style, not OS threads • Operators really have to “come up for air” (“yield”) • Need to write your own scheduler • Harder to debug • But: • Maximizes flexibility for operators at the endpoints • Still provides a fairly simple interface for operator-writers
Relational Algebra Semantics • Selection: sp(R) • Returns all rows in R that satisfy p • Projection: pC(R) • Returns all rows in R projected to columns in C • In strict relational model, remove duplicate rows • In SQL, preserve duplicates (multiset semantics) • Cartesian Product: R S • Union: R S Difference: R — S • Note: R, S must have matching schemata • Join: R p S = sp(R S) • Missing: Grouping & Aggregation, Sorting
Operator Overview: Basics • Selection • Typically “free”, so “pushed down” • Often omitted from diagrams • Projection • In SQL, typically “free”, so “pushed down” • No duplicate elimination • Always pass the minimal set of columns downstream • Typically omitted from diagrams • Cartesian Product • Unavoidable nested loop to generate output • Union: • Concat, or concat followed by dup. elim.
Operator Overview, Cont. • Unary operators: Grouping & Sorting • Grouping can be done with hash or sort schemes (as we’ll see) • Binary matching: Joins/Intersections • Alternative algorithms: • Nested loops • Loop with index lookup (Index N.L.) • Sort-merge • Hash Join • Don’t forget: have to write as iterators • Every time you get called with Next(), you adjust your state and produce an output record
Original Relation Partitions OUTPUT 1 1 2 INPUT 2 hash function hc . . . B-1 B-1 B main memory buffers Disk Disk Unary External Hashing [Bratbergsengen, VLDB 84] • E.g. GROUP BY, DISTINCT • Two hash functions, hc (coarse) and hf (fine) • Two phases: • Phase 1: for each tuple of input, hash via hc into a “spill” partition to be put on disk • B-1 blocks of memory used to hold output buffers for writing a block at a time per partition
Partitions Hash table for partition Ri (k < B pages) hash fn hf Result Output buffer B main memory buffers Disk Unary External Hashing • Phase 2: for each partition, read off disk and hash into a main-memory hashtable via hf • For distinct, when you find a value already in hashtable, discard the copy • For GROUP BY, associate some agg state (e.g. running SUM) with each group in the hash table, and maintain
External Hashing: Analysis • To utilize memory well in Phase 2, would like each partition to be ~ B blocks big • Hence works in two phases when B >= |R| • Same req as external sorting! • Else can recursively partition the partitions in Phase 2 • Can be made to pipeline, to adapt nicely to small data sets, etc.
Hash Join (GRACE) [Fushimi, et al., VLDB 84] • Phase 1: partition each relation on the join key with hc, spilling to disk • Phase 2: • build each partition of smaller relation into a hashtable via hf • scan matching partition of bigger relation, and for each tuple probe the hashtable via hf for matches • Would like each partition of smaller relation to fit in memory • So works well if B >= |smaller| • Size of bigger is irrelevant!! (Vs. sort-merge join) • Popular optimization: Hybrid hash join • Partition #0 doesn’t spill -- it builds and probes immediately • Partitions 1 through n use rest of memory for output buffers • [DeWitt/Katz/Olken/Shapiro/Stonebraker/Wood, SIGMOD 84]
Hash-Join Original Relations Partitions OUTPUT 1 1 2 INPUT 2 hash function hc . . . B-1 B-1 B main memory buffers Disk Disk Partitions of R & S Hash table for partition Ri (k < B-1 pages) hash fn hf hf Join Result Output buffer Input buffer for Si B main memory buffers Disk
Symmetric Hash Join [Mikillineni & Su, TOSE 88] [Wilschut & Apers, PDIS 91] • Pipelining, in-core variant • Build and probe symmetrically • Correctness: Each output tuple generated when its last-arriving component appears • Can be extended to out-of-core case • Tukwila [Ives & HaLevy, SIGMOD ‘99] • Xjoin: Spill and read partitions multiple times • Correctness guaranteed by timestamping tuples and partitions • [Urhan & Franklin, DEBull ‘00]
A Basic SQL primer SELECT [DISTINCT] <output expressions> FROM <tables> [WHERE <predicates>] [GROUP BY <gb-expression> [HAVING <h-predicates>]] [ORDER BY <expression>] • Join tables in FROM clause • applying predicates in WHERE clause • If GROUP BY, partition results by GROUP • And maintain aggregate output expressions per group • Delete groups that don’t satisfy HAVING clause • If ORDER BY, sort output accordingly
Examples • Single-table S-F-W • DISTINCT, ORDER BY • Multi-table S-F-W • And self-join • Scalar output expressions • Aggregate output expressions • With and without DISTINCT • Group By • Having • Nested queries • Uncorrelated and correlated