990 likes | 1.15k Views
Lecture 7: Query Execution. Wednesday, November 9, 2011. Outline. Relational Algebra Read: Ch. 4.2; “Three query languages formalisms” (lecture 1) Overview of query evaluation Read: Ch. 12 Evaluating relational operators: Read: Ch. 14, Shapiro’s Grace Join paper. Reading for Next Lectures.
E N D
Lecture 7:Query Execution Wednesday, November 9, 2011 Dan Suciu -- CSEP544 Fall 2011
Outline • Relational Algebra • Read: Ch. 4.2; “Three query languages formalisms” (lecture 1) • Overview of query evaluation • Read: Ch. 12 • Evaluating relational operators: • Read: Ch. 14, Shapiro’s Grace Join paper Dan Suciu -- CSEP544 Fall 2011
Reading for Next Lectures • Short, optional reading for next lecture • Chaudhuri on query optimization • Lots of papers to read for the following two lectures (and more may be added); start early! Dan Suciu -- CSEP544 Fall 2011
1. Relational Algebra (Ch. 4.2) Dan Suciu -- CSEP544 Fall 2011
The WHAT and the HOW • In SQL we write WHAT we want to get form the data • The database system needs to figure out HOW to get the data we want • The passage from WHAT to HOW goes through the Relational Algebra Physical Data Independence
SQL = WHAT Product(pid, name, price)Purchase(pid, cid, store) Customer(cid, name, city) SELECT DISTINCT x.name, z.name FROM Product x, Purchase y, Customer z WHERE x.pid = y.pid and y.cid = y.cid andx.price > 100 and z.city = ‘Seattle’ It’s clear WHAT we want, unclear HOW to get it
Relational Algebra = HOW Final answer Product(pid, name, price)Purchase(pid, cid, store) Customer(cid, name, city) δ T4(name,name) Π x.name,z.name T3(. . . ) T2( . . . .) σ price>100 and city=‘Seattle’ T1(pid,name,price,pid,cid,store) cid=cid Temporary tablesT1, T2, . . . pid=pid Customer Purchase Product
Relational Algebra = HOW The order is now clearly specified: Iterate over PRODUCT… …join with PURCHASE… …join with CUSTOMER… …select tuples with Price>100 and City=‘Seattle’… …eliminate duplicates… …and that’s the final answer ! Dan Suciu -- CSEP544 Fall 2011
Sets v.s. Bags • Sets: {a,b,c}, {a,d,e,f}, { }, . . . • Bags: {a, a, b, c}, {b, b, b, b, b}, . . . Relational Algebra has two semantics: • Set semantics (paper “Three languages…”) • Bag semantics Dan Suciu -- CSEP544 Fall 2011
Extended Algebra Operators • Union , intersection , difference - • Selection s • ProjectionΠ • Join ⨝ • Rename • Duplicate elimination d • Grouping and aggregation g • Sorting t Dan Suciu -- CSEP544 Fall 2011
Union and Difference R1 R2 R1 – R2 What do they mean over bags ? Dan Suciu -- CSEP544 Fall 2011
What about Intersection ? • Derived operator using minus • Derived using join (will explain later) R1 R2 = R1 – (R1 – R2) R1 R2 = R1 ⨝ R2 Dan Suciu -- CSEP544 Fall 2011
Selection • Returns all tuples which satisfy a condition • Examples • sSalary > 40000(Employee) • sname = “Smith”(Employee) • The condition c can be =, <, , >,, <> sc(R) Dan Suciu -- CSEP544 Fall 2011
Employee sSalary > 40000(Employee) Dan Suciu -- CSEP544 Fall 2011
Projection • Eliminates columns • Example: project social-security number and names: • PSSN, Name (Employee) • Answer(SSN, Name) P A1,…,An(R) Semantics differs over set or over bags Dan Suciu -- CSEP544 Fall 2011
Employee PName,Salary (Employee) Set semantics Bag semantics Which is more efficient?
Cartesian Product • Each tuple in R1 with each tuple in R2 • Very rare in practice; mainly used to express joins R1 R2 Dan Suciu -- CSEP544 Fall 2011
Employee Dependent Employee ✕ Dependent Dan Suciu -- CSEP544 Fall 2011
Renaming • Recall: needed in the named perspective • Changes the schema, not the instance • Example: • rN, S(Employee) Answer(N, S) rB1,…,Bn (R) Dan Suciu -- CSEP544 Fall 2011
Natural Join R1 ⨝R2 • Meaning: R1⨝R2 = PA(s(R1 × R2)) • Where: • The selection schecks equality of all common attributes • The projection eliminates the duplicate common attributes Dan Suciu -- CSEP544 Fall 2011
Natural Join R S R ⨝ S= PABC(sR.B=S.B(R × S)) Dan Suciu -- CSEP544 Fall 2011
Natural Join • Given schemas R(A, B, C, D), S(A, C, E), what is the schema of R ⨝ S ? • Given R(A, B, C), S(D, E), what is R ⨝ S ? • Given R(A, B), S(A, B), what is R ⨝ S ? Dan Suciu -- CSEP544 Fall 2011
Theta Join • A join that involves a predicate • Here q can be any condition R1 ⨝q R2 = sq (R1 R2) Dan Suciu -- CSEP544 Fall 2011
Eq-join • A theta join where q is an equality • This is by far the most used variant of join in practice R1 ⨝A=B R2 = sA=B (R1 R2) Dan Suciu -- CSEP544 Fall 2011
So Which Join Is It ? • When we write R ⨝ S we usually mean an eq-join, but we often omit the equality predicate when it is clear from the context Dan Suciu -- CSEP544 Fall 2011
Semijoin • Where A1, …, An are the attributes in R R ⋉C S = PA1,…,An (R ⨝C S) • Formally, R ⋉C S means this: retain from R only thosetuples that have some matching tuple in S • Duplicates in R are preserved • Duplicates in S don’t matter
Semijoins in Distributed Databases Dependent Employee network Employee ⨝SSN=EmpSSN (sage>71 (Dependent)) • Assumptions • Very few dependents have age > 71. • “Stuff” is big. Task: compute the query with minimum amount of data transfer
Semijoins in Distributed Databases Dependent Employee network Employee ⨝SSN=EmpSSN (sage>71 (Dependent)) T = PEmpSSNs age>71 (Dependents) Dan Suciu -- CSEP544 Fall 2011
Semijoins in Distributed Databases Dependent Employee network Employee ⨝SSN=EmpSSN (sage>71 (Dependent)) T = PEmpSSNs age>71 (Dependents) R = Employee⨝SSN=EmpSSNT = Employee ⋉SSN=EmpSSN (sage>71 (Dependents))
Semijoins in Distributed Databases Dependent Employee network Employee ⨝SSN=EmpSSN (sage>71 (Dependent)) T = PEmpSSNs age>71 (Dependents) R = Employee ⋉SSN=EmpSSNT Answer = R ⨝SSN=EmpSSNs age>71 Dependents
Joins R US • The join operation in all its variants (eq-join, natural join, semi-join, outer-join) is at the heart of relational database systems • WHY ? Dan Suciu -- CSEP544 Fall 2011
Operators on Bags • Duplicate elimination d d(R) = SELECT DISTINCT * FROM R • Grouping g gA,sum(B) (R) = SELECTA,sum(B) FROM R GROUP BY A • Sorting t
y.seller-id=z.id y.pid=u.pid x.id=z.id Complex RA Expressions gu.name, count(*) Product(pid, name, price)Purchase(pid, id, seller-id) Person(id, name) Person x Purchase y Person z Product u SELECT u.name, count(*)FROM Person x, Purchase y, Person z, Product uWHERE z.name=‘fred’ and u.name=‘gizmo’ and y.seller-id = z.id and y.pid = u.pid and x.id=z.idGROUP BY u.name Pid Ppid sname=fred sname=gizmo
RA = Dataflow Program • Several operations, plus strictly specified order • In RDBMS the dataflow graph is always a tree • Novel languages (s.a. PIG), dataflow graph may be a DAG
Limitations of RA • Cannot compute “transitive closure” • Find all direct and indirect relatives of Fred • Cannot express in RA !!! Need to write Java program • Remember the Bacon number ? Needs TC too ! Dan Suciu -- CSEP544 Fall 2011
2. Overview of query evaluation Dan Suciu -- CSEP544 Fall 2011
Steps of the Query Processor SQL query Parse & Rewrite Query Logicalplan Select Logical Plan Queryoptimization Select Physical Plan Physicalplan Query Execution Disk
Example Database Schema Supplier(sno, sname, scity, sstate) • Supply(sno, pno, price) Part(pno, pname, psize, pcolor) View: Suppliers in Seattle CREATE VIEW NearbySupp AS SELECT x.sno, x.sname FROM Supplier x WHERE x.scity='Seattle' AND x.sstate='WA' Dan Suciu -- CSEP544 Fall 2011
Example Query Supplier(sno, sname, scity, sstate) • Supply(sno, pno, price) Part(pno, pname, psize, pcolor) Find the names of all suppliers in Seattle who supply part number 2 NearbySupp(sno, sname) SELECT y.sname FROM NearbySuppy WHERE y.sno IN ( SELECT z.sno FROM Supplies z WHERE z.pno = 2 ) Dan Suciu -- CSEP544 Fall 2011
Steps in Query Evaluation • Step 0: Admission control • User connects to the db with username, password • User sends query in text format • Step 1: Query parsing • Parses query into an internal format • Performs various checks using catalog • Correctness, authorization, integrity constraints • Step 2: Query rewrite • View rewriting, flattening, etc. Dan Suciu -- CSEP544 Fall 2011
Rewritten Version of Our Query Supplier(sno, sname, scity, sstate) • Supply(sno, pno, price) Part(pno, pname, psize, pcolor) NearbySupp(sno, sname) SELECT y.sname FROM NearbySuppy WHERE y.sno IN ( SELECT z.sno FROM Supplies z WHERE z.pno = 2 ) Original query: SELECT x.sname FROM Supplier x, Supplies z WHERE x.scity='Seattle' ANDx.sstate='WA’ AND x.sno = z.sno AND z.pno = 2; Rewritten query:
Continue with Query Evaluation • Step 3: Query optimization • Find an efficient query plan for executing the query • A query plan is • Logical query plan: an extended relational algebra tree • Physical query plan: with additional annotations at each node • Access method to use for each relation • Implementation to use for each relational operator Dan Suciu -- CSEP544 Fall 2011
Logical Query Plan Supplier(sno, sname, scity, sstate) • Supply(sno, pno, price) Part(pno, pname, psize, pcolor) Πsname NearbySupp(sno, sname) sscity=‘Seattle’ sstate=‘WA’ pno=2 sno = sno Suppliers Supplies Dan Suciu -- CSEP544 Fall 2011
Physical Query Plan • Logical query plan with extra annotations • Access path selection for each relation • Use a file scan or use an index • Implementation choice for each operator • Scheduling decisions for operators Dan Suciu -- CSEP544 Fall 2011
Supplier(sno, sname, scity, sstate) • Supply(sno, pno, price) Part(pno, pname, psize, pcolor) Physical Query Plan (On the fly) sname (On the fly) sscity=‘Seattle’ sstate=‘WA’ pno=2 (Nested loop) sno = sno Suppliers Supplies (File scan) (File scan)
Final Step in Query Processing • Step 4: Query execution • How to synchronize operators? • How to pass data between operators? • Synchronization techniques: • Pipelined execution • Materialized relations for intermediate results • Passing data between operators: • Iterator interface • One thread per operator Dan Suciu -- CSEP544 Fall 2011
Iterator Interface • Each operator implements this interface • Interface has only three methods • open() • Initializes operator state • Sets parameters such as selection condition • get_next() • Operator invokes get_next() recursively on its inputs • Performs processing and produces an output tuple • close(): cleans-up state Dan Suciu -- CSEP544 Fall 2011
Supplier(sno, sname, scity, sstate) • Supply(sno, pno, price) Part(pno, pname, psize, pcolor) Pipelined Execution (On the fly) sname (On the fly) sscity=‘Seattle’ sstate=‘WA’ pno=2 (Nested loop) sno = sno Suppliers Supplies (File scan) (File scan)
Pipelined Execution • Applies parent operator to tuples directly as they are produced by child operators • Benefits • No operator synchronization issues • Saves cost of writing intermediate data to disk • Saves cost of reading intermediate data from disk • Good resource utilizations on single processor • This approach is used whenever possible Dan Suciu -- CSEP544 Fall 2011
Supplier(sno, sname, scity, sstate) • Supply(sno, pno, price) Part(pno, pname, psize, pcolor) Intermediate TupleMaterialization sname (On the fly) (Sort-merge join) sno = sno (Scan: write to T1) (Scan: write to T2) sscity=‘Seattle’ sstate=‘WA’ pno=2 Suppliers Supplies (File scan) (File scan)