510 likes | 786 Views
Beyond Relations as Sets: Multisets, Complex Objects, Sequences, and Streams. Equivalences Among Relational Expressions; Aho, Sagiv, and Ullman, SIAM J. Computing, 1979 Four Views of Complex Objects: A Sophisticate's Introduction; Hull, LNCS 361 Nested Relations and Complex Objects, 1987
E N D
Beyond Relations as Sets: Multisets, Complex Objects, Sequences, and Streams Equivalences Among Relational Expressions; Aho, Sagiv, and Ullman, SIAM J. Computing, 1979 Four Views of Complex Objects: A Sophisticate's Introduction; Hull, LNCS 361 Nested Relations and Complex Objects, 1987 SRQL: Sorted Relational Query Language; Ramakrishnan et al., SSDBM 1998 Raghu Ramakrishnan
What’s there Beyond Relations As Sets? • Relations as multisets (real SQL) • Object id (Object DBMSs) • Nested structure (Object-Relational DBMSs) • Partial structure (e.g., XML) • Sequences (SQL:1999) • Temporal databases (versions, “valid” time) • Streams (data feeds, alert systems) • data/query duality
Relations-As-Multisets (Bags) • What if the number of copies of a tuple matters? • E.g., Find the average salary • In fact, a relation is a multiset in SQL, by default. • The results for conjunctive query containment over relations-as-sets no longer hold! • Containment: No longer in NP (Chaudhuri-Vardi); equivalence is still in NP! • Containment of unions of conjunctive queries: Undecidable (Ioannidis-Ramakrishnan) Lesson: Small changes in the data model can have a big impact!
Querying Sequences in SQL:1999 • Trend analysis is difficult to do in SQL-92: • Find the % change in monthly sales • Find the top 5 product by total sales • Find the trailing n-day moving average of sales • The first two queries can be expressed with difficulty, but the third cannot even be expressed in SQL-92 if n is a parameter of the query. • The WINDOW clause in SQL:1999 allows us to write such queries over a table viewed as a sequence (implicitly, based on user-specified sort keys)
Stocks(date, symbol, close) Trans(cust, symbol, date, shares) Motivation • Find the trailing weekly moving average for each stock. • For each date in ‘97, find the ten cheapest stocks. • Compute the % change of each stock during ‘97, and then find stocks in the top 5% (those that changed the most). • For each week, find the stock in which Joe had the most invested.
Sequences • Simple sequence: A relation (logically) sorted on a key; key attributes are called sequencing attributes. • Composite sequence:A relation that is first partitioned usinggrouping attributes, then sorted within each partition bysequencing attributes. ord(t): Ordinal # of tuple within its group
Sequence Algebra • Relational algebra, extended to work over multisets of tuples, is the foundation for SQL. • We extend the relational operators to work on sorted relations. • In addition, we define some new operators over sequences. (Only one of these new operators is necessary; the rest are for convenience.)
Extending Relational Algebra • The operators are: s, p, x, u, -, distinct scan use conditions over ord attr of input • To extend these ops to work on sorted relations, we must define the grouping and sequencing attributes of the result. • We do this simply: the grouping and sequencing attribute lists are empty for the output of the above ops. I.e., the result is unordered.
Extending the Join Operator • Since cross-product (x) does not propagate the ord values of the input table to its output, we must define join as a primitive op! • Join ( ): Can refer to ord values of inputs. • Left-outer join ( ): Left tuples with no matching right tuple appear in result with NULLs in right tuple attrs. • Result has empty grouping / sequencing lists.
Creating a Sequence • The sequencing operator y is our fundamental extension to rel algebra. Applied to a table R, with grouping attrs g and sequencing attrs s, it returns the corresponding composite sequence.
ShiftAll Operator D • D(R,I) joins each tuple t in sequence R with all tuples t’ in the same group such that ord(t’) = ord(t)+I. • If such t’ does not exist, use NULLs. • Can be defined as left-outer join followed by sequencing; included for convenience. • Can be extended to use a fixed ordinal such as FIRST or LAST, and also to align each tuple t with more than one other tuple t’.
Shift Operator d • Modifies each tuple t of R by adding the value of the sequencing attributes of tuple t’ (whose ord value is shifted by I). • Differs from ShiftAll (followed by projection of unwanted columns) if R has duplicates in the sequencing columns. • Like ShiftAll, Shift can be defined in terms of left-outer join and sequencing.
SELECT S.g,S.t,S.v,SHIFTALL(S,-1).v,SHIFT(S,1).t FROM R GROUP BY g, SEQUENCE BY t AS S • SHIFTALL gives access to all attrs of the tuple(s) at the shifted ordinal position. • SHIFT gives the sequencing attr values of the tuple at the shifted ord position.
SELECT t, AVG(v) OVER 0 TO 1 FROM R SEQUENCE BY t POSITIONAL AGGREGATE VALUE-BASED AGGREGATE SELECT t, AVG(v) OVER VALUES 0 TO 1 FROM R SEQUENCE BY t
SELECT expr-list FROM table-list WHERE predicate GROUP BY expr-list SEQUENCE BY expr-list WITH predicate HAVING predicate ORDER BY expr-list SRQL Queries • FROM clause can contain expressions involving sequencing; first evaluate these and find cross-prod. • Then apply selections in WHERE clause, and sort as per GROUP BY and SEQUENCE BY clauses to produce a sequence. • For each tuple (and each agg expr in SELECT), identify window and apply agg fn. • Apply HAVING clause, finally ORDER BY.
SELECT V.name, Q.name FROM Volcano AS V, Quake SEQUENCE BY time AS Q WHERE Q.time <= V.time AND (SHIFT(Q,1).time > V.time OR SHIFT(Q,1).time IS NULL) AND Q.magnitude > 7 • For each volcano eruption where the most recent quake was > 7, find the name of this quake: • Using some syntactic sugar: SELECT V.name, Q.name FROM Volcano AS V, Quake SEQUENCE BY time AS Q WHERE Q.time PRECEDES= V.time AND AND Q.magnitude > 7
Aggregate Functions • As in SQL, we need to extend the algebra to allow the use of aggregate functions like MIN and SUM. • Must define “window” for each application of an agg fn. • In SQL, window is a partition created by GROUP BY; agg op applied to each partition, and yields one value per partition. • In SRQL, for each agg fn, can have a different window for each tuple.
Window Aggregate Op w • w (R, p, f, V): • R is a sequence, f is an agg fn, V is an attr of R, p is a selection predicate; result is a sequence with same grouping/seq lists as R • For each input tuple t, output contains < t, f(pv ( st.g=R.g and p(t) (R) ) ) > • I.e., for each tuple t of R, apply p(t) to find its “window” within the group of t, apply agg fn f to the multiset of V-values in window, and include with t in result.
SELECT t, AVG(v) OVER 0 TO 1 FROM R SEQUENCE BY t POSITIONAL AGGREGATE VALUE-BASED AGGREGATE SELECT t, AVG(v) OVER VALUES 0 TO 1 FROM R SEQUENCE BY t
SELECT day, AVG(profit) OVER 0 TO 1 FROM Sales SEQUENCE BY day WITH vol>100 • Find 2-day moving average of the profit made on sales volume > 100:
SELECT day, AVG(profit) OVER 0 TO 1 FROM Sales SEQUENCE BY day WHERE vol>100 • Considering only days with sales volume > 100, find 2-day moving average of the profit: • Placing selection in WHERE clause affects window for positional aggregation!
SELECT product, day, AVG(vol) OVER 0 TO 1 FROM Sales GROUP BY product SEQUENCE BY day • Find the 2-day moving average of volume sold for each product: • In effect, creates a sequence by day for each product, and computes the moving average over each of these sequences. • Observe how this generalizes SQL’s GROUP BY: illustrates power of composite sequences and aggregation.
The WINDOW Clause SELECT L.state, T.month, AVG(S.sales) OVER W AS movavg FROM Sales S, Times T, Locations L WHERE S.timeid=T.timeid AND S.locid=L.locid WINDOW W AS (PARTITION BY L.state ORDER BY T.month RANGE BETWEEN INTERVAL `1’ MONTH PRECEDING AND INTERVAL `1’ MONTH FOLLOWING) • Let the result of the FROM and WHERE clauses be “Temp”. • (Conceptually) Temp is partitioned according to the PARTITION BY clause. • Similar to GROUP BY, but the answer has one row for each row in a partition, not one row per partition! • Each partition is sorted according to the ORDER BY clause. • For each row in a partition, the WINDOW clause creates a “window” of nearby (preceding or succeeding) tuples. • Can be value-based, as in example, using RANGE • Can be based on number of rows to include in the window, using ROWS clause • The aggregate function is evaluated for each row in the partition using the corresponding window. • New aggregate functions that are useful with windowing include RANK (position of a row within its partition) and its variants DENSE_RANK, PERCENT_RANK, CUME_DIST.
Top N Queries • If you want to find the 10 (or so) cheapest cars, it would be nice if the DB could avoid computing the costs of all cars before sorting to determine the 10 cheapest. • Idea: Guess at a cost c such that the 10 cheapest all cost less than c, and that not too many more cost less. Then add the selection cost<c and evaluate the query. • If the guess is right, great, we avoid computation for cars that cost more than c. • If the guess is wrong, need to reset the selection and recompute the original query.
Top N Queries SELECT P.pid, P.pname, S.sales FROM Sales S, Products P WHERE S.pid=P.pid AND S.locid=1 AND S.timeid=3 ORDER BY S.sales DESC OPTIMIZE FOR 10 ROWS • OPTIMIZE FOR construct is not in SQL:1999! • Cut-off value c is chosen by optimizer. SELECT P.pid, P.pname, S.sales FROM Sales S, Products P WHERE S.pid=P.pid AND S.locid=1 AND S.timeid=3 AND S.sales > c ORDER BY S.sales DESC
Streams are Forever … • So no notion of query over “entire” stream • Queries must operate over parts of a stream • How do you identify “parts”? • Common answer: windows of various kinds • Continuous queries • Many queries of interest being continually evaluated over a given stream • What do you index? The collection of queries or the streaming data? • Duality of data/queries! • Rate-oriented optimizations • Can query evaluation/monitoring keep up with arrival rate? • If not, how do we achieve a good trade-off? • Stream statistics • Incremental computation of statistics • Can actually maintain stats over “entire” stream, but also over “groups” of tuples (window or value-based)
Complex Objects Arise as a generalization of earlier models; in particular, we study them as extensions of the relational model. Features: Richer structuring facilities. e.g. context-dependent information (john, {joanna, goanna}) Precursor to XML’s semi-structured model Data sharing (in models with object-identity, e.g., LDM) Inheritance Two streams of Complex Objects research: Emphasis on structure of objects—our focus! Emphasis on models for expressing rich semantics
Look Out For … • Treatment of identity • Value-based or object-based? • Use of sets • Restrictions make a big difference! • Also, Type vs. Class. Starting Point The relational model: • Only one basic domain of constants (D) • Only one type constructor: R:[A1: D, A2:D, …, An:D] • The only ‘objects’ are tuples.
Complex Objects Types are constructed as follows: • Basic: Basic types T1,T2, …, Tn are given. An object of type Ti is an element of (the given) domain Di. • Record:If T1,T2, …, Tk are types, then T= [L1: T1,L2:T2, …,Lk:Tk] is also a type. An object of type T is an element of DOM(T1) DOM(T2) ... DOM(Tk). (This set of possible values is DOM(T)) • Set: If T1 is a type, then T = L:{T1} is also a type. An object of type T is a finite subset of DOM(T1). (What is DOM(T)?) T T The Li’s are labels, or field names, drawn from a separate domain. No label is used twice in the same type. ⊗ ⊛ L1 LK L T1 TK T1
#5, 10k, [John, 30], [Joe, 1] [Jill, 30] Toy, #7, 20k, [Jack, 32], Gumbo, #7, 50k, [Jack, 32], Example Instance of a type: A set of objects of that type. Example: Basic types: int, char_str Person = [Name: char_str, Age: int] Emp = [EID: int, Sal: int, Who:person, Family:{Person}] Dept = [Dname:char_str, Demps:{Emp}] Instance of Dept:
274-7473390-2577 Syne, Nancy, 232-2323289=9887 Energy, John Wefix, John, ∅ URaft, Andy, 939-9983 Update Examples Corp ⊗ □ □ Phones ⊛ CName Owner □ Phone Insert [Lang, Sylvie, {989-2232}] at Corp Delete 390-2577 at Phones – ‘Energy’ tuple is modified Delete WeFix at CName – ‘WeFix’ tuple is deleted Replace Andy by Sabra at Owner – Modifies the tuple Delete Andy at OwnerInsert Sabra (into the person instance)Not ≡ to the ‘replace’ in 4.
NFNF Relations, PNF NFNF Relation Types T = [T1, T2, …, Tk], and (a) S= {S1} S1 = [ , …, ] (i.e., S1 is a record type) (b) S = [S1, S2, …, Sm], Si = {Si’} (i.e., a set type), or a basic type. for all types S used in defining Ti’s. Straightforward to obtain ‘equivalent’ NFNF type for any given type T. Partitioned , or nested, normal form A type T is in PNF if type S used to define T: If S = [S1, S2, …, Sm], and S1, S2, …, Sk , k m, are basic types, then S1, S2, …, Sk includes a key for S. (Must hold, recursively, at every level of nesting!)
Bach, Dieter, Joseph, Jinyi DeWitt, Jeff, Raghu Bach, Dieter, Joseph, Jinyi DeWitt, Jeff, Raghu Bach, Jeff Bach, DeWitt AT&T Sperry, Genentech Examples of NFNF UW-CS, “Essentially all NFNF relationship in practice are in PNF.” e.g. UW-CS, UW-CS, Theory, UW-CS, DB, UW-CS, Bach/Jeff is one (of several such) combination(s) that can teach a course involving both theory and database. But , …… UW-CS, Med_Info_Sys, Neither Sperry nor Genentech by itself can do the job, but together, they can. Med_Info_Sys,
A variable that is not ‘bound’ through or Relational Calculus (TRC) • Query language for FLAT tables. • Atomic Formula • R(S) – S is a tuple in R • d θ S.L or S.Lθd or S.LθS1.L1 • Formula • Any atomic formula • p, pq, pq, pq • S (P(S)) – S is a variable • S (P(S)) – S is a variable • Query • {u | p(u)}, where u is the only ‘free variable’ in p.
R(S) • S O.L • d θ O.L orO.L θ d or O.L θ O1.L1 • O.L = {u|(u, t1, …, tn)} The set of all u s.t. is true. u, t1, … tn are free variables of . If O is a variable, it is distinct. This is the biggie. S is a tuple in R S is in (the set) O.L Comparisons , θ is =, <, > etc. Query Language (Extended Relational Calculus) Formulas built using , , , , , from atoms of the form:
Anc = {p | t (almost(t) p t} Example: Transitive Closure OK(t) = s [s t ( p(Par(p) s.1 = p.1) q(Par(q) s.2 = q.2))] trans(t) = p q[p t q t OK(t) p.2 = q.1 ( r(r t r.1 = p.1 r.2 = q.2))] contains(t) = p [ Par(p) p t] almost(t) = OK(t) trans(t) contains(t) Intersection of models Note: The variable t takes on relations as values. This underscores the PNF restriction—the intermediate relations are not in PNF. (Note that t is really t.1—a tuple whose only field takes on relations as values.) This query can’t be expressed if all relations are to be in PNF.
a1 b1 c1 c2 b2 c3 a2 b3 c4 b1 c5 Miscellaneous Remarks 1. PNF condition: [A, {[B, {C}]}] a1 b1 c1 c2 b2 c3 a2 b3 c4 2. Transitive closure example variables taking tuple values vs. variables taking relation values(Also, example is due to Vardi; similarly, domain independence is due to Dipaola.) What if this is b3?
1 NF relational Calculus e.g. {t | R(t)} An expression {t | ψ(t)} is SAFE if: 1. Whenever ψ(t) is true, each component (field) of t is in DOM(ψ). 2. For each sub formula of ψ of the form u((u)), if any component of u is not in DOM(ψ), then (u) is true for all values of the other free variables in . 3. For each sub formula of ψ of the form u((u)), if (u) is true for any values of the other free variables in , then each component of u is in DOM(ψ). • What is DOM(ψ)? • What does “safety” give us?
NFNF Query Languages (Algebra) • We’ve already seen the calculus, which is as expressive as Datalog. • NFNF Algebra • The usual operators (Suitably generalized) • , ,, , , • And some fancy additions: (nest), (unnest), (powerset) • Main Result • Safe calculus Algebra
Operators - Nest • Let R = [R1, …, Rn] = [S1, …, Sk, T1, …, Tm] • Let I be an instance of R. • S=(S1…Sk) (I) : “Group S” • { u | r I ( (u.Ti = r.Ti, 1i m) (u[S] = {q[ S1, …, Sk] | q I (q.Ti = r.Ti, 1 i m)} ) ) }
Phones Emps BName Owner {P.No.} {[Ename, Ejob]} 288-2323 GmbH Nancy Karl Researcher 232-4232 232-2345 Lucy Designer E.Inc. John Jim Translator 877-2323 John Explorer W.Fix John … Gail Attorney Operators - Nest What is : υallmine= (Bname, Phones, Emps) (Business)?
Owner Allmine {(Bname, {P.No.}, {[Ename, Ejob]})} Nancy GmbH 288-2323 Karl Researcher 232-4232 John E.Inc. 232-2345 Lucy Designer 877-2323 Jim Translator John Explorer W.Fix … Gail Attorney Example What is : υallmine= (Bname, Phones, Emps) (Business)?
More on Nest • Can you say anything about the key fields of a nested relation? • Is υ commutative?Let R = [A, B, C], I be an instance of R. Is υD=(B) (υE=(C) (I)) = υE=(C) (υD=(B) (I)) ? Find a (counter) example. Both expressions have the same type: ⊛ [⊗[A, ⊛[⊗[B]], ⊛[⊗[C]]]]
Operators – Unnest • Let R = {[R1, …, S, …,Rn]} where S = {[S1, …, Sm]} • Let I be an instance of R. Then S(I) has type {[R1, …, S1, …, Sm, …,Rn]} S(I) = {t | r I ( (t.Ri = r.Ri, 1 i n) t [S1, …, Sm] r.S) }
(Phones){P.No.} Owner Phone (Phones){P.No.} Owner Owner Nancy 288-2323 Nancy 288-2323 Nancy 288-2323 232-4232 Nancy 232-4232 232-4232 John 232-2345 John 232-2345 John 232-2345 877-2323 John 877-2323 John 877-2323 John Example – Cont. Consider projected= owner, phones (business) Another instance of projected Projected phones(Projected)
Operators – Nest vs. Unnest • υ and are (almost) inverses. Let I be an instance of type R. • When is (Ri) (υ(Ri)I) = I? • What happens when: 1. The unnested field has different (set-)values for the same values in the other columns? 2. The unnested field has the empty set? • Can (2) happen following a nest on the same column? • When is υ(Ri) ((Ri)I) = I?