Database Systems 236363

Database Systems236363 Relational Calculus

Relational Algebra vs. Relational Calculus • Relational algebra queries are relatively easy to implement in a programming language, yet the translation of a query from a natural language (or even SQL) to relational algebra is not intuitive • In the Domain Relational Calculus (DRC) language and its variant Tuple Relational Calculus (TRC), the terms of the query are expressed as first-order logic expression, which is more similar to the way humans compose questions, in contrast to the “algorithmic” structure of relational algebra

Domain Relational Calculus (DRC)

DRC – Basic Concept • DRC enables writing logical expressions describing conditions on the records • Each such expression will have free variables (defined later); the truth/false value of the expression will be evaluated with respect to these variables • The expression (x1,...,xn) describes a condition whose free variables are x1,...,xn • is also called a formula • The free variables of a formula describe the attributes of the records • For any set of possible values 1,...,nfor the free variables x1,...,xn, the expression  returns a truth value • Either(1,..., n)is satisfied or (1,..., n) is not satisfied • R[A1,...,An] = {<x1,...,xn> : (x1,...,xn)} stands for the relation consisting of exactly all n-tuples satisfying (x1,...,xn)

Atomic Expressions • Belongs to formulas • If R[A1,...,An] is a relation in the database, then the formula R(x1,...,xn) is a formula with n free variables • For any sequence of values 1,...,n, the formula R(1,...,n) is satisfied iff the relation includes a record with attributes corresponding to these values • We can also write <1 ,..., n>R instead of R(1,...,n) • Comparisons • Between variables and between variables and constants using the usual comparison operators such as “z=2”, “x<y”, etc. • The free variables are the ones appearing in the formula

Composing Expressions – Boolean • Negation • If (x1,...,xn) is a valid expression, then (x1,...,xn)=(x1,...,xn) is also a valid expression with the same free variables • Meaning: (x1,...,xn) is satisfied iff(x1,...,xn) is not satisfied • Disjunction • If 1(x1,...,xn) and 2(x1,...,xn) are valid expressions, then (x1,...,xn)= 1(x1,...,xn)2(x1,...,xn) is a valid expression that is satisfied if either 1 or 2 is satisfied (or both) • It is possible that only some of the variables x1,...,xnin either 1 or 2 are free variables in this expression • It is possible to “declare” for a given expression the free variables that do not appear in it explicitly

Additional Boolean Operations • Conjunction • The expression 1(x1,...,xn)2(x1,...,xn)is satisfied for a given sequence of values x1,...,xniff it is satisfied for both 1(x1,...,xn) and 2(x1,...,xn) • It is equivalent to ((1(x1,...,xn))(2(x1,...,xn))) • The parenthesis describe the order in which operators are applied in the composed expression when it is not obvious • Implication • Written as 1(x1,...,xn)2(x1,...,xn) • This expression is equivalent to ((1(x1,...,xn))(2(x1,...,xn))) • Equivalence • Written as 1(x1,...,xn)2(x1,...,xn) • This expression is equivalent to (1(x1,...,xn)2(x1,...,xn))(2(x1,...,xn)1(x1,...,xn))

The Existential Quantifier • If (x1,...,xn) is a valid expression, then (x2,...,xn)=x1(x1,...,xn)is a valid expression whose free variables are x2,...,xn • The meaning of this expression is that (2,...,n) is satisfied iff there exists a value 1 for which (1,...,n) is satisfied • Example: • For attributes whose value ranges are the natural numbers, (x2,x3)=x1 ((x2 > x1)  (x1 > x3) is satisfied iffx2,x3 are a pair of decreasing non-consecutive numbers

The Universal Quantifier • Given a valid expression (x1,...,xn), the expression (x2,...,xn)=x1(x1,...,xn) has the following meaning: for any given sequence of values 2,...,n, (2,..., n) is satisfied iff for every possible value 1, (1,..., n) is satisfied • How do we define “every possible value”? • Naturally, if the set of possible values is infinite, verifying such an expressions becomes problematic… • For now, we assume that this is restricted to the range of possible values for the corresponding attribute • If the range is not clear, we need to state it explicitly • A similar problem also exists for  due to negation. In particular, x1(x1,...,xn)is equivalent to x1(x1,...,xn)

Quantified and Free Variables • Consider the formula (y,z)=(x1(x,y))(x2(x,z)) • The free variables here are y and Z • What about x? • x appears twice, once on each side of the  operator. Yet, there is no relation between these occurances • In fact, it is equivalent to the formula (y,z)=(x11(x1,y))(x22(x2,z)) • Denote R1[A,B] the relation defined by 1 and R2[A,C] the relation defined by 2 • The equivalent expression in RA is (BR1)(CR2) • To enforce the same value on both sides of the  operator, the formula should be written (y,z)= x(1(x,y)2(x,z)) • This is equivalent in RA to B,C (R1 ⋈ R2)

Implementing RA Expressions in DRC • Let (x1,...,xn)be a formula representing T[A1,...,An] • The relation A1,...,Am T[A1,...,An] (for m<n) is obtained by {<x1,...,xm>:xm+1,...,xn(x1,...,xm,xm+1,...,xn)} • To obtain T[A1,...,An] for a given expression, denote (x1,...,xn)the expression obtained from  by replacing each Ai with Xi (for each 1in) • We get the following first order logic expression T={<x1,...,xn> : (x1,...,xn)  (x1,...,xn)}

Cartesian Product in DRC • Let 1(x1,...,xn)be a formula representing T1[A1,...,An] and 2(x1,...,xm) be a formula representing T2[B1,...,Bm] • We can represent T1xT2 by the following expression T1T2={<x1,...,xn,xn+1,...,xn+m> : 1(x1,...,xn)2(xn+1,...,xn+m)} • Notice that we need to use different sets of variables for 1and 2

Subtract, Union, Intersection • Let 1(x1,...,xn)be a formula representing T1[A1,...,An] and 2(x1,...,xn) be a formula representing T2[A1,...,An] • We have: • T1 \ T2 = {<x1,...,xn> : 1(x1,...,xn)  2(x1,...,xn)} • T1T2 = {<x1,...,xn> : 1(x1,...,xn)  2(x1,...,xn)} • T1T2 = {<x1,...,xn> : 1(x1,...,xn)  2(x1,...,xn)}

Composed Expressions in DRC • Let 1(x1,...,xn,y1,...,ym)be a formula representing T1[A1,...,An,B1,...,Bm] and 2(y1,...,ym,z1,...,zk) a formula representing T2[B1,...,Bm,C1,...,Cc] • The natural join of T1 and T2 can be expressed as follows T1 ⋈T2= {<x1,...,xn,y1,...,ym,z1,...,zk> : 1(x1,...,xn,y1,...,ym) 2(y1,...,ym,z1,...,zk)} • The semi-join can be expressed as T1 ⋉T2 = {<x1,...,xn,y1,...,ym> : z1,...,zk (1(x1,...,xn,y1,...,ym) 2(y1,...,ym,z1,...,zk))}

Formal Equivalence Proof • In order to exemplify how equivalence proofs are structured, we will formally prove that if T1 corresponds to 1(x1,...,xn,y1,...,ym)and T2 corresponds to 2(y1,...,ym,z1,...,zk) then for the expression (x1,...,xn,y1,...,ym,z1,...,zk) = 1(x1,...,xn,y1,...,ym)2(y1,...,ym,z1,...,zk) the following holds: T1⋈T2={<x1,...,zk>:(x1,...,zk)} • To that end, we need to show that for every two relations T1 and T2 and every sequence of values t=(1,...,n,1,...,m,1,...,k): • If tT1⋈T2 then (t) is satisfied • If (t) is satisfied then tT1⋈T2

First Direction • If t=(1,...,n,1,...,m,1,...,k)T1⋈T2 then by the definition of join, (1,...,n,1,...,m)T1 meaning that 1(1,...,n,1,...,m) is satisfied and (1,...,m,1,...,k)T2 meaning that 2(1,...,m,1,...,k) is also satisfied • Hence, 1(1,...,n,1,...,m)2(1,...,m,1,...,k) = (1,...,n,1,...,m,1,...,k) is satisfied, as needed

Second Direction • If (t)=1(1,...,n,1,...,m)2(1,...,m,1,...,k) is satisfied, then by the definition of conjunction, 1(1,...,n,1,...,m) is satisfied, meaning that (1,...,n,1,...,m) T1 • Similarly, 2(1,...,m,1,...,k) is satisfied, meaning that (1,...,m,1,...,k)T2 • Hence, by the definition of natural join T1⋈T2, we have (1,...,n,1,...,m,1,...,k)T1⋈T2,as needed

Boolean Queries • If  is a DRC expression without any free variables (but uses the database relations), then the query “{<>:}” is still meaningful • The result will be a relation without any attributes • It is the empty relation if  is not satisfied over the database • Otherwise, it is the relation that includes a single empty record • We sometimes use the notation “” for such a query • For example, the following query checks whether the relations R and S satisfy R⊆S: {<>: x1,...,xn (R(x1,...,xn) → S(x1,...,xn))}

Problematic Expressions • Consider the DRC expression: {<x> : R(x)} • Can it be written in RA? • Can it be implemented in a table? • How about the expression: {<x,y> : x=y} ? • We should avoid extremely large (and in particular infinite) tables as well as queries (and sub-queries) that require addressing extremely large (and in particular infinite) domains • In other words, we would like to restrict ourselves to the content of the DB and avoid being dependent on the domains

Safe DRC • Safe-DRC is intended to ensure that all expressions can be implemented without having to scan entire domains • This is both in order to prepare the input and in order to evaluate expressions with quantifiers • A DRC expression is called safe if it satisfies some additional requirements that ensure that its result only depends on the relations in the database and not on the entire possible range of attribute values • Not every DRC expression has an equivalent Safe-DRC expression • A more formal definition is given in the tutorial

DRC – Summarizing Example • We are going to compare between a few RA and DRC queries A_Time D_Time S_Name T_Num Height S_Type Days Arrives Station Train Platform Km Gives Serves Line Service L_Type Direction T_Category Food L_Num Class

The Relevant Relations • Since in DRC the order of variables may impact the result, we will first define the order of attributes for each relation: • Station[S_Name, Height] Station_Type[S_Name,S_Type] • Line[L_Num,Direction,L_Type] Serves[S_Name,L_Num,Direction,Km] • Train[T_Num,Days] Service[T_Category,Class,Food] • Gives[T_Num,T_Category,Class] • Arrives[T_Num,S_Name,L_Num,Direction,A_Time,D_Time,Platform]

Sample Query in DRC • Which stations are served on Line 1-south? • As in RA, we can extract the data from “Serves” {<x> : y,z,w ( Serves(x,y,z,w)  y=1  z=“south” )} • This translates directly to RA as S_Name ((L_Num=1)(Direction=“south”) (Serves)) • Such an expression can be also written in a shortened manner as {<x> : w Serves(x,1,“south”,w)}

Another Example • Which lines have stations below sea level? • For every pair of variables y,z representing a line (L_Num,Direction) for which there are stations, we can identify the stations using the variable x for which there is some w such that Serves(x,y,z,w) is true • Hence, the expression {<y,z>:x,w(Serves(x,y,z,w)(x))} where(x)is satisfied iffx is a station below sea level • For this, we have (x) = u(Station(x,u)  (u < 0 )) • Resulting in the following query {<y,z>:x,w(Serves(x,y,z,w)  (u(Station(x,u)  (u<0)))}

The Other Direction • What does the following query means? {<x>:y1,y2(Station(y1,y2)  z1,z2,z3,z4,z5(Arrives(x,y1,z1,z2,z3,z4,z5)))} • The sub-expression z1,z2,z3,z4,z5(Arrives(x,y1,z1,z2,z3,z4,z5)) represents that train number x stops in station called y1 of a certain line number, direction, etc. • The sub-expression around the  sign (without the quantifiers) indicates that if (y1,y2) is a station then x is a train that stops there • Hence, the complete expression gives the train numbers stopping in all stations • Always? How can this be fixed? • Personal exercise: write an expression that will only take into account stations in which trains stop at

A More Complex Example • What are the lines that share a change station? • First, here is an expression for obtaining all change stations 1(x)=y(Station(x,y)  Station_Type(x,“change”)) • Next, we locate pairs of lines appearing with the same change station 2(u1,v1,u2,v2)= x,w1,w2(1(x)Serves(x,u1,v1,w1)Serves(x,u2,v2,w2)) • The complete query, including line types, is then {<u1,v1,t1,u2,v2,t2> : Line(u1,v1,t1)  Line(u2,v2,t2)  2(u1,v1,u2,v2)  (u1≠u2)}

Tuple Relational Calculus (TRC)

TRC • The main difference between TRC and DRC is that in TRC variables’ value are complete records rather than a single attribute • Hence, to define a query that returns a relation, we need an expression (t) that has a single free variable (although it may include sub-expressions that have quantified variables) • The general query structure is {t[A1,...,An]:(t)}, which returns a relation with attributes A1,...,An including all possible records having corresponding attributes for which  is satisfied

Atomic Formulas in TRC • Belongs to • For a relation R[A1,...,An] and variable t, the atomic formula “tR” is satisfied iff the value of t is a record inside R • Comparisons • We denote t[A] the value of attribute A in the variable t • For example, “t[A]=2” is satisfied if that value of A in t is 2 • “r[B]<s[C]”is satisfied if the values of the corresponding attributes of s and r satisfy this condition

Composed Formulas in TRC • Boolean Formulas • These are written in TRC exactly the same as in DRC and have the same meaning • For example, the formula 1(t1,...,tk)2(t1,...,tk)is satisfied iff both 1and 2 are satisfied with respect to t1,...,tk • Recall that here each ti is a complete record (tuple) • Quantifiers • Are also written as expected • For example, t1 (t1,...,tk) is satisfied iff there is a possible record (regardless of whether it is in the database or not) for which (t1,...,tk) is satisfied – here t2,...,tkare free variables • Yet, here the quantifier applies to complete records (lines) rather than single attributes • When it is not clear from the context, we write the attributes of the quantified variable explicitly, e.g., t1[A1,…,An](t1,...,tk)

Sample Queries in TRC • In the train operation example, what is the formula for “which stations are served by line 1-south”? • {t[S_Name] : s[S_Name,L_Num,Direction,Km] ( sServes  t[S_Name]=s[S_Name]   s[L_Num]=1  s[Direction]=“south” )} • What about “which lines serve stations below sea level”? • {t[L_Num,Direction] : r,s (rStation  sServes   s[L_Num]=t[L_Num]  s[Direction]=t[Direction]   r[S_Name]=s[S_Name]  r[Height]<0 )}

The Expressive Power of TRC and DRC • Every query that can be expressed in TRC can also be expressed in DRC and vice versa • In particular, TRC also has the problem that it can be used to define queries that cannot be implemented in a database • Similarly to DRC, TRC also has a safe version • Safe-TRC and Safe-DRC have the same expressive power, i.e., any formula written by Safe-TRC can be expressed in Safe-DRC and vice versa

The Expressive Power of Relational Calculus • The expressive power of relational calculus is at least as strong as RA since as we saw for DRC (and it is true also for TRC), any query in RA can be expressed in DRC • Yet, there are expressions in DRC (and TRC) that cannot be expressed in RA • Although DRC enables writing expressions that cannot be implemented in a database • Safe relational calculus (both Safe-DRC and Safe-TRC) as studied in the tutorials have the same expressive power as RA • Relational calculus (both safe and unsafe) cannot be used for expressing transitive closure • E.g., “What are all the stations that can be reached from station S in a finite number of train changes?”

Database Systems 236363