Database Systems 236363

Database Systems236363 Relational Algebra

Query Languages • A query • An expression that enables extracting data from a database • A query language • A language for expressing queries

Relational Algebra • Relational algebra is a language that enables expressing queries over relational databases • The syntax is similar to algebra, but the expressions operate on tables • The result of such an algebraic expression (query) is a table (relation) Unless stated otherwise, we assume that all relations are sets and all algebraic operators operate on sets

Algebraic Operators • There are 5 basic operators plus 2 technical ones • Unary operators: Projection, Selection • Binary operators: Cartesian product, Subtraction, Union • Renaming attributes: a technical operation to enable composing operators • Assignment: a technical operation to enable execution in stages • Complex operators that can be obtained by composing basic operators: • Intersection, Join (including its variants), Division, etc. • For performance reasons, some complex operators might have direct implementations

Projection • Projection eliminates some of the attributes from the records of a given relation • In terms of tables manipulation, this means removing columns that do not appear in the projection index and then eliminating multiple lines formed due to the column elimination T= πHeb,RomanT=

Selection • Let T=T(A1,…,Am) be a table with schema consisting of attributes A1,…,Am. For an expression θ, σθT consists of all records in T satisfying the expression θ • θ may consist of the following: • Comparisons (through the operators =,<,>,≥,≤,≠) between an attribute and a constant or between two attributes. When the attribute is a set, it is also possible to use set operators like  etc. • Boolean operators, e.g., (A1=“Cat”)  (A2≤8) T= σNo≤2T=

Union, Subtraction and Intersection • These operations are applied only to relations of the same schema and are identical to their counterparts from set theory • Union ST: obtains all records that appear in either S or T • Subtraction S\T: obtains all records that appear in S but not in T • Intersection: ST: obtains all records that appear in both S and T • Intersection is not a basic operator – it can be expressed as S\(S\T) • Examples: S= T= ST= S\T= ST=

Cartesian Product • A Cartesian product yields all combinations of records from the first relation with records from the second relation • In terms of tables, we take all concatenations of rows from the first table with rows from the second table • Whenever S and T have attributes with the same name, we distinguish between them either by adding the table name as a prefix of the attribute name, e.g., “S.Name” and “T.name”, or by adding a sequence number, e.g., “Name1” and “Name2” S= T= S x T= What happens if one of the relations is empty?

Renaming • This is not an algebraic operation, but rather a technical helper operation that is used to compose complex operations (examples appear shortly) • For a given table T=T(A1,…,Am) with a schema consisting of attributes A1,…,Am, the operator ρA1→B1,…,Am→Bm(T) returns an identical table in which the attribute names were changed to B1,…,Bm T= ρAnimal→חיה,Color →צבע(T)=

Basic Operators • The five operators - Projection, Selection, CartesianProduct, Subtraction, and Union – are basic, i.e., • None of these operators can be obtained from the other four • More complex operators (e.g., Join, Division, Intersection) can be obtained by composing some of the basic ones • How can we show that an operator is basic? • We need to find a property that is satisfied by this operator and cannot be obtained by any composition of the others

Proving that an Operator is Basic • Claim: Projection cannot be expressed by composing Select, Cartesian Product, Subtraction, and Union • Proof sketch: Let R be a relation with n attributes A1,…,An. Then πA1R results in a relation with fewer than n attributes. However, the result of applying any of the operators Select, Cartesian Product, Subtraction, and Union on R (and possibly other relations) yields at least n attributes. • This can be shown by induction on the number of operators in the expression • How can this proof be adjusted to show that Cartesian Product is a basic operator?

θ-Join • Given two relations S(A1,…,An) and T(B1,…,Bm) and an expression θ on the attributes A1,…,An,B1,…,Bm, denote by S⋈θT the result of the algebraic expression σθ(SxT) • Example S= T= S⋈B>CT=

Natural Join • A very common operation on databases • For the relations S(A1,…,An,B1,…,Bm) and T(B1,…,Bm,C1,…,Ck),denote by S⋈T the relation that includes all possible combinations of a record from S with a record from T whose common attributes are the same, in which only a single attribute (column) is kept for each pair of common attributes. • More precisely, S⋈T= A1,...,An,S.B1,...,S.Bm,C1,...,Ck(S⋈(S.B1=T.B1) ... (S.Bm=T.Bm)T) S= T= S⋈T=

Semi-Join • For the relations S(A1,…,An,B1,…,Bm) and T(B1,…,Bm,C1,…,Ck),denote by S⋉T the relation that includes all records in S for which there exists a record in T whose common attributes are the same • More precisely, S⋉T= A1,...,An,B1,...,Bm(S⋈T) • For performance reasons, this is usually implemented directly S= T= S⋉T=

Division • For the relations S(A1,…,An,B1,…,Bm) and T(B1,…,Bm) (i.e., the attributes of T are a subset of the attributes of S), denote by ST the relation R(A1,…,An) consisting of all records for which there exists a record in S corresponding to all records in T • More precisely, ST is the maximal relation R such that RxTS • I.e., ST=A1,...,AnS\A1,...,An(((A1,...,AnS)T)\S) ST= S= T=

Division - Example • We’d like to obtain from table A the supplier number of all suppliers that sell all parts in table B A= B= AB=

Relational Algebra – Summarizing Example • Recall the train operation example A_Time D_Time S_Name T_Num Height S_Type Days Arrives Station Train Platform Km Gives Serves Line Service Height Direction T_Category Food L_Num Class

Relational Algebra – Summarizing Example • What Tables do we Extract? • What columns should exist for the relationship set “Serves”? • The key S_Name (of the “Station” entity set) • The key’s attributes L_Num and Direction (of “Line”) • These triplet would serve as the key for “Serves” • In addition, a column for the relation attribute Km • What columns should exist for the relationship set “Arrives”? • The key T_Num of the entity set “Train” • The key’s attributes for the aggregated relationship set “Serves”, i.e., S_Name, L_num, and Direction • The three attributes of the relationship set “Arrives” itself • Platform, A_time, D_time

The Schemas • From the previous slide (underlined attribute names represent keys) • Serves(S_Name, L_Num, Direction, Km) • Arrives(T_Num, S_Name, L_Num, Direction, Platform, • D_Time, A_Time) • We will represent the multiple value attribute as a separate relation • Station(S_Name, Height) • Station_Type(S_Name, S_Type)

Sample Queries • Which stations are served by the line 1-South? • Here, all information is in the table/relation “Serves” S_Name((L_Num=1)(Direction=“south”)(Serves)) • Which lines serves stations below sea level? • Here, we need to join “Serves” and “Station” L_Num,Direction(Height<0(Station⋈Serves)) • We can also use semi-join L_Num,Direction(Serves⋉S_Name(Height<0(Station)))

Sample Queries • What stations are served by more than one line? • Here, we need to examine two lines from “Serves” at a time S_Name((S_Name=S)((L_NumL)(DirectionD))( S_Name→S, L_Num→L, Direction→D, Km→K (Serves)  Serves)) • Suppose we do not care about different directions of the same line? S_Name((S_Name=S)(L_NumL)(S_Name→S, L_Num→L, Direction→D, Km→K (Serves)  Serves)) • Which stations serve exactly one line (in any direction)? S_Name(Serves) \S_Name((S_Name=S)(L_NumL)( S_Name→S, L_Num→L, Direction→D, Km→K (Serves)  Serves))

Sample Queries • What is the name of the highest station? • It is in fact easier to find all other stations R = S_Name((Height<H)(Station   S_Name→N, Height→HStation)) • Now, we can complete the query S_Name(Station) \ R

Example of Using Division • Which trains (by number) serve all stations? T_Num,S_Name(Arrives)S_Name(Station) • What if there are stations that do not appear in any line? How can we avoid referring to them? T_Num,S_Name(Arrives)S_Name(Serves) • What if there are stations that serve a line but no train stops there? How can we ignore them? T_Num,S_Name(Arrives)S_Name(Arrives)

More Complex Examples • Which lines share “change” stations? • First, we will find all <Line,Serves> couples for which the station is a “change” station R = (Line ⋈ Serves)⋉S_Name(S_Type =“change” (Station_Type)) • Now, the Cartesian product RxR would include all combinations of <Line,Serves> couples for which the station is a “change” station. Suppose we distinguish between attributes of each using indices, e.g., S_Name1,S_Name2, etc. • We would like to select from the product only couples for which the station is the same S =  (S_Name1=S_Name2)  (L_Num1L_Num2) (RR) • Finally, we project to obtain only the attributes we are intereted in T = L_Type1,L_Num1,Direction1,L_Type2,L_Num2,Direction2(S)

Which Queries Cannot be Expressed in RA? • Aggregate functions: • How many lines travel “North”? • What is the average distance between stations in Line 1-South? • How many trains stop in each station on Monday? • These functions operate on an unknown number of parameters (counting, sum, average, …) from the set of values obtained from the relation • There are several extensions to RA that enable such queries as well as in SQL

Inexpressible Queries - Continued • Transitive Closure • The following query cannot be expressed in RA: • What are all the stations that can be reached from station S in a finite number of train changes? • Notice, for any given constant k, we can express the following query • What are all the stations that can be reached from station S in at most k train changes? • The key here is whether k is bounded and given or not • These type of queries cannot be expressed in standard ANSI SQL either without the help of the hosting language • In contrast, Datalog supports recursive queries, and in particular this type of queries

Attribute-less Relations • In order to facilitate handling yes/no questions, and to simplify some other queries, often RA is extended to allow relations with no attributes (tables with no rows) • What can such a relation contain? • It can be empty • It can include a single empty line • This represents the value “true” vs. “false” in an empty relation, or “exists” vs. “non-exist” in an empty relation

Generalizing Algebraic Operators • Projection • For a relation R, the empty set is considered a subset of the set of its attributes and the corresponding projection is denoted πλR • The result is the empty relation if R is an empty relation and the single empty record if R is not empty • Cartesian Product • When S has no attributes, RxS results in an empty relation if S is empty and in R if S includes the empty record • Division • When R and S have the same attributes, RS includes the empty record if SR and is empty otherwise

Queries Equivalence • Two expressions are equivalent if they return the same result when applied to any possible database content • Why is this important? • For performance reasons, we may be able to translate one query into an equivalent one that is easier/faster to evaluate • Example: • For the relation R(A,B,C), which of the following equivalent expressions can be evaluated more efficiently? A (R) A (A,B(R)) A (A,C(R))

Example • Given the relations R(A,B), S(A,B), we will show that ARAS is equivalent to A(RS) • We will show that ARAS  A(RS) • As well as that ARAS  A(RS)

Example - Proof (1/2) • We show that ARAS  A(RS) For any record t, if t(AR  AS) then by the definition of union, either tAR or tAS (or both) Hence, either there exists a uR such that u[A]=t[A] or there exists a uSsuch that u[A]=t[A] (or both) In other words, tA(RS)

Example - Proof (2/2) • We show that ARAS  A(RS) If there exists a tA(RS), then by definition there exists a uRSsuch that u[A]=t[A] Hence, either there is a uRsuch that u[A]=t[A] or there is a uSsuch that u[A]=t[A] (or both) In other words, t(ARAS)

On the Importance of Equivalence • A typical SQL query • One needs to ensure that the optimization does not alter the semantics SELECT DISTINCT A FROM R,S WHERE R.B=S.B AND C=5; Direct translation A ((R.B=S.B)(C=5)(R  S)) Optimization A (R⋉B ( C=5S)))

Outerjoin • Natural Join suffers from information loss – that is, the result does not include records that do not have a corresponding record in the second table • To that end, the missing attributes are padded with null • More precisely, for the relations S(A1,…,An,B1,…,Bm) and T(B1,…,Bm,C1,…,Ck),denote by S⋈outerTthe natural join of S and T such that we append null attributes to each record in S that does not have a corresponding record in T and vice versa • That is, S⋈outerT=(S⋈T)((S\(S⋉T))C1,...,CkTnull)(A1,...,AnSnull(T\(T⋉S))) where Tnull is a relation with the same schema as T that includes one record with null values and similarly Snull is a record with the same schema as S that includes one record with null values

Example S= T= What happens when one of the relations is empty? There are also left-outer join, in which only the records of the left relation are padded and right-outer join in which only the records of the right relation are padded S⋈outerT=

Null Values • Once we allow null values, the definition of the RA operators need to be generalized to encompass them • This definition is complicated and outside the scope of this course

Bag/Multi-set Semantics • Until now, we assumed that relations are sets • In practice, and in particular in SQL, we may wish to allow relations that include multiple entries of the same record • To that end, the operators should be generalized accordingly

RA Operators in Bag Semantics • Projection • we eliminate attributes without eliminating duplicates • Selection • we return the corresponding records without eliminating duplicates • Cartesian product RxS • if a record u appears n times in R and a record v appears m times in S, then their concatenation appears nm times in RxS • Union ST • if a record u appears n times in R and m times in S, it will appear n+mtimes in ST • SubtractionS\T • if a record u appears n times in R and m times in S, it will appear max{0,n-m} times in S\T • Intersection ST • if a record u appears n times in R and m times in S, it will appear min{n,m} times in ST • Operator δ • explicit duplicate elimination • Unless otherwise stated, we assume a set semantics in this course

Database Systems 236363