400 likes | 536 Views
Chapter 5. Notes. P. 189: Sets, Bags, and Lists . To understand the distinction between sets, bags, and lists, remember that a set has unordered elements, and only one occurrence of each element.
E N D
Chapter 5 Notes
P. 189: Sets, Bags, and Lists • To understand the distinction between sets, bags, and lists, remember that a set has unordered elements, and only one occurrence of each element. • A bag allows more than one occurrence of an element, but the elements and their occurrences are unordered. • A list allows more than one occurrence of an element, but the occurrences are ordered. • Thus, {1, 2, 1} and {2, 1, 1} are the same bag, but (1, 2, 1) and (2, 1, 1) are not the same list.
P. 206: Why bags? • Commercial DBMS’s implement relations that are bags, rather than sets ….. Some relational operations are considerably more efficient if we use the bag model • To take the union of two relations as bags, we simply copy one relation and add to the copy all the tuples of the other relation • There is no need to eliminate duplicate copies of a tuple that happens to be in both relations. • When we project relations as sets, we need to compare each projected type with all the other projected tuples, to make sure that each projection appears only once. • If we can accept a bag as a result, then we simply project each tuple and add it to the result; no comparison with other projected tuples is necessary.
Union, Intersection, and Difference of Bags • Suppose that R and S are bags, and that tuple t appears n times in R and m times in S. • In the bag union R S, tuple t appears n + m times • In the bag intersection R S, the tuple t appears min(n, m) times • In the bag difference R – S, tuple t appears max(0, n-m) times
Projection and Selection on Bags • Each tuple is processed independently during a projection • If the elimination of one or more attributes during the projection causes the same tuple to be created from several tuples, these duplicate tuples are not eliminated from the result of the a bag-projection. • To apply a selection to a bag, we apply the selection condition to each tuple independently. • Duplicate tuples are not eliminated in the result
Product and Joins of Bags • The rule for the Cartesian product of bags is the expected one. • Joining bags presents no surprises • Compare each tuple of one relation with each tuple of the other, • decide whether or not this pair of tuples joins successfully, • and if so we put the resulting tuple in the answer. • Duplicate tuples are not eliminated in the answer.
Section 5.2: The Extended Algebra δ = eliminate duplicates from bags. τ = sort tuples. γ = grouping and aggregation. Outerjoin: avoids “dangling tuples” = tuples that do not join with anything.
Duplicate Elimination • R1 := δ(R2). • R1 consists of one copy of each tuple that appears in R2 one or more times.
δ(R) = A B 1 2 3 4 Example: Duplicate Elimination R = ( A B ) 1 2 3 4 1 2
5.2.2: Aggregation Operators • Aggregation operators are not operators of relational algebra. • Rather, they apply to entire columns of a table and produce a single result. • The most important examples: SUM, AVG, COUNT, MIN, and MAX.
Example: Aggregation R = ( A B ) 1 3 3 4 3 2 SUM(A) = 7 COUNT(A) = 3 MAX(B) = 4 AVG(B) = 3
5.2.4: Grouping Operator • R1 := γL (R2). L is a list of elements that are either: • Individual (grouping ) attributes. • AGG(A ), where AGG is one of the aggregation operators and A is an attribute. • An arrow and a new attribute name renames the component.
Applying γL(R) • Group R according to all the grouping attributes on list L. • That is: form one group for each distinct list of values for those attributes in R. • Within each group, compute AGG(A ) for each aggregation on list L. • Result has one tuple for each group: • The grouping attributes and • Their group’s aggregations.
Then, average C within groups: A B X 1 2 4 4 5 6 First, group R by A and B : A B C 1 2 3 1 2 5 4 5 6 Grouping/Aggregation R = ( A B C ) 1 2 3 4 5 6 1 2 5 γA,B,AVG(C)->X (R) = ??
5.2.6: Sorting • R1 := τL (R2). • L is a list of some of the attributes of R2. • R1 is the list of tuples of R2 sorted first on the value of the first attribute on L, then on the second attribute of L, and so on. • Break ties arbitrarily. • τ is the only operator whose result is neither a set nor a bag.
Example: Sorting R = ( A B ) 1 2 3 4 5 2 τB (R) = [(5,2), (1,2), (3,4)]
5.2.7: Outerjoin • Suppose we join R⋈C S. • A tuple of R that has no tuple of S with which it joins is said to be dangling. • Similarly for a tuple of S. • Outerjoin preserves dangling tuples by padding them NULL.
R OUTERJOIN S = A B C 1 2 3 4 5 NULL NULL 6 7 Outerjoin R = ( A B ) S = ( B C ) 1 2 2 3 4 5 6 7 (1,2) joins with (2,3), but the other two tuples are dangling.
5.3 A Logic for Relations • The logical query language Datalog consists of if-then rules. • Each of these rules expresses the idea that from certain combinations of tuples in certain relations, • we may infer that some other tuple must be in some other relation, or in the answer to a query. • If-then logical rules have been used in many systems. • Nonrecursive rules are equivalent to the core relational algebra. • Recursive rules extend relational algebra and appear in SQL-99.
Integration example • Goal: integrated view of the menus at many bars Sells(bar, beer, price). • Joe has data JoeMenu(beer, price). • Approach 1: Describe Sells in terms of JoeMenu and other local data sources. Sells(’Joe’’s Bar’, b, p) <- JoeMenu(b, p)
Integration Example cont. • Approach 2: Describe how JoeMenu can be used as a view to help answer queries about Sells and other relations. JoeMenu(b, p) <- Sells(’Joe’’s Bar’, b, p)
A Logical rule • Our first example of a rule uses the relations Frequents(drinker, bar), Likes(drinker, beer), and Sells(bar, beer, price). • The rule is a query asking for “happy” drinkers --- those that frequent a bar that serves a beer that they like.
Head = consequent, a single subgoal Body = antecedent = AND of subgoals. Read this symbol “if” Anatomy of a Rule Happy(d) <- Frequents(d,bar) AND Likes(d,beer) AND Sells(bar,beer,p)
Subgoals Are Atoms • An atom is a predicate, or relation name with variables or constants as arguments. • In essence, a predicate is the name of a function that returns a boolean value. • R(a, b, c) is true if (a, b, c) is a tuple of R. • The head is an atom; the body is the AND of one or more atoms. • Convention: Predicates begin with a capital, variables begin with lower-case.
The predicate = name of a relation Arguments are variables (or constants). Atom Sells(bar, beer, p)
Interpreting Rules • A variable appearing in the head is distinguished; otherwise it is nondistinguished. • Rule meaning: The head is true for given values of the distinguished variables if there exist values of the nondistinguished variables that make all subgoals of the body true.
Distinguished variable Nondistinguished variables Interpretation Happy(d) <- Frequents(d,bar) AND Likes(d,beer) AND Sells(bar,beer,p) Interpretation: drinker d is happy if there exist a bar, a beer, and a price p such that d frequents the bar, likes the beer, and the bar sells the beer at price p.
Arithmetic Atoms or Subgoals • In addition to relations as predicates, a predicate for a subgoal of the body (an atom) can be an arithmetic comparison. • We write arithmetic atoms in the usual way, • e.g., x < y. • The previously defined atoms are called relational atoms
Arithmetic Atoms • A beer is “cheap” if there are at least two bars that sell it for under $2. Cheap(beer) <- Sells(bar1,beer,p1) AND Sells(bar2,beer,p2) AND p1 < 2.00 AND p2 < 2.00 AND bar1 <> bar2
Negated Subgoals • NOT in front of a subgoal negates its meaning. • Example: Think of Arc(a,b) as arcs in a graph. • S(x,y) says the graph is not transitive from x to y ; i.e., there is a path of length 2 from x to y, but no arc from x to y. S(x,y) <- Arc(x,z) AND Arc(z,y) AND NOT Arc(x,y)
Datalog Rules and Queries • Applying a rule • consider all combinations of values of the variables. • If all subgoals are true, then evaluate the head. • The resulting head is a tuple in the result. Happy(d) <- Frequents(d,bar) AND Likes(d,beer) AND Sells(bar,beer,p) FOR (each d, bar, beer, p) IF (Frequents(d,bar), Likes(d,beer), and Sells(bar,beer,p) are all true) add Happy(d) to the result • Note: set semantics so add only once
Applying a Rule 2 • For each subgoal, consider all tuples that make the subgoal true. • If a selection of tuples define a single value for each variable, then add the head to the result. • Leads to finite search for P(x)<-Q(x), but P(x)<-Q(y) is problematic. • We want rule evaluations to be finite and lead to finite results. • “Unsafe” rules like P(x)<-Q(y) have infinite results, even if Q is finite.
Rule Evaluation 2 Happy(d) <- Frequents(d,bar) AND Likes(d,beer) AND Sells(bar,beer,p) FOR (each f in Frequents, i in Likes, and s in Sells) IF (f[1]=i[1] and f[2]=s[1] and i[2]=s[2]) add Happy(f[1]) to the result
Safe Rules • A rule is safe if: • Each distinguished variable, • Each variable in an arithmetic subgoal, and • Each variable in a negated subgoal, also appears in a nonnegated, relational subgoal. • Safe rules prevent infinite results.
Unsafe Rules • Each of the following is unsafe and not allowed: • S(x) <- R(y) • S(x) <- R(y) AND NOT R(x) • S(x) <- R(y) AND x < y • In each case, an infinity of x ’s can satisfy the rule, even if R is a finite relation.
Safe Rules Advantage • We can use “approach 2” to evaluation, where we select tuples from only the nonnegated, relational subgoals. • The head, negated relational subgoals, and arithmetic subgoals thus have all their variables defined and can be evaluated.
5.4 Datalog Programs • Datalog program = collection of rules. • In a program, predicates can be either • EDB = Extensional Database = stored table. • IDB = Intensional Database = relation defined by rules. • Never both! No EDB in heads
Evaluating Datalog Programs • As long as there is no recursion, we can pick an order to evaluate the IDB predicates, so that all the predicates in the body of its rules have already been evaluated. • If an IDB predicate has more than one rule, each rule contributes tuples to its relation.
Example Datalog Program • Using EDB Sells(bar, beer, price) and Beers(name, manf), find the manufacturers of beers Joe doesn’t sell. JoeSells(b) <- Sells(’Joe’’s Bar’, b, p) Answer(m) <- Beers(b,m) AND NOT JoeSells(b) • Step 1: Examine all Sellstuples with first component ’Joe’’s Bar’. • Add the second component to JoeSells. • Step 2: Examine all Beerstuples (b,m). • If b is not in JoeSells, add m to Answer.
Relational Algebra and Datalog • Without recursion, Datalog can express all and only the queries of core relational algebra. • The same as SQL select-from-where, without aggregation and grouping. • But with recursion, Datalog can express more than these languages. • Yet still not Turing-complete.