870 likes | 879 Views
Explore the impact of logic in CS and its application in database management systems, focusing on syntax, semantics, and relational calculus.
E N D
Faculty of Computer Science Technion – Israel Institute of Technology Database Management Systems Course 236363 Lecture 5: Queries in Logic
Logic, Computers, Databases: Syntax • Logic has had an immense impact on CS • Computing has strongly driven one particular branch of logic: finite model theory • That is, (FO/SO) logic restricted to finite models • Very strong connections to complexity theory • The basis of branches in Artificial Intelligence • It is a natural tool to capture and attack fundamental problems in database management • Relations as first-class citizens • Inference for assuring data integrity • Inference for question answering (queries) • It has been used for developing and analyzing the relational model from the early days (Codd, 1972)
A bit about First Order Logic First Order Languages Alphabet: variables, constants, function symbols, predicate symbols, connectives, quantifiers, punctuation symbols term: variable, constant, or inductively f(t1,…,tn) where the ti are terms and f is a function symbol Well-formed-formula (wff): an atom p(t1,…,tn) where p is a predicate symbol or inductively, where F, g are wff’s: Given an alphabet, a first order language is the set of all wff’s.
A bit about First Order Logic: Semantics Scope:The scope of in F orinFis F. A variable is bound if it is the scope of a quantifier, and free otherwise. Interpretation: A non-empty set D, the domain Each constant is assigned an element of D Each n-ary function symbol f is assigned a function Dn D Each n-ary predicate symbol p is assigned a function Dn{True, False} (can encode as a relation p) Variable Assignment: A variable assignment V assigns to each variable X a value from D.
A bit about First Order Logic: Truth Value Truth value for formula F under interpretation I and variable assignment V: Atom p(t1,…,tn): p’(t’1,…,t’n) where p’ is the interpretation I of p and t’jis an element in D according to Vand the interpretation of the function symbols in I. For F G, FG, ¬F according to truth table. has the truth value True if there exists an element d in D such that if the value assigned to X in V is replaced by d, the truth value of F is True, otherwise it is False. has the truth value True if for all elements d in D, if the value assigned to X in V is replaced by d, the truth value of F is True, otherwise it is False. If a formula has no free variables (closed formula), we can simply talk of the truth value of the formula.
Relational Calculus (RC) • RC is, essentially, first-order logic (FO) over the schema relations • A query has the form “find all tuples (x1,...,xk) that satisfy an FO condition” • RC is a declarative query language • Meaning: a query is not defined by a sequence of operations, but rather by a condition that the result should satisfy
RC Query • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) { (x,u) | Person(u, 'female', 'Canada') ⋀ ∃y,z[Parent(y,x) ⋀ Parent(z,y) ⋀ ∃w [Parent(z,w) ⋀ y≠w ⋀ (u=w ⋁ Spouse(u,w)]] } z Which relatives does this query find? w u y x
RC Symbols • Constant values: a, b, c, ... • Values that may appear in table cells • Variables: x, y, z, ... • Range over the values that may appear in table cells • Relation symbols: R, S, T, Person, Parent, ... • Each with a specified arity • Will be fixed by the relational schema at hand • No attribute names, only attribute positions!
Atomic RC Formulas • Atomic formulas: • R(t1,...,tk) • Ris a k-aryrelation • Each tiis a variable or a constant • Semantically it states that (t1,...,tk) is a tuple in R • Example: Person(x, 'female', 'Canada') • x op u • xis a variable, u is a variable/constant, opis one of >, <, =, ≠ • Example: x=y, z>5
RC Formulas • Formula: • Atomic formula • If φand ψare formulas then these are formulas: φ ⋀ψφ⋁ψφ→ψ¬φ∃xφ∀xφ Person(u, 'female', 'Canada') ⋀ ∃y,z[Parent(y,x) ⋀ Parent(z,y) ⋀ ∃w [Parent(z,w) ⋀ y≠w ⋀ (u=w ⋁ Spouse(u,w)]
Free Variables • Informally, variables not bound to quantifiers • Formally: • A free variable of an atomic formula is a variable that occurs in the atomic formula • A free variable of φ ⋀ψ,φ ⋁ψ,φ⟶ψ is a free variable of either φor ψ • A free variable of ¬φis a free variable of φ • A free variable of ∃xφand∀xφis a free variabley of φsuch that y≠x • The notation φ(x1,...,xk) implies that x1,...,xkare the free variables of φ(in some order)
What Are the Free Variables? Person(u, 'female', 'Canada') ⋀ ∃y,z[Parent(y,x) ⋀ Parent(z,y) ⋀ ∃w[Parent(z,w) ⋀ y≠w ⋀ (u=w ⋁ Spouse(u,w)]] φ(x,u), CandianAunt(u,x), ... Person(u, 'female', v) ⋀ ∃y,z[Parent(y,x) ⋀ Parent(z,y) ⋀ Parent(z,w) ⋀ y≠w ⋀ (u=w ⋁ Spouse(u,w) ]
Relation Calculus Query • An RC query is an expression of the form { (x1,...,xk) | φ(x1,...,xk) } where φ(x1,...,xk) is an RC formula • An RC query is over a relational schema S if all the relation symbols belong to S (with matching arities)
RC Query • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) { (x,u) | Person(u, 'female', 'Canada') ⋀ ∃y,z[Parent(y,x) ⋀ Parent(z,y) ⋀ ∃w [Parent(z,w) ⋀ y≠w ⋀ (u=w ⋁ Spouse(u,w)]] } z w u y x
Studies CourseType Who took all core courses? Studiesπcoursetype=‘core’CourseType
RS in Primitive RA vs. RC R(X,Y) ÷ S(Y) • πXR \ πX((πXR ×S) \ R ) In RA: In RC: { (X) | ∃Z [R(X,Z)] ⋀ ∀Y[S(Y) ⟶ R(X,Y)]}
DRC vs. TRC • There are two common variants of RC: • DRC: Domain Relational Calculus (what we're doing) • TRC: Tuple Relational Calculus • DRC applies vanilla FO: variables span over attribute values, relations have arity but no attribute names • TRC is more database friendly: variables span over tuples, where a tuple has named attributes • There are easy conversions between the two formalisms; nothing deep
Our Example in TRC • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) {t| ∃a [a∈Person ⋀ a[gender]= 'female' ⋀ a[country]='Canada'] ⋀ ∃p,q[p∈Parent⋀ p[chlid]= t[nephew] ⋀ q∈Parent ⋀ q[child]=p[parent]⋀ ∃w[w∈Parent⋀ w[parent]=q[parent]⋀ w[child]≠q[child] ⋀ ((t[aunt]=w[child] ⋀ t[aunt]=a[id]) ⋁ ∃s [s∈Spouse ⋀ s[person1]=w[child] ⋀ s[person2]=t[aunt] ⋀ t[aunt]=a[id] ] )]]]} q w s p
What is the Meaning of the Following? { (x) | ¬Person(x, 'female', 'Canada') } { (x,y) |∃z [Spouse(x,z) ⋀ y=z]} { (x,y) |∃z [Spouse(x,z) ⋀ y≠z]} • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2)
Domain • Let S be a schema, let I be an instance over S, and let Q be an RC query over S • The active domain (of I and Q) is the set of all the values that occur in either I or Q • The query Q is evaluated over I with respect to a domain D that contains the active domain • This is the set over which quantifiers range • We denote by QD(I) the result of evaluating Q over I relative to the domain D
Domain Independence • Let S be a schema, and let Qbe an RC query over S • We say that Q is domain independent if for every instance I over S and every two domains D and E that contain the active domain, we have: QD(I)=QE(I)
Which One is Domain Independent? { (x) | ¬Person(x, 'female', 'Canada') } { (x,y) |∃z [Spouse(x,z) ⋀ y=z]} { (x,y) |∃z [Spouse(x,z) ⋀ y≠z]} { (x) |∃z,w Person(x,z,w) ⋀ ∃y [¬Likes(x,y)]} { (x) |∃z,wPerson(x,z,w) ⋀ ∀y [¬Likes(x,y)]} { (x) |∃z,w Person(x,z,w) ⋀ ∀y [¬Likes(x,y)] ⋀ ∃y [¬Likes(x,y)]} • Person(id, gender, country) • Likes(person1, person2) • Spouse(person1, person2)
Bad News... • We would like be able to tell whether a given RA query is domain independent, • ... and then reject “bad queries” • Alas, this problem is undecidable! • That is, there is no algorithm that takes as input an RC query and returns true iff the query is domain independent
Good News Domain-independent RC has an effective syntax; that is: • A syntactic restriction of RC in which every query is domain independent • Restricted queries are said to be safe • Safety can be tested automatically (and efficiently) • Most importantly,for every domain independent RC query there exists an equivalent safe RC query!
Safety • We do not formally define the safe syntax in this course • Details on the safe syntax can be found in the textbook Foundations of Databases by Abiteboul, Hull and Vianu • Example: • In ∃xφ, the variable x should be guardedby φ • Every variable is guarded by R(x1,...,xk) • In φ ⋀ (x=y), the variable x is guarded if and only if either x or y is guarded by φ • ... and so on
Example Revisited R(X,Y) ÷ S(Y) • πXR \ πX((πXR ×S) \ R ) In RA: In RC: { (X) | ∃Z [R(X,Z)] ⋀ ∀Y[S(Y) ⟶ R(X,Y)]}
Another Example • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) { (x) |∃z,wPerson(x,z,w) ⋀ ∀y [¬Spouse(x,y)]} πidPerson \ person1/idπperson1Spouse
Equivalence Between RA and D.I. RC Theorem: RA and domain-independent RC have the same expressive power. More formally, on every schema S: • For every RA expression E there is a domain-independent RC query Q such that Q≣E • For every domain-independent RC query Q there is an RA expression Esuch that Q≣E
About the Proof • The proof has two directions: • Translate a given RA query into an equivalent RC query • Translate a given RC query into an equivalent RA query • Part 1 is fairly easy: induction on the size of the RA expression • Part 2 is more involved
RA DRC Think about a table with k attribute names has having numbered columns, $1…$k
מה בעצם הבעיה עם שאילתות שאינןdomain independent • הרי ניתן פשוט לבצע את השאילתה על התחום האקטיבי. זה כמובן יוביל לתוצאה מוגדרת היטב וסופית. • גם בתסריט זה הבעייתיות של שאילתות כאלו, בין השאר, נובעת מ: • תשובות תלויות שאילתה. נניח שיש רלציהR עם טורA וNיה אחת R(7) . נניח שאילתה Qהמוגדרתפרמטרית כ- ¬ R(a). כאשרa =7 התשובה 7, כאשר a=5 התשובה 5 וכן הלאה. • תשובות משתנות עם תוספת "לא רלוונטית" למסד הנתונים. נניח רלציהR עם 2 טוריםA ראשון ו- Bשני, וNיה אחת (7,7)R. נניח שאילתה Q המוגדרת כ- ∃ Y ¬ R(X,Y). התשובה ריקה. אבל אם נוסיףNיה (7,6) תופיע התשובה 6 =X למרות שבטור Aהנוגע ל-X לא חל כל שינוי והשינוי (תוספת ערך 6) חל בטור "לא רלוונטי" לשאילתה. • בשתי התופעות שאילתות שאינן domain independent יוצרות תופעות לא אינטואיטיביות.
Datalog Path(x,y) Edge(x,y) Path(x,z) Edge(x,y),Path(y,z) InCycle(x) Path(x,x) • Database query language • “Clean” restriction of Prolog w/ DB access • Expressive & declarative: • Set-of-rules semantics • Independence of execution order • Invariance under logical equivalence • Mostly academic implementations; some commercial instantiations
Example • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) Local(x) Person(x,y,'IL') Relative(x,x) Person(x,y,z) Relative(x,y) Relative(x,z),Parent(z,y) Relative(x,y) Relative(x,z),Parent(y,z) Relative(x,y) Relative(x,z),Spouse(z,y) Invited(y) Relative('myself',y),Local(y)
EDBs and IDBs • Datalog rules operates over: • Extensional Database (EDB) predicates • These are the provided/stored database relations from the relational schema • Intentional Database (IDB) predicates • These are the relations derived from the stored relations through the rules
Example • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) Local(x) Person(x,y,'IL') Relative(x,x) Person(x,y,z) Relative(x,y) Relative(x,z),Parent(z,y) Relative(x,y) Relative(x,z),Parent(y,z) Relative(x,y) Relative(x,z),Spouse(z,y) Invited(y) Relative('myself',y),Local(y) EDB IDB
Datalog Program • An atomic formulahas the form R(t1,...,tk) where: • R is a k-ary relation symbol • Eachtiis either a constant or a variable • A Datalog rule has the form head body where head is an atomic formula and body is a sequence of atomic formulas • A Datalog program is a finite set of Datalog rules
Logical Interpretation of a Rule • Consider a Datalog rule of the form R(x) ψ1(x,y),...,ψm(x,y) • Here, x and yare disjoint sequences of variables, and each ψi(x,y) is an atomic formula with variables among x and y • Example: TwoPath(x1,x2) Edge(x1,y),Edge(y,x2) • The rule stands for the logical formula R(x) ∃y [ψ1(x,y) ⋀⋅⋅⋅⋀ ψm(x,y)] • Example: if there exists y where Edge(x1,y) and Edge(y,x2) hold, then TwoPath(x1,x2) should hold
Syntactic Constraints • We require the following from the rule R(x) ψ1(x,y),...,ψm(x,y) • Safety: every variable in x should occur in the body at least once • The predicate R must be an IDB predicate • (The body can include both EDBs and IDBs) • Example of forbidden rules: • R(x,z) S(x,y), R(y,x) • Edge(x,y)Edge(x,z),Edge(x,y) • Assuming Edge is EDB
Semantics of Datalog Programs • Let S be a schema, I an instance over S, and P be a Datalog program over S • That is, all EDB predicates belong to S • The result of evaluating Pover I is an instance J over the IDB schema of P • We give three definitions:
Chase (Operative) Definition A chase procedure is a program of the following form: • Chase(P,I) • J := ⦰ • while(true) { • if(I∪J satisfies all the rules of P) • returnJ • Find a rule head(x) body(x,y) and tuples a, b such that I∪J containsbody(a,b) but not head(a) • J := J∪{head(a)} • }
Nondeterminism • Note: the chase is underspecified • (i.e., not fully defined) • There can be many ways of choosing the next violation to handle • And each choice can lead to new violations, and so on • We can view the choice of a new violation as nondeterministic
Example • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) Relative(x,x) Person(x,y,z) Relative(x,y) Relative(x,z),Parent(z,y) Invited(y) Relative('1',y)