Understanding Queries in Logic in Database Management Systems

Faculty of Computer Science Technion – Israel Institute of Technology Database Management Systems Course 236363 Lecture 5: Queries in Logic

Logic, Computers, Databases: Syntax • Logic has had an immense impact on CS • Computing has strongly driven one particular branch of logic: finite model theory • That is, (FO/SO) logic restricted to finite models • Very strong connections to complexity theory • The basis of branches in Artificial Intelligence • It is a natural tool to capture and attack fundamental problems in database management • Relations as first-class citizens • Inference for assuring data integrity • Inference for question answering (queries) • It has been used for developing and analyzing the relational model from the early days (Codd, 1972)

A bit about First Order Logic First Order Languages Alphabet: variables, constants, function symbols, predicate symbols, connectives, quantifiers, punctuation symbols term: variable, constant, or inductively f(t1,…,tn) where the ti are terms and f is a function symbol Well-formed-formula (wff): an atom p(t1,…,tn) where p is a predicate symbol or inductively, where F, g are wff’s: Given an alphabet, a first order language is the set of all wff’s.

A bit about First Order Logic: Semantics Scope:The scope of in F orinFis F. A variable is bound if it is the scope of a quantifier, and free otherwise. Interpretation: A non-empty set D, the domain Each constant is assigned an element of D Each n-ary function symbol f is assigned a function Dn D Each n-ary predicate symbol p is assigned a function Dn{True, False} (can encode as a relation p) Variable Assignment: A variable assignment V assigns to each variable X a value from D.

A bit about First Order Logic: Truth Value Truth value for formula F under interpretation I and variable assignment V: Atom p(t1,…,tn): p’(t’1,…,t’n) where p’ is the interpretation I of p and t’jis an element in D according to Vand the interpretation of the function symbols in I. For F G, FG, ¬F according to truth table. has the truth value True if there exists an element d in D such that if the value assigned to X in V is replaced by d, the truth value of F is True, otherwise it is False. has the truth value True if for all elements d in D, if the value assigned to X in V is replaced by d, the truth value of F is True, otherwise it is False. If a formula has no free variables (closed formula), we can simply talk of the truth value of the formula.

Relational Calculus (RC) • RC is, essentially, first-order logic (FO) over the schema relations • A query has the form “find all tuples (x1,...,xk) that satisfy an FO condition” • RC is a declarative query language • Meaning: a query is not defined by a sequence of operations, but rather by a condition that the result should satisfy

RC Query • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) { (x,u) | Person(u, 'female', 'Canada') ⋀ ∃y,z[Parent(y,x) ⋀ Parent(z,y) ⋀ ∃w [Parent(z,w) ⋀ y≠w ⋀ (u=w ⋁ Spouse(u,w)]] } z Which relatives does this query find? w u y x

RC Symbols • Constant values: a, b, c, ... • Values that may appear in table cells • Variables: x, y, z, ... • Range over the values that may appear in table cells • Relation symbols: R, S, T, Person, Parent, ... • Each with a specified arity • Will be fixed by the relational schema at hand • No attribute names, only attribute positions!

Atomic RC Formulas • Atomic formulas: • R(t1,...,tk) • Ris a k-aryrelation • Each tiis a variable or a constant • Semantically it states that (t1,...,tk) is a tuple in R • Example: Person(x, 'female', 'Canada') • x op u • xis a variable, u is a variable/constant, opis one of >, <, =, ≠ • Example: x=y, z>5

RC Formulas • Formula: • Atomic formula • If φand ψare formulas then these are formulas: φ ⋀ψφ⋁ψφ→ψ¬φ∃xφ∀xφ Person(u, 'female', 'Canada') ⋀ ∃y,z[Parent(y,x) ⋀ Parent(z,y) ⋀ ∃w [Parent(z,w) ⋀ y≠w ⋀ (u=w ⋁ Spouse(u,w)]

Free Variables • Informally, variables not bound to quantifiers • Formally: • A free variable of an atomic formula is a variable that occurs in the atomic formula • A free variable of φ ⋀ψ,φ ⋁ψ,φ⟶ψ is a free variable of either φor ψ • A free variable of ¬φis a free variable of φ • A free variable of ∃xφand∀xφis a free variabley of φsuch that y≠x • The notation φ(x1,...,xk) implies that x1,...,xkare the free variables of φ(in some order)

What Are the Free Variables? Person(u, 'female', 'Canada') ⋀ ∃y,z[Parent(y,x) ⋀ Parent(z,y) ⋀ ∃w[Parent(z,w) ⋀ y≠w ⋀ (u=w ⋁ Spouse(u,w)]] φ(x,u), CandianAunt(u,x), ... Person(u, 'female', v) ⋀ ∃y,z[Parent(y,x) ⋀ Parent(z,y) ⋀ Parent(z,w) ⋀ y≠w ⋀ (u=w ⋁ Spouse(u,w) ]

Relation Calculus Query • An RC query is an expression of the form { (x1,...,xk) | φ(x1,...,xk) } where φ(x1,...,xk) is an RC formula • An RC query is over a relational schema S if all the relation symbols belong to S (with matching arities)

RC Query • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) { (x,u) | Person(u, 'female', 'Canada') ⋀ ∃y,z[Parent(y,x) ⋀ Parent(z,y) ⋀ ∃w [Parent(z,w) ⋀ y≠w ⋀ (u=w ⋁ Spouse(u,w)]] } z w u y x

Studies CourseType Who took all core courses? Studiesπcoursetype=‘core’CourseType

RS in Primitive RA vs. RC R(X,Y) ÷ S(Y) • πXR \ πX((πXR ×S) \ R ) In RA: In RC: { (X) | ∃Z [R(X,Z)] ⋀ ∀Y[S(Y) ⟶ R(X,Y)]}

DRC vs. TRC • There are two common variants of RC: • DRC: Domain Relational Calculus (what we're doing) • TRC: Tuple Relational Calculus • DRC applies vanilla FO: variables span over attribute values, relations have arity but no attribute names • TRC is more database friendly: variables span over tuples, where a tuple has named attributes • There are easy conversions between the two formalisms; nothing deep

Our Example in TRC • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) {t| ∃a [a∈Person ⋀ a[gender]= 'female' ⋀ a[country]='Canada'] ⋀ ∃p,q[p∈Parent⋀ p[chlid]= t[nephew] ⋀ q∈Parent ⋀ q[child]=p[parent]⋀ ∃w[w∈Parent⋀ w[parent]=q[parent]⋀ w[child]≠q[child] ⋀ ((t[aunt]=w[child] ⋀ t[aunt]=a[id]) ⋁ ∃s [s∈Spouse ⋀ s[person1]=w[child] ⋀ s[person2]=t[aunt] ⋀ t[aunt]=a[id] ] )]]]} q w s p

What is the Meaning of the Following? { (x) | ¬Person(x, 'female', 'Canada') } { (x,y) |∃z [Spouse(x,z) ⋀ y=z]} { (x,y) |∃z [Spouse(x,z) ⋀ y≠z]} • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2)

Domain • Let S be a schema, let I be an instance over S, and let Q be an RC query over S • The active domain (of I and Q) is the set of all the values that occur in either I or Q • The query Q is evaluated over I with respect to a domain D that contains the active domain • This is the set over which quantifiers range • We denote by QD(I) the result of evaluating Q over I relative to the domain D

Domain Independence • Let S be a schema, and let Qbe an RC query over S • We say that Q is domain independent if for every instance I over S and every two domains D and E that contain the active domain, we have: QD(I)=QE(I)

Which One is Domain Independent? { (x) | ¬Person(x, 'female', 'Canada') } { (x,y) |∃z [Spouse(x,z) ⋀ y=z]} { (x,y) |∃z [Spouse(x,z) ⋀ y≠z]} { (x) |∃z,w Person(x,z,w) ⋀ ∃y [¬Likes(x,y)]} { (x) |∃z,wPerson(x,z,w) ⋀ ∀y [¬Likes(x,y)]} { (x) |∃z,w Person(x,z,w) ⋀ ∀y [¬Likes(x,y)] ⋀ ∃y [¬Likes(x,y)]} • Person(id, gender, country) • Likes(person1, person2) • Spouse(person1, person2)

Bad News... • We would like be able to tell whether a given RA query is domain independent, • ... and then reject “bad queries” • Alas, this problem is undecidable! • That is, there is no algorithm that takes as input an RC query and returns true iff the query is domain independent

Good News Domain-independent RC has an effective syntax; that is: • A syntactic restriction of RC in which every query is domain independent • Restricted queries are said to be safe • Safety can be tested automatically (and efficiently) • Most importantly,for every domain independent RC query there exists an equivalent safe RC query!

Safety • We do not formally define the safe syntax in this course • Details on the safe syntax can be found in the textbook Foundations of Databases by Abiteboul, Hull and Vianu • Example: • In ∃xφ, the variable x should be guardedby φ • Every variable is guarded by R(x1,...,xk) • In φ ⋀ (x=y), the variable x is guarded if and only if either x or y is guarded by φ • ... and so on

Example Revisited R(X,Y) ÷ S(Y) • πXR \ πX((πXR ×S) \ R ) In RA: In RC: { (X) | ∃Z [R(X,Z)] ⋀ ∀Y[S(Y) ⟶ R(X,Y)]}

Another Example • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) { (x) |∃z,wPerson(x,z,w) ⋀ ∀y [¬Spouse(x,y)]} πidPerson \ person1/idπperson1Spouse

Equivalence Between RA and D.I. RC Theorem: RA and domain-independent RC have the same expressive power. More formally, on every schema S: • For every RA expression E there is a domain-independent RC query Q such that Q≣E • For every domain-independent RC query Q there is an RA expression Esuch that Q≣E

About the Proof • The proof has two directions: • Translate a given RA query into an equivalent RC query • Translate a given RC query into an equivalent RA query • Part 1 is fairly easy: induction on the size of the RA expression • Part 2 is more involved

RA  DRC Think about a table with k attribute names has having numbered columns, $1…$k

מה בעצם הבעיה עם שאילתות שאינןdomain independent • הרי ניתן פשוט לבצע את השאילתה על התחום האקטיבי. זה כמובן יוביל לתוצאה מוגדרת היטב וסופית. • גם בתסריט זה הבעייתיות של שאילתות כאלו, בין השאר, נובעת מ: • תשובות תלויות שאילתה. נניח שיש רלציהR עם טורA וNיה אחת R(7) . נניח שאילתה Qהמוגדרתפרמטרית כ- ¬ R(a). כאשרa =7 התשובה 7, כאשר a=5 התשובה 5 וכן הלאה. • תשובות משתנות עם תוספת "לא רלוונטית" למסד הנתונים. נניח רלציהR עם 2 טוריםA ראשון ו- Bשני, וNיה אחת (7,7)R. נניח שאילתה Q המוגדרת כ- ∃ Y ¬ R(X,Y). התשובה ריקה. אבל אם נוסיףNיה (7,6) תופיע התשובה 6 =X למרות שבטור Aהנוגע ל-X לא חל כל שינוי והשינוי (תוספת ערך 6) חל בטור "לא רלוונטי" לשאילתה. • בשתי התופעות שאילתות שאינן domain independent יוצרות תופעות לא אינטואיטיביות.

Datalog Path(x,y)  Edge(x,y) Path(x,z)  Edge(x,y),Path(y,z) InCycle(x)  Path(x,x) • Database query language • “Clean” restriction of Prolog w/ DB access • Expressive & declarative: • Set-of-rules semantics • Independence of execution order • Invariance under logical equivalence • Mostly academic implementations; some commercial instantiations

Example • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) Local(x)  Person(x,y,'IL') Relative(x,x)  Person(x,y,z) Relative(x,y)  Relative(x,z),Parent(z,y) Relative(x,y)  Relative(x,z),Parent(y,z) Relative(x,y)  Relative(x,z),Spouse(z,y) Invited(y)  Relative('myself',y),Local(y)

EDBs and IDBs • Datalog rules operates over: • Extensional Database (EDB) predicates • These are the provided/stored database relations from the relational schema • Intentional Database (IDB) predicates • These are the relations derived from the stored relations through the rules

Example • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) Local(x)  Person(x,y,'IL') Relative(x,x)  Person(x,y,z) Relative(x,y)  Relative(x,z),Parent(z,y) Relative(x,y)  Relative(x,z),Parent(y,z) Relative(x,y)  Relative(x,z),Spouse(z,y) Invited(y)  Relative('myself',y),Local(y) EDB IDB

Datalog Program • An atomic formulahas the form R(t1,...,tk) where: • R is a k-ary relation symbol • Eachtiis either a constant or a variable • A Datalog rule has the form head body where head is an atomic formula and body is a sequence of atomic formulas • A Datalog program is a finite set of Datalog rules

Logical Interpretation of a Rule • Consider a Datalog rule of the form R(x) ψ1(x,y),...,ψm(x,y) • Here, x and yare disjoint sequences of variables, and each ψi(x,y) is an atomic formula with variables among x and y • Example: TwoPath(x1,x2)  Edge(x1,y),Edge(y,x2) • The rule stands for the logical formula R(x) ∃y [ψ1(x,y) ⋀⋅⋅⋅⋀ ψm(x,y)] • Example: if there exists y where Edge(x1,y) and Edge(y,x2) hold, then TwoPath(x1,x2) should hold

Syntactic Constraints • We require the following from the rule R(x)  ψ1(x,y),...,ψm(x,y) • Safety: every variable in x should occur in the body at least once • The predicate R must be an IDB predicate • (The body can include both EDBs and IDBs) • Example of forbidden rules: • R(x,z)  S(x,y), R(y,x) • Edge(x,y)Edge(x,z),Edge(x,y) • Assuming Edge is EDB

Semantics of Datalog Programs • Let S be a schema, I an instance over S, and P be a Datalog program over S • That is, all EDB predicates belong to S • The result of evaluating Pover I is an instance J over the IDB schema of P • We give three definitions:

Chase (Operative) Definition A chase procedure is a program of the following form: • Chase(P,I) • J := ⦰ • while(true) { • if(I∪J satisfies all the rules of P) • returnJ • Find a rule head(x)  body(x,y) and tuples a, b such that I∪J containsbody(a,b) but not head(a) • J := J∪{head(a)} • }

Nondeterminism • Note: the chase is underspecified • (i.e., not fully defined) • There can be many ways of choosing the next violation to handle • And each choice can lead to new violations, and so on • We can view the choice of a new violation as nondeterministic

Example • Person(id, gender, country) • Parent(parent, child) • Spouse(person1, person2) Relative(x,x)  Person(x,y,z) Relative(x,y)  Relative(x,z),Parent(z,y) Invited(y)  Relative('1',y)

Understanding Queries in Logic in Database Management Systems

Understanding Queries in Logic in Database Management Systems

Presentation Transcript

Database Management Systems Course 236363

Database Systems 236363

Database Systems 236363

Database Systems 236363

Database Management Systems

Database Management Systems

Database Systems 236363

Database Management Systems

Database Management Systems

Database Management Systems

Database Management Systems

DATABASE Management systems

Database Systems 236363

Database Systems 236363

Database Management Systems

Database Management Systems

Database Management Systems

Database Systems 236363

Database Management Systems

Database Management Systems

Database Management Systems

Database Management Systems