Temple University – CIS Dept. CIS331– Principles of Database Systems

Temple University – CIS Dept.CIS331– Principles of Database Systems V. Megalooikonomou Database Design and Normalization (based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU)

Overview • Relational model • formal query languages • commercial query languages (SQL) • Integrity constraints • domain I.C., foreign keys • functional dependencies • Functional Dependencies • DB design and normalization

Overview - detailed • DB design and normalization • pitfalls of bad design • decomposition • normal forms

Goal • Design ‘good’ tables • sub-goal#1: define what ‘good’ means • sub-goal#2: fix ‘bad’ tables • in short: “we want tables where the attributes depend on the primary key, on the whole key, and nothing but the key” • Let’s see why, and how:

Pitfalls takes1 (ssn, c-id, grade, name, address) Ssn c-id Grade Name Address A 123 smith Main cs331

Pitfalls ‘Bad’ - why? because: ssn->address, name Ssn c-id Grade Name Address 123 cs331 A smith Main 123 cs351 B smith Main 123 cs211 A smith Main

Pitfalls • Redundancy • space • (inconsistencies) • insertion/deletion anomalies:

Ssn c-id Grade Name Address 123 cs331 A smith Main … … … … … 234 null null jones Forbes Pitfalls • insertion anomaly: • “jones” registers, but takes no class - no place to store his address!

Ssn c-id Grade Name Address 123 cs331 A smith Main 123 cs351 B smith Main 123 cs211 A smith Main Pitfalls • deletion anomaly: • delete the last record of ‘smith’ (we lose his address!)

Ssn c-id Grade Name Address 123 cs331 A smith Main 123 cs351 B smith Main 123 cs211 A smith Main Solution: decomposition • split offending table in two (or more), e.g.: ? ?

Overview - detailed • DB design and normalization • pitfalls of bad design • decomposition • lossless join • dependency preserving • normal forms

Decompositions • there are ‘bad’ decompositions • we want: • lossless and • dependency preserving

Ssn c-id Grade Name Address 123 cs331 A smith Main 123 cs351 B smith Main 234 cs211 A jones Forbes Decompositions - lossy: R1(ssn, grade, name, address) R2(c-id,grade) c-id Grade cs331 A cs351 B cs211 A ssn->name, address ssn, c-id -> grade

Ssn c-id Grade Name Address 123 cs331 A smith Main 123 cs351 B smith Main 234 cs211 A jones Forbes Decompositions - lossy: can not recover original table with a join! c-id Grade cs331 A cs351 B cs211 A ssn->name, address ssn, c-id -> grade

Decompositions – lossy: Another example • Decomposition of R = (A, B) into R1 = (A), R2 = (B) A B A B    1 2 1   1 2 A(r) B(r) r A B A (r) B (r)     1 2 1 2

Decompositions example of non-dependency preserving S# -> address S# -> status S# -> address, status address -> status

Decompositions is it lossless? S# -> address, status address -> status S# -> address S# -> status

Decompositions - lossless Definition: Consider schema R, with FD ‘F’. R1, R2 is a lossless join decomposition of R if we always have: An easier criterion?

Decomposition - lossless Theorem: lossless join decomposition if the joining attribute is a superkey in at least one of the new tables Formally:

Ssn c-id Grade Name Address 123 cs331 A smith Main 123 cs351 B smith Main 234 cs211 A jones Forbes Decomposition - lossless example: Ssn c-id Grade R2 123 cs331 A R1 123 cs351 B 234 cs211 A ssn->name, address ssn, c-id -> grade ssn->name, address ssn, c-id -> grade

Overview - detailed • DB design and normalization • pitfalls of bad design • decomposition • lossless join decomp. • dependency preserving • normal forms

Decomposition - depend. pres. informally: we don’t want the original FDs to span two tables - counter-example: S# -> address S# -> status S# -> address, status address -> status

Decomposition - depend. pres. dependency preserving decomposition: S# -> address address -> status S# -> address, status address -> status (but: S#->status ?)

Decomposition - depend. pres. informally: we don’t want the original FDs to span two tables more specifically: … the FDs of the canonical cover Let Fibe the set of dependencies F+ that include only attributes in Ri. • Preferably the decomposition should be dependency preserving, that is, (F1 F2  …  Fn)+ = F+ • Otherwise, checking updates for violation of functional dependencies may require computing joins  expensive

Decomposition - depend. pres. why is dependency preservation good? S# -> address S# -> status S# -> address address -> status (address->status: ‘lost’)

Decomposition - depend. pres. A: eg., record that ‘Philly’ has status ‘A’ S# -> address S# -> address address -> status S# -> status (address->status: ‘lost’)

Decomposition - depend. pres. • To check if a dependency  is preserved in a decomposition of R into R1, R2, …, Rn we apply the following test (with attribute closure done w.r.t. F) • result = while (changes to result) dofor eachRiin the decompositiont = (result  Ri)+  Riresult = result  t • If result contains all attributes in , then functional dependency    is preserved • We apply the test on all dependencies in F to check if a decomposition is dependency preserving • The test takes polynomial time • Computing F+and(F1 F2  …  Fn)+ needs exponential time

Decomposition - conclusions • decompositions should always be lossless • joining attribute -> superkey • whenever possible, we want them to be dependency preserving (occasionally, impossible - see ‘STJ’ example later…)

Normalization using FD • When decomposing a relation schema R with a set of functional dependencies F into R1, R2,…, Rn we want: • Lossless-join decomposition: otherwise … information loss • No redundancy: relations Ripreferably should be in either Boyce-Codd Normal Form or Third Normal Form • Dependency preservation: Let Fibe the set of dependencies in F+ that include only attributes in Ri. • Preferably the decomposition should be dependency preserving, i.e., (F1 F2  …  Fn)+ = F+ • Otherwise, checking updates for violation of functional dependencies may require computing joins  expensive

Normalization using FD - Example • R = (A, B, C)F = {A B, B C) • R1 = (A, B), R2 = (B, C) • Lossless-join decomposition: R1  R2 = {B} and B BC • Dependency preserving • R1 = (A, B), R2 = (A, C) • Lossless-join decomposition: R1  R2 = {A} and A  AB • Not dependency preserving (cannot check B C without computing R1 R2)

Overview - detailed • DB design and normalization • pitfalls of bad design • decomposition ( how to fix the problem) • normal forms ( how to detect the problem) • BCNF, • 3NF, • (1NF, 2NF)

Normal forms - BCNF We saw how to fix ‘bad’ schemas - but what is a ‘good’ schema? Answer: ‘good’, if it obeys a ‘normal form’, i.e., a set of rules Typically: Boyce-Codd Normal Form (BCNF)

Normal forms - BCNF Defn.: Rel. R is in BCNF w.r.t. F, if • informally: everything depends on the full key, and nothing but the key • semi-formally: every determinant (of the cover) is a candidate key

Ssn c-id Grade Name Address 123 cs331 A smith Main 123 cs351 B smith Main 234 cs211 A jones Forbes Normal forms - BCNF Example and counter-example: ssn->name, address ssn->name, address ssn, c-id -> grade

Normal forms - BCNF Formally: for every FD a->b in F+ • a->b is trivial (a is a superset of b) or • a is a superkey • (or both)

Normal forms - BCNF Theorem: given a schema R and a set of FD ‘F’, we can always decompose it to schemas R1, … Rn, so that • R1, … Rn are in BCNF and • the decomposition is lossless (…but, some decomp. might lose dependencies)

BCNF Decomposition How? ….essentially, break off FDs of the cover eg. TAKES1(ssn, c-id, grade, name, address) ssn -> name, address ssn, c-id -> grade

grade name ssn address c-id Normal forms - BCNF eg. TAKES1(ssn, c-id, grade, name, address) ssn -> name, address ssn, c-id -> grade

Ssn c-id Grade Name Address 123 cs331 A smith Main 123 cs351 B smith Main 234 cs211 A jones Forbes Normal forms - BCNF Ssn c-id Grade 123 cs331 A 123 cs351 B 234 cs211 A ssn->name, address ssn, c-id -> grade ssn->name, address ssn, c-id -> grade

grade name ssn address c-id Normal forms - BCNF pictorially: we want a ‘star’ shape :notin BCNF

G F H B A D C E Normal forms - BCNF pictorially: we want a ‘star’ shape or

name name ssn st# ssn st# address address Normal forms - BCNF or a star-like: (e.g., 2 cand. keys): STUDENT(ssn, st#, name, address) =

G F H B D A D C E Normal forms - BCNF but not: or

BCNF Decomposition result := {R};done := false;compute F+;while (not done) do if (there is a schema Riin result that is not in BCNF)then beginlet   be a nontrivial functional dependency that holds on Risuch that  Riis not in F+, and   = ;result := (result – Ri)  (Ri – )  (,  );end else done := true; Note: each Riis in BCNF, and decomposition is lossless-join

T S J Normal forms - 3NF consider the ‘classic’ case: STJ( Student, Teacher, subJect) T-> J S,J -> T is it BCNF?

T S J Normal forms - 3NF STJ( Student, Teacher, subJect) T-> J S,J -> T How to decompose it to BCNF?

Normal forms - 3NF STJ( Student, Teacher, subJect) T-> J S,J -> T 1) R1(T,J) R2(S,J) (BCNF? - lossless? - dep. pres.? ) 2) R1(T,J) R2(S,T) (BCNF? - lossless? - dep. pres.? )

Normal forms - 3NF STJ( Student, Teacher, subJect) T-> J S,J -> T 1) R1(T,J) R2(S,J) (BCNF? Y+Y - lossless? N - dep. pres.? N ) 2) R1(T,J) R2(S,T) (BCNF? Y+Y - lossless? Y - dep. pres.? N )

Normal forms - 3NF STJ( Student, Teacher, subJect) T-> J S,J -> T in this case: impossible to have both • BCNF and • dependency preservation Welcome 3NF (…a weaker normal form)!

S T J Normal forms - 3NF STJ( Student, Teacher, subJect) T-> J S,J -> T informally, 3NF ‘forgives’ the red arrow in the can. cover

Temple University – CIS Dept. CIS331– Principles of Database Systems