Chapter 7: Relational Database Design

Chapter 7: Relational Database Design • Where are we? • We can • build an ER model for information in an enterprise • convert it to relation schema (tables) • Are we done? (no) • Relational database design requires that we find a “good” collection of relation schemas • Design goals: • avoid redundant data • ensure that relationships among entities are represented • facilitate checking updates for violation of integrity constraints

Example • Lending_schema = • (branch_name, branch_city, assets, customer_name, loan_number, amount) • Redundancy: data for branch_name, branch_city, assets are repeated for each loan that a branch makes • wasted space • complicated updating with possibility of inconsistency of assets value • Using nulls to store information about a branch if no loans exist • Normalization: decomposing into more relations, e.g. (branch_name, branch_city, assets) and (branch_namecustomer_name, loan_number, amount)

7.1.1 Design Alternative: Larger Schemas(not usually a good idea) • Suppose we combine borrower and loan to get bor_loan = (customer_id, loan_number, amount ) • Result is possible repetition of information (L-100 in example below) • Normally db design goes from the one table to two

More Important: Decomposing Schemas • Start with bor_loan = (customer_id, loan_number, amount ) • If more than one customer has the same loan, the amount is repeated • Option: split (decompose) it into borrower (customer_id, loan_number) and loan= (loan_number, amount ) • How would we know to do this? • The value of amount depends on loan_number • called a functional dependency: • loan_numberamount • This suggests decomposing bor_loan into borrower and loan • Nice property: borrower loan = bor_loan

Not all Decompositions are Good • employee = (employee_id, employee_name,telephone_number, • start_date) decomposes into • employee1 • = (employee_id, employee_name) • employee2 • = (employee_name, telephone_number, • start_date) • We lose information: • employee1 employee2 ≠ employee • Can’t reconstruct the original employee • relation • This is a lossy decomposition

Normal Forms • Requirements on relations that reflect good design • Most basic: first normal form • A domain is atomic if its elements are considered indivisible units • non-atomic: a set of names, composite attributes • Relational schema R is in first normal form if the domains of all attributes are atomic • Non-atomic values encourage redundant data and complicate storage • e.g. multivalued attribute children might be stored with each parent • example from text: employeeID concatenates dept# and unique ID • These are two different pieces of information that might need to be split apart • Assume all relations are in first normal form • e.g. person(name, birthday) should be defined as • person(name, b_month, b_year, b_day)

Goal: A Theory • Decide what good form means • If R is not in good form, decompose it into a set of relations {R1, R2, ..., Rn} • such that • each Ri is in good form • the decomposition is a lossless-join decomposition • preferably, the decomposition is dependency preserving • Theory is based on functional dependencies • generalization of the notion of a key • allow us to decompose a relation meaningfully

Functional Dependencies • Constraints on the set of legal relations • When the value for a set of attributes uniquely determines the value for another set of attributes • e.g. city and state imply the value for sales_tax in (city_name, state_name, sales_tax) • expressed as city_name, state_name sales_tax • Does every relation have a functional dependency? • primary key  any set of attributes • trivial dependencies (satisfied by all instances of a relation): • customer_name, loan_number customer_name • customer_name customer_name   is trivial if   

Functional Dependencies (cont.) • Let R be a relation schema and   R ,   R • The functional dependency •   • holds on R if and only if for any legal relations r(R), when any two tuples t1and t2 of r agree on the attributes , they also agree on the attributes  : • t1[] = t2 []  t1[ ] = t2 [ ] • Example: Consider r(A,B ) with the following instance of r • On this instance, AB does not hold, but BA does • Does this make BA a functional dependency? • branch_name  branch_city holds on slide 2. Will it always? • 4 • 1 5 • 3 7

Functional Dependencies (cont.) • Revisiting candidate and superkeys: K is a superkey for relation schema R iff K R K is a candidate key for R iff K  R, and for no   K,  R • Some constraints cannot be expressed using superkeys. Loan_info_schema = (branch_name, loan_number, customer_name, amount) • might have functional dependency customer_namebranch_name • but customer_name is not a superkey • Functional dependencies can express constraints such as this • Updates to the db can be tested to see if they satisfy the given set of functional dependencies

Functional Dependencies (cont.) • A specific instance of a schema may satisfy a functional dependency, even if the functional dependency does not hold on all legal instances. • e.g., some, but not all, instances of loan_schema may satisfy • loan_number  customer_name. • Database design goal: identify and enforce FDs that should hold

Boyce-Codd Normal Form • A relation schema R is in BCNF with respect to a set F of functional dependencies if for all functional dependencies in F+ of the form •  , where  R and  R, • at least one of the following holds: •   is trivial (i.e.,  ) •  is a superkey for R • Example: R = (A, B, C), key = {A} • F = {AB B  C} • R is not in BCNF (why?) • Decompose R into R1 = (A, B), R2 = (B, C) • R1and R2 are in BCNF (why?) • This decomposition is dependency-preserving

Decomposing a Schema into BCNF • Suppose we have a schema R and a non-trivial dependency causes a violation of BCNF. Decompose R into: (U ) ( R - (  -  ) ) • Back to the example • R = (A, B, C), key = {A} • F = {AB, B  C} B  C causes the violation, so  = B  = C and R = (A, B, C) is replaced by (U ) = ( B,C ) ( R - (  -  ) ) = ( A, B )

BCNF and Dependency Preservation • Consistency constraints need to be checked with database update • E.g. • attribute types • uniqueness of the primary key • foreign key/primary key consistency • others expressed in SQL: assertions, triggers, check constraints • functional dependencies (dependency preserving) • Functional dependencies that span more than one relation are costly to check • Example: R = (A, B, C), key = {A} F = {AB, B  C}: • R1(A,B), R2(B,C) – can check dependencies easily • R1(A,B), R2(A,C) – need to join R1 and R2 to check B  C

BCNF Example • R = (branch_name, branch_city, assets, customer_name, loan_number, amount) • F = {branch_name assets, branch_city loan_number amount, branch_name} • Key = {loan_number, customer_name} • R is not BCNF, use branch_name assets, branch_city to obtain • R1 = (branch_name, branch_city, assets) • R2 = (branch_name, customer_name, loan_number, amount) • R2 is not BCNF, use loan_number amount, branch_name • R3 = (branch_name, loan_number, amount) • R4 = (customer_name, loan_number) • Final decomposition: R1, R3, R4

Third Normal Form: Motivation • In some situations • BCNF is not dependency preserving, and • efficient checking for FD violation on updates is important • Example • R = (J, K, L) F = {JK L, L K}candidate keys are JK and JL • R is not in BCNF (L is not a superkey) • possible decomposition: R1 = (L, K), R2 = (L, J) • does not preserve JK L • Solution: a weaker normal form, third normal form (3NF) • allows some redundancy (with resultant problems) • but functional dependencies can be checked on individual relations without computing a join • There is always a 3NF lossless-join, dependency-preserving decomposition

Third Normal Form • Relation schema R is in third normal form (3NF) if for all: •  in F+ • at least one of the following holds: • is trivial (i.e.,  ) •  is a superkey for R • each attribute A in  –  is contained in a candidate key for R. (each attribute may be in a different candidate key) • If a relation is in BCNF it is in 3NF • one of the first two conditions above always holds in BCNF • Third condition is a minimal relaxation of BCNF to ensure dependency preservation (will see why later)

3NF Example • R = (J, K, L )F = {JK L, L K } • Candidate keys JK, JL • R is in 3NF • JK L JK is a superkey • L K K is contained in a candidate key • Possible problems in this schema: • repetition of information (e.g., the relationship l1, k1) • need to use null values (e.g., to represent the relationshipl2, k2 where there is no corresponding value for J) J L K j1 j2 j3 null l1 l1 l1 l2 k1 k1 k1 k2

Design Goals • Goal for a relational database design is: • BCNF • lossless join • dependency preservation. • If we cannot achieve this, we accept one of • lack of dependency preservation • redundancy due to use of 3NF • Next step: functional-dependency theory • a formal theory that tells us which functional dependencies are implied logically by a given set of functional dependencies • algorithms to generate lossless decompositions into BCNF and 3NF • algorithms to test if a decomposition is dependency-preserving

Closure of a Set of FDs • Given a set F set of FDs, other FDs are logically implied by F • example: If AB and BC, we can infer that A C • The set of all functional dependencies logically implied by F is the closure of F, denoted F+. • We can find all ofF+by applying Armstrong’s Axioms: • if   , then   (reflexivity) • if  , then    (augmentation) • if  , and   , then   (transitivity) • These rules are • sound (generate only functional dependencies that actually hold) and • complete (generate all functional dependencies that hold)

Example • R = (A, B, C, G, H, I)F = { A B, A C, CG H, CG I, B H} • Some members of F+ • A H • by transitivity from A B and B H • AG I • by augmenting A C with G, to get AG CG and then transitivity with CG I • CG HI • by augmenting CG I to infer CG  CGI, • and augmenting of CG H to inferCGI HI, • and then transitivity

To compute the closure of a set of functional dependencies F: F + = Frepeatfor each functional dependency f in F+ apply reflexivity and augmentation rules on fadd the resulting functional dependencies to F +for each pair of functional dependencies f1and f2 in F +iff1 and f2 can be combined using transitivitythen add the resulting functional dependency to F +until F + does not change any further Alternative procedure for this task later Procedure for Computing F+

Example Again • R = (A, B, C, G, H, I)F = { A BA CCG HCG IB H} • reflexivity and augmentation give • AA, ... ,AG BG, AG CG, CGI HI, CG  CGI, • transitivity gives • A H from A B and B H • CG HI from CG  CGI, and CGI HI • ... (takes forever)

Functional Dependencies • Simplify manual computation of F+ by additional rules • If   and  hold, then    holds (union) • If    holds, then   holds and  holds (decomposition) • If   and   hold, then   holds (pseudotransitivity) • These rules can be inferred from Armstrong’s axioms • The closureunderFof a set of attributes, a (denoted by a+) as the set of attributes that are functionally determined by a under F • Algorithm to compute a+, the closure of a under F • result := a;while (changes to result) do for each in F do begin if  result then result := result end

Example of Attribute Set Closure • R = (A, B, C, G, H, I) • F = {A BA C CG HCG IB H} • (AG)+ 1. result = AG 2. result = ABCG (A C and A  B) 3. result = ABCGH (CG H and CG  AGBC) 4. result = ABCGHI (CG I and CG  AGBCH) Is AG a candidate key? • Is AG a super key? • Does AG R? == Is (AG)+  R Is any subset of AG a superkey? • Does AR? == Is (A)+  R • Does GR? == Is (G)+  R

Testing for superkey: to test if  is a superkey, compute +, and check if +contains all attributes of R Testing functional dependencies to check if a functional dependency    holds (or, in other words, is in F+), just check if   +. that is, compute +by using attribute closure, and then check if it contains . a simple, useful, and cheap testl Computing F+ for each   R, find the closure +, and for each S  +, output a functional dependency   S. Uses of Attribute Closure

Finding F+ • R = (A, B, C, G, H, I)F = { A B, A C, CG H, CG I, B H} • Find the closure of attribute combinations • A+ = ABCH • B+ = BH • C+ = C • G+ = G • H+ = H • I+ = I • CG+ = CGHI ... • Subset of F+ • A ABCH • B  BH • CG CGHI

Canonical Cover • Sets of functional dependencies may have redundant dependencies that can be inferred from the others • For example: A  C is redundant in: {AB, BC} • Parts of a functional dependency may be redundant • E.g.: on RHS: {AB, BC, ACD} can be simplified to {A B, BC, AD} • E.g.: on LHS: {A B, BC, ACD} can be simplified to {A B, BC, AD} • Intuitively, a canonical cover of F is a minimal set of functional dependencies equivalent to F, having no redundant dependencies or redundant parts of dependencies • Why do we want this? To have a minimal set to check when performing an update.

Extraneous Attributes • Consider a set F of functional dependencies and the functional dependency   in F. • Attribute A is extraneous in  if A   and F logically implies (F – {})  {( – A) }. • Attribute A is extraneous in  if A  and the set of functional dependencies (F – {})  {(– A)} logically implies F. • Note: implication in the opposite direction is trivial in each of the cases above, since a “stronger” functional dependency always implies a weaker one • Example: F = {AC, ABC } • B is extraneous in AB C because {AC, AB C} logically implies AC (I.e. the result of dropping B from AB C). • Example: F = {AC, ABCD} • C is extraneous in ABCD since AB C can be inferred even after deleting C

Testing if an Attribute is Extraneous • Consider a set F of functional dependencies and the functional dependency   in F. • To test if attribute A   is extraneousin • compute ({} – A)+ using the dependencies in F • check that ({} – A)+ contains ; if it does, A is extraneous in • To test if attribute A  is extraneous in  • compute + using only the dependencies in F’ = (F – {})  {(– A)}, • check that + contains A; if it does, A is extraneous in 

Canonical Cover • A canonical coverfor F is a set of dependencies Fc such that • F logically implies all dependencies in Fc, and • Fclogically implies all dependencies in F, and • no functional dependency in Fccontains an extraneous attribute, and • each left side of functional dependency in Fcis unique. • To compute a canonical cover for F:repeatUse the union rule to replace any dependencies in F11 and 12 with 112 Find a functional dependency  with an extraneous attribute either in  or in  If an extraneous attribute is found, delete it from until F does not change • (Union rule may apply again after deleting extraneous attributes)

Computing a Canonical Cover • R = (A, B, C)F = {A BC B C A BABC} • Combine A BC and A B into A BC • Set is now {A BC, B C, ABC} • A is extraneous in ABC • check if the result of deleting A from ABC is implied by the other dependencies • yes: in fact, BC is already present! • set is now {A BC, B C} • C is extraneous in ABC • check if A C is logically implied by A B and the other dependencies • yes: using transitivity on A B and B  C. • can use attribute closure of A in more complex cases • The canonical cover is: A B B C

Lossless-join Decomposition • For the decomposition R = (R1, R2), we require that for all • possible relations r on schema R • r = R1(r ) R2(r ) • (lossless join) • A decomposition of R into R1 and R2 is lossless join if and only if at least one of the following dependencies is in F+: • R1 R2R1 • R1 R2R2

Example • R = (A, B, C) • F = {A B, B C) • can be decomposed in two different ways • R1 = (A, B), R2 = (B, C) • lossless-join decomposition: R1  R2 = {B}and B BC • dependency preserving • R1 = (A, B), R2 = (A, C) • lossless-join decomposition: R1  R2 = {A}and A  AB • not dependency preserving (cannot check B C without computing R1 R2)

Dependency Preservation • Let Fibe the set of dependencies F + that include only attributes in Ri. • A decomposition is dependency preserving, if • (F1 F2  …  Fn )+ = F + • Why is this nice? • each dependency can be checked using only one relation • Using F is not sufficient: R = (A, B, C ) F = {ABC} R1 = (A, B), R2 = (A, C) (F1 F2 )+ = (φ) + • But: F + includes AB and AC (F1 F2 )+ =( (AB)  (AC) ) + = F +

Dependency Preservation • Computing F + is expensive. • Alternate: check each dependency in F • To check if    is preserved in the decomposition of R into R1, R2, • …, Rn : result = while (changes to result) dofor eachRiin the decomposition t = (result  Ri)+  Ri //whatever in Ri is dependent on  result = result  t If result contains all attributes in , then the functional dependency    is preserved

Example result =  while (changes to result) dofor eachRiin the decomposition (attribute closure is wrt F) t = (result  Ri)+  Riresult = result  t If result contains all attributes in , then the functional dependency    is preserved • R = (A, B, C ) • F = {ABC B  C} • R is not in BCNF because of B  C • R1 = (A, B), R2 = (B, C) is BCNF. • Check dependency preserving for ABC: result = A Use R1: t = (result  Ri)+  Ri = (A (A,B))+  (A, B) = (A, B, C)  (A, B) = (A, B) result = (A,B) Use R2: t = (result  R2)+  R2 = ((A,B) (B,C))+  (B, C) = (B)+ (B, C) = (B, C) (B, C) = (B, C) result = (A,B)  (B, C) = (A,B,C) which contains BC

Testing for BCNF • To check if a non-trivial dependency causes a violation of BCNF • compute + (the attribute closure of ), and • verify that it includes all attributes of R (it is a superkey of R). • Simplified test: check only the dependencies in the given set F for violation of BCNF, rather than checking all dependencies in F+. • Even simpler: use Fc • Problem: using only Fc is incorrect for testing relations in a decomposition of R • example: R = (A, B, C, D, E), with F = { A  B, BC  D} • decompose R into R1 =(A,B) and R2 =(A,C,D, E) • neither of the dependencies in F contain only attributes from (A,C,D,E) so we might think R2 satisfies BCNF • dependency ACD in F+ shows R2 is not in BCNF

Testing A Decomposition for BCNF • To check if a relation Ri in a decomposition of R is in BCNF, • test Ri for BCNF with respect to the restriction of F+ to Ri (all FDs in F+ that contain only attributes from Ri) • or use the original set F with the following test: • for every set of attributes   Ri, check that + (the attribute closure of ) either includes no attribute of Ri- , or includes all attributes of Ri. • if the condition is violated by some   in F, • the dependency  (+ - )  Ri holds on Ri and • Ri violates BCNF. • use   to decompose Ri • Apply to R = (A, B, C, D, E), with F = { A  B, BC  D} and • R1 =(A,B) and R2 =(A,C, D, E): a possible problem is AC in R2 • (AC)+ = (ABCD), thus AC  D is a dependency. • Decompose R2 to R3 = (ACD) and R4 =(ACB)

BCNF Decomposition Algorithm • result := {R };done := false;compute F +;while (not done) do if (there is a schema Riin result that is not in BCNF)then beginlet   be a nontrivial functional dependency that holds on Ri such that  Riis not in F +, and   = ;result := (result – Ri )  (Ri – )  (,  );end else done := true; • Each Riis now in BCNF and the decomposition is lossless-join

Example of BCNF Decomposition • Original relation R andfunctional dependency F • R =(A, B, C, D, E, F) F ={BA, ACD, BC E, BC AF} Key = {BC} Canonical cover Fc: • {BA, ACD, BC EF} • Use BA to decompose R: • R1 = (A,B ) • R2 = (B,C,D,E,F) • Use BC EF to decompose R2 • R3 = (BCEF) • R4 = (BCD) • Use attribute closure tests to see if done, e.g. (BC) on R4 • Final decomposition: R1, R3, R4

BCNF Summary • Algorithm says decompose using F + • This takes forever • Easier: use Fc • But this may miss some FDs • So, use Fc but check attribute closure: • for every set of attributes   Ri, check that + (the attribute closure of ) either includes no attribute of Ri- , or includes all attributes of Ri.

3NF Decomposition Algorithm • Let Fcbe a canonical cover for F;i := 0;for each functional dependency in Fcdo if none of the schemas Rj, 1  j  i contains then begini := i + 1;Ri := ;end if none of the schemas Rj, 1  j  i contains a candidate key for Rthen begini := i + 1;Ri := any candidate key for R;end return (R1, R2, ..., Ri) • Each relation schema Riis in 3NF (Ri := :  →  is in the canonical cover, the last Ri contains a candidate key and nothing else so can't be decomposed • Decomposition is dependency preserving (each  →  in the canonical cover is in one of the Ri • Decomposition is lossless-join (one of the Ris contains a candidate key)

3NF Decomposition • Original relation R andfunctional dependency F • R =(A, B, C, D, E, F) F ={BA, ACD, BC E, BC AF} Key = {BC} Canonical cover Fc: {BA, ACD, BC EF} • Decomposition R1 = (A,B ) R2 = (A,C,D) R3 = (BCEF) • R3contains a candidate key so we are done • Compare with BCNF R1 = (AB) R3 =(BCEF) R3 =(BCD) which is not dependency preserving

3NF Decomposition: Example • Relation schema: cust_banker_branch = (customer_id, employee_id, branch_name, type ) • The functional dependencies for this relation schema are: customer_id, employee_id branch_name, type employee_id  branch_name customer_id, branch_name employee_id • Compute a canonical cover: branch_name is extraneous in the r.h.s. of the 1st dependency no other attribute is extraneous, so we get FC = customer_id, employee_id  type employee_id  branch_name customer_id, branch_name employee_id

3NF Decomposition Example (Cont.) • The for loop generates following 3NF schema: • (customer_id, employee_id, type ) • (employee_id, branch_name) • (customer_id, branch_name, employee_id) • (customer_id, employee_id, type ) contains a candidate key of the original schema, so no further relation schema needs be added • The second schema is redundant • Minor extension of the 3NF decomposition algorithm: at end of for loop, detect and delete schemas which are subsets of other schemas • If the FDs were considered 1st, 3rd, 2nd (employee_id, branch_name) would not be included in the decomposition because it is a subset of (customer_id, branch_name, employee_id) • Final result will not depend on the order in which FDs are considered

Comparison of BCNF and 3NF • It is always possible to decompose a relation into a set of relations that are in 3NF such that: • the decomposition is lossless • the dependencies are preserved • It is always possible to decompose a relation into a set of relations that are in BCNF such that: • the decomposition is lossless • it may not be possible to preserve dependencies

ER Model and Normalization • When an E-R diagram is carefully designed, identifying all entities correctly, the tables generated from the E-R diagram should not need further normalization • In a real (imperfect) design, there can be functional dependencies from non-key attributes of an entity • example: an employee entity with attributes department_number and department_address, and a functional dependency department_number  department_address • good design would have made department an entity

Denormalization for Performance • May want to use non-normalized schema for performance when queries span two relations (avoid a join) • Alternative 1: Use denormalized relation containing attributes that are usually combined • faster lookup • extra space and extra execution time for updates • extra coding work for programmer and possibility of error in extra code • Alternative 2: use a materialized view with the specific attributes • benefits and drawbacks same as above, except no extra coding work for programmer and avoids possible errors • view: a virtual table representing the result of a query. A query or update of • the view's table is against the underlying base tables. • materialized view: stored as a concrete table that is infrequently updated • from the original base tables

Chapter 7: Relational Database Design