330 likes | 433 Views
Dr Alwyn Barry Dr Joanna Bryson. CM20145 Further DB Design – Normalization. Last Time. Database design is an ongoing, iterative process. Requirements come from data, user demands, design issues. Change occurs: Corporations & technologies grow. Programmers & users learn. Views / security.
E N D
Dr Alwyn Barry Dr Joanna Bryson CM20145Further DB Design –Normalization
Last Time • Database design is an ongoing, iterative process. • Requirements come from data, user demands, design issues. • Change occurs: • Corporations & technologies grow. • Programmers & users learn. • Views / security. • Lossless-join decomposition Now: Science for improving design.
Design Process & Normalization • We assume a schema R is given. • R could have been generated when converting E-R diagram to a set of tables. • R could have been a single relation containing all attributes that are of interest (called universal relation). • Normalization breaks R into smaller relations. R could be the result of any ad hoc design of relations, which we then test & convert to normal form.
Overview • First Normal Form. • Functional Dependencies. • Second Normal Form. • Third Normal Form. • Boyce-Codd Normal Form. • Fourth Normal Form. • Fifth Normal Form. • Domain Key / Normal Form. • Design Process & Problems.
First Normal Form – 1NF • You aren’t supposed to have more than one value per attribute of a tuple. • All tuples have the same number of attributes. • Necessary for a relational database. BAD
Getting Caught Out With 1NF • A domain is atomic if its elements are considered to be indivisible units. • Examples of non-atomic domains: • Set-valued attributes, composite attributes. • Identifiers like CS101 that can be broken up into parts. • A relational schema R is in first normal form if the domains of all attributes of R are atomic. • Non-atomic values: • complicate storage, • encourage redundancy, • Depend on interpretation built into application programs.
Are You Atomic? • Atomicity is not an intrinsic property of the elements of the domain. • Atomicity is a property of how the elements of the domain are used. • E.g. strings containing a possible delimiter (here: a space) • cities = “Melbourne Sydney” (non-atomic: space separated list) • surname = “Fortescue Smythe” (atomic: compound surname) • E.g. strings encoding two separate fields • bucs_login = cssjjb • If the first two characters are extracted to find the department, the domain bucs_login is not atomic. • Leads to encoding of information in application program rather than in the database.
Second Normal Form (2NF) • Violated when a nonkey column is a fact about part of the primary key. • A column is not fully functionally dependent on the full primary key. • CUSTOMER-CREDIT in this case: From Watson BAD FIX ITEM *itemid … ORDER quantity … CUSTOMER *customerid customer-credit …
Def: Functional Dependency • Let R be a relation schema R and R • The functional dependency (FD) holds on R (“ is FD on ”) iff for any legal relations r(R): • whenever any two tuples t1and t2 of r agree on the attributes • they also agree on the attributes . • i.e. (t1) = (t2) (t1) = (t2) • Example: Consider r(A,B) with the following instance of r: • AB does NOT hold, but BA does hold
Functional Dependencies: Uses • Way to encode “business rules”. • Specify constraints on the set of legal relations. • We say that Fholds onR if all legal relations on R satisfy the set of FDs F. • Test relations to see if they are legal under a given set of FDs. • If a relation r is legal under a set F of FDs, we say that rsatisfies F.
Functional Dependencies • An FD is an assertion about a schema, not an instance. • If we only consider an instance or a few instances, we can’t tell if an FD holds. • Inspecting only a few bird species (e.g. crows, cardinals and canaries) we might conclude colour species. • However, this would be a bad FD as we would find out if we found some ravens. • Thus, identifying FDs is part of the data modelling process.
Trivial Functional Dependencies • An FD is trivial if it is satisfied by all instances of a relation • E.g. • customer-name, loan-number customer-name • customer-name customer-name • In general, is trivial if • Permitting such FDs makes certain definitions and algorithms easier to state.
Functional Dependency vs Key • FDs can express the same constraints we could express using keys: • Superkeys: • K is a superkey for relation schema R if and only if K R • Candidate keys: • K is a candidate key for R if and only if • K R, and • there is no K’ K such that K’ R • Of course, which candidate key becomes the primary key is arbitrary.
FDs <> Keys • FDs can represent more information than keys can on their own. • Consider the following Loan-info-schema: Loan-info-schema = (customer-name, loan-number,branch-name, amount). We expect these FDs to hold: loan-numberamount loan-number branch-name We could try to express this by making loan-number the key,however the following FD does not hold: loan-number customer-name • Incidentally, this isn’t a very good table! (¬2NF)
FD Closure • Given a set F of FDs, other FDs are logically implied. • E.g. If A B and B C, we can infer that A C • The set of all FDs implied by F is the closure of F, written F+ . • Find F+by applying Armstrong’s Axioms: • if , then (reflexivity) • if , then (augmentation) • if , and , then (transitivity) • Additional rules (derivable from Armstrong’s Axioms): • If and holds, then holds (union) • If holds, then holds and holds (decomposition) • If holds and holds, then holds (pseudotransitivity)
Bad Decomposition Example(From Last Time) • A Non Lossless-Join Decomposition R = (A, B) R1 = (A), R2 = (B) A B A B A B 1 2 1 2 1 2 1 1 2 B(r) A(r) r A (r) ⋈ B (r) • Thus, r is different to A (r) ⋈ B (r) • So A,B is not a lossless-join decomposition of R.
FDs & Lossless Decomposition • All attributes of an original schema (R) must appear in the decomposition (R1, R2): R = R1 R2 • Lossless-join decomposition.For all possible relations r on schema R r = R1 (r) ⋈ R2 (r) • A decomposition of R into R1 and R2 is lossless-join if and only if at least one of the following dependencies is in F+: • R1 R2R1 • R1 R2R2
Second Normal Form (2NF) • Violated when a nonkey column is a fact about part of the primary key. • A column is not fully functionally dependent on the full primary key. • CUSTOMER-CREDIT in this case: From Watson BAD FIX ITEM *itemid … ORDER quantity … CUSTOMER *customerid customer-credit …
Third Normal Form (3NF) • Violated when a nonkey column is a fact about another nonkey column. • A column is not fully functionally dependent on the primary key. • R is 3NF iff R is 2NF and has no transitive dependencies. • EXCHANGE RATE violates this. FIX BAD
Boyce-Codd (BCNF) • Arises when a table: • has multiple candidate keys, • the candidate keys are composite, • the candidate keys overlap. • R is BCNF iff every determinant is a cand. key. • E.g. Assume one consultant per problem per client, and one problem per consultant. • If client-problem is the primary key, how do you add a new consultant? • Like 3NF but now worry about all fields. BAD FIX
Design Goals & their discontents • Goals for a relational database design: • eliminate redundancies by decomposing relations, • must be able to recover original data using lossless joins, • prefer not to loose dependencies. • BCNF: • no redundancies, • no guarantee of dependency preservation. • 3NF: • dependency preservation, • but possible redundancies.
Fourth normal form (4NF) • A row should not contain two or more independent multivalued facts. • 4NF iff BCNF & no non-trivial multi-valued dependencies. • Multivalued dependency means the value of one attributed determines a set of values for another. BAD FIX
Fifth normal form (5NF) • 5NF iff a relation has no join dependency. • The schemas R1, R2,.., Rn have a join dependency over R if they define a lossless-join decomposition over R. • This is way too complicated, don’t worry about it.
Domain Key Normal Form • Every constraint on the relation must be a logical consequence of the domain constraints and the key constraints that apply to the relation. • Key: unique identifier. • Constraint: rule governing attribute values. • Domain: set of values of the same data type. • No known algorithm gives DK/NF.
E-R Model and Normalization • When an E-R diagram is carefully designed, identifying all entities correctly, the tables generated should not need further normalization. • However, in a real (imperfect) design there can be FDs from non-key attributes of an entity to other attributes of the entity. • The keys identified in E-R diagrams might not be minimal - FDs can help us to identify minimal keys. • FDs from non-key attributes of a relationship set are possible, but rare.
Denormalization & Performance • May want to use non-normalized schema for performance. • E.g. displaying customer-name along with account-number and balance requires join of account with depositor. • Alternative 1: Use denormalized relation containing attributes of account as well as depositor. • faster lookup. • extra space and extra execution time for updates. • extra coding work for programmer and possibility of error in extra code. • Alternative 2: use a materialized view defined as account ⋈ depositor • as above, except less extra coding, errors.
Limits of Normalization • Examples of bad database design, not caught by normalization. • Good: • earnings(company-id, year, amount) • Bad: • earnings-2000, earnings-2001, earnings-2002, etc., on (company-id, earnings) • all are BCNF, but querying across years difficult • needs a new table each year • company-year(company-id, earnings-2000,earnings-2001, earnings-2002) • in BCNF, but querying across years difficult • requires new attribute each year
Summary 1 – Rules to Watch • 1NF: attributes not atomic. • 2NF: non-key attribute FD on part of key. • 3NF: one non-key attribute FD on another. • Boyce-Codd NF: overlapping but otherwise independent candidate keys. • 4NF: multiple, independent multi-valued attributes. • 5NF: join dependency. • Domain Key / NF: all constraints either domain or key
Summary 2 – Concepts • Functional Dependencies: • Axioms & Closure. • Lossless-join decomposition. • Design Process. • Normalization Problems. Next: Interfaces and Architectures
Reading & Exercises • Reading • Connolly & Begg Chapter (13, 14) • Silberschatz Chapters 7. • Any other book, the design/normalization chapter. • Exercises: • Silberschatz • 7.1, 7.2, 7.16, 7.23, 7.24, 7.27-29
Next Week • Architectures and Implementations • Integrity and Security
Slides after and including this one you are not responsible for, but I am saving in case I decide to use them in the future.
Goal: Formalize “Good Design” • Process: • Decide whether a particular relation R is in “good” form. • In the case that a relation R is not in “good” form, decompose it into a set of relations {R1, R2, ..., Rn} such that: • each relation is in good form, • the decomposition is a lossless-join decomposition. • Theory: • Constraints on the set of legal relations. • Require that the value for a certain set of attributes determines uniquely the value for another set of attributes – functional dependencies.