440 likes | 680 Views
Database Design Theory. Which tables to have in a database Normalization. Database Design Theory. Given some body of data to be represented in a database, as modelled in an E-R diagram, what is the most suitable logical structure for that data?
E N D
Database Design Theory Which tables to have in a database Normalization
Database Design Theory • Given some body of data to be represented in a database, as modelled in an E-R diagram, what is the most suitable logical structure for that data? • How do we decide on the appropriate tables and the attributes of the tables?
Requirements • Accommodate data integrity. • General integrity constraints • E.g., referential integrity • Domain specific integrity constraints • E.g., no user can borrow more than 4 books. • Robust in the sense that the design should be application independent. • We try to achieve this through the elimination of redundancy.
The Danger of Redundancy • Consider the example • For students, we want to know student ID, name and address. • For courses, we need to know course ID, title and lecturer. • For employees, we need to know the employee ID, name and department. • For each department, we need to know the department ID, the name and the location. • For each enrollment, we need to know the grade.
The Danger of Redundancy Continued • One solution store everything in one big table Appl(sid,name,addr, cid,title, eid, ename, deptID, dname, loc, grade) • Clearly, this leads to redundancy. • For example, we need to store the student’s address for every course they have been registered for.
The Danger of Redundancy. Conclusion • If everything in one table, then • Greater space requirements • Insertion anomalies • Cannot store information on student who has not passed a course yet. • Deletion anomalies • We may want to delete a course but some student may be registered only for that course. • Update anomalies • If a student changes their address, many tuples need to be updated. • Danger of inconsistency in database.
Good Database Design • The basic idea: • A “good” database is one in which each table consists of a primary key and a set of mutually independent attributes. • Strategy for achieving a good database design: • Identify undesirable dependencies in a table and decompose by projection.
Functional Dependencies (FD’s) • Attribute (set) Y is functionally dependent on attribute (set) X if, whenever two tuples have the same value for X, they also have the same value for Y. • Notation: X Y • X is called the determinant. • An FD A B is non-trivial if and only if B A and BA.
Functional Dependencies in Our Example. • If everything is in one table, then these FD’s exist: • sid name, addr • cid title • eid ename • deptid dname, loc • sid, cid grade • Note we also have • sid, cid, eid, deptid All other attributes
Keys Again • A set of attributes X in a relation R is a superkey if every attribute in R is functionally dependent on X. • A candidate key is a minimal superkey. • Alternate keys are candidate keys that have not been selected as primary keys. • A prime attribute is a member of a candidate key.
Armstrong’s Axioms • Let X,Y and Z be sets of attributes of a relation R • Reflexivity: (X Y) (X Y) • Augmentation (X Y) (XZ YZ) • Transitivity ((X Y) & (Y Z)) (X Z) • Axioms are sound and complete • Can derive all FDs that follow from a given set of FDs. • Derive only true FDs
Some Consequences of Armstrong’s Axioms • The following are implied by Armstrong’s axioms: • Decomposition (X YZ) (X Y) • Union ((X Y)&(X Z)) (X YZ) • Pseudo transitivity ((X Y)&(WY Z)) (WX Z)
Closure of a Set of Functional Dependencies • If F is a set of functional dependencies, the closure of F, F+, is the set of all functional dependencies logically implied by those in F. • Useful since it allows us to determine candidate keys (there must be functional dependency to all other attributes), but very expensive to compute.
Closure Under a Set of Functional Dependencies • Since F+ is too expensive to compute, we use closure of X under a set of functional dependencies, X+. • (X Y) in F if and only if Y X+. • Since X+ is relatively easy to compute, we can now verify whether X is a superkey.
Computing X+. • To compute X+ under a set of FDs F: INPUT: X, F OUTPUT: X+ S := X WHILE there is a (ZY) in F with Z S and Y S DO S := SY ENDWHILE X+ = S
Decomposition • Recall that having identified undesirable FDs, we now need to decompose. • Decomposition: • Let U be a relation scheme. A set of {R1,..,Rn} of relation schemes is a decomposition of U if R1 … Rn = U • Every attribute of U occurs in at least one Ri.
Desirable Properties of Decomposition • Decompositions should be • Lossless • Dependency preserving • No redundancy • Minimal number of tables • Sometimes, not all properties can be achieved simultaneously.
Lossless Decomposition • Let • {R1,..,Rn} a decomposition of U • u relation instance over U • Pi = Ri(u) for i from 1 to n • Then • {R1,..,Rn} is a lossless decomposition if u = P1 … P n • In other words, the original relation can be reconstructed.
Dependency Preserving Decompositions • In decomposing a table, ensure that any FDs are easily enforceable. • Example: • Relation U(A,B,C) • FDs: A B, A C, B C • If we decompose U into R(A,B) and S(B,C), then A B, B C can be easily enforced when changing R or S. • Because of transitivity, A C is automatically enforced.
Non-Dependency Preserving Decomposition • If we decompose U into R’(A,B) and S’(A,C), then enforcing A B and A C is easy. • However, B C becomes an interrelational constraint and can only be enforced through a join. • This decomposition is not dependency preserving.
Normalization • Normal forms, as defined in relational database theory, are guidelines for the design of the tables in the database. • Normalization reduces redundancy. • Important to remember why we want to avoid redundancy • Space requirements • Insertion, deletion and update anomalies.
The Normal Forms • First normal form • Second normal form • Third normal form • Boyce-Codd normal form • Fourth normal form • Fifth normal form • The normal forms are ordered in that everything in 2NF is also in 1NF. • We ignore 5NF, as violations hardly occur in practice.
First Normal Form • A relation is in 1NF iff the value of each attribute in a tuple is atomic. • A relation which is not in 1NF sid cid grade 123 CS51S A CS51T B 234 CS51S C CS52S B CS52T B
Getting Tables into 1NF • Normalizing a table which is not in 1NF is easy: Simply repeat the other fields. • Thus sid cid grade 123 CS51S A 123 CS51T B 234 CS51S C 234 CS52S B 234 CS52T B
Second and Third Normal Form • Second and third normal form concern relationship between non-key and prime attributes. • Recall that a prime attribute is a member of a candidate key. • Under 2NF and 3NF, a non-key attribute value must provide a fact about the key, the whole key and nothing but the key. • Every non-prime attribute must be fully functionally dependent on a candidate key.
Second Normal Form • 2NF is violated when a non-key attribute depends on a proper subset of a candidate key. • The following violates 2NF Result(cid, sid, name, grade) • as name is functionally dependent on sid alone.
Dangers of Violating 2NF • Note that name is repeated for every course that a student has a grade for. • Problems: • Danger of inconsistency if a student changes their name, e.g., by getting married. • If a student has not passed any courses yet, then the student’s name cannot be stored.
Getting Tables into 2NF • Decompose the table into Result(cid, sid, grade) Student(sid, name) • This decomposition leads to longer retrieval times for queries which involve joins. • Normalization is necessary to avoid anomalies which arise because of changes to attributes. • If little chance of changes, then sometimes do not normalize.
Third Normal Form • Third normal form is violated when a non-prime attribute depends on another non-prime attribute. • The following violates 3NF Empl(eid, dept, loc) • loc is a fact about dept. • Danger same as violation of 2NF.
Getting Tables into 3NF • Again, decompose Empl(eid, dept) Department(dept, loc) • We can always restore 3NF through a lossless and dependency preserving decomposition.
Boyce-Codd Normal Form (BCNF) • A relation scheme R is in BCNF if every determinant of a FD over R is a candidate key. • In other words, the determinant of every FD is a superkey. • Violation of BCNF • R(A,B,C,D,E,F) • { A BC, D AEF } • D+ = ABCDEF • D is a good primary key. • A+ = ABC
Another Violation of BCNF • Assume that we give each registration for a course a unique registration number Reg(rid, sid, cid, sname, grade) • FDs • rid sid, cid • sid, cid rid, grade • sid sname • rid+ = All attributes
Getting Tables into BCNF • Decompose according to the FD whose determinant is not a superkey. • In our example, sid sname • This gives • Reg(rid, sid, cid, grade) • Stud(sid, sname) • Not always possible to get tables into BCNF while preserving all functional dependencies.
Example where BCNF is not possible • Consider • R(A,B,C) • { AB C, C B} • Not in BCNF because C is not a superkey. • However, every decomposition of R fails to be dependency preserving as we have to split up the attributes in AB C • Have to settle for 3NF.
Multivalued Dependencies (MVDs) • In an FD, X Y, knowing the value of X means that you know the unique value for Y. • In an MVD, X Y, knowing the value of X means that you know the set of values from which Y can come.
Example of MVD • Assume we have two streams for some course, taught by different instructors, and that for each course, we use two textbooks. • Example: course instructor text CS51T Rao Date Mansingh Korth CS52S Rao Jackson Stitt Rich
Example of MVD Continued • Putting table in 1NF gives course instructor text CS51T Rao Date CS51T Rao Korth CS51T Mansingh Date CS51T Mansingh Korth CS52S Rao Jackson CS52S Rao Rich CS52S Stitt Jackson CS52S Stitt Rich • With primary key • course, instructor, text • Since no FD, in BCNF.
Redundancy because of MVDs • However, still redundancy in the table because • if <c,p,x> and <c,p’,x’> in table <c,p’,x> and <c,p,x’> in table too. • The table contains two multivalued dependencies: • course instructor • course text • Danger of insertion and update anomalies
Fourth Normal Form • Under 4NF, a relation should not contain two or more independent MVDs. • In other words, if there is a MVD, X Y, then X should be a superkey.
Getting Tables into 4NF • Again, get a table into 4NF through decomposition so that each MVD is captured in a separate table. • Example: • CP(course, instructor) • CT(course, text)
Normalization Reconsidered • Normalization helps avoid: • Insertion anomalies • Update anomalies • Deletion anomalies • Normalization increases retrieval time for some queries.