220 likes | 305 Views
Introduction to Normalization. CPSC 356 Database Ellen Walker Hiram College. Building a Schema. Start with a list of all attributes, considered as if you had a giant flat database (one relation) with all possible information in one place Divide the attributes into multiple relations
E N D
Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College
Building a Schema • Start with a list of all attributes, considered as if you had a giant flat database (one relation) with all possible information in one place • Divide the attributes into multiple relations • Intuitively • According to formal rules (normalization) • This is a formalized alternative to the algorithms we learned before
What Makes A Good Schema? • Each relation should have clear semantics, i.e. can be easily described in a few words • Try to avoid redundancy (to minimize storage space, but also to avoid anomalies) • Avoid a design that encourages too many NULL values in a relation. NULL can be ambiguous: N/A vs. unknown vs. not-yet-entered, etc. • Don’t split related attributes so that the relationship between them is lost (e.g. make sure LastName and UserID are both in the same relation)
Tracking Real Estate Staff • Consider a single relation for real estate • It contains branch name, branch number, staff name, staff number, staff salary, etc. • One entry for each staff member of each branch • Branch information is repeated for different staff (REDUNDANCY!) • Staff information is repeated if they work in multiple branches (REDUNDANCY!) • This is an example of what NOT to do
Redundancy-caused Anomalies • Insertion Anomalies • A branch with no staff has many NULLs • Entering a new staff member has NULL branch info • But branch number and staff number are both part of primary key! (Why) • Deletion Anomalies • When the last staff member at a branch is deleted, the branch info is lost • Update Anomalies • If we make a change in branch info once, it must be changed in all copies (for all staff).
Solving Redundancy Problems • Decompose the relation into multiple relations • Use Foreign Keys so the complete relation can be reconstructed through a join • Branch: has branch number & branch info • Staff: has staff number, staff info & branch number as foreign key • Foreign Keys are exactly the attributes that are in the primary key of the other relation • Insertion, deletion & update anomalies are gone! • Consider: add branch with no staff, remove last staff member, update branch info
How to Decompose? • Decompositions are not (always) intuitively obvious • Codd discovered mathematical properties (called Normal Forms) that describe “goodness” of decomposition • First, Second, Third normal forms decrease redundancy without loss of information • BCNF, Fourth and Fifth normal forms potentially introduce information loss (we will see…) • To understand normal forms, start with functional dependencies
Functional Dependencies • If A and B are attributes, and every value of A is associated with exactly one value of B (so knowing A predicts B), then B is functionally dependent on A (We write this as: A->B) • Functional dependency is based on the semantics (meaning) of the attributes. • A->B and B->A are two different constraints • Email -> first name is a valid dependency • First name -> Email is not a valid dependency
Examples of Functional Dependencies • US Zip Code -> State • US Area Code -> State • Email -> Firstname, lastname • HotelNo, RoomNo -> Price • JobTitle, ServiceLength -> Salary
What are the dependencies? • item place customer-name • ring Kay jewelers prince charming • ring walmart miss piggy • Place -> item? • Item -> place? • oil walmart tin man
Finding Dependencies in Data • If a value of attribute A is associated with two or more values of B, then it is not true that A->B. • If a value of attribute A is associated with exactly one value of B, then it might be true that A->B. • Only when every possible value of attribute A is associated with exactly one value of B is it true that A->B.
Characteristics of Functional Dependencies for Normalization • For any given values of the attributes on the left, there is exactly one possible attribute on the right • No future data will ever invalidate the dependency • Dependency is nontrivial -- no attributes from the left are repeated on the right
Keys & Functional Dependency • Remember, a candidate key is a subset of attributes that is (guaranteed) unique for every tuple • Therefore, a valid candidate key determines all other attributes in the tuple • Therefore, there is a functional dependency from the candidate key to all other non-key attributes of the relation. • (Since the primary key is a candidate key, these arguments can also be made for primary keys)
Manipulating Functional Dependencies • Given a set of dependencies, derive more dependencies using inference rules • The closure X+ of a set of dependencies is the set of all possible dependencies that can be derived from it.
Armstrong’s Inference Rules for Manipulating Dependencies • if Y is a subset of X, then X -> Y Alternatively: X,Y -> X (Reflexive) • If X->Y then X,Z->Y,Z (Augmentation) • If X->Y and Y->Z then X->Z (Transitive)
Additional Inference Rules • A->A (Self-determination) • If A->B,C then A->B and A->C (Decomposition) • If A->B and A->C then A->B,C (Union) • If A->B and C->D then A,C -> B,D (Composition)
When are two sets of FDs equivalent? • When we can use inference rules to transform A to B , then A and B are equivalent • Problem: it might take a long time to find the right set of inference rules • What we need is a “standard form” of FD’s - then we can just compare
Finding the Closure • F is a set of functional dependencies (e.g. the obvious ones from primary keys) We want to find X+, which is the set of all attributes that are dependent on X (based on F). X+ = X repeat for each dependency Y->Z in F do if Y is a subset of X+ then X+ = X+ union Z until no more can be added to X+
Closure Example • F is the following set of dependencies: A->B,C C->D A,D -> F • What is A+ (all attributes that can be derived from A)? • Initialize A+ = A • Because A->B,C add B,C to A+ • Because C is in A+ and C->D, add D to A+ • Because A and D are in A+, add F to A+ • Therefore A+ is A,B,C,D,F
Equivalence Test • Are the following sets of FDs equivalent? • AB->C, D->E, AE->G, GD->H, ID->J • ABD->C, ABE->G, GD->EH, IE->J • Compute closures for each, if any two are different, they are not equivalent • You will need to consider every left side…
Finding a Key • Given a relation with attributes ABCDEFGHIJ and the following FDs, find a candidate key for the relation • AB->C, D->E, AE->G, GD->H, ID->J • A candidate key is a subset of attributes that has the entire set of attributes as its closure. • Let’s try ABD…
What is Normalization? • Formal technique for analyzing relations based on primary key (or candidate keys) and functional dependencies • Series of tests (normal forms), each of which is harder to “pass” • Normal forms 1NF, 2NF, 3NF, BCNF depend on functional dependencies • Higher forms (4NF, 5NF) based on other dependencies • To avoid update anomalies without loss, normalize to 3NF.