300 likes | 647 Views
Lecture 22. Functional Dependencies (FDs) and Normalization. Schedule Change. http://www.users.csbsju.edu/~irahal/. The Database Design Process. Conceptual design : Use a data model language to come up with an accurate, high-level description of the system requirements
E N D
Lecture 22 Functional Dependencies (FDs) and Normalization
Schedule Change • http://www.users.csbsju.edu/~irahal/
The Database Design Process • Conceptual design: • Use a data model language to come up with an accurate, high-level description of the system requirements • Words (unstructured) Diagrams • Logical design: • Map the resulting EERD into a set of relations • Diagram Relations • Physical design: • Use DDL on some DBMS to create tables corresponding to your relations
The Database Design Process • Limitations of E-R Designs • The EER model provides a set of guidelines • Does not result in a unique database schema • Does not provide a “formal” way of evaluating alternatives • Relies largely on the common sense of the designer • Here we try to answer • What are the criteria for "good" base relations? • Meaningful grouping of attributes • When designing a relation schema, how to decide which attributes to include? • So far, attributes are grouped to form the relation schema by using the common sense of the database designer
The Database Design Process • First discuss informal guidelines for good relational design • Then we discuss formal concepts of functional dependencies and normal forms - 1NF (First Normal Form) - 2NF (Second Normal Form) - 3NF (Third Normal Form) - BCNF (Boyce-Codd Normal Form) • Additional types of dependencies, further normal forms, relational design algorithms by synthesis are discussed in Chapter 11
Relation Schema Informal Measures • We have some informal measures: • Semantics of the attributes • Reducing the redundancy values in tuples • Disallowing the possibility of generating spurious tuples • Reducing null values • Not always independent of one another
Semantics of the Relation Attributes (1) • Any grouping of attributes to form a relation schema must portray a certain real-world meaning • Each tuple in a relation should represent one entity or relationship instance • Guideline 1: Design a relation schema so that it is easy to explain its meaning • Semantics of attributes should be easy to interpret • Attributes of different entities should not be mixed • Only foreign keys should be used to refer to other entities
Relations that violate Guideline 1 by intermixing attributes from different relations
Redundant Information (2) • One goal of schema design is to reduce redundancy • Information is stored redundantly wasting storage • Problems with update anomalies • Modification anomalies • Insertion anomalies • Deletion anomalies • Mixing attributes of multiple entities may cause the above problems
Update anomalies • Modification Anomalies • Update PNAME from ‘ProductY’ to ‘Customer-Accounting’ • Insert Anomalies • Insert a new employee not assigned to known project • Insert a new project with no working employees • Delete Project • Delete PNUMBER=2 • Delete the sole employee of a project
Modification Anomalies • Consider the relation: • EMP_PROJ ( Emp#, Proj#, No_hours, Ename, Pname, Plocation) • Modification Anomaly: • Changing the name of project number P2 from “Project Y” to “Customer-Accounting” • May cause this update to be made for all employees working on project P2 otherwise the DB will become inconsistent
Modification Anomalies • Consider the relation: • EMP_PROJ( Emp#, Proj#, No_hours, Ename, Pname, Plocation) • Insert Anomaly: Cannot insert a project unless an employee is assigned to • Inversely - Cannot insert an employee unless he/she is assigned to a project. • Delete Anomaly: When a project is deleted delete all the employees who work on the project • Alternately, if an employee is the sole employee on a project, deleting that employee would result in deleting the corresponding project
Guidelines to Redundant Information in Tuples and Update Anomalies • Guideline 2: Design a schema that does not suffer from the insertion, deletion and update anomalies • If there are any present, then note them so that applications can be made to take them into account • Might need to break the guidelines to improve performance for certain queries • Assume that we always access employee information only with department information • The design EMP_PROJ (Emp#, Proj#, No_hours, Ename, Pname, Plocation)might be could for such cases
SSN Name Address Hobby 1111 Joe 123 Main biking 1111 Joe 123 Main hiking ……………. Redundancy Example ER Model SSN Name Address Hobby 1111 Joe 123 Main {biking, hiking} Relational Model (SSN, Hobby, Name, Address)
Example • Redundancy leads to anomalies: • A change in Address must be made in several places • Suppose a person gives up all hobbies. Do we: • Set Hobby attribute to null? No, since Hobby is part of key • Delete the entire row? No, since we lose other information in the row • No hobby information? • Hobby value must be supplied for any inserted row since Hobby is part of key
Decomposition • Solution: use two relations to store Person information • Person1 (SSN, Name, Address) • Hobbies (SSN, Hobby) • People with/without hobbies can now be described • No update anomalies: • Name and address stored once • A hobby can be separately supplied or deleted
Spurious Tuples (3) • Bad designs for a relational database (or bad decompositions) may result in erroneous results for certain JOIN operations • Any decomposition MUST have the "lossless join" property • Nospurious tuplesshould be generated by doing a natural-join of any decomposed relations • Person1 (SSN, Name, Address) • Hobbies (SSN, Name) • Here, “loss” relates to loss of information
Spurious Tuples (3) • Suppose we replace • EMP_PROJ ( SSN, PNUMBER, ENAME, PNAME, PLOCATION, HOURS) by • EMP_PROJ1(SSN, PNUMBER, PNAME, PLOCATION, HOURS) ANDEMP_LOCS (ENAME, PLOCATION) • Guideline 3: The relations should be designed to satisfy the lossless join condition • Avoid relations that contain matching attributes that are not (foreign key, primary key) combinations
Null Values in Tuples (4) • Guideline 4: Relations should be designed such that their tuples will have as few NULL values as possible • Make sure only NULLs are exceptional cases • If many attributes do not apply to all tuples in the relation, we end up with many nulls • Waste space • Ambiguity in meaning • Attribute not applicable or invalid • Value known to exist, but unavailable • Attribute value unknown (may or may not exist) • Difficulty specifying JOIN operations (inner or outer joins) • Attributes that are NULL frequently could be placed in separate relations (with the primary key)
Functional Dependencies • Functional dependencies (FDs) are used to specify formal measures of the "goodness" of a relational database design • FDs are constraints that are derived from the meaning and interrelationships of the data attributes • An FD is a constraint between two sets of attributes X and Y • X Y holds if whenever two tuples have the same value for X, they must have the same value for Y • For any two tuples t1 and t2 in any relation instance r(R):If t1[X]=t2[X], then t1[Y]=t2[Y]
Functional Dependencies • X Y: A set of attributes X functionally determines a set of attributes Y (or Y is functionally determined by X) if the value of X determines a unique value for Y • X Y in R specifies a constraint on all relation instances r(R) • FDs are derived from the real-world constraints on the attributes • Property of the intentionof the database • An FD is a property of the attributes in the schema R • The constraint must hold on every relation instance r(R) • Can NEVER be deduced from an extension • E.g. if in some case, all people having the same first name are registered for the same course, can we deduce that name course?
Examples of FD Constraints • EMP_PROJ (SSN, PNUMBER, ENAME, PNAME, PLOCATION, HOURS) • social security number determines employee name • SSN ENAME • project number determines project name and location • PNUMBER {PNAME, PLOCATION} • employee SSN and project number determines the hours per week that the employee works on the project • {SSN, PNUMBER} HOURS
More on FD Constraints • Definition of a relation KEY (If K is a key of R) • Kfunctionally determines all attributes in R • If XY is true, does that make YX true? • Some FDs are always true regardless of the relation in which they occur • {State, Driver_License_Number} SSN • Zip {City, State}
Inference Rules for FDs • Given a set of FDs F, we can infer additional FDs that hold whenever the FDs in F hold • Armstrong's inference rules: • IR1. (Reflexive) If Y X, then X Y • (Generates trivial FDs) • E.g. SSN, ENAME ENAME • IR2. (Augmentation) If X Y, then XZ YZ (Note that XZ stands for X U Z) • E.g. SSN ENAME, then {SSN, PNUMBER}{ENAME, PNUMBER} • IR3. (Transitive) If X Y and Y Z, then X Z • E.g. SSN DOB and DOB horoscope sign then SSN horoscope sign
Inference Rules for FDs • IR1, IR2, IR3 form a sound and complete set of inference rules • Sound Any rule inferred using IR1, IR2 or IR3 a valid FD • Complete All possible FDs can be generated using them • Some additional inference rules that are useful: • IR4. (Decomposition) If X YZ, then X Y and X Z • SSN ENAME, DOB then SSN DOB & SSN ENAME • IR5. (Union) If X Y and X Z, then X YZ • SSN DOB & SSN ENAME then SSN ENAME, DOB • IR6. (Pseudo-transitivity) If X Y and WY Z, then WX Z • Can be deduced from IR1, IR2, and IR3 (completeness property) • OfficeLocation Department & Department, Ename Salary-level then OfficeLocation,Ename Salary-level
Proofs • IR1. (Reflexive) If Y X, then X Y • For any two tuples t1 and t2 with t1[X] = t2[X] then t1[Y] = t2[Y] because Y X • IR2. (Augmentation) If X Y, then XZ YZ • Proof by contradiction • If for two tuples t1 and t2 we have • (1) t1[X] = t2[X] • (2) t1[Y] = t2[Y] • (3) t1[XZ] = t2[XZ] • (4) t1[YZ] ≠ t2[YZ] • Can’t be true since from (1) and (3) we have (5) t1[Z] = t2[Z] and from (2) and (5) we have t1[YZ] = t2[YZ] which contradicts (4)
Proofs • IR3. (Transitive) If X Y and Y Z, then X Z • For any two tuples t1 and t2 with t1[X] = t2[X] then t1[Y] = t2[Y] which implies that t1[Z] = t2[Z] hence X Z holds • IR4. (Decomposition) If X YZ, then X Y and X Z • X YZ • YZ Y (Using IR1) • X Y (Using IR3) • Similarly for X Z
Proofs • IR5. (Union) If X Y and X Z, then X YZ • X Y • X Z • X XY (Using IR1) • XY YZ (Using IR2) • X YZ (Using IR3) • IR6. (Psuedotransitivity) If X Y and WY Z, then WX Z • X Y • WY Z • WX WY (Using IR2) • WX Z (Using IR3)