CS4433 Database Systems

CS4433Database Systems Relational Design

Why Do We Learn This? • We have designed ER diagram, and translated it into a relational database schema R (or a set of R1, R2, ...) • Now what? • We can do the following • specify all relevant constraints over R • implement R in SQL • start using it, making sure the constraints always remain valid • However, R may not be well-designed, thus causing us a lot of problems • Well designed: • Eliminate Data Redundancy: the same piece of data shall not be stored in more than one place. This is because duplicate data not only waste storage spaces but also easily lead to inconsistencies. • Ensure Data Integrity and Accuracy:

Is This a Good Design? • Potential problems • Redundancy • Update anomalies • Deletion anomalies

How do We Obtain a Good Design? • Start with the original database schema R • Transform it until we get a good design R* • Desirable properties for R* • must preserve the information of R • must have minimal amount of redundancy • must be dependency-preserving • if R is associated with a set of constraints C, then it should be easy to also check C over R* • (must also give good query performance)

OK, But …… • How do we recognize a good design R*? • How do we transform R into R*? • What we need is the “theory” of … Normal Forms

Normal Forms • DB gurus have developed many normal forms • Most important ones • Boyce-Codd (BCNF), 3rd (3NF), and4th(4NF) normal forms • If R* is in one of these forms, then R* is guaranteed to achieve certain good properties • e.g., if R* is in Boyce-Codd NF, it is guaranteed to not have certain types of redundancy • DB gurus have also developed algorithms to transform R into R* that is in some of these normal forms

Normal Forms • DB gurus have also discussed trade-offs among normal forms • Thus, all we have to do is • learn these forms • transform R into R* in one of these forms • carefully evaluate the trade-offs • Many of these normal forms are defined based on various constraints • functional dependencies and keys

Behind the Scene: Know Whom We Should Blame?

Our Plan • Motivation • Functional dependencies & keys • Reasoning with FDs and keys • Desirable properties of schema refinement • Various normal forms and the trade-offs • BCNF, 3rd normal form, 4th normal form, etc. • Putting all together: how to design DB schema

Better Designs Exist

Functional Dependencies • A form of constraint (hence, part of the schema) • a relationship between two attributes, typically between the PK and other non-key attributes within a table. • For any relation R, attribute Y is functionally dependent on attribute X (usually the PK), if for every valid instance of X, that value of X uniquely determines the value of Y. This relationship is indicated by the representation below : • X ———–> Y • The left side of the above FD diagram is called thedeterminant, and the right side is the dependent. • Finding them is part of the database design • Used heavily in schema refinement

Example EmpID Name, Phone, Position Position Phone but Phone Position

Composite Determinants • Composite determinant = a determinant of a functional dependency that consists of more than one attribute (StudentName, ClassName)  (Grade)

Functional Dependency Rules • If A  (B, C), then A  B and A C. • If (A,B)  C, then neither A nor B determines C by itself.

Functional Dependencies • An FD is a statement about all allowable relations • Must be identified based on semantics of application • Given some instance r1of R, we can check if r1 violates some FD f, but we cannot determine if f holds over R • Question: How is FD related to keys? • if “K  all attributes of R” then K is a superkeyfor R • does not require K to be minimal • FDs are a generalization of keys

Functional Dependencies Q: From this, can you conclude phone  SSN?

Relation Keys • After defining FDs, we can now define keys • Key of a relation R is a set of attributes that • functionally determines all attributes of R • NONE of its subsets determines all attributes of R • A composite key • a key that consists of two or more columns. • Superkey • a set of attributes that contains a key • We will need to know the keys of the relations in a DB schema, so that we can refine the schema

Candidate and Primary Keys • A candidate key is a key that determines all of the other columns in a relation. • A primary key is a candidate key selected as the primary means of identifying rows in a relation. • There is only one primary key per relation. • The primary key may be a composite key. • The ideal primary key is short, numeric, and never changes.

Foreign Keys • A foreign key is the primary key of one relation that is placed in another relation to form a link between the relations. • A foreign key can be a single column or a composite key. • The term refers to the fact that key values are foreign to the relation in which they appear as foreign key values.

The Referential Integrity Constraint • A referential integrity constraint is a statement that limits the values of the foreign key to those already existing as primary key values in the corresponding relation. SKU_DATA (SKU, SKU_Description, Department, Buyer) ORDER_ITEM (OrderNumber, SKU, Quantity, Price,ExtendedPrice) Where ORDER_ITEM.SKU must exist in SKU_DATA.SKU

Foreign Key with aReferential Integrity Constraint NOTE: The primary key of the relation is underlined and any foreign keys are in italics in the relations below: SKU_DATA (SKU, SKU_Description, Department, Buyer) ORDER_ITEM (OrderNumber, SKU, Quantity, Price,ExtendedPrice) Where ORDER_ITEM.SKU must exist in SKU_DATA.SKU

Finding the Keys of a Relation • Given a relation constructed from an E/R diagram, what is its key? • If the relation comes from an entity set, the key of the relation is the set of attributes which is the key of the entity set • If the relation comes from a many-many relationship, the key of the relation include the set of all attribute keys in the relations corresponding to the entity sets (and additional attributes if necessary) • Many-one relationship, weak-entity set, multi-way relationship ……

Functional Dependencies in the SKU_DATA Table SKU  (SKU_Description, Department, Buyer) SKU_Description  (SKU, Department, Buyer) Buyer  Department

Functional Dependencies in the ORDER_ITEM Table (OrderNumber, SKU)  (Quantity, Price, ExtendedPrice) (Quantity, Price)  (ExtendedPrice)

Example 3.1.1 • Consider a relation about people in the United States, including their name, Social Security number, street address, city, state, ZIP code, area code, and phone number (7 digits).What FD's would you expect to hold? What are the keys for the relation? • To answer this question, you need to know something about the way these numbers are assigned. For instance, can an area code straddle two states? Can a ZIP code straddle two area codes? Can two people have the same Social Security number? Can they have the same addressor phone number Consider a relation about people in the United States, including their name, Social Secu- rity number, street address, city, state, ZIP code, area code, and phone number (7 digits). Consider a relation about people in the United States, including their name, Social Secu- What FD's would you expect to hold? What are the keys for the relation? To answer this rity number, street address, city, state, ZIP code, area code, and phone number (7 digits). question, you need to know something about the way these numbers are assigned. For What FD's would you expect to hold? What are the keys for the relation? To answer this instance, can an area code straddle two states? Can a ZIP code straddle two area codes? question, you need to know something about the way these numbers are assigned. For Can two people have the same Social Security number? Can they have the same address instance, can an area code straddle two states? Can a ZIP code straddle two area codes? or phone number? Can two people have the same Social Security number? Can they have the same address or phone number

Solution • Social Security number  name • Area code  state • Street address, city, state zipcode • personal number → name #a person could have several homes, and several phones • phone number → zipcode #a phone is generally only signed to one person • street address, city → county, zipcode • city → county • Possible keys are any attributes on the left hand side of the arrows • {Social Security number, street address, city, state, area code, phone number} • Need street address, city, state to uniquely determine location. A person could have multiple addresses. The same is true for phones. These days, a person could have a landline and a cellular phone personal number → name #a person could have several homes, and several phones phone number → personal number #a phone is generally only signed to one person street address, city → county, postal number city → county Possible keys are any attributes on the left hand side of the arrows

Closure of FD sets • Given a relation schema R & a set S of FDs • is a new FD f logically implied by S? • Example • R = {A,B,C,G,H,I} • S = A B, A C, CG  H, CG  I, B  H • would A  H be logically implied? • yes (you can prove this, using the definition of FD) • Closure of S: S+= all FDs logically implied by S • How to compute S+? • we can use Armstrong's axioms

Armstrong's Axioms • Reflexivity rule • A1A2...An  a subset of A1A2...An • ABC  AB ABC  AC ABC  A … • Augmentation rule • A1A2...An  B1B2...Bm, then A1A2...An C1C2..Ck  B1B2...Bm C1C2...Ck • A  B AC  BC • Transitivity rule • A1A2...An  B1B2...Bm and B1B2...Bm C1C2...Ck, thenA1A2...An  C1C2...Ck • A  B and B  C then A  C

Inferring S+ using Armstrong's Axioms • S+ := S • Loop • foreach f in S, apply reflexivity and augmentation rules • add the new FDs to S+ • foreach pair of FDs in S, apply the transitivity rule • add the new FD to S+ • Until S+ does not change any further

Additional Rules • Union rule • X  Y and X  Z, then X  YZ • (X, Y, Z are sets of attributes) • Decomposition rule • X  YZ, then X  Y and X  Z • Pseudo-transitivity rule • X  Y and YZ  U, then XZ  U • These rules can be inferred from Armstrong's axioms • X  Y and YZ  U, then XZ  U • XZ  YZ(augmentation) • XZ  YZ  U • XZ  U (Transitivity)

Closure of a Set of Attributes • Given a set of attributes {A1, …, An} and a set of dependencies S • find all attributes B such that any relation which satisfies S also satisfies: A1, …, An  B • The closure of {A1, …, An}, denoted {A1, …, An}+ , is the set of all such attributes B • Usage • Test if X= {A1, …, An} is a superkey • compute X+, and check if X+ contains all attributes of R • Check if X  Y holds • by checking if Y is contained in X+

Algorithm B , B , … B Start with X={A1, …, An} Repeat until X doesn’t change do: if is in S, and are all in X, and C is not in X then add C to X C n B , B , … B 1 2 n 2 1

For Relation R(A,B,C,D,E,F) with FD’s A B  C • A D  E • B  D • A F  B • What is the closure of {A,B} and {A,F}: Closure of {A,B}: X = {A, B, C, D, E} Closure of {A, F}: X = {A, F, B, D, C, E}

Example • R = {A, B, C, D, E} • F = { B CD, D  E, B  A, E  C, AD B } • Is B  E in F+ ? B+ = B B+ = BCD B+ = BCDA B+ = BCDAE … Yes! and B is a key for R too! • Is D a key for R? D+ = D D+ = DE D+ = DEC … Nope! • Is AD a key for R? AD+ = AD • AD+ = ABD and B is a key, so Yes! • And… • A+ = A, D+ = DEC • … A,D not keys, so Yes! • Is ADE a candidate key for R? • … No! AD is a key, so ADE is a superkey, but not a candidate key • What is F+

Why do we study F.D.s and Keys Together? • Keys are (special) F.D.s • FSUID  Name, Age, Department • “Good” F.D.s (like keys) and “bad” F.D.s • Key (good) • FSUID  Name, Department, Chair • F.D. but not key (bad) – make redundancy • Department Chair

Normal Forms • Desirable Properties of Schema Refinement • minimize redundancy • avoid info loss • preserve dependency • ensure good query performance • Convert to better design • Explore problem • Decomposition • Convert to normal forms • Normal Forms • first normal form = all attributes are atomic • second normal form (2NF) = old and obsolete • Boyce Codd normal form (BCNF) • third normal form (3NF) • others……

Boyce-Codd Normal Form A simple condition for removing anomalies from relations: A relation R is in BCNF if and only if: • Whenever there is a nontrivial FD A1, A2, …, An B for R , it is the case that {A1, A2, …, An} is a super-key for R In English (though a bit vague): • Whenever a set of attributes of R is determining another attribute, it should determine all attributes of R

Example What are the dependencies? SSN  Name (non-key f.d.) What are the keys? *Phone Is it in BCNF? *No, SSN does not determine Phone.

Decompose It Into BCNF SSN  Name: SSN is key No F.D. at all ! Question: How can I find the phone information of Joe?

What about this? Name  Price, Category

BCNF Decomposition • Find a functional dependency that violates the BCNF condition • A1, A2, …, An B1, B2, …, Bn • Heuristics: choose B1, B2, … Bn“as large as possible” Decompose: Continue until there are no BCNF violations left. A’s Others B’s R1 R2

Example Decomposition SSN  Name, Age, EyeColor BCNF decomposition: Person (SSN, Name, Age, EyeColor) Phone-Info (SSN, PhoneNumber) What if we also had an attribute Draft-worthy, and the FD: Age Draft-worthy

BCNF Decomposition: The Algorithm Input: relation R, set S of FDs over R 1) Compute S+ 2) Compute keys for R (from ER or from S+) 3) Use S+ and keys to check if R is in BCNF, if not: a) pick a violating FD f: A  B b) expand B as much as possible, by computing A+ c) create R1 = A union B, R2 = A union (others in R) d) compute all FDs over R1, using R and S+, then compute keys for R1. Repeat similarly for R2 e) Repeat Step 3 for R1 and R2 4) Stop when all relations are BCNF, or are two-attributes

Properties of BCNF • BCNF removes certain types of redundancy • those caused by adding many-many or one-many relations • For examples of redundancy that it cannot remove, see "multivalued redundancy" • BCNF avoids information loss • the decomposition is dependency preserving • Questions • Is BCNF unique? • Does BCNF always exist? • No • Example: R(A,B,C) F = {AB C, C B}

BCNF Decomposition Not Unique • Suppose several dependencies violate BCNF. • Depending on which is used to guide the next decomposition, might get different collections of decomposed relations • Normalization theory does not help us decide among alternatives: designer must consider alternatives and choose based on semantics of the application

Lossless Decompositions A decomposition is lossless if we can recover: R(A,B,C) { R1(A,B) , R2(A,C) } R’(A,B,C) = R(A,B,C) Decompose Recover R’ is in general larger than R. Must ensure R’ = R

Lossy Decomposition: An Example FD’s: AB; C  B Decompose Recover

Lossless Decomposition: An Example FD’s: AB; C  B Decompose Recover But, now we can’t check A  B without doing a join!

Preserving Dependencies • What if, when a relation is decomposed, X of an XY ends up only in one of the new relations and Y ends up only in another? • Such a decomposition is not “dependency-preserving” • Goal: Always have FD-preserving decompositions • BCNF is not always dependency preserving • In fact, some times we cannot find a BCNF decomposition that is dependency preserving • Can handle this situation using 3NF

Example FD’s: Unit  Company; Company, Product Unit So, there is a BCNF violation, and we decompose Unit  Company No FD’s

CS4433 Database Systems

CS4433 Database Systems

Presentation Transcript

Database Systems

DATABASE SYSTEMS

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems

Database Systems