380 likes | 390 Views
Learn about normalization, the problem of redundancy in database design, functional dependency, and decomposition. Discover the objectives of decomposition, normalization forms, and schema refinement. Explore vertical partitioning, denormalization, and horizontal decomposition. Follow the design steps for creating a better relational schema and understand the importance of normalization theory in achieving optimal relational design.
E N D
Outline • Database design: Normalization • Problem of redundancy • Why? • Functional dependency • How to solve? • Decomposition • Objective of the decomposition? • Normalized • Normal form • Schema refinement • Vertical partitioning and Antipartitioning • Denormalization • Horizontal decomposition
Design Steps • The design steps: • Real-World • ER model • Relational Schema 4. Better relational Schema 5. Relational DBMS • Guideline for Step (3) to step (4) • “design theory” for relations • Or “normalization theory”. • Why normalization theory • Automatic mappings from ER to relations may not produce the best relational design possible. • Database designers may go directly from (1) to (3), in which case, the relational design can be really bad.
Database Schema • A relation schema is a relation name and a set of attributes R(a int, b varchar[20]); • A relation instance for R is a set of records over the attributes in the schema for R.
The Evils of Redundancy • Redundancyis the root of several problems associated with relational schemas: • redundant storage • insert/delete/update anomalies • Consider relation obtained from Hourly_Emps: • Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked) • Notation: We will denote this relation schema by listing the attributes: SNLRWH • This is really the set of attributes Hourly_Emps {S,N,L,R,W,H}.
Example • Problems due to R W : • Update anomaly: • Can we change W in just the 1st tuple of SNLRWH? • Insertion anomaly: • Suppose W is not allowed null.What if we want to insert an employee and don’t know the hourly wage for his rating? • Deletion anomaly: • If we delete all employees with rating 5, we lose the information about the wage for rating 5!
Functional Dependencies (FDs) • A functional dependencyX Y holds over relation R if, for every allowable instance r of R: • i.e., given two tuples in r, if the X values agree, then the Y values must also agree. (X and Y are sets of attributes.) • K is a key for relation R if: 1. K determines all attributes of R. 2. For noproper subset of K is (1) true. • If K satisfies only (1), then K is a superkey.
Example • Consider relation Hourly_Emps: • Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked) • FD is a key: • ssn is the key • S SNLRWH • FDs give more detail than the mere assertion of a key. • rating determines hrly_wages • R W
Decomposition of a Relation Scheme • A decompositionof R(A1 ... An) consists of replacing R by two or more relations such that: • Each new relation scheme contains a subset of the attributes of R • Implies that there exists no attributes that do not appear in R • Every attribute of R appears as an attribute of one of the new relations. • E.g., • SNLRWH can be decomposed into SNLRH and RW.
Problems with Decompositions • Three potential problems to consider: • Performance: Some queries become more expensive. • e.g., How much did sailor Joe earn? (salary = W*H) • Lossless Join Decompositions: Given instances of the decomposed relations, we may not be able to reconstruct the corresponding instance of the original relation! • Fortunately, not in the SNLRWH example. • Dependency Preserving Decomposition: Checking some dependencies may require joining the instances of the decomposed relations. • Fortunately, not in the SNLRWH example. • Tradeoff: Must consider these issues vs. redundancy.
Lossless Join Decompositions • Decomposition of R into X and Y is lossless-join w.r.t. a set of FDs F if, for every instance r that satisfies F: • X(r)Y(r)=r • It is always true that rX(r) Y(r) • In general, the other direction does not hold! If it does, the decomposition is lossless-join. • Definition extended to decomposition into 3 or more relations in a straightforward way.
Dependency Preserving Decomposition • Consider CSJDPQV, C is key, JP →C and SD → P. • BCNF decomposition: CSJDQV and SDP • Problem: Checking JP → C requires a join! • Dependency preserving decomposition (Intuitive): • If R is decomposed into X, Y and Z, and we enforce the FDs that hold on X, on Y and on Z, then all FDs that were given to hold on R must also hold. • No join is necessary to enforce original FD
Normalization • A relation is normalized if every interesting functional dependency X A involving attributes in R has the property that X is a key of R. • BCNF • Kent:“Each attribute must describe [an entity or relationship identified by] the key, the whole key, and nothing but the key.“ • No partial dependency • No transitive denepdency BCNF Partial dependency Transitive dependency
Example • Schema1 (not normalized): OnOrder1(supplier_id, part_id, quantity, supplier_address) • Schema 2 (Normalized): OnOrder2(supplier_id, part_id, quantity); Supplier(supplier_id, supplier_address);
Example #1 • Suppose that a bank associates each customer with his or her home branch. Each branch is in a specific legal jurisdiction(权限). • Is the relation R(customer, branch, jurisdiction) normalized?
Example #1 • What are the functional dependencies? • customer branch • branch jurisdiction • customer jurisdiction • Customer is the key, but a functional dependency exists where customer is not involved. • R is not normalized.
Example #2 • Suppose that a doctor can work in several hospitals and receives a salary from each one. • Is R(doctor, hospital, salary) normalized?
Example #2 • What are the functional dependencies? • doctor, hospital salary • The key is doctor, hospital • The relation is normalized.
Example #3 • Same relation R(doctor, hospital, salary) and we add the attribute primary_home_address. Each doctor has a primary home address and several doctors can have the same primary home address. Is R(doctor, hospital, salary, primary_home_address) normalized?
Example #3 • What are the functional dependencies? • doctor, hospital salary • doctor primary_home_address • doctor, hospital primary_home_address • The key is “doctor, hospital”, however doctor (a subset) determines one attribute, hence not normalized • A normalized decomposition would be: • R1(doctor, hospital, salary) • R2(doctor, primary_home_address)
Normal Forms • Returning to the issue of schema refinement, the first question to ask is whether any refinement is needed! • If a relation is in a certain normal form(BCNF, 3NFetc.), it is known that certain kinds of problems are avoided/minimized. This can be used to help us decide whether decomposing the relation will help.
Normal forms Universe of relations 1 NF 2NF 3NF BCNF 4NF 5NF
Practical Schema Design • Identify entities in the application (e.g., doctors, hospitals, suppliers). • Each entity has attributes (an hospital has an address, a juridiction, …). • There are two constraints on attributes: • An attribute cannot have attribute of its own. • The entity associated with an attribute must functionally determine that attribute.
Practical Schema Design • Each entity becomes a relation • To those relations, add relations that reflect relationships between entities. • Worksin (doctor_ID, hospital_ID) • Identify the functional dependencies among all attributes and check that the schema is normalized: • If functional dependency AB C holds, then ABC should be part of the same relation.
Vertical Partitioning • Three attributes: account_ID, balance, address. • Functional dependencies: • account_ID balance • account_ID address • Two normalized schema design: • (account_ID, balance, address) or • (account_ID, balance) • (account_ID, address) • Which design is better?
Which design is better depends on the query pattern: The application that sends a monthly statement is the principal user of the address of the owner of an account The balance is updated or examined several times a day. The second schema might be better because the relation (account_ID, balance) can be made smaller: More account_ID, balance pairs fit in memory, thus increasing the hit ratio A scan performs better because there are fewer pages. Vertical Partitioning
Tuning Normalization • Rule of thumb: • A single normalized relation XYZ is better than two normalized relations XY and XZ if X, Y and Z are frequently accessed together • Why? No costly join operations. • The two-relation design is better iff: • Users access tend to partition between the two sets Y and Z most of the time • Attributes Y or Z have large values (one-third the page size or larger) • Too many accesses across pages
R (X,Y,Z) X is an integer YZ are large strings Scan Query Vertical partitioning exhibits poor performance when XYZ are accessed together. Vertical partitioning provides a speedup if only two of the attributes (XY) are accessed. Vertical Partitioningand Scan
R (X,Y,Z) X is an integer YZ are large strings A mix of point queries access either XYZ or XY. Vertical partitioning gives a performance advantage if the proportion of queries accessing only XY is greater than 20%. The join is not expensive compared to a simple look-up. Vertical Partitioningand Point Queries
Vertical Antipartitioning • Brokers’ bond-buying decisions are based on the price trends of those bonds. • The database holds the closing price for the last 3000 trading days, however the 10 most recent trading days are especially important. • (bond_id, issue_date, maturity, …)(bond_id, date, price) Vs. • (bond_id, issue_date, maturity, today_price, …10dayago_price)(bond_id, date, price)
Tuning Denormalization • Denormalizing means violating normalization for the sake of performance: • Denormalization speeds up performance when attributes from different normalized relations are often accessed together • Denormalization hurts performance for relations that are often updated.
Settings: lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT ); region( R_REGIONKEY, R_NAME, R_COMMENT ); nation( N_NATIONKEY, N_NAME, N_REGIONKEY, N_COMMENT,); supplier( S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL, S_COMMENT); 600000 rows in lineitem, 25 nations, 5 regions, 500 suppliers Denormalizing -- data
Denormalizing -- transactions lineitemdenormalized ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT, L_REGIONNAME); • 600000 rows in lineitemdenormalized • Cold Buffer • Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Queries on Normalized vs. Denormalized Schemas Queries: select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, R_NAME from LINEITEM, REGION, SUPPLIER, NATION where L_SUPPKEY = S_SUPPKEY and S_NATIONKEY = N_NATIONKEY and N_REGIONKEY = R_REGIONKEY and R_NAME = 'EUROPE'; select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, L_REGIONNAME from LINEITEMDENORMALIZED where L_REGIONNAME = 'EUROPE';
TPC-H schema Query: find all lineitems whose supplier is in Europe. With a normalized schema this query is a 4-way join. If we denormalize lineitem and introduce the name of the region for each lineitem we obtain a 30% throughput improvement. Denormalization
Horizontal Decompositions • Relation decomposition: Relation is replaced by a collection of relations that are projections. • Horizontal Decomposition:Replace relation by a collection of relations that are selections. • Each new relation has same schema as the original, but a subset of the rows. • Collectively, new relations contain all rows of the original. Typically, the new relations are disjoint.
Horizontal Decompositions (Contd.) • Suppose that contracts with value > 10000 are subject to different rules. This means that Contracts with val>10000will often be queried • One way to deal with this is to build a clustered B+ tree index on the val field of Contracts. • A second approach is to replace contracts by two new relations: LargeContracts and SmallContracts, with the same attributes • Performs like index on such queries, but no index overhead. Can build clustered indexes on other attributes, in addition!
Masking Conceptual Schema Changes • The replacement of Contracts by LargeContracts and SmallContracts can be masked by the view. • However, queries with the condition val>10000 must be asked wrt LargeContracts for efficient execution: so users concerned with performance have to be aware of the change CREATE VIEW Contracts(cid, sid, jid, did, pid, qty, val) AS SELECT * FROM LargeContracts UNION SELECT * FROM SmallContracts