670 likes | 684 Views
Learn about 2NF, 3NF, and BCNF in database normalization, including prime attributes, transitive dependencies, conversion steps, and the importance of avoiding undesirable dependencies for effective database design.
E N D
Fundamentals/ICY: Databases2010/11WEEK 10 John Barnden Professor of Artificial Intelligence School of Computer Science University of Birmingham, UK
But 2NF can still haveUndesirable Dependencies • A prime attribute is one that is within some candidate key • (not necessarily the primary key). • A transitive dependency is where thedeterminant Dis at least partially outside the PK and is not a superkey, • and the determined attribute X is non-prime (and therefore in particular is not inside the PK; the reason for this restriction is on a later slide). • E.g.: previous Figure for simple case of a simple (= one-attribute) determinant. • Above definition is partly based on Garcia-Molina, Ullman & Widom 2009 – see later ref. More general than the account in R&C and R,C&C.
Third Normal Form • A table is in third normal form (3NF)if: • It is in 2NF and • It contains no transitive dependencies
Conversion to 3NF • For each determinantD involved in a transitive dependency in the original table T, • use D as the PK for a new tableNT(D) • and move out the attributes X transitively determined by Dinto NT(D). • NB: the determinants themselves stay in T as well.
Third Normal Form (3NF) Conversion Results on previous example
The Boyce-Codd Normal Form (BCNF) • Determinants of partial and transitive functional dependencies are not superkeys. So the corresponding normalization gets rid of some non-superkey determinants of functional dependencies. • Normalization into BCNF gets rid of all such determinants. • A table is in BCNF if it’s in 1NF and every determinant of a functional dependency is a superkey • i.e., every attribute-set that determines anyother attribute determines all the attributes, so there’s no redundancy problem
A Table in 3NF but not in BCNF The dependency is NOT TRANSITIVE since B is prime
Decomposition to BCNF The middle diagram shows that changing the PK so as to include C instead of B changes the dependency into a partial one, which can then be removed in the usual way.
((Note: A Simple Form of BCNF)) • Any simple (= one-attribute) superkey is a candidate key. • So BCNF requires all simple determinants to be candidate keys. • Some books (incl. R&C and R,C&C) define BCNF to mean that “all [simple] determinants are candidate keys”. • This is a simpler, less general form of BCNF. • A table could be in simple-BCNF but not be in full BCNF. • My definition of (full) BCNF is from Garcia-Molina, Ullman & Widom, Database Systems: The Complete Book, 2nd. Ed., Pearson, 2009. • This book also gives a process for conversion to full BCNF.
BCNF versus 3NF • BCNF implies that there are no partial or transitive dependencies, so a table that is in BCNF is also in 3NF. • ((If a table is in 3NF but not BCNF then each of the non-superkey determinants D is partly outside the PK and determines only prime attributes. • If also the PK is the only candidate key, then: • the attributes determined by each D must all be in the PK; • but they cannot cover all of the PK (otherwise D would be a superkey). So the PK must be composite.))
Reason for Prime-X Exclusion in Transitive Dependencies • Earlier we said that in a transitive dependency the determined attribute X is non-prime (i.e. not within a candidate key). The reason is: • In removing a transitive dependency, we delete the dependent attribute X from the original table. If X were within the primary key (special case of candidate key), that key would therefore be disrupted, and this would affect other tables referencing the table. But non-primary candidate keys are also sometimes used for such referencing, and are then called secondary keys. So if X were in such a key, the conversion to 3NF would disrupt the referencing. • So, to keep things simple for the purposes of 3NF, prime Xs are banned across the board from the notion of transitive dependency.
((Inter-Table Reference Disruption contd.)) • NB: Conversion to 2NF can, and from 3NF to BCNF does, remove dependent prime attributes, so is potentially disruptive of inter-table reference. • However, it’s relatively unlikely to be a problem in conversion to 2NF, because, in partial dependencies, the dependent attributes are not normally prime at all, and are in any case necessarily outside the primary keys, which are the main tool for inter-table reference. • If a 3NF table is not in BCNF then the troublesome dependencies necessarily involve prime Xs, because if the X is non-prime then a dependency with a non-superkey determinant must either be partial or transitive.
((3NF and Reference Disruption contd.)) • Some textbooks (e.g., Connolly and Begg, Database Systems, Pearson, 2010) only require transitive dependencies to avoid non-primary-key attributes, rather than to avoid all prime attributes. In that case, conversion to 3NF can disrupt references using a secondary key. But at least the cases of 2NF and 3NF are now more similar to each other. • I haven’t seen a version of 2NF that is only concerned with non-prime Xs. But don’t be too surprised if you come across that!
Fourth Normal Form (4NF) • About a different sort of issue from 2NF/3NF/BCNF. • Those NFs are concerned with the redundancy from functional dependencies (FDs). • 4NF is concerned with redundancy resulting from multivalued dependencies (MVDs).
Fourth Normal Form (4NF), contd. • A multivalued dependency of some attribute X on an attribute-set D is like a functional dependency except that X sometimes has several values for a given value of D. • The crucial point is that once the D value is specified, the X values are independent of other attributes. • So, we can think of X as a multivalued attribute implemented by putting different values in different rows, where the setof X values is fully determined by just the value of D. • E.g.: imagine multivalued car-colour being determined by just the make and year of the car.
Notes re Multivalued Dependencies • Caution: some books take functional dependencies to be just a special case of multivalued dependencies. So all dependencies are technically “multiple”, but some actually involve multiplicity and some don’t. • The determinant D in a (truly) multivalued dependency cannot be a superkey, because if it were then there could only be one X value per D value. • The D/X association doesn’t violate 2NF, 3NF or BCNF because it’s not a functional dependency. • “Trivial” multivalued dependencies include those where D together with X forms a superkey (including the case where there are no other attributes). Trivial MVDs avoid the problem on the next slide.
Problems with a Single MVD • Suppose there is an attribute Z (different from X) that is not determined by Dtogether with X, such as SIZE. (Hence, also, Z is not determined by D by itself.) Then there are different represented objects (e.g. cars) with different values of Z but the same value of D and X, and each such object needs to have rows in the table to cover all the different values of X (e.g., red, blue and green) associated with that value of D. • So we get redundancy of representation of the D/X association (same problem as with e.g. partial and transitive dependencies, but now worse because of the multi-valuedness of X). • Notice that the above situation can only happen if the MVD is non-trivial. If the MVD were trivial you wouldn’t be able to have a Z as above.
Problems with a Single MD, contd. • Just the problem covered earlier in module concerning car-colour: if there is another attribute Y in the table and Y is determined by D , then: • either it has a value repeated in all the different rows holding the different X values for a single D value, so we get redundancy of the representation of the D/Y association • or if, say, NULLs are used instead of repeating the Y value, we get extra manipulation complexity in handling/maintaining Y.
Problems with a SingleMD, contd. • But note that problem 2 is prevented from arising if the table is in BCNF, because D has to be a non-superkey determinant (of Y), and this is disallowed by BCNF. • Similarly, get some such protection from problem 2 if the table is in 3NF or just 2NF. • But BCNF etc. don’t prevent either problem 1 or special problems arising from the interaction of different multivalued dependencies.
What about Multiple MDs • Example: an employee may be assigned to several work assignments and may, independently of that, help several different charitable organizations. • If we try to use one table, we have • a multivalued dependency of assignment on (say) employee-id • a multivalued dependency of charitable-org on employee-id
Three Ways of Trying to Encode the two multivalued dependencies (Figure no. shown is from R&C 6th ed. It is 5.10 in 7th ed, and Fig. 7.10 in R,C&C.)
Problems with Multiple MDs • Those methods cause wasted space, redundancy, and/or additional manipulation complexity (with distinct possibility of getting the manipulation wrong). • ((Because of NULL values it may be difficult to define a good or any PK. May need to replace NULLs by some other special value.))
Fourth Normal Form • [R,C&C and R&C:] A table is in 4NF if • It is in 3NF and • It does not have multiplemultivalued dependencies • [Garcia-Molina et al.:] A table is in 4NF if • It is in BCNF • It does not have anynon-trivialmultivalued dependencies
A Set of Tables in 4NF (Figure no. shown is from 6th ed. of textbook. It is 5.11 in 7th ed., and 7.11 in R,C&C)
Notes on the Resulting Tables • Tables ASSIGNMENT and SERVICE_V1 are bridging tables. • The PK of SERVICE_V1 consists of both its attributes. • The PK of ASSIGNMENT is meant to be ASSIGN_NUM. But note that the other 2 columns also form a candidate key. • Each of the tables in the diagram is in 4NF, under both definitions of 4NF. • Each table is in BCNF (and hence 3NF), and • The only tables containing MVDs are ASSIGNMENT and SERVICE_V1, and • In each of these tables, there is only one MVD, with determinant = EMP_NUM, and • Each of these MVDs is trivial: the attributes involved in it (the “D” together with the “X”) is a superkey.
That’s the End of the Optional 4NF Material Back to Obligatory Material
Normal Forms Overall • Let “<” mean “provides less protection than”. Then: • 1NF < 2NF < 3NF < BCNF ((and 3NF <4NF)) • ((Also BCNF < 4NF under the second definition of 4NF. • Exercise: is there a < relationship between the two forms of 4NF?)) • BCNF and 4NF guard against relatively unusual situations. For most business database design purposes, 3NF is highest we need to go. • On the other hand, merely requiring 2NF is now unusual. So 3NF is the normal target. • BCNF is more disruptive to achieve than 2NF and 3NF.
Normal Forms Overall, contd • BCNF is more disruptive to achieve than 2NF or 3NF: • BCNF may require the PK to be changed, but conversion to 2NF or 3NF never does so. • Conversion from 3NF to BCNF always removes prime attributes, including possibly some PK attributes, perhapsdisrupting inter-table reference. • Conversion to 2NF only sometimes removes prime attributes, and can only do so if they are non-PK, so it has less danger of disrupting inter-table reference. • Conversion from 2 NF to 3NF has no such danger.
Non-Normalization/Denormalization • Normalization leads to more tables. • Joining larger number of tables takes additional disk input/output (I/O) operations, additional manipulation complexity, and possibly substantial communication delays. • Conflicts among design principles, information requirements, and processing speed are often resolved through compromises that may include ending up with some non-normalized tables.
Relational algebra Defines theoretical way of manipulating tables using “relational operators” that mainly manipulate the relations in the tables. SELECT PROJECT JOIN (various sorts) INTERSECT Use of relational algebra operators on existing tables produces new tables UNION DIFFERENCE PRODUCT ((DIVIDE)) Relational Database Operators
Relational Operators (continued) • Select [better name would be Select-Rows] • Yields a table S whose rows are some (or all) of the rows in the single given table T, preserving duplicates. • S’s rows could be all of T’s, but more usually are those that satisfy some specified criterion. • Yields a “horizontal subset” of T (“collection of horizontal slices” would be a better term). • Does not itself reduce the set of columns.
Select (continued) • SQL: • SELECT * FROM … WHERE … • Note: it’s the WHERE part that is actually doing the selection according to a criterion. • Relational algebra notation in handout: • Result table is C(T) where T is the given table and C is the selection criterion. • More compact than SQL notation. Avoids notation private to particular versions of particular programming languages.
Relational Operators (continued) • Project [better name would be Select-Columns] • Yields a table S whose columns are a specified subset of the columns of the single given table T, and whose rows contain the corresponding values from all of T’s rows. • The operation may create duplication even when none present in original table. • Yields a “vertical subset” of T [better: “set of vertical slices”]
Project (continued) • SQL: • SELECT [DISTINCT]…FROM … • Note: it’s the … just after the SELECT that is actually doing the projection (“selection” of specified columns)!! • Relational algebra notation in handout: • Result table is X(T) where T is the given table and X is the list of selected attributes (columns). • But this always removes row duplications from the result, and so does not exactly correspond to the full DB notion of projection.
Relational Operators (continued) • Union and its All version • Intersect and its All version • Difference and its All version • The given tables must have compatible value domains.
Union and Intersection (continued) • SQL: • UNION, INTERSECT, EXCEPT (or MINUS) • UNION ALL, INTERSECT ALL, EXCEPT (or MINUS) ALL • Relational algebra notation in handout: • Result tables are T1T2, T1T2 and T1 T2 where T1 and T2 are the given tables. • Maths of relations: • Result relations are R1 R2, R1 R2 and R1 R2in the non-ALL cases. where R1and R2 are the relations in the given tables. • Problem: relations don’t account for duplicates of rows, so don’t handle the ALL versions.
Relational Operators (continued) • Join • Allows us to join related rows from two or more tables • It’s an important feature of the relational database idea • Joining has been implicitly important in some of the module handouts.
Relational Operators (continued) • Productor Cross Join • Yields a table containing all concatenations of whole rows from first given table with whole rows from second given table. • The tables can have completely different attribute characteristics. • But if attributes are shared, they’re repeated in the product.
Product If second table also had a PRICE attribute, then the product would have a Table1.PRICE attr. and a Table2.PRICE attr.
So, I want … • ….. to define the non-standard notion of “flattened Cartesian product” of two sets A and B, applicable when the members of A and B are already tuples. Notated by the symbol (underlined multiplication symbol). • A B = the set of tuples that are the concatenations of members of A and members of B. • E.g., if <a,b,c> is in A and <d,e,f> is in B then <a,b,c,d,e,f> is in A B.
Contd. • If A is the People relation and B is the Organizations relation, and • A has members of form E156, ‘Sam’, ‘Finks’, I678> and • B has members of form I459, ‘Dell’, ‘UK’> • THEN • A B has members of form • E156, ‘Sam’, ‘Finks’, I678>, I459, ‘Dell’, ‘UK’> > • BUT • A B has members of form • E156, ‘Sam’, ‘Finks’, I678, I459, ‘Dell’, ‘UK’>
Product or Cross Join (continued) • SQL: • SELECT * FROM …two [or more] tables … • NB: it’s the mere listing of the tables that does the Product, but it’s possible also to write: • SELECT * FROM T1 CROSS JOIN T2 CROSS JOIN ... • Relational algebra notation: • Result table is T1 T2 where T1 and T2 are the given tables. • Maths of relations: • Result relationis R1 R2 where R1and R2 are the relations in the given tables. • Problem: relations don’t account for duplicates of rows.
Product or Cross Join (continued) • CAUTION: • Note the use of the flattenedCartesian product R1 R2. The ordinary Cartesian product would be incorrect, even though in DB lingo the word Product might suggest Cartesian product and people do sometimes say Cartesian product… • … and even though in relational algebra the ordinary multiplication symbol is used, which again makes it look like a Cartesian product.
Natural Join • Links two tables by finding rows in each that match on the attributes the two tables have in common (if any), and then concatenatingthe matching rows together in a natural way but removing repetitions of the shared attributes. • The common attributes or columns are called the join attributes or columns): just the AGENT_CODE attribute in above example • Can be thought of as the result of a three-stage process: • the PRODUCT of the tables is created • a SELECT is performed on the resulting table to yield only the rows for which the join-attribute values (e.g. AGENT_CODE values) are equal • a PROJECT is now performed to yield a single copy of each join attribute, thereby eliminating duplicate columns
Natural Join, Step 1: PRODUCT Note the two AGENT_CODE columns