670 likes | 825 Views
Database Normalization Revisited: An information-theoretic approach. Leonid Libkin Joint work with Marcelo Arenas and Solmaz Kolahi. Sources. M. Arenas, L. An information-theoretic approach to normal forms for relational and XML data , PODS’03; J.ACM, 2005.
E N D
Database Normalization Revisited: An information-theoretic approach Leonid Libkin Joint work with Marcelo Arenas and Solmaz Kolahi
Sources • M. Arenas, L. An information-theoretic approach to normal forms for relational and XML data, PODS’03; J.ACM, 2005. • S. Kolahi, L. Redundancy vs dependency-preservation in normalisation: an information-theoretic analysis of 3NF, PODS’06.
Outline • Part 1 - Database Normalization from the 1970s and 1980s • Part 2: Classical theory re-done: new justification for normal forms: • BCNF and relatives (academic, eliminate redundancies) • 3NF (practical, may leave some redundancies) • Part 3: An XML application 2
If you haven’t taught “Intro to DB” lately… • Design: decide how to represent the information in a particular data model. • Even for simple application domains there is a large number of ways of representing the data of interest. • We have to design the schema of the database. • Set of relations. • Set of attributes for each relation. • Set of data dependencies. 3
Normalization Theory Today • Normalization theory for relational databases was developed in the 70s and 80s. • Why do we need normalization theory today? • New data models have emerged: XML. • XML documents cancontain redundant information. • Redundant information in XML documents: • Can be discovered if the user provides semantic information. • Can be eliminated. 15
Designing a Database: An Example • Attributes:number, title, section, room. • Data dependency: every course number is associated with only one title. • Relational Schema: BAD alternative: 4
Deletion Anomaly CSC434 is not given in this term. 6
Deletion Anomaly CSC434 is not given in this term. 6
Deletion Anomaly CSC434 is not given in this term. Additional effect: all the information about CSC434 was deleted. 6
Avoiding Update Anomalies The instance does not store redundant information. 8
Avoiding Update Anomalies CSC434 is not given in this term. 8
Avoiding Update Anomalies CSC434 is not given in this term. The title of CSC434 is not removed from the instance. 8
Normalization Theory • Main idea: a normal form defines a condition that a well designed database should satisfy. • Normal form: syntactic condition on the database schema. • Defined for a class of data dependencies. • Main problems: • How to test whether a database schema is in a particular normal form. • How to transform a database schema into an equivalent one satisfying a particular normal form. 10
BCNF: a Normal Form for FDs • Functional dependency (FD) over R(A1, …, An): X Y , X, Y {A1, …, An}. • X Y: two rows with the same X-values must have the same Y-values. • Number Title in our example • Key dependency : X A1 …. An • X is a key: two distinct rows must have distinct X-values. 11
BCNF: a Normal Form for FDs • is a set of FD over R(A1, …, An). • Relation schema R(A1, …,An), is in BCNF if for every nontrivial X Y in , X is a key. • A relational schema is in BCNF if every relation schema is in BCNF. 12
BCNF Decomposition • Relation schema: R(X,Y,Z), • Not in BCNF: implies X Y and but not X A, for every A Z. • Basic decomposition: replace R(X,Y,Z) by S(X,Y) and T(X,Z). • Example: 13
Lossless Decomposition ∏number, title (R) ∏number, section, room (R) 14
Lossless Decomposition S JoinT 14
How to justify good designs? • What is a good database design? • Well-known solutions: BCNF, 4NF, 3NF… • But what is it that makes a database design good? • Elimination of update anomalies. • Existence of algorithms that produce good designs: lossless decomposition, dependency preservation. 34
Problems with traditional approaches • Many papers tried to justify normal forms. • Problem: tied very closely to the relational model. • Relied on well-defined notions of queries/updates. • These days we want to deal with other data models, in particular XML. • We need an approach that extends to other models, in particular, XML.
Justification of Normal Forms • Problematic to evaluate XML normal forms. • No XML update language has been standardized. • No XML query language yet has the same “yardstick” status as relational algebra. • We do not even know if implication of XML FDs is decidable! • We need a different approach. • It must be based on some intrinsic characteristics of the data. • It must be applicable to new data models. • It must be independent of query/update/constraint issues. • Our approach is based on information theory. 35
Information Theory • Entropy measures the amount of information provided by a certain event. • Assume that an event can have n different outcomes with probabilities p1, …, pn. Entropy is maximal if each pi= 1/n : 36
Entropy and Redundancies • Database schema: R(A,B,C), A B • Instance I: • Pick a domain properly containing adom(I) : • Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4 • Entropy: log 5 ≈ 2.322 • Pick a domain properly containing adom(I) : {1, …, 6} • Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2 • Entropy: log 1 = 0 {1, …, 6} 37
Entropy and Normal Forms • Let be a set of FDs over a schema S. Theorem(S,) is in BCNF if and only if for every instance of (S,) and for every domain properly containing adom(I),each position carries non-zero amount of information (entropy > 0). • A similar result holds for 4NF and MVDs. • This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough ... 38
Problems with the Measure • The measure cannot distinguish between different types of data dependencies. • It cannot distinguish between different instances of the same schema: R(A,B,C), A B entropy = 0 entropy = 0 39
A General Measure InstanceI of schema R(A,B,C), A B : 40
A General Measure InstanceI of schema R(A,B,C), A B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I) {1, …, k}. For example, k = 7. 40
A General Measure InstanceI of schema R(A,B,C), A B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I) {1, …, k}. For example, k = 7. 40
A General Measure InstanceI of schema R(A,B,C), A B : Initial setting: pick a position pPos(I)and pickksuch thatadom(I) {1, …, k}. For example, k = 7. Computation: for everyX Pos(I) – {p}, compute probability distributionP(a | X),a {1, …, k}. 40
A General Measure InstanceI of schema R(A,B,C), A B : Computation: for everyX Pos(I) – {p}, compute probability distributionP(a | X),a {1, …, k}. 40
A General Measure InstanceI of schema R(A,B,C), A B : Computation: for everyX Pos(I) – {p}, compute probability distributionP(a | X),a {1, …, k}. 40
A General Measure InstanceI of schema R(A,B,C), A B : Computation: for everyX Pos(I) – {p}, compute probability distributionP(a | X),a {1, …, k}. P(2 | X) = 40
A General Measure InstanceI of schema R(A,B,C), A B : Computation: for everyX Pos(I) – {p}, compute probability distributionP(a | X),a {1, …, k}. P(2 | X) = 40
A General Measure InstanceI of schema R(A,B,C), A B : Computation: for everyX Pos(I) – {p}, compute probability distributionP(a | X),a {1, …, k}. P(2 | X) = 40
A General Measure InstanceI of schema R(A,B,C), A B : Computation: for everyX Pos(I) – {p}, compute probability distributionP(a | X),a {1, …, k}. P(2 | X) = 40
A General Measure InstanceI of schema R(A,B,C), A B : Computation: for everyX Pos(I) – {p}, compute probability distributionP(a | X),a {1, …, k}. P(2 | X) = 48/ 40
A General Measure InstanceI of schema R(A,B,C), A B : Computation: for everyX Pos(I) – {p}, compute probability distributionP(a | X),a {1, …, k}. P(2 | X) = 48/ For a ≠ 2,P(a | X) = 40
A General Measure InstanceI of schema R(A,B,C), A B : Computation: for everyX Pos(I) – {p}, compute probability distributionP(a | X),a {1, …, k}. P(2 | X) = 48/ For a ≠ 2,P(a | X) = 40
A General Measure InstanceI of schema R(A,B,C), A B : Computation: for everyX Pos(I) – {p}, compute probability distributionP(a | X),a {1, …, k}. P(2 | X) = 48/ For a ≠ 2,P(a | X) = 40
A General Measure InstanceI of schema R(A,B,C), A B : Computation: for everyX Pos(I) – {p}, compute probability distributionP(a | X),a {1, …, k}. P(2 | X) = 48/ (48 + 6 * 42) = 0.16 For a ≠ 2,P(a | X) = 42 / (48 + 6 * 42) = 0.14 Entropy ≈ 2.8057 (log 7 ≈ 2.8073) 40
A General Measure InstanceI of schema R(A,B,C), A B : Value : we consider the average over all sets X Pos(I) – {p}. • Average:2.4558 < log 7(maximal entropy) • It corresponds to conditional entropy. • It depends on the value of k ... 40
A General Measure: Relative Information Content (RIC) • Previous value: RICIk(Σ|p) • For each k, we consider the ratio: RICIk(Σ|p) / log k • How close the given position p is to having the maximum possible information content. • General measure (Arenas, L. 2003): RICI(Σ|p) = limk ∞RICIk(Σ|p) / log k 41
Basic Properties • The measure is well defined: For every set of firstorder constraints Σ,every instance I of Σ, and every position p in I, RICI(Σ|p) exists. • Bounds: 0 ≤RICI(Σ|p) ≤ 1 • Closer to 1 = Less redundancy 42
Basic Properties • The measure does not depend on a particular representation of constraints. • It overcomes the limitations of the simple measure: R(A,B,C), A B 0.875 0.781 43
Well-Designed Databases Definition A database specification (S,) is well-designed if for every I inst(S,) and every p Pos(I), RICI(Σ|p) = 1. In other words, every position in every instance carries the maximum possible amount of information. 44
Relational Databases (Arenas, L.’03) is a set of data dependencies over a schema S: • = Ø: (S,) is well-designed. • is a set of FDs: (S,) is well-designed if and only if (S,) is in BCNF. • is a set of FDs and MVDs: (S,) is well-designed if and only if (S,) is in 4NF. • is a set of FDs and JDs: • If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed. The converse is not true. • A syntactic characterization ofbeing well-designed is given in [AL03]. 45
Decidability Issues • If Σ is a set of First-Order integrity constraints, then the problem of verifying whether a relational schema is well-designed is undecidable. • If Σcontains only universal constraints (FDs, MVDs, JDs, …), then the problem becomes decidable. • High complexity (coNEXPTIME) by reduction to the (complement) of Bernays-Schönfinkel satisfiability. 46
3NF • BCNF is the most popular textbook normal form. • In practice 3NF is much more common. • From Oracle's “General Database Design FAQ”: after defining 1NF, 2NF, and 3NF, it says: that there are other normal forms but “their definitions are of academic concern only, and are rarely required for practical purposes”