Relative Information Capacity of Simple Relational Database Schemata

Relative Information Capacity of Simple Relational Database Schemata Paper by: Richard Hull Presented by: Jose Picado

Outline • Problem: Data relativism and information capacity • Definition • Examples • Importance • Hierarchy of dominance measures • Basic results • Discussion

Data relativism • Represent the same data in different ways

Data relativism • Represent the same data in different ways • Represent the same data under different schemas

Data relativism • Represent the same data in different ways • Represent the same data under different schemas Schema 1 Example taken from: Kosky, Anhony. Transforming Databases with Recursive Data Structures, 1996.

Data relativism • Represent the same data in different ways • Represent the same data under different schemas Schema 1 Schema 2 Example taken from: Kosky, Anhony. Transforming Databases with Recursive Data Sturctures, 1996.

Relative information capacity • Expressiveness of a schema • Different schemas representing same data may have different information capacity

Relative information capacity • Expressiveness of a schema • Different schemas representing same data may have different information capacity Schema 1 Schema 2 Example taken from: Kosky, Anthony. Transforming Databases with Recursive Data Structures, 1996.

Relative information capacity • Expressiveness of a schema • Different schemas representing same data may have different information capacity • Schema 1: • Does not require that the spouse attribute of a man goes to a woman. • Does not require that for each spouse attribute in one direction there is a corresponding spouse attribute in another direction. Example taken from: Kosky, Anthony. Transforming Databases with Recursive Data Structures, 1996.

Relative information capacity • Expressiveness of a schema • Different schemas representing same data may have different information capacity • Schema 2: • Allows unmarried people to be represented in the database. Example taken from: Kosky, Anthony. Transforming Databases with Recursive Data Structures, 1996.

Relative information capacity • Possible solution: • Transform existing schema to new schema by structural manipulations transformation

Relative information capacity • Possible solution: • Transform existing schema to new schema by structural manipulations • Information capacity preserving? transformation

Importance • Schema evolution • None of the information stored in the initial database is lost

Importance • Data integration • All information in one of the component databases is reflected in the integrated database Example taken from: Kosky, Anthony. Transforming Databases with Recursive Data Structures, 1996.

Importance • Database normalization theory • User view construction • Schema simplification • Translation between data models

Hull’s paper • Introduces theoretical tools for studying measures of relative information capacity • Theoretical frameworks at the time were complex • There was no clear definition about the concept • Hull introduced nice ways of comparing schemata and their information capacity • Defines a hierarchy of measures to compare information capacity of schemata

Hull’s paper • Gives some basic results concerning the previous measures • Considers only non-keyed relations Non-keyed Keyed Relations: Instances:

Definitions • Schema P is a set of relations • Relations composed of attributes, which may be of different basic types • Basic types are domain designators (have a fixed domain of possible values) • I(P) is the instances of P, usually infinite Instances I(P) Schema P …

Transformation • P and Q are relational schemata • A transformation from P to Q is a map

Transformation • P and Q are relational schemata • A transformation from P to Q is a map P

Transformation • P and Q are relational schemata • A transformation from P to Q is a map P Q

Transformation • P and Q are relational schemata • A transformation from P to Q is a map P PersonInfo(x,y,z) :- Person(x,y), Birth(x,z). Q

Dominance • P and Q are relational schemata • Q dominates P via if the composition of followed by is the identity on P

Dominance P Q

Dominance • Take instances of P: I(P)

Dominance • Apply to I(P) Male(x) :- Person(x,y,z), y=“male”. Female(x) :- Person(x,y,z), y=“female”. Marriage(x,y) :- Person(x,u,y), Person(y,v,x), u=“male”, v=“female”

Dominance • Apply to (I(P)) Person(x,”male”,z) :- Male(x), Marriage(x,z). Person(x,”female”,z) :- Female(x), Marriage(x,z).

Dominance • Compare I(P) and ( (I(P))) I(P) ( (I(P)))

Dominance • P and Q are relational schemata • Q dominates P via if the composition of followed by is the identity on P Information structured according to P can be restructured to “fit” into Q, and restructured again to “fit” into P Q has at least as much capacity for storing information as P

Equivalence • P and Q are equivalent (xxx) if they have equivalent information capacity • P and Q are equivalent if • Q dominates P (xxx) and • P dominates Q (xxx)

Information dominance measures • Calculous dominance • Generic dominance • Internal dominance • Absolute dominance More restrictive Less restrictive

Types of equivalency • P and Q are equivalent (calc) • P and Q are equivalent (gen) • P and Q are equivalent (int) • P and Q are equivalent (abs) More restrictive Less restrictive

Level 1: Calculous dominance • Only allow transformations to be relational calculus expressions • Relational calculus: • First order logic or predicate calculus • Predicates: atom, • Each query Q(x1, …, xn) is a predicate P

Level 1: Calculous dominance • Only allow transformations to be relational calculus expressions • are relational calculus expressions • Q dominates P calculously

Level 2: Generic dominance • Only allow transformations that treat domain elements as “essentially uninterpreted objects” • Treat all elements as equals except some set of constants • Property of all query languages, such as SQL and Datalog

Level 2: Generic dominance • Only allow transformations that treat domain elements as “essentially uninterpreted objects” • treat all elements as equals • Q dominates P generically

Level 3: Internal dominance • Only allow transformations that do not invent any data • Invent data: numerical computations or string manipulations performance = goals/games

Level 3: Internal dominance • Only allow transformations that do not invent any data • do not invent data • Q dominates P internally

Level 4: Absolute dominance • Some set of values • : instances of P that contain only values in Y, where • : cardinality of instances of P containing only values in Y • If thenQ dominates P absolutely • Easy to compute: based on counting of instances, instead of transformations

Basic results • Q dominates P calculously Q dominates P generically Q dominates P internally Q dominates P absolutely

Basic results • Sometimes absolute and internal dominance hold, but generic and calculous dominance don’t • Q dominates P (abs, int) • and transformation (int) does not invent data • Q does not dominate P (gen, calc) • There is no transformation (gen, calc) that takes instances of P to Q and then back to P P Q

Basic results • Absolute dominance useful for verifying calculous (not) dominance • Q dominates P calculously Q dominates P absolutely • P does not dominate Q absolutely P does not dominates Q calculously P Q *under certain constraints

Basic results • Dominance is preserved by re-namings of basic types (homomorphism) • h(P): homomorphism of P • If Q dominates P thenh(Q) dominates h(P)for any measure of dominance (calc, gen, int, abs)

Basic results • Calculousdominance does not accurately measure the presence of “semantic correspondence”

Basic results • Calculousdominance does not accurately measure the presence of “semantic correspondence” NAME NUMBER NUMBER NAME NAME NUMBER S1 R1 P S2 R2

Basic results • Calculousdominance does not accurately measure the presence of “semantic correspondence” NAME NUMBER NUMBER NAME NAME NUMBER S1 R1 P S2 R2 Q T

Basic results • Calculousdominance does not accurately measure the presence of “semantic correspondence” NAME NUMBER NUMBER NAME NAME NUMBER S1 R1 P S2 R2 Q T Q dominates P (calc), but there is not semantic mapping from P to Q

Basic results • If only non-keyed relational schemata with only one basic type, then all types of dominance are equivalent Theorem: Let P and Q be non-keyed relational schemata over a single basic type B. Then the following are equivalent: Q dominates P (calc) Q dominates P (gen) Q dominates P (int) Q dominates P (abs)

Basic results • With any reasonable measure of relative information capacity, two non-keyed relational schemata are equivalent iff they are identical • In the relational model (non-keyed), there is essentially at most one way to represent a given data set

Discussion • Strong points: • ???

Relative Information Capacity of Simple Relational Database Schemata