Schema & Schema Integration

Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

Outline • XTRACT System for inferring DTDs from a set of XML documents • Incremental validation of XML Documents

Schema & XML Databases • Databases need a Schema • DTDs serve the role of the schema of the document • Efficient storage of XML data • Optimization of XML queries DTDs are not mandatory !!!!

XTRACT • Goal: Infer DTDs from a set of XML documents

Problem Simplification and Abstraction • Infer a DTD for each tag separately • Separate example sequences for each <e> • Infer a “good” DTD for each <e> • Resulting document DTD is a composition of all inferred “tag”-DTDs

Example book <book> <title> </title> <author> <name> </name> <age> </age> </author> <author> <name> </name> </author> <editor> <name> </name> </editor> </book> title author author editor name age name name

Candidate DTD Concise Precise (a|b)* (ab|abab|ababab) ab|ab(ab|abab) (ab)* What is a “good” DTD ? Given the example sequence set I={ ab, abab, ababab } Possible DTDs: Yes No No Yes No Yes Yes Somewhat

What is a “good” DTD ? (ctd.) • A good DTD D must satisfy two restrictions • R1: D should be concise • R2: D should be precise • Minimum Description Length quantifies and resolves the tradeoff between R1 and R2

The MDL Principle • MDL principle states: The best theory to infer from a given set of data is the one which minimizes the sum of • The length of the theory in bits • The length of the data, in bits, when encoded with the help of the theory

Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe } MDL Modul Factoring Generalization Overview of XTRACT System Sg = I  { (ab)*, (a|b)*, b*d, b*e } Sf = Sg { (a|b)(c|d), b*(d|e) } Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)

MDL Subsystem • In order to use the MDL principle, we need to • Define theory description length • Define data description length • Solve the resulting minimization problem

MDL Coding scheme • Description Length of a DTD • Number of characters of the DTD • Cost of encoding the example sequences • encoding of b in terms of DTD a | b | c is 1, cost 1 (position of b in the DTD) • encoding of bbb in terms of DTD b* is 3 (number of repetitions of b), cost 1 • encoding of b in terms of DTD b is , cost 0

MDL Subsystem Minimization Input Sequences Candidate DTDs ab (a|b)* abb abbb ab* abbbb abb ab

MDL Subsystem Minimization Input Sequences Candidate DTDs ab 3 = 1* + (1a + 1b) 6 (a|b)* 30 4 abb 5 6 abbb ab* 7 abbbb abb abbbbb

MDL Subsystem Minimization Input Sequences Candidate DTDs ab (a|b)* 30 1 abb 1 3 1 abbb ab* 8 1 abbbb 1 abb abbbbb

MDL Subsystem Minimization Input Sequences Candidate DTDs ab (a|b)* 30 abb 0 abbb ab* 8 abbbb 3 abb 3 abbbbb

MDL Subsystem Minimization Input Sequences Candidate DTDs ab (a|b)* 30 abb abbb ab* 8 abbbb abb 3 ab

Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe } MDL Modul Factoring Generalization Overview of XTRACT System Sg = I  { (ab)*, (a|b)*, b*d, b*e } Sf = Sg { (a|b)(c|d), b*(d|e) } Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)

Generalization Subsystem • Goal: • Infer regular expressions from example sequences • Produce candidate DTDs such as a*bc,(abc)*, (a|b|c)*,((ab)*c)* • Generate more general DTDs • Two heuristics: • DiscoverSeqPattern(s,r): s=abbbbc => ab*c • DiscoverOrPattern(s,d): s=abacbc => (a|b|c)* • Candidate DTDs are generated by calling the above functions for appropriate values of r and d

a b a b a b a b c a b c a b a b c a b a b a b a b c a b c a b a b c a b ( a b ) * c a b c a b a b c a b ( a b ) * c a b c a b a b c ( a b ) * c a b c ( a b ) * c ( a b ) * c a b c ( a b ) * c ( ( a b ) * c ) * DiscoverSeqPattern Example The pattern must occur at least two times: r=2

DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a x c x a c

DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a x c x a c Step 1: Partition

DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a x c x a c Step 2: replace pattern a1…an by (a1|..|an)*

DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a ( x | c ) * a c Step 2: replace pattern a1…an by (a1|..|an)*

a ( ((de)*e)* | c ) * a c DiscoverOrPattern Example Given: • the example sequence s=axcxac • distance parameter d=2 a ( x | c ) * a c x is an auxiliary symbol introduced by DiscoverSeqPattern x = ((de)*e)*

ac | ad | bc | bd Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd }

Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd } ac | ad | bc | bd =>

Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd } ac | ad | bc | bd => a(c|d)

Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd } ac | ad | bc | bd => a(c|d) |

Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd } ac | ad | bc | bd => a(c|d) | b(c|d)

Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd } ac | ad | bc | bd => a(c|d) | b(c|d) =>

Factoring Subsystem • Goal: Combine different candidates to derive more compact, factored DTDs • Example candidate set Sg = { ac, ad, bc, bd } ac | ad | bc | bd => a(c|d) | b(c|d) => (a|b)(c|d) • Reduces MDL description length of the candidate DTDs • Adoption of factoring algorithms for Boolean expressions • Use heuristic algorithm for selecting subsets of candidate DTDs that give a good factored form

Factoring Subsystem Heuristics • Choose subsets S of candidate DTDs from SG such that • DTDs in S have a common prefix p or suffix s • number of DTDs with this common prefix in SG is high

abc(d*|e*|f*|g*) Factoring Prefixes Candidate DTDs abcddd abcd* abceee abce* abcfff abcf* abcggg abcg* longer prefixes result in MDL cost reduction factored DTD covers all input sequences

Schema & Schema Integration