240 likes | 359 Views
A Normal Form for XML Documents. Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto. Motivating Example. courses. course. course. info. @cno. student. @cno. student. @sno. name. student. “cs100”. “cs225”. “123”. “Fox”. @sno. name. grade. @sno.
E N D
A Normal Form for XML Documents Marcelo ArenasLeonid Libkin Department of Computer Science University of Toronto
Motivating Example courses course course info @cno student @cno student @sno name student “cs100” . . . “cs225” “123” “Fox” @sno name grade @sno name grade “123” “Fox” “B+” “123” “Fox” “A+”
Our Goal DTD: Integrity Constraints: courses course* course @cno, student* student @sno, name, grade name S grade S ,info* two students with the same @sno value must have the same name. @snois the key ofinfo. info @sno, name
A “Non-relational” Example DBLP conf conf . . . title issue issue “ICDT” @year article article article @year “1999” “2001” author title @year author title @year title @year “Dong” “. . .” “1999” “Jarke” “. . .” “1999” “. . .” “2001”
Problems to Address • Functional dependencies for XML. • Normal form for XML documents (XNF). • Generalizes BCNF. • Algorithm for normalizing XML documents. • Implication problem for functional dependencies.
Framework: DTDs • We do not consider mixed content, IDs and IDREFs. • Paths(D): all paths in a DTD D courses.course courses.course.@cno courses.course.student.name courses.course.student.name.S • We distinguish three kinds of elements: attributes (@), strings (S) and element types.
Framework: XML Trees courses v0 course course v1 . . . @cno student student v2 v5 “cs100” @sno grade @sno grade name name v3 v4 v6 v7 “123” “456” S S S S “Fox” “B+” “Smith” “A-”
Towards FDs for XML • We know how to handle FDs in relational DBs. • We need a relational representation of XML. • We do it by considering tree tuples: atree tuple in a DTD Dis a mapping t : Paths(D) Vertices Strings {} consistent with D: could be extracted from an XML tree conforming to D.
Tree Tuples A tree tuple represents an XML tree: courses t(courses) = v0 t(courses.course) = v1 t(courses.course.@cno) = “cs100” t(courses.course.student) = v2 t(p) = , for the remaining paths v0 course v1 @cno student v2 “cs100”
XML Tree: set of Tree Tuples courses courses v0 v0 course course course course v1 v1 . . . . . . @cno @cno student student student student v2 v5 v5 v2 “cs100” “cs100” @sno @sno grade grade @sno @sno grade grade name name name name v3 v3 v4 v4 v6 v6 v7 v7 “123” “123” “456” “456” S S S S S S S S “Fox” “Fox” “B+” “B+” “Smith” “Smith” “A-” “A-”
Functional Dependencies • Expressions of the form: X Y defined over a DTD D, where X, Y are finite non-empty subsets of Paths(D). • XML tree T can be tested for satisfaction of X Yif: X Y Paths(T) Paths(D) • T X Y if for every pair u, v of tree tuples in T: u.X = v.X and u.X≠ implies u.Y = v.Y
FD: Examples • University DTD: courses course* course @cno, student* student @sno, name, grade • Two students with the same @sno value must have the same name: courses.course.student.@sno courses.course.student.name.S • Every student can have at most one grade in every course: { courses.course, courses.course.student.@sno } courses.course.student.grade.S
Checking FD Satisfaction { courses.course, courses.course.student.@sno } courses.course.student.grade.S courses courses v0 v0 course course course course v1 v1 v5 v5 @cno @cno student student @cno @cno student student “cs100” “cs100” v2 v2 “cs225” “cs225” v6 v6 @sno @sno grade grade @sno @sno grade grade name name name name “123” “123” v3 v3 v4 v4 “123” “123” v7 v7 v8 v8 S S S S S S S S “Fox” “Fox” “B+” “B+” “Fox” “Fox” “A+” “A+”
Checking FD Satisfaction { courses.course, courses.course.student.@sno } courses.course.student.grade.S courses courses v0 v0 course course v1 v1 @cno @cno student student student student “cs100” “cs100” v2 v2 v5 v5 @sno @sno grade grade @sno @sno grade grade name name name name “123” “123” v3 v3 v4 v4 “123” “123” v6 v6 v7 v7 S S S S S S S S “Fox” “Fox” “B+” “B+” “Fox” “Fox” “A+” “A+”
Implication Problem for FD • Given a DTD D and a set of functional dependencies {}: (D, ) if for any XML tree T conforming to D and satisfying , it is the case that T • (D, )+ = { | (D, ) } • Functional dependency is trivial if it is implied by the DTD alone: (D, )
XNF: XML Normal Form • XML specification: a DTD D and a set of functional dependencies . • A Relational DB is in BCNF if for every non-trivial functional dependency X Y in the specification, X is a key. • (D, ) is in XNF if: For each non-trivial FD X p.@l or X p.S in (D, )+, X p is in (D, )+.
Back to DBLP • DBLP is not in XNF: DBLP.conf.issue DBLP.conf.issue.article.@year (D,)+ DBLP.conf.issue DBLP.conf.issue.article (D,)+ • Proposed solution is in XNF.
Normalization Algorithm The algorithm applies two transformations until the schema is in XNF. • If there is an anomalous FD of the form: DBLP.conf.issue DBLP.conf.issue.article.@year then apply the “DBLP example rule”. • Otherwise: choose a minimal anomalous FD and apply the “University example rule”.
Normalizing XML Documents • TheoremThe decomposition algorithm terminates and outputs a specification in XNF. • It does not lose information: Unnormalized Normalized XML document XML Document Q1, Q2 are XQuery core queries. • It works even if we cannot compute (D,)+. If we know (D,)+, the output is better. Q1 Q2
Implication Problem (cont’d) • Typically, regular expressions used in DTDs are rather simple. • Trivial regular expression: s1, s2, …, sn • Each si is either one of ai, ai?, ai+or ai* • For i ≠ j, ai ≠ aj • Dis asimpleDTDif all its productions use “permutations” of trivial regular expressions. Example: (a | b)*is a permutation of a*, b*
Implication Problem (cont’d) • TheoremFor simple DTDs: • The implication problem for FDs is solvable in O(n2). • Testing if a specification is in XNF can be done in O(n3). • Other results: • There is a larger class of DTDs for which these problems are tractable. • There is an even larger class of DTDs for which they are coNP-complete.
Final Comments • The normalization algorithm can be improved in various ways. • Why XNF? • TheoremXNF generalizes BCNF and a normal form for nested relations (called NNF) when those are coded as XML documents. • A complete classification of the complexity of the implication problem for FDs remains open. • Implementation.
Normalization Algorithm • We consider FDs of the form: {q, p1.@l1, …, pn.@ln } p where n 0 and q ends with an element type. • The algorithm applies two transformations until the schema is in XNF. • If there is an anomalous FD with n = 0: DBLP.conf.issue DBLP.conf.issue.article.@year then apply the “DBLP example rule”. • Otherwise: choose a minimal anomalous FD and apply the “University example rule”.