1 / 24

A Normal Form for XML Documents

A Normal Form for XML Documents. Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto. Motivating Example. courses. course. course. info. @cno. student. @cno. student. @sno. name. student. “cs100”. “cs225”. “123”. “Fox”. @sno. name. grade. @sno.

quant
Download Presentation

A Normal Form for XML Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Normal Form for XML Documents Marcelo ArenasLeonid Libkin Department of Computer Science University of Toronto

  2. Motivating Example courses course course info @cno student @cno student @sno name student “cs100” . . . “cs225” “123” “Fox” @sno name grade @sno name grade “123” “Fox” “B+” “123” “Fox” “A+”

  3. Our Goal DTD: Integrity Constraints: courses  course* course  @cno, student* student  @sno, name, grade name S grade S ,info* two students with the same @sno value must have the same name. @snois the key ofinfo. info  @sno, name

  4. A “Non-relational” Example DBLP conf conf . . . title issue issue “ICDT” @year article article article @year “1999” “2001” author title @year author title @year title @year “Dong” “. . .” “1999” “Jarke” “. . .” “1999” “. . .” “2001”

  5. Problems to Address • Functional dependencies for XML. • Normal form for XML documents (XNF). • Generalizes BCNF. • Algorithm for normalizing XML documents. • Implication problem for functional dependencies.

  6. Framework: DTDs • We do not consider mixed content, IDs and IDREFs. • Paths(D): all paths in a DTD D courses.course courses.course.@cno courses.course.student.name courses.course.student.name.S • We distinguish three kinds of elements: attributes (@), strings (S) and element types.

  7. Framework: XML Trees courses v0 course course v1 . . . @cno student student v2 v5 “cs100” @sno grade @sno grade name name v3 v4 v6 v7 “123” “456” S S S S “Fox” “B+” “Smith” “A-”

  8. Towards FDs for XML • We know how to handle FDs in relational DBs. • We need a relational representation of XML. • We do it by considering tree tuples: atree tuple in a DTD Dis a mapping t : Paths(D)  Vertices  Strings  {} consistent with D: could be extracted from an XML tree conforming to D.

  9. Tree Tuples A tree tuple represents an XML tree: courses t(courses) = v0 t(courses.course) = v1 t(courses.course.@cno) = “cs100” t(courses.course.student) = v2 t(p) = , for the remaining paths v0 course v1 @cno student v2 “cs100”

  10. XML Tree: set of Tree Tuples courses courses v0 v0 course course course course v1 v1 . . . . . . @cno @cno student student student student v2 v5 v5 v2 “cs100” “cs100” @sno @sno grade grade @sno @sno grade grade name name name name v3 v3 v4 v4 v6 v6 v7 v7 “123” “123” “456” “456” S S S S S S S S “Fox” “Fox” “B+” “B+” “Smith” “Smith” “A-” “A-”

  11. Functional Dependencies • Expressions of the form: X  Y defined over a DTD D, where X, Y are finite non-empty subsets of Paths(D). • XML tree T can be tested for satisfaction of X Yif: X Y Paths(T)  Paths(D) • T  X  Y if for every pair u, v of tree tuples in T: u.X = v.X and u.X≠  implies u.Y = v.Y

  12. FD: Examples • University DTD: courses  course* course  @cno, student* student  @sno, name, grade • Two students with the same @sno value must have the same name: courses.course.student.@sno  courses.course.student.name.S • Every student can have at most one grade in every course: { courses.course, courses.course.student.@sno }  courses.course.student.grade.S

  13. Checking FD Satisfaction { courses.course, courses.course.student.@sno }  courses.course.student.grade.S courses courses v0 v0 course course course course v1 v1 v5 v5 @cno @cno student student @cno @cno student student “cs100” “cs100” v2 v2 “cs225” “cs225” v6 v6 @sno @sno grade grade @sno @sno grade grade name name name name “123” “123” v3 v3 v4 v4 “123” “123” v7 v7 v8 v8 S S S S S S S S “Fox” “Fox” “B+” “B+” “Fox” “Fox” “A+” “A+”

  14. Checking FD Satisfaction { courses.course, courses.course.student.@sno }  courses.course.student.grade.S courses courses v0 v0 course course v1 v1 @cno @cno student student student student “cs100” “cs100” v2 v2 v5 v5 @sno @sno grade grade @sno @sno grade grade name name name name “123” “123” v3 v3 v4 v4 “123” “123” v6 v6 v7 v7 S S S S S S S S “Fox” “Fox” “B+” “B+” “Fox” “Fox” “A+” “A+”

  15. Implication Problem for FD • Given a DTD D and a set of functional dependencies  {}: (D, )   if for any XML tree T conforming to D and satisfying  , it is the case that T   • (D, )+ = {  | (D, )   } • Functional dependency  is trivial if it is implied by the DTD alone: (D, )  

  16. XNF: XML Normal Form • XML specification: a DTD D and a set of functional dependencies . • A Relational DB is in BCNF if for every non-trivial functional dependency X  Y in the specification, X is a key. • (D, ) is in XNF if: For each non-trivial FD X  p.@l or X  p.S in (D, )+, X  p is in (D, )+.

  17. Back to DBLP • DBLP is not in XNF: DBLP.conf.issue  DBLP.conf.issue.article.@year  (D,)+ DBLP.conf.issue  DBLP.conf.issue.article  (D,)+ • Proposed solution is in XNF.

  18. Normalization Algorithm The algorithm applies two transformations until the schema is in XNF. • If there is an anomalous FD of the form: DBLP.conf.issue  DBLP.conf.issue.article.@year then apply the “DBLP example rule”. • Otherwise: choose a minimal anomalous FD and apply the “University example rule”.

  19. Normalizing XML Documents • TheoremThe decomposition algorithm terminates and outputs a specification in XNF. • It does not lose information: Unnormalized Normalized XML document XML Document Q1, Q2 are XQuery core queries. • It works even if we cannot compute (D,)+. If we know (D,)+, the output is better. Q1 Q2

  20. Implication Problem (cont’d) • Typically, regular expressions used in DTDs are rather simple. • Trivial regular expression: s1, s2, …, sn • Each si is either one of ai, ai?, ai+or ai* • For i ≠ j, ai ≠ aj • Dis asimpleDTDif all its productions use “permutations” of trivial regular expressions. Example: (a | b)*is a permutation of a*, b*

  21. Implication Problem (cont’d) • TheoremFor simple DTDs: • The implication problem for FDs is solvable in O(n2). • Testing if a specification is in XNF can be done in O(n3). • Other results: • There is a larger class of DTDs for which these problems are tractable. • There is an even larger class of DTDs for which they are coNP-complete.

  22. Final Comments • The normalization algorithm can be improved in various ways. • Why XNF? • TheoremXNF generalizes BCNF and a normal form for nested relations (called NNF) when those are coded as XML documents. • A complete classification of the complexity of the implication problem for FDs remains open. • Implementation.

  23. Backup Slides

  24. Normalization Algorithm • We consider FDs of the form: {q, p1.@l1, …, pn.@ln }  p where n  0 and q ends with an element type. • The algorithm applies two transformations until the schema is in XNF. • If there is an anomalous FD with n = 0: DBLP.conf.issue  DBLP.conf.issue.article.@year then apply the “DBLP example rule”. • Otherwise: choose a minimal anomalous FD and apply the “University example rule”.

More Related