1 / 22

Managing XML and Semistructured Data

Managing XML and Semistructured Data. Lecture 11: Schema - DTD’s. Prof. Dan Suciu. Spring 2001. In this lecture. Schema Extraction for SS data Schemas for XML DTDs XML Schema Resources Extracting Schema from Semistructured Data Nestorov, Abiteboul, Motwani. SIGMOD 98

gengelhardt
Download Presentation

Managing XML and Semistructured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing XML and Semistructured Data Lecture 11: Schema - DTD’s Prof. Dan Suciu Spring 2001

  2. In this lecture • Schema Extraction for SS data • Schemas for XML • DTDs • XML Schema Resources • Extracting Schema from Semistructured Data Nestorov, Abiteboul, Motwani. SIGMOD 98 • Data on the WebAbiteboul, Buneman, Suciu : section 3.3

  3. Review of Schemas so far Upper bound schema S • Tell us what labels are allowed • Conformance test: D  S • In practice: need deterministic schemas Lower bound schema S • Tells us what labels are required • Conformance test: S  D • Alternative formulation: datalog programs, maximal fixpoint

  4. Schema Extraction(From Data) Problem statement • given data instance D • find the “most specific” schema S for D In practice: S too large, need to relax [Nestorov, Abiteboul, Motwani 1998]

  5. Schema Extraction: Sample Data Example database D = &r employee employee employee employee employee employee employee employee manages manages manages manages manages &p1 &p2 &p3 &p4 &p5 &p6 &p7 &p8 managedby managedby managedby managedby managedby worksfor worksfor worksfor worksfor worksfor company worksfor worksfor worksfor &c

  6. Lower Bound Schema Extraction [NAM’98] approach: • Start with the schema given by the data (S = D): • Each node = a predicate = a class • Compute maximal fixpoint (PTIME) • Declare two classes equal iff they are equal sets • E.g. p4={&p1,&p4,&p6}, p6={&p1,&p4,&p6}, hence p1=p4 • Equivalently, p=p’ iff p(&p’) and p’(&p) . . . . . . p4(x) :- link(x, manages, y), p5(y), link(x, worksfor, z), c(z) p5(x) :- link(x, managed-by, y), p4(y), link(x, worksfor, z), c(z) . . . . . .

  7. Lower Bound Schema Extraction Result S = Root &r employee company employee Bosses &p1,&p4,&p6 Regulars &p2,&p3,&p5,&p7,&p8 manages managedby worksfor Company &c worksfor

  8. Lower Bound Schema Extraction Equivalently: • Compute the maximal simulation D  D • Can do in time O(m2) • Two nodes p, p’ are equivalent iff x  x’ and x’  x • Schema consists of equivalence classes Remark: could use the bisimulation relation instead (perhaps is even better)

  9. Upper Bound Schema Extraction • The extracted lower bound schema S is also an upper bound schema ! • But: nondeterministic • Convert S  Sd • Alternatively, convert directly D  Dd = Sd • These are data guides [McHugh and Widom]

  10. Upper Bound Schema Extraction Result Sd = Root &r employee Employees &p1,&p1,&p3,P4 &p5,&p6,&p7,&p8 company managedby manages worksfor Bosses &p1,&p4,&p6 Regulars &p2,&p3,&p5,&p7,&p8 manages managedby worksfor Company &c worksfor

  11. XMLDocument Type Definitions • part of the original XML specification • an XML document may have a DTD • terminology for XML: • well-formed: if tags are correctly closed • valid: if it has a DTD and conforms to it • validation is useful in data exchange

  12. Very Simple DTD <!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)> ]>

  13. Very Simple DTD Example of valid XML document: <company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ... </company>

  14. Content Model • Element content: what we can put in an element (aka content model) • Content model: • Complex = a regular expression over other elements • Text-only = #PCDATA • Empty = EMPTY • Any = ANY • Mixed content = (#PCDATA | A | B | C)* • (i.e. very restrictied)

  15. Attributes in DTDs <!ELEMENT person (ssn, name, office, phone?)> <!ATTLIS personageCDATA #REQUIRED> <personage=“25”> <name> ....</name> ... </person>

  16. Attributes in DTDs <!ELEMENT person (ssn, name, office, phone?)> <!ATTLIS personageCDATA #REQUIRED idID #REQUIRED managerIDREF #REQUIRED managesIDREFS #REQUIRED > <personage=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ... </person>

  17. Attributes in DTDs Types: • CDATA = string • ID = key • IDREF = foreign key • IDREFS = foreign keys separated by space • (Monday | Wednesday | Friday) = enumeration • NMTOKEN = must be a valid XML name • NMTOKENS = multiple valid XML names • ENTITY = you don’t want to know this

  18. Attributes in DTDs Kind: • #REQUIRED • #IMPLIED = optional • value = default value • value #FIXED = the only value allowed

  19. Using DTDs • Must include in the XML document • Either include the entire DTD: • <!DOCTYPE rootElement [ ....... ]> • Or include a reference to it: • <!DOCTYPE rootElement SYSTEM “http://www.mydtd.org”> • Or mix the two... (e.g. to override the external definition)

  20. DTDs as Grammars <!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)> ]> <paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section> </paper>

  21. DTDs as Grammars • A DTD = a grammar • A valid XML document = a parse tree for that grammar

  22. DTDs as Schemas Not so well suited: • impose unwanted constraints on order<!ELEMENT person (name,phone)> • references cannot be constrained • can be too vague: <!ELEMENT person ((name|phone|email)*)> like an upper bound schema

More Related