1 / 50

Managing XML and Semistructured Data

Managing XML and Semistructured Data. Part 2: Modelling XML Data. In this section…. More XML syntax [ XML glossary – by Sun] [ XML Tutorials ] XML DTD and XML Schema XML Query data model Comparison of XML with semistructured data Papers:

Download Presentation

Managing XML and Semistructured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing XML and Semistructured Data Part 2: Modelling XML Data

  2. In this section… • More XML syntax [XML glossary – by Sun] [XML Tutorials] • XML DTD and XML Schema • XML Query data model • Comparison of XML with semistructured data Papers: • XML, Java, and the future of the Web by Jon Bosak, Sun Microsystems. • W3C XML Query Data Model Mary Fernandez, Jonathan Robie. • Extracting Schema from Semi structured Data Nestorov, Abiteboul, Motwani. SIGMOD 98 • Data on the WebAbiteboul, Buneman, Suciu : Section 3.3

  3. More XML Syntax: Attributes <bookprice = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> attributes are alternative ways to represent data (Single valued, unordered)

  4. More XML: Oids and References <personid=“o555”> <name> Jane </name> </person> <personid=“o456”> <name> Mary </name> <childrenidrefs=“o123o555”/> </person> <personid=“o123” mother=“o456”><name>John</name> </person> oids and references in XML are just syntax (ID, IDREF) The value of IDREF attribute must match the value of some ID attribute in the document. The value of IDREFS attribute can contain several references to elements with ID attribute separated with whitespaces.

  5. Elementnode Attributenode Textnode XML Semantics: a Tree ! data • <data> • <person id=“o555”> • <name> Mary </name> • <address> • <street> Maple </street> • <no> 345 </no> • <city> Seattle </city> • </address> • </person> • <person> • <name> John </name> • <address> Thailand </address> • <phone> 23456 </phone> • </person> • </data> person person id address name address name phone o555 street no city Mary Thai John 23456 Maple 345 Seattle Order matters !!!

  6. More XML: CDATA Section • Syntax: <![CDATA[ .....any text here...]]> • Example: <![CDATA[ <slide>..A sample slide..</slide> ]]> which displays as: <slide>..A sample slide.. </slide>

  7. More XML: Entity References • Entity references to replace illegal XML characters (Escape characters) • Syntax: &entityname; (a form of macros) • Example: (what happens if we simply use <?)<element> this is less than &lt; </element> • Some entities:

  8. Target application Data for processing More XML: Processing Instructions • Syntax: <?targetargument?> • Example 1: • Example 2: <product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price></product> <?wilfred.lecture.Program QUERY="MSc,PhD,all"?> <slide type="all"> <title>COMP630H</title> </slide> Note: <?xml version = “1.0”?> is notPI

  9. More XML: Comments • Syntax <!-- .... Comment text... --> • Yes, they are part of the data model !!!

  10. XML Namespaces • http://www.w3.org/TR/REC-xml-names (1/99) • name ::= [prefix:]localpart <bookxmlns:isbn=“www.isbn-org.org/def”> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number> </book>

  11. XML Namespaces • syntactic: <number> , <isbn:number> • semantic: provide URL for schema • namespace declaration apply within the content of the specified element • multiple namespace prefixes can be declared <tagxmlns:mystyle = “http://…”> … <mystyle:title> … </mystyle:title> <mystyle:number> … </tag> Belong to this namespace

  12. XML Data Models Several competing models: • Document Object Model (DOM): • http://www.w3.org/TR/2001/WD-DOM-Level-3-CMLS-20010209/ (2/2001) • class hierarchy (node, element, attribute,…) • objects have behavior • defines API to inspect/modify the document • XPath data model • XML Query data model • Infoset (a set of information items of an XML document) • PSV (post schema validation) • http://www.w3.org/TR/xml-infoset/

  13. XML Data v.s. E/R, ODL, Relational • Q: is XML better or worse ? • A: serves different purposes • E/R, ODL, Relational models: • For centralized processing, when we control the data • XML: • Data sharing between different systems • we do not have control over the entire data on the Web • Data centric Vs Document centric documents • Do NOT use XML to model your data ! Use E/R, ODL, or relational instead. Use XML to exchange data instead.

  14. XLink • Generalizes HTML’s href • Many types: simple, extended, locator, ... • Discuss only simple links, which is a link that associates exactly two resources, one local and one remote, with an arc going from the former to the latter. Thus, a simple link is always an outbound link. <person xmlns:xlink=“http:///.w3.org/1999/xlink” xlink:type=“simple” xlink:href=“http://a.b.c/myhomepage.html” xlink:title=“The Homepage” xlink:show=“replace” xlink:actuate=“onRequest”> ..... </person> required attributes optional attributes

  15. XLink • show attribute (specify desired presentation) can be • “new” (new window) • ”replace” (same window) • ”embed” • ”other” • actuate attribute (specify desired timing of traversal) can be • “onLoad” (immediate loading) • ”onRequest” (post-loading, event triggered) • ”other” • ”none”

  16. XLink • href attribute: • a URI or • an XPointer (next) • More about XLink can be found in: • [http://www.w3.org/TR/xlink/]

  17. XPointer • An extension of XPath (next week) • Usage: • href=“www.a.b.c/document.xml#xpointerExpr” • An XPointer expression points to: • A point • A range • Reference [http://www.w3.org/TR/2001/CR-xptr-20010911/]

  18. XPointer • Pointing to a point (=XML element or character) • Full form: e.g. #xpointer(id(“3652”)) • Bar name: e.g. #3652 • Child sequence: e.g. #xpointer( /1/3/2/5), #xpointer( /bib/book[3]) • Pointing to a range: e.g. #xpointer(id(3652 to 44)) • Most interesting examples use XPath

  19. XML v.s. Semistructured Data • SSD integrates of heterogeneous sources with non-rigid structure, eg biological data, Web data {lecture: {title: “XML”, date: “1-Jan-2005”, instructor: { name: “Wilfred”, department: “CS”} } } • both described best by a graph • both are schema-less, self-describing

  20. <personid=“o123”> <name> Alan </name> <age> 42 </age> <email> ab@com </email> </person> { person: &o123 { name: “Alan”, age: 42, email: “ab@com” } } father person father person name email age name age email Alan 42 ab@com Alan 42 ab@com Similarities and Differences <personfather=“o123”> … </person> { person: { father: &o123 …} } similar on trees, different on graphs

  21. More Differences • XML is ordered, SSD is not • XML can mix text and elements: <talk> Teaching XML is horrible <speaker> Wilfred Ng </speaker> </talk> • XML has lots of other stuff: entities, processing instructions, comments ! these differences make XML data management harder

  22. Document Type DefinitionsDTD • part of the original XML specification • an XML document may have a DTD • XML document: well-formed = if tags are correctly closed Valid = if it has a DTD and conforms to it • validation is useful in data exchange

  23. Very Simple DTD <!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)> ]>

  24. Very Simple DTD Example of valid XML document: <company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ... </company>

  25. DTD: The Content Model <!ELEMENT tag (CONTENT)> • Content model: • Complex = a regular expression over other elements • Text-only = #PCDATA • Empty = EMPTY • Any = ANY • Mixed content = (#PCDATA | A | B | C)* contentmodel

  26. DTD: Regular Expressions DTD XML sequence <!ELEMENT name (firstName, lastName)) <name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName> </name> optional <!ELEMENT name (firstName?, lastName)) <person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . . </person> Kleene star <!ELEMENT person (name, phone*)) alternation <!ELEMENT person (name, (phone|email)))

  27. Attributes in DTDs <!ELEMENT person (ssn, name, office, phone?)> <!ATTLIS personageCDATA #REQUIRED> <personage=“25”> <name> ....</name> ... </person>

  28. Attributes in DTDs <!ELEMENT person (ssn, name, office, phone?)> <!ATTLIS personageCDATA #REQUIRED idID #REQUIRED managerIDREF #REQUIRED managesIDREFS #REQUIRED > <personage=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ... </person>

  29. Attributes in DTDs Types: • CDATA = string • ID = key • IDREF = foreign key • IDREFS = foreign keys separated by space • (Monday | Wednesday | Friday) = enumeration • NMTOKEN = must be a valid XML name • NMTOKENS = multiple valid XML names • ENTITY = you don’t want to know this Kind: • #REQUIRED • #IMPLIED = optional • value = default value • value #FIXED = the only value allowed

  30. Using DTDs • Must include in the XML document • Either include the entire DTD: • <!DOCTYPE rootElement [ ....... ]> • Or include a reference to it: • <!DOCTYPE rootElement SYSTEM “http://www.mydtd.org”> • Or mix the two... (e.g. to override the external definition)

  31. DTDs as Grammars <!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)> ]> <paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section> </paper> A DTD = a grammar A valid XML document = a parse tree for that grammar

  32. DTDs as Schemas Not so well suited: • impose unwanted constraints on order<!ELEMENT person (name,phone)> • references cannot be constrained • can be too vague: <!ELEMENT person ((name|phone|email)*)>

  33. XML Schemas • generalizes DTDs • uses XML syntax • two documents: structure and datatypes • www.w3.org/TR/2001/REC-xmlschema-1-20010502 • www.w3.org/TR/2001/REC-xmlschema-2-20010502 • XML Schemas • Elements v. Types • Regular expressions • Expressive power • XML-Schema is very complex • often criticized • some alternative proposals

  34. XML Schemas <xs:elementname=“paper” type=“papertype”/> <xs:complexTypename=“papertype”> <xs:sequence> <xs:elementname=“title” type=“xs:string”/> <xs:elementname=“author” minOccurs=“0”/> <xs:elementname=“year”/> <xs:choice> < xs:elementname=“journal”/> <xs:elementname=“conference”/> </xs:choice> </xs:sequence> </xs:element> DTD: <!ELEMENT paper (title,author*,year, (journal|conference))>

  35. Elements v.s. Types in XML Schema <xs:elementname=“person”> <xs:complexType> <xs:sequence> <xs:elementname=“name” type=“xs:string”/> <xs:elementname=“address”type=“xs:string”/> </xs:sequence> </xs:complexType></xs:element> <xs:elementname=“person”type=“ttt”><xs:complexType name=“ttt”> <xs:sequence> <xs:elementname=“name” type=“xs:string”/> <xs:elementname=“address”type=“xs:string”/> </xs:sequence></xs:complexType> DTD: <!ELEMENT person (name,address)>

  36. Elements v.s. Types in XML Schema • Types: • Simple types (integers, strings, ...) • Complex types (regular expressions, like in DTDs) • Element-type-element alternation: • Root element has a complex type • That type is a regular expression of elements • Those elements have their complex types... • ... • On the leaf nodes we have simple types

  37. String Token Byte unsignedByte Integer positiveInteger Int (larger than integer) unsignedInt Long Short ... Time dateTime Duration Date ID IDREF IDREFS Simple Types

  38. Examples length minLength maxLength pattern enumeration whiteSpace maxInclusive maxExclusive minInclusive minExclusive totalDigits fractionDigits Facets of Simple Types • Facets = additional properties restricting a simple type • 15 facets defined by XML Schema

  39. Facets of Simple Types • Can further restrict a simple type by changing some facets • Restriction = subset

  40. Not so Simple Types • List types: • Union types • Restriction types <xs:simpleType name="listOfMyIntType"> <xs:list itemType="myInteger"/> </xs:simpleType> <listOfMyInt>20003 15037 95977 95945</listOfMyInt>

  41. Local and Global Types in XML Schema • Local type: <xs:elementname=“person”> [define locally the person’s type] </xs:element> • Global type: <xs:elementname=“person” type=“ttt”/> <xs:complexType name=“ttt”> [define here the type ttt] </xs:complexType> Global types: can be reused in other elements

  42. Local v.s. Global Elements inXML Schema • Local element: <xs:complexType name=“ttt”> <xs:sequence> <xs:elementname=“address” type=“...”/>... </xs:sequence> </xs:complexType> • Global element: <xs:elementname=“address” type=“ttt”/> <xs:complexType name=“ttt”> <xs:sequence><xs:elementref=“address”/> ... </xs:sequence> </xs:complexType> Global elements: like in DTDs

  43. Regular Expressions in XML Schema Recall the element-type-element alternation: <xs:complexType name=“....”> [regular expression on elements] </xs:complexType> Regular expressions: • <xs:sequence> A B C </...> = A B C • <xs:choice> A B C </...> = A | B | C • <xs:group> A B C </...> = (A B C) • <xs:... minOccurs=“0”maxOccurs=“unbounded”> ..</...> = (...)* • <xs:... minOccurs=“0”maxOccurs=“1”> ..</...> = (...)?

  44. Local Names in XML-Schema <xs:elementname=“person”> <xs:complexType> . . . . . <xs:elementname=“name”> <xs:complexType> <xs:sequence> <xs:elementname=“firstname” type=“xs:string”/> <xs:elementname=“lastname” type=“xs:string”/> </xs:sequence> </xs:element> . . . . </xs:complexType></xs:element> <xs:elementname=“product”> <xs:complexType> . . . . . <xs:elementname=“name” type=“xs:string”/> </xs:complexType></xs:element> name has different meanings in person and in product

  45. Subtle Use of Local Names <xs:complexType name=“oneB”> <xs:choice> <xs:elementname=“B” type=“xs:string”/> <xs:sequence> <xs:elementname=“A” type=“onlyAs”/> <xs:elementname=“A” type=“oneB”/> </xs:sequence> <xs:sequence> <xs:elementname=“A” type=“oneB”/> <xs:elementname=“A” type=“onlyAs”/> </xs:sequence> </xs:choice></xs:complexType> <xs:elementname=“A” type=“oneB”/> <xs:complexType name=“onlyAs”> <xs:choice> <xs:sequence> <xs:elementname=“A” type=“onlyAs”/> <xs:elementname=“A” type=“onlyAs”/> </xs:sequence> <xs:elementname=“A” type=“xs:string”/> </xs:choice></xs:complexType> Arbitrary deep binary tree with A elements, and a single B element

  46. Attributes in XML Schema <xs:elementname=“paper” type=“papertype”/> <xs:complexTypename=“papertype”> <xs:sequence> <xs:elementname=“title” type=“xs:string”/> . . . . . . </xs:sequence> <xs:attribute name=“language" type="xs:NMTOKEN" fixed=“English"/> </xs:complexType> Attributes are associated to the type, not to the element Only to complex types; more trouble if we want to add attributes to simple types.

  47. “Mixed” Content, “Any” Type <xs:complexTypemixed="true"> . . . . • Better than in DTDs: can still enforce the type, but now may have text between any elements • Means anything is permitted there <xs:elementname="anything" type="xs:anyType"/> . . . .

  48. “All” Group <xs:complexTypename="PurchaseOrderType"> <xs:all> <xs:elementname="shipTo" type="USAddress"/> <xs:elementname="billTo" type="USAddress"/> <xs:elementref="comment" minOccurs="0"/> <xs:elementname="items" type="Items"/> </xs:all> <xs:attributename="orderDate" type="xs:date"/> </xs:complexType> • A restricted form of & in SGML • Restrictions: • Only at top level • Has only elements • Each element occurs at most once • E.g. “comment” occurs 0 or 1 times

  49. Derived Types by Extensions <complexTypename="Address"> <sequence> <elementname="street" type="string"/> <elementname="city" type="string"/> </sequence> </complexType> <complexTypename="USAddress"> <complexContent> <extensionbase="ipo:Address"> <sequence> <elementname="state" type="ipo:USState"/> <elementname="zip" type="positiveInteger"/> </sequence> </extension> </complexContent> </complexType> Corresponds to inheritance

  50. Derived Types by Restrictions • (*): may restrict cardinalities, e.g. (0,infty) to (1,1); may restrict choices; other restrictions… <complexContent> <restrictionbase="ipo:Items“> … [rewrite the entire content, with restrictions]... </restriction> </complexContent> Corresponds to set inclusion

More Related