450 likes | 569 Views
Lecture 08: XML and Semistructured Data. Outline. XML (Section 17) XML syntax, semistructured data Document Type Definitions (DTDs) XPath. Additional Readings on XML. XML http://www.w3.org/XML/1999/XML-in-10-points www.zvon.org/xxl/XMLTutorial/General/book_en.html
E N D
Outline • XML (Section 17) • XML syntax, semistructured data • Document Type Definitions (DTDs) • XPath
Additional Readings on XML • XML • http://www.w3.org/XML/1999/XML-in-10-points • www.zvon.org/xxl/XMLTutorial/General/book_en.html • http://db.bell-labs.com/galax/ • http://www.w3.org/TR/REC-xml-names (1/99) • Xpath • http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html • Xquery • http://www.w3.org/TR/xmlquery-use-cases/ • http://www.xmlportfolio.com/xquery.html • Main source: www.w3.org (but hard to read)
XML • eXtensible Markup Language • XML 1.0 – a recommendation from W3C, 1998 • Roots: SGML (used in publishing). • After the roots: a format for sharing data
XML Data • Relational data does not have a syntax • I can’t “give” you my relational database • Need to import it from other syntax, like CSV (comma-separated-values) • XML = rich syntax for data • But XML is not relational: semistructured • Usage: • Map any data to XML • Store it in files, exchange on the Web, etc. • Even query it directly, using XPath, XQuery
XML Data Sharing and Exchange application application object-relational Integrate XML Data WEB (HTTP) Transform Warehouse application relational data legacy data Specific data management tasks
From HTML to XML HTML describes the layout
HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br> Morgan Kaufmann, 1999
XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the structure
XML Terminology • tags: book, title, author, … • start tag: <book>, end tag: </book> • elements: <book>…</book>,<author>…</author> • elements are nested • empty element: <red></red> abbrv. <red/> • well formed XML document • if it has matching tags • tags are properly nested • single root element • and more constraints, e.g. on names
More XML: Attributes <bookprice = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> attributes are alternative ways to represent data
More XML: IDs and References <personid=“o555”> <name> Jane </name> </person> <personid=“o456”> <name> Mary </name> <childrenidref=“o123 o555”/> </person> <personid=“o123” mother=“o456”><name>John</name> </person> Scope of IDs and references is the document
More XML: CDATA Section • Syntax: <![CDATA[ .....any text here...]]> • Example: <example> <![CDATA[ some text here </notAtag> <>]]></example>
More XML: Entity References • Syntax: &entityname; • Used like macros • Example: <element> this is less than < </element> some predefined entities complete list: http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html
More XML: Processing Instructions • Syntax: <?target argument?> • Example: • Processed by external applications, e.g. php(bad style) <product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price></product>
More XML: Comments • Syntax <!-- .... Comment text... --> • Yes, they are part of the data model !!!
Elementnode Attributenode Textnode XML Data: a Tree ! data • <data> • <person id=“o555”> • <name> Mary </name> • <address> • <street> Maple </street> • <no> 345 </no> • <city> Seattle </city> • </address> • </person> • <person> • <name> John </name> • <address> Thailand </address> • <phone> 23456 </phone> • </person> • </data> person person id address name address name phone o555 street no city Mary Thai John 23456 Maple 345 Seattle Order matters !!!
<persons> <row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row> </persons> From Relational Data to XML Data XML: persons persons row row row phone name phone name phone name “John” 3634 “Sue” 6343 “Dick” 6363
XML Data • XML is self-describing • Schema elements become part of the data • Relational schema: persons(name,phone) • In XML <persons>, <name>, <phone> are part of the data, and are repeated many times • Consequence: XML is much more flexible • XML = semistructured data
Semi-structured Data Explained • Missing attributes: • Could represent ina table with nulls <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person> no phone !
Semi-structured Data Explained • Repeated attributes • Impossible in tables: <person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone> </person> two phones ! ???
Semistructured Data Explained • Attributes with different types in different objects • Nested collections (no 1NF) • Heterogeneous collections: • <db> contains both <book>s and <publisher>s <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person> structured name !
Document Type DefinitionsDTD • part of the original XML specification • an XML document may have a DTD • XML document: well-formed = if tags are correctly closed valid = if it has a DTD and conforms to it • validation is useful in data exchange
Very Simple DTD <!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)> ]>
Very Simple DTD Example of valid XML document: <company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ... </company>
DTD: The Content Model <!ELEMENT tag (CONTENT)> • Content model: • Complex = a regular expression over other elements • Text-only = #PCDATA • Empty = EMPTY • Any = ANY • Mixed content = (#PCDATA | A | B | C)* contentmodel
DTD: Regular Expressions DTD XML sequence <!ELEMENT name (firstName, lastName)) <name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName> </name> optional <name> <lastName> . . . . . </lastName> </name> <!ELEMENT name (firstName?, lastName)) <person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . . </person> star (repeated occurrence) <!ELEMENT person (name, phone*)) alternation <person> <name> . . . . . </name> <email> . . . . . </email> </person> <!ELEMENT person (name, (phone|email)))
DTD: Attributes • Document Type Definition<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST personage CDATA #REQUIRED "18"birthdate CDATA #IMPLIEDnationality CDATA #FIXED "CH"gender (male|female) "female"> • Document <personage="24" nationality="CH" gender="male"> <ssn> … </ssn> …<phone> … </phone> </person> mandatory optional default enumeration
DTD: Entities • DTD:<!ENTITY address SYSTEM "address.xml"><!ENTITY name "<name>Tim Berners Lee</name>"> • Document:<celebrity>&name;&address;</celebrity> internal entity external entity
Inclusion of DTD in Documents External DTD Declaration <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE test PUBLIC "-//Test AG//DTD test V1.0//EN" SYSTEM "http://www.test.org/test.dtd"><test> "test" is a document element </test> Internal DTD Declaration <!DOCTYPE test [ <!ELEMENT test EMPTY> ]><test/> Mixed usage <!DOCTYPE test SYSTEM "http://www.test.org/test.dtd" [ <!ENTITY hello "hello world">]><test>&hello;</test>
XML Namespaces • Different DTDs can use the same names! • how to avoid conflicts when combining names from different DTDs? • XML namespace is a collection of names (markup vocabulary) • identified by a prefix (URL reference)
XML Namespaces • name ::= [prefix:]localname default name space <book xmlns='urn:loc.gov:book' xmlns:isbn='www.isbn-org.org/def'> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number> </book> names belong to default name space
XML Namespaces • syntactic: <number> , <isbn:number> • semantic: URL used as unique identifier • URL may not exist, has no function <tagxmlns:mystyle = “http://…”> … <mystyle:title> … </mystyle:title> <mystyle:number> … </tag> Belong to this namespace
Querying XML Data • XPath = simple navigation through the tree • XQuery = the SQL of XML • XSLT = recursive traversal • will not discuss • XQuery and XSLT build on XPath
Sample Data for Queries <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><bookprice=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book> </bib>
bib Data Model for XPath The root The root element book book publisher author . . . . Addison-Wesley Serge Abiteboul
XPath: Simple Expressions /bib/book/year Result: <year> 1995 </year> <year> 1998 </year> Result: empty (there were no papers) /bib/paper/year
XPath: Restricted Kleene Closure //author Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author> Result: <first-name> Rick </first-name> /bib//first-name
XPath: Text Nodes /bib/book/author/text() Result: Serge Abiteboul Jeffrey D. Ullman Rick Hull doesn’t appear because he has firstname, lastname Functions in XPath: • text() = matches the text value • node() = matches any node (= * or @* or text()) • name() = returns the name of the current tag
XPath: Wildcard Result: <first-name> Rick </first-name> <last-name> Hull </last-name> * Matches any element //author/*
XPath: Attribute Nodes /bib/book/@price Result: “55” @price means that price is has to be an attribute
XPath: Predicates /bib/book/author[firstname] Result: <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author>
XPath: More Predicates Result: <lastname> … </lastname> <lastname> … </lastname> /bib/book/author[firstname][address[.//zip][city]]/lastname
XPath: More Predicates /bib/book[@price < “60”] /bib/book[author/@age < “25”] /bib/book[author/text()]
XPath: Summary bib matches a bib element * matches any element / matches the root element /bib matches a bib element under root bib/paper matches a paper in bib bib//paper matches a paper in bib, at any depth //paper matches a paper at any depth paper|book matches a paper or a book @price matches a price attribute bib/book/@price matches price attribute in book, in bib bib/book[@price<“55”]/author/lastname matches…