410 likes | 505 Views
Introduction to XML. Yanlei Diao UMass Amherst April 19, 2007. Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau. Structure in Data Representation. Relational data is highly structured structure is defined by the schema good for system design
E N D
Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives andGerome Miklau.
Structure in Data Representation • Relational data is highly structured • structure is defined by the schema • good for system design • good for precise query semantics / answers • Structure can be limiting • data exchange hard: integration of diff schema • authoring is constrained: schema-first • querying constrained: must know schema • changes to structure not easy
Data Integration 1. Find all departments whose total employee salaries exceed 1% of the budget of the company. 2. Find names of employees with the top sales record last month. Australia US Internet Asia Europe
WWW Integration of Text and Structured Data Structured data - Databases Semistructured Data Unstructured Text - Documents
Need for A New Data Model Loose (and rich) structure • Integration of structured, but heterogeneous data sources • Evolving, unknown, or irregular structure • Textual data with tags and links • Combination of data models 5
XML: Universal Data Exchange Format • XML is the confluence of many factors: • Databases needed a more flexible interchange format. • Data needed to be generated and consumed by applications. • The Web needed a more declarative format for data. • Documents needed a mechanism for extended tags. • XML was originally proposed for online publishing, is becoming the wire format for data exchange. • W3C Recommendation: http://www.w3.org/TR/REC-xml/
From HTML to XML HTML describes the presentation.
HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999
XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content!
XML Syntax • Tags: book, title, author, … • start tag: <book> • end tag: </book> • Elements: <book>…</book>,<author>…</author> • elements are nested • empty element: <red></red>, abbrv. <red/> • An XML document: single root element An XML document is well formed if it has matching tags
XML Syntax <bookprice = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> Attributes are alternative ways to represent data.
XML Syntax <personid=“o555”> <name> Jane </name> </person> <personid=“o456”> <name> Mary </name> <childrenidref=“o123 o555”/> </person> <personid=“o123” mother=“o456”><name>John</name> </person> Oids and references in XML are just syntax.
Elementnode Attributenode Textnode XML Semantics: a Tree ! <data> <person id=“o555”> <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address> Thailand </address> <phone> 23456 </phone> </person> </data> data person person id address name address name phone o555 street no city Mary Thai John 23456 Maple 345 Seattle IDREF will turn it to a graph. Order matters !
XML Data • XML is self-describing • Schema elements become part of the data • Relational schema: persons(name,phone) • In XML <persons>, <name>, <phone> are part of the data, and are repeated many times • Consequence: XML is much more flexible Some real data: http://www.cs.washington.edu/research/xmldatasets/
Relational Data as XML XML: person person <person> <row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row> </person> row row row name phone name phone name phone “Dick” 6363 3634 “Sue” 6343 “John”
XML is Semi-structured Data • Missing attributes: • Could represent ina table with nulls <data> <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person> </data> ← no phone !
XML is Semi-structured Data • Repeated attributes • Impossible in tables: nested collections (non 1NF) <person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone> </person> ← two phones ! ???
XML is Semi-structured Data • Attributes with different types in different objects • Mixed content: • <db> contains both <book>s and <publisher>s <data> <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person> <person> <name> M. Carey</name> <phone>3456</phone> </person> </data> ← structured name ! ← unstructured name !
Data Typing in XML • Data typing in the relational model: schema • Data typing in XML • Much more complex • Typing restricts valid trees that can occur • theoretical foundation: tree languages • Practical methods: • DTD (Document Type Definition) • XML Schema
Document Type Definitions (DTD) • Part of the original XML specification • To be replaced by XML Schema • Much more complex • An XML document may have a DTD • XML document: well-formed= if tags are correctly closed Valid = if it has a DTD and conforms to it • Validation is useful in data exchange
DTD Example <!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)> ]>
DTD Example Example of valid XML document: <company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ... </company>
contentmodel DTD: The Content Model <!ELEMENT tag (CONTENT)> • Content model: • Complex = a regular expression over other elements • Text-only = #PCDATA • Empty = EMPTY • Any = ANY • Mixed content = (#PCDATA | A | B | C)*
DTD: Regular Expressions DTD XML sequence <!ELEMENT name (firstName, lastName)) <name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName> </name> optional <!ELEMENT name (firstName?, lastName)) <person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . . </person> Kleene star <!ELEMENT person (name, phone*)) alternation <!ELEMENT person (name, (phone|email)))
Attributes in DTDs <!ELEMENT person (ssn, name, office, phone?)> <!ATTLIST personageCDATA #REQUIRED> <personage=“25”> <name> ....</name> ... </person>
Attributes in DTDs <!ELEMENT person (ssn, name, office, phone?)> <!ATTLIST personageCDATA #REQUIRED idID #REQUIRED managerIDREF #REQUIRED managesIDREFS #REQUIRED > <personage=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ... </person>
Attributes in DTDs Types: • CDATA = string • ID = key • IDREF = foreign key • IDREFS = foreign keys separated by space • (Monday | Wednesday | Friday) = enumeration
Attributes in DTDs Kind: • #REQUIRED • #IMPLIED = optional • value = default value • value #FIXED = the only value allowed
Using DTDs • Must include in the XML document • Either include the entire DTD: • <!DOCTYPE rootElement [ ....... ]> • Or include a reference to it: • <!DOCTYPE rootElement SYSTEM “http://www.mydtd.org”> • Or mix the two... (e.g. to override the external definition)
XML Schema • DTDs capture grammatical structure, but have some drawbacks: • Not themselves in XML, inconvenient to build tools • Don’t capture database datatypes’ domains • No way of defining OO-like inheritance… • XML Schema addresses shortcomings of DTDs • XML syntax • Subclassing • Domains and built-in datatypes • nin. and max # of occurrences of elements • http://www.w3.org/XML/Schema
Basics of XML Schema • Need to use the XML Schema namespace (generally named xsd) • simpleTypes are a way of restricting domains on scalars • Can define a simpleType based on integer, with values within a particular range • complexTypes are a way of defining element structures • Basically equivalent to !ELEMENT, but more powerful • Specify sequence, choice between child elements • Specify minOccurs and maxOccurs (default 1) • Must associate an element/attribute with a simpleType, or an element with a complexType
Simple Schema Example <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name=“mastersthesis" type=“ThesisType"/> <xsd:complexType name=“ThesisType"> <xsd:attribute name=“mdate" type="xsd:date"/> <xsd:attribute name=“key" type="xsd:string"/> <xsd:attribute name=“advisor" type="xsd:string"/> <xsd:sequence> <xsd:element name=“author" type=“xsd:string"/> <xsd:element name=“title" type=“xsd:string"/> <xsd:element name=“year" type=“xsd:integer"/> <xsd:element name=“school" type=“xsd:string”/> <xsd:element name=“committeemember" type=“CommitteeType” minOccurs=“0"/> </xsd:sequence> </xsd:complexType> </xsd:schema>
How the Web was Yesterday • HTML documents • often generated by applications • consumed by humans only • easy access: across platforms, across organizations • No application interoperability: • HTML not understood by applications • Database technology: client-server
Application Interoperability Purchase order Internet Amazon Supplier3 Supplier2 Supplier1
New Universal Data Exchange Format: XML A recommendation from the W3C • XML = data • XML generated by applications • XML consumed by applications • Easy access: across platforms, organizations
XML • A W3C standard to complement HTML • Origins: Structured text SGML • Large-scale electronic publishing • Data exchange on the web • Motivation: • HTML describes presentation • XML describes content • http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)
Paradigm Shift on the Web • From documents (HTML) to data (XML) • From information retrieval to data management • For databases, also a paradigm shift: • from relational model to XML model • from data processing to data/query translation • from storage to transport
Database Issues • How are we going to model XML? (graphs). Compared to relational model, • XML is hierarchical • XML allows missing or additional attributes • XML allows multiple instances of an attribute (set-valued) • XML allows different types in different objects • XML integrates structure and text data … • How are we going to query XML? (XQuery) • How are we going to store XML (in a relational database? object-oriented? native?) • How are we going to process XML efficiently? (many interesting research questions!)
Designing an XML Schema/DTD • Not as formalized as relational data design • We can still use ER diagrams to break into entity, relationship sets • Note that often we already have our data in relations and need to design the XML schema to export them! • Generally orient the XML tree around the “central” objects • Big decision: element vs. attribute • Element if it has its own properties, or if you *might* have more than one of them • Attribute if it is a single property – or perhaps not!