720 likes | 870 Views
Semistructured Data and XML. How the Web is Today. HTML documents often generated by applications consumed by humans only easy access: across platforms, across organizations only layout, no semantic information No application interoperability: HTML not understood by applications
E N D
How the Web is Today • HTML documents • often generated by applications • consumed by humans only • easy access: across platforms, across organizations • only layout, no semantic information • No application interoperability: • HTML not understood by applications • screen scraping brittle • Database technology: client-server • still vendor specific
XML Data Exchange Format • A standard from the W3C (World Wide Web Consortium, http://www.w3.org). • The mission of the W3C „. . . developing common protocols that promote its evolution and ensure its interoperability. . .“. • Basic ideas • XML = data • XML generated by applications • XML consumed by applications • Easy access: across platforms, organizations.
Paradigm Shift on the Web • For web search engines: • From documents (HTML) to data (XML) • From document management to document understanding (e.g., question answering) • From information retrieval to data management • For database systems: • From relational (structured) model to semistructured data • From data processing to data /query translation • From storage to transport
complex object atomic object The Semistructured Data Model Bib Object Exchange Model (OEM) &o1 paper paper book references &o12 &o24 &o29 references references author page author year author title http title title publisher author author author &o43 &25 &96 1997 last firstname firstname lastname first lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu”
The Semistructured Data Model • Data is self-describing, i.e. the data description is integrated with the data itself rather than in a separate schema. • Database is a collection of nodes and arcs (directed graph). • Leaf nodes represent data of some atomic type (atomic objects, such as numbers or strings). • Interior nodes represent complex objects consisting of components (child nodes), connected by arcs to this node. • Arcs are directed and connect two nodes.
The Semistructured Data Model • Arc labels indicates the relationship between the two corresponding nodes. • The root node is the only interior node without in-arcs, representing the entire database. • All database objects are children of the root node. • Every node must be reachable from the root. • A general graph structure is possible, i.e. the graph need not be a tree structure.
Syntax for Semistructured Data Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} } } Observe: Nested tuples, set-values, oids!
Syntax for Semistructured Data May omit oids: { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } } }
Vs. Relational Model • Missing attributes • Additional attributes • Multiple attribute values (set-valued attributes) • Objects as attribute values • No global schema only the first characteristics supported by relational model, all others are not
Vs. Relational Model • Semistructured data • Self-describing, • Irregular data, • No a-priori structure. • Relational DB • Separate schema, • Regular data, • A-priori structure.
Important XML Standards • XSL/XSLT: presentation and transformation standards • RDF: resource description framework (meta-info such as ratings, categorizations, etc.) • Xpath/Xpointer/Xlink: standard for linking to documents and elements within • Namespaces: for resolving name clashes • DOM: Document Object Model for manipulating XML documents • SAX: Simple API for XML parsing • XQuery: query language
XML • A W3C standard to complement HTML • Origins: Structured text SGML • Large-scale electronic publishing • Data exchange on the web • Motivation: • HTML describes presentation • XML describes content • http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)
From HTML to XML HTML describes the presentation
HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999 HTML describes the presentation
XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content
Why are we DB’ers interested? • It’s data. That’s us. • Database issues: • How are we going to model XML? (graphs). • How are we going to query XML? (XQuery) • How are we going to store XML (in a relational database? object-oriented? native?) • How are we going to process XML efficiently? (many interesting research questions!)
Elements • Tags book, title, author, … • start tag: <book>, end tag: </book> • defined by user / programmer (different from HTML!) • Elements <book>…<book>,<author>…</author> • An element consists of a matching start and end tag and the enclosed content. • Elements can be nested, i.e. content of one element can consist of sequence of other elements.
Attributes • Attributes can be associated with any element. • Provide additional information about elements. • Attributes can have only one value. • Example <bookprice = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> • Attributes can also be used to connect elements.
Non-tree-like XML • So far: only tree-like XML documents,i.e. each element is nested within at most one other element. • Attributes can also be used to create non-tree XML documents. • Attributes with a domain of ID serve as primary keys of elements. • Attributes with a domain of IDREF serve as foreign keys referencing the ID of another element.
Non-tree-like XML • Example of a non-tree structure <persons> <personpersonid=“o555”> <name> Jane </name> </person> <personpersonid=“o456”> <name> Mary </name> <childrenrefs=“o123 o555”</children > </person> <personpersonid=“o123” mother=“o456”> <name>John</name> </person> </persons>
Namespaces • An XML document can involve tags that come for multiple sources. • One and the same tag can appear in more than one source. <table> <tr> <td>Apples</td> <td>Bananas</td> </tr> </table> <table> <name>African Coffee Table</name> <width>80</width> <length>120</length> </table>
Namespaces • Name conflicts can be resolved by prefixing tag names according to their source. <h:table> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td> </h:tr> </h:table> <f:table> <f:name>African Coffee Table</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table> • When using prefixes in XML, a namespace for the prefix must be defined. • The namespace must be referenced (via an URI) in the start tag of an enclosing element .
Well-Formed XML • A well-formed XML document satisfies the following conditions: • Begins with a declaration that it is XML. • Has a single root element that encloses the whole document. • Consists of properly nested elements, i.e. start and end tag of an element are within the same enclosing element. • standalone =“yes” states that document has no DTD. • In this mode, you can invent your own tags, like in semistructured data model.
Well-Formed XML • <?XML version=“1.0” standalone =“yes” ?> • <bibliography> • <book> <title> Foundations… </title> • <author> Abiteboul </author> • <author> Hull </author> • <author> Vianu </author> • <publisher> Addison Wesley </publisher> • <year> 1995 </year> • </book> • <book> <title> … </title> • . . . • </book> • … • </bibliography>
Well-Formed XML • HTML browsers will display documents with errors (like missing end tags). • The W3C XML specification states that a program should stop processing an XML document if it finds an error. • The main reason is that XML is being consumed by programs rather than by humans (as HTML). • W3C provides a validator that checks whether an XML document is well-formed.
Valid XML • The validator can also check whether an XML document is valid, i.e. conforms to a Document Type Definition (DTD). • A DTD specifies the allowable tags and how they can be nested. • XML with a DTD is no longer semistructured (self-describing). • However, a DTD is less rigid than the schema of a relational DB. E.g., a DTD allows missing and multiple attributes / elements.
Document Type Definitions • Document Type Definition (DTD): set of rules (grammar) specifying elements, attributes and all other aspects of XML documents. • For each element, specify name and content type. • Content type can, e.g., be • #PCDATA (character string), • other elements, • regular expression made of the above content types * = zero or more occurrences ? = zero or one occurrence + = one or more occurrences , = sequence of elements.
Document Type Descriptors • Sort of like a schema but not really. • Inherited from SGML DTD standard • BNF grammar establishing constraints on element structure and content • Definitions of entities
Example DTD: Product Catalog <!DOCTYPE CATALOG [ <!ELEMENT CATALOG (PRODUCT+)> <!ELEMENT PRODUCT (SPECIFICATIONS+,OPTIONS?,PRICE+,NOTES?)> <!ATTLIST PRODUCT NAME CDATA #IMPLIED CATEGORY (HandTool|Table|Shop-Professional) "HandTool" PARTNUM CDATA #IMPLIED PLANT (Pittsburgh|Milwaukee|Chicago) "Chicago" INVENTORY (InStock|Backordered|Discontinued) "InStock"> <!ELEMENT SPECIFICATIONS (#PCDATA)> <!ATTLIST SPECIFICATIONS WEIGHT CDATA #IMPLIED POWER CDATA #IMPLIED> <!ELEMENT OPTIONS (#PCDATA)> <!ATTLIST OPTIONS FINISH (Metal|Polished|Matte) "Matte" ADAPTER (Included|Optional|NotApplicable) "Included" CASE (HardShell|Soft|NotApplicable) "HardShell"> <!ELEMENT PRICE (#PCDATA)> <!ATTLIST PRICE MSRP CDATA #IMPLIED WHOLESALE CDATA #IMPLIED STREET CDATA #IMPLIED SHIPPING CDATA #IMPLIED> <!ELEMENT NOTES (#PCDATA)> ]>
Shortcomings of DTDs Useful for documents, but not so good for data: • Element name and type are associated globally • No support for structural re-use • Object-oriented-like structures aren’t supported • No support for data types • Can’t do data validation • Can have a single key item (ID), but: • No support for multi-attribute keys • No support for foreign keys (references to other keys) • No constraints on IDREFs (reference only a Section)
XML Schema • The successor of DTDs to specify a schema for XML documents. • A W3C standard. • Includes and extends functionality of DTDs. • In particular, XML Schemas support data types. This makes it easier to validate the correctness of data and to work with data from a database. • XML Schemas are written in XML. You don't have to learn a new language and can use your XML parser to parse your Schema files.
Example XML Schema <schema version=“1.0” xmlns=“http://www.w3.org/1999/XMLSchema”> <element name=“author” type=“string” /> <element name=“date” type = “date” /> <element name=“abstract”> <type> … </type> </element> <element name=“paper”> <type> <attribute name=“keywords” type=“string”/> <element ref=“author” minOccurs=“0” maxOccurs=“*” /> <element ref=“date” /> <element ref=“abstract” minOccurs=“0” maxOccurs=“1” /> <element ref=“body” /> </type> </element> </schema>
Simple Elements • Simple elements contain only text. • They can have one of the built-in datatypes: xs:string, xs:decimal, xs:integer, xs:boolean xs:date, xs:time. • Example <xs:element name="lastname“ type="xs:string"/> <xs:element name="age" type="xs:integer"/> <xs:element name="dateborn" type="xs:date"/>
Simple Elements • Restrictions allow you to further constrain the content of simple elements. <xs:element name="age"> <xs:simpleType> <xs:restriction base="xs:integer"> <xs:minInclusive value="0"/> <xs:maxInclusive value="120"/> </xs:restriction> </xs:simpleType> </xs:element>
Attributes • Attributes can be specified using the attribute element: <xs:attribute name="xxx" type="yyy"/> • Attribute elements are nested within the element of the element with which they are associated. • By default, attributes are optional. • To make an attribute mandatory, use <xs:attribute name="lang“ type="xs:string“use="required"/> • Attributes can have the same built-in datatypes as simple elements.
Complex Elements • Complex elements can contain other elements and can have attributes. • Nested elements need to occur in the order specified. • The number of repetitions of elements are controlled by the attributes minOccurs and maxOccurs. The default is one repetition. • A complex element with an attribute: <xs:element name="product"> <xs:complexType> <xs:attribute name="prodid" type="xs:positiveInteger"/> </xs:complexType> </xs:element>
Complex Elements • A complex element containing a sequence of nested (simple) elements: <xs:element name="employee"> <xs:complexType> <xs:sequence> <xs:element name="firstname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element>
Complex Elements • If you name the complex element, other elements can reference and include it: <xs:complexType name="persontype"> <xs:sequence> <xs:element name="firstname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:sequence> </xs:complexType> <xs:element name="person" type="persontype"/>
Example XML Schema <schema version=“1.0” xmlns=“http://www.w3.org/1999/XMLSchema”> <element name=“author” type=“string” /> <element name=“date” type = “date” /> <element name=“abstract”> <type> … </type> </element> <element name=“paper”> <type> <attribute name=“keywords” type=“string”/> <element ref=“author” minOccurs=“0” maxOccurs=“*” /> <element ref=“date” /> <element ref=“abstract” minOccurs=“0” maxOccurs=“1” /> <element ref=“body” /> </type> </element> </schema>
XML vs. Semistructured Data • Both described best by a graph. • Both are schema-less, self-describing(XML without DTD / XML schema). • XML is ordered, semistructured data is not. • XML can mix text and elements: <talk> Making Java easier to type and easier to type <speaker> Phil Wadler </speaker> </talk> • XML has lots of other stuff: attributes, entities, processing instructions, comments.
Query Languages for XML • XPath is a simple query language based on describing similar paths in XML documents. • XQuery extends XPath in a style similar to SQL, introducing iterations, subqueries, etc. • XPath and XQuery expressions are applied to an XML document and return a sequence of qualifying items. • Items can be primitive values or nodes (elements, attributes, documents). • The items returned do not need to be of the same type.
XPath • A path expression returns the sequence of all qualifying items that are reachable from the input item following the specified path. • A path expression is a sequence consisting of tags or attributes and special characters such as slashes (“/”). • Absolute path expressions are applied to some XML document and returns all elements that are reachable from the document’s root element following the specified path. • Relative path expressions are applied to an arbitrary node.
XPath <?XML version=“1.0” standalone =“yes” ?> <bibliography> <book bookID = “b100“> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> • Applied to the above document, the XPath expression /bibliography/book/author returns the sequence <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> . . .
Attributes • If we do not want to return the qualifying elements, but the value one of their attributes, we end the path expression with @attribute. <?XML version=“1.0” standalone =“yes” ?> <bibliography> <book bookID = “b100“> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> the XPath expression /bibliography/book/@bookID returns the sequence “b100“ . . .
Wildcards • We can use wildcards instead of actual tags and attributes:* means any tag, and @* means any attribute. • Examples /bibliography/*/author returns the sequence <author> Abiteboul </author> • <author> Hull </author>./bibliography//author/@* returns the sequence “IBM“ “a739“.