Semistructured Data and XML

SemistructuredData and XML

How the Web is Today • HTML documents • often generated by applications • consumed by humans only • easy access: across platforms, across organizations • only layout, no semantic information • No application interoperability: • HTML not understood by applications • screen scraping brittle • Database technology: client-server • still vendor specific

XML Data Exchange Format • A standard from the W3C (World Wide Web Consortium, http://www.w3.org). • The mission of the W3C „. . . developing common protocols that promote its evolution and ensure its interoperability. . .“. • Basic ideas • XML = data • XML generated by applications • XML consumed by applications • Easy access: across platforms, organizations.

Paradigm Shift on the Web • For web search engines: • From documents (HTML) to data (XML) • From document management to document understanding (e.g., question answering) • From information retrieval to data management • For database systems: • From relational (structured) model to semistructured data • From data processing to data /query translation • From storage to transport

complex object atomic object The Semistructured Data Model Bib Object Exchange Model (OEM) &o1 paper paper book references &o12 &o24 &o29 references references author page author year author title http title title publisher author author author &o43 &25 &96 1997 last firstname firstname lastname first lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu”

The Semistructured Data Model • Data is self-describing, i.e. the data description is integrated with the data itself rather than in a separate schema. • Database is a collection of nodes and arcs (directed graph). • Leaf nodes represent data of some atomic type (atomic objects, such as numbers or strings). • Interior nodes represent complex objects consisting of components (child nodes), connected by arcs to this node. • Arcs are directed and connect two nodes.

The Semistructured Data Model • Arc labels indicates the relationship between the two corresponding nodes. • The root node is the only interior node without in-arcs, representing the entire database. • All database objects are children of the root node. • Every node must be reachable from the root. • A general graph structure is possible, i.e. the graph need not be a tree structure.

Syntax for Semistructured Data Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} } } Observe: Nested tuples, set-values, oids!

Syntax for Semistructured Data May omit oids: { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } } }

Vs. Relational Model • Missing attributes • Additional attributes • Multiple attribute values (set-valued attributes) • Objects as attribute values • No global schema  only the first characteristics supported by relational model, all others are not

Vs. Relational Model • Semistructured data • Self-describing, • Irregular data, • No a-priori structure. • Relational DB • Separate schema, • Regular data, • A-priori structure.

XML

Important XML Standards • XSL/XSLT: presentation and transformation standards • RDF: resource description framework (meta-info such as ratings, categorizations, etc.) • Xpath/Xpointer/Xlink: standard for linking to documents and elements within • Namespaces: for resolving name clashes • DOM: Document Object Model for manipulating XML documents • SAX: Simple API for XML parsing • XQuery: query language

XML • A W3C standard to complement HTML • Origins: Structured text SGML • Large-scale electronic publishing • Data exchange on the web • Motivation: • HTML describes presentation • XML describes content • http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)

From HTML to XML HTML describes the presentation

HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999 HTML describes the presentation

XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content

Why are we DB’ers interested? • It’s data. That’s us. • Database issues: • How are we going to model XML? (graphs). • How are we going to query XML? (XQuery) • How are we going to store XML (in a relational database? object-oriented? native?) • How are we going to process XML efficiently? (many interesting research questions!)

Elements • Tags book, title, author, … • start tag: <book>, end tag: </book> • defined by user / programmer (different from HTML!) • Elements <book>…<book>,<author>…</author> • An element consists of a matching start and end tag and the enclosed content. • Elements can be nested, i.e. content of one element can consist of sequence of other elements.

Attributes • Attributes can be associated with any element. • Provide additional information about elements. • Attributes can have only one value. • Example <bookprice = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> • Attributes can also be used to connect elements.

Non-tree-like XML • So far: only tree-like XML documents,i.e. each element is nested within at most one other element. • Attributes can also be used to create non-tree XML documents. • Attributes with a domain of ID serve as primary keys of elements. • Attributes with a domain of IDREF serve as foreign keys referencing the ID of another element.

Non-tree-like XML • Example of a non-tree structure <persons> <personpersonid=“o555”> <name> Jane </name> </person> <personpersonid=“o456”> <name> Mary </name> <childrenrefs=“o123 o555”</children > </person> <personpersonid=“o123” mother=“o456”> <name>John</name> </person> </persons>

Namespaces • An XML document can involve tags that come for multiple sources. • One and the same tag can appear in more than one source. <table> <tr> <td>Apples</td> <td>Bananas</td> </tr> </table> <table> <name>African Coffee Table</name> <width>80</width> <length>120</length> </table>

Namespaces • Name conflicts can be resolved by prefixing tag names according to their source. <h:table> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td> </h:tr> </h:table> <f:table> <f:name>African Coffee Table</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table> • When using prefixes in XML, a namespace for the prefix must be defined. • The namespace must be referenced (via an URI) in the start tag of an enclosing element .

Well-Formed XML • A well-formed XML document satisfies the following conditions: • Begins with a declaration that it is XML. • Has a single root element that encloses the whole document. • Consists of properly nested elements, i.e. start and end tag of an element are within the same enclosing element. • standalone =“yes” states that document has no DTD. • In this mode, you can invent your own tags, like in semistructured data model.

Well-Formed XML • <?XML version=“1.0” standalone =“yes” ?> • <bibliography> • <book> <title> Foundations… </title> • <author> Abiteboul </author> • <author> Hull </author> • <author> Vianu </author> • <publisher> Addison Wesley </publisher> • <year> 1995 </year> • </book> • <book> <title> … </title> • . . . • </book> • … • </bibliography>

Well-Formed XML • HTML browsers will display documents with errors (like missing end tags). • The W3C XML specification states that a program should stop processing an XML document if it finds an error. • The main reason is that XML is being consumed by programs rather than by humans (as HTML). • W3C provides a validator that checks whether an XML document is well-formed.

Valid XML • The validator can also check whether an XML document is valid, i.e. conforms to a Document Type Definition (DTD). • A DTD specifies the allowable tags and how they can be nested. • XML with a DTD is no longer semistructured (self-describing). • However, a DTD is less rigid than the schema of a relational DB. E.g., a DTD allows missing and multiple attributes / elements.

DTD

Document Type Definitions • Document Type Definition (DTD): set of rules (grammar) specifying elements, attributes and all other aspects of XML documents. • For each element, specify name and content type. • Content type can, e.g., be • #PCDATA (character string), • other elements, • regular expression made of the above content types * = zero or more occurrences ? = zero or one occurrence + = one or more occurrences , = sequence of elements.

Document Type Descriptors • Sort of like a schema but not really. • Inherited from SGML DTD standard • BNF grammar establishing constraints on element structure and content • Definitions of entities

Example DTD: Product Catalog <!DOCTYPE CATALOG [ <!ELEMENT CATALOG (PRODUCT+)> <!ELEMENT PRODUCT (SPECIFICATIONS+,OPTIONS?,PRICE+,NOTES?)> <!ATTLIST PRODUCT NAME CDATA #IMPLIED CATEGORY (HandTool|Table|Shop-Professional) "HandTool" PARTNUM CDATA #IMPLIED PLANT (Pittsburgh|Milwaukee|Chicago) "Chicago" INVENTORY (InStock|Backordered|Discontinued) "InStock"> <!ELEMENT SPECIFICATIONS (#PCDATA)> <!ATTLIST SPECIFICATIONS WEIGHT CDATA #IMPLIED POWER CDATA #IMPLIED> <!ELEMENT OPTIONS (#PCDATA)> <!ATTLIST OPTIONS FINISH (Metal|Polished|Matte) "Matte" ADAPTER (Included|Optional|NotApplicable) "Included" CASE (HardShell|Soft|NotApplicable) "HardShell"> <!ELEMENT PRICE (#PCDATA)> <!ATTLIST PRICE MSRP CDATA #IMPLIED WHOLESALE CDATA #IMPLIED STREET CDATA #IMPLIED SHIPPING CDATA #IMPLIED> <!ELEMENT NOTES (#PCDATA)> ]>

Shortcomings of DTDs Useful for documents, but not so good for data: • Element name and type are associated globally • No support for structural re-use • Object-oriented-like structures aren’t supported • No support for data types • Can’t do data validation • Can have a single key item (ID), but: • No support for multi-attribute keys • No support for foreign keys (references to other keys) • No constraints on IDREFs (reference only a Section)

XML Schema

XML Schema • The successor of DTDs to specify a schema for XML documents. • A W3C standard. • Includes and extends functionality of DTDs. • In particular, XML Schemas support data types. This makes it easier to validate the correctness of data and to work with data from a database. • XML Schemas are written in XML. You don't have to learn a new language and can use your XML parser to parse your Schema files.

Example XML Schema <schema version=“1.0” xmlns=“http://www.w3.org/1999/XMLSchema”> <element name=“author” type=“string” /> <element name=“date” type = “date” /> <element name=“abstract”> <type> … </type> </element> <element name=“paper”> <type> <attribute name=“keywords” type=“string”/> <element ref=“author” minOccurs=“0” maxOccurs=“*” /> <element ref=“date” /> <element ref=“abstract” minOccurs=“0” maxOccurs=“1” /> <element ref=“body” /> </type> </element> </schema>

Simple Elements • Simple elements contain only text. • They can have one of the built-in datatypes: xs:string, xs:decimal, xs:integer, xs:boolean xs:date, xs:time. • Example <xs:element name="lastname“ type="xs:string"/> <xs:element name="age" type="xs:integer"/> <xs:element name="dateborn" type="xs:date"/>

Simple Elements • Restrictions allow you to further constrain the content of simple elements. <xs:element name="age"> <xs:simpleType> <xs:restriction base="xs:integer"> <xs:minInclusive value="0"/> <xs:maxInclusive value="120"/> </xs:restriction> </xs:simpleType> </xs:element>

Attributes • Attributes can be specified using the attribute element: <xs:attribute name="xxx" type="yyy"/> • Attribute elements are nested within the element of the element with which they are associated. • By default, attributes are optional. • To make an attribute mandatory, use <xs:attribute name="lang“ type="xs:string“use="required"/> • Attributes can have the same built-in datatypes as simple elements.

Complex Elements • Complex elements can contain other elements and can have attributes. • Nested elements need to occur in the order specified. • The number of repetitions of elements are controlled by the attributes minOccurs and maxOccurs. The default is one repetition. • A complex element with an attribute: <xs:element name="product"> <xs:complexType> <xs:attribute name="prodid" type="xs:positiveInteger"/> </xs:complexType> </xs:element>

Complex Elements • A complex element containing a sequence of nested (simple) elements: <xs:element name="employee"> <xs:complexType> <xs:sequence> <xs:element name="firstname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element>

Complex Elements • If you name the complex element, other elements can reference and include it: <xs:complexType name="persontype"> <xs:sequence> <xs:element name="firstname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:sequence> </xs:complexType> <xs:element name="person" type="persontype"/>

Example XML Schema <schema version=“1.0” xmlns=“http://www.w3.org/1999/XMLSchema”> <element name=“author” type=“string” /> <element name=“date” type = “date” /> <element name=“abstract”> <type> … </type> </element> <element name=“paper”> <type> <attribute name=“keywords” type=“string”/> <element ref=“author” minOccurs=“0” maxOccurs=“*” /> <element ref=“date” /> <element ref=“abstract” minOccurs=“0” maxOccurs=“1” /> <element ref=“body” /> </type> </element> </schema>

XML vs. Semistructured Data • Both described best by a graph. • Both are schema-less, self-describing(XML without DTD / XML schema). • XML is ordered, semistructured data is not. • XML can mix text and elements: <talk> Making Java easier to type and easier to type <speaker> Phil Wadler </speaker> </talk> • XML has lots of other stuff: attributes, entities, processing instructions, comments.

XML-Path = XPath

Query Languages for XML • XPath is a simple query language based on describing similar paths in XML documents. • XQuery extends XPath in a style similar to SQL, introducing iterations, subqueries, etc. • XPath and XQuery expressions are applied to an XML document and return a sequence of qualifying items. • Items can be primitive values or nodes (elements, attributes, documents). • The items returned do not need to be of the same type.

XPath • A path expression returns the sequence of all qualifying items that are reachable from the input item following the specified path. • A path expression is a sequence consisting of tags or attributes and special characters such as slashes (“/”). • Absolute path expressions are applied to some XML document and returns all elements that are reachable from the document’s root element following the specified path. • Relative path expressions are applied to an arbitrary node.

XPath <?XML version=“1.0” standalone =“yes” ?> <bibliography> <book bookID = “b100“> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> • Applied to the above document, the XPath expression /bibliography/book/author returns the sequence <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> . . .

Attributes • If we do not want to return the qualifying elements, but the value one of their attributes, we end the path expression with @attribute. <?XML version=“1.0” standalone =“yes” ?> <bibliography> <book bookID = “b100“> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> the XPath expression /bibliography/book/@bookID returns the sequence “b100“ . . .

Wildcards • We can use wildcards instead of actual tags and attributes:* means any tag, and @* means any attribute. • Examples /bibliography/*/author returns the sequence <author> Abiteboul </author> • <author> Hull </author>./bibliography//author/@* returns the sequence “IBM“ “a739“.

Semistructured Data and XML

Semistructured Data and XML

Presentation Transcript

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

XML: Semistructured Data

Managing XML and Semistructured Data

Semistructured Data and XML

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data