Introduction to Semistructured Data and XML

Introduction to Semistructured Data and XML Chapter 27

How the Web is Today • HTML documents • often generated by applications • consumed by humans only • easy access: across platforms, across organizations • No application interoperability: • HTML not understood by applications • Database technology: client-server

New Universal Data Exchange Format: XML A recommendation from the W3C • XML = data • XML generated by applications • XML consumed by applications • Easy access: across platforms, organizations

Paradigm Shift on the Web • From documents (HTML) to data (XML) • From information retrieval to data management • For databases, also a paradigm shift: • from relational model to semistructured data • from data processing to data/query translation • from storage to transport

HTML • HTML is widely used for formatting and structuring Web documents. • Designed to describe how a Web browser should arrange text, images and push-buttons on a page. • Easy to learn, but does not convey structure and meaning of data in the Web pages. • Fixed tag set. Text (PCDATA) Opening tag • <HTML> • <HEAD><TITLE>Welcome to the XML course</TITLE></HEAD> • <BODY> • <H1>Introduction</H1> • <IMGSRC=”dragon.jpeg"WIDTH="200"HEIGHT="150” > • </BODY> • </HTML> Closing tag “Bachelor” tag Attribute name Attribute value

Semistructure data • Information integration: important new application that motivates what follows. • Semistructured data: a new data model designed to cope with problems of information integration. • XML (Extensible Markup Language) : a new Web standard that is essentially semistructured data. • XQUERY: an emerging standard query language for XML data.

Information Integration Problem: related data exists in many places. They talk about the same things, but differ in model, schema, conventions (e.g., terminology). Example: In the real world, every bar has its own database. • Some may have relations like beer-price; others have an Microsoft Word file from which the menu is printed. • Some keep phones of manufacturers but not addresses. • Some distinguish beers and ales; others do not.

The Semistructured Data Model Bib Object Exchange Model (OEM) &o1 complex object paper paper book references &o12 &o24 &o29 references references author page author year author title http title title publisher author author author &o43 &25 &96 1997 last firstname firstname lastname first lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu” atomic object

Characteristics of Semistructured Data • Missing or additional attributes • Multiple attributes • Different types in different objects • Heterogeneous collections Self-describing, irregular data, no a priori structure

{ row: { name: “John”, phone: 3634 }, row: { name: “Sue”, phone: 6343 }, row: { name: “Dick”, phone: 6363 } } row row row name phone name phone name phone “John” 3634 “Sue” 6343 “Dick” 6363 Comparison with Relational Data

XML (Extensible Markup Language) • A W3C standard to complement HTML • Origins: Structured text SGML • Large-scale electronic publishing • Data exchange on the web • Motivation: • HTML describes presentation • XML describes content

From HTML to XML HTML describes the presentation

HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999

XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content

Why are we DB’ers interested? • It’s data. That’s us. • Database issues: • How are we going to model XML? (graphs). • How are we going to query XML? (XQuery) • How are we going to store XML (in a relational database? object-oriented? native?) • How are we going to process XML efficiently? (many interesting research questions!)

XML Terminology • Tags: book, title, author, … • start tag: <book>, end tag: </book> • Elements: <book>…<book>,<author>…</author> • elements can be nested • empty element: <red></red> (Can be abbrv. <red/>) • XML document: Has a single root element • Well-formed XML document: Has matching tags • Valid XML document: conforms to a schema

Well-Formed XML 1. Declaration = <? ... ?> . • Normal declaration is<? XML VERSION = "1.0" STANDALONE = "yes" ?> • “Standalone” means that there is no DTD specified. 2. Root tag surrounds the entire balance of the document. • <FOO> is balanced by </FOO>, as in HTML. 3. Any balanced structure of tags OK. • Option of tags that don’t require balance, like <P> in HTML.

XML: An Example <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <BOOKLIST> <BOOK genre="Science" format="Hardcover"> <AUTHOR> <FIRSTNAME>Richard</FIRSTNAME><LASTNAME>Feynman</LASTNAME> </AUTHOR> <TITLE>The Character of Physical Law</TITLE> <PUBLISHED>1980</PUBLISHED> </BOOK> <BOOK genre="Fiction"> <AUTHOR> <FIRSTNAME>R.K.</FIRSTNAME><LASTNAME>Narayan</LASTNAME> </AUTHOR> <TITLE>Waiting for the Mahatma</TITLE> <PUBLISHED>1981</PUBLISHED> </BOOK> <BOOK genre="Fiction"> <AUTHOR> <FIRSTNAME>R.K.</FIRSTNAME><LASTNAME>Narayan</LASTNAME> </AUTHOR> <TITLE>The English Teacher</TITLE> <PUBLISHED>1980</PUBLISHED> </BOOK> </BOOKLIST>

attribute closing tag open tag data attribute value element name XML – Elements <BOOK genre="Science" format="Hardcover">…</BOOK> • Xml is case and space sensitive • Element opening and closing tag names must be identical • Opening tags: “<” + element name + “>” • Closing tags: “</” + element name + “>” • Empty Elements have no data and no closing tag: • They begin with a “<“ and end with a “/>” <BOOK/>

attribute open tag closing tag attribute value data element name XML – Attributes <BOOK genre="Science" format="Hardcover">…</BOOK> • Attributes provide additional information for element tags. • There can be zero or more attributes in every element; each one has the the form: attribute_name=‘attribute_value’ • There is no space between the name and the “=‘” • Attribute values must be surrounded by “ or ‘ characters • Multiple attributes are separated by white space (one or more spaces or tabs).

Elements The segment of an XML document between an opening and a corresponding closing tag is called an element. <person> <name> Malcolm Atchison </name> <tel> (215) 898 4321 </tel> <tel> (215) 898 4321 </tel> <email> mp@dcs.gla.ac.sc </email> </person> element element, a sub-element of not an element

attribute open tag closing tag attribute value element name data XML – Data and Comments <BOOK genre="Science" format="Hardcover">…</BOOK> • Xml data is any information between an opening and closing tag • Xml data must not contain the ‘<‘ or ‘>’ characters • Comments:<!- comment ->

XML text XML has only one “basic” type -- text. It is bounded by tags, e.g. <title> The Big Sleep </title> <year> 1935 </ year> --- 1935 is still text XML text is called PCDATA (for parsed character data). It uses a 16-bit encoding.

XML – Nesting & Hierarchy • Xml tags can be nested in a tree hierarchy • Xml documents can have only one root tag • Between an opening and closing tag you can insert: 1. Data 2. More Elements 3. A combination of data and elements <root> <tag1> Some Text <tag2>More</tag2> </tag1> </root>

projects: title budget managedBy employees: name ssn age Representing relational DBs:Two ways

Project and Employee relations in XML Projects and employees are intermixed <db> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe</managedBy> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee> <employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> : </db>

Project and Employee relations in XML (cont’d) Employees follows projects <db> <projects> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy>Joe </managedBy> </project> <project> <title>Auto guided vehicles</title> <budget> 70000 </budget> <managedBy>Sandra</managedBy> </project> : </projects> <employees> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> </employee> <employee> <name> Sandra </name> <ssn> 2234 </ssn> <age>35 </age> </employee> : <employees> </db>

More XML: Oids and References <personid=“o555”> <name> Jane </name> </person> <personid=“o456”> <name> Mary </name> <childrenidref=“o123 o555”/> </person> <personid=“o123” mother=“o456”><name>John</name> </person> oids and references in XML are just syntax

XML Data Model (Graph)

Document Type Descriptors • Sort of like a schema but not really. • Inherited from SGML DTD standard • BNF grammar establishing constraints on element structure and content • Definitions of entities

DTD – An Example <?xml version='1.0'?> <!ELEMENT Basket (Cherry+, (Apple | Orange)*) > <!ELEMENT Cherry EMPTY> <!ATTLIST Cherry flavor CDATA #REQUIRED> <!ELEMENT Apple EMPTY> <!ATTLIST Apple color CDATA #REQUIRED> <!ELEMENT Orange EMPTY> <!ATTLIST Orange location ‘Florida’> -------------------------------------------------------------------------------- <Basket> <Cherry flavor=‘good’/> <Apple color=‘red’/> <Apple color=‘green’/> </Basket> <Basket> <Apple/> <Cherry flavor=‘good’/> <Orange/> </Basket>

DTD - !ELEMENT <!ELEMENT Basket (Cherry+, (Apple | Orange)*) > • !ELEMENT declares an element name, and what children elements it should have • Content types: • Other elements • #PCDATA (parsed character data) • EMPTY (no content) • ANY (no checking inside this structure) • A regular expression Name Children

DTD - !ELEMENT (Contd.) • A regular expression has the following structure: • exp1, exp2, exp3, …, expk: A list of regular expressions • exp*: An optional expression with zero or more occurrences • exp+: An optional expression with one or more occurrences • exp1 | exp2 | … | expk: A disjunction of expressions

DTD - !ATTLIST <!ATTLIST Cherry flavor CDATA #REQUIRED> <!ATTLIST Orange location CDATA #REQUIRED color ‘orange’> • !ATTLISTdefines a list of attributes for an element • Attributes can be of different types, can be required or not required, and they can have default values. Element Attribute Type Flag

DTD – Well-Formed and Valid <?xml version='1.0'?> <!ELEMENT Basket (Cherry+)> <!ELEMENT Cherry EMPTY> <!ATTLIST Cherry flavor CDATA #REQUIRED> -------------------------------------------------------------------------------- Not Well-Formed <basket> <Cherry flavor=good> </Basket> Well-Formed but Invalid <Job> <Location>Home</Location> </Job> Well-Formed and Valid <Basket> <Cherry flavor=‘good’/> </Basket>

Example: An Address Book Exactly one name <person> <name> MacNiel, John </name> <greet> Dr. John MacNiel </greet> <addr>1234 Huron Street </addr> <addr> Rome, OH 98765 </addr> <tel> (321) 786 2543 </tel> <fax> (321) 786 2543 </fax> <tel> (321) 786 2543 </tel> <email> jm@abc.com </email> </person> At most one greeting As many address lines as needed (in order) Mixed telephones and faxes As many as needed

Specifying the structure • name to specify a nameelement • greet? to specify an optional (0 or 1) greet elements • name,greet? to specify a name followed by an optional greet

Specifying the structure (cont) • addr* to specify 0 or more address lines • tel | faxa telor a fax element • (tel | fax)* 0 or more repeats of tel or fax • email*0 or more email elements

A DTD for the address book <!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address (#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)> ]>

DTD for the example relational DB <!DOCTYPE db [ <!ELEMENT db (projects,employees)> <!ELEMENT projects (project*)> <!ELEMENT employees (employee*)> <!ELEMENT project (title, budget, managedBy)> <!ELEMENT employee (name, ssn, age)> ... ]>

Summary of XML regular expressions • Each element name is a tag. • Its components are the tags that appear nested within, in the order specified. • A The tag A occurs • e1,e2 The expression e1 followed by e2 • e* 0 or more occurrences of e • e? Optional -- 0 or 1 occurrences • e+ 1 or more occurrences • e1 | e2 either e1 or e2 • (e) grouping

XML Querying Path Expressions : • Bib.paper • Bib.book.publisher • Bib.paper.author.lastname Given an OEM instance, the value of a path expression p is a set of objects

Bib &o1 paper paper book references &o12 &o24 &o29 references references author page author year author title http title title publisher author author author &o43 &25 &o44 &o45 &o46 &o52 &96 1997 &o51 &o50 &o49 &o47 &o48 last firstname firstname lastname first lastname &o70 &o71 &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu” Path Expressions Examples: DB = Bib.paper={&o12,&o29} Bib.book.publisher={&o51} Bib.paper.author.lastname={&o71,&206}

XQuery Emerging standard for querying XML documents. Basic form: FOR <variables ranging oversets of elements> WHERE <condition> RETURN <set of elements>; • Sets of elements described by paths, consisting of: • URL, if necessary. • Element names forming a path in the semistructured data graph, e.g., //BAR/NAME =“start at any BAR node and go to a NAME child.” • Ending condition of the form [<condition about subelements, @attributes, and values>]

XQuery Overview: • FOR-LET-WHERE-ORDERBY-RETURN = FLWOR FOR/LET Clauses List of tuples WHERE Clause List of tuples ORDERBY/RETURN Clause Instance of Xquery data model

XQuery • FOR$x in expr -- binds $x to each value in the list expr • LET$x = expr -- binds $x to the entire list expr • Useful for common subexpressions and for aggregations

FOR v.s. LET Returns: <result> <book>...</book></result> <result> <book>...</book></result> <result> <book>...</book></result> ... FOR$xINdocument("bib.xml")/bib/book RETURN <result> $x </result> LET$xINdocument("bib.xml")/bib/book RETURN <result> $x </result> Returns: <result> <book>...</book> <book>...</book> <book>...</book> ... </result>

XQuery Find all book titles published after 1995: FOR$xINdocument("bib.xml")/bib/book WHERE$x/year > 1995 RETURN$x/title Result: <title> abc </title> <title> def </title> <title> ghi </title>

XQuery For each author of a book by Morgan Kaufmann, list all books s/he published: FOR$aINdistinct(document("bib.xml")/bib/book[publisher=“Morgan Kaufmann”]/author) RETURN <result> $a, FOR$tIN /bib/book[author=$a]/title RETURN$t </result> distinct = a function that eliminates duplicates

XQuery Result: <result> <author>Jones</author> <title> abc </title> <title> def </title> </result> <result> <author> Smith </author> <title> ghi </title> </result>

Introduction to Semistructured Data and XML