XML Data: From Research to Standards

XML Data:From Research to Standards Daniela Florescu Propel Jérôme Siméon Bell Laboratories

Data and the Web:A bit of history • Research: > 1950’s: Lisp [Mac Carthy] > 1960’s: Tree languages [Buchi] > 1970’s: Relational DBs [Codd] > 1990: Graphlog [Univ. Toronto] > 1994: O2 extensions [INRIA] > 1995: Tsimmis & OEM [Stanford] > 1995: UnQL [UPenn] • Internet industry: • > 1957 : Sputnik launches ARPA • > 1972 : First demonstration of ARPANET • > 1989 : Number of hosts breaks 100,000 • > 1991 : CERN releases the World Wide WebHTML as the support for information • > 1997 : 20 Million Hosts, 1 Million Web sites • > 1998 :W3C releases XML to represent information on the Web Need to handle irregular Web data. Use graph data models. XML provides a syntax for irregular textual Web information. ?

The secret of HTML success • Everybody can write it: > HTML is simple > HTML is textual: it is human readable, you can use any editor, ... • Everybody can read it > HTML is portable on any platform > The browser is the universal application • It connects pieces of information together > Through hypertext links

But new applications = new needs • Infomediaries: • Search engines • Web portals • Digital libraries • Virtual enterprises • Electronic services: • On-line catalogs and procurement • Comparison shoppers • Market places • Scientific applications • Manufacturing engineering etc. More than HTML: data on the Web More than the browser: applications on the Web

The Secret of XML Popularity • <book> <title> Foundations… </title> • <author> Abiteboul </author> • <author> Hull </author> • <author> Vianu </author> • <publisher> Addison Wesley </publisher> • <year> 1995 </year> • </book> … It looks like HTML... > Simple, familiar, easy to learn, human-readable > Universal and portable > Supported by the W3C:trusted and quickly adopted by the industry …but it’s more than HTML! > Flexible: you can represent any information >Extensible:you can represent it the way you want!

XML Is Only the Beginning... • How do you build applications ? > There is an urgent need for XML tools • Designing XML tools is a data management problem: > XML 1.0 to describe structured documents ~ Syntax for trees > XML data models to describe the information content ~ Data model for trees > XML schemas to describe the structure of information ~ Data definition language for trees > XML languages to describe information processing ~ Data manipulation language for trees

About the Tutorial • XML through database glasses • Contains: > Up-to-date information about standards > Relationship with research > Convergence and divergences • Divided in 4 parts: 1. Introduction to XML 1.0 2. Data models 3. Schema languages 4. Query languages  Please, please, please, ask questions! 

Part IXML 1.0

About the W3C • Membership organization • Different types of groups inside the W3C: • Working groups • Interest groups • Coordination groups • Status of W3C documents: • Note • Working draft • Last Call • Candidate/proposed recommendation • Recommendation~ Standard

XML activities inside W3C • Core XML > eXtensible Markup Language (XML 1.0), namespaces, Infoset • XML Linking > XML Pointer Language (XPointer), XML Linking language • XML Schema • XML Query > XML Data Model, Algebra and Query Language • Document Object Model • XSL > XPath > XSLT/XSL: Transformation and stylesheet language

XML 1.0:Well formed documents • An XML Document is composed of: • > markup: element, attributes • > text: #PCDATA, CDATA • Well-formed document: • > verifies XML lexical conventions • > contains properly nested elements with a single root element • > can contain empty elements, mixed text and elements <bookyear=“1967” > <title>The politics of experience</title> <author>R.D. Laing</author> <refisbn=“1341-1444-555”/> <section> The great and true Amphibian, whose nature is disposed to….. <title>Persons and experience</title> Even facts become... </section> … </book>

XML 1.0:Valid documents • A Valid XML document verifies a Document Type Definition (DTD): • > grammar for the document • > constraints on the structure of elements, attributes, entities, notations... • > a DTD is optional • (We will see more about DTD in the schema part of the tutorial) <?XML version=“1.0”?> <DOCTYPEbook [ <!ELEMENTbook (title, author*, publisher?, section+)> <!ATTLISTbook year CDATA #IMPLIED> <!ELEMENTtitle (#PCDATA)> <!ELEMENTauthor (#PCDATA)> <!ELEMENTsection (#PCDATA | title | section)*> ]> ...

Some additional features • General entities &myentity; > Declared as part of XML 1.0 or in a DTD > Used to escape characters, as macros for pieces of documents & = & > An XML document contains Unicode characters < = < = < • Parameter entities %myentity; > Declared in a DTD, used as macros for pieces of DTDs <!ENTITY %macro “publisher (#PCDATA)”> … <!ELEMENT %macro;>

Even more additional features • Namespaces mynames:name > a set of names identified by an URI > tags and attribute names become qualified names (QName) • Processing instructions > to embed processing in a document (e.g. Java applet in HTML) • Comments <myns:section xmlns:myns=“http://caravel.inria.fr/mySchema” > <myns:title> Persons and experience</myns:title> </myns:section>

Part IIData Model

Why a data model for XML ? For old & well-know (but good!) reasons • As a support for physical/logical independence • > XML can be stored in files, a native XML repository, a relational database • > XML can be virtual, as a view of a repository, integrated sources • > XML can be in memory, using data structures in C, C++, Java, etc • > XML can be streamed between processes • To describe information content of XML documents • > to agree and reason about information content, preservation • To define semantics of operations: • > equality, etc.

But XML has specifics • Serialization syntax • Some information exists only after schema validation • > price is not a string but a decimal value • > refs is not a string but a list of references • One more motivation for a data model: • To isolate the user from syntactic details of XML • <book bookid=“b1” price=“10.50”/> • <title>War & Peace</title> • <author>Tolstoi</author> • <biblio refs=“b1 b2 b3”> • </book> • <xsd:attribute name=“price” type=“xsd:decimal”/> • <xsd:attribute name=“bookid” type=“xsd:ID”/> • <xsd:attribute name=“refs” type=“xsd:IDREFS”/>

Existing data models • Graph and tree models used in research • Document Object Model (DOM) > status: recommendation > programmatic interface for XML (with an object-oriented flavor) • XML Information Set (Infoset) > describes the information content exported by XML processors > can be generated after parsing or after validation • XML languages’ Data models: > required for language semantics > XPath: recommendation has it’s own data model > XML Query Data model: working draft

Bib &b0 book book book references &b2 &b3 biblio &b1 author author title title biblio publisher author author author author price refs title refs refs “Tolstoi” “War & Peace” 10.50 Semistructured model • Graph based, unordered, edge-labeled (here OEM) > But XML is ordered, tree based > Node-labeled seems more natural (e.g., like in DOM)

b0: bib b2: b1: book book book b3: ...... title author price biblio title author price biblio refs .......................................... “Tolstoi" 10.50 “War & Peace” &b1 &b2 &b3 Ordered model • Node-labeled, ordered trees, with references (YAT) > But what about attributes (unordered!), namespaces, processing interactions, etc. ?

XML Infoset • Specifies a description of information in a well-formed XML document • Abstract way to think about XML data • Other processors (e.g. XML Schema) can contribute information Here is an example in a made-up syntax: • b1 = Element [ local name = “book”; • children =[ Element [ local name = “title” ... ]; • Element [ local name = “author”... ]; ... ] • attributes = [ Attribute [ local name = “price”; • children = [ Character [ code = ‘1’ ]; • Character = [ code = ‘0’ ]; • Character = [ code = ‘.’ ]; • Character = [ code = ‘5’]; • Character = [ code = ‘0’ ] ]; • attribute type = “xsd:decimal” ] ... ] ]

XML Query Data Model • A node-labeled, tree model with references > Very close to XPath data model • Generated after validation > provides also pointers to schema information • Uses a functional notation > no explicit data structure • Defines a mapping from post-schema validated Infoset to XML Query Data Model > preserves original infoset (e.g., characters)

XML Query Data Model • Nodes Node = DocNode | ElemNode | AttrNode | ValueNode | NSNode | PINode | CommentNode | InfoItemNode • XML Schema primitive types string, boolean, ID, IDREF, decimal, QName, ... • Collections sequence bag union [T] {T} T1 | T2 References ref(T)

Constructors & accessors • Attribute Constructor attrNode : (QNameValue, ValueNode) -> AttrNode ValueNode = StringValue | DecimalValue | ... qnameValue : (uriReference | null, string)-> QNameValue • Attribute Accessors name : AttrNode -> QNameValue value : AttrNode -> ValueNode type : AttrNode -> ElemNode • Example: <book price=“10.50”/> A1 = attrNode(qnameValue(null, “price”),decimalValue(10.50)) name(A1) = qnameValue(null, “price”) value(A1) = decimalValue(10.50)

XML Data Model: Conclusion • Research focuses on simple formal models • Many standards related to the need for a data model • XML Query Data Model reconciles both worlds > Complete with respect to XML > Simple design with a clear connection to a formal model: ordered trees, node-labeled, with references > Clear relationship two other W3C standards: mapping to XML Infoset based on XPath + typed values and unordered collections > Less clear relationship with DOM

Part IIIData Definition Language

Why a DDL for XML ? For old & well-know (but good!) reasons • As an ontology & modeling tool: • > to describe the structure of information: entities, relationships... • > to share common descriptions between actors/applications • > to guide query formulation and application development • For error detection & safety: • > to verify that documents comply to what the application expects • > to make sure that the application accesses valid data • > to enforce safe operations (e.g., don’t do float arithmetic on trees!) • > to check that compositions of operations make sense • For performances: • > to design storage (saving space, improving clustering, etc.) • > to process queries (algebraic laws, rewriting path expressions, etc.)

But XML deals with new needs • XML data created from legacy repositories > Need to capture schemas from heterogeneous sources • Relational schemas: Simple but with integrity constraints • Object-oriented schemas: Typed references, Inheritance... • Document grammars: Regular expressions, mixed text and structure • XML used on the Web, for data exchange > Need to remain flexible • Web sources: From strict schemas to well-formed documents (smooooothly........) • Many applications use the same information: We should be able to type the same document in multiple ways

Existing schema languages • DTDs(W3C recommendation as part of XML 1.0) > powerful for documents: regular expressions, mixes of text and structure > limited for other applications: cannot capture relational or object schemas • XML Schema(Candidate recommendation) > Many new features: data types, forms of subtyping, etc. > More powerful but quite complex • Schemas for unordered semistructured models: > Data guides, Graph schemas, using Datalog > Used for optimization, schema inference from data • Schemas for ordered trees models > Regular tree grammars, YAT, lotos, XDuce, Relax, TRex etc. > Used for optimization, type checking and inference from queries

DDL Roadmap 3.1. Describing atomic values >integer, string, float, date, images, etc 3.2. Describing structures > elements: tag-coupled approach vs. tag-decoupled approach > attributes 3.3. More semantics > identity, references, relationships intra or inter documents > isa: notion of inheritance... 3.4. Simplifying schema reuse > import/export abilities > refinement of existing descriptions

Values in XML: easy ? • DTD says it’s easy: Recipe: #PCDATA = string CDATA = other strings, ... I.e.: Everything is a string Unfortunately: Strings are not a panacea... • Database research says it’s easy: Recipe: Take a data model with atomic types Each value is in a different type... I.e.: Don’t deal with syntax but data model Unfortunately: XML = file = syntax

Values in XML: many issues... • Addressing numerous needs: > float, string, int, date, URI, telephone number, gif, applet, etc. • Living with XML 1.0 syntax > The same lexical representation can correspond to several values > The same value can have several lexical representations > binary formats (images, etc.) must be serialized in a portable way • Compatible with other standards • Compatible with internationalization > World Wide Web! <book><title>Haystacks at Chailly </title><author>Monet</author> <date>1865</date><price>1865</price></book> <book><ref>Monet1865</ref><in_stock>true</in_stock></book> <book><ref>Monet1865</ref><in_stock>1</in_stock></book>

XML Schema Part 2: Datatypes • Defines 14 built-in types (basic types) > general purpose types > types for compatibility with DTDs • Relies on other existing standards whenever possible > IEEE 754-1985 for floats > UCS [ISO 10646] & Unicode for internationalization > ISO 8601 for dates • Gives the ability to define new types (derived types) • Single lexical representation for many values ? > document is interpreted with respect to a given schema > if no schema, the value is given the type string

Datatypes: base types • Base types cover essential needs > “classic” values: string, boolean, float, double, decimal > temporal values:timeDuration, recurringDuration > binary values: binary > Web-related types:uriReference, QName > DTD types:ID, IDREF, ENTITY, NOTATION • One value for several syntaxes > Each base type has a set of values (value space) > Values may have several lexical representations (lexical space) > Equality and order are defined in terms of the value space

Base types: examples

Datatypes: facets • Each base type has facets (read: properties) • Some facets are fundamentals > equality, order > bounded, cardinality, numeric • Some facets are constraining > length, minLength, maxLength: for string, binary or lists > maxInclusive, maxExclusive, minInclusive, minExclusive > precision, scale: for decimal numbers > encoding: hex or base64 for binary > enumeration, pattern > duration, period

Datatypes: derived types • One can derive types by restriction of facets • One can derive types by list • XML Schema offers predefined derived types > integer, nonpositiveInteger, int, date, year, century, timeInstant, language, etc. > IDREFS, NMTOKENS, etc. <simpleType name=’integer' base=’xsd:decimal'> <scale value='0'/></simpleType> <simpleType name=’int' base=’xsd:integer'> <maxInclusive value=’2147483647'/> <mininclusive value=‘-2147483648’/></simpleType> <simpleType name=’IDREFS' base=’xsd:IDREF’ derivedBy=‘xsd:list’/>

Now you can practice... > Using a range facet > Using an enumeration facet > Using a pattern facet > Using a list type > etc. <simpleType name=’auctionprice' base=’xsd:decimal'> <minInclusive value='10'/></simpleType> <simpleType name=’booktype' base=’xsd:string'> <xsd:enumeration value=”Book"/> <xsd:enumeration value=”Collection"/>... <xsd:simpleType name=”isbn" base=‘xsd:string’> <xsd:pattern value=”ISBN \d{10}"/></xsd:simpleType> <xsd:simpleType name=”auctions" base="xsd:auctionprice” derivedBy=“xsd:list”/>

Describing Values: Conclusion • Not addressed in research • XML Schema Part2: Datatypes does a good job > Quite complete > Deals with complex requirements (e.g.,internationalization) • Defines values but not operations! > Needed by XPath, XQuery…

Describing XML structures • element names > with the names themselves: book, title, etc. > possibly with wildcards: ~ = any tag, !a = not a, etc. • element children > using regular expressions • element attributes > unordered attribute-value pairs • Main question: types vs. element names > does the element name determines the type ? > tag-coupled types vs. tag-decoupled types

Coupled types • Approach taken by DTDs > two elements with same name have always same type > children = regular expression over elements • Properties > easy to parse: =>no depth look-ahead > no closure under union, no local names allowed > cannot express relational, object-oriented schemas <!ELEMENT book (title, author+, price, publisher, section, conclusion?)> <!ELEMENT title (#PCDATA)>.... <!ELEMENT author (name,affiliation) <!ELEMENT name (first, last)> <!ELEMENT first (#PCDATA)>.... <!ELEMENT publisher (name, address)>...

Decoupled types • Approach taken by YAT, XDuce, lotos, etc. > types are decoupled from element names > children are defined by regular expressions over types > different types can have the same tag • Properties > equivalent to regular tree grammars > closure under intersection, complement, union... > more precise type for documents and queries > harder to parse (might require look-ahead and backtracking) type Book = book [ Title, Author+, Price, Publisher, Section, Conclusion? ] type Title = title [ String ] type Author = author [ Name, Affiliation ] type Name = name [ first [ String ], last [ String ] ] ... type Publisher = publisher [ PName, Address ] type PName = name [ String ]

Decoupled types cont’d • They can easily describe mixed content • They can easily describe all well-formeddocuments • They support a notion of subtyping via inclusion > all documents of type Body2 are also of type Body and UrTree • But they can be ambiguous > deciding between Body and Body2 can be expensive type Section = section [ title [ String ], Body ] type Body = content [ (b [ Body ] | footnote [ String ] | Section | String)* ] type UrScalar = (String | Boolean | Float | Double ...) type UrTree = UrScalar | ~[ UrTree* ] type Body2 = content [ String, (b [ String ] | footnote [ String ] | String)*, Section* ] Body2 <: Body <: UrTree type Section2 = section [ title [ String ], Body2*,Body* ]

Decoupled types & full XML • How do you describe attributes ? > but attributes are unordered, without duplicates > they do not interact with the children of the element > they cannot contain complex values • How do you describe references ? > Like in object schemas [Cluet et al 1998]: > but it’s even harder to parse because of cycles [Beeri, Milo 1999] • How do you deal with XML specifics ? > entities, process instructions, name spaces, serialization, etc. type Book = book [ @isbn [ String ], Title, Author+, Price, Publisher, Section, Conclusion? ] type Author = author [ name [ first [ String ], type Book = book [ title [ String ], last [ String ] ] ] &Author+, &Publisher ] type Publisher = publisher [ name [ String ] ]

What about XML Schema ? • Tries to get the expressive power of decoupled types + the ease of parsing of coupled types • Advanced features: “subtyping”, constraints... • Deals with all the specifics of XML • XML Schema Syntax is in XML Results in a pretty complex specification <xsd:element name=”book”> <xsd:complexType> <xsd:element name=”title" type="xsd:string"/> <xsd:element name=”author” maxOccurs=“unbounded”> <xsd:complexType><element name=“first” type=“xsd:string”/> <element name=“last” type=“xsd:string”/> </xsd:complexType></xsd:element> ……… </xsd:complexType> </xsd:element>

Element & attribute declarations • Element decl.~ associate element names to types > have a name and their content is described by a type • Attribute decl. ~ associate element names to types > have a name and contain an atomic value > can be required or optional > can only appear inside elements (through complex types) <xsd:element name=”title" type="xsd:string"/> title [ String ] <xsd: element name = “affiliation” type=“publisher”/> affiliation [ Publisher ] <xs:attribute name=”price”/> @price [ String ]? <xs:attribute name=”auctionhistory” type="auctions” @auctionhistory [ Auctions] use="required"/> type Auctions = Decimal*

Model groups • Defines content models (i.e., type for the children of an element) ~ equivalent to regular expressions over elements <xsd:sequence> title[Title],price[Price] <xsd:element name=”title" type=”Title"/> <xsd:element name=”price" type=”Price"/> </xsd:sequence> <xsd:choice> ( publisher[Publisher] <xsd:element name=”publisher” type=“Publisher”/> | editor[Author]) <xsd:element name=”editor” type=“Author”/> </xsd:choice> <xsd:sequence minOccurs=“0” book[ Book ]* maxOccurs=“unbounded”> <xsd:element name = “book” type=“Book”> </xsd:sequence> <xsd:all> (title[Title],price[Price]) <xsd:element name=”title" type=”Title"/> | (price[Price],title[Title]) <xsd:element name=”price" type=”Price"/> </xsd:all>

Complex type definitions > they contain a content model and attribute declarations > they can be empty > they can be recursive > then can be mixed (I.e., strings + sub elements) <xsd:complexType name=“Book”> type Book = @isbn [String], <sequence> title [String] <xsd:element name=”title" type="xsd:string"/> author[ Name ]+ <xsd:element name=”author” maxOccurs=“unbounded” type=“AuthorName”/> </sequence> <xsd:attribute name = “isbn” type=“xsd:string/> </xsd:complexType> </xsd:complexType name=“RefBib” content=“empty”> type RefBib = @refto [ &UrTree ] <xsd:attribute name = “refto” type=“xsd:IDREF/> </xsd:complexType> </xsd:complexType name=“Body” content=“mixed”> type Body = (b[Body]|String)* <xsd:element name = “b” type=“Body” minOccurs=“0” maxOccurs=“unbounded”/> </xsd:complexType>

Some feature interactions • Local element restrictions > local elements with same name can have different types > but they must have the same type among siblings • To be simple or not to be simple... > requires a complexType defined by extension over decimals <xsd:element name=”author”> <xsd:complexType> type Author = author [ name[ AuthorName ] ] <xsd:element name=”name” type=“AuthorName”/> </xsd:complexType></xsd:element> <xsd:element name=”publisher"/><xsd:complexType> type Publisher = publisher [ name [ String ] ] <xsd:element name=”name" type="xsd:string"/>... </xsd:complexType></xsd:element> <xsd:complexType name=“Names”> type Names = name [ AuthorName ], <xsd:element name=”name” type=“AuthorName”/> name [ String ]? <xsd:element name=“name” type = “xsd:string” minOccurs = “0”/> <xsd:complexType> <internationalPrice currency='EU'>423.46</internationalPrice>

XML Data: From Research to Standards