1 / 213

XML Data: From Research to Standards

XML Data: From Research to Standards. Daniela Florescu Propel. Jérôme Siméon Bell Laboratories. Data and the Web: A bit of history. Research: > 1950’s : Lisp [Mac Carthy] > 1960’s : Tree languages [Buchi] > 1970’s : Relational DBs [Codd] > 1990 : Graphlog [Univ. Toronto]

macy
Download Presentation

XML Data: From Research to Standards

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML Data:From Research to Standards Daniela Florescu Propel Jérôme Siméon Bell Laboratories

  2. Data and the Web:A bit of history • Research: > 1950’s: Lisp [Mac Carthy] > 1960’s: Tree languages [Buchi] > 1970’s: Relational DBs [Codd] > 1990: Graphlog [Univ. Toronto] > 1994: O2 extensions [INRIA] > 1995: Tsimmis & OEM [Stanford] > 1995: UnQL [UPenn] • Internet industry: • > 1957 : Sputnik launches ARPA • > 1972 : First demonstration of ARPANET • > 1989 : Number of hosts breaks 100,000 • > 1991 : CERN releases the World Wide WebHTML as the support for information • > 1997 : 20 Million Hosts, 1 Million Web sites • > 1998 :W3C releases XML to represent information on the Web Need to handle irregular Web data. Use graph data models. XML provides a syntax for irregular textual Web information. ?

  3. The secret of HTML success • Everybody can write it: > HTML is simple > HTML is textual: it is human readable, you can use any editor, ... • Everybody can read it > HTML is portable on any platform > The browser is the universal application • It connects pieces of information together > Through hypertext links

  4. But new applications = new needs • Infomediaries: • Search engines • Web portals • Digital libraries • Virtual enterprises • Electronic services: • On-line catalogs and procurement • Comparison shoppers • Market places • Scientific applications • Manufacturing engineering etc. More than HTML: data on the Web More than the browser: applications on the Web

  5. The Secret of XML Popularity • <book> <title> Foundations… </title> • <author> Abiteboul </author> • <author> Hull </author> • <author> Vianu </author> • <publisher> Addison Wesley </publisher> • <year> 1995 </year> • </book> … It looks like HTML... > Simple, familiar, easy to learn, human-readable > Universal and portable > Supported by the W3C:trusted and quickly adopted by the industry …but it’s more than HTML! > Flexible: you can represent any information >Extensible:you can represent it the way you want!

  6. XML Is Only the Beginning... • How do you build applications ? > There is an urgent need for XML tools • Designing XML tools is a data management problem: > XML 1.0 to describe structured documents ~ Syntax for trees > XML data models to describe the information content ~ Data model for trees > XML schemas to describe the structure of information ~ Data definition language for trees > XML languages to describe information processing ~ Data manipulation language for trees

  7. About the Tutorial • XML through database glasses • Contains: > Up-to-date information about standards > Relationship with research > Convergence and divergences • Divided in 4 parts: 1. Introduction to XML 1.0 2. Data models 3. Schema languages 4. Query languages  Please, please, please, ask questions! 

  8. Part IXML 1.0

  9. About the W3C • Membership organization • Different types of groups inside the W3C: • Working groups • Interest groups • Coordination groups • Status of W3C documents: • Note • Working draft • Last Call • Candidate/proposed recommendation • Recommendation~ Standard

  10. XML activities inside W3C • Core XML > eXtensible Markup Language (XML 1.0), namespaces, Infoset • XML Linking > XML Pointer Language (XPointer), XML Linking language • XML Schema • XML Query > XML Data Model, Algebra and Query Language • Document Object Model • XSL > XPath > XSLT/XSL: Transformation and stylesheet language

  11. XML 1.0:Well formed documents • An XML Document is composed of: • > markup: element, attributes • > text: #PCDATA, CDATA • Well-formed document: • > verifies XML lexical conventions • > contains properly nested elements with a single root element • > can contain empty elements, mixed text and elements <bookyear=“1967” > <title>The politics of experience</title> <author>R.D. Laing</author> <refisbn=“1341-1444-555”/> <section> The great and true Amphibian, whose nature is disposed to….. <title>Persons and experience</title> Even facts become... </section> … </book>

  12. XML 1.0:Valid documents • A Valid XML document verifies a Document Type Definition (DTD): • > grammar for the document • > constraints on the structure of elements, attributes, entities, notations... • > a DTD is optional • (We will see more about DTD in the schema part of the tutorial) <?XML version=“1.0”?> <DOCTYPEbook [ <!ELEMENTbook (title, author*, publisher?, section+)> <!ATTLISTbook year CDATA #IMPLIED> <!ELEMENTtitle (#PCDATA)> <!ELEMENTauthor (#PCDATA)> <!ELEMENTsection (#PCDATA | title | section)*> ]> ...

  13. Some additional features • General entities &myentity; > Declared as part of XML 1.0 or in a DTD > Used to escape characters, as macros for pieces of documents &amp; = & > An XML document contains Unicode characters &#60; = &lt; = < • Parameter entities %myentity; > Declared in a DTD, used as macros for pieces of DTDs <!ENTITY %macro “publisher (#PCDATA)”> … <!ELEMENT %macro;>

  14. Even more additional features • Namespaces mynames:name > a set of names identified by an URI > tags and attribute names become qualified names (QName) • Processing instructions > to embed processing in a document (e.g. Java applet in HTML) • Comments <myns:section xmlns:myns=“http://caravel.inria.fr/mySchema” > <myns:title> Persons and experience</myns:title> </myns:section> <!-- This is a comment -->

  15. Part IIData Model

  16. Why a data model for XML ? For old & well-know (but good!) reasons • As a support for physical/logical independence • > XML can be stored in files, a native XML repository, a relational database • > XML can be virtual, as a view of a repository, integrated sources • > XML can be in memory, using data structures in C, C++, Java, etc • > XML can be streamed between processes • To describe information content of XML documents • > to agree and reason about information content, preservation • To define semantics of operations: • > equality, etc.

  17. But XML has specifics • Serialization syntax • Some information exists only after schema validation • > price is not a string but a decimal value • > refs is not a string but a list of references • One more motivation for a data model: • To isolate the user from syntactic details of XML • <book bookid=“b1” price=“10.50”/> • <title>War &amp; Peace</title> • <author>Tolstoi</author> • <biblio refs=“b1 b2 b3”> • </book> • <xsd:attribute name=“price” type=“xsd:decimal”/> • <xsd:attribute name=“bookid” type=“xsd:ID”/> • <xsd:attribute name=“refs” type=“xsd:IDREFS”/>

  18. Existing data models • Graph and tree models used in research • Document Object Model (DOM) > status: recommendation > programmatic interface for XML (with an object-oriented flavor) • XML Information Set (Infoset) > describes the information content exported by XML processors > can be generated after parsing or after validation • XML languages’ Data models: > required for language semantics > XPath: recommendation has it’s own data model > XML Query Data model: working draft

  19. Bib &b0 book book book references &b2 &b3 biblio &b1 author author title title biblio publisher author author author author price refs title refs refs “Tolstoi” “War & Peace” 10.50 Semistructured model • Graph based, unordered, edge-labeled (here OEM) > But XML is ordered, tree based > Node-labeled seems more natural (e.g., like in DOM)

  20. b0: bib b2: b1: book book book b3: ...... title author price biblio title author price biblio refs .......................................... “Tolstoi" 10.50 “War & Peace” &b1 &b2 &b3 Ordered model • Node-labeled, ordered trees, with references (YAT) > But what about attributes (unordered!), namespaces, processing interactions, etc. ?

  21. XML Infoset • Specifies a description of information in a well-formed XML document • Abstract way to think about XML data • Other processors (e.g. XML Schema) can contribute information Here is an example in a made-up syntax: • b1 = Element [ local name = “book”; • children =[ Element [ local name = “title” ... ]; • Element [ local name = “author”... ]; ... ] • attributes = [ Attribute [ local name = “price”; • children = [ Character [ code = ‘1’ ]; • Character = [ code = ‘0’ ]; • Character = [ code = ‘.’ ]; • Character = [ code = ‘5’]; • Character = [ code = ‘0’ ] ]; • attribute type = “xsd:decimal” ] ... ] ]

  22. XML Query Data Model • A node-labeled, tree model with references > Very close to XPath data model • Generated after validation > provides also pointers to schema information • Uses a functional notation > no explicit data structure • Defines a mapping from post-schema validated Infoset to XML Query Data Model > preserves original infoset (e.g., characters)

  23. XML Query Data Model • Nodes Node = DocNode | ElemNode | AttrNode | ValueNode | NSNode | PINode | CommentNode | InfoItemNode • XML Schema primitive types string, boolean, ID, IDREF, decimal, QName, ... • Collections sequence bag union [T] {T} T1 | T2 References ref(T)

  24. Constructors & accessors • Attribute Constructor attrNode : (QNameValue, ValueNode) -> AttrNode ValueNode = StringValue | DecimalValue | ... qnameValue : (uriReference | null, string)-> QNameValue • Attribute Accessors name : AttrNode -> QNameValue value : AttrNode -> ValueNode type : AttrNode -> ElemNode • Example: <book price=“10.50”/> A1 = attrNode(qnameValue(null, “price”),decimalValue(10.50)) name(A1) = qnameValue(null, “price”) value(A1) = decimalValue(10.50)

  25. XML Data Model: Conclusion • Research focuses on simple formal models • Many standards related to the need for a data model • XML Query Data Model reconciles both worlds > Complete with respect to XML > Simple design with a clear connection to a formal model: ordered trees, node-labeled, with references > Clear relationship two other W3C standards: mapping to XML Infoset based on XPath + typed values and unordered collections > Less clear relationship with DOM

  26. Part IIIData Definition Language

  27. Why a DDL for XML ? For old & well-know (but good!) reasons • As an ontology & modeling tool: • > to describe the structure of information: entities, relationships... • > to share common descriptions between actors/applications • > to guide query formulation and application development • For error detection & safety: • > to verify that documents comply to what the application expects • > to make sure that the application accesses valid data • > to enforce safe operations (e.g., don’t do float arithmetic on trees!) • > to check that compositions of operations make sense • For performances: • > to design storage (saving space, improving clustering, etc.) • > to process queries (algebraic laws, rewriting path expressions, etc.)

  28. But XML deals with new needs • XML data created from legacy repositories > Need to capture schemas from heterogeneous sources • Relational schemas: Simple but with integrity constraints • Object-oriented schemas: Typed references, Inheritance... • Document grammars: Regular expressions, mixed text and structure • XML used on the Web, for data exchange > Need to remain flexible • Web sources: From strict schemas to well-formed documents (smooooothly........) • Many applications use the same information: We should be able to type the same document in multiple ways

  29. Existing schema languages • DTDs(W3C recommendation as part of XML 1.0) > powerful for documents: regular expressions, mixes of text and structure > limited for other applications: cannot capture relational or object schemas • XML Schema(Candidate recommendation) > Many new features: data types, forms of subtyping, etc. > More powerful but quite complex • Schemas for unordered semistructured models: > Data guides, Graph schemas, using Datalog > Used for optimization, schema inference from data • Schemas for ordered trees models > Regular tree grammars, YAT, lotos, XDuce, Relax, TRex etc. > Used for optimization, type checking and inference from queries

  30. DDL Roadmap 3.1. Describing atomic values >integer, string, float, date, images, etc 3.2. Describing structures > elements: tag-coupled approach vs. tag-decoupled approach > attributes 3.3. More semantics > identity, references, relationships intra or inter documents > isa: notion of inheritance... 3.4. Simplifying schema reuse > import/export abilities > refinement of existing descriptions

  31. Values in XML: easy ? • DTD says it’s easy: Recipe: #PCDATA = string CDATA = other strings, ... I.e.: Everything is a string Unfortunately: Strings are not a panacea... • Database research says it’s easy: Recipe: Take a data model with atomic types Each value is in a different type... I.e.: Don’t deal with syntax but data model Unfortunately: XML = file = syntax

  32. Values in XML: many issues... • Addressing numerous needs: > float, string, int, date, URI, telephone number, gif, applet, etc. • Living with XML 1.0 syntax > The same lexical representation can correspond to several values > The same value can have several lexical representations > binary formats (images, etc.) must be serialized in a portable way • Compatible with other standards • Compatible with internationalization > World Wide Web! <book><title>Haystacks at Chailly </title><author>Monet</author> <date>1865</date><price>1865</price></book> <book><ref>Monet1865</ref><in_stock>true</in_stock></book> <book><ref>Monet1865</ref><in_stock>1</in_stock></book>

  33. XML Schema Part 2: Datatypes • Defines 14 built-in types (basic types) > general purpose types > types for compatibility with DTDs • Relies on other existing standards whenever possible > IEEE 754-1985 for floats > UCS [ISO 10646] & Unicode for internationalization > ISO 8601 for dates • Gives the ability to define new types (derived types) • Single lexical representation for many values ? > document is interpreted with respect to a given schema > if no schema, the value is given the type string

  34. Datatypes: base types • Base types cover essential needs > “classic” values: string, boolean, float, double, decimal > temporal values:timeDuration, recurringDuration > binary values: binary > Web-related types:uriReference, QName > DTD types:ID, IDREF, ENTITY, NOTATION • One value for several syntaxes > Each base type has a set of values (value space) > Values may have several lexical representations (lexical space) > Equality and order are defined in terms of the value space

  35. Base types: examples

  36. Datatypes: facets • Each base type has facets (read: properties) • Some facets are fundamentals > equality, order > bounded, cardinality, numeric • Some facets are constraining > length, minLength, maxLength: for string, binary or lists > maxInclusive, maxExclusive, minInclusive, minExclusive > precision, scale: for decimal numbers > encoding: hex or base64 for binary > enumeration, pattern > duration, period

  37. Datatypes: derived types • One can derive types by restriction of facets • One can derive types by list • XML Schema offers predefined derived types > integer, nonpositiveInteger, int, date, year, century, timeInstant, language, etc. > IDREFS, NMTOKENS, etc. <simpleType name=’integer' base=’xsd:decimal'> <scale value='0'/></simpleType> <simpleType name=’int' base=’xsd:integer'> <maxInclusive value=’2147483647'/> <mininclusive value=‘-2147483648’/></simpleType> <simpleType name=’IDREFS' base=’xsd:IDREF’ derivedBy=‘xsd:list’/>

  38. Now you can practice... > Using a range facet > Using an enumeration facet > Using a pattern facet > Using a list type > etc. <simpleType name=’auctionprice' base=’xsd:decimal'> <minInclusive value='10'/></simpleType> <simpleType name=’booktype' base=’xsd:string'> <xsd:enumeration value=”Book"/> <xsd:enumeration value=”Collection"/>... <xsd:simpleType name=”isbn" base=‘xsd:string’> <xsd:pattern value=”ISBN \d{10}"/></xsd:simpleType> <xsd:simpleType name=”auctions" base="xsd:auctionprice” derivedBy=“xsd:list”/>

  39. Describing Values: Conclusion • Not addressed in research • XML Schema Part2: Datatypes does a good job > Quite complete > Deals with complex requirements (e.g.,internationalization) • Defines values but not operations! > Needed by XPath, XQuery…

  40. Describing XML structures • element names > with the names themselves: book, title, etc. > possibly with wildcards: ~ = any tag, !a = not a, etc. • element children > using regular expressions • element attributes > unordered attribute-value pairs • Main question: types vs. element names > does the element name determines the type ? > tag-coupled types vs. tag-decoupled types

  41. Coupled types • Approach taken by DTDs > two elements with same name have always same type > children = regular expression over elements • Properties > easy to parse: =>no depth look-ahead > no closure under union, no local names allowed > cannot express relational, object-oriented schemas <!ELEMENT book (title, author+, price, publisher, section, conclusion?)> <!ELEMENT title (#PCDATA)>.... <!ELEMENT author (name,affiliation) <!ELEMENT name (first, last)> <!ELEMENT first (#PCDATA)>.... <!ELEMENT publisher (name, address)>...

  42. Decoupled types • Approach taken by YAT, XDuce, lotos, etc. > types are decoupled from element names > children are defined by regular expressions over types > different types can have the same tag • Properties > equivalent to regular tree grammars > closure under intersection, complement, union... > more precise type for documents and queries > harder to parse (might require look-ahead and backtracking) type Book = book [ Title, Author+, Price, Publisher, Section, Conclusion? ] type Title = title [ String ] type Author = author [ Name, Affiliation ] type Name = name [ first [ String ], last [ String ] ] ... type Publisher = publisher [ PName, Address ] type PName = name [ String ]

  43. Decoupled types cont’d • They are simple to define > basic entities: datatypes, tags, type names > one construct : types schema ::= type type_name = type ......... type ::= String | Boolean | ... (* datatypes *) | type_name (* type name *) | tag [ type ](* element *) | ~ [ type ] (* element with wild card *) | type, type (* sequence *) | type | type (* union *) | type*(* kleene star *)

  44. Decoupled types cont’d • They can easily describe mixed content • They can easily describe all well-formeddocuments • They support a notion of subtyping via inclusion > all documents of type Body2 are also of type Body and UrTree • But they can be ambiguous > deciding between Body and Body2 can be expensive type Section = section [ title [ String ], Body ] type Body = content [ (b [ Body ] | footnote [ String ] | Section | String)* ] type UrScalar = (String | Boolean | Float | Double ...) type UrTree = UrScalar | ~[ UrTree* ] type Body2 = content [ String, (b [ String ] | footnote [ String ] | String)*, Section* ] Body2 <: Body <: UrTree type Section2 = section [ title [ String ], Body2*,Body* ]

  45. Decoupled types & full XML • How do you describe attributes ? > but attributes are unordered, without duplicates > they do not interact with the children of the element > they cannot contain complex values • How do you describe references ? > Like in object schemas [Cluet et al 1998]: > but it’s even harder to parse because of cycles [Beeri, Milo 1999] • How do you deal with XML specifics ? > entities, process instructions, name spaces, serialization, etc. type Book = book [ @isbn [ String ], Title, Author+, Price, Publisher, Section, Conclusion? ] type Author = author [ name [ first [ String ], type Book = book [ title [ String ], last [ String ] ] ] &Author+, &Publisher ] type Publisher = publisher [ name [ String ] ]

  46. What about XML Schema ? • Tries to get the expressive power of decoupled types + the ease of parsing of coupled types • Advanced features: “subtyping”, constraints... • Deals with all the specifics of XML • XML Schema Syntax is in XML Results in a pretty complex specification <xsd:element name=”book”> <xsd:complexType> <xsd:element name=”title" type="xsd:string"/> <xsd:element name=”author” maxOccurs=“unbounded”> <xsd:complexType><element name=“first” type=“xsd:string”/> <element name=“last” type=“xsd:string”/> </xsd:complexType></xsd:element> ……… </xsd:complexType> </xsd:element>

  47. Element & attribute declarations • Element decl.~ associate element names to types > have a name and their content is described by a type • Attribute decl. ~ associate element names to types > have a name and contain an atomic value > can be required or optional > can only appear inside elements (through complex types) <xsd:element name=”title" type="xsd:string"/> title [ String ] <xsd: element name = “affiliation” type=“publisher”/> affiliation [ Publisher ] <xs:attribute name=”price”/> @price [ String ]? <xs:attribute name=”auctionhistory” type="auctions” @auctionhistory [ Auctions] use="required"/> type Auctions = Decimal*

  48. Model groups • Defines content models (i.e., type for the children of an element) ~ equivalent to regular expressions over elements <xsd:sequence> title[Title],price[Price] <xsd:element name=”title" type=”Title"/> <xsd:element name=”price" type=”Price"/> </xsd:sequence> <xsd:choice> ( publisher[Publisher] <xsd:element name=”publisher” type=“Publisher”/> | editor[Author]) <xsd:element name=”editor” type=“Author”/> </xsd:choice> <xsd:sequence minOccurs=“0” book[ Book ]* maxOccurs=“unbounded”> <xsd:element name = “book” type=“Book”> </xsd:sequence> <xsd:all> (title[Title],price[Price]) <xsd:element name=”title" type=”Title"/> | (price[Price],title[Title]) <xsd:element name=”price" type=”Price"/> </xsd:all>

  49. Complex type definitions > they contain a content model and attribute declarations > they can be empty > they can be recursive > then can be mixed (I.e., strings + sub elements) <xsd:complexType name=“Book”> type Book = @isbn [String], <sequence> title [String] <xsd:element name=”title" type="xsd:string"/> author[ Name ]+ <xsd:element name=”author” maxOccurs=“unbounded” type=“AuthorName”/> </sequence> <xsd:attribute name = “isbn” type=“xsd:string/> </xsd:complexType> </xsd:complexType name=“RefBib” content=“empty”> type RefBib = @refto [ &UrTree ] <xsd:attribute name = “refto” type=“xsd:IDREF/> </xsd:complexType> </xsd:complexType name=“Body” content=“mixed”> type Body = (b[Body]|String)* <xsd:element name = “b” type=“Body” minOccurs=“0” maxOccurs=“unbounded”/> </xsd:complexType>

  50. Some feature interactions • Local element restrictions > local elements with same name can have different types > but they must have the same type among siblings • To be simple or not to be simple... > requires a complexType defined by extension over decimals <xsd:element name=”author”> <xsd:complexType> type Author = author [ name[ AuthorName ] ] <xsd:element name=”name” type=“AuthorName”/> </xsd:complexType></xsd:element> <xsd:element name=”publisher"/><xsd:complexType> type Publisher = publisher [ name [ String ] ] <xsd:element name=”name" type="xsd:string"/>... </xsd:complexType></xsd:element> <xsd:complexType name=“Names”> type Names = name [ AuthorName ], <xsd:element name=”name” type=“AuthorName”/> name [ String ]? <xsd:element name=“name” type = “xsd:string” minOccurs = “0”/> <xsd:complexType> <internationalPrice currency='EU'>423.46</internationalPrice>

More Related