950 likes | 969 Views
Universal Database Systems. Part 4: Databases and XML. Overview. Introduction to XML DTDs and Schemas for XML Documents Languages for XML, in particular XSL Querying and Storing XML Summary and Outlook Running Example: e-shopper‘s_heaven.com. Some Motivation.
E N D
Universal Database Systems Part 4: Databases and XML
Overview • Introduction to XML • DTDs and Schemas for XML Documents • Languages for XML, in particular XSL • Querying and Storing XML • Summary and Outlook Running Example:e-shopper‘s_heaven.com UDBS Part 4 - Winter 2001/2
Some Motivation • Data exchange between two partners • Data integration from different sources • Working with the Web • E-Commerce Scenario UDBS Part 4 - Winter 2001/2
Destination ProtocolTransformationFormatsData Data Exchange Source UDBS Part 4 - Winter 2001/2
Source 1 Source 2 Source 3 ProtocolTransformationFormatsData Destination Data Integration UDBS Part 4 - Winter 2001/2
Working with the Web Server 1 Server 2 Server 3 HTTPSearch EngineHTML UDBS Part 4 - Winter 2001/2
E-Commerce Scenario Customer Payment Supplier EC Portal UDBS Part 4 - Winter 2001/2
Problems • Heterogeneous data formats • Varying data quality(missing values, varying level of detail) • Missing distinction between contents and formatting ("markup") • Derivation of individual data collections difficult UDBS Part 4 - Winter 2001/2
The Web Today and Tomorrow • HTML documents, all meant for human (not machine) consumption • More and more documents are automatically generated by computers or applications • Applications must be able to communicate directly • Companies need interoperability at an increasing pace • Data exchange must work across platform and company boundaries UDBS Part 4 - Winter 2001/2
Running Example:e-shopper‘s_heaven.com • Internet-based store for books, movies, and music • Merchandise comes in the mail • Publishers, music producers, etc. are supposed to move their data directly into the heaven database • Web presentationsare generated from that database for a variety of target platforms • Users can browse/search the database UDBS Part 4 - Winter 2001/2
State of the Art: HTML • HyperText Markup Language(T. Berners-Lee 1990) • Basis for most Web pages • Document properties come from markups or tags, e.g., • point size, character set • text structure (title, paragraphs, etc.) • hyperlinks UDBS Part 4 - Winter 2001/2
But ... • HTML does not separate markup from contents • Misuse of tags (e.g., <h1> for “bold face“ instead of “header 1“) • Web users want voice, E-Commerce, WAP, EDI; HTML is difficult to extend • No modularity • Weak internationalization UDBS Part 4 - Winter 2001/2
Overview • Introduction to XML • DTDs and Schemas for XML Documents • Languages for XML, in particular XSL • Querying and Storing XML • Summary and Outlook UDBS Part 4 - Winter 2001/2
In Detail • Fundamental language elements, mostly by way of examples:tags, elements, attributes • Well-formedness • Tree structure vs. serialization • IDs and referencing UDBS Part 4 - Winter 2001/2
Extensible MarkupLanguage (XML) • a W3C "Recommendation"(W3C: MIT, INRIA, Keio) • Meta language for the creation and formatting of document markups and for the specification of (other) languages UDBS Part 4 - Winter 2001/2
What Does XML Do? • Documents are hierarchically decomposed into parts ("elements") • The parts are named • Names and contents are (Unicode) text • Rules can describe how parts fit together UDBS Part 4 - Winter 2001/2
Analogy to Relational Databases UDBS Part 4 - Winter 2001/2
text element tag Example <book> <author> Serge Abiteboul </author> <author> Rick Hull </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year> </book> UDBS Part 4 - Winter 2001/2
XML Elements • Elementsarecorresponding pairs of begin and end tags with text included • Tags determine structure,text determines contents • Elements may beempty,e.g., <red></red> abbrev. <red/> • Elements may be nested to form a tree, i.e., there is a single root element, and all tag pairs observe a strict nesting (no overlaps!) UDBS Part 4 - Winter 2001/2
XML Documents • An XML document is an unranked, ordered tree and consists of elements • The ordering of elements in a document is significant; an XML document is ordered • Well-formed document: loosely speaking “matching tags“(one closing per opening tag at the same level of nesting) UDBS Part 4 - Winter 2001/2
Non-equivalent Documents <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> <book> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <title> Foundations… </title> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> The corresponding trees are different! UDBS Part 4 - Winter 2001/2
e-shopper‘s Example (1) <CATALOG> <BOOKCATALOG> <BOOK category="technical" language="en"> <ISBN>0070310866</ISBN> <AUTHOR> <PERSON> <FIRSTNAME>Abraham</FIRSTNAME> <LASTNAME>Silberschatz</LASTNAME> </PERSON> <PERSON> <LASTNAME>Korth</LASTNAME> <FIRSTNAME>Henry F.</FIRSTNAME> </PERSON> </AUTHOR> <TITLE>Database System Concepts</TITLE> <PUBLISHER>McGraw Hill</PUBLISHER> <LOCATION/> <EDITION>3</EDITION> <YEAR>1998</YEAR> </BOOK> UDBS Part 4 - Winter 2001/2
e-shopper‘s Example (2) <BOOK category="fiction" language="de"> <ISBN>342333052X</ISBN> <AUTHOR> <PERSON> <LASTNAME>Singh</LASTNAME> <FIRSTNAME>Simon</FIRSTNAME> </PERSON> </AUTHOR> <TITLE>Fermat‘s Last Theorem</TITLE> <PUBLISHER>DTV</PUBLISHER> <LOCATION>Munich</LOCATION> <EDITION>1</EDITION> <YEAR>2000</YEAR> </BOOK></BOOKCATALOG> XML document:a tree of elements containing character data UDBS Part 4 - Winter 2001/2
book book ISBN ISBN author title year author . . . . . edition title person publisher . . . . . Tree View bookcatalog UDBS Part 4 - Winter 2001/2
Alternative View: Serialization <BOOK category="technical“, language="en"><ISBN>0070310866</ISBN><AUTHOR><PERSON><FIRSTNAME>Abraham</FIRSTNAME><LASTNAME>Silberschatz</LASTNAME></PERSON><PERSON><LASTNAME>Korth</LASTNAME><FIRSTNAME>Henry F.</FIRSTNAME></PERSON></AUTHOR><TITLE>Database System Concepts</TITLE><PUBLISHER>McGrawHill</PUBLISHER><LOCATION/><EDITION>3</EDITION><YEAR>1998</YEAR></BOOK> UDBS Part 4 - Winter 2001/2
Attributes • Elements can have attributes (with name and value) • Attribute ordering is immaterial • Attributes are data modeling alternatives, i.e., information can be represented as elements or via attributes: <address> <street>N. Olive St.</street><city>Dallas</city> </address> vs. <address street="N. Olive St." city="Dallas"/> • Attribute values must appear in single or double quotes (' or ")and are user-defined • Each attribute occurs in a tag at most once UDBS Part 4 - Winter 2001/2
e-shopper‘s Example (3) <MOVIECATALOG> <VIDEO language="de"> <TITLE>The Sixth Sense</TITLE> <DIRECTOR> <PERSON> <LASTNAME>Shyamalan</LASTNAME> <FIRSTNAME>M. Night</FIRSTNAME> </PERSON> </DIRECTOR> <CAST> <ACTOR> <PERSON> <LASTNAME>Willis</LASTNAME> <FIRSTNAME>Bruce</FIRSTNAME> </PERSON> </ACTOR> <ACTOR> <PERSON> <LASTNAME>Osment</LASTNAME> <FIRSTNAME>Haley Joel</FIRSTNAME> </PERSON> </ACTOR> </CAST> <RUNTIME>103</RUNTIME> <YEAR>2000</YEAR> </VIDEO> UDBS Part 4 - Winter 2001/2
e-shopper‘s Example (4) <DVD RegionCode="2"> <TITLE>Matrix</TITLE> <DIRECTOR> <PERSON> <LASTNAME>Wachowski</LASTNAME> <FIRSTNAME>Andy</FIRSTNAME> </PERSON> <PERSON> <LASTNAME>Wachowski</LASTNAME> <FIRSTNAME>Larry</FIRSTNAME> </PERSON> </DIRECTOR> <CAST> <ACTOR> <PERSON> <LASTNAME>Reeves</LASTNAME> <FIRSTNAME>Keanu</FIRSTNAME> </PERSON> </ACTOR> <ACTOR> <PERSON> <LASTNAME>Fishburne</LASTNAME> <FIRSTNAME>Laurence</FIRSTNAME> </PERSON> </ACTOR> <ACTRESS> <PERSON> <LASTNAME>Moss</LASTNAME> <FIRSTNAME>Carrie-Anne</FIRSTNAME> </PERSON> </ACTRESS> </CAST> <RUNTIME>131</RUNTIME> <YEAR>1999</YEAR> <SOUND> <LANGUAGE>de</LANGUAGE> <SOUNDMIX>AC3/Dolby Digital 5.1</SOUNDMIX> </SOUND> <ANNOTATION>includes: Making-of, Comments of Director, Actor Biographies</ANNOTATION> </DVD> </MOVIECATALOG> UDBS Part 4 - Winter 2001/2
video dvd sound title title runtime director cast director cast year actor actress runtime . . . . . . . . . . Tree View (cont‘d) moviecatalog UDBS Part 4 - Winter 2001/2
bookcatalog moviecatalog musiccatalog book video musicitem ISBN title title author director performer cast year person year title runtime . . . . . track . . . . . track Overall Tree View catalog UDBS Part 4 - Winter 2001/2
Attribute Types • String type • CDATA: character data, any Unicode character • Tokenized type • ID: unique element identifier • IDREF: the value of a unique ID attribute • IDREFS: multiple IDREFs of an element • ENTITY/ENTITIES: the name of an entity or a list of entity names • NMTOKEN/NMTOKENS: a (list of) name token(s) • Enumerated type • NOTATION: the name of a notation that allows for a specific interpretation of the value • ENUMERATION: a list of possible values UDBS Part 4 - Winter 2001/2
IDs and IDREFs • AnId(entifier) attribute with a unique value can be associated with an element • This element can then be referenced from somewhere else using an Idrefattribute • Note:Both IDs and references are just syntax in XML! UDBS Part 4 - Winter 2001/2
Example <personid="o555"> <name> Jane </name> </person> <personid="o456"> <name> Mary </name> <childrenidrefs="o123 o555"/> </person> <personid="o123" idref="o456"><name>John</name> </person> UDBS Part 4 - Winter 2001/2
How to Edit XML Documents • A simple text editor is enough • Better are specific editors with (at least) well-formedness tests • Freeware and commercial tools include • GNU Emacs • Icon XML Spy • MS XML Notepad • ezDTD • XML Styler • Arbortext Epic • SoftQuad XMetaL UDBS Part 4 - Winter 2001/2
Other Stuff • Comments<!-- this is a comment --> • Optional Document Header<?xml version="1.0" encoding="UTF-8"?> <!-- edited by Gottfried Vossen --> <!DOCTYPE CATALOG SYSTEM "catalog.dtd"> • Namespaces:“identify your vocabulary“ We‘ll get to this shortly! UDBS Part 4 - Winter 2001/2
Namespaces • Serve to avoid name clashes when documents are composed from parts that originate from different sources • Map names to URIs (Universal Resource Identifiers) which identify a particular Namespace • A combination of local name + namespaceURI yields a unique name UDBS Part 4 - Winter 2001/2
Sample Name Clash • Document 1:<book> Euro <price> 25.99 </price></book> • Document 2:<book><price currency="Euro"> 25.99 </price></book> • Document 3:<book><price currency="Euro" amount="25.99"/></book> UDBS Part 4 - Winter 2001/2
defined here XML Namespaces • name ::= [prefix:]localpart • syntactically: <number> , <isbn:number> • semantically: provide URL for schema <tagxmlns:mystyle = “http://…”> … <mystyle:title> … </mystyle:title> <mystyle:number> … </tag> UDBS Part 4 - Winter 2001/2
Namespaces can be Mixed <h:html xmlns:a=“http://www.article.com/article“ xmlns:h=“http://www.w3.org/TR/REC-html40“> <h:head><h:title>Articles about XML</h:title></h:head><h:body> <a:article> <a:title h:style=“font-family:arial;“>Namespaces</a:title> <h:table> <h:tr align=“left“> <h:td><a:author>Fritz Schnapp</a:author></h:td> <h:td><a:journal>XML-News</a:journal></h:td> <h:td><a:pages>11</a:pages></h:td> . . . . . . . </h:table> </a:article></h:body> </h:html> UDBS Part 4 - Winter 2001/2
Linking • Recall from HTML:URLs exclusively point to documents; links are always uni-directional; external link definitions are not allowed • In XML:a document can be linked internally or externally • Technical tools: • Attribute types such as ID and IDREF • XPointer, XLink, XPath UDBS Part 4 - Winter 2001/2
Linking and Addressing • XPath (XML Path Language):language for addressing parts of an XML document via paths • XPointer (XML Pointer Language):language using XPath for addressing into the internal structures of an XML document • XLink (XML Linking Language):constructs for describing links between XML objects as well as resources UDBS Part 4 - Winter 2001/2
1st step 2nd step Axes Node test Predicate Location Paths in XPath • General form: document(url)/step/step/.../step • Location steps have the form axis::nodetest[filter]* • Steps comprise an axis, a node test and a predicate • Example:child::AUTHOR[position()<3]/attribute::id UDBS Part 4 - Winter 2001/2
Path Syntax • Valid axis values:child, attribute, parent, following-siblings • Node test:can be a tag, an attribute name or *; also allowed are functions like text() or comment() • More details in the context of XSLT later on UDBS Part 4 - Winter 2001/2
Fragment Identifiers in XPointer • XPointer uses XPath for defining fragment identifiers • Such an identifier can be:the value of an ID-type attribute, a sequence of numbers, a sequence of XPointer expressions • Example:http://www.myserver.net/#xpointer(//BOOK/AUTHOR[position()=1]) UDBS Part 4 - Winter 2001/2
XLink • For linking documents • Links can be simple or extended • XLink uses its own namespace • Extended links connect more than 2 documents (and can determine via “arcs“ how these links are traversed) UDBS Part 4 - Winter 2001/2
Example of a Simple Link <AUTHOR xmlns:xlink=“http://www.w3.org/1999/xlink“xlink:type=“simple“xlink:href=“http://cs-faculty.stanford.edu/~knuth/“xlink:role=“don~knuth_homepage“xlink:show=“embed“xlink:actuate=“onLoad“> Donald Knuth </AUTHOR> What should happen with the document during loading? When should the action specified by “show“ occur? UDBS Part 4 - Winter 2001/2
The XML Communication Problem How do I share structure and metadata with my community? How do I learn and use the element structure of a document? How to make all this automatable? UDBS Part 4 - Winter 2001/2
Overview • Introduction to XML • DTDs and Schemas for XML Documents • Languages for XML, in particular XSL • Querying and Storing XML • Summary and Outlook UDBS Part 4 - Winter 2001/2
Publ. A e-shopper‘sheaven Publ. B e-shopper‘s_heaven.com • How can e-shopper‘s motivate a publisher to supply its information in a uniform format to the heaven database ? UDBS Part 4 - Winter 2001/2
Document Type Definition(DTD) • Syntax rules for one or more documents stating • what tags are allowed, • where these may occur, • how they fit together, • which attributes may be used. • An XML document is valid if it is well-formed and follows the rules of “its“ DTD UDBS Part 4 - Winter 2001/2