790 likes | 918 Views
CS331: Advanced Database Systems: Semistructured Data Management. Norman Paton University of Manchester npaton@manchester.ac.uk. Two views of data: Databases: structured, modelled, queried, programmed. Documents: partially structured, authored, read, navigated.
E N D
CS331: Advanced Database Systems: Semistructured Data Management Norman Paton University of Manchester npaton@manchester.ac.uk
Two views of data: Databases: structured, modelled, queried, programmed. Documents: partially structured, authored, read, navigated. Semistructured data management is at the confluence of these two views. XML is the principal data representation notation for semistructured data. XML can be seen as: An extensible markup language for documents. A data model for hierarchical data. A notation for communicating data with its structure. Semistructured Data See also: COMP30352 – IR, Hypermedia and the Web
XML (Extensible Markup Language) is just that: a markup language with an extensible collection of tags. XML is associated with many related standards within the W3C (World Wide Web Consortium): http://www.w3.org/. XML Related Standards: XPath: navigation. XQuery: queries. XSLT: transformations. XML Schema: document description. DOM: modelling documents as objects. ... ... and underpins: Web Services. The Semantic Web. XML Language Space
Markup • Markup is the inclusion of symbols with special meaning in a text document. • Languages with markup: • LaTeX. • HTML. • RTF. • XML. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>The Silver Pigs</title> </head> <body bgcolor="#ffffff" text="#000000“> <h1>Rome</h1> \documentclass{llncs} \begin{document} \title{An Experimental Performance Evaluation of Join Algorithms for Parallel Object Databases} \author{Sandra Sampaio\inst{1} \and Jim Smith\inst{2}\and Norman W. Paton\inst{1}\and Paul Watson\inst{2}} ...
In XML, content is structured using tags. Tags are distinguished by the characters ‘<‘ and ‘>’. Tags often come in pairs, round some content, as start and end tags. XML Markup Start tag <html> <head> <title>The Silver Pigs</title> </head> <body bgcolor="#ffffff" text="#000000“> <h1>Rome</h1> ... </body> </html> End tag
Elements • An element is a meaningful unit of content enclosed by tags. • An application may be able to interpret an element. • Elements may be ordered or nested. • Context matters, especially given nesting. <station> <name>Oxford Road</name> <city>Manchester</city> </station>
An XML document essentially represents hierarchical data. The elements in a well formed XML document will match and respect the hierarchy. Element Hierarchies <country> <name>United Kingdom</name> <people> <pop>60094648 </pop> <maleLE>75.74</maleLE> <femaleLE>80.7</femaleLE> </people> </country> country name people pop maleLE femaleLE
Attributes provide auxiliary information about elements. Attributes are embedded within start tags, and have the form “name = value”. Attributes <station updatedBy = “Fred Bloggs” validUntil = “22/06/2005”> <name>Oxford Road</name> <city>Manchester</city> </station> <country source=“http://www.cia.gov”> <name>United Kingdom</name> <people> <pop>60094648 </pop> <maleLE>75.74</maleLE> <femaleLE>80.7</femaleLE> </people> </station>
An XML file may be able to contain any old tags, in any order or combination (while remaining well-formed). Restrictions on the legal tags and values a document can contain may be specified using a DTD or an XML Schema. A DTD (Document Type Definition) provides a concise syntax for modelling documents (but is on the way out). An XML Schema definition is itself an XML document, which provides a wide range of modelling constructs for constraining other XML documents. Models
Hierarchical models can capture most cycle-free data fairly naturally. Hierarchical models, however, promote some concepts and demote others. The relational model, by contrast, treats all concepts as (broadly) equal. <train> <tno>3107101</tno> <source>Edinburgh</source> <destination>London</destination> <visit> <name>Edinburgh</name> <time>06:00</time> </visit> <visit> <name>York</name> <time>08:00</time> </visit> <visit> <name>London</name> <time>10:00</time> </visit> </train> Trains in XML
XML is widely used, and many software systems can read/write XML formats. Generic tools have also been developed for designing/editing XML. Tools XML Spy showing a valid XML file as text.
XML Spy supports: Editing. Data modelling. Validation. Transformation. … XML Spy XML Spy showing an XML schema document as a tree.
Oxygen supports: Editing. Data modelling. Validation. Transformation. Querying. … Oxygen Oxygen showing an XML schema document as a Tree and text.
Native XML Databases: Store XML in the database directly (“native”). Make XML Schema the optional schema definition language. Query the database using XML query languages (XPath/XQuery). Program database data as XML data structures (XML:DB, DOM, ...). XML Databases An XPath query and result in eXist
XML Databases • Native XML databases: • Tamino - Software AG: • http://www.softwareag.com/tamino/. • eXist - Open Source: • http://exist.sourceforge.net/. • Standard APIs: • XML:DB Initiative: http://www.xmldb.org/; Both Tamino and eXist provide XML:DB APIs. • XQJ: XQuery API for Java; Java community standard.
Storage options: Decomposed: store an XML document in relational tables, and reconstruct on retrieval. Composed: store an XML document as an attribute of a relational table. Retrieval options: Represent relational tables as XML (e.g. Java WebRowSet). Relational vendors: Tend to support both composed and decomposed storage models. Provide APIs that accommodate XML for data transport or display (e.g., in Web Services or for Web interface generation). XML and Relational Databases
Summary • XML is becoming increasingly ubiquitous for data representation for: • files, transport, storage, metadata. • Data management systems must support storage, querying and communication using XML. • Soon everything will be stored using XML? Don’t believe it!
Further Reading • S. Abiteboul, P. Buneman, D. Suchi, Data on the Web, Morgan-Kaufmann, 1999. • N. Bradley, The XML Companion (3rd Edition), Addison-Wesley, 2002.
XML Schema is a W3C standard for modelling using XML. An XML Schema definition is itself an XML document – there is an XML Schema for XML Schema! XML Schema files have a .xsd suffix; XML data files have a .xml suffix. An XML Schema can specify: Which elements are mandatory/optional. Which attributes are mandatory/optional. Element/attribute types. Cardinalities. Relative ordering. XML Schema
Role of XML Schema • Unlike in relational/object databases: • An XML database need not have a schema. • An XML schema may not be very prescriptive in terms of what can or cannot be stored.
Train Model sequence recurring sequence
Train ComplexType <xs:complexType name="TrainType"> <xs:sequence> <xs:element name="tno" type="xs:string"/> <xs:element name="source" type="xs:string"/> <xs:element name="destination" type="xs:string"/> <xs:sequence maxOccurs="unbounded"> <xs:element name="visit"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="time" type="xs:time"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:sequence> </xs:complexType
Elements • Elements are defined thus: • <element name = “the-name”>. • Attributes associated with elements: • type: specifies the kind of content that an element with no attributes or sub-elements can have. Default imposes no constraints. • minOccurs, maxOccurs: the number of times an element can occur. A value of unbounded allows open-ended cardinality. Default is once and only once.
Built-in Types • There are many built-in types: • string. • integer, positiveInteger, negativeInteger. • short, long. • date, dateTime, time. • id, idref. • anyURI
Any element with sub-elements or attributes is declared to have a complex type. Sequence: the sub-elements must appear in the given order. Choice: a selection is made from the sub-elements. Both sequence and choice can have minOccurs/maxOccurs. Complex Elements <xs:element name="visit"> <xs:complexType> <xs:sequence> <xs:element name="name" type=“..."/> <xs:element name="time" type=“..."/> </xs:sequence> </xs:complexType> </xs:element>
Attributes can be defined within complex types. Attributes are optional unless use=“required”. The resulting data file can populate the attribute. Attributes <train ... xsi:type="TrainType“ engine="125"> <tno>3107101</tno> <source>Edinburgh</source> <destination>London</destination> <visit> ... </visit> </train> <xs:complexType name="TrainType"> <xs:sequence> <xs:element name="tno" type="xs:string"/> ... </xs:sequence> <xs:attribute name="engine" type="xs:string"/> </xs:complexType>
Building on Existing Types • New types can be constructed from existing types by: • Extension: for complex types, this means that new types can be defined that add attributes or elements to the type on which they are based. • Restriction: for complex types, this means that new types can be defined with fewer attributes or elements, reduced cardinalities, etc.
Type Extensions Example <xs:complexType name="StationType"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="type" type="xs:string"/> </xs:sequence> </xs:complexType> <xs:complexType name="DistrictStationType"> <xs:complexContent> <xs:extension base="StationType"> <xs:sequence> <xs:element name="main" type="xs:string"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType>
Type Extensions in Use <xs:element name="stations"> <xs:complexType> <xs:choice maxOccurs="unbounded"> <xs:element name="station" type="StationType"/> <xs:element name="districtStation" type="DistrictStationType"/> </xs:choice> </xs:complexType> <stations> <station> <name>London</name> <type>main</type> </station> <districtStation> <name>York</name> <type>district</type> <main>London</main> </districtStation> </stations> models
Hierarchal models do not naturally support shared components. In documents, cross-references and hyperlinks are very common. XML has several cross-referencing schemes (e.g., ID/IDREF, XPointer). Within an XML document: A value of type ID must be unique. A value of type IDREF must match some ID within the document. Cross References
ID and IDREF for Trains ID used to identify station IDREF used to reference station
XML Schema for ID/IDREF <xs:element name="station"> <xs:complexType> <xs:sequence> <xs:element name="city" type="xs:string"/> <xs:element name="name" type="xs:ID"/> </xs:sequence> </xs:complexType> </xs:element> ... <xs:element name="visit"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:IDREF"/> <xs:element name="time" type="xs:time"/> </xs:sequence> </xs:complexType> </xs:element>
Example: What is the schema? <shiporder orderid="889923“> <orderperson>John Smith</orderperson> <shipto> <name>Ola Nordmann</name> <address>Langgt 23</address> <city>4000 Stavanger</city> <country>Norway</country> </shipto> <item> <title>Empire Burlesque</title> <quantity>1</quantity> <price>10.90</price> </item> <item> …</item> </shiporder>
Summary • XML data can be parsed, transmitted and queried in the absence of any formal description of its structure. • Many applications need to be able to make assumptions about the structure of documents they process. • XML Schema provides a wide range of modelling facilities for defining XML documents.
Further Reading • The W3C Consortium Tutorial is short but informative: • http://www.w3schools.com/schema/ • D. Fallside, XML Schema Part 0: Primer, 2001: • http://www.w3.org/TR/xmlschema-0/ • N. Bradley, The XML Companion (3rd Edition), Addison-Wesley, 2002.
XPath is a W3C standard for addressing parts of an XML document. XPath is widely used in XML languages and tools, including XQuery and XSLT. XPath is not especially expressive. XQuery is a W3C standard for accessing and restructuring XML documents. XQuery is supported by several XML databases, but is less widely deployed than XPath. XPath is used within XQuery. XPath and XQuery
Trains Model • The following diagram shows the XML schema of the data queried in the following example queries.
XPath • XPath uses path expressions to describe routes through documents. • These expressions address locations in a hierarchy in a way that is familiar from file systems (/books/chapters/chapter1). • XPath also includes a function library (shared with XQuery) for manipulating numerical, string, date, node and sequence values. • Standardisation: XPath 1.0 has been a W3C standard since 1999; XPath 2.0 became a W3C standard in January 2007.
XPath Terminology • Nodes include elements, documents (root elements) and attributes; nodes can be addressed using XPath. • XPath includes constructs for exploring relationships between nodes, such as parent, child, ancestor and descendent.