460 likes | 530 Views
Database Systems I The Semistructured Data Model. The Web Today. HTML documents generated by humans or by applications, consumed by humans only, easy access: across platforms, across organizations. only layout, no semantic information Limited application interoperability
E N D
The Web Today • HTML documents • generated by humans or by applications, • consumed by humans only, • easy access: across platforms, across organizations. • only layout, no semantic information • Limited application interoperability • HTML not understood by applications at most, some heuristic rules. • Database technology • SQL standard, but still lots of vendor specific aspects in implementations.
XML Data Exchange Format • A standard from the W3C (World Wide Web Consortium, http://www.w3.org). • The mission of the W3C „. . . developing common protocols that promote its evolution and ensure its interoperability. . .“. • Basic ideas • XML = data • XML generated by applications • XML consumed by applications • Easy access: across platforms, organizations.
Paradigm Shift on the Web • For web search engines: • From documents (HTML) to data (XML) • From document management to document understanding (e.g., question answering) • From information retrieval to data management • For database systems: • From relational (structured) model to semistructured data • From data processing to data /query translation • From storage to transport
The Semistructured Data Model • Developed by the DBS community to address the following, emerging issues • Data sets with non-rigid structure • Biological datasequence data, 3D data, text data . . . and their relationships • Web data • Integration of heterogeneous sourcesnot only, but especially for Web data and biological data.
The Semistructured Data Model • Data is self-describing, i.e. the data description is integrated with the data itself rather than in a separate schema. • Database is a collection of nodes and arcs (directed graph). • Leaf nodes represent data of some atomic type (atomic objects, such as numbers or strings). • Interior nodes represent complex objects consisting of components (child nodes), connected by arcs to this node. • Arcs are directed and connect two nodes.
The Semistructured Data Model • Arc labels indicates the relationship between the two corresponding nodes. • The root node is the only interior node without in-arcs, representing the entire database. • All database objects are children of the root node. • Every node must be reachable from the root. • A general graph structure is possible, i.e. the graph need not be a tree structure.
complex object atomic object Graphical Representation Bib &o1 paper paper book references &o12 &o24 &o29 references references author page author year author title http title title publisher author author author &o43 &25 &96 1997 last firstname firstname lastname first lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu”
Textual Representation • Example: Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} } } • Nested tuples, set-values, object identifiers (oids)
Textual Representation • Simplified textual representation. • Can omit oids. { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } } }
Comparison with Relational Model Missing attributes Additional attributes Multiple attribute values (set-valued attributes) Objects as attribute values No global schema only the first characteristics supported by relational model, all others are not
Comparison with Relational Model • Semistructured data • Self-describing, • Irregular data, • No a-priori structure. • Relational DB • Separate schema, • Regular data, • A-priori structure.
row row row name phone name phone name phone “John” 3634 “Sue” 6343 “Dick” 6363 Comparison with Relational Model Example { row: { name: “John”, phone: 3634 }, row: { name: “Sue”, phone: 6343 }, row: { name: “Dick”, phone: 6363 } }
XML • A W3C standard for an Extensible Markup Language. • Origins: Structured text SGML (Standard Generalized Markup Language). • Motivation • HTML describes presentation only, XML describes content and its meaning (semantics). • HTML is fix language, XML allows to define your own markup languages.
From HTML to XML HTML describes the presentation / layout
From HTML to XML HTML example <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999
From HTML to XML • XML example <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> • XML describes the content
Elements • Tags book, title, author, … • start tag: <book>, end tag: </book> • defined by user / programmer (different from HTML!) • Elements <book>…<book>,<author>…</author> • An element consists of a matching start and end tag and the enclosed content. • Elements can be nested, i.e. content of one element can consist of sequence of other elements.
Attributes • Attributes can be associated with any element. • Provide additional information about elements. • Attributes can have only one value. • Example <bookprice = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> • Attributes can also be used to connect elements.
Non-tree-like XML • So far: only tree-like XML documents,i.e. each element is nested within at most one other element. • Attributes can also be used to create non-tree XML documents. • Attributes with a domain of ID serve as primary keys of elements. • Attributes with a domain of IDREF serve as foreign keys referencing the ID of another element.
Non-tree-like XML • Example of a non-tree structure <persons> <personpersonid=“o555”> <name> Jane </name> </person> <personpersonid=“o456”> <name> Mary </name> <childrenrefs=“o123 o555”</children > </person> <personpersonid=“o123” mother=“o456”> <name>John</name> </person> </persons>
Namespaces • An XML document can involve tags that come for multiple sources. • One and the same tag can appear in more than one source. <table> <tr> <td>Apples</td> <td>Bananas</td> </tr> </table> <table> <name>African Coffee Table</name> <width>80</width> <length>120</length> </table>
Namespaces • Name conflicts can be resolved by prefixing tag names according to their source. <h:table> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td> </h:tr> </h:table> <f:table> <f:name>African Coffee Table</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table> • When using prefixes in XML, a namespace for the prefix must be defined. • The namespace must be referenced (via an URI) in the start tag of an enclosing element .
Namespaces <h:table xmlns:h="http://www.w3.org/TR/html4/"> <h:tr> . . . </h:tr> </h:table> <f:table xmlns:f="http://www.w3schools.com/furniture"> . . . </f:table> </root> Or alternatively: <root xmlns:h="http://www.w3.org/TR/html4/"xmlns:f="http://www.w3schools.com/furniture"> <h:table> . . . </h:table> <f:table> . . . </f:table> </root>
Namespaces • A URI is a Universal Resource Identifier, typically a URL. • The document referenced by the URI describes the meaning of the tags in the namespace. • This description is informal and is not used by the XML parser. • The description can even be empty.
Well-Formed XML • A well-formed XML document satisfies the following conditions: • Begins with a declaration that it is XML. • Has a single root element that encloses the whole document. • Consists of properly nested elements, i.e. start and end tag of an element are within the same enclosing element. • standalone =“yes” states that document has no DTD. • In this mode, you can invent your own tags, like in semistructured data model.
Well-Formed XML • <?XML version=“1.0” standalone =“yes” ?> • <bibliography> • <book> <title> Foundations… </title> • <author> Abiteboul </author> • <author> Hull </author> • <author> Vianu </author> • <publisher> Addison Wesley </publisher> • <year> 1995 </year> • </book> • <book> <title> … </title> • . . . • </book> • … • </bibliography>
Well-Formed XML • HTML browsers will display documents with errors (like missing end tags). • The W3C XML specification states that a program should stop processing an XML document if it finds an error. • The main reason is that XML is being consumed by programs rather than by humans (as HTML). • W3C provides a validator that checks whether an XML document is well-formed.
Valid XML • The validator can also check whether an XML document is valid, i.e. conforms to a Document Type Definition (DTD). • A DTD specifies the allowable tags and how they can be nested. • XML with a DTD is no longer semistructured (self-describing). • However, a DTD is less rigid than the schema of a relational DB. E.g., a DTD allows missing and multiple attributes / elements.
Document Type Definitions • Document Type Definition (DTD): set of rules (grammar) specifying elements, attributes and all other aspects of XML documents. • For each element, specify name and content type. • Content type can, e.g., be • #PCDATA (character string), • other elements, • regular expression made of the above content types * = zero or more occurrences ? = zero or one occurrence + = one or more occurrences , = sequence of elements.
Document Type Definitions • Specification of element type“<!ELEMENT“ <Name> <Content> “>“ • Specification of attributes“<!ATTLIST“ <ElementName> <AttributeName> <Content> <Type> “>“ • Attribute type either #REQUIRED or #IMPLIED (optional).
Document Type Definitions • ID: domain with unique values within the given document. • IDREF: references one ID. • IDREFS: references a list of IDs. • Example <Book id = „book1“ pub = „book5“ . . .> . . . <Book id = „book5“ pub = „book4“ . . .>
Document Type Definitions • Document type contains all corresponding element types: “<!DOCTYPE“ <Name> “[“<ElementTypes>“]>“ • Use of DTD by some document: • reference DTD in document opening line • STANDALONE = “no“. • Example <?XML version=“1.0” standalone =“no” ?> <!DOCTYPE Book SYSTEM =“Book.dtd”>
Example DTD: Product Catalog <!DOCTYPE CATALOG [ <!ELEMENT CATALOG (PRODUCT+)> <!ELEMENT PRODUCT (SPECIFICATIONS+,OPTIONS?,PRICE+,NOTES?)> <!ATTLIST PRODUCT NAME CDATA #IMPLIED CATEGORY (HandTool|Table|Shop-Professional) "HandTool" PARTNUM CDATA #IMPLIED PLANT (Pittsburgh|Milwaukee|Chicago) "Chicago" INVENTORY (InStock|Backordered|Discontinued) "InStock"> <!ELEMENT SPECIFICATIONS (#PCDATA)> <!ATTLIST SPECIFICATIONS WEIGHT CDATA #IMPLIED POWER CDATA #IMPLIED> <!ELEMENT OPTIONS (#PCDATA)> <!ATTLIST OPTIONS FINISH (Metal|Polished|Matte) "Matte" ADAPTER (Included|Optional|NotApplicable) "Included" CASE (HardShell|Soft|NotApplicable) "HardShell"> <!ELEMENT PRICE (#PCDATA)> <!ATTLIST PRICE MSRP CDATA #IMPLIED WHOLESALE CDATA #IMPLIED STREET CDATA #IMPLIED SHIPPING CDATA #IMPLIED> <!ELEMENT NOTES (#PCDATA)> ]>
XML Schema • The successor of DTDs to specify a schema for XML documents. • A W3C standard. • Includes and extends functionality of DTDs. • In particular, XML Schemas support data types. This makes it easier to validate the correctness of data and to work with data from a database. • XML Schemas are written in XML. You don't have to learn a new language and can use your XML parser to parse your Schema files.
Simple Elements • Simple elements contain only text. • They can have one of the built-in datatypes: xs:string, xs:decimal, xs:integer, xs:boolean xs:date, xs:time. • Example <xs:element name="lastname“ type="xs:string"/> <xs:element name="age" type="xs:integer"/> <xs:element name="dateborn" type="xs:date"/>
Simple Elements • Restrictions allow you to further constrain the content of simple elements. <xs:element name="age"> <xs:simpleType> <xs:restriction base="xs:integer"> <xs:minInclusive value="0"/> <xs:maxInclusive value="120"/> </xs:restriction> </xs:simpleType> </xs:element>
Attributes • Attributes can be specified using the attribute element: <xs:attribute name="xxx" type="yyy"/> • Attribute elements are nested within the element of the element with which they are associated. • By default, attributes are optional. • To make an attribute mandatory, use <xs:attribute name="lang“ type="xs:string“use="required"/> • Attributes can have the same built-in datatypes as simple elements.
Complex Elements • Complex elements can contain other elements and can have attributes. • Nested elements need to occur in the order specified. • The number of repetitions of elements are controlled by the attributes minOccurs and maxOccurs. The default is one repetition. • A complex element with an attribute: <xs:element name="product"> <xs:complexType> <xs:attribute name="prodid" type="xs:positiveInteger"/> </xs:complexType> </xs:element>
Complex Elements • A complex element containing a sequence of nested (simple) elements: <xs:element name="employee"> <xs:complexType> <xs:sequence> <xs:element name="firstname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element>
Complex Elements • If you name the complex element, other elements can reference and include it: <xs:complexType name="persontype"> <xs:sequence> <xs:element name="firstname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:sequence> </xs:complexType> <xs:element name="person" type="persontype"/>
XML Document With Schema • An XML document that uses a schema has to reference the schema in the schemaLocation attribute of its root element : <?xml version="1.0"?> <note xmlns="http://www.w3schools.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3schools.com note.xsd"> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
Example XML Schema <schema version=“1.0” xmlns=“http://www.w3.org/1999/XMLSchema”> <element name=“author” type=“string” /> <element name=“date” type = “date” /> <element name=“abstract”> <type> … </type> </element> <element name=“paper”> <type> <attribute name=“keywords” type=“string”/> <element ref=“author” minOccurs=“0” maxOccurs=“*” /> <element ref=“date” /> <element ref=“abstract” minOccurs=“0” maxOccurs=“1” /> <element ref=“body” /> </type> </element> </schema>
XML vs. Semistructured Data • Both described best by a graph. • Both are schema-less, self-describing(XML without DTD / XML schema). • XML is ordered, semistructured data is not. • XML can mix text and elements: <talk> Making Java easier to type and easier to type <speaker> Phil Wadler </speaker> </talk> • XML has lots of other stuff: attributes, entities, processing instructions, comments.
Summary • Due to their variable and complex structure, Web documents cannot naturally be modeled using the relational model. • The Semistructured Data Model is a self-describing data model providing sufficient flexibility for representing Web documents. • One of the weaknesses of the Web is that (HTML) documents cannot be processed automatically. • The purpose of XML is to provide a way of recording the semantics of Web documents and their components. For this sake, XML allows you to define your application-specific tags.
Summary • XML documents are lists of elements and attributes. Elements can be nested to form tree-like structures. • Non-hierarchical structures are also possible. • Document type definitions (DTDs) are similar to but less restrictive than DB schemas, specifying rules that corresponding XML documents have to satisfy. • XML schemas are a more recent and more DB-like extension of DTDs.