480 likes | 575 Views
Models and languages for semistructured data. Bridging documents and databases. Lectures. 1. Introduction to data models 2. Query languages for relational databases 3. Models and query languages for object databases 4. Embedded query languages
E N D
Models and languages forsemistructured data Bridging documents and databases
Lectures 1. Introduction to data models 2. Query languages for relational databases 3. Models and query languages for object databases 4. Embedded query languages 5. Models and query languages for semistructured data, XML 6. Semantic Web, introduction 7. Semantic Web, continued
Why do we like types? • Types facilitate understanding • Types enable compact representations • Types enable query optimisation • Types facilitate consistency enforcement
Background assumptions fortyped data • Data stable over time • Organisational body to control data • Exercise: Give an example of a context where these assumptions do not hold
Semistructured data Semistructured data is schemaless and self describing The data and the description of the data are integrated
name tel email 112233 “john@123.edu” first last “John” “Smith” An example {name: {first: “John”, last: “Smith”}, tel: 112233, email: “john@123.edu”}
person person child &o1 &o2 name age name age “Eva” 40 “Abel” 20 Another example {person: &o1{name: “Eva”, age: 40, child: &o2}, person: &o2{name: “Abel”, age: 20}} An object identifier, such as &o1, before a structure, binds the object identifier to the identity of that structure. The object identifier can then be used to refer to the structure.
Value Label Objectidentifier Terminology The following is an ssd-expression: &o1{name: “Eva”, age: 40, child: &o2}
A database author Crick DNA spiral n1 author Wallace 1956 paper title date Origin 1848 Darwin author biblio book db n2 title date book Kapital 1860 Marx author ……. n3 title date
Path expressions A path expression is a sequence of labels: l1.l2…ln A path expression results in a set of nodes Path properties are specified by regular expressions on two levels: on the alphabet of labels and on the alphabet of characters that comprise labels
A path expression author Crick DNA spiral biblio.book.author n1 author Wallace 1956 paper title date Origin 1848 Darwin author biblio book db n2 title date book Kapital 1860 Marx author ……. n3 title date
A path expression author Crick DNA spiral biblio.(book l paper).author n1 author Wallace 1956 paper title date Origin 1848 Darwin author biblio book db n2 title date book Kapital 1860 Marx author ……. n3 title date
Examples of path expressions • biblio.book.author - authors of books • biblio.paper.author - authors of papers • biblio.(book l paper).author - authors of books or papers • biblio._.author - authors of anything • biblio._*.author - nodes at the ends of paths starting with biblio, ending with author, and having an arbitrary sequence of labels between
Example of a label pattern • ((b l B)ook l (a l A)uthor) (s)? - book, Book, author, Author, books, Books, authors, Authors
An exercise biblio._*.author.(“[s l S]ection”) Which ones of the following paths match the path expression above? 1. Biblio.author.Section 2. Biblio.cat.rat.hat.author.section 3. Biblio.author 4. Biblio.cat.author.section.Section
A simple query Select author: X from biblio.book.author X Result: {author: “Darwin”, author: “Marx”}
A query with a condition select row: X from biblio._ X where “Crick” in X.author Result: {row: {author: “Crick”, author: “Wallace”, date: 1956, title: “The spiral DNA”}, …}
Two exercises select row: {title: Y, date: Z} from biblio.paper X, X.title Y, X.date Z select row: {author: Y, date: Z} from biblio.book X, X.author Y, X.date Z
A database select row: {title: Y, date: Z} from biblio.paper X, X.title Y, X.date Z author Crick DNA spiral n1 author Wallace 1956 paper title date Origin 1848 Darwin author biblio book db n2 title date book Kapital 1860 Marx author ……. n3 title date
A database author Crick DNA spiral n1 author Wallace 1956 paper title date Origin 1848 Darwin author biblio book db n2 title date book Kapital 1860 Marx author ……. n3 title date
Nested queries select row: (select author: Y from X.author Y) from biblio.book X
Three exercises • Which authors have written a book or a paper in 1992? • Which authors have written a book together with Jones? • Which authors have written both a book and a paper?
Expressing relations r1 r2 a b c b d e 1 2 3 1 1 3 3 2 2 3 4 2 4 3 1 2 3 1 { r1: { row: {a: 1, b:2, c:2}, row: {a: 1, b:2, c:2}, row: {a: 1, b:2, c:2} }, r2: { row: {b: 1, d:2, e:2}, row: {b: 1, d:2, e:2}, row: {b: 1, d:2, e:2} } }
Expressing relational joins select a: A, d: D from r1.row X r2.row Y X.a A, X.b B, Y.b B’, Y.d D where B = B’
Label variables Label variable select L: X from biblio._*.L X where matches(“.*Shakespeare.*”, X) Macbeth 1622 Shakespeare author biblio book db n2 title date book Best of Shakespeare 1992 Smith author ……. n3 title date
Label variables select L: X from biblio._*.L X where matches(“.*Shakespeare.*”, X) {author: “Shakespeare”, title: “Best of Shakespeare”}
author Crick DNA spiral n1 author Wallace 1956 paper title date Origin 1848 Darwin author biblio book db n2 title date Turning labels into data select publ: {type: L, author: A} from biblio.L X, X.author A {publ: {type: “paper”, author: “Crick”}, publ: {type: “paper”, author: “Wallace”}, publ: {type: “book”, author: “Darwin”}
An exercise • List all publications in 1992, their types, and titles.
element content end-tag start-tag Basic XML syntax XML is a textual representation of data An element is a text bounded by tags <name> John </name> <name> </name> can be abbreviated as <name/>
Basic XML syntax Elements may contain subelements <person> <name> John </name> <tel> 112233 </tel> <email> john@123.edu </email> </person>
XML attributes An attribute is defined by a name-value pair within a tag <price currency = “dollar”> 500 </price> <length unit = “cm”> 25 </length>
XML attributes and elements <product> <name> widget </name> <price> 10 </price> </product> <product price = “10”> <name> widget </name> </product> <product name = “widget” price = “10”/>
XML and ssd-expressions <person> <name> John </name> <tel> 112233 </tel> <email> john@123.edu </email> </person> {person: {name: “John”, tel: 112233, email: “john@123.edu”}}
element identifier reference attribute XML references <person id = “p1”> <name> John </name> <tel> 112233 </tel> </person> <person id = “p2”> <name> Peter </name> <tel> 998877 </tel> <boss idref = “p1”/> </person>
Document Type Definitions <!DOCTYPE db [ <!ELEMENT db (person*)> <!ELEMENT person (name, age, email)> <!ELEMENT name (#PCDATA)> <!ELEMENT age (#PCDATA)> <!ELEMENT email (#PCDATA)> ]>
An exercise on DTDs as schemas <db> <r1> <a> a1 </a> <b> b1 </b> </r1> <r1> <a> a2 </a> <b> b2 </b> </r1> <r2> <c> a1 </c> <d> b1 </d> </r1> <r2> <c> c2 </c> <d> d2 </d> </r1> <r3> <a> a1 </a> <c> b1 </c> </r1> </db> Write down a DTD for the data above!
Attributes in DTDs <product> <name language = “Swedish” department = “music”> trumpet </name> <price currency = “dollar”> 500 </price> <length unit = “cm”> 25 </length> </product> <!ATTLIST name language CDATA #REQUIRED department CDATA #IMPLIED> <!ATTLIST price currency CDATA #REQUIRED> <!ATTLIST length unit CDATA #REQUIRED>
Reference attributes in DTDs <!DOCTYPE people [ <!ELEMENT people (person*)> <!ELEMENT person (name)> <!ELEMENT name (PCDATA)> <!ATTLIST person id ID #REQUIRED boss IDREF #REQUIRED friends IDREFS #IMPLIED> ]>
An exercise <people> <person> id = “sven” boss = “olle”> <name> Sven Svensson </name> </person> <person> id = “olle” friends = “nils eva”> <name> Olle Olsson </name> </person> <person> id = “pelle” boss = “nils eva”> <name> Per Persson </name> </person> <people> Does this XML element conform to the previous DTD?
Limitations of DTDs as schemas • DTDs impose order • No base types • The types of IDREFs cannot be constrained
XSL - extensible stylesheet language <bib> <book> <title> t1 </title> <author> a1 </author> <author> a2 </author> </book> <paper> <title> t2 </title> <author> a3 </author> <author> a4 </author> </paper> <book> <title> t3 </title> <author> a5 </author> <author> a6 </author> </book> </bib>
} Template rule XSL pattern Template rules and XSL patterns <xsl: template> <xsl: apply-templates/> </xsl: template> <xsl: template match = “bib/*/title”> <result> <xsl: value-of/> </result> </xsl: template> <result> t1 </result> <result> t2 </result> <result> t3 </result>
Two exercises select row: {title: Y, date: Z} from biblio.paper X, X.title Y, X.date Z {row: {title: “The spiral DNA”, date: 1956}, {title: “Origin”, date: 1848}, {title: “Kapital”, date: 1860}} select row: {author: Y, date: Z} from biblio.book X, X.author Y, X.date Z
Which authors have written a book or a paper in 1992? select author: X from biblio.(book | paper) Y, Y.author X where Y.date = 1992
Which authors have written a book together with Jones? select author: X from biblio.book Y, Y.author X where “Jones” in Y.author
Which authors have written both a book and a paper? select author: A from biblio.book B, biblio.paper P, B.author A where B.author = P.author select author: A1 from biblio.book B, biblio.paper P, B.author A1, P.author A2 where A1 = A2
List all publications in 1992, their types, and titles. select publ: {type: L, title: T} from biblio.L X, X.title T where X.date = 1992
<!DOCTYPE db [ <!ELEMENT db (r1*, r2*, r3*)> <!ELEMENT r1 (a, b)> <!ELEMENT r2 (c, d)> <!ELEMENT r3 (a, c)> <!ELEMENT a (#PCDATA)> <!ELEMENT b (#PCDATA)> <!ELEMENT c (#PCDATA)> <!ELEMENT d (#PCDATA)> ]> <db> <r1> <a> a1 </a> <b> b1 </b> </r1> <r1> <a> a2 </a> <b> b2 </b> </r1> <r2> <c> a1 </c> <d> b1 </d> </r1> <r2> <c> c2 </c> <d> d2 </d> </r1> <r3> <a> a1 </a> <c> b1 </c> </r1> </db>