290 likes | 296 Views
An overview of requirements and concepts for querying XML using locator semantics, including a running example and XQL overview.
E N D
Querying XML with Locator Semantics Peter Fankhauser joint work with: Matthias Friedrich, Gerald Huck, Ingo Macherius, Jonathan Robie GMD German National Research Center for Information Technology Institute for Integrated Publication- and Informationsystems GMD-IPSI http://xml.darmstadt.gmd.de/
Overview • Requirements for Querying XML • XQL Overview • Locators • Locator Algebra • IPSI XML-Brokering Framework
General Requirements for Querying XML(Excerpt from Dave Maier, W3C QL 98) • Require no schema • flexibly match irregular structure • preserve (irregular) structure • Query & Preserve Order and Association • sibling order • hierarchy • Precise Semantics • rewrite rules • compositional semantics • Closedness/Completeness • XML to XML • when is a QL for XML complete?
Running Example • Bookstore: • Non Uniform Hierarchy • sci-fi: 2 levels • mystery: 3 levels • Customers: Flat Table <books_and_customers><bookstore> <fiction> <sci-fi> <book> <isbn>0006482805</isbn> <title>Do androids dream of electric sheep</title> <author>Philip K. Dick</author> </book> </sci-fi> <fantasy> <mystery> <book> <isbn>0261102362</isbn> <title>The two towers</title> <author>JRR Tolkien</author> </book> </mystery> </fantasy> </fiction></bookstore><!-- continued next column --> <customers> <customer> <name>Jason Woolsey</name> <boughtbooks> <isbn>0261102362</isbn> <isbn>0593488321</isbn> </boughtbooks> </customer> <customer> <name>P.W. Ellis</name> <boughtbooks> <isbn>0006482805</isbn> <isbn>0261102362</isbn> </boughtbooks> </customer> </customers> </ books_and_customers >
Functional Requirements for Querying XML (Dave Maier, W3C QL 98) • Selection and Extraction: • all sci-fi books by P.K. Dick • Reduction: • drop all authors but 1st author • Combination: • combine all books with their customers via isbn • Restructuring: • return flat lists of title/author pairs • and vice versa • Multidocument Handling: • get reviews and books from different sites • follow (dereference) links in books to authors
XQL Overview (State W3C QL 98) • Basic Concept: Selection of Subtrees • Originated as QL for DOM • adopted for selectors in XSL-templates(now merged with XPointer to XPel to XPath to ????) • Defined along search contexts = an (ordered) set of document nodes • Path Expressions and Filters: • A query is essentially a navigation in element trees • Navigation and filters modify the search context • Query result is the last search context • Selection of nodes by: • Element- and attribute name • Type (element, attribute, comment, etc.) • Content or value of nodes • Relationship between nodes: hierarchy, sequence, index • Combination by: union, intersection
XQL 98 Examples • Selection and Extraction: • all books by P.K. Dick//book[author=„P.K. Dick“] • Reduction: • drop all but 1st author//*?/book?/(isbn | author[0] | title) • * matches all elements along paths to book • shallow return operator (?) retains nesting hierarchy • union preserves document order (title before author)
XQL 98 lacked: • Selection Functionality • comparison operators for fulltext (in progress) • regular path expressions for hierarchy (only // for recursive descent and * for matching all nodes in a search context) • Restructuring • Suggestions: return operators (SAG), XSLT (W3C), Application Level (e.g. WebMethods) • Combination • joins; Suggestions: see below • Graphs • no navigation along ID/IDREF • no multi-documents (dereferencing URIs) • Suggestions: docref, ref, keyref, idref • Delegation • external functions • wrappers
Extended XQL Examples • Combination: • combine all books with customers via isbn$root//*?/book?[$i:=isbn]/ (* | $root//customer?[boughtbooks/isbn=$i]) • New concepts • combination with nodes outside of search context ($root//review) • correlation variables for expressing join predicate [$i:=isbn] • $root used for clarity... • Irregular structure of bookstore is preserved • Multidocuments/Delegation: • get multiple bookstores from a bookmark list (HTTP-GET)docref('http://www.bookstores')/docref(.//@href)//bookstore • the same with a form (HTTP-POST - simplified!)docref ('http://www.bookstores/search.cfm',‘country',‘us')//bookstore • the same with a wrapper (application program delivering XML)wrapper(„bookstore“)//bookstore
Towards a Datamodel for querying XML <document> <person id=“jonathanr"> <firstname>Jonathan</firstname> <lastname>Robie</lastname> </person> <person id=“joel"> <firstname>Joe</firstname> <lastname>Lapp</lastname> <!-- ... --> <document> person person article ? ? author author firstname firstname lastname lastname title year Jonathan Robie Joe Lapp XQL for Dummies 1999 W3C-DOM: Element Tree XML Serialization: Structured Text OEM: Graph ? ? document document.persondocument.person.@iddocument.person.@id.“joel"document.person.firstnamedocument.person.firstname.“Joe"document.person.firstname.“Lapp"document.persondocument.person.@id ... Relational Tables (generic massive join option) Locators: Lists of Paths
Locators for Bookstore bookstore#1 bookstore#1.fiction#2 bookstore#1.fiction#2.sci-fi#3 bookstore#1.fiction#2.sci-fi#3.book#4 bookstore#1.fiction#2.sci-fi#3.book#4.isbn#5 bookstore#1.fiction#2.sci-fi#3.book#4.title#6 bookstore#1.fiction#2.sci-fi#3.book#4.author#7 … bookstore#1.fiction#2.fantasy#8 bookstore#1.fiction#2.fantasy#8.mistery#9 bookstore#1.fiction#2.fantasy#8.mistery#9.book#10 bookstore#1.fiction#2.fantasy#8.mistery#9.book#10.isbn#11 bookstore#1.fiction#2.fantasy#8.mistery#9.book#10.isbn#11.title#12 bookstore#1.fiction#2.fantasy#8.mistery#9.book#10.isbn#11.author#13 ...
Locators <-> XML Serialization • Locators are lists of paths • XML-document->Locators • each element-node gets id in document-order (depth first, left to right traversal) • each element-node is located by the entire path from root • attributes are attached to element-nodes • content is attached to leave-nodes • Locators->XML-document: • clean up: discard locators $prefix which are followed by at leastone locator $prefix.$postfix • generate tree(1) for all locators generate nested serialization(2) fill up with content and attributes • Mappings should be total, 1:1
Locator Sets vs. Relations • Commonalties • flat sets • identity defined by identity of components • concatenation to derive new locators/tuples • Differences • arity • locators: variable length • tuples: fixed • access to components: • locators: by navigation • tuples: by position/attribute • data: • locator components: document nodestuples components: values
Locator Algebra (1) • Preliminaries • L domain of locator sets • x, y • PL domain of locators • u, v • tail(u) … last component of uprefix(u) … u - tail(u) • Tree-Operators • navigation in document tree using DOM methods • root, parent, children: PL L • applied to locator sets from L using d-join (see below) • Set-Operators • , , -: L L Ldefined as usual • order preservation due to total ordering on document nodes
Locator Algebra (2) • Select • select[p]: L L, where p: PL Booleanselect[p](x) = {u | u x, p(tail(u))} • Example: select[nodename(.) = “book”](x) =select[“book”](x) • Return • Corresponds to projectduplicates tail of locator for preserving it insubsequent d-join (see below) • return: PL PLreturn(u)=concat(u, tail(u))
Locator Algebra (3) • Dependent-Join: • d-join[f]: L L, where f: PL Ld-join[f](x) = u x concat(prefix(u),f(tail(u)) • Example: return all titles of books in their book contextselect[“title”](d-join[children(.)] (select[“book”](d-join[return(children(.))](x)) =/book?/title • Kleene Star: • fixpoint-operator for recursive descent queries • *[f]: L L, where f: L L*[f](x) = f(x) *[f](f(x)) • Example: select all titles in their original contextselect[“title”](d-join[children(.)] (*[d-join[return(children(.)](.))](x))=//*?/title • maybe too general for physical algebra
Locator Algebra (4) • Varbind, Varget • to realize joins across contexts • varbind[i,f]: L L, where i Name, f: PL Lvarbind[i,f](x):for all u x: vars(u):=vars(u) vf(tail(u))<i,v> • varget[i]: PL Lvarget[i](u): {v | (i,v) vars(u)}
Join Example (1) $D=varbind[$i,select[“isbn”](children(.))]($B)= //*?/book[$i:=isbn]? bc#0 $A=*[d-join[return(children(.))](.)](x)= //*? bc#0.bs#1.f#2.sf#3.b#4<$i,isbn#5> bc#0.bs#1.f#2.fa#8.mi#9.b#10<$i,isbn#11> ... bc#0.bookstore#1 bc#0.bookstore#1.fiction#2 bc#0.bookstore#1.fiction#2.sci-fi#3 ... $E=select[“customer”](d-join[children(.)] (*[d-join[return(children(.))](.)](d-join[root(.)]($D)))=//*?/customer customers#14.customer#15 customers#14.customer#20 $B=select[“book”](d-join[return(children(.))]($A))= //*?/book $F=d-join(select[ select[“isbn”](d-join[children(.)] (select[“boughtbooks”](d-join[children(.)](.)))= = varget[$i](.)](“$E”)]($D)= //*?/book[$i:=isbn]?/ (//*?/customer[boughtbooks/isbn=$i]) bc#0.bs#1.f#2.sf#3.b#4 bc#0.bs#1.f#2.fa#8.mi#9.b#10 ... $C=d-join[return(children(.))]($B)=//*?/book?/* bc#0.bs#1.f#2.sf#3.b#4.cs#14.customer#20 bc#0.bs#1.f#2.fa#8.mi#9.b#10.cs#14.customer#15 bc#0.bs#1.f#2.fa#8.mi#9.b#10.cs#14.customer#20 bc#0.bs#1.f#2.sf#3.b#4.isbn#5 bc#0.bs#1.f#2.sf#3.b#4.title#6 ...
Join Example (2) <fantasy> <mystery> <book> <isbn>0261102362</isbn> <title>The two towers</title> <author>JRR Tolkien</author> <customers> <customer> <name>Jason Woolsey</name> <boughtbooks> <isbn>0261102362</isbn><isbn>0593488321</isbn> </boughtbooks> </customer> <customer> <name>P.W. Ellis</name> <boughtbooks> <isbn>0006482805</isbn> <isbn>0261102362</isbn> </boughtbooks> </customer> </customers> </book> </mystery> </fantasy> </fiction></bookstore></books_and_customers> • <books_and_customers><bookstore> <fiction> <sci-fi> <book> <isbn>0006482805</isbn> <title>Do androids dream of electric sheep</title> <author>Philip K. Dick</author> <customers> <customer> <name>P.W. Ellis</name> <boughtbooks> <isbn>0006482805</isbn> <isbn>0261102362</isbn> </boughtbooks> </customer> </customers> </book> </sci-fi>
Some Equivalence Transformations for L’Algebra • Commutativity: • union(A,B) = union(B,A) (within single document) • but d-join is not commutative • Associativity: • union, intersect, d-join • Idempotence: • union(A,A) = A • Distributivity: • //book/(title | author) = //book/title | //book/author • Neutral Elements: • union: {} • d-join: $root(?)
Open Issues • Combination with relational algebra • Graphs/Multidocuments • DAGs: Multiple paths from root-context to node (serialization?) • Role of URIs in locators? • Typing • Role of XSD (XML Schema Description) • Inference • Constructors • attribute to element and vice versa…. • Grouping, Skolems • Details • Investigate conformance of locator concept to W3C Infoset • Constraints on locators/mappings to guarantee wellformedness • Political • XQL-Implementations shipping:underlying semantics node-based, not locator-based
The IPSI XML Brokering Framework Visualization HTML, CSS URL+Queries XSL Processor XQL XML Queries Server (HTTP, URL) XQL XML Program Queryprocessor: XML Query Language (XQL) DOM Persistent DOM Warehouse Datamodel: Document Object Model (W3C-DOM) HTTP/HTML Roboter Generic Wrappers JEDI Framework Specific Wrappers
Wrappers • Jedi Framework for Wrappers • Pivot Object Model • Scripting language for control-flow • Access to dynamic sources (ODBC, CORBA) with iterators • Generic Wrappers • Generic Mapping of structured formats to XML • Examples: SGML,XML, HTML, MS-RTF • Jedi Parser • for irregularily formatted sources • context free, attributed grammars • fault-tolerant, efficient parser: unlimited lookahead, interpretation of ambiguous, incomplete grammars by specificity ordering • HTTP-Access • Access plans for delegation integrated with XQL Engine
Mediator: XQL Engine + Persistent DOM • XQL 98 Implementation • efficient recursive descent queries by signature-index • + Joins • + Multi Document Handling • extends XQL with external references (via http-get, http-post) • Multidocument DOM; for every node namespace and URI • + User defined functions • input: context (reference-node-set, reference-node-pointer), parameters: constants, XQL-expressions (lazy evaluation) • output: node-functions, collection-functions (set of nodes), comparison-operatorscan attach base-URIs • variables
<golfplatzid="platz0001"> <adresse> [...] </adresse> <policy> ... </policy> <handicap> <wochentag>34</wochentag> <wochenende>34</wochenende> </handicap> </golfplatz> <www.wetter.de> <wetter> <plz>87724</plz> <datum>981001</datum> <temperatur>16</temperatur> <regen>90</regen> <wind>9</wind> <prognose>13</prognose> </wetter> <!-- ... --> <www.wetter.de> <www.reiseplanung.de> <route> <von>53757</von> <nach>93333</nach> <entfernung>481.9</entfernung> <fahrzeit>274</fahrzeit> <karte>5375793333.gif</karte> </route> <!-- ... --> </www.reiseplanung.de> Application 1: An XML Broker for Golfers <golfdemo <golfplatz> <adresse> ... </adresse> <greenfee> ... </greenfee> ... </golfplatz> <wetter> ... </wetter> <route> ... </route> </golfdemo> XSL XML Broker Query
Application 2: RELIMO Integrating Bioinformatics Data XML Application (e.g. Office 2000) XML Browser (e.g. Mozilla 5) XSL Formatter (e.g. Lotus-XSL) XML Broker RELIBASE with XML RPC PDB as local PDOM
Application Data • XML Broker for Golfers • Sources: www.golffuehrer.de (500 KB), www.wetter.de (200 KB), www.routen-information.de (200 KB) • Joins (via zip-code) ~ 2 to 3 secs • RELIMO (Germany) • Sources: Relibase (XML-RPC), PDB (5 GB -> 25 MB XML, 30 MB PDOM) • response time (100 MB) 50 to 30000 ms • MIROWEB (ESPRIT) • JEDI for importing several sources to Oracle 8 • Shakespeare • all plays • 10 MB (Tests with duplicated data up to 0.5 GB)
Some Links & Acks • XQL FAQ • http://metalab.unc.edu/xql/ • IPSI XML Research & Development • http://xml.darmstadt.gmd.de • XQL-Engine 1.0.1 download (non-commercial use) • JEDI download (non-commercial use) • XML Brokering Framework Licensing Info (Infonyte) • hemmje@globit.com • www.infonyte.com • Many thanks to • Karl Aberer, Harald Schöning, Guido Mörkotte