Management of XML and Semistructured Data

Management of XML and Semistructured Data Lecture 7: XML-QL, Structural Recursion Monday, April 23, 2001

XML-QL • First declarative language for XML • How to obtain a query language for XML fast ? • Assume OEM as data model • Use features from UnQL and StruQL • Patterns • Templates • Skolem functions • Design XML-like syntax

Patterns in XML-QL Find all authors who published in Morgan Kaufmann: WHERE <booklanguage=“french”> <publisher> <name> Morgan Kaufmann </> </> <author> $A </> </book> in “www.a.b.c/bib.xml” CONSTRUCT <author> $A </> Abbreviation: </> closes any tag.

Patterns in XML-QL Find all languages in which Jones’ coauthors have published: where <booklanguage=$X> <author> $A </author> </book> in “www.a.b.c/bib.xml” <book> <author> $A </author> <author> Jones </author> </book> in “www.a.b.c/bib.xml” construct <result> $X </> There is a join here…

Constructors in XML-QL Find all authors and the languages in which they published: where <booklanguage = $L> <author> $A </> </> in “www.a.b.c/bib.xml” construct <result> <author> $A </> <lang> $L </> </> • Result is: • <result> <author>Smith</author> <lang>English </lang> </result> • <result> <author>Smith</author> <lang>Mandarin</lang> </result> • <result> <author>Doe </author> <lang>English </lang> </result> • . . . .

Nested Queries in XML-QL Find all authors and the languages in which they published; group by authors: WHERE <book.author> $A </> in “www.a.b.c/bib.xml” CONSTRUCT <result> <author> $A </> WHERE <booklanguage = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <lang> $L </> </> Note: book.author is a (regular) path expression

<result> <author>Smith</author> <lang>English</lang> <lang>Mandarin</lang> <lang>…</lang> … </result> <result> <author>Doe</author> <lang>English</lang> … </result> Result is:

Skolem Functions in XML-QL Same query, with Skolem functions WHERE <booklanguage = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <result id=F($A)> <author> $A</> <lang> $L </> </> • Assumptions: • the ID attribute is always id • default Skolem function for author is G($A), for lang is H($A, $L) (why ?)

Skolem Functions in XML-QL Object fusion with Skolem functions and block structure - Compile a complete list of authors, from two sources { WHERE <book> <author> $A </> <title> $T </> </> in “www.a.b.c/bib.xml” CONSTRUCT <person id=F($A)> <name id=G($A)> $A </> <booktitle> $T</> /* implicit Skolem function H($A, $T) */ </> } { WHERE <paper> <author> $A </> <title> $T </> <journal> $J </> </> in “www.d.e.f/papers.xml” CONSTRUCT <person id=F($A)> <name id=G($A)> $A </> <papertitle> $T</> /* implicit Skolem function J($A, $T) */ <journaltitle> $J</> /* implicit Skolem function K($A, $T) */ </> }

<person> <name>Smith</name> <booktitle>Book1</booktitle > <booktitle>Book2</booktitle > </result> <person> <name>Jones</name> <booktitle>Book3</booktitle > <papertitle>paper1</papertitle > <journaltitle>journal1</journaltitle > </result> <person> <name>Mark</name> <papertitle>paper2</papertitle > <journaltitle>journal3</journaltitle > </result> … Result: (some have only books, Others only papers, Others have both)

Skolem Functions in XML-QL “Wrong” query number 1: WHERE <booklanguage = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <result id=F($A)> <author id=G($A)> $A</> <lang id=H($A)> $L </> </> What is “wrong” here ?

Skolem Functions in XML-QL “Wrong” query number 2: WHERE <booklanguage = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <result id=F($A,$L)> <author id=G($A)> $A</> <lang id=H($A,$L)> $L </> </> What is “wrong” here ?

Skolem Functions in XML-QL “Wrong” query number 3: { WHERE <booklanguage = $L> <author> $A </> </> in “www.a.b.c/bib.xml” CONSTRUCT <author id=F($A)> <lang id=H($A,$L)> $L </> </> } { WHERE <person> <city> $C </> <fluent-in> $X </> </> in “www.a.b.c/bib.xml” CONSTRUCT <location id=G($C)> <lang id=H($C,$L)> $L </> </> }

Three Rules to Construct Only Trees Rule 1: nested elements must have Skolem functions that are… [how ??] Rule 2: an element that has an atomic content must have a Skolem function that is… [how ??] Rule 3: if a Skolem function occurs in two different places than the following condition must hold… [which ??] CONSTRUCT <tag1 id=F([args1])> <tag2 id=G([args2])> …</> </> CONSTRUCT … <tag id=F([args])> $X </> … { CONSTRUCT <tag1 id=G([args1])> <tag id=F([args])> …</> </> } { CONSTRUCT <tag1 id=H([args2])> <tag id=F([args])> …</> </> }

XML-QL v.s. XQuery • Xquery (=Quilt) v.s. XML-QL + faithful XML data model + Xpath sublanguage + aggregate functions (like in SQL) + some features from XQL • Patterns • Skolem functions

A Different Paradigm:Structural Recursion Data as sets with a union operator: {a:3, a:{b:”one”, c:5}, b:4} = {a:3} U {a:{b:”one”,c:5}} U {b:4}

a b a result result result 3 c b 4 3 5 4 “one” 5 Structural Recursion Example: retrieve all integers in the data f($T1 U $T2) = f($T1) U f($T2) f({$L: $T}) = f($T) f({}) = {} f($V) = if isInt($V) then {result: $V} else {}

Structural Recursion What does this do ? f($T1 U $T2) = f($T1) U f($T2) f({$L: $T}) = if $L=a then {b:f($T)} else {$L:f($T)} f({}) = {} f($V) = $V

Structural Recursion What does this do ? f($T1 U $T2) = f($T1) U f($T2) f({$L: $T}) = {$L:{$L:f($T)}} f({}) = {} f($V) = $V Input = tree with n nodes Output = ???

f($T1 U $T2) = f($T1) U f($T2) f({$L: $T}) = if $L= engine then {$L: g($T)} else {$L: f($T)} f({}) = {} f($V) = $V g($T1 U $T2) = g($T1) U g($T2) g({$L: $T}) = if $L= price then {$L:1.1*$T} else {$L: g($T)} g({}) = {} g($V) = $V engine engine body body part part price price price price part part price price price price 1100 1000 1000 1000 100 110 100 100 Structural Recursion Example: increase all engine prices by 10%

f($T1 U $T2) = f($T1) U f($T2) f({$L: $T}) = if $L= a then g($T} U $T else { } f({}) = { } f($V) = { } g($T1 U $T2) = g($T1) U g($T2) g({$L: $T}) = if $L= b then f($T) else { } g({}) = { } g($V) = { } Structural Recursion Retrieve all subtrees reachable by (a.b)*.a a b a

Structural Recursion: General Form f1($T1 U $T2) = f1($T1) U f1($T2) f1({$L: $T}) = E1($L, f1($T),...,fk($T), $T) f1({}) = { } f1($V) = { } . . . . fk($T1 U $T2) = fk($T1) U fk($T2) fk({$L: $T}) = Ek($L, f1($T),...,fk($T), $T) fk({}) = { } fk($V) = { } Each of E1, ..., Ek consists only of {_ : _}, U, if_then_else_

Evaluating Structural Recursion Recursive Evaluation: • Compute the functions recursively, starting with f1 at the root Termination is guaranteed. How efficiently can we evaluate this ?

Structural Recursion Consider this: f($T1 U $T2) = f($T1) U f($T2) f({$L: $T}) = {$L:f($T)}, $L:f($T)} f({}) = {} f($V) = $V

Naive Recursive Evaluation a a a b b b b b c c c c c c c c c d Input tree = n nodes Output tree = 2n+1 – 1 nodes

a a a b b b c c c d d d Efficient Recursive Evaluation Recursive Evaluation with function memorization. PTIME complexity. f($T1 U $T2) = f($T1) U f($T2) f({$L: $T}) = {$L:f($T)}, $L:f($T)} f({}) = {} f($V) = $V Alternatively: apply the function in parallel to each input edge  Bulk Evaluation

 a  b d  c d d Bulk Evaluation Sometimes f doesn’t return anything  use  edges f($T1 U $T2) = f($T1) U f($T2) f({$L: $T}) = if $L=c then $T else f($T) f({}) = {} f($V) = $V

Epsilon Edges Meaning of  edges: a b a b  = c d c d c d

Epsilon Edges Note: union becomes easy to draw with  edges: Example:   T1 T2 U = T1 T2   a b U a b c d e = c d e = e a c d b

f1($T1 U $T2) = f1($T1) U f1($T2) f1({$L: $T}) = E1($L, f1($T),...,fk($T), $T) f1({}) = { } f1($V) = { } . . . . fk($T1 U $T2) = fk($T1) U fk($T2) fk({$L: $T}) = Ek($L, f1($T),...,fk($T), $T) fk({}) = { } fk($V) = { } Bulk Evaluation Idea: “apply” E1, ..., Ek independently on each edge, then connect with  edges  PTIME

f($T1 U $T2) = f($T1) U f($T2) f({$L: $T}) = if $L= a then g($T} U $T else { } f({}) = { } f($V) = { } g($T1 U $T2) = g($T1) U g($T2) g({$L: $T}) = if $L= b then f($T) else { } g({}) = { } g($V) = { } Bulk Evaluation Recall (a.b)*.a: a b b a a a a a b d a b b a a b c d a a b d d c b b c c

Structural Recursion • Can evaluate in two ways: • Recursively: memorize functions’ results • Bulk: apply all functions on all edges, in parallel, connect, eliminate what is useless • Complexity: PTIME • More precisely: NLOGSPACE • Works on graphs with cycles too !

XSL • two W3C drafts: XSLT and XPATH • http://www.w3.org/TR/xpath, 11/99 • http://www.w3.org/TR/WD-xslt, 11/99 • in commercial products (e.g. IE5.0) • purpose: stylesheet specification language: • stylesheet: XML -> HTML • in general: XML -> XML

Retrieve all book titles: <xsl:template> <xsl:apply-templates/> </xsl:template> <xsl:templatematch = “/bib/*/title”> <result> <xsl:value-of/> </result> </xsl:template> XSL Templates and Rules • query = collection of template rules • template rule = match pattern + template

Flow Control in XSL <xsl:template> <xsl:apply-templates/> </xsl:template> <xsl:templatematch=“a”> <A><xsl:apply-templates/></A> </xsl:template> <xsl:templatematch=“b”> <xsl:apply-templates/> </xsl:template> <xsl:templatematch=“c”> <C><xsl:value-of/></C> </xsl:template>

<a> <e> <c> 1 </c> <c> 2 </c> <a> <c> 3 </c> </a> </e> <c> 4 </c> </a> <A> <C> 1 </C> <C> 2 </C> <A> <C> 3 </C> </A> <C> 4 </C> </A>

XSL is Structural Recursion Equivalent to: f(T1 U T2) = f(T1) U f(T2) f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t) f({}) = {} f(V) = V XSL query = single function XSL query with modes = multiple function

XSL: trees only may loop Structural Recursion: arbitrary graphs always terminates XSL and Structural Recursion add the following rule: <xsl:templatematch = “e”> <xsl:apply-patternsselect=“/”/> </xsl:template> stack overflow on IE 5.0

Management of XML and Semistructured Data