280 likes | 394 Views
On the Path to Efficient XML Queries. Andrey Balmin, Kevin Beyer , Fatma Özcan IBM Almaden Research Center Matthias Nicola IBM Silicon Valley Lab. New languages = new abilities + new pitfalls. XQuery A new query language designed specifically for XML data SQL / XML
E N D
On the Path to Efficient XML Queries Andrey Balmin, Kevin Beyer, Fatma Özcan IBM Almaden Research Center Matthias Nicola IBM Silicon Valley Lab
New languages = new abilities + new pitfalls • XQuery • A new query language designed specifically for XML data • SQL / XML • Added XML as a data type, including XQuery sequences • Added XQuery as a sublanguage
Purpose • Teach users of the new XML query languages • Teach the teachers • Share our users’ experiences • Influence languages
Focus • Large databases with many moderate XML documents • Schema flexibility is required • Many schemas in one collection • No schema validation used • Schemas with xs:any • Documents like Atom Syndication and RSS that allow any extension • Therefore • Document filtering is primary concern • Limited type inference • Any data is possible
Index eligibility Index Eligibility: We say that an index I is eligible to answer predicate P of query Q, if for any collection of XML documents D, the following holds: Q(D) = Q( I( P,D )). Where I( P,D ) is the set of XML documents produced, by probing index I with predicate P. This is not as obvious as it is in relational databases.
XML indexes in DB2 • Index a linear XPath pattern over a column as a particular datatype CREATE INDEX index-name ON table(xml-column) USING 'pattern' AS type pattern ::= namespace-decls? (( / | // ) axis? ( name-test | kind-test ))+ axis ::= @ | child:: | attribute:: | self:: | descendant:: | descendant-or-self:: name-test ::= qname | * | ncname:* | *:ncname kind-test ::= node() | text() | comment() | processing-instruction() type ::= varchar | double | date | timestamp
Query pattern index pattern CREATE INDEX li_price ON orders(orddoc)USING XMLPATTERN '//lineitem/@price' AS double • Can use the index: more restrictive for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[ lineitem/@price > 100 ]return $i • Cannot use the index: less restrictive for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[ lineitem/@* > 100 ]return $i
Match index and query predicate data type CREATE INDEX li_price ON orders(orddoc)USING XMLPATTERN '//lineitem/@price' AS double • Can use the index: numeric predicate and index for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[ lineitem/@price > 100 ]return $i • Cannot use the index: string predicate for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[lineitem/@price > "100" ]return $i
Data Types for Joins CREATE INDEX o_custid ON orders(orddoc) USING XMLPATTERN '//custid' AS double CREATE INDEX c_custid ON customer(cdoc) USING XMLPATTERN '/customer/id' AS double • Cannot use the indexes: unknown comparison type for $i in db2-fn:xmlcolumn("ORDERS.ORDDOC")/orderfor $j in db2-fn:xmlcolumn("CUSTOMER.CDOC")/customerwhere $i/custid = $j/idreturn $i • Can use the indexes: at least one cast required for $i in db2-fn:xmlcolumn("ORDERS.ORDDOC")/orderfor $j in db2-fn:xmlcolumn("CUSTOMER.CDOC")/customerwhere $i/custid/xs:double(.) = $j/id/xs:double(.)return $i
SQL/XML Query Functions • XMLQuery Scalar function that returns an (possibly empty) XQuery sequence for every row • XMLExists Predicate that returns true iff the XQuery sequence produced is not empty • XMLTable Produces a table with one row for each item in the row-producing XQuery sequence, and with one column per column-producing XQuery expression. The columns may be XQuery sequences or cast to simple SQL types.
Result XMLQuery does not filter rows (usually) • Cannot use the index: SELECT XMLQuery(‘ $order//lineitem[ @price > 100 ] ‘ passing orddoc as "order")FROM orders • Can use the index: VALUES (XMLQuery(’ db2-fn:xmlcolumn("ORDERS.ORDDOC") //lineitem[ @price > 100 ] ')) • Can use the index: db2-fn:xmlcolumn('ORDERS.ORDDOC') //lineitem[ @price > 100 ]
XMLExists filter rows (usually) • Can use the index SELECT ordid, orddocFROM ordersWHERE XMLExists(‘ $order//lineitem[ @price > 100 ]‘ passing orddoc as "order") • Cannot use the index: false exists SELECT ordid, orddocFROM ordersWHERE XMLExists(‘ $order//lineitem/@price > 100‘ passing orddoc as "order") Need XMLTest which uses XQuery’s Effective Boolean Value
XMLQuery + XMLExists vs. XMLTable • Can use the index SELECT ordid, XMLQuery(‘$order//lineitem[@price > 100] ’ passing orddoc as "order")FROM ordersWHERE XMLExists(‘$order//lineitem[@price > 100] ’ passing orddoc as "order") • XMLTable: More efficient and less redundant SELECT o.ordid, t.lineitemFROM orders o,XMLTable(‘$order//lineitem[@price > 100] ’ passing o.orddoc as "order“ COLUMNS "lineitem" XML BY REF PATH '.')as t(lineitem)
Predicates in XMLTable column expressions • Can use the index SELECT o.ordid, t.lineitemFROM orders o,XMLTable(‘$order//lineitem[@price > 100] ’ passing o.orddoc as "order“ COLUMNS "lineitem" XML BY REF PATH '.')as t(lineitem) • Cannot use the index SELECT o.ordid, t.lineitem, t.priceFROM orders o, XMLTable(‘ $order//lineitem ’ passing o.orddoc as "order" COLUMNS "lineitem" XML BY REF PATH '.', "price" DECIMAL(6,3) PATH '@price[. > 100]‘ )as t(lineitem, price)
Joining XML Values in SQL/XML • Can use index on product/id, but not p.id SELECT p.name, o.orddocFROM products p, orders oWHERE XMLExists(‘$order//lineitem/product[ id eq $pid ] ‘ passing o. orddoc as "order", p.id as "pid") • Can use index on p.id, but not product/id SELECT p.name, o.orddocFROM products p, orders oWHERE p.id = XMLCast( XMLQuery(‘ $order//lineitem/product/id ‘ passing o. orddoc as "order") as VARCHAR(13)) Need to unify XQuery and SQL data types
Joining XML Values in SQL/XML • Probably cannot use XML indexes: SQL types differ from XML SELECT c.name, o.orddocFROM orders o, customer cWHERE XMLCast( XMLQuery(‘ $order/order/custid ’ passing o.orddoc as "order") as DOUBLE) =XMLCast( XMLQuery(‘ $cust/customer/id ’ passing c.cdoc as "cust") as DOUBLE) • Can use XML indexes SELECT c.name, o.orddocFROM orders o, customer cWHERE XMLExists(‘ $order/order[ custid/xs:double(.) = $cust/customer/id/xs:double(.) ] ‘ passing o.orddoc as "order", c.cdoc as "cust")
XQuery Let Clauses • Can use the index for $doc in db2-fn:xmlcolumn('ORDERS.ORDDOC')for $item in $doc//lineitem[ @price > 100 ]return <result>{ $item }</result>for $ord in db2-fn:xmlcolumn('ORDERS.ORDDOC')/orderreturn $ord/lineitem[ @price > 100 ] • Cannot use the index for $doc in db2-fn:xmlcolumn('ORDERS.ORDDOC')let $item:= $doc//lineitem[ @price > 100 ]return <result>{ $item }</result>for $ord in db2-fn:xmlcolumn('ORDERS.ORDDOC')/orderreturn <result>{ $ord/lineitem[ @price > 100 ]}</result>
XQuery Let Clauses • Can use the index for $ord in db2-fn:xmlcolumn('ORDERS.ORDDOC')/orderwhere $ord/lineitem/@price > 100return <result>{ $ord/lineitem }</result> • Same as above for $ord in db2-fn:xmlcolumn('ORDERS.ORDDOC')/orderlet $price := $ord/lineitem/@pricewhere $price > 100return <result>{ $ord/lineitem }</result>
Context is everything • $i is bound to the document node for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') return $i/order/lineitem • $j is bound to <my_order> for $j in ( for $o in db2-fn:xmlcolumn('ORDERS.ORDDOC')/order return <my_order>{ $o/* }</my_order> )return $j/my_order/lineitem
Remember the dot • Produces a type error: no document node at root let $order := <new_order>{ db2-fn:xmlcolumn('ORDERS.ORDDOC') /order[custid > 1001] } </new_order>return $order[ //customer/name ] • Absolute path expressions is a shorthand for fn:root(.) treat as document-node(). • Absolute path expressions are bad style
Construction and View Composition • Want to rewrite this… let $view := for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC')/ order/lineitem return <ordered>{ $i/@quantity, $i/product/@price, <pid>{ $i/product/id/data(.) }</pid> }</ordered>for $j in $viewwhere $j/pid = '17‘return $j/@price
Construction and View Composition • … into this for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') /order/lineitemwhere $i/product/id/data(.) = '17‘return $i/product/@price • but…
Construction and View Composition • Data type changed to untypedAtomic • id is string: comparison is now an error • id is long: comparison is now as double instead of long • List types are concatenated • Error for duplicate @price attributes lost • New node identity lost • Parent axis is broken Any sequence should live in tree without change.Separate identity from construction?
Remember the namespaces • Index definition and query must match namespaces CREATE INDEX li_price ON orders(orddoc)USING XMLPATTERN '//lineitem/@price' AS double • Cannot use the index. Which is right? declare default element namespace "http://ournamespaces.com/order";for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[ lineitem/@price > 100 ]return $i
Elements and text nodes differ CREATE INDEX PRICE_TEXT ON orders.orddocUSING XMLPATTERN '//price' AS varchar • Can not use index for $ord in db2-fn:xmlcolumn(“ORDERS.ORDDOC”) /order[ lineitem/price/text() = “99.50” ]return $ord • Element might have more data than just text <price>99.50<unit>USD</unit></price>
Attributes are shy • No attributes //*//node() • Only attributes //@*//attribute::node() • Empty result due to “principle node kind” //@*/self:*
Between predicates are not obvious • Might not be between: multiple prices lineitem[ price > 100 and price < 200 ] • Between or error lineitem[ price gt 100 and price lt 200 ] • Always between lineitem/price/data()[ . > 100 and . < 200 ] • Between if not list type lineitem/price[ . > 100 and . < 200 ] • Between if not list type lineitem[ @price > 100 and @price < 200 ]
Conclusions • Easy to make mistakes without schema constraints • Many subtle differences in expressions • Improve construction composition • Unify SQL and XQuery type systems • Add XMLTest to SQL/XML