240 likes | 256 Views
Optimization in XSLT and XQuery. Michael Kay. Challenges. XSLT/XQuery are high-level declarative languages: performance depends on good optimization Performance also depends on good programming! How can users write good programs if they don’t know what the optimizer is doing?.
E N D
Optimizationin XSLT and XQuery Michael Kay
Challenges • XSLT/XQuery are high-level declarative languages: performance depends on good optimization • Performance also depends on good programming! • How can users write good programs if they don’t know what the optimizer is doing?
What is optimization? • Widest sense: • Everything that’s done to make your query go fast • Narrower sense: • Expression rewriting: replacing the code that you write with equivalent, faster code that has the same effect
Main performance contributors • Efficient internal coding • Tree model for documents • Streamed execution (pipelining) + lazy evaluation • Rewrite optimizations • Including join optimization • Tail recursion • XSLT template rule matching
Databases vs. in-memory processors • Databases • 90% of optimization is about finding and using indexes • You can spend more time building the data to reduce query costs • Indexes are long-lived • Queries may be repeatable or one-off • In-memory processors • Loading the data is a significant part of the overall cost • Memory utilization needs to be minimized • Indexes, if used, are transient • Queries/Stylesheets may be repeatable or one-off
The Saxon TinyTree Model • Requirements: • Low memory footprint • Fast construction • Fast access paths • Support for document order • Non-requirement: • In-situ update
TinyTree example <root> <a>12</a> <b>Prague</b> </root>assume whitespace is stripped
TinyTree: key points • No “object-per-node” overhead • Names held as integer codes • Fast child navigation • Fast document order comparison • Extra information added dynamically if needed: • preceding-sibling pointers • Base-uri, line numbers etc • Indexes
Streaming (Pipelining) • Common practice in set-based languages • Functional programming languages • SQL • Each node in the expression tree can deliver its results incrementally to the parent node • Can be implemented as pull or push (Saxon uses both)
Example: filter expressions $nodes[x=1] filter Class FilterExpressionIterator { public Item next() { while (true) { Item item = base.next(); if (item == EOS) return EOS; if (matches(item, predicate)) return item; } } $nodes = x 1
Example: Many-to-One Comparisons x=1 = Class ManyToOneComparisonEvaluator { public boolean evaluate () { while (true) { Item item = lhs.next(); if (item = rhs) return true; } return false; } x 1
Benefits of Streaming • Saves memory • No memory for intermediate results • Allocating and de-allocating memory takes time • Early exit, for example in • (a/b/c/d)[1] • book[author = ‘Smith’] • exists(//@xml:space)
Lazy Evaluation • Closely associated with streaming • Variables and function arguments are not evaluated until the value is needed • Benefits: • The value might never be needed • Only part of the value might be needed (early exit) • Memory is used for the minimum time
Compile-time Expression Rewrites General approach: • Parse the source code into an expression tree • Resolve references (variables, functions) • Decorate the tree with attributes • Type of an expression • Dependencies of an expression • Other properties, e.g. whether a node-set is sorted • Scan the tree repeatedly to identify expressions that can be replaced by faster equivalents
Two kinds of rewrites • Rewrites that could have been done by the programmer • count(A) > 3 ►exists(A[4]) • Rewrites that use constructs not available to the programmer • A[position()=last()] ► A[isLast()]
Some important rewrites • Sort removal • Not sorting path expressions where the result is already sorted • Constant subexpressions • Evaluated at compile time where possible • Extracting subexpressions from loops • Distributing WHERE conditions • + many ad-hoc rewrites
Some rewrites that Saxon doesn’t yet do • Inline expansion of variable references • Inline expansion of function calls • Detecting common subexpressions • Creating new global variables
Type Checkingand its effect on performance • XQuery and XSLT 2.0 allow you to declare types of variables and functions • But it’s not mandatory • Main benefit is better error detection • Type information can also be used by the optimizer • With Saxon, this rarely makes a big difference
“Optimistic static type checking” • The static type of an expression S is compared with the required type R • Possible outcomes: • S is a subtype of R: no action needed • S overlaps with R: run-time type-checking code is generated • S and R are disjoint: static error reported • Special case: • integer* and string* overlap (both allow the empty sequence)
Join Optimization • Less important in XQuery than in SQL • Except that some people write XQuery as if it were SQL • General strategy in Saxon-SA: • Distribute the join predicates (turn WHERE clauses into filter expressions) • Use indexed lookup for predicates where appropriate
Indexes in Saxon-SA • Explicit user-defined indexes • xsl:key • Implicit document-level indexes • //a/b/c[@id=$param] • Implicit sequence-level indexes • $abc[@id = $param] • Hash tables for many-to-many “=“ • book[keyword = $keywords]
Some tips for effective indexing • Declare your types • Avoid untypedAtomic • use a schema • Use “eq” rather than “=“
Tail Recursion • See the printed paper
Conclusions • Optimization techniques are similar for XSLT and XQuery • But vary between database products and in-memory processors • Compile-time techniques • Type analysis • Expression rewriting • Run-time techniques • Streaming/pipelining • etc