Lazy Query Evaluation for Active XML

Lazy Query Evaluation for Active XML Abiteboul, Benjelloun, Cautis, Manolescu, Milo, PredaINRIA Futurs presented by: Grigoris Karvounarakis Univ. of Pennsylvania CIS 650 October 14, 2004

Active XML function nodes CIS 650

descendant edge Tree Pattern Queries result nodes CIS 650

Tree Pattern Queries • Similar to Pattern Trees from TAX/TLC algebra + variable nodes, used to bind variables to sub-trees (variable nodes with the same name must be mapped to elements with the same tag name) + result nodes • Embedding (of a query q into a doc d) = Match • Result of embedding = bindings of output variables on witness tree CIS 650

No embedding … CIS 650

No embedding … 1 … but if we evaluate CIS 650

Embedding Example CIS 650

X Y Embedding Example CIS 650

Relevant rewriting • (getNearbyRestos) is a relevant function node • In general, a function node is relevant, if there exists some rewriting of the document where some of the nodes it produces belongs to a match • Rewriting the document by invoking relevant function nodes produces relevant rewritings d1!v1 d2!v2 … dn • A document that contains no calls that are relevant to a query q is said to be complete for q 1 CIS 650

Problem definition • Given an Active XML document d and a query q, find an efficient way to evaluate the query over the document • Naïve approach: interleave query evaluation with function calls • Better: try to compute (a superset of) the relevant functions calls for q and execute q over the rewriting of d (that results from executing these function calls) CIS 650

Problem definition • Given an Active XML document d and a query q, find an efficient way to evaluate the query over the document • Naïve approach: interleave query evaluation with function calls • Better: try to compute (a superset of) the relevant functions calls for q and execute q over the rewriting of d (that results from executing these function calls) • Efficiency tradeoff • time to compute approximation of set of relevant functions (larger for more accurate approx) • time to execute the function calls (smaller for more accurate approx) and time to execute query over resulting rewriting of document (smaller document for more accurate approx) CIS 650

Outline • Definitions • Finding relevant calls • Sequencing relevant calls • Improving accuracy • Reducing detection time • Conclusions - Discussion CIS 650

Linear Path Queries /*() /nyHotels/*() /nyHotels/hotel/*() /nyHotels/hotel/name/*() /nyHotels/hotel/rating/*() /nyHotels/hotel/nearby/*() /nyHotels/hotel/nearby//*() /nyHotels/hotel/nearby//restaurant/*() /nyHotels/hotel/nearby//restaurant/name/*() /nyHotels/hotel/nearby//restaurant/address/*() /nyHotels/hotel/nearby//restaurant/rating/*() CIS 650

Linear Path Queries • Correct, but usually inaccurate • Ignores filtering conditions in the path from the root or in other branches that could make some of the functions irrelevant (e.g. there is no chance that a getNearbyRestos() function node under a hotel is relevant, if the hotel rating is not “*****”) CIS 650

Node Focused Queries • For each node in the query tree, replace it with an OR node (to add a branch *() to match any functions, similarly with LPQs) • Then, for every node v in the resulting query tree, create qv = q – {v and its subtree}, with output node fvpointing at the position of the *() OR-sibling of v • Each such query tree involves the path from the root to the node (as in LPQ) + any parts of the tree that would have to be matched anyway, for the whole query tree to match. CIS 650

NFQ Example nyHotels * hotel * * * name rating nearby * * * restaurant “Best Western” “*****” * * * name address rating X Y “*****” CIS 650

NFQ Example nyHotels * CIS 650

nyHotels * hotel * * * name rating nearby * * * restaurant “*****” * * * name address rating X Y “*****” Another NFQ Example “Best Western” CIS 650

nyHotels * hotel * * * name rating nearby * * * “Best Western” “*****” Another NFQ Example CIS 650

nyHotels hotel nearby * * name rating * * * “*****” Another NFQ Example “Best Western” CIS 650

Node Focused Queries • Assuming that functions can return data of arbitrary type, the function nodes that are relevant for a query q are precisely the ones retrieved by the NFQs of q CIS 650

Sequencing relevant calls • Naïve NFQA algorithm: • Evaluate all NFQs • Pick one of the returned functions, say fv • Evaluate the function and rewrite the document (d !fv d’) • Until all NFQs return empty results (i.e., there are no more relevant calls) • After every loop, although the NFQs remain the same, their result can change (since evaluating functions at step 3 above can introduce new function nodes or make some results irrelevant) CIS 650

Improving NFQA • “Predict” when NFQ results could not have possibly changed and avoid reevaluating them • Identify dependences between NFQs and the effect of executing functions they return CIS 650

nyHotels hotel nearby * * name rating * * * “*****” Influence of NFQs NFQ1 NFQ2 nyHotels * “Best Western” NFQ1 can influence NFQ2, but not vice versa CIS 650

Influence of NFQs • NFQ1may influenceNFQ2 iff the output function node of NFQ1 is an ancestor (in the query tree) of the output node of NFQ2 • Two NFQs belong in the same layer if they may influence (directly or transitively) each other. • Inside every layer, we have to reevaluate every NFQ after every function call • Multiple equivalent NFQs (i.e., in the same layer) can only exist under //– so that, not knowing the output type, both nodes could appear as descendants of each other, e.g. //a, //b: in /a/b, //a matches /a and //b matches /a/b, while in /b/a, //b matches /b and //a matches /b/a CIS 650

Influence of NFQs • L1< L2 iff some NFQ in L1 may influence (directly or transitively) some NFQ in • We have to process L1 before L2 (without having to process L1 again afterwards) • When processing L1 has finished, OR-nodes corresponding to returned functions are redundant and thus NFQs in L2 can be simplified by removing them CIS 650

Parallelizing calls • Let qlin be the linear path from the root to the output node of NFQ q, not inclusive(note: qlin is a regular expression) • Two NFQs q, q’ that belong to the same layer are independent iff there are no common words in the regular languages of qlin, q’lin • E.g: //a, //b are independent • But //a//c and //b//c are not: (e.g. both match /a/b/c) • If all NFQs in a layer are independent, we can call all functions returned by the same NFQ in a step of NFQA in parallel. • Other sufficient conditions could exist, too … CIS 650

Using types • Use function return type to “predict” shape of data that a function call can return • Similar to check for existence of a possible rewriting • If this shape cannot match the (corresponding part of) the query pattern, they can be discarded • In some cases, one can go further and restrict not only the output type but also the specific names of functions that could match • Refined NFQs • Use set of function names of appropriate return type instead of *() • Use F-guides (later) to make them even more refined CIS 650

Refined NFQ example nyHotels hotel nearby * * name rating * * * “Best Western” “*****” CIS 650

Refined NFQ example nyHotels hotel nearby * * name rating getNearbyRestos getRating * “Best Western” “*****” CIS 650

Pushing queries • Similar to pushing selections on scans in relational queries or pushing queries to data sources in mediator systems • Reduce amount of (useless) data that are transferred (assuming functions correspond to remote (web) services), by filtering irrelevant matches and projecting only on output variable nodes CIS 650

Lenient rewriting • Trade accuracy for efficiency • Use XPath or LPQs instead of NFQ (faster processing) • Use a lenient form of type checking (ignoring order and cardinality of elements) CIS 650

Function call guides • Similar to dataguides for function calls • One occurrence for each path that leads to some function node + pointers to function nodes CIS 650

Function call guides • Similar to dataguides for function calls • One occurrence for each path that leads to some function node + pointers to function nodes paths that don’t lead to functions are left out CIS 650

Function call guides • Similar to dataguides for function calls • One occurrence for each path that leads to some function node + pointers to function nodes pointers to getHotels calls pointers to getRating calls pointers to getNearbyRestos, getNearbyMuseums calls CIS 650

Function call guides • Use F-guides for: • Generation of Refined NFQs (use return type within appropriate F-guide part to get only function names that can indeed appear in the corresponding tree fragment) • Efficient approximation of relevant function nodes: evaluate queries (NFQs) on F-guide  evaluate queries on original document using LPQs • Initial filtering: Can get rid of NFQs for nodes that don’t have any children in the F-guide CIS 650

Conclusions • Active XML: Interesting new area • Nothing fundamentally novel • Applies known tools (distributed processing, lazy evaluation) in a new context, giving new life to documents • Greatest challenge: formulate the right research questions well • Answers to these well-formulated questions are fairly easy. • Contributions of this paper: • Formulates such an interesting question • Thorough understanding of different aspects of the problem (accuracy vs. performance and their effect to overall efficiency) CIS 650

Questions? CIS 650

Lazy Query Evaluation for Active XML