1 / 46

Lazy Query Evaluation for Active XML

Lazy Query Evaluation for Active XML. Abiteboul, Benjelloun, Cautis, Manolescu, Milo, Preda INRIA Futurs. presented by: Grigoris Karvounarakis Univ. of Pennsylvania CIS 650 October 14, 2004. Active XML. function nodes.

kesia
Download Presentation

Lazy Query Evaluation for Active XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lazy Query Evaluation for Active XML Abiteboul, Benjelloun, Cautis, Manolescu, Milo, PredaINRIA Futurs presented by: Grigoris Karvounarakis Univ. of Pennsylvania CIS 650 October 14, 2004

  2. Active XML function nodes CIS 650

  3. descendant edge Tree Pattern Queries result nodes CIS 650

  4. Tree Pattern Queries • Similar to Pattern Trees from TAX/TLC algebra + variable nodes, used to bind variables to sub-trees (variable nodes with the same name must be mapped to elements with the same tag name) + result nodes • Embedding (of a query q into a doc d) = Match • Result of embedding = bindings of output variables on witness tree CIS 650

  5. No embedding … CIS 650

  6. No embedding … 1 … but if we evaluate CIS 650

  7. Embedding Example CIS 650

  8. Embedding Example CIS 650

  9. X Y Embedding Example CIS 650

  10. Relevant rewriting • (getNearbyRestos) is a relevant function node • In general, a function node is relevant, if there exists some rewriting of the document where some of the nodes it produces belongs to a match • Rewriting the document by invoking relevant function nodes produces relevant rewritings d1!v1 d2!v2 … dn • A document that contains no calls that are relevant to a query q is said to be complete for q 1 CIS 650

  11. Problem definition • Given an Active XML document d and a query q, find an efficient way to evaluate the query over the document • Naïve approach: interleave query evaluation with function calls • Better: try to compute (a superset of) the relevant functions calls for q and execute q over the rewriting of d (that results from executing these function calls) CIS 650

  12. Problem definition • Given an Active XML document d and a query q, find an efficient way to evaluate the query over the document • Naïve approach: interleave query evaluation with function calls • Better: try to compute (a superset of) the relevant functions calls for q and execute q over the rewriting of d (that results from executing these function calls) • Efficiency tradeoff • time to compute approximation of set of relevant functions (larger for more accurate approx) • time to execute the function calls (smaller for more accurate approx) and time to execute query over resulting rewriting of document (smaller document for more accurate approx) CIS 650

  13. Outline • Definitions • Finding relevant calls • Sequencing relevant calls • Improving accuracy • Reducing detection time • Conclusions - Discussion CIS 650

  14. Linear Path Queries /*() /nyHotels/*() /nyHotels/hotel/*() /nyHotels/hotel/name/*() /nyHotels/hotel/rating/*() /nyHotels/hotel/nearby/*() /nyHotels/hotel/nearby//*() /nyHotels/hotel/nearby//restaurant/*() /nyHotels/hotel/nearby//restaurant/name/*() /nyHotels/hotel/nearby//restaurant/address/*() /nyHotels/hotel/nearby//restaurant/rating/*() CIS 650

  15. Linear Path Queries • Correct, but usually inaccurate • Ignores filtering conditions in the path from the root or in other branches that could make some of the functions irrelevant (e.g. there is no chance that a getNearbyRestos() function node under a hotel is relevant, if the hotel rating is not “*****”) CIS 650

  16. Node Focused Queries • For each node in the query tree, replace it with an OR node (to add a branch *() to match any functions, similarly with LPQs) • Then, for every node v in the resulting query tree, create qv = q – {v and its subtree}, with output node fvpointing at the position of the *() OR-sibling of v • Each such query tree involves the path from the root to the node (as in LPQ) + any parts of the tree that would have to be matched anyway, for the whole query tree to match. CIS 650

  17. NFQ Example nyHotels * hotel * * * name rating nearby * * * restaurant “Best Western” “*****” * * * name address rating X Y “*****” CIS 650

  18. NFQ Example nyHotels * hotel * * * name rating nearby * * * restaurant “Best Western” “*****” * * * name address rating X Y “*****” CIS 650

  19. NFQ Example nyHotels * CIS 650

  20. NFQ Example nyHotels * CIS 650

  21. NFQ Example nyHotels * CIS 650

  22. nyHotels * hotel * * * name rating nearby * * * restaurant “*****” * * * name address rating X Y “*****” Another NFQ Example “Best Western” CIS 650

  23. nyHotels * hotel * * * name rating nearby * * * “Best Western” “*****” Another NFQ Example CIS 650

  24. nyHotels * hotel * * * name rating nearby * * * “Best Western” “*****” Another NFQ Example CIS 650

  25. nyHotels hotel nearby * * name rating * * * “*****” Another NFQ Example “Best Western” CIS 650

  26. Node Focused Queries • Assuming that functions can return data of arbitrary type, the function nodes that are relevant for a query q are precisely the ones retrieved by the NFQs of q CIS 650

  27. Outline • Definitions • Finding relevant calls • Sequencing relevant calls • Improving accuracy • Reducing detection time • Conclusions - Discussion CIS 650

  28. Sequencing relevant calls • Naïve NFQA algorithm: • Evaluate all NFQs • Pick one of the returned functions, say fv • Evaluate the function and rewrite the document (d !fv d’) • Until all NFQs return empty results (i.e., there are no more relevant calls) • After every loop, although the NFQs remain the same, their result can change (since evaluating functions at step 3 above can introduce new function nodes or make some results irrelevant) CIS 650

  29. Improving NFQA • “Predict” when NFQ results could not have possibly changed and avoid reevaluating them • Identify dependences between NFQs and the effect of executing functions they return CIS 650

  30. nyHotels hotel nearby * * name rating * * * “*****” Influence of NFQs NFQ1 NFQ2 nyHotels * “Best Western” NFQ1 can influence NFQ2, but not vice versa CIS 650

  31. Influence of NFQs • NFQ1may influenceNFQ2 iff the output function node of NFQ1 is an ancestor (in the query tree) of the output node of NFQ2 • Two NFQs belong in the same layer if they may influence (directly or transitively) each other. • Inside every layer, we have to reevaluate every NFQ after every function call • Multiple equivalent NFQs (i.e., in the same layer) can only exist under //– so that, not knowing the output type, both nodes could appear as descendants of each other, e.g. //a, //b: in /a/b, //a matches /a and //b matches /a/b, while in /b/a, //b matches /b and //a matches /b/a CIS 650

  32. Influence of NFQs • L1< L2 iff some NFQ in L1 may influence (directly or transitively) some NFQ in • We have to process L1 before L2 (without having to process L1 again afterwards) • When processing L1 has finished, OR-nodes corresponding to returned functions are redundant and thus NFQs in L2 can be simplified by removing them CIS 650

  33. Parallelizing calls • Let qlin be the linear path from the root to the output node of NFQ q, not inclusive(note: qlin is a regular expression) • Two NFQs q, q’ that belong to the same layer are independent iff there are no common words in the regular languages of qlin, q’lin • E.g: //a, //b are independent • But //a//c and //b//c are not: (e.g. both match /a/b/c) • If all NFQs in a layer are independent, we can call all functions returned by the same NFQ in a step of NFQA in parallel. • Other sufficient conditions could exist, too … CIS 650

  34. Outline • Definitions • Finding relevant calls • Sequencing relevant calls • Improving accuracy • Reducing detection time • Conclusions - Discussion CIS 650

  35. Using types • Use function return type to “predict” shape of data that a function call can return • Similar to check for existence of a possible rewriting • If this shape cannot match the (corresponding part of) the query pattern, they can be discarded • In some cases, one can go further and restrict not only the output type but also the specific names of functions that could match • Refined NFQs • Use set of function names of appropriate return type instead of *() • Use F-guides (later) to make them even more refined CIS 650

  36. Refined NFQ example nyHotels hotel nearby * * name rating * * * “Best Western” “*****” CIS 650

  37. Refined NFQ example nyHotels hotel nearby * * name rating getNearbyRestos getRating * “Best Western” “*****” CIS 650

  38. Pushing queries • Similar to pushing selections on scans in relational queries or pushing queries to data sources in mediator systems • Reduce amount of (useless) data that are transferred (assuming functions correspond to remote (web) services), by filtering irrelevant matches and projecting only on output variable nodes CIS 650

  39. Outline • Definitions • Finding relevant calls • Sequencing relevant calls • Improving accuracy • Reducing detection time • Conclusions - Discussion CIS 650

  40. Lenient rewriting • Trade accuracy for efficiency • Use XPath or LPQs instead of NFQ (faster processing) • Use a lenient form of type checking (ignoring order and cardinality of elements) CIS 650

  41. Function call guides • Similar to dataguides for function calls • One occurrence for each path that leads to some function node + pointers to function nodes CIS 650

  42. Function call guides • Similar to dataguides for function calls • One occurrence for each path that leads to some function node + pointers to function nodes paths that don’t lead to functions are left out CIS 650

  43. Function call guides • Similar to dataguides for function calls • One occurrence for each path that leads to some function node + pointers to function nodes pointers to getHotels calls pointers to getRating calls pointers to getNearbyRestos, getNearbyMuseums calls CIS 650

  44. Function call guides • Use F-guides for: • Generation of Refined NFQs (use return type within appropriate F-guide part to get only function names that can indeed appear in the corresponding tree fragment) • Efficient approximation of relevant function nodes: evaluate queries (NFQs) on F-guide  evaluate queries on original document using LPQs • Initial filtering: Can get rid of NFQs for nodes that don’t have any children in the F-guide CIS 650

  45. Conclusions • Active XML: Interesting new area • Nothing fundamentally novel • Applies known tools (distributed processing, lazy evaluation) in a new context, giving new life to documents • Greatest challenge: formulate the right research questions well • Answers to these well-formulated questions are fairly easy. • Contributions of this paper: • Formulates such an interesting question • Thorough understanding of different aspects of the problem (accuracy vs. performance and their effect to overall efficiency) CIS 650

  46. Questions? CIS 650

More Related