1 / 67

Organizing and Searching Information with XML

Organizing and Searching Information with XML. Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze. Outline. Definition Selectivity Estimation Motivation Algorithms for Selectivity Estimation Path Tree Markov Tables XPathLearner XSketches Summary. A.

sezja
Download Presentation

Organizing and Searching Information with XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze

  2. Outline • Definition Selectivity Estimation • Motivation • Algorithms for Selectivity Estimation • Path Tree • Markov Tables • XPathLearner • XSketches • Summary

  3. A B C E D D Selectivity Definition Selectivity of a path expression σ(p) is defined as the number of paths in the XML data tree that match the tag sequence in p Example: σ(A/B/D) = 2

  4. Motivation • Estimating the size of query results and inter-mediate results is neccessary for effective query optimization • Knowing selectivities of sub-queries help identifying cheap query evaluation plans • Internet Context: Quick feedback about expected result size before evaluating the full query result

  5. Department RA TA Faculty Example XQuery-Expression: For $f IN document („personnel.xml“)//department/faculty WHERE count ($f/TA) > 0 AND count($f/RA) > 0 RETURN $f This expression matches all faculty members that has at least one TA and one RA • one join for every edge is computed Presumption • Number of nodes is known • Join-Algorithm: Nested Loop

  6. Department Faculty Faculty Faculty Scientist Name Name Name Secretary RA RA RA RA RA RA RA TA TA Method 2 Join 1: (Faculty) – Dep. Join 2: (Result Join 1) – RA Join 3: (Result Join 2) – TA Evaluating the join Method 1 Join 1: (Faculty) – TA Join 2: (Result Join 1) – RA Join 3: (Result Join 2) – Dep. Number of operations: Join 1: 3 * 2 = 6 Join 2: 1 * 7 = 7 Join 3: 1 * 1 = 1 Total = 14 Number of operations: Join 1: 3 * 1 = 3 Join 2: 3 * 7 = 21 Join 3: 3 * 2 = 6 Total = 30

  7. Outline • Motivation • Definition Selectivity Estimation • Algorithms for Selectivity Estimation • Path Trees • Markov Tables • XPathLearner • XSketches • Summary

  8. Representing XML data structure Path Trees Markov Tables

  9. 1 A B C D D E 2 1 1 1 3 The tree has to be summarized Path Trees <A> <B></B> <B> <D></D> </B> <C> <D></D> <E></E> <E></E> <E></E> </C> </A> Problem: The Path Tree may become larger than the available memory

  10. Summarizing a Path Tree 4 different Algorithms: • Sibling-* • Level-* • Global-* • No-* Operation breakdown: Delete the nodes with the lowest frequencies and replace them with a „* “ (star-node) to preserve some structural information

  11. * C K B A G H D E F * A * * F K K J I C B 1 9 13 6 15 8 f=23 n=2 3 Sibling-* Operation breakdown: 1 • Mark the nodes with the lowest frequencies for deletion 13 9 7 5 15 10 6 • Check siblings, if sibling coalesce I J n=2 f=6 3 • Traverse Tree and compute average frequency 2 4 K 11 12

  12. C A B C G K F * * K E K I D G J K H B F A 1 1 13 9 13 9 7 5 15 10 6 6 15 10 2 4 11 12 3 11 12 Level-* • As before, delete the nodes with the lowest frequency • One *-node for every level

  13. C C H K A B E G H K F I J G D F B D * K K 3 1 13 9 13 9 7 5 15 10 6 7 15 10 6 2 4 11 12 11 12 Global-* • Delete the nodes with the lowest frequency • One *-node for the complete tree

  14. No-* • Low frequency nodes are deleted and not replaced • Tree may becomes a forest with many roots No-* conservatively assumes that nodes that do not exist in the summarized path tree did not exist in the original path tree

  15. 1 9 13 * F K * C B A * 6 15 8 11 3 Selectivity-Estimation • find all matchings tags • estimated selectivity = total frequency of these nodes Example:σ(A/B/F) = 15 + 6 = 21 σ(A/B/Z) = 6 σ(A/C/Z/K) = 11

  16. Outline • Motivation • Definition Selectivity Estimation • Algorithms for Selectivity Estimation • Path Trees • Markov Tables • XPathLearner • XSketches • Summary

  17. C D A B C D D What are Markov Tables ? • Table, conaining all distinct paths in the data of length up to m and their selectivity • m 2 • Order: m - 1 • Markov Table = Markov Histogramm 1 11 6 4 9 7 8

  18. Selectivity Estimation • The table provides selectivity estimates for all paths of length up to m • Assumption that the occurence of a particular tag in a path is dependant only on m-1 tags occuring before it • Selectivity estimation for longer path expressions is done with the following formula

  19. Selectivity Estimation t1 t2 t3 t… t… Markov Chain E E1 P[tn] Propability of tag tnoccuring in the xml data tree N Total number of nodes in the xml data tree P[ti|ti+1] Probability of tag ti occuring before tag ti+1 EPredictand for the occurence of tag tn E1Predictand for the occurence of tag ti before tag ti+1

  20. Selectivity Estimation = Selectivity of path p Example:

  21. Summarizing Markov Tables The Nodes with the lowest selectivity are deleted and replaced 3 Algorithms: • Suffix-* • Global-* • No-*

  22. Suffix-* • Deleting a path of length 1 add to path * • Deleting a path of length 2 add to SD and look for paths with the same start tag suffix-* path Example: SD={(A/C), (G/H)} deleting (A/B) (A/*) * - Path : representing all deleted paths of length 1 */* - Path : representing all deleted paths of length 2 SD : Set of deleted paths with length 2 • Before checking SD, check Markov Table

  23. Deleting a path of length 1 add to path * • Deleting a path of length 2 immediately add to path */* Global-* * - Path : representing all deleted paths of length 1 */* - Path : representing all deleted paths of length 2

  24. No-* • does not use *-Paths • Low-frequency paths simply discarded If any of the required paths is not found (in the markov table) its selectivity is conservatively assumed to be zero

  25. Data has common structure Markov Table Data has NO common structure Path Trees Path exists in XML-Data * - Algorithm Path do not exist No - * - Algorithm Which method should be used ? Path Trees vs. Markov Table „ * “ vs. „ No-* “

  26. Outline • Motivation • Definition Selectivity Estimation • Algorithms for Selectivity Estimation • Path Trees • Markov Tables • XPathLearner • XSketches • Summary

  27. Weaknesses of previous methods • Off-line, scan of the entire data set • Limited to simple path expressions • Oblivious to workload distribution • Updates too expensive

  28. XPathLearner is... • An on-line self-tuning Markov histogram for XML path selectivity estimation • on-line: collects statistics from query feedback • self-tuning: learns Markov model from feedback, adapts to changing XML data • workload-aware • supports simple, single-value and multi-value path expressions

  29. Workflow Training data initial training Histogram Learner Selectivity Estimator updates Histogram observed estimation error feedback, real selectivity estimated selectivity System uses feedback to update the statistics for the queried path. Updates are based on the observed estimation error.

  30. Basics • Relies on path trees as intermediate representation • Uses Markov histogram of order (m-1) to store the path tree and the statistics • Henceforth m=2 table stores tag-tag and tag-value pairs and single tags

  31. Data values • Problem: Number of distinct data values is very large table may become larger than the available memory • Solution • Only the k most frequent tag-value pairs are stored exactly • All other pairs are aggregated into buckets according to some feature • Feature should distribute as uniform as possible

  32. A B C Example, k=1 1 Data value v1 begins with letter ‘a‘, v2 with the letter ‘b‘ 6 3 V2 V3 V1 1 3 1

  33. E1Expectation for the occurence of tag ti before tag ti+1(if n=2 ti+1= tn) Selectivity Estimation E E1 P[tn] Propability of tag tnoccuring in the xml data tree N Total number of nodes in the xml data tree P[ti|ti+1] Probability of tag ti occuring before tag ti+1 EExpectation for the occurence of tag tn

  34. Selectivity Estimation • Simple path p=//t1/t2.../tn • Analogous for single-value path p=//t1/t2.../tn-1=vn-1 • Slightly more complicated for multi-value path

  35. Example Real selectivity =3

  36. Updates • Changes in the data require the statistics to be updated • Done via query feedback tuple (p,) • p denotes the path •  denotes the accurate selectivity of p • Feedback is contributed to all path p according to some strategies

  37. Learning process • Given • Initially empty Markov Histogram f • Query feedback (p,) • Estimated selectivity  • Learn any unknown length-2-path • Update selectivities for known paths • Two strategies • Heavy-Tail-Rule • Delta-Rule

  38. A D 2 Algorithm-Part 1 • Learn new paths of length up to 2 UPDATE(Histogram f, Feedback(p, ), Estimate ) if |p|2 then if not exists f(p) then add entry f(p)=  else f(p) • Example: (AD)=1 (not in f), (AD) = 2 Tag Tag Count A B 6 A C 3

  39. Algorithm-Part 2 • Learn longer paths (decompose into paths of length 2) else for each (ti,ti+1)p if not exists f(ti,ti+1) then add entry f(ti,ti+1)=1 f(ti,ti+1) update endfor • f(ti,ti+1) update depends on update strategy

  40. A C 5 C C D D 1 4 Example • (ACD)=1, (ACD)=5 • decompose into AC and CD • AC is present • update the frequency • CD is not present add f(CD)=1 f(CD)=4 • update f(CD) Tag Tag Count A B 6

  41. A D 2 D 2 Algorithm-Part 3 • Learn frequency of single tags for each tip, i1 if not exists f(ti) then add entry f(ti) f(ti) max{f(ti),f(, ti)} endfor • Example: (AD)=1 (not in f), (AD) = 2 Tag Count Tag Tag Count A 1 A B 6 B 6 A C 3 C 3

  42. Update strategiesHeavy-Tail-Rule • Attribute more of the estimation error to the end of the path • where • wi weighting factors (increasing with i,e.g. 2i) •  learning rate • W normalized weight W

  43. Update strategiesDelta-Rule • Error reduction learning technique • Minimizes an error function • update to term f(ti,ti+1) proportional to the negative gradient of E with respect to f(ti,ti+1) •  determines the length of a step

  44. Evaluation • Good • on-line, adapts to changing data • workload-aware • after learning phase comparable to off-line methods • update overhead nearly constant • Bad • still restricted to XML trees, no support for idrefs

  45. Outline • Motivation • Definition Selectivity Estimation • Algorithms for Selectivity Estimation • Path Trees and Markov Tables • XPathLearner • XSketches • Summary

  46. XML Data Graph A: Author P: Paper B: Book PB: Publisher T: Title N: Name Preliminaries P0 A2 PB3 A1 N4 B5 P6 P7 N8 B9 V4 T10 T11 T12 V8 T13 E14 V10 V11 V12 V13 V14

  47. Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set Preliminaries P0 A2 PB3 A1 N4 B5 P6 P7 N8 B9 V4 T10 T11 T12 V8 T13 E14 T11 T12 V10 V11 V12 V13 V14

  48. Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is a set Preliminaries P0 A2 PB3 A1 N4 B5 P6 P7 N8 B9 V4 T10 T12 V8 T13 E14 T11 V10 V11 V12 V13 V14

  49. Path Expressions XPath Expressions : Simple: A/P/T Complex : A[B]/P/T Result is aset:{T1,T2} Preliminaries P0 A2 PB3 A1 N4 B5 P6 P7 N8 B9 V4 T10 T11 T12 V8 T13 E14 T11 T12 V10 V11 V12 V13 V14

  50. Preliminaries • Motivation Selectivity Estimation over XML Data Graphs • Outline • XSketch Synopsis • Estimation Framework • XSketch Refinement Operations • Experiment

More Related