690 likes | 823 Views
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences Department University of Wisconsin - Madison. Motivation. XML enables Internet scale applications that query data from many sources
E N D
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences Department University of Wisconsin - Madison
Motivation • XML enables Internet scale applications that query data from many sources • Niagara, Xyleme, … • Queries over XML data use path expressions • Optimizing these queries requires estimating the selectivity of the path expressions • Focus of this talk: Building statistics for XML data and using them for estimating the selectivity of simple path expressions
What is XML? <readings> <play> <title>Pygmalion</title> <author>Bernard Shaw</author> </play> <novel> <title>David Copperfield</title> <author>Charles Dickens</author> </novel> </readings>
Querying XML FOR $n_auth IN document("*")//novel/author $p_auth IN document("*")//play/author WHERE $n_auth/text() = $p_auth/text() RETURN $n_auth • Optimizing this query requires estimating the selectivity of the path expressions • This requires information about the structure of the XML data
Goal of this Work • Build database statistics that capture the structure of XML data • Ensure that the statistics fit in a small amount of memory • For efficient query optimization • Important for Internet scale applications • Use the statistics to estimate the selectivity of simple XML path expressions//t1/t2/…/tn
Outline of Presentation • Introduction • Path Trees • Markov Tables • Performance Evaluation • Conclusions
A C B D D E 1 1 2 1 1 3 Path Trees <A> <B> </B> <B> <D> </D> </B> <C> <D> </D> <E> </E> <E> </E> <E> </E> </C> </A>
Summarizing Path Trees • Path trees contain all the information needed for selectivity estimation • Problem: May not fit in available memory • Small available memory • Internet scale • Remove low frequency nodes • Removed nodes replaced with *-nodes • Tag name: * meaning "any tag" • Frequency: Average frequency of replaced nodes • Sibling-*, Level-*, Global-*, No-*
A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Sibling-* Summarization
B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Sibling-* Summarization A 1
B C 13 9 D E F G H 7 5 15 10 6 J K K 4 11 12 Sibling-* Summarization A 1 I 2
B C 13 9 D E F G H 7 5 15 10 6 K K 11 12 Sibling-* Summarization A 1 I J 2 4
B C 13 9 D E F G H 7 5 15 10 6 * f=6 n=2 K K 11 12 Sibling-* Summarization • *-nodes represent deleted sibling nodes • Memory saved by coalescing nodes A 1
B C 13 9 D F G H 7 15 10 6 K K 11 12 Sibling-* Summarization A 1 E 5 * f=6 n=2
B C 13 9 D F G 7 15 10 K K 11 12 Sibling-* Summarization A 1 E H 5 6 * f=6 n=2
B C 13 9 F G 15 10 K K 11 12 Sibling-* Summarization A 1 D E H 7 5 6 * f=6 n=2
B C 13 9 * f=12 n=2 F G H 15 10 6 K K 11 12 Sibling-* Summarization A 1 * f=6 n=2
B 13 F G 15 10 K K 11 12 Sibling-* Summarization A 1 C 9 * f=12 n=2 H 6 * f=6 n=2
B 13 F 15 K K 11 12 Sibling-* Summarization A 1 C 9 * f=12 n=2 G H 10 6 * f=6 n=2
B 13 F * f=16 n=2 15 K K 11 12 Sibling-* Summarization A 1 C 9 * f=12 n=2 * f=6 n=2
B 13 F 15 Sibling-* Summarization A 1 C 9 * f=12 n=2 * f=16 n=2 * f=6 n=2 K f=23 n=2
B 13 F 15 Sibling-* Summarization A 1 C 9 * * 6 8 * K f=23 n=2 3
A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Original Path Tree
B 13 F 15 Sibling-* Summarization A 1 C 9 * * 6 8 * K f=23 n=2 3 • Try to retain as much information as possible about the deleted nodes
A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Level-* Summarization
B C 13 9 F G 15 10 K K 11 12 Level-* Summarization A 1 D E H 7 5 6 I J 2 4
B C 13 9 * F G 6 15 10 * K K 3 11 12 Level-* Summarization • Less information about deleted nodes than sibling-* • Deletes fewer nodes than sibling-* A 1
A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Global-* Summarization
B C 13 9 F G 15 10 K K 11 12 Global-* Summarization A 1 D E H 7 5 6 I J 2 4
B C 13 9 F G 15 10 K K 11 12 Global-* Summarization 3 * D H 7 6
A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 No-* Summarization
B C 13 9 F G 15 10 K K 11 12 No-* Summarization A 1 D E H 7 5 6 I J 2 4
B C 13 9 F G 15 10 K K 11 12 No-* Summarization • Memory savings similar to global-* • Conservative assumption about deleted nodes D E H 7 5 6
Outline • Introduction • Path Trees • Markov Tables • Performance Evaluation • Conclusions
f(B/C/D) f(A/B/C/D) = f(A/B/C) f(B/C) Markov Tables • A table of all distinct paths of length up to m and their frequencies • For paths of length greater than m, combine paths from the Markov table • Example: • Uses "short memory" or "Markov" property
A 1 B C D 11 6 4 C D 9 7 D 8 Markov Tables
Summarizing Markov Tables • Exact selectivities for paths of length up to m • Approximate selectivities for paths longer than m • Problem: May not fit in available memory • Remove low frequency paths • Discard removed paths of length > 2 • Replace removed paths of length 1 or 2 with *-paths • Suffix-*, Global-*, No-*
Set of deleted paths of length 2 Suffix-* Summarization SD= { }
Suffix-* Summarization SD= { (AD,4) }
Suffix-* Summarization SD= { (AD,4)}
Suffix-* Summarization SD= { (AD,4) }
Suffix-* Summarization SD= { }
Suffix-* Summarization SD= { }
Suffix-* Summarization SD= { (BD,7) }
Suffix-* Summarization SD= { (BD,7) }
Suffix-* Summarization SD= { (BD,7), (CD,8) }