Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences Department University of Wisconsin - Madison

Motivation • XML enables Internet scale applications that query data from many sources • Niagara, Xyleme, … • Queries over XML data use path expressions • Optimizing these queries requires estimating the selectivity of the path expressions • Focus of this talk: Building statistics for XML data and using them for estimating the selectivity of simple path expressions

What is XML? <readings> <play> <title>Pygmalion</title> <author>Bernard Shaw</author> </play> <novel> <title>David Copperfield</title> <author>Charles Dickens</author> </novel> </readings>

Querying XML FOR $n_auth IN document("*")//novel/author $p_auth IN document("*")//play/author WHERE $n_auth/text() = $p_auth/text() RETURN $n_auth • Optimizing this query requires estimating the selectivity of the path expressions • This requires information about the structure of the XML data

Goal of this Work • Build database statistics that capture the structure of XML data • Ensure that the statistics fit in a small amount of memory • For efficient query optimization • Important for Internet scale applications • Use the statistics to estimate the selectivity of simple XML path expressions//t1/t2/…/tn

Outline of Presentation • Introduction • Path Trees • Markov Tables • Performance Evaluation • Conclusions

A C B D D E 1 1 2 1 1 3 Path Trees <A> <B> </B> <B> <D> </D> </B> <C> <D> </D> <E> </E> <E> </E> <E> </E> </C> </A>

Summarizing Path Trees • Path trees contain all the information needed for selectivity estimation • Problem: May not fit in available memory • Small available memory • Internet scale • Remove low frequency nodes • Removed nodes replaced with *-nodes • Tag name: * meaning "any tag" • Frequency: Average frequency of replaced nodes • Sibling-*, Level-*, Global-*, No-*

A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Sibling-* Summarization

B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Sibling-* Summarization A 1

B C 13 9 D E F G H 7 5 15 10 6 J K K 4 11 12 Sibling-* Summarization A 1 I 2

B C 13 9 D E F G H 7 5 15 10 6 K K 11 12 Sibling-* Summarization A 1 I J 2 4

B C 13 9 D E F G H 7 5 15 10 6 * f=6 n=2 K K 11 12 Sibling-* Summarization • *-nodes represent deleted sibling nodes • Memory saved by coalescing nodes A 1

B C 13 9 D F G H 7 15 10 6 K K 11 12 Sibling-* Summarization A 1 E 5 * f=6 n=2

B C 13 9 D F G 7 15 10 K K 11 12 Sibling-* Summarization A 1 E H 5 6 * f=6 n=2

B C 13 9 F G 15 10 K K 11 12 Sibling-* Summarization A 1 D E H 7 5 6 * f=6 n=2

B C 13 9 * f=12 n=2 F G H 15 10 6 K K 11 12 Sibling-* Summarization A 1 * f=6 n=2

B 13 F G 15 10 K K 11 12 Sibling-* Summarization A 1 C 9 * f=12 n=2 H 6 * f=6 n=2

B 13 F 15 K K 11 12 Sibling-* Summarization A 1 C 9 * f=12 n=2 G H 10 6 * f=6 n=2

B 13 F * f=16 n=2 15 K K 11 12 Sibling-* Summarization A 1 C 9 * f=12 n=2 * f=6 n=2

B 13 F 15 Sibling-* Summarization A 1 C 9 * f=12 n=2 * f=16 n=2 * f=6 n=2 K f=23 n=2

B 13 F 15 Sibling-* Summarization A 1 C 9 * * 6 8 * K f=23 n=2 3

A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Original Path Tree

B 13 F 15 Sibling-* Summarization A 1 C 9 * * 6 8 * K f=23 n=2 3 • Try to retain as much information as possible about the deleted nodes

A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Level-* Summarization

B C 13 9 F G 15 10 K K 11 12 Level-* Summarization A 1 D E H 7 5 6 I J 2 4

B C 13 9 * F G 6 15 10 * K K 3 11 12 Level-* Summarization • Less information about deleted nodes than sibling-* • Deletes fewer nodes than sibling-* A 1

A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Global-* Summarization

B C 13 9 F G 15 10 K K 11 12 Global-* Summarization A 1 D E H 7 5 6 I J 2 4

B C 13 9 F G 15 10 K K 11 12 Global-* Summarization 3 * D H 7 6

A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 No-* Summarization

B C 13 9 F G 15 10 K K 11 12 No-* Summarization A 1 D E H 7 5 6 I J 2 4

B C 13 9 F G 15 10 K K 11 12 No-* Summarization • Memory savings similar to global-* • Conservative assumption about deleted nodes D E H 7 5 6

Outline • Introduction • Path Trees • Markov Tables • Performance Evaluation • Conclusions

f(B/C/D) f(A/B/C/D) = f(A/B/C) f(B/C) Markov Tables • A table of all distinct paths of length up to m and their frequencies • For paths of length greater than m, combine paths from the Markov table • Example: • Uses "short memory" or "Markov" property

A 1 B C D 11 6 4 C D 9 7 D 8 Markov Tables

Summarizing Markov Tables • Exact selectivities for paths of length up to m • Approximate selectivities for paths longer than m • Problem: May not fit in available memory • Remove low frequency paths • Discard removed paths of length > 2 • Replace removed paths of length 1 or 2 with *-paths • Suffix-*, Global-*, No-*

Suffix-* Summarization

Set of deleted paths of length 2 Suffix-* Summarization SD= { }

Suffix-* Summarization SD= { (AD,4) }

Suffix-* Summarization SD= { (AD,4)}

Suffix-* Summarization SD= { (AD,4) }

Suffix-* Summarization SD= { }

Suffix-* Summarization SD= { (BD,7) }

Suffix-* Summarization SD= { (BD,7), (CD,8) }

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Presentation Transcript

CS435/535: Internet-Scale Applications

CS435/535: Internet-Scale Applications

Service Primitives for Internet Scale Applications

Path Expressions

Thialﬁ: A Client Notiﬁcation Service for Internet-Scale Applications

XML Path Language

XML for Data Grid Applications

CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions

Indexing and Querying XML Data for Regular Path Expressions

XML Path Language (XPath)

XML APPLICATIONS

6.7 Applications of Rational Expressions

Scientific Applications of XML

Applications of XML in the NHS

XML Standardization for Business Applications

Efficient Evaluation of Regular Path Expressions on Streaming XML Data

Indexing and Querying XML Data for Regular Path Expressions

Catalyst Selectivity Synthesis gas applications

XML Fragment Caching for Large-Scale Mobile Commerce Applications

Applications of XML in the NHS

Towards an Internet-Scale XML Dissemination Service

XML for Scientific Applications