FoXtrot : Distributed Structural & Value XML Filtering

FoXtrot:Distributed Structural & Value XML Filtering Iris Miliaraki* Department of Informatics and Telecommunications National and Kapodistrian University of Athens *Supported by Microsoft Research through European PhD Scholarship Programme

Outline of the talk • XML Filtering scenario • FoXtrot system • Distributed structural matching • Distributed value matching • Experimental evaluation • Sum up and future work

XML Filtering scenario Centralized Distributed XML Filtering system XPath/XQuery ? YFilter Index-Filter Parallel/Hierarchical XTrie Subscriber Publisher ONYX Gong et al. [ICDE05] FiST XTrie XPath/XQuery ? Li et al. [ICDCS08] XPush Snoeren [SOSP 2001] Publisher Subscriber

XML Filtering scenario Mesh or tree-based overlays XPath/XQuery ? • Load imbalances Subscriber Publisher XPath/XQuery ? • Potential bottlenecks due to centralized control Publisher Subscriber

FoXtrot Filtering of XML data using structured overlay networks • Load balanced XPath/XQuery ? • Scalable DHT Subscriber Publisher XPath/XQuery ? • Fully distributed Publisher Subscriber

XML data model - example <bib> <article title=“XML Filtering” conf=“VLDB” year=“2007”> <author institure=“Harvard”> John Smith </author> </article> </bib> <bib> <article title=“XML Filtering” conf=“VLDB” year=“2007”> <authorinstiture=“Harvard” John Smith </author> </article> </bib> <bib> <article title=“XML Filtering” conf=“VLDB” year=“2007”> <authorinstiture=“Harvard” John Smith </author> </article> </bib> Value Matching • Structural Matching Q1: /bib/*/author[text()="John Smith"] Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q1: /bib/*/author[text()="John Smith"] Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q1: /bib/*/author[text()="John Smith"] Q2: /bib/phdthesis[@published>2005]/author[@nationality=greek] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"]

Automata-based approaches • XFilter and YFilter, ONYX, XTrie, IndexFilter, FiST etc. • Main idea • Construct an automaton from a set of XPath/Xquery queries • Use it as a matching engine against the XML documents Structural matching!

bib title Q3 Example NFA (YFilter) Q1: /bib/phdthesis/year = ‘2010’ Q2: /bib/proceedings/school = ‘Univ. of Athens’ Q3: /bib/proceedings/title = ‘XML Dissemination’ Q1 year 3 Q4: /bib/*/author = ‘Michael Smith’ 2 phdthesis Q5: //*/cite [@id = 12743] Q2 school proceedings 1 4 5 * 6 0 Q4 author 7 8 * ε Q5 cite 9 10 11 *

Designing FoXtrot • Moving to a distributed solution • Utilize automata-based techniques • Instead of a single centralized automaton, the automaton is shared by the DHT peers • Design and employ methods for filtering of XML data against a distributed automaton

Distributing the NFA on top of DHT P10 11 P9 2 P1 10 FoXtrot P8 P2 3 P3 P7 0 7 6 5 P4 P6 9 4 P5 1

Distributing the NFA on top of DHT P10 11 P9 2 P1 10 FoXtrot P8 P2 3 P3 P7 0 7 6 5 P4 P6 9 4 P5 1 2 4 7

Distributing the NFA on top of DHT ℓ=1 ℓ=0 P10 11 2 3 P9 P1 11 10 FoXtrot P8 P2 3 P3 P7 9 1 0 10 7 6 6 5 P4 P6 9 10 6 5 4 P5 2 4 7 1

Load balancing in FoXtrot Static replication • Create a fixed number r of replicas for each state • Load previously suffered by 1 peer, will be now shared by r+1 peers

Load balancing in FoXtrot cont. Assumption: the frequency f of visiting an NFA state during filtering is inversely proportional to the NFA depth d of this state. Dynamic replication • Create r/d replicas for each state where d is the NFA depth of the state

bib Centralized NFA Execution (YFilter) These paths can be executed in parallel! Incoming XML document Start of document <bib> <proceedings> <school> Univ. of Athens </school> <title > XML and DHTs </title> </proceedings> </bib> school proceedings title * End of document ε * * 5 9 10 6 9 10 Runtime stack 4 7 9 10 1 9 10 0

Distributed NFA execution – Iterative P10 P9 4 P1 P2 1 5 1 9 10 P8 2 P3 0 6 Publisher becomes overloaded! 3 Runtime stack 7 P7 P4 P6 P5

Distributed NFA execution – Recursive P10 P9 P1 P2 1 2 P8 P3 3 4 P7 3 2 4 2 P4 Several parallel executions P6 3 P5

Distributed NFA Structural matching! What about value matching? Miliaraki, Z. Kaoudi and M. Koubarakis. XML Data Dissemination using automata on top of structured overlay networks. In WWW 2008.

What about value matching? • Automata-based approaches efficient for structural matching • Queries apart from defining a structural path also contain value-based predicates /bib/phdthesis[@year>2005]/author[@nationality=greek] • We want FoXtrot to scale for both the size of the query set and the number of predicates per query

Definitions • Attribute predicates: element[@attr op value]e.g. /bib/phdthesis[@published=2007] • Textual predicates: element[text() op value]e.g. /bib/*/author[text()=“John Smith”] So, how can we deal with value matching along with structural matching?

Direct evaluation with automaton/trie author 3 • Treat predicates as elements! author 5 Q1: /dblp/phdthesis[@year=2005]/author[@nationality=greek] year author nationality conference 3 8 10 2 7 phdthesis Huge increase of NFA states! author text() bib * 5 9 0 1 4 Q2: /bib/*/author[text()=Michael Smith] article Destroy sharing of path expressions! conference text() 7 11 6 Q3: /bib/article/conference[text()=WWW 2009]

Bottom-up evaluation • Common rule in relational query optimization  apply selections as early as possible • Works well for relational query processing A lot of effort evaluating predicates while the structure may not be matched

Top-down evaluation • Check predicates after structural matching depending on predicate selectivity number of false positives may be very large

Step-by-step evaluation • XPath queries consist of distinct steps • Each step contains one or more value-based predicates • Perform value matching with structural matching in a stepwise manner Effort spent for evaluating predicates while the structure may not be fully matched

Moving on to details • Parse XML document and generate a set of candidate predicates to perform predicate evaluation Enriched parsing events Candidate predicates CP1:article[@title="XML Filtering"] CP2:article[@conf=VLDB] CP3:article[@year=2007] CP4:author[text()="John Smith"] CP5:author[@institute=Harvard]

Delay value matching after structural matching Top-down evaluation • Execute distributed NFA • Only check predicates if a final state is reached • Each peer uses a local index mapping predicates to the list of queries that contain them (hash index)

Example Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q2: /bib/*/author[text()="John Smith"] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Candidate predicates CP1:article[@title="XML Filtering"] CP2:article[@conf=VLDB] CP3:article[@year=2007] CP4:author[text()="John Smith"] CP5:author[@institute=Harvard] author 2 5 phdthesis 6 author bib article 0 1 3 cite * 7 conference 4 8

Top-down evaluation with pruning • At each step of the execution, part of the NFA is revealed • Applies on equality predicates 2 5 IDEA: Use a compact summary of predicate information to stop NFA execution (prune)if we can deduce that no match can be found phdthesis 6 author bib article 0 1 3 cite * 7 conference 4 8

Experimental evaluation • Implemented FoXtrot in Java using FreePastry release (http://freepastry.org) • Environment • 400 peers in PlanetLab (http://www.planet-lab.org/) • 112 peers in a local shared cluster (http://www.grid.tuc.gr)

Experimental evaluation – Datasets • Sets of 106 distinct XPath queries • depth 5-15 • predicates 1-3 • wildcard probability 0.2 • descendant axis probability 0.2 • 1000 XML documents • depth 5-25

Indexing throughput

Filtering latency & notifications

Load balancing I – 10 most loaded peers

Load balancing II – storage overhead

Network size

Parameter l

Cluster (4 predicates per query)

Sum up & future work • Overcome weaknesses of distributed XML filtering systems • Described methods to combine both structural and value XML filtering in a distributed environment • Future work • ….

Other research Atlas system • Distributed RDF query processing

Thank you for your attention Questions? References I. Miliaraki, Z. Kaoudi and M. Koubarakis. XML Data Dissemination using Automata on top of Structured Overlay Networks. 17th International World Wide Web Conference (WWW 2008), Beijing, China, April 21-25, 2008. I. Miliaraki, and M. Koubarakis. Distributed Structural and Value XML Filtering. 4th ACM International Conference on Distributed Event-Based Systems (DEBS 2010), Cambridge, United Kingdom, July 12-15, 2010. I. Miliaraki and M. Koubarakis, FoXtrot: Distributed Structural and Value XML filtering. Journal paper. To be submitted to ACM TWEB.

FoXtrot : Distributed Structural & Value XML Filtering