520 likes | 680 Views
Distributed Structural and Value XML Filtering. Iris Miliaraki * and Manolis Koubarakis Department of Informatics and Telecommunications University of Athens. * Supported by Microsoft Research through European PhD Scholarship Programme.
E N D
Distributed Structural and Value XML Filtering Iris Miliaraki* and ManolisKoubarakis Department of Informatics and Telecommunications University of Athens *Supported by Microsoft Research through European PhD Scholarship Programme 4th ACM International Conference on Distributed Event-Based Systems (DEBS 2010), Cambridge, UK
Outline • XML Filtering scenario • Background • DHTs • Structural matching • Value matching • Experiments • Sum up and future work
XML Filtering scenario Centralized Distributed Subscriber Publisher XML Filtering system XPath/XQuery ? Index-Filter Parallel/Hierarchical XTrie ONYX YFilter Miliaraki [WWW 2008] XTrie FiST Subscriber Gong et al. [ICDE05] Li et al. [ICDCS08] Publisher XPath/XQuery ? XPush Snoeren [SOSP 2001]
XML Filtering scenario Mesh or tree-based overlays Subscriber Publisher XPath/XQuery ? Subscriber Publisher XPath/XQuery ? • Potential bottlenecks due to centralized control • Load imbalances
Background: DHTs • Structured overlay networks • Solve the item location problem in a distributed and dynamic network of nodes (in O(log N) hops): • Let x be some data item. Find x! • Distributed version of hash tabledata structure • id=Hash(K) • Main operations: • Put: given a key (for a data item), map the key onto a node. • Get: Find the location of a data item with a given a key.
XML Filtering scenario Subscriber Publisher XPath/XQuery ? DHT Subscriber Publisher XPath/XQuery ? • Fully distributed • Load balanced • Scalable
XML data model - example <bib> <article title=“XML Filtering” conf=“VLDB” year=“2007”> <author institure=“Harvard”> John Smith </author> </article> </bib> <bib> <article title=“XML Filtering” conf=“VLDB” year=“2007”> <authorinstiture=“Harvard” John Smith </author> </article> </bib> <bib> <article title=“XML Filtering” conf=“VLDB” year=“2007”> <authorinstiture=“Harvard” John Smith </author> </article> </bib> Value Matching Structural matching Q1: /bib/*/author[text()="John Smith"] Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Q1: /bib/*/author[text()="John Smith"] Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Q1: /bib/*/author[text()="John Smith"] Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770]
Automata-based approaches • XFilter and YFilter, ONYX, XTrie, IndexFilter, FiST etc. • Main idea • Construct an automaton from a set of XPath/Xquery queries • Use it as a matching engine against the XML documents
bib title Q3 Example NFA (YFilter) Q1: /bib/phdthesis/year = ‘2008’ Q2: /bib/proceedings/school = ‘Univ. of Athens’ Q3: /bib/proceedings/title = ‘XML Dissemination’ Q1 year 3 Q4: /bib/*/author = ‘John Smith’ 2 phdthesis Q5: //*/cite [@id = 12743] Q2 school proceedings 5 1 4 * 6 0 Q4 author 7 8 * ε Q5 cite * 11 9 10
Distributed structural matching • Utilize a distributed version of a state-of-the-art approach YFilter • Instead of a centralized NFA • Distribute the NFA in the DHT Miliaraki, Z. Kaoudi and M. Koubarakis. XML Data Dissemination using automata on top of structured overlay networks. In WWW 2008.
Distributed NFA Structural matching! What about value matching? Miliaraki, Z. Kaoudi and M. Koubarakis. XML Data Dissemination using automata on top of structured overlay networks. In WWW 2008.
What about value matching? • Automata-based approaches efficient for structural matching • Queries apart from defining a structural path also contain value-based predicates /dblp/phdthesis[@year=2005]/author[@nationality=greek] • Our goal: Scale for both the size of the query set and the number of predicates per query
Definitions • Attribute predicates: element[@attr op value]e.g. /bib/phdthesis[@published=2007] • Textual predicates: element[text() op value]e.g. /bib/*/author[text()=“John Smith”] So, how can we deal with value matching along with structural matching?
Direct evaluation with automaton/trie author 3 • Treat predicates as elements! • XNet [Felber et al. 2008] • XTrie structure • Lazy DFA [Gupta and Suciu, 2003] • Hope is that only a small set of DFA states will be computed at runtime author Q1: /dblp/phdthesis[@year=2005]/author[@nationality=greek] 5 2 phdthesis year author nationality conference 3 8 10 Huge increase of NFA states! 7 bib * 0 1 4 author text() Q2: /bib/*/author[text()=Michael Smith] 5 9 article Destroy sharing of path expressions! 6 conference text() 7 11 Q3: /bib/article/conference[text()=WWW 2009]
Bottom-up evaluation • Common rule in relational query optimization apply selections as early as possible • Works well for relational query processing • pFist [Kwon et al. 2005] A lot of effort evaluating predicates while the structure may not be matched
Step-by-step evaluation • XPath queries consist of distinct steps • Each step contains one or more value-based predicates • Perform value matching with structural matching in a stepwise manner • YFilter – Inline [Diao et al. 2003] • process predicates when NFA state is reached Effort spent for evaluating predicates while the structure may not be fully matched
Top-down evaluation • Check predicates after structural matching • YFilter – Selection-Postponed [Diao et al. 2003] • performs predicate evaluation after the execution of the NFA • VA-RoXSum [Vagena et al. 2007] • Focus on message aggregation depending on predicate selectivity number of false positives may be very large
Moving on to details • Parse XML document and generate a set of candidate predicates to perform predicate evaluation Enriched parsing events Candidate predicates CP1:article[@title="XML Filtering"] CP2:article[@conf=VLDB] CP3:article[@year=2007] CP4:author[text()="John Smith"] CP5:author[@institute=Harvard]
Top-down evaluation Top-down evaluation with pruning Step-by-step evaluation Bottom-up evaluation
Top-down evaluation Delay value matching after structural matching • Execute distributed NFA • Only check predicates if an accepting state is reached • Each peer uses a local index mapping predicates to the list of queries that contain them (hash index)
Example Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q2: /bib/*/author[text()="John Smith"] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Candidate predicates CP1:article[@title="XML Filtering"] CP2:article[@conf=VLDB] CP3:article[@year=2007] CP4:author[text()="John Smith"] CP5:author[@institute=Harvard] author 2 5 phdthesis 6 author bib article 0 1 3 cite * 7 conference 4 8
Top-down evaluation • Top-down evaluation with pruning Step-by-step evaluation Bottom-up evaluation
Top-down evaluation with pruning • At each step of the execution, part of the NFA is revealed • Applies on equality predicates 2 5 IDEA: Use a compact summary of predicate information to stop NFA execution (prune)if we can deduce that no match can be found phdthesis 6 author bib article 0 1 3 cite * 7 conference 4 8
TD with pruning – Details • Each peer responsible for storing many NFA fragments • Each peer keeps one Bloom filter which summarizes predicates of queries indexed in the relevant NFA fragments Value filter (VF) • Assuming a peer p and a state st, for each query q whose NFA accepting path contains st, we insert one predicate of q in the VF of p
TD with pruning - Main idea cont. • Predicates are inserted as a whole in VFs using their string representation: • element[@attr=value] element + attr + value • element[text()=value] element + text + value • VFs are updated during query indexing • Since we traverse the NFA accepting path of a query to index all relevant VFs will be updated
Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q2: /bib/*/author[text()="John Smith"] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Example: Constructing Value filters 1 0 0 Select 1 predicate per query to insert … author 2 5 1 0 phdthesis 1 6 author bib article 0 1 3 is responsible for cite * 7 conference m-bit filter 4 8 keeps value filter DHT
Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q2: /bib/*/author[text()="John Smith"] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Q7: //article[@year=2007] Example – Querying Value filters Candidate predicates CP1:article[@title="XML Filtering"] CP2:article[@conf=VLDB] CP3:article[@year=2007] CP4:author[text()="John Smith"] CP5:author[@institute=Harvard] author 2 5 phdthesis 6 author 1 1 bib article 0 1 3 0 0 cite * 7 0 0 … … e conference check Value filter 4 8 1 1 0 0 * article 1 1 9 10 Step 1: expanding state 0 DHT Step 2: expanding state 1 Execution stops MATCH! MISS! Execution continues
Top-down evaluation • Top-down evaluation with pruning • Step-by-step evaluation Bottom-up evaluation
Step-by-step evaluation Check value-predicates while matching the structure • Associate NFA states with relevant predicate information organized using a hash index • At each step of the execution • check predicates • update list with partially matched queries Q • continue with expanding state if Q not empty
Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q2: /bib/*/author[text()="John Smith"] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Example • At each step of the execution • check predicates • update list with partially matched queries Q • continue with expanding state if Q not empty Candidate predicates CP1:article[@title="XML Filtering"] CP2:article[@conf=VLDB] CP3:article[@year=2007] CP4:author[text()="John Smith"] CP5:author[@institute=Harvard] author 2 5 phdthesis 6 author bib article 0 1 3 cite * 7 conference 4 8
Top-down evaluation • Top-down evaluation with pruning • Step-by-step evaluation • Bottom-up evaluation
Check values as early as possible Bottom-up evaluation • Queries are indexed in the network using their predicates • For each distinct predicate in query set select a responsible peer using DHT • peer organizes its queries using a local index mapping predicates to the list of queries that contain them • This indexing model resembles works from area of Information Filtering
Check values as early as possible Bottom-up evaluation cont. • Construct set of candidate predicates • For each candidate predicate contact responsible peer • Peer checks its local index • Performs locally structural matching
Example Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek] Find responsible Find responsible Candidate predicates CP1:article[@title="XML Filtering"] CP2:article[@conf=VLDB] CP3:article[@year=2007] CP4:author[text()="John Smith"] CP5:author[@institute=Harvard] DHT
Experiments • Implemented methods using FreePastry release in Java • Environment • Cluster (http://www.grid.tuc.gr/) – 28 machines (4 peers per machine) • 253 peers from Planetlab network • Queries • Sets of 1000000 queries • 1 to 8 predicates per query • Data • NITF DTD • Bloom filter size • 100K bits
Cluster (4 predicates per query) Arrival time in msecs
Sum up & future work • Described methods to combine both structural and value XML filtering in a distributed environment • Experimental evaluation of our methods • Future work • Potential improvements for SBS method • More sophisticated methods for selectivity estimation • Range predicates • Textual predicates