570 likes | 693 Views
EDBT’2011. GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher. Mohammad Sadoghi , Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research Group University of Toronto. An X-ToPSS Project. MIDDLEWARE SYSTEMS. RESEARCH GROUP. http://msrg.org/tags/x-topss.
E N D
EDBT’2011 GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research Group University of Toronto An X-ToPSS Project MIDDLEWARE SYSTEMS RESEARCH GROUP http://msrg.org/tags/x-topss MSRG.ORG
The Problem in a Nutshell Event/Publication XML XPath Expressions (XPE) (Millions of XPE) XML Filtering Subscriptions (Boolean Expressions) Pub/Sub Engine Matched XPE Matched Subscriptions
Notification Notification Publish/Subscribe Systems TSX Stock markets NASDAQ NYSE Publisher Publisher AMGN=58 Publications IBM=84 ORCL=12 JNJ=58 HON=24 INTC=19 MSFT=27 Broker Subscriptions Subscriptions: IBM > 85 ORCL < 10 JNJ > 60 Subscriber Subscriber X-ToPSS & GPX-Matcher
Pub/Sub Matching Algorithms • Rete algorithm [Forgy, late 70s] • A graph-structure to correlate events, process rules (solves a more general problem) • SIFT [Yan et al. TODS‘94] • Predicate counting et al. • Gough algorithm [Gough et al. ACSC‘95] • Based on a finite state representation of subscriptions • Gryphon algorithm [Aguilera, et al. PODC‘99] • Decision tree over predicates • Clustering algorithm [Fabret et al. SIGMOD‘01] • Clusters subscriptions based on common predicates • k-Index [Whang et al. VLDB‘09] • Hardware-based matching acceleration [Sadoghi et al. VLDB‘10] • BE-Tree [Sadoghi & Jacobsen, SIGMOD’2011] X-ToPSS & GPX-Matcher
The Key Question? Can XML Filtering be benefited from the efficient publish/subscribe matching algorithms thathave been developed for more than three decades? X-ToPSS & GPX-Matcher
XML Filtering Challenges XML XPath Expressions (XPE) (Millions of XPE) Matched XPE Filter XML according to XPEs Efficiently, at Internet-scale, for millions of XPEs, and for many XML documents per unit of time X-ToPSS & GPX-Matcher 6
XML Filtering Systems • XML filtering systems are • publish/subscribe systems • XPath & XML are subscription • and publication, respectively • Growing need for XML filtering • Application-level firewalls • Maleware detection and prevention • Document routing • RSS aggregators • XML-based messaging and application integration • Selected industry players (XML appliances) • SolaceSystems • IBM DataPower • Talerian • Sarvega (Intel) X-ToPSS & GPX-Matcher 7
The Core Problem • XML Document Filtering Problem • Given a set of XPath expressions Q and an XML document d, find all expressions in Q that are matched by d • An expressions q is matched by an XML document d if and only if q selects a non-empty set of nodes in d • XPath expressions are used to select entire documents or fragments of documents X-ToPSS & GPX-Matcher 8
Agenda • Supported XPath Language • Mapping XML Filtering to Pub/Sub Matching • XPath encoding • XML encoding • Experimental results • Outlook X-ToPSS & GPX-Matcher 9
section figure subsection figure relative query absolute query descendent operator location step wildcards XML and XPath XML fragment XML tree XML paths <section> <subsection><figure> … </figure> </subsection><figure> … </figure></section> section-subsection-figure section-figure XPath queries /section/*/figure */figure /section/subsection/figure section/figure /section//subsection/figure section//figure child operator
XPath 2.0 Subset Considered • Absolute path expressions • /a/b • Relative path expressions • a/b/c • Descendant operators in path expressions • a/b//a/d • Wildcards in path expressions • a/*/*/b • Not discussed, but shown how to address • Filter predicates in path expressions • <path>[@x>1]/<path> • Nested path filters (the XPE becomes a tree) • <path>[a/b]/<path> X-ToPSS & GPX-Matcher
Agenda • Supported XPath Language • Mapping XML Filtering to Pub/Sub Matching • XPath encoding • XML encoding • Experimental results • Outlook X-ToPSS & GPX-Matcher 12
Our Question(s) • How can we map XPath expressions onto subscriptions? • Conjunctive Boolean formula over predicates • S = (a1 op v1) (a2 op v2) … (an op vn) • How can we map XML documents onto publications? • Set of attribute-value pairs • P = {(a1, v1),(a2, v2),…, (am, vm)} X-ToPSS & GPX-Matcher 13
Predicate Calculus • Single-tag predicate • Double-tags predicate • End-tag predicate • Length-constraint predicate X-ToPSS & GPX-Matcher
Single-tag Predicate Example Tag b at position 1 • XPath expression /b/… • Predicate b d a c b-a-c (b, 1), (a, 2), (c, 3) X-ToPSS & GPX-Matcher
Double-tags Predicate Example I Distance between Tag a and Tag b is one location step • XPath expression … a/b … • Predicate x d a b x-a-b (x, 1), (a, 2), (b, 3) X-ToPSS & GPX-Matcher
Double-tags Predicate Example II Distance between Tag a and Tag b is at least one location step • XPath expression a//b • Predicate a d x b a-x-b (a, 1), (x, 2), (b, 3) X-ToPSS & GPX-Matcher
End-tag Predicate Example Tag a at least two location steps away from path end • XPath expression /a/*/* • Predicate a d x y a-x-y (a, 1), (x, 2), (y, 3), (length, 3) X-ToPSS & GPX-Matcher
Length-constraint Predicate Example Length of the path is at least 3 • XPath expression */*/* • Predicate x d y z x-y-z (x, 1), (y, 2), (z, 3) (length, 3) X-ToPSS & GPX-Matcher
Q1: a/b//a Q2: a//b/d Q3: a/*/*/*//b/d Q1: a1/b1//a2 Q2: a1//b1/d1 Q3: a1/*/*/*//b1/d1 Q1: Q2: Q3: Putting it Together:XPath Query Encoding Example P2 P1 P3 P4 P4 P5 Our XPath encoding grows linearly in the size of the XPath expression
XML Document Path Encoding a-b-c-d Document path Without duplicate tags (i.e., all occurrence numbers are 1) a1-b1-c1-d1 Attribute-value pair (length, 4), (a1, 1), (b1, 2), (c1, 3), (d1, 4) (a1, b1, 1), (a1, c1, 2), (a1, d1, 3), (b1, c1, 1), (b1, d1, 2), (c1, d1, 1) Publication The resulting attribute-value “pairs” set has O(n2) tags.
Mapping XML Filtering to Pub/Sub Matching Event/Publication XML Subscriptions (Boolean Expressions) Pub/Sub Engine XPath Expressions (XPE) (Millions of XPE) Matched Subscriptions Matched XPE
Matching Algorithms • Pick any pub/sub matching algorithm • We used • Counting algorithm [exact origin is unknown] • Clustering algorithm [Fabret, Jacobsen et al., 2001] • Both are two-phased matching algorithms • Predicate matching: Match all predicates. • Subscriptions matching: Match subscriptions using the result from step 1. X-ToPSS & GPX-Matcher
Predicate value Predicate Matching: Single Tag Predicate with id i Publication: Hash on the tag 1 2 3 4 (length, 4), (a1, 1), (b1, 2), (c1, 3), (d1, 4) (a1, b1, 1), (a1, c1, 2), (a1, d1, 3), (b1, c1, 1), (b1, d1, 2), (c1, d1, 1) = i a c j with id j i 1 0 0 0 Predicate bit vector
Cluster queries based on the access predicates Access predicates shared by all queries in cluster Only check clusters whose access predicates are matched Open Question: how to choose an effective access predicate Subscription Matching: Clustering Algorithm Access predicates false pi false pi X-ToPSS & GPX-Matcher
Experimental Evaluation • All algorithms implemented in C • GPX – the base encoding with counting • GPX-ap – the base encoding with clustering (access pred.) • YFilter & BPA • DTDs used for generating workloads • NITF DTD (News Industry Data Format) • PSD DTD (Protein Sequence Database) • Total filtering time averaged over 500 XML documents • XML parsing time is negligible in the overall filtering time • Intel Quad-Core 2.66 GHz, 4GB XML encoded XPath expressions X-ToPSS & GPX-Matcher
Scalability in Number of XPEs All XPEs are distinct 1 ms vs. 18 ms ap on first ap on last X-ToPSS & GPX-Matcher
Scalability in Number of XPEs XPEs workload contains duplicates X-ToPSS & GPX-Matcher
Effect of Path Length X-ToPSS & GPX-Matcher
Effect of Wildcards X-ToPSS & GPX-Matcher
Conclusions • Novel XML/XPath encoding • Leverages existing matching techniques • Differs significantly from predominantly automata-based related work • Outperforms related approach by an order of magnitude under many experimental conditions X-ToPSS & GPX-Matcher
Thank You! • To learn more about X-ToPSS, please see • http://msrg.org/tags/x-topss X-ToPSS & GPX-Matcher
Agenda • XML-based Filtering Systems • Mapping XML Filtering to Pub/Sub Matching • XPath encoding • XML encoding • Experimental results • Outlook X-ToPSS & GPX-Matcher 34
Content-based Publish/Subscribe • Subscription: Boolean expressions (i.e., an attribute-operator-value triple) (subject = news) (topic = travel) (date > 21.2.2011) • Publication (a.k.a. event): Sets of attribute-value pairs (subject, news), (topic, travel), (date, 21.2.2011), … X-ToPSS & GPX-Matcher 35
The Pub/Sub Matching Problem event / publication subscriptions matches Given an event, e, and a set of subscriptions, S, determine all subscriptions, s S, that match e. 36 X-ToPSS & GPX-Matcher
Wide Applicability Selective information dissemination Location-based services Personalization, alerting services Application integration Service & resource discovery Network and distributed system management Monitoring, surveillance, and control Network and distributed system management Workforce management Workload management & job scheduling Business activity monitoring Business process management, monitoring, and execution X-ToPSS & GPX-Matcher 37
Matching Algorithm Techniques • Amortized storage & processing • Access predicates • Cost model-driven subscription partitioning • Cache-conscious data structure layout • Asynchronous cache-level pre-fetching • Event queue re-ordering and batch processing • Parallelization of algorithms for SMP & multi-core • FPGA-based acceleration (hardware-level) X-ToPSS & GPX-Matcher
eXtensible Markup Language • XML – de facto standard for data exchange • Web Services, data and application integration, information dissemination • XPath – XML query language • Also used as basis for other query languages (e.g., XQuery, Xpointer, XSLT et al.) X-ToPSS & GPX-Matcher
XML and XPath XML fragment XML tree XML paths <section> <subsection><figure> … </figure> </subsection><figure> … </figure></section> section-subsection-figure section-figure section figure subsection figure XPath queries /section/*/figure */figure /section/subsection/figure section/figure /section//subsection/figure section//figure X-ToPSS & GPX-Matcher 40
XML and XPath section figure subsection figure location step XML fragment XML tree XML paths <section> <subsection><figure> … </figure> </subsection><figure> … </figure></section> section-subsection-figure section-figure XPath queries /section/*/figure */figure /section/subsection/figure section/figure /section//subsection/figure section//figure 41 child operator
XML and XPath section figure subsection figure descendent operator location step XML fragment XML tree XML paths <section> <subsection><figure> … </figure> </subsection><figure> … </figure></section> section-subsection-figure section-figure XPath queries /section/*/figure */figure /section/subsection/figure section/figure /section//subsection/figure section//figure 42 child operator
Our Research Goal • Solve the XML filtering problem using content-based pub/sub matching algorithm. • Why • Build on and exploit several decades worth of insights, rather than construct special purpose solutions. X-ToPSS & GPX-Matcher 43
In a Nutshell section figure subsection encoded XPath expressions figure section-subsection-figure section-figure X-ToPSS & GPX-Matcher
Special purpose XML/XPath Filtering Algorithm XFilter [Altinel et al. VLDB‘00] WebFilter [Pereira et al. VLDB’01] YFilter [Diao et al. TODS‘03] XTrie [Chan et al. ICDE‘03] AFilter [Candan et al. VLDB‘06] BPA [Huo & Jacobsen, ICDE‘06] BoXFilter [Moro et al. VLDB‘07] pFiST [Kwon et al. DKE’08] X-ToPSS & GPX-Matcher 45
From XML Filtering to Publish/Subscribe Matching • XPath expressions are encoded in a predicate calculus • XML documents are expressed as a set of paths from the root to a leave in the document tree • Each path is translated into sets of attribute-value pairs (tags and their location in the path) • Matching algorithm • The attribute-value pairs are matched against the predicates with traditional pub/sub matching algorithms X-ToPSS & GPX-Matcher
Possibly Extensions • Extend predicate calculus to encompass other XPath 2.0 features • Alternative encodings • Exploit DTD or schema information • Exploit information about XPath expressions processed X-ToPSS & GPX-Matcher
X-ToPSS: XML-based Toronto Publish/Subscribe System • Distributed, content-based publish/subscribe (cf. ICDCS’08) • Exploit DTDs (Document Type Definition) to optimize subscription routing in distributed pub/sub systems • Explain covering and merging optimizations for XML/XPath • Alternative predicate-based XML/Xpath matching algorithm that cannot exploit traditional pub/sub schemes (cf. ICDE’06) • Encoding presented herein, cf. EDBT’2011 (forthcoming) http://msrg.org/tags/x-topss
Example: XPath Query Encoding 1 2 3 4 P1 = a 1b1 1 5 3 P2 = P3 b 1a2 2 P4 = 1d1 4 P5 Predicate identifier (pid) X-ToPSS & GPX-Matcher
That’s Like Data Base Querying !! query publication data tuples subscriptions About past About future sets of tuples sets of tuples Query and subscription are very similar. Data tuples and publication are very similar. However, the two problem statements are inverse.