Early Profile Pruning on XML-aware Publish-Subscribe Systems

Early Profile Pruning on XML-aware Publish-Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California, Riverside

Overview • Motivation • Bottom-up Filtering FSM (BUFF) • Bounding-based XML Filtering (BoxFilter) • Core Modules • Filtering algorithms • Experimental results

Motivation • Publish-subscribe systems: The message transmission is defined by the message content • Examples: notification websites hotwire.com or ticketmaster.com Publisher Publisher Publisher Publisher Docu ments Docu ments Docu ments Docu ments Matching algorithm Re su l t Re su l t Re su l t Re su l t Prof ile Prof ile Prof ile Prof ile Submit, Update, Delete Submit, Update, Delete Submit, Update, Delete Submit, Update, Delete Subscriber Subscriber Subscriber Subscriber

Bib article article vol year title title no 7 1996 proceedings t1 11 t2 year SIGMOD journal author 2006 TPDS last first author Florescu Daniela last mi first DeWitt J David Publish-subscribe systems • The data is exchanged in XML format. • Nodes - correspond to elements, attributes or text values • Edges represent immediate element-subelement or element-value relationships <Bib> <article vol=“7” no=“11”> <title>t1</title> <author> <last>DeWitt</last> <mi>J</mi> <first>David</first> </author> <journal>TPDS</journal> <year>1996</year> </article> <article> <title>t2</title> <author> <last>Florescu</last> <first>Daniela</first> </author> <proceedings>SIGMOD </proceedings> <year>2006</year> </article> </Bib> (a) Document (b) Tree representation

Publish-subscribe systems (cont.) • The user profiles are expressed in XML query language (XPath, XQuery) • XML query contains • structural constraints • value-based constraints Structural constraints: ////article[/author[@last=``Smith'']]//procs[@conf=``VLDB''] Tree pattern: article author proceedings last conf

Related Work/Our Contribution • Current work • Construction of overlay network • Dissemination/indexing of profiles (queries) • Processing of stream of messages • We focus on the matching process that takes place within a broker • Improves the performance of regular FSM by using a bottom-up evaluation of the document • Develop index-based filtering technique that performs early pruning of the query profile

Bottom-up vs. Top-down filtering • State machines are among the most common methods for the XML matching process • Top-down approach: (i.e. in-order traversal or depth first order): advancing the state machine for each XML element (or attribute) read. • Do not consider any form of early pruning • Bottom-up approach: This approach takes into consideration the (usual) fact that an XML document has its more selective elements located at its leaves

Q1 a Q4 a Q6 e b e g c f h d h Example • Top-down approach groups the queries according to their common prefixes • Bottom up: groups them according to their common suffixes. root Q2 a Q5 e Q3 a a a a a a a a a a a a c e f b b b b b b b b b b b d f h c c c c c c c c c c c d (a) Document (b) Queries c d a 3 2 4 3 4 b b c Q1 1 2 Q1 d a c d 6 1 5 5 e Q2 Q2 a f h f e a 7 8 9 6 7 0 8 0 Q4 Q3 Q3 h e h e a f f 12 11 11 12 10 Q5 10 Q5 Q4 9 g g h e 13 14 13 14 Q6 Q6 (c) Top-down (d) Bottom up

BUFF • FSM-based Bottom-up approach for XML filtering. • BUFF avoids translating documents and queries to Prüfer sequences (as the other algorithms do), and employs a more direct evaluation algorithm. • The document is parsed through a SAX parser, which triggers events for specific marks (tags) in the XML document • The machine keeps a runtime stack that stores the current document path being processed.

a1 1 </e> b2 b5 c3 c6 d c d4 d7 e9 b e8 a f10 5 1 </f> </e> 2 </d> 3,6 </c> 5 1 e 1,2 1,2 1,2,5 c c c b b b b a a a a BUFF Example e <e> d <d> c d b 1 2 3 4 e c <c> Q1 0 f b <b> c b a 8 5 6 7 a <a> Q2 (a) Document and BUFF (b)‏ (c)‏ (d)‏ (e)‏ (f)‏ (g)‏

Profile Index Profiles P1 P2 P3 Bounding-based XML Filtering • Two major processes working asynchronously • Profile Management • Profile Matching Prüfer Sequence Profile Manager Matching Algorithm Matching Module Profiles (queries)‏ Input Documents Matched Documents

Prüfer Sequence • A unique sequential encoding of a labeled tree • Algorithm: • Iteratively removes nodes from the tree until all nodes but the last two have been removed. • At each iteration, the algorithm finds and removes the leaf with the smallest label and adds to the Prüfer sequence the label of that leaf's parent. • Theorem: If a query tree Q is a subgraph of a document tree D then the Prüfer sequence of Q is a subsequence of the Prüfer sequence of D

Sequence Envelope • Assume a set of k Prüfer sequences representing user profiles S1,..,Sk • We can derive two new sequences • Upper bound U: for each position take largest element • Lower bound L: for each position take smallest element • L and U form the smallest possible bounding envelope that encompasses all members of the set of sequences from above and below.

Example • Assume 3 sequences with 11 symbols each abcabababcd cdcdecdcdec dedededebab

Sequence Envelope (Cont.) • The sequence envelope structure is that it can be used as an aggregation of the sustaining set of sequences

BoXFilter Tree • Sequence envelopes can be nested forming BoXFilter tree

Filtering algorithms • The profiles in the system are organized in BoXFilter tree. Documents are traversed thought the tree • There are two variations of the filtering algorithm • Sequential – documents are processed one by one • Batch processing – documents are organized in a tree like the queries and both trees are joined • After the traversal of the BoXFilter tree, there is a verification step

Experimental Results • We have generated datasets with 1000, 10000 and 100000 small documents (with up to 8KB) • We generated up to 100000 queries with selectivity fixed to 50% (a)‏ (b)‏ (c)‏

Experimental Results (cont.) In this set of experiments, we vary the number of documents that match any of the profile queries. (selectivity 1\% means that one percent of the documents satisfy \textit{any} of the queries.)

Thank You!

Early Profile Pruning on XML-aware Publish-Subscribe Systems