HY-561 PRESENTATION

HY-561 PRESENTATION Φιλιππάκη Χρυσή ΑΜ: 584

1st paper Distributed Query Evaluation with Performance Guarantees

Problem • Partial evaluation effective technique for evaluating Boolean XPath queries over a fragmented tree, that is distributed over a number of sites. • Is the technique applicable to generic dataselecting XPath queries? • Yes! • evaluation algorithms (PaX3, PaX2) • optimizations

Partial Evaluation • Function f(x1, x2) • Weare given part of its input e.g. x1 • Partial evaluation specializes function f with respect to theknown argument x1, without waiting for the other argumentx2. • It performs the part of f’s computation thatdepends only on x1, and generates a partial answer, i.e. aresidual function f′ that depends on the as yet unavailableargument x2.

XML Tree Fragmentation (1/2)

XML Tree Fragmentation (2/2)

ParBoX • Algorithmbased onpartial evaluation, which evaluates Boolean xml queries overa fragmented tree that is distributed over a number of differentsites • Partially evaluates the whole queryQ, in parallel, over each fragment of the tree. • Partial answers are all collected to a single coordinatorsite and are composed resulting in the final answer to Q.

Parallel XPath (PaX3) • Evaluation algorithm, based on partial evaluation for generic data-selecting XPath queries. Guarantees: • Max 3 visits per site • Parallel query processing • Total computation comparable tothe best-known centralized algorithm • Total network trafficdetermined by the size of: • the query • query answer • not the xml tree

Three stages of PaX3 • Each stage  single visit of a site holding tree fragments • Partially evaluatethe qualifiers of query Q. At the end for each node we know: • the actual value of eachqualifier or • a Boolean formula whose value is yet to be determined • Partially evaluatethe selectionpart of query Q. At the end for each node we know: • whether or not the node is part of the answer of query Q • that the node is a candidate to be part of the answer • Determine which candidate answer nodes are true answer nodes  all nodes belonging to the answer of Q are transmitted to site S

PaX3 Algorithm (1/2) • SVect(Q): vector to store the prefixes of the selectionpath η1/ . . . /ηn, such that SVect(Q)[i] indicates the queryη1/ . . . /ηi • QVect(Q): Boolean vector to store the listof all sub-queries of the qualifiers of Q

PaX3 Algorithm (2/2) • Simple query Q over T • At each node v we computethe values of all the sub-queries in QVect(Q) and store themin a vector QVv. • Consult the (already computed) values ofthe QVect(Q) sub-queries at the children (QCVv) and descendants (QDVv) of v • Each fragment is processed in parallel,the values of QVect(Q) are unknown for the virtualnodes • Partial evaluation: introduce Boolean variables, one foreach missing value of QVect(Q) at each virtual node

Example (1/2) • Q = client[country/text()= “us”]/broker[market/name/text() = “nasdaq”]/name • normalize(Q) = client/ε [country/ε [text()=“us”]]/broker/ε[market/name/ ε[text() = “nasdaq”]]/name • SVect(Q) = [q1, q2, q3] where • q1 = client, q2 = q1/broker, q3 = q2/name • QVect(Q) = [q1, q2, q3, q4, q5, q6, q7, q8, q9], where • q1 = country, q2 = [text()=“us”], q3 = q1/ε [q2], q4 = * /ε [q3],q5 = name, q6 = [text()=“nasdaq”], q7 = q5/ ε [q6],q8 = market/q7, q9 = * /ε [q8]

Example (2/2) • QVname= <0, 0, 0, 0, 1, 0, 0, 0, 0> • QVcountry = <1, 0, 1, 0, 0, 0, 0, 0, 0> • QVF1= <x1, x2, x3, x4, x5, x6, x7, x8, x9> • CQVF1=<cx1,cx2,cx3,cx4,cx5,cx6,cx7,cx8,cx9> • QVclient= <0, 0, 0, 1, 0, 0, 0, 0, x8>

Analysis • Communication cost : O((|Q| |FT|) + |ans|) (optimal) • cost of transmitting our queryover the various sites + • cost of retrieving theactual answers to our query • Total computation cost : O(|Q| |T|) • at each node v of Fj at most O(|Q|) operations are performed • total computation for each fragment is O(|Q| |Fj |) • Parallel computation cost: O(|Q| maxSi |FSi |) • |FSi | : total size of the fragments in site Si • Correctness: correctanswer Q(T) on any xml tree T no matter how T is fragmentedand distributed

PaX2 • Two stagesand max two visits of each site • Combine thefirst two stages of PaX3 into a single stage • evaluation of qualifiers +evaluation of selection paths • Queryingsite SQ makes a remote procedure call to all the sites holdingfragments • At each such site, combines the partialevaluation of selection paths with that of qualifiers, over afragment Fj . • The procedure performs a top-down traversal of fragment Fj . • At each node v of Fj , twotypes of computation are performed: a pre-order computationand a post-order computation.

Optimizing Query Evaluation (1/2) • Identifie fragmentswhich do not contain any nodes that are in the query answer • Require that each edge (Fj , Fk) of the fragmenttree FT of T is annotated with a simple XPath expressiondescribing the path in T connecting the root offragment Fj with the root of fragment Fk

Optimizing Query Evaluation (2/2) • XPath-annotations are used before the beginning of Stage2 in PaX3 and before Stage 1 in PaX2 to identify fragmentsthat are relevant to a query • Ifthe input query Q has no qualifiers then we can use XPathannotationsto skip the last step of both algorithm PaX3 andPaX2

Experimental Study • Q1: without qualifiers (with and without annotations) • Q4: with qualifiers

Experimental Study Conclusions • Distributingtree fragments over various sites proves an effective strategywith significant reductions in evaluation times • In the presence of a ‘//’ in the selection path of a query,XPath-annotations might not help much • Using PaX2 alongwith XPath-annotations best results

Conclusions • Developed algorithms and optimizations for evaluatinggeneric XPath queries on fragmented anddistributed xml trees. • Shown analyticallyand experimentally that these techniques are scalable and efficient. • Partial evaluation can also be combined with recent techniquesdeveloped for P2P systems andbe applied to P2P query processing

2nd paper XML Processing In DHT Networks

Problem • Study the management of XML datain P2P networks based on distributed hash tables (DHTs). • Identify performance limitations and proposean array of techniques to lift them: • DHT improvements • DPP Algorithm to speed up query processing • Bloom filters to reduce data transfers entailed by query processing

KadoP System • Peers publish XML documentsand share the tasks of indexing the data and processingqueries • Indexes the XML data in the form of postings,where each posting encodes information on an element or akeyword • “Responsibility” mechanism, by which (typically) a single peerstores all the postings for a given term • Given aquery, the system combines the postings stored in the index tolocate the peers that can contribute to the query, and forwardsthe query to these peers where the final results are computed • Potential problem: posting lists for very popular terms grow very large andlimit the system’s scalability

KadoPData and Query Model • Each document in the system isidentified by a pair (p, d) • p: the numerical identifierof the peer that checked it in • d:the document identifierwithin this peer • Document (p, d): a labeled unranked tree (element and text nodes) • Element: uniquely identified by a structural identifier sid=(start, end, lev) • start: the numberassigned to the openingtag of the element • end: the numberassigned to the closingtag of the element • lev:denotesthe element’s level in the tree • Element e1 is an ancestor of element e2 ife1.start<e2.start<e1.end

Indexing scheme of KadoP (1/3) • XML documents are stored at their publishingpeer • Term relation is distributed among thepeers of the system using a distributed hash table (DHT) • Interface: • locate(k) returns the id of peer in charge of key k • put(k,α) enters a new value for k • get(k) returns the value fork • delete(k) deletes the key k

Indexing scheme of KadoP (2/3) • Term(p,d,sid,l) l is the label of element (p, d, sid) • Term(p,d,sid,w) w is a word under element (p,d,sid) • Posting: a tuple in Term • Posting list for a (La) : the set of its postings

Indexing scheme of KadoP (3/3) • DHTassigns the keys automatically among the peers (hash function), and handles the redistribution ofkeys when peers join or leave the network • Keysof the relation are the terms and the values the correspondingposting lists • Important property of KadoP index is that it identifiesprecisely the documents that contain results for q, whichin turn can limit considerably the set of peers to which it’s forwarded

Challenge • Evaluation of index queries that involve longposting lists since they represent the true challenge for a DHT-basedapproach

Improving indexing time • Postings of the same term are bufferedand sent in batches • Extending the DHT API • insert: n successive entriesassociated to key k leads to a total I/O complexity of n2 • new operationappend(key, entry)to obtain an indexing of linear cost • Replacing its data store • DHT’s communicationbuffers to cope with many small messages generated by smallposting lists • Tuningof the index storage

Improving query response time • Peer p in charge of a query q performs a holistic twig join on the posting lists received from other peers • get: it returns only when thecontent of the posting list has been fully retrieved • Adding a pipelined get method, which transfers posting listsasynchronously.

DPP Algorithm • Distributed posting partitioning (DPP) is adistributed hierarchical data structure for managing postinglists • A DPP is used to split a posting list for a given term overseveral peers

Implementation of DPP in KadoP • Originally, the entries of one posting list are all in one datablock • The system sets a bound on the number of entries in a data block • When inserting entries, a block may overflowand be split

Implementation of DPP in KadoP

Experiment (Indexing time) • DPP block splitting has a moderate cost • Many publishers drastically cut indexing time, asthey work in parallel

Experiment (Query responsetime) • Benefits of the DPP: • query processing is cutby a factor of three • its growth is really slow as the datavolumes grow

Structural Bloom Filters • Mechanism forreducing the volume of transferred data during the evaluationof index queries • A Bloom Filter provides a concise representationof a set S in a form that is suitable for membershipqueries

Ancestor Bloom Filter(AB Filter) • Tags a and b and the respectiveposting lists La and Lb • ABF(a): AB Filter for La, enables the filtering of Lb and the computation ofa sub-list F(b,ABF(a)) • F(b,ABF(a)): contains a superset of b[\\a], the set of Lb postings that have an ancestor in La

Descendant BloomFilter (DB Filter) • DBF(b):DB Filter for Lb, enables the filtering of La and the computation ofa sub-list F(a,DBF(b)) • F(a,DBF(b)): contains a superset of a[//b], that is, the postings in Lathat have at least one descendant in Lb

Query Evaluation with Bloom Filters • Three query processing strategies based onStructural Bloom Filters: • Ancestor Bloom Reducer(AB Filter), • Descendant Bloom Reducer(AB Filter), • Bloom Reducer (a hybrid of the previous two) • Phases: • the peers exchangestructural Bloom Filters and reduce their posting lists • the reduced lists are sent to the query peer forthe final join

Performance of Bloom-based strategies • DB Reducer: very effective in filteringpostings that are irrelevant to the query, leading to a reductionof more than 90% in transfer load • Bloom Reducer andAB Reducer: less effective as they transfer a large ABfilter on article, without getting any significant benefits fromfiltering the small list of Ullman

Performance of Bloom-based strategies • AB- and Bloom Reducer become more efficientsince the overhead of the AB filteron article is now offset by the savings of reducing author, thedominant list in this query

Performance of Bloom-based strategies • Proposed strategies do not enable any savings for thisparticular query • Due to the existence of the title branch,which has a detrimental effect on the performance of eachstrategy

Conclusions • Investigating several improvementson the KadoP system • Exploring techniquesto improve index construction time, as it remains significantwhen a large collection of documents is published

End Thank you for your attention! Any questions?

HY-561 PRESENTATION

HY-561 PRESENTATION

Presentation Transcript

Hy Peskin

HY 135

HY 135

Memory Hierarc hy

HY 135

HY 135

Hy-V 0.1

DIABETIC RETINOPAT HY

HY-483 Presentation

HY 135

HY-483

HY 135

HY 135

HY 135

HY 135

HY 135

Hy jersey

HY-483

HY-483 Presentation