150 likes | 272 Views
Leonidas Fegaras University of Texas at Arlington fegaras@cse.uta.edu http://lambda.uta.edu/. XQuery Processing with Relevance Ranking. Motivation. Many IR techniques for approximate matching over text-rich documents keyword search queries over flat documents only ranked results
E N D
Leonidas Fegaras University of Texas at Arlington fegaras@cse.uta.edu http://lambda.uta.edu/ XQuery Processing with Relevance Ranking
Motivation • Many IR techniques for approximate matching over text-rich documents • keyword search queries over flat documents only • ranked results • XQuery is very powerful in expressing exact queries over XML • Our goal: • XQuery processing + IR-style approximate matching • Challenges: • relevance ranking functions • ordering the results by their relevance scores • propagating and combining scores in a query • complexity of XQuery • multiple documents • nested queries
XQuery with Approximate Matching XQuery syntax extensions: • full-text search: e ~ S • search specification S: “phrase”, S1and S2, S1or S2, not S • relevance assessment: score(e) • all indexed documents: document() <answer>{ ( for $db in document()/biblio, $b in $db/bib[title ~ ("XQuery processing" and "relevance")] where $b/abstract ~ ("SAX" andnot "DOM") order by score($b) descending return <paper>{ $b/author/name, $b/title, score($b) }</paper> )[position()<=10] }</answer>
Design Goals • Design an XML indexing scheme for • path navigation • search predicates • Provide relevance ranking functions based on the indexes • Build a highly pipelined XQuery engine • that uses merge-joins exclusively • does not materialize intermediate results • This engine should consist of operators that • propagate and combine relevance scores • naturally reflect the syntactic structures of XQuery and • can be composed into pipelines in the same way the corresponding XQuery structures are composed to form complex queries • The XQuery translation should be concise, clean, and completely compositional
Related Work • TIX algebra [Al-Khalifa et al, SIGMOD'03] • TexQuery [Amer-Yahia et al, WWW'04] • XQuery/IR [Bremer et al, WebDB'02] • Languages: ELIXIR, XIRQL • Systems: XXL, XRANK, XIRCUS
Relevance Ranking • e ~ “term”: the term is associated with a pair: (weight, position) • the position is related to the beginning of the e element • the weight is based on • the standard IR tf-idf (term-frequency/inverse-document-frequency) • the difference between the nesting levels of term and e • The scores for phrases & boolean connectives are based on term proximity • a conjunction of terms is summarized by their center of mass • position: (Spositioni * weighti) / Sweighti • weight: (Spositioni * weighti) / Spositioni • two sets of position/weight pairs: • positive terms (T): a disjunction of possibilities • negative terms (F): a conjunction of forbidden terms
Relevance Ranking (cont.) • Merging terms from sets A and B: A B = { ( (p1*w1+p2*w2)/(p1+p2) , (p1*w1+p2*w2)/(w1+w2) ) | (p1,w1) A, (p2,w2) B } • Position/weights of search specifications: [S1and S2].T = [S1].T [S2].T [S1and S2].F = [S1].F [S2].F [S1or S2].T = [S1].T [S2].T [S1or S2].F = [S1].F [S2].F [not S].T = [S].F [not S].F = [S].T • Cost of a search specification S: • calculate: { ú p1-p2ú / size * w1 * (1-w2) | (p1,w1) [S].T, (p2,w2) [S].F } • reduce the set by the function: x Åy = x+y-x*y
Inverse Indexes • Four indexes: • XML tags: each hit has a begin/end position • text terms: each hit has a position • attribute names • attribute values • Each index delivers the posting/hit pairs in (document_number,begin_position) order keys postings hits
The Pipeline Units abstract class Element { float score; // relevance assessment of element } class Fragment extends Element { int document; // document ID short begin; // the start position in document short end; // the end position in document short level; // depth of term in document } class ConstructedElement extends Element { String tagname; Element[] sequence; // children Attributes attributes; // SAX-like attributes } class PCData extends Element { String data; }
The Pipeline Units (cont.) • Need an element to capture all indexed elements: class Pattern extends Element { int min_level; // minimum depth in document int max_level; // maximum depth in document } • for queries such as: count(document()/*/*) • as a starting element for document() • The unit of communication between pipeline operators is a tuple: class Tuple { Element[] components; } • one element for each for-variable in a FLWOR expression
Pipeline Iterators class Iterator { Tuple current(); // current tuple from stream void open (); // open the stream iterator Tuple next (); // get the next tuple from stream boolean eos (); // is this the end of stream? } • An iterator reads data from the input stream(s) and delivers data to the output stream • Connected through pipelines • an iterator (the producer) delivers a stream element to the output only when requested by the next operator in pipeline (the consumer) • to deliver one stream element to the output, the producer becomes a consumer by requesting from the previous iterator as many elements as necessary to produce a single element, etc, until the end of stream
Example class Child extends Iterator { String tag; Iterator input; IndexIterator ti; } Tuple next () { while (!ti.eos() && !input.eos()) { if (input.current[0] instanceof Fragment) { Fragment f = (Fragment) input.current[0]; Posting p = ti.posting(); TagHit h = (TagHit) ti.hit(); if ( f.document == p.document && f.begin < h.begin && f.end > h.end && h.level == f.level+1) { ti.next(); return new Tuple(new Fragment(p.document,h.begin,h.end,h.level)); ...
For-Loops using Iterators Need a stepper for a for-loop: class Step extends Iterator { boolean first; Tuple tuple; void open () { first = true; current = tuple; } Tuple next () { first = false; return current; } void set ( Tuple t ) { tuple = t; } boolean eos () { return !first; } } Tuple Loop.next () { if (!left.eos()) { while (right.eos()) { left.next(); right_step.set(left.current()); right.open(); }; current = left.current().append(right.current()); right.next(); return current; } } Loop right right_step left right pipeline Step set class Loop extends Iterator { Iterator left; Step right_step; Iterator right; }
Let-Bindings using Iterators Let-bindings are the hardest to implement: • the let-value may be a sequence • one producer -- many consumers • we do not want to materialize the let-value in memory queue tail head fastest consumer slowest consumer backlog Some cases are hopeless: let $v:=e return ($v,$v)
Future Work • Integration with a pull-based, event-oriented processing of local XML files (instead of DOM-based) • Incorporate evaluation techniques for top-K selection queries • Use it in a peer-to-peer system as a distributed XML database • current P2P indexing techniques (based on DHTs) are an overkill • for query /A/B: need to send all A index entries from peer “A” to peer “B” • preprocessing of XQueries using Bloom filters