XQuery Processing with Relevance Ranking

Leonidas Fegaras University of Texas at Arlington fegaras@cse.uta.edu http://lambda.uta.edu/ XQuery Processing with Relevance Ranking

Motivation • Many IR techniques for approximate matching over text-rich documents • keyword search queries over flat documents only • ranked results • XQuery is very powerful in expressing exact queries over XML • Our goal: • XQuery processing + IR-style approximate matching • Challenges: • relevance ranking functions • ordering the results by their relevance scores • propagating and combining scores in a query • complexity of XQuery • multiple documents • nested queries

XQuery with Approximate Matching XQuery syntax extensions: • full-text search: e ~ S • search specification S: “phrase”, S1and S2, S1or S2, not S • relevance assessment: score(e) • all indexed documents: document() <answer>{ ( for $db in document()/biblio, $b in $db/bib[title ~ ("XQuery processing" and "relevance")] where $b/abstract ~ ("SAX" andnot "DOM") order by score($b) descending return <paper>{ $b/author/name, $b/title, score($b) }</paper> )[position()<=10] }</answer>

Design Goals • Design an XML indexing scheme for • path navigation • search predicates • Provide relevance ranking functions based on the indexes • Build a highly pipelined XQuery engine • that uses merge-joins exclusively • does not materialize intermediate results • This engine should consist of operators that • propagate and combine relevance scores • naturally reflect the syntactic structures of XQuery and • can be composed into pipelines in the same way the corresponding XQuery structures are composed to form complex queries • The XQuery translation should be concise, clean, and completely compositional

Related Work • TIX algebra [Al-Khalifa et al, SIGMOD'03] • TexQuery [Amer-Yahia et al, WWW'04] • XQuery/IR [Bremer et al, WebDB'02] • Languages: ELIXIR, XIRQL • Systems: XXL, XRANK, XIRCUS

Relevance Ranking • e ~ “term”: the term is associated with a pair: (weight, position) • the position is related to the beginning of the e element • the weight is based on • the standard IR tf-idf (term-frequency/inverse-document-frequency) • the difference between the nesting levels of term and e • The scores for phrases & boolean connectives are based on term proximity • a conjunction of terms is summarized by their center of mass • position: (Spositioni * weighti) / Sweighti • weight: (Spositioni * weighti) / Spositioni • two sets of position/weight pairs: • positive terms (T): a disjunction of possibilities • negative terms (F): a conjunction of forbidden terms

Relevance Ranking (cont.) • Merging terms from sets A and B: A B = { ( (p1*w1+p2*w2)/(p1+p2) , (p1*w1+p2*w2)/(w1+w2) ) | (p1,w1) A, (p2,w2) B } • Position/weights of search specifications: [S1and S2].T = [S1].T [S2].T [S1and S2].F = [S1].F [S2].F [S1or S2].T = [S1].T [S2].T [S1or S2].F = [S1].F [S2].F [not S].T = [S].F [not S].F = [S].T • Cost of a search specification S: • calculate: { ú p1-p2ú / size * w1 * (1-w2) | (p1,w1) [S].T, (p2,w2) [S].F } • reduce the set by the function: x Åy = x+y-x*y

Inverse Indexes • Four indexes: • XML tags: each hit has a begin/end position • text terms: each hit has a position • attribute names • attribute values • Each index delivers the posting/hit pairs in (document_number,begin_position) order keys postings hits

The Pipeline Units abstract class Element { float score; // relevance assessment of element } class Fragment extends Element { int document; // document ID short begin; // the start position in document short end; // the end position in document short level; // depth of term in document } class ConstructedElement extends Element { String tagname; Element[] sequence; // children Attributes attributes; // SAX-like attributes } class PCData extends Element { String data; }

The Pipeline Units (cont.) • Need an element to capture all indexed elements: class Pattern extends Element { int min_level; // minimum depth in document int max_level; // maximum depth in document } • for queries such as: count(document()/*/*) • as a starting element for document() • The unit of communication between pipeline operators is a tuple: class Tuple { Element[] components; } • one element for each for-variable in a FLWOR expression

Pipeline Iterators class Iterator { Tuple current(); // current tuple from stream void open (); // open the stream iterator Tuple next (); // get the next tuple from stream boolean eos (); // is this the end of stream? } • An iterator reads data from the input stream(s) and delivers data to the output stream • Connected through pipelines • an iterator (the producer) delivers a stream element to the output only when requested by the next operator in pipeline (the consumer) • to deliver one stream element to the output, the producer becomes a consumer by requesting from the previous iterator as many elements as necessary to produce a single element, etc, until the end of stream

Example class Child extends Iterator { String tag; Iterator input; IndexIterator ti; } Tuple next () { while (!ti.eos() && !input.eos()) { if (input.current[0] instanceof Fragment) { Fragment f = (Fragment) input.current[0]; Posting p = ti.posting(); TagHit h = (TagHit) ti.hit(); if ( f.document == p.document && f.begin < h.begin && f.end > h.end && h.level == f.level+1) { ti.next(); return new Tuple(new Fragment(p.document,h.begin,h.end,h.level)); ...

For-Loops using Iterators Need a stepper for a for-loop: class Step extends Iterator { boolean first; Tuple tuple; void open () { first = true; current = tuple; } Tuple next () { first = false; return current; } void set ( Tuple t ) { tuple = t; } boolean eos () { return !first; } } Tuple Loop.next () { if (!left.eos()) { while (right.eos()) { left.next(); right_step.set(left.current()); right.open(); }; current = left.current().append(right.current()); right.next(); return current; } } Loop right right_step left right pipeline Step set class Loop extends Iterator { Iterator left; Step right_step; Iterator right; }

Let-Bindings using Iterators Let-bindings are the hardest to implement: • the let-value may be a sequence • one producer -- many consumers • we do not want to materialize the let-value in memory queue tail head fastest consumer slowest consumer backlog Some cases are hopeless: let $v:=e return ($v,$v)

Future Work • Integration with a pull-based, event-oriented processing of local XML files (instead of DOM-based) • Incorporate evaluation techniques for top-K selection queries • Use it in a peer-to-peer system as a distributed XML database • current P2P indexing techniques (based on DHTs) are an overkill • for query /A/B: need to send all A index entries from peer “A” to peer “B” • preprocessing of XQueries using Bloom filters

XQuery Processing with Relevance Ranking

XQuery Processing with Relevance Ranking

Presentation Transcript

Xquery

Ranking Multimedia Databases via Relevance Feedback with History and Foresight Support

XQUERY

Effective XML Keyword Search with Relevance Oriented Ranking

Relevance Ranking in the Scholarly Domain

XQuery

XQuery

Xquery

XQuery

XQUERY

XQuery

Relevance Ranking and Clustering

XQuery

Ranking Tweets Considering Trust and Relevance

XQuery

XQuery

XQuery

XQuery

Assignments With Relevance