XRANK: Ranked Keyword Search over XML Documents

XRANK: Ranked Keyword Search over XML Documents Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay

Outline • Motivation • Problem Definition, Query Semantics • Ranking Function • A New Datastructure – Dewey Inverted List (DIL) • Algorithms • Performance Evaluation

Motivation

Motivation - I • Why do we need search over XML data? • Why not use search techniques used on WWW (keyword search on HTML)?

Motivation - IIKeyword Search: XML Vs HTML • XML • structural • Links: IDREFs and Xlinks • Tags: Content specifiers • ranking • Result: XML element (a tree) • Element-level ranking • Proximity: • width • height • HTML • structural • Links: document-to-document • Tags: Format specifiers • ranking • Result: Document • Page-level ranking • Proximity: • width: distance between words

Problem Definition,Query Semantics,and Ranking

Problem Definition • Input: Set of keywords • Output: Ranked XML elements What is a result? How to rank results ?

Query Evaluator Data Structures (DIL) Bird's eye view of the system Results Query Keywords XML doc repository Preprocessing (ElemRank computation)

What is a result? • A minimal Steiner tree of XML elements • Result-set is a set of XML elements that • includes a subset of elements containing all query-keywords at leastonce, after excluding the occurrences of keywords in contained results (if any).

result 1 result 2

Result:Graphical representation containment edge ancestor descendant

Ranking: Which results to return first? • Properties: • The Ranking function should • reflect Result Specificity • consider Keyword-Proximity • be Hyperlink Aware • Ranking function: • f (height, width, link-structure)

Less specific result More specific result

Ranking Function For a single XML element (node): r (v1, ki) = ElemRank ( vt ) . decayt-1 v1 vt ki

Ranking Function Combining ranks in case of multiple occurrences: Overall Rank:

Semantics of the ranking function Link structure r (v1, ki) = ElemRank ( vt ) . decayt-1 Specificity (height) Proximity

ElemRank Computation – adopt PageRank?? • PageRank • Short-comings: • Fails to capture: • bidirectional transfer of “ElemRanks” • discrimination between edge-types (containment and hyperlink) • doesn't aggregate “ElemRanks” for reverse containment relationships

ElemRank Computation - I • Consider Both forward and reverse ElemRank propagation. • Ne = total # of XML elements • Nh(u) = # hyperlinks from 'u' • Nc(u) = # children of 'u' • E = HE U CE U CE' • CE' = reverse containment edges

ElemRank Computation - II • Seperate containment and hyperlink edges • CE = containment edges • HE = hyperlink edges • ElemRank (sub elements) α1 / ( # sibling sub-elements )

ElemRank Computation - III • Sum over the reverse-containment edges, instead of distributing the weight • Nd(u) = total # XML documents • Nde(v) = # elements in the XML doc containing v • ElemRank (parent) α Sum (ElemRank(sub-elements))

Datastructures and Algorithms

Naïve Algorithm • Approach: • XML element ~ doc • Use “keyword search on WWW” • Limitations: • Space overhead (in inverted indices) • Failure to model Hierarchical relationships (ancestor~decendent). • Inaccurate Ranking • Need a new datastructure which can model hierarchical relationships !! • Answer: Dewey Inverted Lists

Labeling nodes using Dewey Ids

Dewey Inverted Lists • One entry per keyword • Entry for keyword 'k' has Dewey-IDs of elements directly containing 'k' • Simple equi merge-join of Dewey-ID-lists won't work ! • Need to compute prefixes.

System Architecture

DIL : Query Processing • Simple equality merge-join will not work • Need to find LCP (longest common prefix) over all elements with query keyword-match. • Single pass over the inverted lists suffices! • Compute LCP while merging the ILs of individual keywords. • ILs are sorted on Dewey-IDs

Datastructures • Array of all inverted lists : invertedList[] • invertedList[i] for keyword 'i' • each invertedList[i] is sorted on Dewey-ID • Heap to maintain top-m results : resultHeap • Stack to store current Dewey-ID, ranks, position List, longest common prefixes : deweyStack

Algorithm on DILs - Abstract • While all inverted-lists are not processed • Read the next entry from DIL having smallest Dewey-ID • call this 'currentEntry' • Find the longest common prefix (lcp) between stack components and entry read from DIL • lcp (deweyStack , currentEntry) • Pop non-matching entries from Dewey-stack; Add result to heap if appropriate • check if current top-of-stack contains all keywords • if yes, compute OverallRank, put this result onto heap • else • non-matching entries are popped one component at a time and update (rank, posList) on each pop • Push non-matching part of 'currentEntry' to 'deweyStack' • non-matching components of 'currentEntry.deweyID' are pushed onto stack • Update components of top entry of deweyStack

Example Query: “XQL Ricardo”

Algorithm Trace – Step 1 Rank[i] = Rank due to keyword 'i' PosList[i] = List of occurrences of keyword 'i' Smallest ID: 5.0.3.0.0 DeweyStack DIL: invertedList[] push all components and find rank, posL

Algorithm Trace – Step 2 Smallest ID: 5.0.3.0.1 DeweyStack DIL: invertedList[] find lcp and pop nonmatching components

Algorithm Trace – Step 3 Smallest ID: 5.0.3.0.1 DeweyStack DIL: invertedList[] updated rank, posL

Algorithm Trace – Step 4 Smallest ID: 5.0.3.0.1 DeweyStack DIL: invertedList[] push non-matching components

Algorithm Trace – Step 5 Smallest ID: 6.0.3.8.3 DeweyStack DIL: invertedList[] find lcp, update, finally pop all components

Problems with DIL • Scans the entire inverted-list for all keywords before a result is output • Very inefficient for top-k computation

Other Techniques - RDIL • Ranked Dewey Inverted List: • For efficient top-k result computation • IL is ordered by ElemRank • Each IL has a B+ tree index on the Dewey-IDs • Algorithm with RDIL uses a threshold

Algorithm using RDIL (Abstract) • Choose the next entry from one of the invertedList[] in a Round-Robin fashion. • say chosen IL = invertedList[i] • d = top-ranked Dewey-ID from invertedList[i] • Find the longest common prefix that contains all query-keywords • Probe the B+ tree index of all other keyword ILs, for the longest common prefix • Claim: • d2 = smallest Dewey-ID in invertedList[j] of query-keyword 'j' • d3 = immediate predecessor of d2 • lcp = max_prefix (lcp ( d, d2) , lcp ( d, d3)) • Check if 'lcp' is a complete result • Recompute'threshold' = sum (ElemRank of last processed element in each query keyword IL) • If(rank of top-k results on heap) >= threshold)return;

Performance of RDIL • Works well for queries with highly correlated keywords • BUT ! becomes equivalent (actually worse) to DIL for totally uncorrelated keywords • Need an intermediate technique

HDIL • Uses both DIL and RDIL • Adaptive strategy: • Start with RDIL • Switch to DIL if performance is bad • Performance? • Estimated remaining time for RDIL = (m – r ) * t / r • t = time spent so far • r = no. of results above threshold so far • m = desired no. of results • Estimated remaining time for DIL ? • No. of query-keywords is known • Size of each IL is known

HDIL • Datastructures? • Store full IL sorted on Dewey-ID • Store small fraction of IL sorted on ElemRank • Share the leaf level between IL and B+ tree (in RDIL) • Overhead: top levels of B+ tree

Updating the lists • Updation is easy • Insertion – very bad! • techniques from Tatarinov et al. • we've seen a better technique in this course :) – OrdPath

Evaluation • Criteria: • no. of query-keywords • correlation between query-keywords • desired no. of query results • selectivity of keywords • Setup: • Datasets used: DBLP, Xmark • d1 = 0.35, d2 = 0.25, d3 = 0.25 • 2.8GHz Pentium IV + 1GB RAM + 80GB HDD

Performance - 1

Performance - 2

Critique • New datastructure (DIL) defined to represent hierarchical relationships accurately and efficiently. • Hyperlinks and IDREFs are considered only while computing ElemRank. Not used while returning results. • Only containment edges (ancestor-descendant) are considered while computing result trees. • Works only on trees, can't handle graphs.

Jens Graupmann Ralf Schenkel Gerhard Weikum Max-Plack-Institut fur Informatik Presentation by: Nitin Gupta Meghana Kshirsagar Indian Institute of Technology Bombay The SphereSearch Engine for Unified Banked Retrieval of Heterogenous XML and Web Documents

Why another search engine ? • To cope with diversity in the structures and annotations of the data • Ranked retrieval paradigm for producing relevance ordered results lists rather than a mere boolean retrieval. • Short comings of the current search engines • Concept aware • Context aware (or link-awareness) • Abstraction aware • Query Language

Concept awareness • Example: researcher max planck yields many results about researchers who work at the institute “Max Plack” Society • Better formulation would be researcher person=“max planck” • Objective attained by • Transformation to XML • Data Annotation

XRANK: Ranked Keyword Search over XML Documents

XRANK: Ranked Keyword Search over XML Documents

Presentation Transcript

ARTstor Training What will I see in ARTstor?

Zyprexa documents ppt 1.

Uninformed Search

Sector Search Pattern

Chapter 6 Inheritance

Semantic Search Engines – On the Way to Web 3.0

Combinatorial Pattern Matching

Search Patterns

Search Engine Technology 2/10

Semantic Search Engines – On the Way to Web 3.0

ICSTORe for IC S earch, T rack, O rder, and R eporting of Documents

Documents for UK Program 英国项目材料

Query XML Documents with XQuery

SEARCH AND RESCUE

Extra Slides – WebFLIS

Chapter 2 Structured Web Documents in XML

Efficient IR-Style Keyword Search over Relational Databases

Combinatorial Pattern Matching

Search Patterns