1 / 72

XRANK: Ranked Keyword Search over XML Documents

XRANK: Ranked Keyword Search over XML Documents. Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram. Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay. Outline. Motivation Problem Definition, Query Semantics Ranking Function

cicily
Download Presentation

XRANK: Ranked Keyword Search over XML Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XRANK: Ranked Keyword Search over XML Documents Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay

  2. Outline • Motivation • Problem Definition, Query Semantics • Ranking Function • A New Datastructure – Dewey Inverted List (DIL) • Algorithms • Performance Evaluation

  3. Motivation

  4. Motivation - I • Why do we need search over XML data? • Why not use search techniques used on WWW (keyword search on HTML)?

  5. Motivation - IIKeyword Search: XML Vs HTML • XML • structural • Links: IDREFs and Xlinks • Tags: Content specifiers • ranking • Result: XML element (a tree) • Element-level ranking • Proximity: • width • height • HTML • structural • Links: document-to-document • Tags: Format specifiers • ranking • Result: Document • Page-level ranking • Proximity: • width: distance between words

  6. Problem Definition,Query Semantics,and Ranking

  7. Problem Definition • Input: Set of keywords • Output: Ranked XML elements What is a result? How to rank results ?

  8. Query Evaluator Data Structures (DIL) Bird's eye view of the system Results Query Keywords XML doc repository Preprocessing (ElemRank computation)

  9. What is a result? • A minimal Steiner tree of XML elements • Result-set is a set of XML elements that • includes a subset of elements containing all query-keywords at leastonce, after excluding the occurrences of keywords in contained results (if any).

  10. result 1 result 2

  11. Result:Graphical representation containment edge ancestor descendant

  12. Ranking: Which results to return first? • Properties: • The Ranking function should • reflect Result Specificity • consider Keyword-Proximity • be Hyperlink Aware • Ranking function: • f (height, width, link-structure)

  13. Less specific result More specific result

  14. Ranking Function For a single XML element (node): r (v1, ki) = ElemRank ( vt ) . decayt-1 v1 vt ki

  15. Ranking Function Combining ranks in case of multiple occurrences: Overall Rank:

  16. Semantics of the ranking function Link structure r (v1, ki) = ElemRank ( vt ) . decayt-1 Specificity (height) Proximity

  17. ElemRank Computation – adopt PageRank?? • PageRank • Short-comings: • Fails to capture: • bidirectional transfer of “ElemRanks” • discrimination between edge-types (containment and hyperlink) • doesn't aggregate “ElemRanks” for reverse containment relationships

  18. ElemRank Computation - I • Consider Both forward and reverse ElemRank propagation. • Ne = total # of XML elements • Nh(u) = # hyperlinks from 'u' • Nc(u) = # children of 'u' • E = HE U CE U CE' • CE' = reverse containment edges

  19. ElemRank Computation - II • Seperate containment and hyperlink edges • CE = containment edges • HE = hyperlink edges • ElemRank (sub elements) α1 / ( # sibling sub-elements )

  20. ElemRank Computation - III • Sum over the reverse-containment edges, instead of distributing the weight • Nd(u) = total # XML documents • Nde(v) = # elements in the XML doc containing v • ElemRank (parent) α Sum (ElemRank(sub-elements))

  21. Datastructures and Algorithms

  22. Naïve Algorithm • Approach: • XML element ~ doc • Use “keyword search on WWW” • Limitations: • Space overhead (in inverted indices) • Failure to model Hierarchical relationships (ancestor~decendent). • Inaccurate Ranking • Need a new datastructure which can model hierarchical relationships !! • Answer: Dewey Inverted Lists

  23. Labeling nodes using Dewey Ids

  24. Dewey Inverted Lists • One entry per keyword • Entry for keyword 'k' has Dewey-IDs of elements directly containing 'k' • Simple equi merge-join of Dewey-ID-lists won't work ! • Need to compute prefixes.

  25. System Architecture

  26. DIL : Query Processing • Simple equality merge-join will not work • Need to find LCP (longest common prefix) over all elements with query keyword-match. • Single pass over the inverted lists suffices! • Compute LCP while merging the ILs of individual keywords. • ILs are sorted on Dewey-IDs

  27. Datastructures • Array of all inverted lists : invertedList[] • invertedList[i] for keyword 'i' • each invertedList[i] is sorted on Dewey-ID • Heap to maintain top-m results : resultHeap • Stack to store current Dewey-ID, ranks, position List, longest common prefixes : deweyStack

  28. Algorithm on DILs - Abstract • While all inverted-lists are not processed • Read the next entry from DIL having smallest Dewey-ID • call this 'currentEntry' • Find the longest common prefix (lcp) between stack components and entry read from DIL • lcp (deweyStack , currentEntry) • Pop non-matching entries from Dewey-stack; Add result to heap if appropriate • check if current top-of-stack contains all keywords • if yes, compute OverallRank, put this result onto heap • else • non-matching entries are popped one component at a time and update (rank, posList) on each pop • Push non-matching part of 'currentEntry' to 'deweyStack' • non-matching components of 'currentEntry.deweyID' are pushed onto stack • Update components of top entry of deweyStack

  29. Example Query: “XQL Ricardo”

  30. Algorithm Trace – Step 1 Rank[i] = Rank due to keyword 'i' PosList[i] = List of occurrences of keyword 'i' Smallest ID: 5.0.3.0.0 DeweyStack DIL: invertedList[] push all components and find rank, posL

  31. Algorithm Trace – Step 2 Smallest ID: 5.0.3.0.1 DeweyStack DIL: invertedList[] find lcp and pop nonmatching components

  32. Algorithm Trace – Step 3 Smallest ID: 5.0.3.0.1 DeweyStack DIL: invertedList[] updated rank, posL

  33. Algorithm Trace – Step 4 Smallest ID: 5.0.3.0.1 DeweyStack DIL: invertedList[] push non-matching components

  34. Algorithm Trace – Step 5 Smallest ID: 6.0.3.8.3 DeweyStack DIL: invertedList[] find lcp, update, finally pop all components

  35. Problems with DIL • Scans the entire inverted-list for all keywords before a result is output • Very inefficient for top-k computation

  36. Other Techniques - RDIL • Ranked Dewey Inverted List: • For efficient top-k result computation • IL is ordered by ElemRank • Each IL has a B+ tree index on the Dewey-IDs • Algorithm with RDIL uses a threshold

  37. Algorithm using RDIL (Abstract) • Choose the next entry from one of the invertedList[] in a Round-Robin fashion. • say chosen IL = invertedList[i] • d = top-ranked Dewey-ID from invertedList[i] • Find the longest common prefix that contains all query-keywords • Probe the B+ tree index of all other keyword ILs, for the longest common prefix • Claim: • d2 = smallest Dewey-ID in invertedList[j] of query-keyword 'j' • d3 = immediate predecessor of d2 • lcp = max_prefix (lcp ( d, d2) , lcp ( d, d3)) • Check if 'lcp' is a complete result • Recompute'threshold' = sum (ElemRank of last processed element in each query keyword IL) • If(rank of top-k results on heap) >= threshold)return;

  38. Performance of RDIL • Works well for queries with highly correlated keywords • BUT ! becomes equivalent (actually worse) to DIL for totally uncorrelated keywords • Need an intermediate technique

  39. HDIL • Uses both DIL and RDIL • Adaptive strategy: • Start with RDIL • Switch to DIL if performance is bad • Performance? • Estimated remaining time for RDIL = (m – r ) * t / r • t = time spent so far • r = no. of results above threshold so far • m = desired no. of results • Estimated remaining time for DIL ? • No. of query-keywords is known • Size of each IL is known

  40. HDIL • Datastructures? • Store full IL sorted on Dewey-ID • Store small fraction of IL sorted on ElemRank • Share the leaf level between IL and B+ tree (in RDIL) • Overhead: top levels of B+ tree

  41. Updating the lists • Updation is easy • Insertion – very bad! • techniques from Tatarinov et al. • we've seen a better technique in this course :) – OrdPath

  42. Evaluation • Criteria: • no. of query-keywords • correlation between query-keywords • desired no. of query results • selectivity of keywords • Setup: • Datasets used: DBLP, Xmark • d1 = 0.35, d2 = 0.25, d3 = 0.25 • 2.8GHz Pentium IV + 1GB RAM + 80GB HDD

  43. Performance - 1

  44. Performance - 2

  45. Critique • New datastructure (DIL) defined to represent hierarchical relationships accurately and efficiently. • Hyperlinks and IDREFs are considered only while computing ElemRank. Not used while returning results. • Only containment edges (ancestor-descendant) are considered while computing result trees. • Works only on trees, can't handle graphs.

  46. Jens Graupmann Ralf Schenkel Gerhard Weikum Max-Plack-Institut fur Informatik Presentation by: Nitin Gupta Meghana Kshirsagar Indian Institute of Technology Bombay The SphereSearch Engine for Unified Banked Retrieval of Heterogenous XML and Web Documents

  47. Why another search engine ? • To cope with diversity in the structures and annotations of the data • Ranked retrieval paradigm for producing relevance ordered results lists rather than a mere boolean retrieval. • Short comings of the current search engines • Concept aware • Context aware (or link-awareness) • Abstraction aware • Query Language

  48. Concept awareness • Example: researcher max planck yields many results about researchers who work at the institute “Max Plack” Society • Better formulation would be researcher person=“max planck” • Objective attained by • Transformation to XML • Data Annotation

More Related