1 / 15

An Adaptive XML Retrieval System

Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab. An Adaptive XML Retrieval System. The XML retrieval tasks. Query formulation CO – Content only CAS – Content and structure (NEXI) Retrieval tasks Thorough: “find all highly exhaustive and specific elements”

miron
Download Presentation

An Adaptive XML Retrieval System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab An Adaptive XML Retrieval System XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008

  2. XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 The XML retrieval tasks • Query formulation • CO – Content only • CAS – Content and structure (NEXI) • Retrieval tasks • Thorough: • “find all highly exhaustive and specific elements” • Retrieval results can be (possibly overlapping) XML elements of varying granularity that fulfill the query • Focussed : • “ find the most exhaustive and specific element in a path” • No overlap in returned results

  3. XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Approaches for XML retrieval • Index full documents. • Score documents and then components inside the documents • Problem: Works well for “fetch and browse” but not for the general thorough task • Index only leaf elements • Score leaves and propagate scores along the XML tree • Problem: weights used to propagate are either set manually by the user or set empirically • Index all elements into same index • Score all possible elements • Problem: distorted “element-level" statistics due to overlapping • Can we fix the distorted statistics?

  4. XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 An adaptive XML retrieval system • Split all collection elements into separate indices such that • Coverage - each element is indexed in at least one index • No overlap - elements in each index do not nest. • Run Query on each index • Merge results to a single result list

  5. XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Split to indices - example article[1] article[1] Index 0 Index 1 bdy[1] bdy[1] sec[2] Index 2 sec[1] sec[1] Index 3 p[3] ss1[2] p[1] ss1[1] • Index 0: /article[1] /article[1] Index 1: /article[1]/bdy[1] /article[1]/bdy[1] Index 2: /article[1]/bdy[1]/sec[1], /article[1]/bdy[1]/sec[1] /article[1]/bdy[1]/sec[2] Index 3: /article[1]/bdy[1]/sec[2]/p[1], /article[1]/bdy[1]/sec[1]/ss1[1] /article[1]/bdy[1]/sec[2]/p[3] /article[1]/bdy[1]/sec[1]/ss1[2]

  6. XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 An adaptive indexing schema SplitToIndices(doc, minCompSize, nInd) • Find all leaves in doc that are larger than minCompSize • If no minimal leaves found return G0 = {root} • Let d be the longest path among all those leaves • Create groups {G0,…,Gd-1} where each Gi contains all elements inferred Xpath prefixes of length i of all matched leaves. • Remove repeating elements in each group • Split the groups {G1,…,Gd} to indices{I0,…, InInd-1} (several strategies) • Return {I0,…, InInd-1}

  7. XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Examples – cut long paths • Minimal element -/article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2] • Split to Indices • index 0 : /article[1] • index 1 : /article[1]/body[1] • index 2 : /article[1]/body[1]/section[7] • index 3: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1] • index 4: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2] • index 5: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1] • index 6: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2]

  8. XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Experiements • IEEE collection 1995-2004 • 17,000 articles, 700MB • Average document length ~41K • Average depth 6.9 • 29 topics from INEX 2005 • Wikipedia collection • 660,000 pages, 4.5GB • Average document length 6.8K • Average depth 6.72 • 111 topics from INEX 2006

  9. XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Coverage • For nInd=7 and minCompSize=10. • 87% coverage for IEEE collection recall base • 75% coverage for Wikipedia collection filtered recall base • The filtered recall base was generated by removing all link elements from the recall base • We still miss some small elements and some in-between elements which has depth > 7

  10. XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Doc pivot • Some low level indices have partial content of the collection thus missing statistics • Solution: compensate by containing document’s score Score’(e) = docPivot * Score(doc(e)) + (1 – docPivot) * Score(e))

  11. XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Elements distribution

  12. XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Tuning number of Indices Set minCompSize=10 needle

  13. XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Tuning min Component Size Set num indices nInd=7 Set num indices = 7

  14. Summary • Adaptive Indexing schema • split XML elements to separate indices • Same parameters for different collections • XML retrieval system • achieved by running existing IR engines on each index • Can be used for CAS • Relatively low MAep results • Does XML structure reflect any semantic structure? XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008

  15. Thank you! XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008

More Related