150 likes | 285 Views
Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab. An Adaptive XML Retrieval System. The XML retrieval tasks. Query formulation CO – Content only CAS – Content and structure (NEXI) Retrieval tasks Thorough: “find all highly exhaustive and specific elements”
E N D
Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab An Adaptive XML Retrieval System XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 The XML retrieval tasks • Query formulation • CO – Content only • CAS – Content and structure (NEXI) • Retrieval tasks • Thorough: • “find all highly exhaustive and specific elements” • Retrieval results can be (possibly overlapping) XML elements of varying granularity that fulfill the query • Focussed : • “ find the most exhaustive and specific element in a path” • No overlap in returned results
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Approaches for XML retrieval • Index full documents. • Score documents and then components inside the documents • Problem: Works well for “fetch and browse” but not for the general thorough task • Index only leaf elements • Score leaves and propagate scores along the XML tree • Problem: weights used to propagate are either set manually by the user or set empirically • Index all elements into same index • Score all possible elements • Problem: distorted “element-level" statistics due to overlapping • Can we fix the distorted statistics?
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 An adaptive XML retrieval system • Split all collection elements into separate indices such that • Coverage - each element is indexed in at least one index • No overlap - elements in each index do not nest. • Run Query on each index • Merge results to a single result list
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Split to indices - example article[1] article[1] Index 0 Index 1 bdy[1] bdy[1] sec[2] Index 2 sec[1] sec[1] Index 3 p[3] ss1[2] p[1] ss1[1] • Index 0: /article[1] /article[1] Index 1: /article[1]/bdy[1] /article[1]/bdy[1] Index 2: /article[1]/bdy[1]/sec[1], /article[1]/bdy[1]/sec[1] /article[1]/bdy[1]/sec[2] Index 3: /article[1]/bdy[1]/sec[2]/p[1], /article[1]/bdy[1]/sec[1]/ss1[1] /article[1]/bdy[1]/sec[2]/p[3] /article[1]/bdy[1]/sec[1]/ss1[2]
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 An adaptive indexing schema SplitToIndices(doc, minCompSize, nInd) • Find all leaves in doc that are larger than minCompSize • If no minimal leaves found return G0 = {root} • Let d be the longest path among all those leaves • Create groups {G0,…,Gd-1} where each Gi contains all elements inferred Xpath prefixes of length i of all matched leaves. • Remove repeating elements in each group • Split the groups {G1,…,Gd} to indices{I0,…, InInd-1} (several strategies) • Return {I0,…, InInd-1}
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Examples – cut long paths • Minimal element -/article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2] • Split to Indices • index 0 : /article[1] • index 1 : /article[1]/body[1] • index 2 : /article[1]/body[1]/section[7] • index 3: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1] • index 4: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2] • index 5: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1] • index 6: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2]
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Experiements • IEEE collection 1995-2004 • 17,000 articles, 700MB • Average document length ~41K • Average depth 6.9 • 29 topics from INEX 2005 • Wikipedia collection • 660,000 pages, 4.5GB • Average document length 6.8K • Average depth 6.72 • 111 topics from INEX 2006
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Coverage • For nInd=7 and minCompSize=10. • 87% coverage for IEEE collection recall base • 75% coverage for Wikipedia collection filtered recall base • The filtered recall base was generated by removing all link elements from the recall base • We still miss some small elements and some in-between elements which has depth > 7
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Doc pivot • Some low level indices have partial content of the collection thus missing statistics • Solution: compensate by containing document’s score Score’(e) = docPivot * Score(doc(e)) + (1 – docPivot) * Score(e))
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Elements distribution
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Tuning number of Indices Set minCompSize=10 needle
XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 Tuning min Component Size Set num indices nInd=7 Set num indices = 7
Summary • Adaptive Indexing schema • split XML elements to separate indices • Same parameters for different collections • XML retrieval system • achieved by running existing IR engines on each index • Can be used for CAS • Relatively low MAep results • Does XML structure reflect any semantic structure? XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008
Thank you! XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008