150 likes | 322 Views
TopX @ INEX ‘05. Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken. article. article. title. “ Current Approaches to XML Data Manage- ment .”. title. “ The X ML Files ”. bib. sec. sec. sec. sec. bib. title.
E N D
TopX @ INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken
article article title “Current Approaches to XML Data Manage- ment.” title “The XML Files” bib sec sec sec sec bib title “Native XML databases.” title “The Ontology Game” title par par item item “The Dirty Little Secret” “Native XML database systems can store schemaless data ... ” “XML queries with an expres- sive power similar to that of Datalog …” title par “XML” par “Sophisticated technologies developed by smart people.” url inproc par title “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” “w3c.org/xml” “XML-QL: A Query Language for XML.” “Proc. Query Languages Workshop, W3C,1998.” par “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files …” “What does XML add for retrieval? It adds formal ways …” //article[//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML database”)] ]//bib[about(.//item, “W3C”)] An Efficient and Versatile Query Engine for TopX Search
TopX: Efficient XML-IR [VLDB ’05] Goal: Efficiently retrieve the best results of a similarity query • Extend top-k query processing algorithms for sorted lists [Buckley ’85; Güntzer, Balke & Kießling ’00; Fagin ‘01]to XML data • Non-schematic, heterogeneous data sources • Combined inverted index for content & structure • Avoid full index scans, postpone expensive random accesses to large disk-resident data structures • Exploit cheap disk space for redundant indexing An Efficient and Versatile Query Engine for TopX Search
Data Model “xml ir ir technique xml clustering xml evaluation“ article 1 6 “clustering xml evaluation“ title abs sec 4 2 5 3 3 3 “xml ir” “ir technique xml“ title par 5 2 6 1 “clustering xml” “evaluation“ ftf(“xml”, article1 ) = 3 <article> <title>XML-IR</title> <abs> IR techniques for XML</abs> <sec> <title> Clustering on XML </title> <par>Evaluation</par> </sec> </article> • Simplified XML model • disregarding IDRef & XLink/XPointer • Redundant full-contents • Per-element term frequencies ftf(ti,e) for full-contents • Pre/postorder labels for each tag-term pair An Efficient and Versatile Query Engine for TopX Search
Full-Content Scoring Model per-element statistics • Full-content scores cast into an Okapi-BM25 probabilistic modelwith element-specific parameterization Basic scoring idea within IR-style family of TF*IDF ranking functions Additional static score mass c for relaxable structural conditions An Efficient and Versatile Query Engine for TopX Search
Inverted Block-Index for Content & Structure sec[clustering] title[xml] par[evaluation] • Inverted index over tag-term pairs (full-contents) • Benefits from increased selectivity of combined tag-term pairs • Accelerates child-or-descendant axis, e.g., sec//”clustering” sec[clustering] par[evaluation] title[xml] • Sequential block-scans • Re-order elements in descending order of (maxscore, docid, score) per list • Fetch all tag-term pairs per doc in one sequential block-access • docid limits the range of in-memory structural joins • Stored as inverted files or database tables (B+-tree indexes) An Efficient and Versatile Query Engine for TopX Search
Navigational Index sec title[xml] par[evaluation] sec title par • Additional navigational index • Non-redundant element directory • Supports element paths and branching path queries • Random accesses using (docid, tag) as key • Schema-oblivious indexing & querying An Efficient and Versatile Query Engine for TopX Search
TopX Query Processing [Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85] • Adapt Threshold Algorithm (TA) paradigm • Focus on inexpensive sequential/sorted accesses • Postpone expensive random accesses • Candidated= connected sub-pattern with element ids and scores • Incrementally evaluate path constraints using pre/postorder labels • In-memory structural joins (nested loops, staircase, or holistic twig joins) • Upper/lower score guarantees per candidate • Remember set of evaluated dimensions E(d) worstscore(d) = ∑iE(d)score(ti,e) bestscore(d) = worstscore(d) + ∑iE(d) highi • Early threshold termination • Candidate queuing • Stop, if • Extensions • Batching of sorted accesses & efficient queue management • Cost model for random access scheduling • Probabilistic candidate pruning for approximate top-k results [VLDB ’04] An Efficient and Versatile Query Engine for TopX Search
TopX Query Processing By Example 171 46 171 46 171 9 171 171 46 46 46 46 171 9 9 9 46 46 9 9 worst=2.2best=2.2 worst=0.5best=2.4 worst=0.9best=2.9 worst=0.9best=2.8 score=0.5best=1.3 worst=0.5best=2.5 worst=0.5best=2.3 score=1.7best=2.5 worst=0.9 worst=0.5 worst=0.9best=2.7 worst=0.5best=0.5 worst=0.9best=2.8 worst=0.9best=2.75 worst=0.9best=1.8 worst=0.9best=1.0 worst=0.9best=2.55 worst=1.0best=2.65 worst=1.0best=2.8 worst=1.0 worst=1.0best=1.9 worst=1.0best=1.6 worst=1.0best=2.75 worst=0.9 worst=0.85best=2.75 51 216 216 216 51 51 72 216 216 72 51 216 72 3 3 28 28 3 84 3 28 28 28 182 3 28 3 182 worst=0.8best=2.65 worst=1.7 worst=2.2 worst=0.1best=0.9 worst=0.8best=1.6 worst=0.8best=2.45 worst=0.0 best=1.35 worst=0.0best=2.65 worst=0.0best=2.45 worst=0.0best=1.7 worst=0.0 best=1.4 worst=0.0best=2.75 worst=0.0best=2.8 worst=1.6best=2.1 worst=1.6 worst=0.0best=2.9 worst=0.85best=2.45 worst=0.85best=2.65 worst=0.85best=2.15 Top-2 results sec[clustering] title[xml] par[evaluation] min-2=0.0 min-2=0.5 min-2=0.9 min-2=1.6 sec[clustering] par[evaluation] title[xml] 1.0 1.0 1.0 0.9 0.9 1.0 0.8 0.8 0.85 0.5 0.75 0.1 doc2 doc17 doc1 doc5 Candidate queue doc3 Pseudo- Candidate An Efficient and Versatile Query Engine for TopX Search
CO.Thorough • Element-granularity • Turn query into pseudo CAS query using “//*” • No post-filtering on specific element types • nxCG@10 = 0.0379 (rank 22 of 55) • MAP = 0.008 (rank 37 of 55) • Old INEX_eval: MAP=0.058 (rank 3) An Efficient and Versatile Query Engine for TopX Search
COS.Fetch&Browse • Document-granularity • Rank documents according to their best target element • Strict evaluation of support & target elements • Return all target elements per doc using the document score (no overlap) • MAP = 0.0601 (rank 4 of 19) An Efficient and Versatile Query Engine for TopX Search
SSCAS • Element-granularity with strict support & target elements (no overlap) • nxCG@10 = 0.45 (ranks 1 & 2 of 25) • MAP = 0.0322 & 0.0272 (ranks 1 & 6 ) An Efficient and Versatile Query Engine for TopX Search
Top-k Efficiency k P@k MAP@k # SA epsilon # RA relPrec relPrec CPU sec Join&Sort 10 n/a 9,122,318 0 0.261 StructIndex 10 n/a 761,970 3,25,068 0.37 StructIndex+ 10 n/a 77,482 5,074,384 1.87 0.34 0.09 1.00 TopX – MinProbe 10 0.0 635,507 64,807 0.03 TopX – BenProbe 10 0.0 723,169 84,424 0.07 TopX – BenProbe 1,000 0.0 882,929 1,902,427 0.35 0.03 0.17 1.00 An Efficient and Versatile Query Engine for TopX Search
Probabilistic Pruning P@k MAP@k # SA epsilon # RA relPrec CPU sec k TopX - MinProbe 10 0.00 635,507 64,807 0.03 0.34 0.09 1.00 10 0.25 392,395 56,952 0.05 0.34 0.08 0.77 10 0.50 231,109 48,963 0.02 0.31 0.08 0.65 10 0.75 102,118 42,174 0.01 0.33 0.08 0.51 10 1.00 36,936 35,327 0.01 0.30 0.07 0.38 An Efficient and Versatile Query Engine for TopX Search
Conclusions & Ongoing Work • Efficient and versatile TopX query processor • Extensible framework for text, semi-structured & structured data • Probabilistic Extensions • Probabilistic cost model for random access scheduling • Very good precision/runtime ratio for probabilistic candidate pruning • Full NEXI support • Phrase matching, mandatory terms “+”, negation “-”, attributes “@” • Query weights (e.g., relevance feedback, ontological similarities) • Scalability • Optimized for runtime, exploits cheap disk space (redundancy factor 4-5 for INEX) • Participated at TREC Terabyte Efficiency Task • Dynamic and self-tuning query expansions [Sigir ’05] • Incrementally merges inverted lists for a set of active expansions • Vague Content & Structure (VCAS) queries (maybe next year..) An Efficient and Versatile Query Engine for TopX Search