260 likes | 270 Views
This research article delves into the development of an efficient and versatile query engine for TopX search, focusing on XML-IR techniques and data management strategies. The study explores the history and related work in the field of information retrieval on structured data and XML, highlighting the advancements made in query languages and commercial software offerings. Key elements covered include the data model, scoring techniques, database schema, indexing approaches, top-k query processing, scheduling, and probabilistic candidate pruning. The implementation outcomes and conclusions drawn from experiments are also discussed.
E N D
An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics Saarbrücken Germany VLDB ‘05
An XML-IR Scenario… article article title “Current Approaches to XML Data Manage- ment.” title “The XML Files” bib sec sec sec sec bib title “Native XML databases.” title “The Ontology Game” title par par item item “The Dirty Little Secret” “Native XML database systems can store schemaless data ... ” “XML queries with an expres- sive power similar to that of Datalog …” title par “XML” par “Sophisticated technologies developed by smart people.” url inproc par title “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” “w3c.org/xml” “XML-QL: A Query Language for XML.” “Proc. Query Languages Workshop, W3C,1998.” par “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files …” “What does XML add for retrieval? It adds formal ways …” //article[//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML database”)] ]//bib[about(.//item, “W3C”)] An Efficient and Versatile Query Engine for TopX Search
TopX: Efficient XML-IR Goal: Efficiently retrieve the best results of a similarity query • Extend top-k query processing algorithms for sorted lists [Buckley ’85; Güntzer, Balke & Kießling ’00; Fagin ‘01]to XML data • Non-schematic, heterogeneous data sources • Combined inverted index for content & structure • Avoid full index scans, postpone expensive random accesses to large disk-resident data structures • Exploit cheap disk space for redundant indexing An Efficient and Versatile Query Engine for TopX Search
XML-IR: History and Related Work IR on structured data (SGML): Web query languages: 1995 OED etc. (U Waterloo) HySpirit (U Dortmund) HyperStorM (GMD Darmstadt) WHIRL (CMU) W3QS (Technion Haifa) Araneus (U Roma) Lorel (Stanford U) WebSQL (U Toronto) IR on XML: XML query languages: XIRQL (U Dortmund / Essen) XXL & TopX (U Saarland / MPII) ApproXQL (U Berlin / U Munich) ELIXIR (U Dublin) JuruXML (IBM Haifa ) XSearch (Hebrew U) Timber (U Michigan) XRank & Quark (Cornell U) FleXPath (AT&T Labs) XKeyword (UCSD) XML-QL (AT&T Labs) 2000 XPath 1.0 (W3C) NEXI (INEX Benchmark) XPath & XQuery Full-Text (W3C) XPath 2.0 (W3C) XQuery (W3C) TeXQuery (AT&T Labs) Commercial software: MarkLogic, Verity?, IBM?, Oracle?, ... 2005 An Efficient and Versatile Query Engine for TopX Search
Outline • Data & Scoring model • Database schema & indexing • Top-k query processing for XML • Scheduling & probabilistic candidate pruning • Experiments & Conclusions An Efficient and Versatile Query Engine for TopX Search
Data Model “xml ir ir technique xml clustering xml evaluation“ article 1 6 “clustering xml evaluation“ title abs sec 4 2 5 3 3 3 “xml ir” “ir technique xml“ title par 5 2 6 1 “clustering xml” “evaluation“ ftf(“xml”, article1 ) = 3 <article> <title>XML-IR</title> <abs> IR techniques for XML</abs> <sec> <title> Clustering on XML </title> <par>Evaluation</par> </sec> </article> • Simplified XML model • disregarding IDRef & XLink/XPointer • Redundant full-contents • Per-element term frequencies ftf(ti,e) for full contents • Pre/postorder labels for each tag-term pair An Efficient and Versatile Query Engine for TopX Search
Full-Content Scoring Model element statistics • Full-content scores cast into an Okapi-BM25 probabilistic model with element-specific model parameterization Basic scoring idea within IR-style family of TF*IDF ranking functions Additional static score mass c for relaxable structural conditions An Efficient and Versatile Query Engine for TopX Search
Outline • Data & Scoring model • Database schema & indexing • Top-k query processing for XML • Scheduling & probabilistic candidate pruning • Experiments & Conclusions An Efficient and Versatile Query Engine for TopX Search
Inverted Block-Index for Content & Structure sec[clustering] title[xml] par[evaluation] • Inverted index over tag-term pairs (full-contents) • Benefits from increased selectivity of combined tag-term pairs • Accelerates child-or-descendant axis, e.g., sec//”clustering” sec[clustering] par[evaluation] title[xml] • Sequential block-scans • Re-order elements in descending order of (maxscore, docid, score) per list • Fetch all tag-term pairs per doc in one sequential block-access • docid limits range of in-memory structural joins • Stored as inverted files or database tables (B+-tree indexes) An Efficient and Versatile Query Engine for TopX Search
Navigational Index sec title[xml] par[evaluation] sec title par • Additional navigational index • Non-redundant element directory • Supports element paths and branching path queries • Random accesses using (docid, tag) as key • Schema-oblivious indexing & querying An Efficient and Versatile Query Engine for TopX Search
Outline • Data & Scoring model • Database schema & indexing • Top-k query processing for XML • Scheduling & probabilistic candidate pruning • Experiments & Conclusions An Efficient and Versatile Query Engine for TopX Search
TopX Query Processing [Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85] • Adapt Threshold Algorithm (TA) paradigm • Focus on inexpensive sequential/sorted accesses • Postpone expensive random accesses • Candidated= connected sub-pattern with element ids and scores • Incrementally evaluate path constraints using pre/postorder labels • In-memory structural joins (nested loops, staircase, or holistic twig joins) • Upper/lower score guarantees per candidate • Remember set of evaluated dimensions E(d) worstscore(d) = ∑iE(d)score(ti,e) bestscore(d) = worstscore(d) + ∑iE(d) highi • Early threshold termination • Candidate queuing • Stop, if • Extensions • Batching of sorted accesses & efficient queue management • Cost model for random access scheduling • Probabilistic candidate pruning for approximate top-k results [Theobald, Schenkel & Weikum, VLDB ’04] An Efficient and Versatile Query Engine for TopX Search
TopX Query Processing By Example 171 46 171 46 171 9 171 171 46 46 46 46 171 9 9 9 46 46 9 9 worst=2.2best=2.2 worst=0.5best=2.4 worst=0.9best=2.9 worst=0.9best=2.8 score=0.5best=1.3 worst=0.5best=2.5 worst=0.5best=2.3 score=1.7best=2.5 worst=0.9 worst=0.5 worst=0.9best=2.7 worst=0.5best=0.5 worst=0.9best=2.8 worst=0.9best=2.75 worst=0.9best=1.8 worst=0.9best=1.0 worst=0.9best=2.55 worst=1.0best=2.65 worst=1.0best=2.8 worst=1.0 worst=1.0best=1.9 worst=1.0best=1.6 worst=1.0best=2.75 worst=0.9 worst=0.85best=2.75 51 216 216 216 51 51 72 216 216 72 51 216 72 3 3 28 28 3 84 3 28 28 28 182 3 28 3 182 worst=0.8best=2.65 worst=1.7 worst=2.2 worst=0.1best=0.9 worst=0.8best=1.6 worst=0.8best=2.45 worst=0.0 best=1.35 worst=0.0best=2.65 worst=0.0best=2.45 worst=0.0best=1.7 worst=0.0 best=1.4 worst=0.0best=2.75 worst=0.0best=2.8 worst=1.6best=2.1 worst=1.6 worst=0.0best=2.9 worst=0.85best=2.45 worst=0.85best=2.65 worst=0.85best=2.15 Top-2 results sec[clustering] title[xml] par[evaluation] min-2=0.0 min-2=0.5 min-2=0.9 min-2=1.6 sec[clustering] par[evaluation] title[xml] 1.0 1.0 1.0 0.9 0.9 1.0 0.8 0.8 0.85 0.5 0.75 0.1 doc2 doc17 doc1 doc5 Candidate queue doc3 Pseudo- Element An Efficient and Versatile Query Engine for TopX Search
Incremental Path Validations [0.0, highi] [0.0, highi] [0.0, highi] par= “xml” par= “java” title= “security” article sec bib item c=[1.0] [1.0] [1.0] [1.0] article par= “xml” par= “java” title= “security” article sec bib item sec bib par= xml par= java item title= security //article[//sec//par//“xml java”] //bib//item//title//“security” Query: • Complex query DAGs • Transitive closure of descendant constraints • Aggregate additional static score mass c for a structural condition i, if all edges rooted at i are satisfiable • Incrementallytest structural constraints • Quickly decrease best scores for early pruning • Schedule random accesses in ascending order of structural selectivities child-or-descendant “Promising candidate” RA bib 0.0 RA item 0.0 0.7 worst(d)= 1.5 best(d) = 5.5 worst(d)= 1.5 best(d) = 4.5 worst(d)= 1.5 best(d) = 6.5 0.8 min-k=4.8 An Efficient and Versatile Query Engine for TopX Search
Outline • Data & Scoring model • Database schema & indexing • Top-k query processing for XML • Scheduling & probabilistic candidate pruning • Experiments & Conclusions An Efficient and Versatile Query Engine for TopX Search
Random Access Scheduling - Minimal Probes c=[1.0] [1.0] [1.0] [1.0] par= “xml” par= “java” title= “security” article sec bib item • MinProbe-Scheduling • Structural conditions as “soft filters” (Expensive Predicates & Minimal Probes [Chang & Hwang, SIGMOD ‘02]) • Schedule random accesses only for the most promising candidates • Schedule batch of RAs on d, if worstscore(d) + od c > min-k evaluated content & structure- related score unevaluated structural score mass (constant!) An Efficient and Versatile Query Engine for TopX Search
Cost-based Scheduling • BenProbe-Scheduling • Analytic cost model • Basic idea • Compare expected random access costs to an optimal schedule • Access costs on d are wasted, if d does not make it into the final top-k (considering both content & structure) • Compare different Expected Wasted Costs (EWC) • EWC-RAs(d) of looking up d in the structure • EWC-RAc(d) of looking up d in the content • EWC-SA(d) of not seeing d in the next batch of b sorted accesses • Schedule batch of RAs on d, if EWC-RAs|c(d) [RA] < EWC-SA [SA] EWC-SA = An Efficient and Versatile Query Engine for TopX Search
Structural Selectivity Estimator sec par= “xml” bib= “vldb” figure= “java” //sec[//figure=“java”] [//par=“xml”] [//bib=“vldb”] • Split the query into a set of characteristic patterns, e.g., twigs, descendants & tag-term pairs • Consider structural selectivities P[d satisfies all structural conditions Y] = P[d satisfies a subset Y’ of structural conditions Y] = • Consider binary correlations between structural patterns and/or tag-term pairs (estimated from data sampling, query logs, etc.) sec EWC-RAs(d) bib= “vldb” p1 = 0.682 p2 =0.001 p3 =0.002 p4 =0.688 p5 =0.968 p6 =0.002 p7=0.023 p8 = 0.067 p9 =0.011 //sec[//figure]//par //sec[//figure]//bib //sec[//par]//bib //sec//figure //sec//par //sec//bib //bib=“vldb” //par=“xml” //figure=“java” An Efficient and Versatile Query Engine for TopX Search
Full-content Score Predictor S1 Convolution (S1,S2) 0 1 high1 S2 2 0 δ(d) 0 1 high2 Probabilistic candidate pruning: Drop d from the candidate queue, if P[d gets in the final top-k] < ε EWC-RAc(d) • For each inverted list Li(i.e., all tag-term pairs) • Approximate local score distribution Si by an equi-width histogram • Periodically test all din the candidate queue • Consider aggregated score predictor P[d gets in the final top-k] = title[xml] par[evaluation] An Efficient and Versatile Query Engine for TopX Search
Outline • Data & Scoring model • Database schema & indexing • Top-k query processing for XML • Scheduling & probabilistic candidate pruning • Experiments & Conclusions An Efficient and Versatile Query Engine for TopX Search
Data Collections & Competitors • INEX ‘04 benchmark setting • 12,223 docs; 12M elemt’s; 119M index entries; 534MB • 46 queries with official relevance judgments e.g.,//article[.//bib=“QBIC” and .//par=“image retrieval”] • IMDB (Internet Movie Database) • 386,529 docs; 34M elemt’s; 130M index entries; 1,117 MB • 20 queries, e.g.,//movie[.//casting[.//actor=“John Wayne”] and .//role=“Sheriff”]//[.//year=“1959” and .//genre=“Western”] • Competitors • DBMS-style Join&Sort • Using index full scans on the TopX schema • StructIndex [Kaushik et al, Sigmod ’04] • Top-k with separate inverted indexes for content & structure • DataGuide-like structural index • Full evaluations no uncertainty about final document scores • No candidate queuing, eager random accesses • StructIndex+ • Extent chaining technique for DataGuide-based extent identifiers (skip scans) An Efficient and Versatile Query Engine for TopX Search
INEX Results k P@k MAP@k # SA epsilon # RA relPrec relPrec CPU sec Join&Sort 10 n/a 9,122,318 0 0.261 StructIndex 10 n/a 761,970 3,25,068 0.37 StructIndex+ 10 n/a 77,482 5,074,384 1.87 0.34 0.09 1.00 TopX – MinProbe 10 0.0 635,507 64,807 0.03 TopX – BenProbe 10 0.0 723,169 84,424 0.07 TopX – BenProbe 1,000 0.0 882,929 1,902,427 0.35 0.03 0.17 1.00 An Efficient and Versatile Query Engine for TopX Search
IMDB Results P@k MAP@k # SA epsilon # RA CPU sec relPrec k Join&Sort 10 n/a 14,510077 0 37.7 StructIndex 10 n/a 346,697 291,655 0.16 StructIndex+ 10 n/a 22,445 301,647 0.17 1.00 n/a TopX – MinProbe 10 0.0 317,380 72,196 0.08 TopX – BenProbe 10 0.0 241,471 50,016 0.06 An Efficient and Versatile Query Engine for TopX Search
INEX with Probabilistic Pruning P@k MAP@k # SA epsilon # RA relPrec CPU sec k TopX - MinProbe 10 0.00 635,507 64,807 0.03 0.34 0.09 1.00 10 0.25 392,395 56,952 0.05 0.34 0.08 0.77 10 0.50 231,109 48,963 0.02 0.31 0.08 0.65 10 0.75 102,118 42,174 0.01 0.33 0.08 0.51 10 1.00 36,936 35,327 0.01 0.30 0.07 0.38 An Efficient and Versatile Query Engine for TopX Search
Conclusions & Ongoing Work • Efficient and versatile TopX query processor • Extensible framework for text, semi-structured & structured data • Probabilistic cost model for random access scheduling • Very good precision/runtime ratio for probabilistic candidate pruning • Scalability • Optimized for runtime, exploits cheap disk space (factor 4-5 for INEX) • Experiments on TREC Terabyte text collection (see paper) • Support for typical IR extensions • Phrase matching, mandatory terms “+”, negation “-” • Query weights (e.g., relevance feedback, ontological similarities) • Dynamic and self-tuning query expansions[SigIR ’05] • Incrementally merges inverted lists on demand • Dynamically opens scans on additional expansion terms • Vague Content & Structure (VCAS) queries An Efficient and Versatile Query Engine for TopX Search
Demo available! Thank you! An Efficient and Versatile Query Engine for TopX Search