170 likes | 291 Views
Warning: Non-technical Content! To Be Taken with Grain of SALT. Integrating DB and IR Technologies: What is the Sound of One Hand Clapping?. Surajit Chaudhuri (Microsoft Research) Raghu Ramakrishnan (U Wisconsin, ex QUIQ) Gerhard Weikum (Max-Planck Institute of CS).
E N D
Warning: Non-technical Content! To Be Taken with Grain of SALT. Integrating DB and IR Technologies:What is the Sound of One Hand Clapping? Surajit Chaudhuri (Microsoft Research) Raghu Ramakrishnan (U Wisconsin, ex QUIQ) Gerhard Weikum (Max-Planck Institute of CS)
parallel universes forever ? DB and IR: Two Parallel Universes Database Systems Information Retrieval canonical application: accounting libraries text numbers, short strings data type: foundation: algebraic / logic based probabilistic / statistics based search paradigm: Boolean retrieval (exact queries, result sets/bags) ranked retrieval (vague queries, result lists)
Take-home Messageor Food for Disagreement Claim 1: DB&IR applications require and justify new platform / kernel system with appropriately designed API for a Scoring Algebra for Lists and Text (SALT) Claim 2: One key challenge lies in reconciling flexible scoring with query optimizability
Outline • Top-down Motivation: DB&IR Applications • Bottom-up Motivation: Algorithms & Tricks • Towards SALT: Scoring Algebra(s) for Lists and Text • Key Problem: Query Optimization
Why customizable scoring? • wealth of different apps within this app class • different customer classes • adjustment to evolving business needs • scoring on text + structured data • (weighted sums, language models, skyline, • w/ correlations, etc. etc.) Top-down Motivation: Applications (1)- Customer Support - Typical data: Customers (CId, Name, Address, Area, Category, Priority, ...) Requests (RId, CId, Date, Product, ProblemType, Body, RPriority, WFId, ...) Answers (AId, RId, Date, Class, Body, WFId, WFStatus, ...) Typical queries: premium customer from Germany: „A notebook, model ... configured with ..., has a problem with the driver of its Wave-LAN card. I already tried the fix ..., but received error message ...“ • request classification & routing • find similar requests Platform desiderata (from app developer‘s viewpoint): • Flexible ranking and scoring on text, categorical, numerical attributes • Incorporation of dimension hierarchies for products, locations, etc. • Efficient execution of complex queries over text and data attributes • Support for high update rates concurrently with high query load
Top-down Motivation: Applications (2) More application classes: • Global health-care management for monitoring epidemics • News archives for journalists, press agencies, etc. • Product catalogs for houses, cars, vacation places, etc. • Customer relationship management in banks, insurances, telcom, etc. • Bulletin boards for social communities • P2P personalized & collaborative Web search • etc. etc.
Prob Prob 0.95 0.9 0.75 • facts now have confidence scores • queries involve probabilistic inferences • and result ranking • relevant for „business intelligence“ Top-down Motivation: Applications (3) Next wave Text2Data: use Information-Extraction technology (regular expressions, HMMs, lexicons, other NLP and ML techniques) to convert text docs into relational facts, moving up in the value chain Example: „The CIDR‘05 conference takes place in Asilomar from Jan 4 to Jan 7, and is organized by D.J. DeWitt, Mike Stonebreaker, ...“ Conference ConfOrganization Name Year Location Date Name Year Chair CIDR 2005 Asilomar 05/01/04 CIDR 2005 P68 CIDR 2005 P35 People Id Name P35 Michael Stonebraker P68 David J. DeWitt
5) Simple, sufficiently expressive data model (XML light) 6) Data preparation (entity recognition, entity resolution, etc.) 7) Personalization (profile learning) 8) Usage patterns (query logs, click streams, etc.) 1, 2, 3 most strongly affect platform architecture and API Top-down Motivation: Applications (4) Essential requirements for DB&IR platform: 1) Customizable scoring and ranking 2) Composite queries incl. joins, filters & top-k 3) Optimizability of query expressions 4) Metadata and ontologies CIDR 2005 P35 0.75
Vanilla algorithm „join&sort“ for query q: t1 t2 t3 • Good search engines use a variety of heuristics and tricks • for shortcutting: • keeping short lists of • best docs per term in memory • global statistics for index list selection • early pruning of result candidates • bounded priority queue of candidates top-k ( [term=t1](index) ID [term=t2](index) ID [term=t3](index) ID order by sum(s) desc) Bottom-up Motivation: Algorithms & Tricks B+ tree on terms, categories, values, ... ... ... t1 t2 t3 17: 0.3 17: 0.3 12:0.5 11:0.6 index lists with (ID, s = tf*idf) sorted by ID 44: 0.4 11:0.4 17: 0.1 17:0.1 Google: > 10 mio. terms > 8 bio. docs > 4 TB index 52:0.1 28:0.1 52:0.7 ... 53:0.8 44: 0.2 44: 0.2 51:0.6 51:0.6 ... 52:0.3 ...
TA flavor w/ early termination is great • Implementation details are crucial • DB&IR needs to combine it • with filter, join, phrase matching, etc. • Unclear how to abstract TA • and integrate into relational algebra Bottom-up Motivation: Algorithms & Tricks TA: efficient & principled top-k query processing with monotonic score aggr. TA with sorted access only (NRA) (Fagin 01, Güntzer/Kießling/Balke 01): scan index lists; consider d at posi in Li; E(d) := E(d) {i}; highi := s(ti,d); worstscore(d) := aggr{s(t,d) | E(d)}; bestscore(d) := aggr{worstscore(d), aggr{high | E(d)}}; if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’ top-k}; else if bestscore(d) > min-k then cand := cand {d}; threshold := max {bestscore(d’) | d’ cand}; if threshold min-k then exit; Data items: d1,…,dn d1 s(t1,d1) = 0.7 … s(tm,d1) = 0.2 Query: q = (t1,t2,t3) Index lists k = 1 d78 0.9 d23 0.8 d10 0.8 d1 0.7 d88 0.2 t1 Scan depth 1 … Scan depth 2 Scan depth 3 d64 0.8 d23 0.6 d10 0.6 d10 0.2 d78 0.1 t2 … d10 0.7 d78 0.5 d64 0.4 d99 0.2 d34 0.1 STOP! t3 …
SALT Algebra: Three Proposals SALT = Scoring Algebra for Lists and Text • Goals: • reconcile relational algebra with TA-flavor operators • reconcile flexible scoring with query optimizability • Three proposals: • Speculative filters and stretchable operators • Operators with scoring modalities • Scoring operator Related prior work: probabilistic relations approximate query processing query algebras on lists SQL user-defined aggregation
date > 11/30/04 class=„/network/drivers“ product=„Thinkpad“ software=„Linux“ Properties and problems: + can leverage multidim. histograms ? composability of operators ? choice of filters for approx. k top-level results Speculative Filters and Stretchable Operators(SALT with SQL Flavor) Rationale: map ranked-retrieval queries to multidimensional SQL filters such that they return approx. k results Ex.: recent WLAN device driver problems on notebook T40 (with Debian) [date > 11/30/04 class=„/network/drivers“ product=„Thinkpad“ software=„Linux“] (Requests) • Techniques: • ranking many answers speculative filters • generate additional conjunctive conditions to approximate top-k • finding enough answers stretchable operators • relax (range or categorical) conditions to ensure at-least-k • similar to IR query expansion, by (pseudo-)feedback, thesaurus, query log Proposal: ~[k, date > 1/4/05 class=„/network/drivers/wlan“ product=„T40“] (...) generally: ~[k],~[k], ~[k], ...
similar to SQL rank( ) with user-defined aggregation (and LDL++ aggregation), but with early termination! Properties and problems: + pipelined processing of list prefixes + can be implemented by TA with bounded queue ? difficult to integrate into query rewriting ? difficult for cost estimation Operator (SALT with TA Flavor) • Rationale: • all operators produce lists of tuples • a operator encapsulates customizable scoring • can be efficiently implemented in relational kernel • Technique: • [; , F; T] (R) consumes prefixes of an input list R with • a set of simple aggregation functions, • each with O(1) space and O(|prefix|) time („accumulators“) • a scoring function : dom(R)out() real • a filter condition F as in , referring to current tuple and values • a stopping condition T, of the same form as F Ex.:sort[k, Score, desc] ( [: min-k := min{Score(t)|tinput}; threshold := ...; (t) := sum(R1.Score, R2.Score, C1.Score) as Score; F: Score > min-k |input|<k; T: min-k threshold |input| k] (merge( sort[...] ([...](Requests R1 ...)), sort[...] ([...](Requests R2 ...)), sort[...] ([...](Customers C1 ...)))
Technical challenge: Either work out correct & useful rewriting rules or establish „approximate equivalences“ of the kind ~[k, F] ( [G] (R)) sort[k, ...] ( [G] (~[k*, F] (R)) with proper k* ideally with quantifiable error probabilities Wishful thinking! Key Problem: Query Rewriting Goal: establish algebraic equivalences for SALT expressions as a basis for query rewriting Examples: commutativity of stretchable top-k and standard selection ~[k, date > 1/4/05] ( [product=„T40“] (R)) [product=„T40“] (~[k, date > 1/4/05] (R)) commutativity of scoring operator and standard selection distributivity of scoring operator over union ...
Technical challenge: Develop full estimator for top-k execution cost Possible approaches (Ilyas et al.: Sigmod’04, Theobald et al.: VLDB’04): • Probabilistically predict (quantile of) aggregated score of data item d: • precompute score-distribution histogram for each single dim • compute convolution of histograms at query time to predict P[i Si ] Index lists d78 0.9 d23 0.8 d10 0.8 d1 0.7 d88 0.2 t1 … d64 0.8 d23 0.6 d10 0.6 d10 0.2 d78 0.1 t2 … View scores X1 > X2 > ... > Xn of n data items as samples from S = i Si Use order statistics to predict score of rank-k item and scan depth at stopping time d10 0.7 d78 0.5 d64 0.4 d99 0.2 d34 0.1 t3 … Key Problem: Cost Estimation 1) usual DB cost estimation: selectivity of multidimensional filters 2) cost estimation for top-k ranked retrieval: when will we stop? (for : length of input prefix; for TA: scan depth on index lists) We claim that 2 is harder than 1 !
Is there anything new here? Literature has bits and pieces, but no strategic view Don‘t eXtensible DBSs or intranet SEs cover 90%? XDBSs with UDFs too complex, SEs lack query optimization Yes: prob. Datalog, XML IR, statistical relational learning, etc. Do IR people believe in DB&IR? No: mostly driven by search result quality, largely disregard performance Do IR people believe in SALT and query opt.? No: simple consumer-oriented search or small content mgt. apps Does SE industry believe in SALT and query opt.? Is there business value in DB&IR? Yes, for both individual apps and general text2data. Where do we go from here? Detailed design & impl. of SALT, with query optimization Conclusion: Caveats and Rebuttals DB&IR is important, SALT algebra is one key aspect
Additional Resources • http://dblife.cs.wisc.edu/ • Information Extaction and Integration: an Overview, William Cohen: http://www.cs.cmu.edu/~wcohen/ie-survey.ppt