530 likes | 652 Views
Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007. Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com. Roadmap. The problem Database content modeling Database selection Summary. ??? applied mathematics. Metasearch – the problem.
E N D
MetasearchMathematics of Knowledge and Search Engines: Tutorials @ IPAM9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com
Roadmap • The problem • Database content modeling • Database selection • Summary
??? appliedmathematics Metasearch – the problem ??? appliedmathematics MetasearchEngine Search results
Subproblems • Database content modeling • How does a Metasearch engine “perceive” the content of each database? • Database selection • Selectively issue the query to the “best” databases • Query translation • Different database has different query formats • “a+b” / “a AND b” / “title:a AND body:b” / etc. • Result merging • Query “applied mathematics” • top-10 results from both science.com and nature.com, how to present?
Database content modeling and selection: a simplified example • A “content summary” of each database • Selection based on # of mathing docs • Assuming independence between words Total #: 60,000 Total #: 10,000 > 10,000 0.4 0.25 = 1000 documents matches“applied mathematics” 60,000 0.00333 0.005 = 1documents matches “applied mathematics”
Roadmap • The problem • Database content modeling • Database selection • Summary
Database content modeling able to replicate theentire text database - most storage demanding- fully cooperative database able to obtain a fullcontent summary - less storage demanding- fully cooperative database approximate the contentsummary via sampling - least storage demanding- non-cooperative database download part of atext database - more storage demanding- non-cooperative database
Replicate the entire database • E.g. • www.google.com/patents, replica of the entire USPTO patent document database
Download a non-cooperative database • Objective: download as much as possible • Basic idea: “probing” (querying with short queries) and downloading all results • Practically, can only issue a fixed # of probes (e.g., 1000 queries per day) SearchInterface “applied” MetasearchEngine “mathematics” A textdatabase
Harder than the “set-coverage” problem • All docs in a database db as the universe • assuming all docs are equal • Each probe corresponds to a subset • Find the least # of subsets(probes) that covers db • or, the max coverage with afixed # of subsets (probes) • NP-complete • Greedy algo. proved to be thebest-possible P-timeapproximation algo. • Cardinality of each subset(# of matching docs for eachprobe) unknown! “mathematics” “applied”
Pseudo-greedy algorithms [NPC05] • Greedy-set-coverage: choose subsets with the max “cardinality gain” • When cardinality of subsets is unknown • Assume cardinality of subsets is the same across databases - proportionally • e.g. build a database with Web pages crawled from the Internet, rank single words according to their frequency • Start with certain “seed” queries, adaptively choose query words within the docs returned • Choice of probing words varies from database to database
An adaptive method • D(wi) – subsets returned by probe with word wi • w1, w2, …, wn already issued • Rewritten as |db|Pr(wi+1) - |db|Pr(wi+1Λ(w1V…Vwn)) • Pr(w): prob. of w appearing in a doc of db
interpolated“P̃r(w)” fitted Zipf’s law curve Pr(w) values forw1, w2, …, wn single words ranked by Pr(w)in the downloaded documents An adaptive method (cont’d) • How to estimate Pr̃(wi+1) • Zipf’s law: • Pr(w) = α(R(w)+β)-γ, R(w): rank of w in a descending order of Pr(w) • Assuming the ranking of w1, w2, …, wn and other words remains the same in the downloaded subset and in db • Interpolate:
Obtain an exact content summary • C(db) for a database db • Statistics about words in db,e.g., df – document frequency, • Standards and proposals for co-operative databases to follow to export C(db) • STARTS [GCM97] • Initiated by Stanford, attracted main search engine players by 1997: Fulcrum, Infoseek, PLS, Verity, WAIS, Excite, etc. • SDARTS [GIG01] • Initiated by Columbia U.
Approximate the content-summary • Objective: C̃(db) of a database db, with high vocabulary coverage & high accuracy • Basic idea: probing and download sample docs [CC01] • Example: df as the content summary statistics • Pick a single word as the query, probe the database • Download a fraction of results, e.g., top-k • If terminating condition unsatisfied, go to 1. • Output <w, df̃> based on the sample docs downloaded
Vocabulary coverage • Can a small sample of docs cover the vocabulary of a big database? • Yes, based on Heap’s law [Hea78]: • |W |= Knβ • n - # of words scanned • W - set of distinct words encountered • K - constant, typically in [10, 100] • β - constant, typically in [0.4, 0.6] • Empirically verified [CC01]
Estimate document frequency • How to identify the df̃ of w in the entire database? • w used as a query during sampling: df typically revealed in search results • w’ appearing in the sampled docs: need to estimate df̃ based on the docs sample • Apply Zipf’s law & interpolate [IG02] • Rank w and w’ based on their frequency in the sample • Curve-fit based on the true df of those w • Interpolate the estimated df̃ of w’ onto the fitted curve
What if db changes over time? • So does its content summary C(db), and C̃(db) [INC05] • Empirical study • 152 Web databases, a snapshot downloaded weekly, for 1 year • df as the statistics measure • Kullback-Leibler divergenceas the “change” measure • between the “latest”snapshot and thesnapshot time t ago • db does change! • How do we modelthe change? • When to resample, andget a new C̃(db) ? Kullback-Leiblerdivergence t
Model the change • KLdb(t) – the KL divergence between the current C̃(db) and C̃(db, t) time t ago • T: time when KLdb(t) exceeds a pre-specified τ • Applying principles of Survival Analysis • Survival function Sdb(t) = 1 – Pr(T ≤t) • Hazard funciton hdb(t) = - (dSdb(t) /dt) / Sdb(t) • How to compute hdb(t) and then Sdb(t)?
Learn the hdb(t) of database change • Cox proportional hazards regression model • ln( hdb(t) ) = ln( hbase(t) ) + β1x1 + … , where xi is some predictor variable • Predictors • Pre-specified threshold τ • Web domain of db, “.com” “.edu” “.gov” “.org” “others” • 5 binary “domain variables” • ln( |db| ) • avg KLdb(1 week) measured in the training period • …
Train the Cox model • Stratified Cox model being applied • Domain variables didn’t satisfy the Cox proportional assumption • Stratifying on each domain, or, a hbase(t) / Sbase(t) for each domain • Training Sbase(t) for each domain • Assuming Weibull distribution, Sbase(t) = e-λtγ
Training result • γranges in (0.57, 1.08) Sbase(t) not exponential distribution Sbase(t) t
Training result (cont’d) • A larger db takes less time to have KLdb(t) exceed τ • Databases changes faster during a short period are more likely to change later on
How to use the trained model? • Model gives Sdb(t) likelihood that db “has not changed much” • An update policy to periodically resample each db • Intuitively, maximize ∑db Sdb(t) • More preciselyS = limt∞ (1/t)∫0t [ ∑db Sdb(t) ]dt • A policy: {fdb}, where fdb is the update frequency of db, e.g., 2/week • Subject to practical constraints, e.g., total update cap per week –
Derive an optimal update policy – • Find {fdb} that maximizes S under the constraint ∑db fdb = F, where F is a global frequency limit • Solvable by the Lagrange-multiplier method • Sample results:
Roadmap • The problem • Database content modeling • Database selection • Summary
Database selection • Select the databases to issue a given query • Necessary when the Metasearch engine do not have entire replica of each database – most likely with content summary only • Reduces query load in the entire system • Formalization • Query q = <w1, …, wm>, databases db1, …, dbn • Rank databases according to their “relevancy score” r(dbi, q) to query q
Relevancy score • # of matching docs in db • Similarity between q and top docs returned by db • Typically vector-space similarity (dot-product) between q and a doc • Sum / Avg of similarities of top-k docs of each db, e.g., top-10 • Sum / Avg of similarities of top docs of each db exceeding a similarity threshold • Relevancy of db as judged by users • Explicit relevance feedback • User click behavior data
Estimating r(db,q) • Typically, r(db, q) unavailable • Estimate r̃(db, q) based on C(db), or C̃(db)
Estimating r(db,q), example 1 [GGT99] • r(db, q): # of matching docs in db • Independence assumption: • Query words w1, …, wm appear independently in db • r̃(db, q): • df(db, wj): document frequency of wj in db –could be df̃(db, wj) from C̃(db)
Estimating r(db,q), example 2 [GGT99] • r(db, q):∑{ddb| sim(d, q)>l} sim(d, q) • d: a doc in db • sim(d, q): vector dot-product between d & q • each word in d & q weighted with common tfidf weighting • l: a pre-specified threshold
Estimating r(db,q), example 2 (cont’d) • Content summary, C(db), required: • df(db, w): doc frequency • v(db, w): ∑{ddb} weight of w in d’s vector • <v(db, w1), v(db, w2), …> - “centroid” of the entire db as a “cluster of doc vectors” – – –
Estimating r(db,q), example 2 (cont’d) • l = 0, sum of all q-doc similarity values of db • r(db, q) = ∑{ddb} sim(d, q) • r̃(db, q) = r(db, q) =<v(q,w1), …> <v(db, w1), v(db, w2), …> • v(q, w): weight of w in the query vector • l > 0? – –
Estimating r(db,q), example 2 (cont’d) • Assuming uniform weight of w among all docs using w • i.e. weight of w in any doc = v(db, w) / df(db, w) • Highly-correlated query words scenario • If df(db, wi) < df(db, wj), every doc using wi also uses wj • Words in q sorted s.t. df(db, w1) ≤ df(db, w2) ≤ … ≤ df(db, wm) • r̃(db, q) = ∑i=1…pv(q, wi)v(db, wi) +df(db, wp) [ ∑j=p+1…mv(q, wj)v(db, wj)/df(db, wj)]where p is determined by some criteria [GGT99] • Disjoint query words scenario • No doc using wi uses wj • r̃(db, q) = ∑i=1…m | df(db, wi) > 0 Λv(q, wi)v(db, wi) / df(db, wi) > lv(q, wi)v(db, wi) – – – – –
Estimating r(db,q), example 2 (cont’d) • Ranking of databases based on r̃(db, q) empirically evaluated [GGT99]
A probabilistic model for errors in estimation [LLC04] • Any estimation makes errors • An error (observed) distribution for each db • distribution of db1≠ distribution of db2 • Definition of error: relative
Modeling the errors: a motivating experiment • dbPMC:PubMedCentral www.pubmedcentral.nih.gov • Two query sets, Q1 and Q2 (healthcare related) • |Q1| = |Q2| = 1000, Q1 Q2= • Compute err(dbPMC, q) for each sample queryqQ1 orQ2 • Further verified through statistical tests (Pearson-χ2) error probability distribution error probability distribution err(dbPMC, q), qQ1 err(dbPMC, q), qQ2 Q2 Q1
Implications of the experiment • On a text database • Similar error behavior among sample queries • Can sample a database and summarize the error behavior into an Error Distribution (ED) • Use ED to predict the error for a future unseen query • Sampling size study [LLC04] • A few hundred sample queries good enough
From an Error Distribution (ED)to a Relevancy Distribution (RD) • Database: db1. Query: qnew ① by definition 0.5 0.4 0.1 err(db1,qnew) from sampling 0.5 0.4 -50% 0% +50% 0.1 r(db1,qnew) ② The ED for db1 500 1000 1500 ④ A Relevancy Distribution (RD)for r(db1, qnew) ③r̃(db1,qnew) =1000 existing estimation method
0.9 0.5 0.4 0.1 0.1 500 650 1000 1300 1500 RD-based selection Estimation-based: db1 > db2 r̃(db2,qnew) r̃(db1,qnew) 1000 650 db1 0.5 0.4 0.1 err(db1, qnew) r(db1, qnew) -50% 0% +50% RD-based: db1 < db2(Pr(db1 < db2) = 0.85 ) r̃(db1,qnew) =1000 db2 0.9 0.1 err(db2, qnew) db1: r(db2, qnew) 0% +100% db2: r̃(db1,qnew) =650
Correctness metric • Terminology: • DBk: k databases returned by some method • DBtopk: the actual answer • How correct DBk is compared toDBtopk? • Absolute correctness: Cora(DBk) = 1, if DBk=DBtopk0, otherwise • Partial correctness: Corp(DBk) = • Cora(DBk) = Corp(DBk) for k = 1
Effectiveness of RD-based selection • 20 healthcare-related text databases on the Web • Q1 (training, 1000 queries) to learn the ED of each database • Q2 (testing, 1000 queries) to test the correctness of database selection
0.9 0.5 0.4 0.1 0.1 500 650 1000 1300 1500 r(db1,q)=500 500 Probing to improve correctness db1: • RD-based selection 0.85 = Pr(db2 > db1) = Pr({db2} = DBtop1) = 1Pr({db2} = DBtop1) + 0Pr({db2} DBtop1) = E[Cora({db2})] • Probe dbi: contact a dbi to obtain its exact relevancy • After probing db1: E[Cora({db2})] = Pr(db2 > db1) = 1 db2:
Computing the expected correctness • Expected absolute correctness • E[Cora(DBk)]=1Pr(Cora(DBk) = 1) + 0Pr(Cora(DBk) = 0)= Pr(Cora(DBk) = 1)= Pr(DBk = DBtopk) • Expected partial correctness • E[Corp(DBk)]
Adaptive probing algorithm: APro • User-specified correctness threshold: t return this DBk RD’s of the probed and unprobed databases dbi+1 dbn YES dbi Any DBkwith E[Cor(DBk)] t? unprobed probed NO dbi-1 db1 dbi dbi+1 dbn
r(db3, q) = rb r(db3, q) = ra r(db3, q) = rc Which database to probe? • A greedy strategy: The stopping condition: E[Cor(DBk)] t Once probed, which database leads to the highest E[Cor(DBk)]? Suppose we will probe db3 if r(db3,q) = ra, max E[Cor(DBk)] = 0.85 if r(db3,q) = rb, max E[Cor(DBk)] = 0.8 if r(db3,q) = rc, max E[Cor(DBk)] = 0.9 Probe the database that leads tothe largest “expected”max E[Cor(DBk)] db1 db2 db3 db4 rc rb ra
1 1 1 0.9 0.8 0.9 0.9 0.7 0.8 0.8 0.6 0.7 0.5 0.7 0.4 0.6 0.6 0.3 adaptive probing APro 0.5 0.2 0.5 the term-independence estimator 0.1 0.4 0.4 0 0 1 2 3 4 5 0.3 0.3 adaptive probing APro # of databases probed adaptive probing APro 0.2 0.2 the term-independence estimator the term-independence estimator 0.1 0.1 0 0 0 1 2 3 4 5 0 1 2 3 4 5 # of databases probed # of databases probed Effectiveness of adaptive probing • 20 healthcare-related text databases on the Web • Q1 (training, 1000 queries) to learn the RD of each database • Q2 (testing, 1000 queries) to test the correctness of database selection avgCora avgCora avgCorp k = 1 k = 3 k = 3
The “lazy TA problem” • Same problem, generalized & “humanized” • After the final exam, the TA wants to find out the top scoring students • TA is “lazy,” don’t want to score all exam sheets • Input: every student’s score: a known distribution • Observed from pervious quiz, mid-term exams • Output: a scoring strategy • Maximizes the correctness of the “guessed” top-k students
Further study of this problem [LSC05] • Proves greedy probing is optimal under special cases • More interesting factors to-be-explored: • “Optimal” probing strategy in general cases • Non-uniform probing cost • Time-variant distributions
Roadmap • The problem • Database content modeling • Database selection • Summary