Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007

MetasearchMathematics of Knowledge and Search Engines: Tutorials @ IPAM9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com

Roadmap • The problem • Database content modeling • Database selection • Summary

??? appliedmathematics Metasearch – the problem ??? appliedmathematics MetasearchEngine Search results

Subproblems • Database content modeling • How does a Metasearch engine “perceive” the content of each database? • Database selection • Selectively issue the query to the “best” databases • Query translation • Different database has different query formats • “a+b” / “a AND b” / “title:a AND body:b” / etc. • Result merging • Query “applied mathematics” • top-10 results from both science.com and nature.com, how to present?

Database content modeling and selection: a simplified example • A “content summary” of each database • Selection based on # of mathing docs • Assuming independence between words Total #: 60,000 Total #: 10,000 > 10,000  0.4  0.25 = 1000 documents matches“applied mathematics” 60,000  0.00333  0.005 = 1documents matches “applied mathematics”

Database content modeling able to replicate theentire text database - most storage demanding- fully cooperative database able to obtain a fullcontent summary - less storage demanding- fully cooperative database approximate the contentsummary via sampling - least storage demanding- non-cooperative database download part of atext database - more storage demanding- non-cooperative database

Replicate the entire database • E.g. • www.google.com/patents, replica of the entire USPTO patent document database

Download a non-cooperative database • Objective: download as much as possible • Basic idea: “probing” (querying with short queries) and downloading all results • Practically, can only issue a fixed # of probes (e.g., 1000 queries per day) SearchInterface “applied” MetasearchEngine “mathematics” A textdatabase

Harder than the “set-coverage” problem • All docs in a database db as the universe • assuming all docs are equal • Each probe corresponds to a subset • Find the least # of subsets(probes) that covers db • or, the max coverage with afixed # of subsets (probes) • NP-complete • Greedy algo. proved to be thebest-possible P-timeapproximation algo. • Cardinality of each subset(# of matching docs for eachprobe) unknown! “mathematics” “applied”

Pseudo-greedy algorithms [NPC05] • Greedy-set-coverage: choose subsets with the max “cardinality gain” • When cardinality of subsets is unknown • Assume cardinality of subsets is the same across databases - proportionally • e.g. build a database with Web pages crawled from the Internet, rank single words according to their frequency • Start with certain “seed” queries, adaptively choose query words within the docs returned • Choice of probing words varies from database to database

An adaptive method • D(wi) – subsets returned by probe with word wi • w1, w2, …, wn already issued • Rewritten as |db|Pr(wi+1) - |db|Pr(wi+1Λ(w1V…Vwn)) • Pr(w): prob. of w appearing in a doc of db

interpolated“P̃r(w)” fitted Zipf’s law curve Pr(w) values forw1, w2, …, wn single words ranked by Pr(w)in the downloaded documents An adaptive method (cont’d) • How to estimate Pr̃(wi+1) • Zipf’s law: • Pr(w) = α(R(w)+β)-γ, R(w): rank of w in a descending order of Pr(w) • Assuming the ranking of w1, w2, …, wn and other words remains the same in the downloaded subset and in db • Interpolate:

Obtain an exact content summary • C(db) for a database db • Statistics about words in db,e.g., df – document frequency, • Standards and proposals for co-operative databases to follow to export C(db) • STARTS [GCM97] • Initiated by Stanford, attracted main search engine players by 1997: Fulcrum, Infoseek, PLS, Verity, WAIS, Excite, etc. • SDARTS [GIG01] • Initiated by Columbia U.

Approximate the content-summary • Objective: C̃(db) of a database db, with high vocabulary coverage & high accuracy • Basic idea: probing and download sample docs [CC01] • Example: df as the content summary statistics • Pick a single word as the query, probe the database • Download a fraction of results, e.g., top-k • If terminating condition unsatisfied, go to 1. • Output <w, df̃> based on the sample docs downloaded

Vocabulary coverage • Can a small sample of docs cover the vocabulary of a big database? • Yes, based on Heap’s law [Hea78]: • |W |= Knβ • n - # of words scanned • W - set of distinct words encountered • K - constant, typically in [10, 100] • β - constant, typically in [0.4, 0.6] • Empirically verified [CC01]

Estimate document frequency • How to identify the df̃ of w in the entire database? • w used as a query during sampling: df typically revealed in search results • w’ appearing in the sampled docs: need to estimate df̃ based on the docs sample • Apply Zipf’s law & interpolate [IG02] • Rank w and w’ based on their frequency in the sample • Curve-fit based on the true df of those w • Interpolate the estimated df̃ of w’ onto the fitted curve

What if db changes over time? • So does its content summary C(db), and C̃(db) [INC05] • Empirical study • 152 Web databases, a snapshot downloaded weekly, for 1 year • df as the statistics measure • Kullback-Leibler divergenceas the “change” measure • between the “latest”snapshot and thesnapshot time t ago • db does change! • How do we modelthe change? • When to resample, andget a new C̃(db) ? Kullback-Leiblerdivergence t

Model the change • KLdb(t) – the KL divergence between the current C̃(db) and C̃(db, t) time t ago • T: time when KLdb(t) exceeds a pre-specified τ • Applying principles of Survival Analysis • Survival function Sdb(t) = 1 – Pr(T ≤t) • Hazard funciton hdb(t) = - (dSdb(t) /dt) / Sdb(t) • How to compute hdb(t) and then Sdb(t)?

Learn the hdb(t) of database change • Cox proportional hazards regression model • ln( hdb(t) ) = ln( hbase(t) ) + β1x1 + … , where xi is some predictor variable • Predictors • Pre-specified threshold τ • Web domain of db, “.com” “.edu” “.gov” “.org” “others” • 5 binary “domain variables” • ln( |db| ) • avg KLdb(1 week) measured in the training period • …

Train the Cox model • Stratified Cox model being applied • Domain variables didn’t satisfy the Cox proportional assumption • Stratifying on each domain, or, a hbase(t) / Sbase(t) for each domain • Training Sbase(t) for each domain • Assuming Weibull distribution, Sbase(t) = e-λtγ

Training result • γranges in (0.57, 1.08)  Sbase(t) not exponential distribution Sbase(t) t

Training result (cont’d) • A larger db takes less time to have KLdb(t) exceed τ • Databases changes faster during a short period are more likely to change later on

How to use the trained model? • Model gives Sdb(t)  likelihood that db “has not changed much” • An update policy to periodically resample each db • Intuitively, maximize ∑db Sdb(t) • More preciselyS = limt∞ (1/t)∫0t [ ∑db Sdb(t) ]dt • A policy: {fdb}, where fdb is the update frequency of db, e.g., 2/week • Subject to practical constraints, e.g., total update cap per week –

Derive an optimal update policy – • Find {fdb} that maximizes S under the constraint ∑db fdb = F, where F is a global frequency limit • Solvable by the Lagrange-multiplier method • Sample results:

Database selection • Select the databases to issue a given query • Necessary when the Metasearch engine do not have entire replica of each database – most likely with content summary only • Reduces query load in the entire system • Formalization • Query q = <w1, …, wm>, databases db1, …, dbn • Rank databases according to their “relevancy score” r(dbi, q) to query q

Relevancy score • # of matching docs in db • Similarity between q and top docs returned by db • Typically vector-space similarity (dot-product) between q and a doc • Sum / Avg of similarities of top-k docs of each db, e.g., top-10 • Sum / Avg of similarities of top docs of each db exceeding a similarity threshold • Relevancy of db as judged by users • Explicit relevance feedback • User click behavior data

Estimating r(db,q) • Typically, r(db, q) unavailable • Estimate r̃(db, q) based on C(db), or C̃(db)

Estimating r(db,q), example 1 [GGT99] • r(db, q): # of matching docs in db • Independence assumption: • Query words w1, …, wm appear independently in db • r̃(db, q): • df(db, wj): document frequency of wj in db –could be df̃(db, wj) from C̃(db)

Estimating r(db,q), example 2 [GGT99] • r(db, q):∑{ddb| sim(d, q)>l} sim(d, q) • d: a doc in db • sim(d, q): vector dot-product between d & q • each word in d & q weighted with common tfidf weighting • l: a pre-specified threshold

Estimating r(db,q), example 2 (cont’d) • Content summary, C(db), required: • df(db, w): doc frequency • v(db, w): ∑{ddb} weight of w in d’s vector • <v(db, w1), v(db, w2), …> - “centroid” of the entire db as a “cluster of doc vectors” – – –

Estimating r(db,q), example 2 (cont’d) • l = 0, sum of all q-doc similarity values of db • r(db, q) = ∑{ddb} sim(d, q) • r̃(db, q) = r(db, q) =<v(q,w1), …> <v(db, w1), v(db, w2), …> • v(q, w): weight of w in the query vector • l > 0? – –

Estimating r(db,q), example 2 (cont’d) • Assuming uniform weight of w among all docs using w • i.e. weight of w in any doc = v(db, w) / df(db, w) • Highly-correlated query words scenario • If df(db, wi) < df(db, wj), every doc using wi also uses wj • Words in q sorted s.t. df(db, w1) ≤ df(db, w2) ≤ … ≤ df(db, wm) • r̃(db, q) = ∑i=1…pv(q, wi)v(db, wi) +df(db, wp) [ ∑j=p+1…mv(q, wj)v(db, wj)/df(db, wj)]where p is determined by some criteria [GGT99] • Disjoint query words scenario • No doc using wi uses wj • r̃(db, q) = ∑i=1…m | df(db, wi) > 0 Λv(q, wi)v(db, wi) / df(db, wi) > lv(q, wi)v(db, wi) – – – – –

Estimating r(db,q), example 2 (cont’d) • Ranking of databases based on r̃(db, q) empirically evaluated [GGT99]

A probabilistic model for errors in estimation [LLC04] • Any estimation makes errors • An error (observed) distribution for each db • distribution of db1≠ distribution of db2 • Definition of error: relative

Modeling the errors: a motivating experiment • dbPMC:PubMedCentral www.pubmedcentral.nih.gov • Two query sets, Q1 and Q2 (healthcare related) • |Q1| = |Q2| = 1000, Q1 Q2=  • Compute err(dbPMC, q) for each sample queryqQ1 orQ2 • Further verified through statistical tests (Pearson-χ2) error probability distribution error probability distribution err(dbPMC, q), qQ1 err(dbPMC, q), qQ2 Q2 Q1

Implications of the experiment • On a text database • Similar error behavior among sample queries • Can sample a database and summarize the error behavior into an Error Distribution (ED) • Use ED to predict the error for a future unseen query • Sampling size study [LLC04] • A few hundred sample queries good enough

From an Error Distribution (ED)to a Relevancy Distribution (RD) • Database: db1. Query: qnew ① by definition 0.5 0.4 0.1 err(db1,qnew) from sampling 0.5 0.4 -50% 0% +50% 0.1 r(db1,qnew) ② The ED for db1 500 1000 1500 ④ A Relevancy Distribution (RD)for r(db1, qnew) ③r̃(db1,qnew) =1000 existing estimation method

0.9 0.5 0.4 0.1 0.1 500 650 1000 1300 1500 RD-based selection Estimation-based: db1 > db2 r̃(db2,qnew) r̃(db1,qnew) 1000 650 db1 0.5 0.4 0.1 err(db1, qnew) r(db1, qnew) -50% 0% +50% RD-based: db1 < db2(Pr(db1 < db2) = 0.85 ) r̃(db1,qnew) =1000 db2 0.9 0.1 err(db2, qnew) db1: r(db2, qnew) 0% +100% db2: r̃(db1,qnew) =650

Correctness metric • Terminology: • DBk: k databases returned by some method • DBtopk: the actual answer • How correct DBk is compared toDBtopk? • Absolute correctness: Cora(DBk) = 1, if DBk=DBtopk0, otherwise • Partial correctness: Corp(DBk) = • Cora(DBk) = Corp(DBk) for k = 1

Effectiveness of RD-based selection • 20 healthcare-related text databases on the Web • Q1 (training, 1000 queries) to learn the ED of each database • Q2 (testing, 1000 queries) to test the correctness of database selection

0.9 0.5 0.4 0.1 0.1 500 650 1000 1300 1500 r(db1,q)=500 500 Probing to improve correctness db1: • RD-based selection 0.85 = Pr(db2 > db1) = Pr({db2} = DBtop1) = 1Pr({db2} = DBtop1) + 0Pr({db2} DBtop1) = E[Cora({db2})] • Probe dbi: contact a dbi to obtain its exact relevancy • After probing db1: E[Cora({db2})] = Pr(db2 > db1) = 1 db2:

Computing the expected correctness • Expected absolute correctness • E[Cora(DBk)]=1Pr(Cora(DBk) = 1) + 0Pr(Cora(DBk) = 0)= Pr(Cora(DBk) = 1)= Pr(DBk = DBtopk) • Expected partial correctness • E[Corp(DBk)]

Adaptive probing algorithm: APro • User-specified correctness threshold: t return this DBk RD’s of the probed and unprobed databases dbi+1 dbn YES dbi Any DBkwith E[Cor(DBk)] t? unprobed probed NO dbi-1 db1 dbi dbi+1 dbn

r(db3, q) = rb r(db3, q) = ra r(db3, q) = rc Which database to probe? • A greedy strategy: The stopping condition: E[Cor(DBk)]  t Once probed, which database leads to the highest E[Cor(DBk)]? Suppose we will probe db3 if r(db3,q) = ra, max E[Cor(DBk)] = 0.85 if r(db3,q) = rb, max E[Cor(DBk)] = 0.8 if r(db3,q) = rc, max E[Cor(DBk)] = 0.9 Probe the database that leads tothe largest “expected”max E[Cor(DBk)] db1 db2 db3 db4 rc rb ra

1 1 1 0.9 0.8 0.9 0.9 0.7 0.8 0.8 0.6 0.7 0.5 0.7 0.4 0.6 0.6 0.3 adaptive probing APro 0.5 0.2 0.5 the term-independence estimator 0.1 0.4 0.4 0 0 1 2 3 4 5 0.3 0.3 adaptive probing APro # of databases probed adaptive probing APro 0.2 0.2 the term-independence estimator the term-independence estimator 0.1 0.1 0 0 0 1 2 3 4 5 0 1 2 3 4 5 # of databases probed # of databases probed Effectiveness of adaptive probing • 20 healthcare-related text databases on the Web • Q1 (training, 1000 queries) to learn the RD of each database • Q2 (testing, 1000 queries) to test the correctness of database selection avgCora avgCora avgCorp k = 1 k = 3 k = 3

The “lazy TA problem” • Same problem, generalized & “humanized” • After the final exam, the TA wants to find out the top scoring students • TA is “lazy,” don’t want to score all exam sheets • Input: every student’s score: a known distribution • Observed from pervious quiz, mid-term exams • Output: a scoring strategy • Maximizes the correctness of the “guessed” top-k students

Further study of this problem [LSC05] • Proves greedy probing is optimal under special cases • More interesting factors to-be-explored: • “Optimal” probing strategy in general cases • Non-uniform probing cost • Time-variant distributions

Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007

Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007

Presentation Transcript

Pattern-Directed Inference Systems

Problem-Solving Items in PSLE Mathematics

How to rank No. 1 on Google (and the other search engines)

Search Engine Optimization (SEO)

Feature selection methods

http://comet.lehman.cuny.edu/jung/presentation/presentation.html

MT 313 IC ENGINES

Mathematics (Pg: All )

Introduction to Technical Mathematics

Semantic Search Engines – On the Way to Web 3.0

VLDB'99 TUTORIAL Metasearch Engines: Solutions and Challenges

IPAM-UCLA Tutorial May 14-18, 2001 GBM in Image Processing, Computer Vision, and Computer Graphics

Discrete Mathematics Lecture 2.

Semantic Search Engines – On the Way to Web 3.0

Personalized Web Search using Clickthrough History

New Zealand 2007

Hyperspectral Imagery (HSI) Dimensionality Reduction

Information Retrieval and Search Engines

Information Retrieval and Search Engines

Ronald Westra, Eric Postma Department of Mathematics Universiteit Maastricht

seo