370 likes | 755 Views
Efficient Search Engine Measurements. Maxim Gurevich Technion. Ziv Bar-Yossef Technion and Google. Search Engine Benchmarks. State of the art: No objective benchmarks for search engines Need to rely on “anecdotal” studies or on subjective search engine reports
E N D
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google
Search Engine Benchmarks • State of the art: No objective benchmarks for search engines • Need to rely on “anecdotal” studies or on subjective search engine reports • Users, advertisers, partners cannot compare search engines • Our goal: Design search engine benchmarking techniques • Accurate • Efficient • Objective • Transparent
Search Engine Corpus Evaluation • Corpus size • How many pages are indexed? • Search engine overlap • What fraction of the pages indexed by search engine A are also indexed by search engine B? • Freshness • How old are the pages in the index? • Spam resilience • What fraction of the pages in the index are spam? • Duplicates • How many unique pages are there in the index?
Search Engine Corpus Metrics Indexed Documents Search Engine Web Public Interface D Index Target function Focus of this talk • Overlap • Average age of a page • Corpus size • Number of unique pages
Search Engine Estimators Indexed Documents Search Engine Web Public Interface D Index Top k results Queries Estimator Estimate of |D|
Success Criteria Estimation accuracy: • Bias E(Estimate - |D|) Amortized cost (cost times variance): • Amortized query cost • Amortized fetch cost • Amortized function cost
Previous Work Average metrics: • Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00] • Queries from user query logs [LawrenceGiles98, DobraFeinberg04] • Random queries [BharatBroder98, CheneyPerry05, GulliSignorini05, BarYossefGurevich06, Broder et al 06] • Random sampling from the web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01] Sum metrics: • Random queries [Broder et al 06]
Our Contributions • A new search engine estimator • Applicable to both sum metrics and average metrics • Arbitrary target functions • Arbitrary target distributions (measures) • Less bias than the Broder et al estimator • In one experiment, empirical relative bias was reduced from75% to 0.01% • More efficient than the BarYossefGurevich06 estimator • In one experiment, query cost was reduced by a factor of 375. • Techniques • Approximate ratio importance sampling • Rao-Blackwellization
Roadmap • Recast the Broder et al corpus size estimator as an importance sampling estimator. • Describe the “degree mismatch problem” (DMP) • Show how to overcome DMP using approximate ratio importance sampling • Discuss Rao-Blackwellization • Gloss over some experimental results
Query Pools Pre-processing step: Create a query pool Query Pool Training corpus of web documents P q1 C q2 … … • Working example: P = all length-3 phrases that occur in C • If “to be or not to be” occurs in C, P contains: • “to be or”, “be or not”, “or not to”, “not to be” • Choose P that “covers” most documents in D
The Search Engine Graph • P = query pool • neighbors(q)= { documents returned on query q } • deg(q) = |neighbors(q)| • neighbors(x)= { queries that return x as a result } • deg(x) = |neighbors(x)| www.cnn.com www.foxnews.com “news” news.bbc.co.uk news.google.com “bbc” www.bbc.co.uk en.wikipedia.org/wiki/BBC “google” www.google.com maps.google.com “maps” maps.yahoo.com www.mapquest.com • deg(“news”) = 4, deg(“bbc”) = 3 • deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2
Corpus Size as an Integral E = Edges in the queries-documents graph Lemma: Proof: Contribution of edge (q,x) to sum: 1/deg(x) Total contribution of edges incident to x: 1 Total contribution of all edges: |D| www.cnn.com www.foxnews.com “news” news.bbc.co.uk news.google.com “bbc” www.bbc.co.uk en.wikipedia.org/wiki/BBC “google” www.google.com maps.google.com
Corpus Size as an Integral • Express corpus size as an integral: • Target measure: (q,x) = 1/deg(x) • Target function: f(q,x) = 1
Monte Carlo Estimation • Monte Carlo estimation of the integral • Sample (Q,X) according to • Output f(Q,X) • Works only if: • is a proper distribution • We can easily sample from • BUT, • In our case is not a distribution • Even if it was, sampling from = 1/deg(x) may not be easy • So instead, we sample (Q,X) from an easy “trial distribution” p
Sampling Edges, Easily Sample an edge (q,x) with probability p(q,x) = 1/(|P| ¢ deg(q)) Search Engine Top k results Q A random query Q P Sampler (Q,X) X - a random result of Q
Importance Sampling (IS) [Marshal56] • We have: A sample (Q,X) from p • We need: Estimate the integral • So we cannot use simple Monte Carlo estimation • Importance sampling comes to the rescue… • Compute an “importance weight” for (Q,X): • Importance sampling estimator:
Computing the Importance Sampling Estimator • We need to compute • Computing |P| is easy – we know P • How to compute deg(Q) = |neighbors(Q)|? • Since Q was submitted to the search engine, we know deg(Q) • How to compute deg(X) = |neighbors(X)|? • Fetch content of X from the web • pdeg(X) = number of distinct queries from P that X contains • Use pdex(X) as an estimate for deg(X)
The Degree Mismatch Problem (DMP) • In reality, pdeg(X) may be different from deg(X) • Neighbor recall problem: There may be q neighbors(x) that do not occur in x • q occurs as “anchor text” in a page linking to x • q occurs in x, but our parser failed to find it • Neighbor precision problem: There may be q that occur in x, but q neighbors(x) • q “overflows” • q occurs in x, but the search engine’s parser failed to find it
Implications of DMP • Can only approximate document degrees • Bias of importance sampling estimator may become significant • In one of our experiments, relative bias was 75%
Eliminating the Neighbor Recall Problem • The predicted search engine graph: • pneighbors(x) = queries that occur in x • pneighbors(q) = documents in whose text q occurs • An edge (q,x) is “valid”, if it occurs both in the search engine graph and the predicted search engine graph • The valid search engine graph: • vneighbors(x) = neighbors(x) ∩ pneighbors(x) • vneighbors(q) = neighbors(q) ∩ pneighbors(q)
Eliminating the Neighbor Recall Problem (cont.) • We use the valid search engine graph rather than the real search engine graph: • vdeg(q) = |vneighbors(q)| • vdeg(x) = |vneighbors(x)| • P+ = queries q in P with vdeg(q) > 0 • D+ = documents x in D with vdeg(x) > 0 • Assuming D+ = D, then E(IS(Q,X)) = |D|
Approximate Importance Sampling (AIS) • We need to compute • vdeg(Q) – Easy • vdeg(X) – Hard • |P+| - Hard • We therefore approximate |P+| and vdeg(X): • IVD(X) = unbiased probabilistic estimator for pdeg(X)/vdeg(X)
Estimating pdeg(x)/vdeg(x) • Given: A document x • Want: Estimate pdeg(x) / vdeg(x) • Geometric estimation: • n = 1 • forever do • Choose a random phrase Q that occurs in content(x) • Send Q to the search engine • If x neighbors(Q), return n • n n + 1 • Probability to hit a “valid” query: vdeg(x) / pdeg(x) • So, expected number of iterations: pdeg(x) / vdeg(x)
Approximate Importance Sampling: Bias Analysis • Lemma: Multiplicative bias of AIS(Q,X) is
Approximate Importance Sampling: Bias Elimination • How to eliminate the bias in AIS? • Estimate the bias |P|/|P+| • Divide AIS by this estimate • Well, this doesn’t quite work • Expected ratio ≠ ratio of expectations • So, use a standard trick in estimation of ratio statistics: BE = estimator of |P|/|P+|
Bias Analysis • Theorem:
Estimating |P|/|P+| • Also by geometric estimation: • n = 1 • forever do • Choose a random query Q from P • Send Q to the search engine • If vdeg(Q) > 0, return n • n n + 1 • Probability to hit a “valid” query: |P+|/|P| • So, expected number of iterations: |P|/|P+|
Recap • Sample valid edges (Q1,X1),…,(Qn,Xn) from p • Compute vdeg(Qi) for each query Qi • Compute pdeg(Xi) for each document Xi • Estimate IVD(Xi) = pdeg(Xi)/vdeg(Xi) for each Xi • Compute AIS • Estimate the expected bias BEi = |P|/|P+| • Output
Rao-Blackwellization • Question: We currently use only one (random) result for each query submitted to the search engine. Can we use also the rest? • Rao & Blackwell: Sure! Use them as additional samples. It can only help! • The Rao-Blackwellized AIS estimator: • Recall:
RB-AIS: Analysis • TheRao-Blackwell Theorem: • AISRB has exactly the same bias as AIS • The variance of AISRB can only be lower • Variance reduces, if query results are sufficiently “variable” • Now, use AISRB instead of AIS in SizeEstimator: