Efficient Search Engine Measurements

Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google

Search Engine Benchmarks • State of the art: No objective benchmarks for search engines • Need to rely on “anecdotal” studies or on subjective search engine reports • Users, advertisers, partners cannot compare search engines • Our goal: Design search engine benchmarking techniques • Accurate • Efficient • Objective • Transparent

Search Engine Corpus Evaluation • Corpus size • How many pages are indexed? • Search engine overlap • What fraction of the pages indexed by search engine A are also indexed by search engine B? • Freshness • How old are the pages in the index? • Spam resilience • What fraction of the pages in the index are spam? • Duplicates • How many unique pages are there in the index?

Search Engine Corpus Metrics Indexed Documents Search Engine Web Public Interface D Index Target function Focus of this talk • Overlap • Average age of a page • Corpus size • Number of unique pages

Search Engine Estimators Indexed Documents Search Engine Web Public Interface D Index Top k results Queries Estimator Estimate of |D|

Success Criteria Estimation accuracy: • Bias E(Estimate - |D|) Amortized cost (cost times variance): • Amortized query cost • Amortized fetch cost • Amortized function cost

Previous Work Average metrics: • Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00] • Queries from user query logs [LawrenceGiles98, DobraFeinberg04] • Random queries [BharatBroder98, CheneyPerry05, GulliSignorini05, BarYossefGurevich06, Broder et al 06] • Random sampling from the web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01] Sum metrics: • Random queries [Broder et al 06]

Our Contributions • A new search engine estimator • Applicable to both sum metrics and average metrics • Arbitrary target functions • Arbitrary target distributions (measures) • Less bias than the Broder et al estimator • In one experiment, empirical relative bias was reduced from75% to 0.01% • More efficient than the BarYossefGurevich06 estimator • In one experiment, query cost was reduced by a factor of 375. • Techniques • Approximate ratio importance sampling • Rao-Blackwellization

Roadmap • Recast the Broder et al corpus size estimator as an importance sampling estimator. • Describe the “degree mismatch problem” (DMP) • Show how to overcome DMP using approximate ratio importance sampling • Discuss Rao-Blackwellization • Gloss over some experimental results

Query Pools Pre-processing step: Create a query pool Query Pool Training corpus of web documents P q1 C q2 … … • Working example: P = all length-3 phrases that occur in C • If “to be or not to be” occurs in C, P contains: • “to be or”, “be or not”, “or not to”, “not to be” • Choose P that “covers” most documents in D

The Search Engine Graph • P = query pool • neighbors(q)= { documents returned on query q } • deg(q) = |neighbors(q)| • neighbors(x)= { queries that return x as a result } • deg(x) = |neighbors(x)| www.cnn.com www.foxnews.com “news” news.bbc.co.uk news.google.com “bbc” www.bbc.co.uk en.wikipedia.org/wiki/BBC “google” www.google.com maps.google.com “maps” maps.yahoo.com www.mapquest.com • deg(“news”) = 4, deg(“bbc”) = 3 • deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2

Corpus Size as an Integral E = Edges in the queries-documents graph Lemma: Proof: Contribution of edge (q,x) to sum: 1/deg(x) Total contribution of edges incident to x: 1 Total contribution of all edges: |D| www.cnn.com www.foxnews.com “news” news.bbc.co.uk news.google.com “bbc” www.bbc.co.uk en.wikipedia.org/wiki/BBC “google” www.google.com maps.google.com

Corpus Size as an Integral • Express corpus size as an integral: • Target measure: (q,x) = 1/deg(x) • Target function: f(q,x) = 1

Monte Carlo Estimation • Monte Carlo estimation of the integral • Sample (Q,X) according to  • Output f(Q,X) • Works only if: •  is a proper distribution • We can easily sample from  • BUT, • In our case  is not a distribution • Even if it was, sampling from  = 1/deg(x) may not be easy • So instead, we sample (Q,X) from an easy “trial distribution” p

Sampling Edges, Easily Sample an edge (q,x) with probability p(q,x) = 1/(|P| ¢ deg(q)) Search Engine Top k results Q A random query Q P Sampler (Q,X) X - a random result of Q

Importance Sampling (IS) [Marshal56] • We have: A sample (Q,X) from p • We need: Estimate the integral • So we cannot use simple Monte Carlo estimation • Importance sampling comes to the rescue… • Compute an “importance weight” for (Q,X): • Importance sampling estimator:

IS: Bias Analysis

Computing the Importance Sampling Estimator • We need to compute • Computing |P| is easy – we know P • How to compute deg(Q) = |neighbors(Q)|? • Since Q was submitted to the search engine, we know deg(Q) • How to compute deg(X) = |neighbors(X)|? • Fetch content of X from the web • pdeg(X) = number of distinct queries from P that X contains • Use pdex(X) as an estimate for deg(X)

The Degree Mismatch Problem (DMP) • In reality, pdeg(X) may be different from deg(X) • Neighbor recall problem: There may be q  neighbors(x) that do not occur in x • q occurs as “anchor text” in a page linking to x • q occurs in x, but our parser failed to find it • Neighbor precision problem: There may be q that occur in x, but q  neighbors(x) • q “overflows” • q occurs in x, but the search engine’s parser failed to find it

Implications of DMP • Can only approximate document degrees • Bias of importance sampling estimator may become significant • In one of our experiments, relative bias was 75%

Eliminating the Neighbor Recall Problem • The predicted search engine graph: • pneighbors(x) = queries that occur in x • pneighbors(q) = documents in whose text q occurs • An edge (q,x) is “valid”, if it occurs both in the search engine graph and the predicted search engine graph • The valid search engine graph: • vneighbors(x) = neighbors(x) ∩ pneighbors(x) • vneighbors(q) = neighbors(q) ∩ pneighbors(q)

Eliminating the Neighbor Recall Problem (cont.) • We use the valid search engine graph rather than the real search engine graph: • vdeg(q) = |vneighbors(q)| • vdeg(x) = |vneighbors(x)| • P+ = queries q in P with vdeg(q) > 0 • D+ = documents x in D with vdeg(x) > 0 • Assuming D+ = D, then E(IS(Q,X)) = |D|

Approximate Importance Sampling (AIS) • We need to compute • vdeg(Q) – Easy • vdeg(X) – Hard • |P+| - Hard • We therefore approximate |P+| and vdeg(X): • IVD(X) = unbiased probabilistic estimator for pdeg(X)/vdeg(X)

Estimating pdeg(x)/vdeg(x) • Given: A document x • Want: Estimate pdeg(x) / vdeg(x) • Geometric estimation: • n = 1 • forever do • Choose a random phrase Q that occurs in content(x) • Send Q to the search engine • If x  neighbors(Q), return n • n  n + 1 • Probability to hit a “valid” query: vdeg(x) / pdeg(x) • So, expected number of iterations: pdeg(x) / vdeg(x)

Approximate Importance Sampling: Bias Analysis • Lemma: Multiplicative bias of AIS(Q,X) is

Approximate Importance Sampling: Bias Elimination • How to eliminate the bias in AIS? • Estimate the bias |P|/|P+| • Divide AIS by this estimate • Well, this doesn’t quite work • Expected ratio ≠ ratio of expectations • So, use a standard trick in estimation of ratio statistics: BE = estimator of |P|/|P+|

Bias Analysis • Theorem:

Estimating |P|/|P+| • Also by geometric estimation: • n = 1 • forever do • Choose a random query Q from P • Send Q to the search engine • If vdeg(Q) > 0, return n • n  n + 1 • Probability to hit a “valid” query: |P+|/|P| • So, expected number of iterations: |P|/|P+|

Recap • Sample valid edges (Q1,X1),…,(Qn,Xn) from p • Compute vdeg(Qi) for each query Qi • Compute pdeg(Xi) for each document Xi • Estimate IVD(Xi) = pdeg(Xi)/vdeg(Xi) for each Xi • Compute AIS • Estimate the expected bias BEi = |P|/|P+| • Output

Rao-Blackwellization • Question: We currently use only one (random) result for each query submitted to the search engine. Can we use also the rest? • Rao & Blackwell: Sure! Use them as additional samples. It can only help! • The Rao-Blackwellized AIS estimator: • Recall:

RB-AIS: Analysis • TheRao-Blackwell Theorem: • AISRB has exactly the same bias as AIS • The variance of AISRB can only be lower • Variance reduces, if query results are sufficiently “variable” • Now, use AISRB instead of AIS in SizeEstimator:

Corpus Size, Bias Comparison

Corpus Size, Query Cost Comparison

Corpus Size Estimations for 3 Major Search Engines

Thank You

Average Metric, Bias Comparison

Average Metric, Query Cost Comparison

Efficient Search Engine Measurements

Efficient Search Engine Measurements

Presentation Transcript

Search Engine

Search Engine

Search Engine

Search Engine

Search Engine

Efficient Diverse Search

Search Engine Optimization and Search Engine Marketing

SEARCH ENGINE

Search Engine

Search Engine

Search Engine

Search engine

Search Engine

search engine

Search Engine

SEARCH ENGINE

Efficient Search - Overview

GALANX: An Efficient Peer-to-Peer Search Engine System