Random Sampling from a Search Engine‘s Index

Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

Search Engine Samplers Search Engine Web Public Interface D Index Top k results Queries Indexed Documents Random document x  D Sampler

Motivation • Useful tool for search engine evaluation: • Freshness • Fraction of up-to-date pages in the index • Topical bias • Identification of overrepresented/underrepresented topics • Spam • Fraction of spam pages in the index • Security • Fraction of pages in index infected by viruses/worms/trojans • Relative Size • Number of documents indexed compared with other search engines

Size Wars August 2005 : We index 20 billion documents. September 2005 : We index 8 billion documents, but our indexis 3 times larger than our competition’s. So, who’s right?

Why Does Size Matter, Anyway? • Comprehensiveness • A good crawler covers the most documents possible • Narrow-topic queries • E.g., get homepage of John Doe • Prestige • A marketing advantage

Measuring size using random samples[BharatBroder98, CheneyPerry05, GulliSignorni05] • Sample pages uniformly at random from the search engine’s index • Two alternatives • Absolute size estimation • Sample until collision • Collision expected after k ~ N½ random samples (birthday paradox) • Return k2 • Relative size estimation • Check how many samples from search engine A are present in search engine B and vice versa

Other Approaches • Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00] • Queries from user query logs [LawrenceGiles98, DobraFeinberg04] • Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]

The Bharat-Broder Sampler: Preprocessing Step Lexicon Large corpus L t1, freq(t1,C) C t2, freq(t2,C) … …

The Bharat-Broder Sampler Search Engine Top k results t1 AND t2 Two random terms t1, t2 L Random document from top k results BB Sampler • Only if: • all queries return the same number of results ≤ k • all documents are of the same length • Then, samples are uniform.

The Bharat-Broder Sampler:Drawbacks • Documents have varying lengths • Bias towards long documents • Some queries have more than k matches • Bias towards documents with high static rank

Our Contributions • A pool-based sampler • Guaranteed to produce near-uniform samples • A random walk sampler • After sufficiently many steps, guaranteed to produce near-uniform samples • Does not need an explicit lexicon/pool at all! Focus of this talk

Search Engines as Hypergraphs “news” “google” • results(q)= { documents returned on query q } • queries(x)= { queries that return x as a result } • P = query pool = a set of queries • Query pool hypergraph: • Vertices: Indexed documents • Hyperedges: { result(q) | q  P } www.cnn.com news.google.com www.google.com news.bbc.co.uk www.foxnews.com www.mapquest.com maps.google.com en.wikipedia.org/wiki/BBC www.bbc.co.uk maps.yahoot.com “maps” “bbc”

Query Cardinalities and Document Degrees “news” “google” • Query cardinality: card(q) = |results(q)| • card(“news”) = 4, card(“bbc”) = 3 • Document degree: deg(x) = |queries(x)| • deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2 • Cardinality and degree are easily computable www.cnn.com news.google.com www.google.com news.bbc.co.uk www.foxnews.com www.mapquest.com maps.google.com en.wikipedia.org/wiki/BBC www.bbc.co.uk maps.yahoot.com “maps” “bbc”

Sampling documents uniformly • Sampling documents from D uniformly Hard • Sampling documents from D non-uniformly: Easier • Will show later: can sample documents proportionally to their degrees:

Sampling documents by degree “news” “google” • p(news.bbc.co.uk) = 2/13 • p(www.cnn.com) = 1/13 www.cnn.com news.google.com www.google.com news.bbc.co.uk www.foxnews.com www.mapquest.com maps.google.com en.wikipedia.org/wiki/BBC www.bbc.co.uk maps.yahoot.com “maps” “bbc”

Monte Carlo Simulation • We need: Samples from the uniform distribution • We have: Samples from the degree distribution • Can we somehow use the samples from the degree distribution to generate samples from the uniform distribution? • Yes! Monte Carlo Simulation Methods Rejection Sampling Importance Sampling Metropolis-Hastings Maximum-Degree

Monte Carlo Simulation • : Target distribution • In our case:  = uniform on D • p: Trial distribution • In our case: p = degree distribution • Bias weight of p(x) relative to (x): • In our case: -Sampler Sample from  Samples from p (x1,w(x)), (x2,w(x)),… Monte Carlo Simulator p-Sampler x

Bias Weights • Unnormalized forms of  and p: • : (unknown) normalization constants • Examples: •  = uniform: • p = degree distribution: • Bias weight:

Rejection Sampling [von Neumann] • C: envelope constant • C ≥ w(x) for all x • The algorithm: • accept := false • while (not accept) • generate a sample x from p • toss a coin whose heads probability is • if coin comes up heads, accept := true • return x • In our case: C = 1 and acceptance prob = 1/deg(x)

Pool-Based Sampler • Degree distribution: p(x) = deg(x) / x’deg(x’) Search Engine results(q1), results(q2),… q1,q2,… Pool-Based Sampler (x1,1/deg(x1)),(x2,1/deg(x2)),… Degree distribution sampler Rejection Sampling x Documents sampled from degree distribution with corresponding weights Uniform sample

Sampling documents by degree “google” “news” www.cnn.com news.google.com • Select a random query q • Select a random x  results(q) • Documents with high degree are more likely to be sampled • If we sample q uniformly  “oversample” documents that belong to narrow queries • We need to sample q proportionally to its cardinality www.google.com news.bbc.co.uk www.foxnews.com www.mapquest.com maps.google.com en.wikipedia.org/wiki/BBC www.bbc.co.uk maps.yahoot.com “maps” “bbc”

Sampling documents by degree (2) “google” “news” www.cnn.com news.google.com • Select a query q proportionally to its cardinality • Select a random x  results(q) • Analysis: www.google.com news.bbc.co.uk www.foxnews.com www.mapquest.com maps.google.com en.wikipedia.org/wiki/BBC www.bbc.co.uk maps.yahoot.com “maps” “bbc”

Degree Distribution Sampler Search Engine Query sampled from cardinality distribution Document sampled from degree distribution results(q) q Degree Distribution Sampler Cardinality Distribution Sampler Sample x uniformly from results(q) x

Sampling queries by cardinality • Sampling queries from pool uniformly: Easy • Sampling queries from pool by cardinality: Hard • Requires knowing cardinalities of all queries in the search engine • Use Monte Carlo methods to simulate biased sampling via uniform sampling: • Target distribution: the cardinality distribution • Trial distribution: uniform distribution on the query pool

Sampling queries by cardinality • Bias weight of cardinality distribution relative to the uniform distribution: • Can be computed using a single search engine query • Use rejection sampling: • Envelope constant for rejection sampling: • Queries are sampled uniformly from the pool • Each query q is accepted with probability

Complete Pool-Based Sampler Search Engine Uniform Query Sampler (q,card(q)),… Rejection Sampling Uniform query sample Query sampled from cardinality distribution (q,results(q)),… (x,1/deg(x)),… Degree Distribution Sampler Rejection Sampling x Uniform document sample Documents sampled from degree distribution with corresponding weights

Dealing with Overflowing Queries • Problem: Some queries may overflow (card(q) > k) • Bias towards highly ranked documents • Solutions: • Select a pool P in which overflowing queries are rare (e.g., phrase queries) • Skip overflowing queries • Adapt rejection sampling to deal with approximate weights Theorem:Samples of PB sampler are at most -away from uniform. ( = overflow probability of P)

Creating the query pool Query Pool Large corpus P q1 C q2 … … • Example: P = all 3-word phrases that occur in C • If “to be or not to be” occurs in C, P contains: • “to be or”, “be or not”, “or not to”, “not to be” • Choose P that “covers” most documents in D

A random walk sampler • Define a graph G over the indexed documents • (x,y)  E iff queries(x) ∩ queries(y) ≠  • Run a random walk on G • Limit distribution = degree distribution • Use MCMC methods to make limit distribution uniform. • Metropolis-Hastings • Maximum-Degree • Does not need a preprocessing step • Less efficient than the pool-based sampler

Bias towards Long Documents

Relative Sizes of Google, MSN and Yahoo! Google = 1 Yahoo! = 1.28 MSN Search = 0.73

Top-Level Domains in Google, MSN and Yahoo!

Conclusions • Two new search engine samplers • Pool-based sampler • Random walk sampler • Samplers are guaranteed to produce near-uniform samples, under plausible assumptions. • Samplers show no or little bias in experiments.

Thank You

Random Sampling from a Search Engine‘s Index