Web search engines

Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa

The Web: Size: more than tens of billions of pages Language and encodings:hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User: Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Two main difficulties Extracting “significant data” is difficult !! Matching “user needs” is difficult !!

Evolution of Search Engines • First generation-- use only on-page, web-text data • Word frequency and language • Second generation-- use off-page, web-graph data • Link (or connectivity) analysis • Anchor-text (How people refer to a page) • Third generation-- answer “the need behind the query” • Focus on “user need”, rather than on query • Integrate multiple data-sources • Click-through data 1995-1997 AltaVista, Excite, Lycos, etc 1998: Google Google, Yahoo, MSN, ASK,……… Fourth generation  Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research]

2009 2009-12

This is a search engine!!!

III° generation II° generation IV° generation

Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8

Is it good ? • How fast does it index • Number of documents/hour • (Average document size) • How fast does it search • Latency as a function of index size • Expressiveness of the query language

Measures for a search engine • All of the preceding criteria are measurable • The key measure: user happiness …useless answers won’t make a user happy

Happiness: elusive to measure • Commonest approach is given by the relevance of search results • How do we measure it ? • Requires 3 elements: • A benchmark document collection • A benchmark suite of queries • A binary assessment of either Relevant or Irrelevant for each query-doc pair

Evaluating an IR system • Standard benchmarks • TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years • Other doc collections: marked by human experts, for each query and for each doc, Relevant or Irrelevant • On the Web everything is more complicated since we cannot mark the entire corpus !!

collection Retrieved Relevant General scenario

Precision vs. Recall • Precision: % docs retrieved that are relevant [issue “junk” found] • Recall: % docs relevant that are retrieved [issue “info” found] collection Retrieved Relevant

How to compute them • Precision: fraction of retrieved docs that are relevant • Recall: fraction of relevant docs that are retrieved • Precision P = tp/(tp + fp) • Recall R = tp/(tp + fn)

Some considerations • Can get high recall (but low precision) by retrieving all docs for all queries! • Recall is a non-decreasing function of the number of docs retrieved • Precision usually decreases

We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries Precision-Recall curve precision x x x x recall

A common picture x precision x x x recall

F measure • Combined measure (weighted harmonic mean): • People usually use balanced F1 measure • i.e., with  = ½ thus 1/F = ½ (1/P + 1/R) • Use this if you need to optimize a single measure that balances precision and recall.

The web-graph: properties Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 19.1 and 19.2

The Web’s Characteristics • Size • 1 trillion of pages is available (Google 7/08) • 50 billion static pages • 5-40K per page => terabytes & terabytes • Size grows every day!! • Change • 8% new pages, 25% new links change weekly • Life time of about 10 days

The Bow Tie

Some definitions • Weakly connected components (WCC) • Set of nodes such that from any node can go to any node via an undirected path. • Strongly connected components (SCC) • Set of nodes such that from any node can go to any node via a directed path. WCC SCC

Find the CORE • Iterate the following process: • Pick a random vertex v • Compute all nodes reached from v: O(v) • Compute all nodes that reach v: I(v) • Compute SCC(v):= I(v) ∩ O(v) • Check whether it is the largest SCC If the CORE is about ¼ of the vertices, after 20 iterations, Pb to not find the core < 1% (given that the graph is available).

Compute SCCs • Classical Algorithm: • DFS(G) • Transpose G in GT • DFS(GT) following vertices in decreasing order of the time their visit ended at step 1. • Every tree is a SCC. DFS is hard to compute on disk: no locality

DFS Classical Approach main(){ foreach vertex v do color[v]=WHITE endFor foreach vertex v do if (color[v]==WHITE) DFS(v); endFor } DFS(u:vertex) color[u]=GRAY d[u] time  time +1 foreach v in succ[u] do if (color[v]=WHITE) then p[v] u DFS(v) endFor color[u] BLACK f[u]  time  time + 1

Semi-External DFS • Bit array of nodes (visited or not) • Array of successors • Stack of the DFS-recursion Key observation: If bit-array fits in internal memory than a DFS takes |V| + |E|/B disk accesses.

What about million/billion nodes? NO Key observation: A forest is a DFS forest if and only if there are no FORWARD CROSS edges among the non-tree edges Algorithm ? Construct incrementally a tentative DFS forest which minimizes the # of those edges (overall), in passes...

A Semi-External DFS • Bit array of nodes (visited or not) • Array of successors (stack of the DFS-recursion) Key assumption: We assume that 2n edges, and the auxiliary data structures, fit in memory. Rearrange nodes in adj-lists, the ones with large subtrees go to the front

Observing Web Graph • We do not know which percentage of it we know • The only way to discover the graph structure of the web is via large scale crawls • Warning: the picture might be distorted by • Size limitation of the crawl • Crawling rules • Perturbations of the "natural" process of birth and death of nodes and links

Why is it interesting? • Largest artifact ever conceived by the human • Exploit its structure of the Web for • Crawl strategies • Search • Spam detection • Discovering communities on the web • Classification/organization • Predict the evolution of the Web • Sociological understanding

Many other large graphs… • Physical network graph • V = Routers • E = communication links • The “cosine” graph (undirected, weighted) • V = static web pages • E = semantic distance between pages • Query-Log graph (bipartite, weighted) • V = queries and URL • E = (q,u) u is a result for q, and has been clicked by some user who issued q • Social graph (undirected, unweighted) • V = users • E = (x,y) if x knows y (facebook, address book, email,..)

The size of the web Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 19.5

What is the size of the web ? • Issues • The web is really infinite • Dynamic content, e.g., calendar • Static web contains syntactic duplication, mostly due to mirroring (~30%) • Some servers are seldom connected • Who cares? • Media, and consequently the user • Engine design

The relative sizes of search engines Document extension: e.g. engines index pages not yet crawled, by indexing anchor-text. Document restriction: All engines restrict what is indexed (first n words, only relevant words, etc.) The coverage of a search engine relative to another particular crawling process. What can we attempt to measure?

Sec. 19.5 AÇB Relative Size from OverlapGiven two engines A and B Sample URLs randomly from A Check if contained in B and vice versa AÇ B= (1/2) * Size A AÇ B= (1/6) * Size B (1/2)*Size A = (1/6)*Size B \ Size A / Size B = (1/6)/(1/2) = 1/3 Each test involves: (i) Sampling URL (ii) Checking URL

Sampling URLs • Ideal strategy: Generate a random URL and check for containment in each index. • Problem: Random URLs are hard to find! • Approach 1: Generate a random URL surely contained in a given search engine • Approach 2: Random walks or random IP addresses

#1: Random URL in SE via random queries • Generate random query: • Lexicon:400,000+ words from a web crawl • Conjunctive Queries: w1 and w2 e.g., vocalists AND rsi • Get 100 result URLs from engine A • Choose a random URL as the candidate to check for presence in search engine B (next slide) • This induces a probability weight W(p) for each page. • Conjecture: W(SEA) / W(SEB) ~ |SEA| / |SEB|

URL checking • Download D at address URL. • Get list of words. • Use 8 low frequency words as AND query to B • Check if D is present in result set. • Problems: • Near duplicates • Engine time-outs • Is 8-word query good enough?

Advantages & disadvantages • Statistically sound under the induced weight. • Biases induced by random query • Query bias: Favors content-rich pages in the language(s) of the lexicon • Ranking bias [Solution: Use conjunctive queries & fetch all] • Query restriction bias:engine might not deal properly with 8 words conjunctive query • Malicious bias: Sabotage by engine • Operational Problems: Time-outs, failures, engine inconsistencies, index modification.

#2: Random IP addresses • Find a web server at the given IP address • If there’s one • Collect all pages from server • From this, choose a page at random

Advantages & disadvantages • Advantages • Clean statistics • Independent of crawling strategies • Disadvantages • Many hosts might share one IP, or not accept requests • No guarantee all pages are linked to root page, and thus can be collected. • Power law for # pages/hosts generates bias towards sites with few pages.

Conclusions • No sampling solution is perfect. • Lots of new ideas ... ....but the problem is getting harder • Quantitative studies are fascinating and a good research problem

The web-graph: storage Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.4

Definition Directed graph G = (V,E) • V = URLs, E = (u,v) if u has an hyperlink to v Isolated URLs are ignored (no IN & no OUT) Three key properties: • Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution Altavista crawl, 1999 WebBase Crawl 2001 Indegree follows power law distribution This is true also for: out-degree, size components,...

Web search engines