Web search engines

Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa

The Web: Size: more than tens of billions of pages Language and encodings:hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User: Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Two main difficulties Extracting “significant data” is difficult !! Matching “user needs” is difficult !!

Evolution of Search Engines • First generation-- use only on-page, web-text data • Word frequency and language • Second generation-- use off-page, web-graph data • Link (or connectivity) analysis • Anchor-text (How people refer to a page) • Third generation-- answer “the need behind the query” • Focus on “user need”, rather than on query • Integrate multiple data-sources • Click-through data 1995-1997 AltaVista, Excite, Lycos, etc 1998: Google Google, Yahoo, MSN, ASK,……… Fourth generation  Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research]

The web-graph: properties Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 19.1 and 19.2

The Web’s Characteristics • Size • 1 trillion of pages is available (Google 7/08) • 50 billion static pages • 5-40K per page => terabytes & terabytes • Size grows every day!! • Change • 8% new pages, 25% new links change weekly • Life time of about 10 days

The Bow Tie

Some definitions • Weakly connected components (WCC) • Set of nodes such that from any node can go to any node via an undirected path. • Strongly connected components (SCC) • Set of nodes such that from any node can go to any node via a directed path. WCC SCC

Find the CORE • Iterate the following process: • Pick a random vertex v • Compute all nodes reached from v: O(v) • Compute all nodes that reach v: I(v) • Compute SCC(v):= I(v) ∩ O(v) • Check whether it is the largest SCC If the CORE is about ¼ of the vertices, after 20 iterations, Pb to not find the core < 1%.

Compute SCCs • Classical Algorithm: • DFS(G) • Transpose G in GT • DFS(GT) following vertices in decreasing order of the time their visit ended at step 1. • Every tree is a SCC. DFS hard to compute on disk: no locality

DFS Classical Approach main(){ foreach vertex v do color[v]=WHITE endFor foreach vertex v do if (color[v]==WHITE) DFS(v); endFor } DFS(u:vertex) color[u]=GRAY d[u] time  time +1 foreach v in succ[u] do if (color[v]=WHITE) then p[v] u DFS(v) endFor color[u] BLACK f[u]  time  time + 1

Semi-External DFS (1) • Bit array of nodes • (visited or not) • Array of successors • (stack of the DFS-recursion) Key observation: If bit-array fits in internal memory than a DFS takes |V| + |E|/B disk accesses.

What about million/billion nodes? NO Key observation: A forest is a DFS forest if and only if there are no FORWARD CROSS edges among the non-tree edges Algorithm? Construct incrementally a tentative DFS forest which minimizes the # of those edges, in passes...

Another Semi-External DFS (3) • Bit array of nodes • (visited or not) • Array of successors • (stack of the DFS-recursion) Key assumption: We assume that 2n edges, and the auxiliary data structures, fit in memory.

Observing Web Graph • We do not know which percentage of it we know • The only way to discover the graph structure of the web as hypertext is via large scale crawls • Warning: the picture might be distorted by • Size limitation of the crawl • Crawling rules • Perturbations of the "natural" process of birth and death of nodes and links

Why is it interesting? • Largest artifact ever conceived by the human • Exploit its structure of the Web for • Crawl strategies • Search • Spam detection • Discovering communities on the web • Classification/organization • Predict the evolution of the Web • Sociological understanding

Many other large graphs… • Physical network graph • V = Routers • E = communication links • The “cosine” graph (undirected, weighted) • V = static web pages • E = semantic distance between pages • Query-Log graph (bipartite, weighted) • V = queries and URL • E = (q,u) u is a result for q, and has been clicked by some user who issued q • Social graph (undirected, unweighted) • V = users • E = (x,y) if x knows y (facebook, address book, email,..)

The size of the web Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 19.5

What is the size of the web ? • Issues • The web is really infinite • Dynamic content, e.g., calendar • Static web contains syntactic duplication, mostly due to mirroring (~30%) • Some servers are seldom connected • Who cares? • Media, and consequently the user • Engine design

The relative sizes of search engines Document extension: e.g. engines index pages not yet crawled, by indexing anchor-text. Document restriction: All engines restrict what is indexed (first n words, only relevant words, etc.) The coverage of a search engine relative to another particular crawling process. What can we attempt to measure?

Sec. 19.5 AÇB Relative Size from OverlapGiven two engines A and B Sample URLs randomly from A Check if contained in B and vice versa AÇ B= (1/2) * Size A AÇ B= (1/6) * Size B (1/2)*Size A = (1/6)*Size B \ Size A / Size B = (1/6)/(1/2) = 1/3 Each test involves: (i) Sampling (ii) Checking

Sampling URLs • Ideal strategy: Generate a random URL and check for containment in each index. • Problem: Random URLs are hard to find! • Approach 1: Generate a random URL contained in a given engine • Suffices for the estimation of relative size • Approach 2: Random walks or IP addresses • In theory: might give us a true estimate of the size of the web (as opposed to just relative sizes of indexes)

Random URLs from random queries • Generate random query: how? • Lexicon:400,000+ words from a web crawl • Conjunctive Queries: w1 and w2 e.g., vocalists AND rsi • Get 100 result URLs from engine A • Choose a random URL as the candidate to check for presence in engine B (next slide) • This distribution induces a probability weight W(p) for each page. • Conjecture: W(SEA) / W(SEB) ~ |SEA| / |SEB|

Query-based checking • Strong Query to check whether an engine B has a document D: • Download D. Get list of words. • Use 8 low frequency words as AND query to B • Check if D is present in result set. • Problems: • Near duplicates • Redirects • Engine time-outs • Is 8-word query good enough?

Advantages & disadvantages • Statistically sound under the induced weight. • Biases induced by random query • Query bias: Favors content-rich pages in the language(s) of the lexicon • Ranking bias: Solution: Use conjunctive queries & fetch all • Checking bias: Duplicates • Query restriction bias:engine might not deal properly with 8 words conjunctive query • Malicious bias: Sabotage by engine • Operational Problems: Time-outs, failures, engine inconsistencies, index modification.

Random IP addresses • Generate random IP addresses • Find a web server at the given address • If there’s one • Collect all pages from server • From this, choose a page at random

Advantages & disadvantages • Advantages • Clean statistics • Independent of crawling strategies • Disadvantages • Many hosts might share one IP, or not accept requests • No guarantee all pages are linked to root page, and thus can be collected. • Power law for # pages/hosts generates bias towards sites with few pages.

Random walks • View the Web as a directed graph • Build a random walk on this graph • Includes various “jump” rules back to visited sites • Does not get stuck in spider traps! • Can follow all links! • Converges to a stationary distribution • Must assume graph is finite and independent of the walk. • Conditions are not satisfied (many traps…) • Time to convergence not really known • Sample from stationary distribution of walk

Advantages & disadvantages • Advantages • “Statistically clean” method at least in theory! • Disadvantages • List of seeds is a problem. • Practical approximation might not be valid. • Non-uniform distribution • Subject to link spamming

Conclusions • No sampling solution is perfect. • Lots of new ideas ... • ....but the problem is getting harder • Quantitative studies are fascinating and a good research problem

The web-graph: storage Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.4

Definition Directed graph G = (V,E) • V = URLs, E = (u,v) if u has an hyperlink to v Isolated URLs are ignored (no IN & no OUT) Three key properties: • Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution Altavista crawl, 1999 WebBase Crawl 2001 Indegree follows power law distribution This is true also for: out-degree, size components,...

Definition Directed graph G = (V,E) • V = URLs, E = (u,v) if u has an hyperlink to v Isolated URLs are ignored (no IN, no OUT) Three key properties: • Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1 • Locality: usually most of the hyperlinks point to other URLs on the same host (about 80%). • Similarity: pages close in lexicographic order tend to share many outgoing lists

A Picture of the Web Graph j i 21 millions of pages, 150millions of links

URL-sorting Berkeley Stanford

Front-compression of URLs + delta encoding of IDs Front-coding .

The library WebGraph Uncompressed adjacency list Adjacency list with compressed gaps (locality) Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1} For negative entries: .

Reference chains possibly limited Copy-lists Uncompressed adjacency list D Adjacency list with copy lists (similarity) Each bit of y informs whether the corresponding successor of y is also a successor of the reference x; The reference index is chosen in [0,W] that gives the best compression. .

Copy-blocks = RLE(Copy-list) Adjacency list with copy lists. Adjacency list with copy blocks (RLE on bit sequences) The first copy block is 0 if the copy list starts with 0; The last block is omitted (we know the length…); The length is decremented by one for all blocks .

This is a Java and C++ lib (≈3 bits/edge) 3 Extra-nodes:Compressing Intervals Adjacency list with copy blocks. Consecutivity in extra-nodes 0 = (15-15)*2 (positive) 2 = (23-19)-2 (jump >= 2) 600 = (316-16)*2 12 = (22-16)*2 (positive) 3018 = 3041-22-1 (difference) Intervals: use their left extreme and length Int. length: decremented by Lmin = 2 Residuals: differences between residuals, or the source .

Web search engines