1 / 40

Web search engines

Web search engines. Paolo Ferragina Dipartimento di Informatica Università di Pisa. The Web: Size: more than tens of billions of pages Language and encodings: hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User:

herb
Download Presentation

Web search engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa

  2. The Web: Size: more than tens of billions of pages Language and encodings:hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User: Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Two main difficulties Extracting “significant data” is difficult !! Matching “user needs” is difficult !!

  3. Evolution of Search Engines • First generation-- use only on-page, web-text data • Word frequency and language • Second generation-- use off-page, web-graph data • Link (or connectivity) analysis • Anchor-text (How people refer to a page) • Third generation-- answer “the need behind the query” • Focus on “user need”, rather than on query • Integrate multiple data-sources • Click-through data 1995-1997 AltaVista, Excite, Lycos, etc 1998: Google Google, Yahoo, MSN, ASK,……… Fourth generation  Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research]

  4. The web-graph: properties Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 19.1 and 19.2

  5. The Web’s Characteristics • Size • 1 trillion of pages is available (Google 7/08) • 50 billion static pages • 5-40K per page => terabytes & terabytes • Size grows every day!! • Change • 8% new pages, 25% new links change weekly • Life time of about 10 days

  6. The Bow Tie

  7. Some definitions • Weakly connected components (WCC) • Set of nodes such that from any node can go to any node via an undirected path. • Strongly connected components (SCC) • Set of nodes such that from any node can go to any node via a directed path. WCC SCC

  8. Find the CORE • Iterate the following process: • Pick a random vertex v • Compute all nodes reached from v: O(v) • Compute all nodes that reach v: I(v) • Compute SCC(v):= I(v) ∩ O(v) • Check whether it is the largest SCC If the CORE is about ¼ of the vertices, after 20 iterations, Pb to not find the core < 1%.

  9. Compute SCCs • Classical Algorithm: • DFS(G) • Transpose G in GT • DFS(GT) following vertices in decreasing order of the time their visit ended at step 1. • Every tree is a SCC. DFS hard to compute on disk: no locality

  10. DFS Classical Approach main(){ foreach vertex v do color[v]=WHITE endFor foreach vertex v do if (color[v]==WHITE) DFS(v); endFor } DFS(u:vertex) color[u]=GRAY d[u] time  time +1 foreach v in succ[u] do if (color[v]=WHITE) then p[v] u DFS(v) endFor color[u] BLACK f[u]  time  time + 1

  11. Semi-External DFS (1) • Bit array of nodes • (visited or not) • Array of successors • (stack of the DFS-recursion) Key observation: If bit-array fits in internal memory than a DFS takes |V| + |E|/B disk accesses.

  12. What about million/billion nodes? NO Key observation: A forest is a DFS forest if and only if there are no FORWARD CROSS edges among the non-tree edges Algorithm? Construct incrementally a tentative DFS forest which minimizes the # of those edges, in passes...

  13. Another Semi-External DFS (3) • Bit array of nodes • (visited or not) • Array of successors • (stack of the DFS-recursion) Key assumption: We assume that 2n edges, and the auxiliary data structures, fit in memory.

  14. Observing Web Graph • We do not know which percentage of it we know • The only way to discover the graph structure of the web as hypertext is via large scale crawls • Warning: the picture might be distorted by • Size limitation of the crawl • Crawling rules • Perturbations of the "natural" process of birth and death of nodes and links

  15. Why is it interesting? • Largest artifact ever conceived by the human • Exploit its structure of the Web for • Crawl strategies • Search • Spam detection • Discovering communities on the web • Classification/organization • Predict the evolution of the Web • Sociological understanding

  16. Many other large graphs… • Physical network graph • V = Routers • E = communication links • The “cosine” graph (undirected, weighted) • V = static web pages • E = semantic distance between pages • Query-Log graph (bipartite, weighted) • V = queries and URL • E = (q,u) u is a result for q, and has been clicked by some user who issued q • Social graph (undirected, unweighted) • V = users • E = (x,y) if x knows y (facebook, address book, email,..)

  17. The size of the web Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 19.5

  18. What is the size of the web ? • Issues • The web is really infinite • Dynamic content, e.g., calendar • Static web contains syntactic duplication, mostly due to mirroring (~30%) • Some servers are seldom connected • Who cares? • Media, and consequently the user • Engine design

  19. The relative sizes of search engines Document extension: e.g. engines index pages not yet crawled, by indexing anchor-text. Document restriction: All engines restrict what is indexed (first n words, only relevant words, etc.) The coverage of a search engine relative to another particular crawling process. What can we attempt to measure?

  20. Sec. 19.5 AÇB Relative Size from OverlapGiven two engines A and B Sample URLs randomly from A Check if contained in B and vice versa AÇ B= (1/2) * Size A AÇ B= (1/6) * Size B (1/2)*Size A = (1/6)*Size B \ Size A / Size B = (1/6)/(1/2) = 1/3 Each test involves: (i) Sampling (ii) Checking

  21. Sampling URLs • Ideal strategy: Generate a random URL and check for containment in each index. • Problem: Random URLs are hard to find! • Approach 1: Generate a random URL contained in a given engine • Suffices for the estimation of relative size • Approach 2: Random walks or IP addresses • In theory: might give us a true estimate of the size of the web (as opposed to just relative sizes of indexes)

  22. Random URLs from random queries • Generate random query: how? • Lexicon:400,000+ words from a web crawl • Conjunctive Queries: w1 and w2 e.g., vocalists AND rsi • Get 100 result URLs from engine A • Choose a random URL as the candidate to check for presence in engine B (next slide) • This distribution induces a probability weight W(p) for each page. • Conjecture: W(SEA) / W(SEB) ~ |SEA| / |SEB|

  23. Query-based checking • Strong Query to check whether an engine B has a document D: • Download D. Get list of words. • Use 8 low frequency words as AND query to B • Check if D is present in result set. • Problems: • Near duplicates • Redirects • Engine time-outs • Is 8-word query good enough?

  24. Advantages & disadvantages • Statistically sound under the induced weight. • Biases induced by random query • Query bias: Favors content-rich pages in the language(s) of the lexicon • Ranking bias: Solution: Use conjunctive queries & fetch all • Checking bias: Duplicates • Query restriction bias:engine might not deal properly with 8 words conjunctive query • Malicious bias: Sabotage by engine • Operational Problems: Time-outs, failures, engine inconsistencies, index modification.

  25. Random IP addresses • Generate random IP addresses • Find a web server at the given address • If there’s one • Collect all pages from server • From this, choose a page at random

  26. Advantages & disadvantages • Advantages • Clean statistics • Independent of crawling strategies • Disadvantages • Many hosts might share one IP, or not accept requests • No guarantee all pages are linked to root page, and thus can be collected. • Power law for # pages/hosts generates bias towards sites with few pages.

  27. Random walks • View the Web as a directed graph • Build a random walk on this graph • Includes various “jump” rules back to visited sites • Does not get stuck in spider traps! • Can follow all links! • Converges to a stationary distribution • Must assume graph is finite and independent of the walk. • Conditions are not satisfied (many traps…) • Time to convergence not really known • Sample from stationary distribution of walk

  28. Advantages & disadvantages • Advantages • “Statistically clean” method at least in theory! • Disadvantages • List of seeds is a problem. • Practical approximation might not be valid. • Non-uniform distribution • Subject to link spamming

  29. Conclusions • No sampling solution is perfect. • Lots of new ideas ... • ....but the problem is getting harder • Quantitative studies are fascinating and a good research problem

  30. The web-graph: storage Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.4

  31. Definition Directed graph G = (V,E) • V = URLs, E = (u,v) if u has an hyperlink to v Isolated URLs are ignored (no IN & no OUT) Three key properties: • Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

  32. The In-degree distribution Altavista crawl, 1999 WebBase Crawl 2001 Indegree follows power law distribution This is true also for: out-degree, size components,...

  33. Definition Directed graph G = (V,E) • V = URLs, E = (u,v) if u has an hyperlink to v Isolated URLs are ignored (no IN, no OUT) Three key properties: • Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1 • Locality: usually most of the hyperlinks point to other URLs on the same host (about 80%). • Similarity: pages close in lexicographic order tend to share many outgoing lists

  34. A Picture of the Web Graph j i 21 millions of pages, 150millions of links

  35. URL-sorting Berkeley Stanford

  36. Front-compression of URLs + delta encoding of IDs Front-coding .

  37. The library WebGraph Uncompressed adjacency list Adjacency list with compressed gaps (locality) Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1} For negative entries: .

  38. Reference chains possibly limited Copy-lists Uncompressed adjacency list D Adjacency list with copy lists (similarity) Each bit of y informs whether the corresponding successor of y is also a successor of the reference x; The reference index is chosen in [0,W] that gives the best compression. .

  39. Copy-blocks = RLE(Copy-list) Adjacency list with copy lists. Adjacency list with copy blocks (RLE on bit sequences) The first copy block is 0 if the copy list starts with 0; The last block is omitted (we know the length…); The length is decremented by one for all blocks .

  40. This is a Java and C++ lib (≈3 bits/edge) 3 Extra-nodes:Compressing Intervals Adjacency list with copy blocks. Consecutivity in extra-nodes 0 = (15-15)*2 (positive) 2 = (23-19)-2 (jump >= 2) 600 = (316-16)*2 12 = (22-16)*2 (positive) 3018 = 3041-22-1 (difference) Intervals: use their left extreme and length Int. length: decremented by Lmin = 2 Residuals: differences between residuals, or the source .

More Related