100 likes | 223 Views
Crawling. Paolo Ferragina Dipartimento di Informatica Università di Pisa. Reading 20.1, 20.2 and 20.3. Spidering. 24h, 7days “ walking ” over a Graph What about the Graph? BowTie Direct graph G = (N, E) N changes (insert, delete) >> 50 * 10 9 nodes
E N D
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3
Spidering • 24h, 7days “walking” over a Graph • What about the Graph? • BowTie • Direct graph G = (N, E) • N changes (insert, delete) >> 50 * 109 nodes • E changes (insert, delete) > 10 links per node • 10*50*109 = 500*109 1-entries in adj matrix
Crawling Issues • How to crawl? • Quality: “Best” pages first • Efficiency: Avoid duplication (or near duplication) • Etiquette: Robots.txt, Server load concerns (Minimize load) • How much to crawl? How much to index? • Coverage: How big is the Web? How much do we cover? • Relative Coverage: How much do competitors have? • How often to crawl? • Freshness: How much has changed? • How to parallelize the process
Page selection • Given a page P, define how “good” P is. • Several metrics: • BFS, DFS, Random • Popularity driven (PageRank, full vs partial) • Topic driven or focused crawling • Combined
This page is a new one ? • Check if file has been parsed or downloaded before • after 20 mil pages, we have “seen” over 200 million URLs • each URL is at least 100 bytes on average • Overall we have about 20Gb of URLS • Options: compress URLs in main memory, or use disk • Bloom Filter (Archive) • Disk access with caching (Mercator, Altavista) • Also, two-level indexing with Front-coding compression
Crawler Manager AR Link Extractor Downloaders PQ Crawler “cycle of life” PR Link Extractor: while(<Page Repository is not empty>){ <take a page p (check if it is new)> <extract links contained in p within href> <extract links contained in javascript> <extract ….. <insert these links into the Priority Queue> } Downloaders: while(<Assigned Repository is not empty>){ <extract url u> <download page(u)> <send page(u) to the Page Repository> <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while(<Priority Queue is not empty>){ <extract some URL u having the highest priority> foreach u extracted { if ( (u “Already Seen Page” ) || ( u “Already Seen Page” && <u’s version on the Web is more recent> ) ) { <resolve u wrt DNS> <send u to the Assigned Repository> } } }
Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication • Dynamic assignment • Central coordinator dynamically assigns URLs to crawlers • Links are given to Central coordinator (?bottleneck?) • Static assignment • Web is statically partitioned and assigned to crawlers • Crawler only crawls its part of the web
Two problems with static assignment Let D be the number of downloaders. hash(URL) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(U) = x Which hash would you use? • Load balancing the #URLs assigned to downloaders: • Static schemes based on hosts may fail • www.geocities.com/…. • www.di.unipi.it/ • Dynamic “relocation” schemes may be complicated • Managing the fault-tolerance: • What about the death of downloaders ? DD-1, new hash !!! • What about new downloaders ? DD+1, new hash !!!
Each server gets replicated log S times [monotone] adding a new server moves points between one old to the new, only. [balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times... A nice technique: Consistent Hashing • Item andservers mapped to unit circle • Item K assigned to first server N such that ID(N) ≥ ID(K) • What if a downloader goes down? • What if a new downloader appears? • A tool for: • Spidering • Web Cache • P2P • Routers Load Balance • Distributed FS
Examples: Open Source • Nutch, also used by WikiSearch • http://nutch.apache.org/