Crawling

Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Spidering • 24h, 7days “walking” over a Graph • What about the Graph? • BowTie • Direct graph G = (N, E) • N changes (insert, delete) >> 50 * 109 nodes • E changes (insert, delete) > 10 links per node • 10*50*109 = 500*109 1-entries in adj matrix

Crawling Issues • How to crawl? • Quality: “Best” pages first • Efficiency: Avoid duplication (or near duplication) • Etiquette: Robots.txt, Server load concerns (Minimize load) • How much to crawl? How much to index? • Coverage: How big is the Web? How much do we cover? • Relative Coverage: How much do competitors have? • How often to crawl? • Freshness: How much has changed? • How to parallelize the process

Page selection • Given a page P, define how “good” P is. • Several metrics: • BFS, DFS, Random • Popularity driven (PageRank, full vs partial) • Topic driven or focused crawling • Combined

This page is a new one ? • Check if file has been parsed or downloaded before • after 20 mil pages, we have “seen” over 200 million URLs • each URL is at least 100 bytes on average • Overall we have about 20Gb of URLS • Options: compress URLs in main memory, or use disk • Bloom Filter (Archive) • Disk access with caching (Mercator, Altavista) • Also, two-level indexing with Front-coding compression

Crawler Manager AR Link Extractor Downloaders PQ Crawler “cycle of life” PR Link Extractor: while(<Page Repository is not empty>){ <take a page p (check if it is new)> <extract links contained in p within href> <extract links contained in javascript> <extract ….. <insert these links into the Priority Queue> } Downloaders: while(<Assigned Repository is not empty>){ <extract url u> <download page(u)> <send page(u) to the Page Repository> <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while(<Priority Queue is not empty>){ <extract some URL u having the highest priority> foreach u extracted { if ( (u “Already Seen Page” ) || ( u “Already Seen Page” && <u’s version on the Web is more recent> ) ) { <resolve u wrt DNS> <send u to the Assigned Repository> } } }

Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication • Dynamic assignment • Central coordinator dynamically assigns URLs to crawlers • Links are given to Central coordinator (?bottleneck?) • Static assignment • Web is statically partitioned and assigned to crawlers • Crawler only crawls its part of the web

Two problems with static assignment Let D be the number of downloaders. hash(URL) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(U) = x Which hash would you use? • Load balancing the #URLs assigned to downloaders: • Static schemes based on hosts may fail • www.geocities.com/…. • www.di.unipi.it/ • Dynamic “relocation” schemes may be complicated • Managing the fault-tolerance: • What about the death of downloaders ? DD-1, new hash !!! • What about new downloaders ? DD+1, new hash !!!

Each server gets replicated log S times [monotone] adding a new server moves points between one old to the new, only. [balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times... A nice technique: Consistent Hashing • Item andservers mapped to unit circle • Item K assigned to first server N such that ID(N) ≥ ID(K) • What if a downloader goes down? • What if a new downloader appears? • A tool for: • Spidering • Web Cache • P2P • Routers Load Balance • Distributed FS

Examples: Open Source • Nutch, also used by WikiSearch • http://nutch.apache.org/

Crawling

Crawling

Presentation Transcript

FOCUSED CRAWLING

Web Crawling

Web Crawling

Web Crawling

Web Crawling

Crawling

CRAWLING THE WEB

Crawling

Crawling and Ranking

Crawling HTML

Crawling

Crawling

Some crawling algorithms

Adaptive Focused Crawling

CRAWLING

Adaptive Focused Crawling

Crawling

Flying/Crawling Wires

Datahut - Web Crawling Services

Crawling and Ranking

Deep Web Crawling

List crawling