Distributed Web Crawling over DHTs

Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

Search Today Search Index Crawl Crawl

What’s Wrong? • Users have a limited search interface • Today’s web is dynamic and growing: • Timely re-crawls required. • Not feasible for all web sites. • Search engines control your search results: • Decide which sites get crawled: • 550 billion documents estimated in 2001 (BrightPlanet) • Google indexes 3.3 billion documents. • Decide which sites gets updated more frequently • May censor or skew results rankings. Challenge: User customizable searches that scale.

Our Solution: A Distributed Crawler • P2P users donate excess bandwidth and computation resources to crawl the web. • Organized using Distributed Hash tables (DHTs) • DHT and Query Processor agnostic crawler: • Designed to work over any DHT • Crawls can be expressed as declarative recursive queries • Easy for user customization. • Queries can be executed over PIER, a DHT-based relational P2P Query Processor Crawlees: Web Servers Crawlers: PIER nodes

Potential • Infrastructure for crawl personalization: • User-defined focused crawlers • Collaborative crawling/filtering (special interest groups) • Other possibilities: • Bigger, better, faster web crawler • Enables new search and indexing technologies • P2P Web Search • Web archival and storage (with OceanStore) • Generalized crawler for querying distributed graph structures. • Monitor file-sharing networks. E.g. Gnutella. • P2P network maintenance: • Routing information. • OceanStore meta-data.

Challenges that We Investigated • Scalability and Throughput • DHT communication overheads. • Balance network load on crawlers • 2 components of network load: Download and DHT bandwidth. • Network Proximity. Exploit network locality of crawlers. • Limit download rates on web sites • Prevents denial of service attacks. • Main tradeoff: Tension between coordination and communication • Balance load either on crawlers or on crawlees ! • Exploit network proximity at the cost of communication.

Publish Link (sourceUrl, destUrl) Rate Throttle & Reorder Filters Dup Elim CrawlWrapper DupElim DHT Scan: WebPage(url) Crawl as a Recursive Query Publish WebPage(url) : Link.destUrl WebPage(url) Redirect Crawler Thread Output Links Extractor Downloader Input Urls Seed Urls

Crawl Distribution Strategies • Partition by URL • Ensures even distribution of crawler workload. • High DHT communication traffic. • Partition by Hostname • One crawler per hostname. • Creates a “control point” for per-server rate throttling. • May lead to uneven crawler load distribution • Single point of failure: • “Bad” choice of crawler affects per-site crawl throughput. • Slight variation: X crawlers per hostname.

Redirection • Simple technique that allows a crawler to redirect or pass on its assigned work to another crawler (and so on….) • A second chance distribution mechanism orthogonal to the partitioning scheme. • Example: Partition by hostname • Node responsible for google.com (red) dispatches work (by URL) to grey nodes • Load balancing benefits of partition by URL • Control benefits of partition by hostname • When? Policy-based • Crawler load (queue size) • Network proximity • Why not? Cost of redirection • Increased DHT control traffic • Hence, put a limit number of redirections per URL. www.google.com

Experiments • Deployment • WebCrawler over PIER, Bamboo DHT, up to 80 PlanetLab nodes • 3 Crawl Threads per crawler, 15 min crawl duration • Distribution (Partition) Schemes • URL • Hostname • Hostname with 8 crawlers per unique host • Hostname, one level redirection on overload. • Crawl Workload • Exhaustive crawl • Seed URL: http://www.google.com • 78244 different web servers • Crawl of fixed number of sites • Seed URL: http://www.google.com • 45 web servers within google • Crawl of single site within http://groups.google.com

Crawl of Multiple Sites I CDF of Per-crawler Downloads (80 nodes) Partition by Hostname shows poor imbalance (70% idle). Better off when more crawlers are busy Crawl Throughput Scaleup Hostname: Can only exploit at most 45 crawlers. Redirect (hybrid hostname/url) does the best.

Crawl of Multiple Sites II Per-URL DHT Overheads Redirect: The per-URL DHT overheads hit their maximum around 70 nodes. Redirection incurs higher overheads only after queue size exceeds a threshold. Hostname incurs low overheads since crawl only looks at google.com which has lots of self-links.

Network Proximity Sampled 5100 crawl targets and measured ping times from each of 80 PlanetLab hosts Partition by hostname approximates random assignment Best-3 random is “close enough” to Best-5 random Sanity check: what if a single host crawls all targets ?

Summary of Schemes

Related Work • Herodotus, at MIT (Chord-based) • Partition by URL • Batching with ring-based forwarding. • Experimented on 4 local machines • Apoidea, at GaTech (Chord-based) • Partition by hostname. • Forwards crawl to DHT neighbor closest to website. • Experimented on 12 local machines.

Conclusion • Our main contributions: • Propose a DHT and QP agnostic Distributed Crawler. • Express crawl as a query. • Permits user-customizable refinement of crawls • Discover important trade-offs in distributed crawling: • Co-ordination comes with extra communication costs • Deployment and experimentation on PlanetLab. • Examine crawl distribution strategies under different workloads on live web sources • Measure the potential benefits of network proximity.

Backup slides

Existing Crawlers • Cluster-based crawlers • Google: Centralized dispatcher sends urls to be crawled. • Hash-based parallel crawlers. • Focused Crawlers • BINGO! • Crawls the web given basic training set. • Peer-to-Peer • Grub SETI@Home infrastructure. • 23993 members .

Exhaustive Crawl Partition by Hostname shows imbalance. Some crawlers are over-utilized for downloads. Little difference in throughput. Most crawler threads are kept busy.

Single Site URL is best, followed by redirect and hostname.

Future Work • Fault Tolerance • Security • Single-Node Throughput • Work-Sharing between Crawl Queries • Essential for overlapping users. • Crawl Global Prioritization • A requirement of personalized crawls. • Online relevance feedback. • Deep web retrieval.

Distributed Web Crawling over DHTs

Distributed Web Crawling over DHTs

Presentation Transcript

CRAWLING THE HIDDEN WEB

User-Centric Web Crawling

Crawling the Hidden Web

Web Crawling

Web Crawling

Web Crawling

Distributed Web Crawling (a survey by Dustin Boswell)

Crawling the Hidden Web

Web Crawling

CRAWLING THE WEB

Crawling the Hidden Web

Crawling the Hidden Web

Crawling the Hidden Web

Bandwidth-Efficient Continuous Query Processing over DHTs

Ch. 8: Web Crawling

Distributed Search over the Hidden Web:

Distributed Web Crawling over DHTs

Datahut - Web Crawling Services

Ch. 8: Web Crawling

Deep Web Crawling

User-Centric Web Crawling*