210 likes | 357 Views
Distributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4. Search Today. Search. Index. Crawl. Crawl. What’s Wrong?. Users have a limited search interface Today’s web is dynamic and growing: Timely re-crawls required.
E N D
Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4
Search Today Search Index Crawl Crawl
What’s Wrong? • Users have a limited search interface • Today’s web is dynamic and growing: • Timely re-crawls required. • Not feasible for all web sites. • Search engines control your search results: • Decide which sites get crawled: • 550 billion documents estimated in 2001 (BrightPlanet) • Google indexes 3.3 billion documents. • Decide which sites gets updated more frequently • May censor or skew results rankings. Challenge: User customizable searches that scale.
Our Solution: A Distributed Crawler • P2P users donate excess bandwidth and computation resources to crawl the web. • Organized using Distributed Hash tables (DHTs) • DHT and Query Processor agnostic crawler: • Designed to work over any DHT • Crawls can be expressed as declarative recursive queries • Easy for user customization. • Queries can be executed over PIER, a DHT-based relational P2P Query Processor Crawlees: Web Servers Crawlers: PIER nodes
Potential • Infrastructure for crawl personalization: • User-defined focused crawlers • Collaborative crawling/filtering (special interest groups) • Other possibilities: • Bigger, better, faster web crawler • Enables new search and indexing technologies • P2P Web Search • Web archival and storage (with OceanStore) • Generalized crawler for querying distributed graph structures. • Monitor file-sharing networks. E.g. Gnutella. • P2P network maintenance: • Routing information. • OceanStore meta-data.
Challenges that We Investigated • Scalability and Throughput • DHT communication overheads. • Balance network load on crawlers • 2 components of network load: Download and DHT bandwidth. • Network Proximity. Exploit network locality of crawlers. • Limit download rates on web sites • Prevents denial of service attacks. • Main tradeoff: Tension between coordination and communication • Balance load either on crawlers or on crawlees ! • Exploit network proximity at the cost of communication.
Publish Link (sourceUrl, destUrl) Rate Throttle & Reorder Filters Dup Elim CrawlWrapper DupElim DHT Scan: WebPage(url) Crawl as a Recursive Query Publish WebPage(url) : Link.destUrl WebPage(url) Redirect Crawler Thread Output Links Extractor Downloader Input Urls Seed Urls
Crawl Distribution Strategies • Partition by URL • Ensures even distribution of crawler workload. • High DHT communication traffic. • Partition by Hostname • One crawler per hostname. • Creates a “control point” for per-server rate throttling. • May lead to uneven crawler load distribution • Single point of failure: • “Bad” choice of crawler affects per-site crawl throughput. • Slight variation: X crawlers per hostname.
Redirection • Simple technique that allows a crawler to redirect or pass on its assigned work to another crawler (and so on….) • A second chance distribution mechanism orthogonal to the partitioning scheme. • Example: Partition by hostname • Node responsible for google.com (red) dispatches work (by URL) to grey nodes • Load balancing benefits of partition by URL • Control benefits of partition by hostname • When? Policy-based • Crawler load (queue size) • Network proximity • Why not? Cost of redirection • Increased DHT control traffic • Hence, put a limit number of redirections per URL. www.google.com
Experiments • Deployment • WebCrawler over PIER, Bamboo DHT, up to 80 PlanetLab nodes • 3 Crawl Threads per crawler, 15 min crawl duration • Distribution (Partition) Schemes • URL • Hostname • Hostname with 8 crawlers per unique host • Hostname, one level redirection on overload. • Crawl Workload • Exhaustive crawl • Seed URL: http://www.google.com • 78244 different web servers • Crawl of fixed number of sites • Seed URL: http://www.google.com • 45 web servers within google • Crawl of single site within http://groups.google.com
Crawl of Multiple Sites I CDF of Per-crawler Downloads (80 nodes) Partition by Hostname shows poor imbalance (70% idle). Better off when more crawlers are busy Crawl Throughput Scaleup Hostname: Can only exploit at most 45 crawlers. Redirect (hybrid hostname/url) does the best.
Crawl of Multiple Sites II Per-URL DHT Overheads Redirect: The per-URL DHT overheads hit their maximum around 70 nodes. Redirection incurs higher overheads only after queue size exceeds a threshold. Hostname incurs low overheads since crawl only looks at google.com which has lots of self-links.
Network Proximity Sampled 5100 crawl targets and measured ping times from each of 80 PlanetLab hosts Partition by hostname approximates random assignment Best-3 random is “close enough” to Best-5 random Sanity check: what if a single host crawls all targets ?
Related Work • Herodotus, at MIT (Chord-based) • Partition by URL • Batching with ring-based forwarding. • Experimented on 4 local machines • Apoidea, at GaTech (Chord-based) • Partition by hostname. • Forwards crawl to DHT neighbor closest to website. • Experimented on 12 local machines.
Conclusion • Our main contributions: • Propose a DHT and QP agnostic Distributed Crawler. • Express crawl as a query. • Permits user-customizable refinement of crawls • Discover important trade-offs in distributed crawling: • Co-ordination comes with extra communication costs • Deployment and experimentation on PlanetLab. • Examine crawl distribution strategies under different workloads on live web sources • Measure the potential benefits of network proximity.
Existing Crawlers • Cluster-based crawlers • Google: Centralized dispatcher sends urls to be crawled. • Hash-based parallel crawlers. • Focused Crawlers • BINGO! • Crawls the web given basic training set. • Peer-to-Peer • Grub SETI@Home infrastructure. • 23993 members .
Exhaustive Crawl Partition by Hostname shows imbalance. Some crawlers are over-utilized for downloads. Little difference in throughput. Most crawler threads are kept busy.
Single Site URL is best, followed by redirect and hostname.
Future Work • Fault Tolerance • Security • Single-Node Throughput • Work-Sharing between Crawl Queries • Essential for overlapping users. • Crawl Global Prioritization • A requirement of personalized crawls. • Online relevance feedback. • Deep web retrieval.