Web Crawling: Processes, Implementation, and Performance Metrics

Mining the Web Crawling the Web

Schedule • Search engine requirements • Components overview • Specific modules: the crawler • Purpose • Implementation • Performance metrics

What does it do? • Processes users queries • Find pages with related information • Return a list of resources • Is it really that simple?

What does it do? • Processes users queries • How is a query represented? • Find pages with related information • Return a resources list • Is it really that simple?

What does it do? • Processes users queries • Find pages with related information • How do we find pages? • Where in the web do we look? • How do we match query and documents? • Return a resources list • Is it really that simple?

What does it do? • Processes users queries • Find pages with related information • Return a resources list • Is what order? • How are the pages ranked? • Is it really that simple?

What does it do? • Processes users queries • Find pages with related information • Return a resources list • Is it really that simple? • Limited resources • Time quality tradeoff

Search Engine Structure • General Design • Crawling • Storage • Indexing • Ranking

Web Crawlers Search Engine Structure Page Repository Indexer Collection Analysis Queries Results Query Engine Ranking Text Structure Utility Crawl Control Indexes

Is it an IR system? • The web is • Used by millions • Contains lots of information • Link based • Incoherent • Changes rapidly • Distributed • Traditional information retrieval was built with the exact opposite in mind

Web Dynamics • Size • ~10 billion Public Indexable pages • 10kB / page  100 TB • Doubles every 18 months • Dynamics • 33% change weekly • 8% new pages every week • 25% new links every week

Weekly change Fetterly, Manasse, Najork, Wiener 2003

Collecting “all” Web pages • For searching, for classifying, for mining, etc • Problems: • No catalog of all accessible URLs on the Web • Volume, latency, duplications, dinamicity, etc.

The Crawler • A program that downloads and stores web pages: • Starts off by placing an initial set of URLs, S0, in a queue, where all URLs to be retrieved are kept and prioritized. • From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue. • This process is repeated until the crawler decides to stop.

Crawling Issues • How to crawl? • Quality: “Best” pages first • Efficiency: Avoid duplication (or near duplication) • Etiquette: Robots.txt, Server load concerns • How much to crawl? How much to index? • Coverage: How big is the Web? How much do we cover? • Relative Coverage: How much do competitors have? • How often to crawl? • Freshness: How much has changed? • How much has really changed? (why is this a different question?)

Before discussing crawling policies… • Some implementation issue

HTML • HyperText Markup Language • Lets the author • specify layout and typeface • embed diagrams • create hyperlinks. • expressed as an anchor tag with a HREF attribute • HREF names another page using a Uniform Resource Locator (URL), • URL = • protocol field (“HTTP”) + • a server hostname (“www.cse.iitb.ac.in”) + • file path (/, the `root' of the published file system).

HTTP(hypertext transport protocol) • Built on top of the Transport Control Protocol (TCP) • Steps(from client end) • resolve the server host name to an Internet address (IP) • Use Domain Name Server (DNS) • DNS is a distributed database of name-to-IP mappings maintained at a set of known servers • contact the server using TCP • connect to default HTTP port (80) on the server. • Enter the HTTP requests header (E.g.: GET) • Fetch the response header • MIME (Multipurpose Internet Mail Extensions) • A meta-data standard for email and Web content transfer • Fetch the HTML page

Crawling procedure • Simple • Great deal of engineering goes into industry-strength crawlers • Industry crawlers crawl a substantial fraction of the Web • E.g.: Google, Yahoo • No guarantee that all accessible Web pages will be located • Crawler may never halt ……. • pages will be added continually even as it is running.

Crawling overheads • Delays involved in • Resolving the host name in the URL to an IP address using DNS • Connecting a socket to the server and sending the request • Receiving the requested page in response • Solution: Overlap the above delays by • fetching many pages at the same time

Anatomy of a crawler • Page fetching by (logical) threads • Starts with DNS resolution • Finishes when the entire page has been fetched • Each page • stored in compressed form to disk/tape • scanned for outlinks • Work pool of outlinks • maintain network utilization without overloading it • Dealt with by load manager • Continue till the crawler has collected a sufficient number of pages.

Typical anatomy of a large-scale crawler.

Large-scale crawlers: performance and reliability considerations • Need to fetch many pages at same time • utilize the network bandwidth • single page fetch may involve several seconds of network latency • Highly concurrent and parallelized DNS lookups • Multi-processing or multi-threading: impractical at low level • Use of asynchronous sockets • Explicit encoding of the state of a fetch context in a data structure • Polling socket to check for completion of network transfers • Care in URL extraction • Eliminating duplicates to reduce redundant fetches • Avoiding “spider traps”

DNS caching, pre-fetching and resolution • A customized DNS component with….. • Custom client for address resolution • Caching server • Prefetching client

Custom client for address resolution • Tailored for concurrent handling of multiple outstanding requests • Allows issuing of many resolution requests together • polling at a later time for completion of individual requests • Facilitates load distribution among many DNS servers.

Caching server • With a large cache, persistent across DNS restarts • Residing largely in memory if possible.

Prefetching client • Steps • Parse a page that has just been fetched • extract host names from HREF targets • Make DNS resolution requests to the caching server • Usually implemented using UDP • User Datagram Protocol • connectionless, packet-based communication protocol • does not guarantee packet delivery • Does not wait for resolution to be completed.

Multiple concurrent fetches • Managing multiple concurrent connections • A single download may take several seconds • Open many socket connections to different HTTP servers simultaneously • Multi-CPU machines not useful • crawling performance limited by network and disk • Two approaches • using multi-threading • using non-blocking sockets with event handlers

Multi-threading • threads • physical thread of control provided by the operating system (E.g.: pthreads) OR • concurrent processes • fixed number of threads allocated in advance • programming paradigm • create a client socket • connect the socket to the HTTP service on a server • Send the HTTP request header • read the socket (recv) until • no more characters are available • close the socket. • use blocking system calls

Multi-threading: Problems • performance penalty • mutual exclusion • concurrent access to data structures • slow disk seeks. • great deal of interleaved, random input-output on disk • Due to concurrent modification of document repository by multiple threads

Non-blocking sockets and event handlers • non-blocking sockets • connect, send or recv call returns immediately without waiting for the network operation to complete. • poll the status of the network operation separately • “select” system call • lets application suspend until more data can be read from or written to the socket • timing out after a pre-specified deadline • Monitor polls several sockets at the same time • More efficient memory management • code that completes processing not interrupted by other completions • No need for locks and semaphores on the pool • only append complete pages to the log

Link extraction and normalization • Goal: Obtaining a canonical form of URL • URL processing and filtering • Avoid multiple fetches of pages known by different URLs • many IP addresses • For load balancing on large sites • Mirrored contents/contents on same file system • “Proxy pass“ • Mapping of different host names to a single IP address • need to publish many logical sites • Relative URLs • need to be interpreted w.r.t to a base URL.

Canonical URL • Formed by • Using a standard string for the protocol • Canonicalizing the host name • Adding an explicit port number • Normalizing and cleaning up the path

Robot exclusion • Check • whether the server prohibits crawling a normalized URL • In robots.txt file in the HTTP root directory of the server • specifies a list of path prefixes which crawlers should not attempt to fetch. • Meant for crawlers only

Eliminating already-visited URLs • Checking if a URL has already been fetched • Before adding a new URL to the work pool • Needs to be very quick. • Achieved by computing MD5 hash function on the URL • Exploiting spatio-temporal locality of access • Two-level hash function. • most significant bits (say, 24) derived by hashing the host name plus port • lower order bits (say, 40) derived by hashing the path • concatenated bits used as a key in a B-tree • qualifying URLs added to frontier of the crawl. • hash values added to B-tree.

Spider traps • Protecting from crashing on • Ill-formed HTML • E.g.: page with 68 kB of null characters • Misleading sites • indefinite number of pages dynamically generated by CGI scripts • paths of arbitrary depth created using soft directory links and path remapping features in HTTP server

Spider Traps: Solutions • No automatic technique can be foolproof • Check for URL length • Guards • Preparing regular crawl statistics • Adding dominating sites to guard module • Disable crawling active content such as CGI form queries • Eliminate URLs with non-textual data types

Avoiding repeated expansion of links on duplicate pages • Reduce redundancy in crawls • Duplicate detection • Mirrored Web pages and sites • Detecting exact duplicates • Checking against MD5 digests of stored URLs • Representing a relative link v (relative to aliases u1 and u2) as tuples (h(u1); v) and (h(u2); v) • Detecting near-duplicates • Even a single altered character will completely change the digest ! • E.g.: date of update/ name and email of the site administrator

Load monitor • Keeps track of various system statistics • Recent performance of the wide area network (WAN) connection • E.g.: latency and bandwidth estimates. • Operator-provided/estimated upper bound on open sockets for a crawler • Current number of active sockets.

Thread manager • Responsible for • Choosing units of work from frontier • Scheduling issue of network resources • Distribution of these requests over multiple ISPs if appropriate. • Uses statistics from load monitor

Per-server work queues • Denial of service (DoS) attacks • limit the speed or frequency of responses to any fixed client IP address • Avoiding DOS • limit the number of active requests to a given server IP address at any time • maintain a queue of requests for each server • Use the HTTP/1.1 persistent socket capability. • Distribute attention relatively evenly between a large number of sites • Access locality vs. politeness dilemma

Crawling Issues • How to crawl? • Quality: “Best” pages first • Efficiency: Avoid duplication (or near duplication) • Etiquette: Robots.txt, Server load concerns • How much to crawl? How much to index? • Coverage: How big is the Web? How much do we cover? • Relative Coverage: How much do competitors have? • How often to crawl? • Freshness: How much has changed? • How much has really changed? (why is this a different question?)

Crawl Order • Want best pages first • Potential quality measures: • Final In-degree • Final PageRank • Crawl heuristics: • Breadth First Search (BFS) • Partial Indegree • Partial PageRank • Random walk

Breadth-First Crawl • Basic idea: • start at a set of known URLs • explore in “concentric circles” around these URLs start pages distance-one pages distance-two pages • used by broad web search engines • balances load between servers

Web Wide Crawl (328M pages) [Najo01] BFS crawling brings in high quality pages early in the crawl

Stanford Web Base (179K) [Cho98] Overlap with best x% by indegree x% crawled by O(u)

Queue of URLs to be fetched • What constraints dictate which queued URL is fetched next? • Politeness – don’t hit a server too often, even from different threads of your spider • How far into a site you’ve crawled already • Most sites, stay at ≤ 5 levels of URL hierarchy • Which URLs are most promising for building a high-quality corpus • This is a graph traversal problem: • Given a directed graph you’ve partially visited, where do you visit next?

Where do we crawl next? • Complex scheduling optimization problem, subject to constraints • Plus operational constraints (e.g., keeping all machines load-balanced) • Scientific study – limited to specific aspects • Which ones? • What do we measure? • What are the compromises in distributed crawling?

Page selection • Importance metric • Web crawler model • Crawler method for choosing page to download

Importance Metrics • Given a page P, define how “good” that page is • Several metric types: • Interest driven • Popularity driven • Location driven • Combined

Web Crawling: Processes, Implementation, and Performance Metrics

Web Crawling: Processes, Implementation, and Performance Metrics

Presentation Transcript

Web Mining

Web Mining

Web Mining

Web Mining

Web Mining

Web Mining

Web mining

Web Mining

Web Mining

Web Mining

Web Mining

WEB MINING

Web Mining

Web Mining

Web Mining

Web Mining

WEB MINING

WEB MINING