1.12k likes | 1.14k Views
Explore the purpose, implementation, and performance metrics of web crawling for search engine requirements. Discover how web crawlers find pages, match queries and documents, and return a list of relevant resources.
E N D
Mining the Web Crawling the Web
Schedule • Search engine requirements • Components overview • Specific modules: the crawler • Purpose • Implementation • Performance metrics
What does it do? • Processes users queries • Find pages with related information • Return a list of resources • Is it really that simple?
What does it do? • Processes users queries • How is a query represented? • Find pages with related information • Return a resources list • Is it really that simple?
What does it do? • Processes users queries • Find pages with related information • How do we find pages? • Where in the web do we look? • How do we match query and documents? • Return a resources list • Is it really that simple?
What does it do? • Processes users queries • Find pages with related information • Return a resources list • Is what order? • How are the pages ranked? • Is it really that simple?
What does it do? • Processes users queries • Find pages with related information • Return a resources list • Is it really that simple? • Limited resources • Time quality tradeoff
Search Engine Structure • General Design • Crawling • Storage • Indexing • Ranking
Web Crawlers Search Engine Structure Page Repository Indexer Collection Analysis Queries Results Query Engine Ranking Text Structure Utility Crawl Control Indexes
Is it an IR system? • The web is • Used by millions • Contains lots of information • Link based • Incoherent • Changes rapidly • Distributed • Traditional information retrieval was built with the exact opposite in mind
Web Dynamics • Size • ~10 billion Public Indexable pages • 10kB / page 100 TB • Doubles every 18 months • Dynamics • 33% change weekly • 8% new pages every week • 25% new links every week
Weekly change Fetterly, Manasse, Najork, Wiener 2003
Collecting “all” Web pages • For searching, for classifying, for mining, etc • Problems: • No catalog of all accessible URLs on the Web • Volume, latency, duplications, dinamicity, etc.
The Crawler • A program that downloads and stores web pages: • Starts off by placing an initial set of URLs, S0, in a queue, where all URLs to be retrieved are kept and prioritized. • From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue. • This process is repeated until the crawler decides to stop.
Crawling Issues • How to crawl? • Quality: “Best” pages first • Efficiency: Avoid duplication (or near duplication) • Etiquette: Robots.txt, Server load concerns • How much to crawl? How much to index? • Coverage: How big is the Web? How much do we cover? • Relative Coverage: How much do competitors have? • How often to crawl? • Freshness: How much has changed? • How much has really changed? (why is this a different question?)
Before discussing crawling policies… • Some implementation issue
HTML • HyperText Markup Language • Lets the author • specify layout and typeface • embed diagrams • create hyperlinks. • expressed as an anchor tag with a HREF attribute • HREF names another page using a Uniform Resource Locator (URL), • URL = • protocol field (“HTTP”) + • a server hostname (“www.cse.iitb.ac.in”) + • file path (/, the `root' of the published file system).
HTTP(hypertext transport protocol) • Built on top of the Transport Control Protocol (TCP) • Steps(from client end) • resolve the server host name to an Internet address (IP) • Use Domain Name Server (DNS) • DNS is a distributed database of name-to-IP mappings maintained at a set of known servers • contact the server using TCP • connect to default HTTP port (80) on the server. • Enter the HTTP requests header (E.g.: GET) • Fetch the response header • MIME (Multipurpose Internet Mail Extensions) • A meta-data standard for email and Web content transfer • Fetch the HTML page
Crawling procedure • Simple • Great deal of engineering goes into industry-strength crawlers • Industry crawlers crawl a substantial fraction of the Web • E.g.: Google, Yahoo • No guarantee that all accessible Web pages will be located • Crawler may never halt ……. • pages will be added continually even as it is running.
Crawling overheads • Delays involved in • Resolving the host name in the URL to an IP address using DNS • Connecting a socket to the server and sending the request • Receiving the requested page in response • Solution: Overlap the above delays by • fetching many pages at the same time
Anatomy of a crawler • Page fetching by (logical) threads • Starts with DNS resolution • Finishes when the entire page has been fetched • Each page • stored in compressed form to disk/tape • scanned for outlinks • Work pool of outlinks • maintain network utilization without overloading it • Dealt with by load manager • Continue till the crawler has collected a sufficient number of pages.
Large-scale crawlers: performance and reliability considerations • Need to fetch many pages at same time • utilize the network bandwidth • single page fetch may involve several seconds of network latency • Highly concurrent and parallelized DNS lookups • Multi-processing or multi-threading: impractical at low level • Use of asynchronous sockets • Explicit encoding of the state of a fetch context in a data structure • Polling socket to check for completion of network transfers • Care in URL extraction • Eliminating duplicates to reduce redundant fetches • Avoiding “spider traps”
DNS caching, pre-fetching and resolution • A customized DNS component with….. • Custom client for address resolution • Caching server • Prefetching client
Custom client for address resolution • Tailored for concurrent handling of multiple outstanding requests • Allows issuing of many resolution requests together • polling at a later time for completion of individual requests • Facilitates load distribution among many DNS servers.
Caching server • With a large cache, persistent across DNS restarts • Residing largely in memory if possible.
Prefetching client • Steps • Parse a page that has just been fetched • extract host names from HREF targets • Make DNS resolution requests to the caching server • Usually implemented using UDP • User Datagram Protocol • connectionless, packet-based communication protocol • does not guarantee packet delivery • Does not wait for resolution to be completed.
Multiple concurrent fetches • Managing multiple concurrent connections • A single download may take several seconds • Open many socket connections to different HTTP servers simultaneously • Multi-CPU machines not useful • crawling performance limited by network and disk • Two approaches • using multi-threading • using non-blocking sockets with event handlers
Multi-threading • threads • physical thread of control provided by the operating system (E.g.: pthreads) OR • concurrent processes • fixed number of threads allocated in advance • programming paradigm • create a client socket • connect the socket to the HTTP service on a server • Send the HTTP request header • read the socket (recv) until • no more characters are available • close the socket. • use blocking system calls
Multi-threading: Problems • performance penalty • mutual exclusion • concurrent access to data structures • slow disk seeks. • great deal of interleaved, random input-output on disk • Due to concurrent modification of document repository by multiple threads
Non-blocking sockets and event handlers • non-blocking sockets • connect, send or recv call returns immediately without waiting for the network operation to complete. • poll the status of the network operation separately • “select” system call • lets application suspend until more data can be read from or written to the socket • timing out after a pre-specified deadline • Monitor polls several sockets at the same time • More efficient memory management • code that completes processing not interrupted by other completions • No need for locks and semaphores on the pool • only append complete pages to the log
Link extraction and normalization • Goal: Obtaining a canonical form of URL • URL processing and filtering • Avoid multiple fetches of pages known by different URLs • many IP addresses • For load balancing on large sites • Mirrored contents/contents on same file system • “Proxy pass“ • Mapping of different host names to a single IP address • need to publish many logical sites • Relative URLs • need to be interpreted w.r.t to a base URL.
Canonical URL • Formed by • Using a standard string for the protocol • Canonicalizing the host name • Adding an explicit port number • Normalizing and cleaning up the path
Robot exclusion • Check • whether the server prohibits crawling a normalized URL • In robots.txt file in the HTTP root directory of the server • specifies a list of path prefixes which crawlers should not attempt to fetch. • Meant for crawlers only
Eliminating already-visited URLs • Checking if a URL has already been fetched • Before adding a new URL to the work pool • Needs to be very quick. • Achieved by computing MD5 hash function on the URL • Exploiting spatio-temporal locality of access • Two-level hash function. • most significant bits (say, 24) derived by hashing the host name plus port • lower order bits (say, 40) derived by hashing the path • concatenated bits used as a key in a B-tree • qualifying URLs added to frontier of the crawl. • hash values added to B-tree.
Spider traps • Protecting from crashing on • Ill-formed HTML • E.g.: page with 68 kB of null characters • Misleading sites • indefinite number of pages dynamically generated by CGI scripts • paths of arbitrary depth created using soft directory links and path remapping features in HTTP server
Spider Traps: Solutions • No automatic technique can be foolproof • Check for URL length • Guards • Preparing regular crawl statistics • Adding dominating sites to guard module • Disable crawling active content such as CGI form queries • Eliminate URLs with non-textual data types
Avoiding repeated expansion of links on duplicate pages • Reduce redundancy in crawls • Duplicate detection • Mirrored Web pages and sites • Detecting exact duplicates • Checking against MD5 digests of stored URLs • Representing a relative link v (relative to aliases u1 and u2) as tuples (h(u1); v) and (h(u2); v) • Detecting near-duplicates • Even a single altered character will completely change the digest ! • E.g.: date of update/ name and email of the site administrator
Load monitor • Keeps track of various system statistics • Recent performance of the wide area network (WAN) connection • E.g.: latency and bandwidth estimates. • Operator-provided/estimated upper bound on open sockets for a crawler • Current number of active sockets.
Thread manager • Responsible for • Choosing units of work from frontier • Scheduling issue of network resources • Distribution of these requests over multiple ISPs if appropriate. • Uses statistics from load monitor
Per-server work queues • Denial of service (DoS) attacks • limit the speed or frequency of responses to any fixed client IP address • Avoiding DOS • limit the number of active requests to a given server IP address at any time • maintain a queue of requests for each server • Use the HTTP/1.1 persistent socket capability. • Distribute attention relatively evenly between a large number of sites • Access locality vs. politeness dilemma
Crawling Issues • How to crawl? • Quality: “Best” pages first • Efficiency: Avoid duplication (or near duplication) • Etiquette: Robots.txt, Server load concerns • How much to crawl? How much to index? • Coverage: How big is the Web? How much do we cover? • Relative Coverage: How much do competitors have? • How often to crawl? • Freshness: How much has changed? • How much has really changed? (why is this a different question?)
Crawl Order • Want best pages first • Potential quality measures: • Final In-degree • Final PageRank • Crawl heuristics: • Breadth First Search (BFS) • Partial Indegree • Partial PageRank • Random walk
Breadth-First Crawl • Basic idea: • start at a set of known URLs • explore in “concentric circles” around these URLs start pages distance-one pages distance-two pages • used by broad web search engines • balances load between servers
Web Wide Crawl (328M pages) [Najo01] BFS crawling brings in high quality pages early in the crawl
Stanford Web Base (179K) [Cho98] Overlap with best x% by indegree x% crawled by O(u)
Queue of URLs to be fetched • What constraints dictate which queued URL is fetched next? • Politeness – don’t hit a server too often, even from different threads of your spider • How far into a site you’ve crawled already • Most sites, stay at ≤ 5 levels of URL hierarchy • Which URLs are most promising for building a high-quality corpus • This is a graph traversal problem: • Given a directed graph you’ve partially visited, where do you visit next?
Where do we crawl next? • Complex scheduling optimization problem, subject to constraints • Plus operational constraints (e.g., keeping all machines load-balanced) • Scientific study – limited to specific aspects • Which ones? • What do we measure? • What are the compromises in distributed crawling?
Page selection • Importance metric • Web crawler model • Crawler method for choosing page to download
Importance Metrics • Given a page P, define how “good” that page is • Several metric types: • Interest driven • Popularity driven • Location driven • Combined