260 likes | 393 Views
CRAWLER DESIGN. YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information. Challenges. The amount of information In 1994 the World Wide Web Worm indexed 110K pages In 1997 : millions of pages
E N D
CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information
Challenges • The amount of information • In 1994 the World Wide Web Worm indexed 110K pages • In 1997 : millions of pages • In 2004 : billions of pages • In 2010 : ???? Of pages • Complexity of the link graph
Basics • HTTP : Hypertext transport protocol • TCP : Transmission Control Protocol • IP : Internet Protocol • HTML : Hypertext markup language • URL : Uniform Resource Locator <a href=“http://www.cse.iitb.ac.in/”> The IIT Bombay Computer Science Department</a> protocol Server host name File path
Basics • A click on the hyperlink is converted to a network request by the browser • Browser will then fetch and display the web page pointed to ny the url. • Server host name (like www.cse.iitb.ac.in) needs to be translated into an ip address such as 144.16.111.14 to contact the server using TCP. <a href=“http://www.cse.iitb.ac.in/”> The IIT Bombay Computer Science Department</a> protocol Server host name File path
Basics • DNS (Domain Name Service) is a distributed database of name-to-IP address mappings • This database is maintained by known servers • A click on the hyperlink is translated into • telnet www.cse.iitb.ac.in 80 • 80 is the default http port
MIME Header MIME: Multipurpose Internet Mail Extensions, a standard for email and web content transfer.
Crawling • There is no directory of all accessible URLs • The main strategy is to • start from a set of seed web pages • Extract URLs from those pages • Apply the same techniques to the pages from those URL • It may not be possible to retrieve all the pages on the WEB with this technique since New pages are added every day Use a queue structure and mark Visited nodes
Crawling • Writing a basic crawler is easy • Writing a large-scale crawler is challenging • Following are the basic steps of crawling • URL to IP conversion using the DNS server • Socket connection to the server and sending the request • Receiving the requested page • For small pages, DNS lookup and socket connection takes more time then receiving the requested page • We need to overlap the processing and waiting times for the above three steps.
Crawling • Storage requirements are huge • Need to store the list of URLs and the retrieved pages in the disk • Storing the URLs in the disk is also needed for persistency • Pages are stored in compressed form (goodle uses zlib for compression, 3 to 1 )
Large Scale Crawler Tips • Fetch hundreds of pages at the same time to increase bandwidth utilization • Use more than one DNS server for concurrent DNS lookup • Using asynchronous sockets is better than multi-threading • Eliminate duplicates to reduce the number of redundant fetches and to avoid spider traps (infinite set of fake URLs)
DNS Caching • Address mapping is a significant bottleneck • A crawler can generate more requests per unit time than a DNS server can handle • Caching the DNS entries helps • DNS cache needs to be refreshed periodically (whenever it is idle)
Concurrent page requests • Can be achieved by • Multithreading • Non-blocking sockets with event handlers • Multithreading • A set of threads are created • After the server name is translated to IP address, • a thread creates a client socket • Connects to the Http service on the server • Sends the http request header • Reads the socket until eof • Closes the socket • Blocking system calls are used to suspend the thread until the requested data is available
Multithreading • A fixed number of worker threads share a work-queue of pages to fetch • Handling concurrent access to data structures is a problem. Mutual exclusion needs to be handled properly • Disk access can not be orchestrated when multiple concurrent threads are used • Non-blocking sockets could be a better approach!
Non-blocking sockets • Connect, send, and receive calls will return immediately without blocking for network data • The status of the network can be polled later on • “Select” system call lets the application wait for data to be available on the socket • This way completion of page fetching is serialized. • No need for locks or semaphores • Can append the pages to the file in disk without being intercepted
Link Extraction and Normalization • An HTML page is searched for links to add to the work-pool • URLs extracted from pages need to be preprocessed before they are added to the work-pool • Duplicate elimination is necessary but difficult • Since mapping from urls to hostnames is many-to-many I.e., a computer may have many IP addresses and many hostnames. • Extracted URLs are converted to canonical form by • Using the canonical hotname provided by the DNS response • Adding an explicit port number • Converting the relative addresses to absolute addresses
Some more tips • Server may disallow crawling using “robots.txt” found in the http root directory • Robots.txt specifies a list of path prefixes that crawlers should not try to fetch
Eliminating already visited URLS • IsUrlVisisted module in the architecture does that job • The same page could be kinked from many different sites • Checking if the page is already visited eliminates redundant page requests • Comparing the strings of URLs may take long time since it involves disk access and checking against all the stored URLS
Eliminating already visited URLS • Duplicate checking is done by applying a hash function MD5 originally designed for digital signature applications • MD5 algorithm takes a message of arbitrary length as input and produces a 128-bit "fingerprint" or "message digest" as output • “it is computationally infeasible to produce two messages having the same message digest” • http://www.w3.org/TR/1998/REC-DSig-label/MD5-1_0 • Even the hashed URLs need to be stored in disk due to storage and persistency requirements • Spatial and temporal locality of URL access means less number of disk accesses when URL hashes are cached
Eliminating already visited URLs • We need utilize spatial locality as much as possible • But MD5 will distribute the domain of similar URLs string uniformly over a range. • Two-block or two-level hash function is used • Use different hash functions for the host address and the path • B-tree could be used to index the host name, and the retrieved page will contain the urls in the same host.
Spider Traps • Malicious pages designed to crash the crawlers • Simply add 64K of null characters in the middle of URL to crash the lexical analyzer • Infinitely deep web sites • Using dynamically generated links via CGI scripts • Need to check the link length • No technique is foolproof • Generate periodic statistics for the crawler to eliminate dominating sites • Disable crawling active content
Avoiding duplicate pages • A page can be accessed via different URLs • Eliminating duplicate pages will also help eliminate spider traps • MD5 can be used for that purpose • Minor changes can not be handled with MD5. • Can divide the page into blocks
Denial of Service • HTTP servers protect themselves against denial of service (DoS) attacks • DoS attacks will send frequent requests to the same server to slow down its operation • Therefore frequent requests from the same IP are prohibited • Crawlers need to consider such cases for courtesy/legal action • Need to limit the active requests to a given server IP address at any time • Maintain a queue of requests for each server • This will also reduce the effect of spider traps
Text Repository • The pages that are fetched are dumped into a text repository • The text repository is significantly large • Needs to be compressed (google uses zlip for 3-1 compression) • Google implements its own file system • Berkeley DB (www.sleepycat.com) can also be used • Stores a database within a single file • Provides several access methods such as B-tree or sequential
Refreshing Crawled Pages • HTTP protocol could be used to check if a page changes since last time it was crawled • But using HTTP for checking if a page is modified takes a lot of time • If a page expires after a certain time, this could be extracted from the http header. • If we had a score that reflects the probability of change since last time it was visited • We can sort the pages wrt that score and crawl them in that order • Use the past behavior to model the future!
Your crawler • Use w3c-libwww API to implement your crawler • Start from a very simple implementation and go on from that! • Sample codes and algorithms are provided in the handouts