1 / 23

Web Crawler & Distributed IR

Web Crawler & Distributed IR. Rong Jin. A Basic Crawler. Initialize queue with URLs of known seed pages Repeat Take URL from queue Fetch and parse page Extract URLs from page Add URLs to queue Fundamental assumption The web is well linked. Challenges.

happy
Download Presentation

Web Crawler & Distributed IR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Crawler & Distributed IR Rong Jin

  2. A Basic Crawler • Initialize queue with URLs of known seed pages • Repeat • Take URL from queue • Fetch and parse page • Extract URLs from page • Add URLs to queue • Fundamental assumption • The web is well linked.

  3. Challenges • Fetch 1,000,000,000 pages in one month • almost 400 pages per second! • Latency/bandwidth • Politeness – don’t hit a server too often • Duplicates • Spider traps • Malicious server that generates an infinite sequence of linked pages • Sophisticated spider traps generate pages that are not easily identified as dynamic.

  4. What A Crawler Should Do? • Be polite • Don’t hit a each site too often • Only crawl pages you are allowed to crawl: robots.txt • Be robust • Be immune to spider traps, duplicates, very large pages, very large websites, dynamic pages etc

  5. Robot.txt • Protocol for giving crawlers (“robots”) limited access to a website, originally from 1994 • Examples: User-agent: * Disallow: /private/ Allow: /other/public/ User-agent: favoredcrawler Disallow: • Important: cache the robots.txt file of each site we are crawling

  6. What A Crawler Should Do? • Be capable of distributed operation • Be scalable: need to be able to increase crawl rate by adding more machines • Fetch pages of higher quality or dynamic pages first • Continuous operation: get fresh version of already crawled pages

  7. Basic Crawler Architecture

  8. Basic Processing Steps • Pick a URL from the frontier • Fetch the document at the URL • Check if the document has content already seen (if yes: skip following steps) • Index document • Parse the document and extract URLs to other docs • For each extracted URL: • Does it fail certain tests (e.g., spam)? Yes: skip • Already in the frontier? Yes: skip

  9. URL Normalization • Some URLs extracted from a document are relative URLs. • E.g., at http://mit.edu, we may have /aboutsite.html • This is the same as: http://mit.edu/aboutsite.html • During parsing, we must normalize (expand) all relative URLs.

  10. Content Seen • For each page fetched: check if the content is already in the index • Check this using document fingerprints or shingles • Skip documents whose content has already been indexed

  11. Distributing Crawler • Run multiple crawl threads, potentially at different nodes • Partition hosts being crawled into nodes

  12. URL Frontier • Must avoid trying to fetch them all at the same time • Must keep all crawling threads busy

  13. URL Frontier • Politeness: Don’t hit a web server too frequently • E.g., insert a time gap between successive requests to the same server • Freshness: Crawl some pages (e.g., news sites) more often than others • Not an easy problem

  14. URL Frontier • Front queue manage priority • Each front queue corresponds to a level of priority • Back queue enforce politeness • URLs in each back queue share the same web sever

  15. Multi-Thread Crawlers • Extract the root of the back queue heap • Fetch URL at head of corresponding back queue q • Check if q is now empty • If yes, • pulling URLs from the front queues, and • adding them to their corresponding back queues until the URL’s host does not have a back queue – then put the URL in q and create heap entry for it.

  16. Federated Search • Visible Web: • Accessible (crawled by) conventional search engines like Google or Yahoo! • Hidden Web: • Hidden from conventional engines (can not be crawled). • Accessible via source-specific search engine • Larger than Visible Web (2-50 times, Sherman 2001) • Created by professionals

  17. Example of Hidden Web

  18. Example of Hidden Web

  19. Example of Hidden Web

  20. Engine 1 Engine 2 Engine 3 Engine 4 Engine N …… …… Distributed Information Retrieval . . . . . . . . . . (1) Resource Representation (2) Resource Selection (3) Results Merging • Source recommendation: recommend sources for given queries • Federated search : search selected sources and merge individual ranked lists into a single list

  21. Resource Description • Description by word occurrences • Cooperative protocols • Eg. STARTS protocol (Gravano et al., 1997) • Query sampling based approaches • Collect word occurrences by sending random queries and analyzing the returned docs • Build centralized database by the returned docs of random queries • Good for uncooperative environment

  22. Resource Selection • Select information sources that are most likely to provide relevant docs for given queries • Basic steps • Build a word histogram profile for each database • Treat each database as a “big” doc, and select the database with the highest relevance for given queries

  23. Result Merging • Merge the returned docs from different databases into a single list • Challenges: • Each database only returns a list, no scores • Each database may use different retrieval algorithms, making their scores incomparable • Solution • Round and robin • CORI merging algorithm (Callan, 1997) • Calibrate the scores from different database

More Related