1 / 14

CRAWLING

CRAWLING . -Sashank S Jupudi. Rohit Srivatsav. Agenda. Introduction Architecture Basic Various Components Distributed Additional component (that makes the system distributed) Distributed Crawling. Introduction To Crawling. What is a Crawler and what is Crawling Basic Features

liseli
Download Presentation

CRAWLING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CRAWLING -Sashank S Jupudi. Rohit Srivatsav.

  2. Agenda • Introduction • Architecture • Basic • Various Components • Distributed • Additional component (that makes the system distributed) • Distributed Crawling.

  3. Introduction To Crawling • What is a Crawler and what is Crawling • Basic Features • The “MUST” Features • The “SHOULD” Features • The Working

  4. Basic Features • The ”Must” Features • Robustness • Politeness • The “Should” Features • Distributed • Scalable • Performance and Efficiency • Quality • Extensible

  5. The Working Simple Crawling Cycle WebPages are ranked here. Begin (Int, Seed Set From Frontier) Fetch Text Indexer Text Parse the Fetched page Output In Case Of Continuous Crawling URL Frontier Links Series of Tests

  6. Architecture of the Basic Crawler

  7. Architecture Of The Basic Crawler • The URL Frontier • DNS Resolution Module • Fetch Module • Parsing Module • Content Seen Module • Duplicate Eliminator • Host Splitter(D)

  8. Modules Of the Crawler • Content Seen Module • Checks if the URL was already processed before. • DNS Resolution Module • What is DNS resolution? • Any Disadvantages/issues? • Solution.

  9. Modules Of Crawler • Fetch Module: • It will retrieve the WebPages from the server (http protocol)using the data provided by the DNS resoluter. • Duplicate Eliminator: • It will remove the duplicate urls, thus reducing the burden .

  10. Modules Of the Crawler • URL Frontier • Basic Test To accept the URL. • Fingerprints and shingles • Filtering of the URL’s (exclusive and inclusive) • Normalizing the URL • Checking for duplicates. • Does URLF assign priority? On what basis? • Responsibilities • only one connection is open at a time to any host; (D) • high-priority pages are crawled preferentially. • Housekeeping (log such as URL’s crawled). • Check Pointing – State of the Frontier

  11. Normalizing The URL • Converting the scheme and host to lower case. HTTP://www.Example.com/ → http://www.example.com/ • Adding trailing : http://www.example.com → http://www.example.com/ • Removing directory index. Default directory index are not needed in URLs. http://www.example.com/a/index.html → http://www.example.com/a/ • Converting the entire URL to lower case. URLs from a case-insensitive web server may be converted to lowercase to avoid ambiguity. www.example.com/BAR.html → www.example.com/bar.html • Removing the fragment. The fragment component of a URL is removed. www.example.com/bar.html#sec1 → www.example.com/bar.html Reference: wikipedia.org/wiki/URL_normalization

  12. Few more rules to be respected by the crawler: • 1. Only one connection should be open to any given host at a time. • 2. A waiting time of a few seconds should occur between successive requests to a host. • 3. Politeness restrictions should be obeyed.

  13. Distributing The Basic Crawler • What Makes The Basic System Distributed? • Function Of The Host Splitter. • Issues with making the crawler distributed. • Fingerprints for same pages may be different. • No use of caching since no popular fingerprints. • Fingerprints need to be saved along with URL’s since updates can happen.

  14. Distributing The Basic Crawler

More Related