140 likes | 308 Views
CRAWLING . -Sashank S Jupudi. Rohit Srivatsav. Agenda. Introduction Architecture Basic Various Components Distributed Additional component (that makes the system distributed) Distributed Crawling. Introduction To Crawling. What is a Crawler and what is Crawling Basic Features
E N D
CRAWLING -Sashank S Jupudi. Rohit Srivatsav.
Agenda • Introduction • Architecture • Basic • Various Components • Distributed • Additional component (that makes the system distributed) • Distributed Crawling.
Introduction To Crawling • What is a Crawler and what is Crawling • Basic Features • The “MUST” Features • The “SHOULD” Features • The Working
Basic Features • The ”Must” Features • Robustness • Politeness • The “Should” Features • Distributed • Scalable • Performance and Efficiency • Quality • Extensible
The Working Simple Crawling Cycle WebPages are ranked here. Begin (Int, Seed Set From Frontier) Fetch Text Indexer Text Parse the Fetched page Output In Case Of Continuous Crawling URL Frontier Links Series of Tests
Architecture Of The Basic Crawler • The URL Frontier • DNS Resolution Module • Fetch Module • Parsing Module • Content Seen Module • Duplicate Eliminator • Host Splitter(D)
Modules Of the Crawler • Content Seen Module • Checks if the URL was already processed before. • DNS Resolution Module • What is DNS resolution? • Any Disadvantages/issues? • Solution.
Modules Of Crawler • Fetch Module: • It will retrieve the WebPages from the server (http protocol)using the data provided by the DNS resoluter. • Duplicate Eliminator: • It will remove the duplicate urls, thus reducing the burden .
Modules Of the Crawler • URL Frontier • Basic Test To accept the URL. • Fingerprints and shingles • Filtering of the URL’s (exclusive and inclusive) • Normalizing the URL • Checking for duplicates. • Does URLF assign priority? On what basis? • Responsibilities • only one connection is open at a time to any host; (D) • high-priority pages are crawled preferentially. • Housekeeping (log such as URL’s crawled). • Check Pointing – State of the Frontier
Normalizing The URL • Converting the scheme and host to lower case. HTTP://www.Example.com/ → http://www.example.com/ • Adding trailing : http://www.example.com → http://www.example.com/ • Removing directory index. Default directory index are not needed in URLs. http://www.example.com/a/index.html → http://www.example.com/a/ • Converting the entire URL to lower case. URLs from a case-insensitive web server may be converted to lowercase to avoid ambiguity. www.example.com/BAR.html → www.example.com/bar.html • Removing the fragment. The fragment component of a URL is removed. www.example.com/bar.html#sec1 → www.example.com/bar.html Reference: wikipedia.org/wiki/URL_normalization
Few more rules to be respected by the crawler: • 1. Only one connection should be open to any given host at a time. • 2. A waiting time of a few seconds should occur between successive requests to a host. • 3. Politeness restrictions should be obeyed.
Distributing The Basic Crawler • What Makes The Basic System Distributed? • Function Of The Host Splitter. • Issues with making the crawler distributed. • Fingerprints for same pages may be different. • No use of caching since no popular fingerprints. • Fingerprints need to be saved along with URL’s since updates can happen.