CRAWLING

CRAWLING -Sashank S Jupudi. Rohit Srivatsav.

Agenda • Introduction • Architecture • Basic • Various Components • Distributed • Additional component (that makes the system distributed) • Distributed Crawling.

Introduction To Crawling • What is a Crawler and what is Crawling • Basic Features • The “MUST” Features • The “SHOULD” Features • The Working

Basic Features • The ”Must” Features • Robustness • Politeness • The “Should” Features • Distributed • Scalable • Performance and Efficiency • Quality • Extensible

The Working Simple Crawling Cycle WebPages are ranked here. Begin (Int, Seed Set From Frontier) Fetch Text Indexer Text Parse the Fetched page Output In Case Of Continuous Crawling URL Frontier Links Series of Tests

Architecture of the Basic Crawler

Architecture Of The Basic Crawler • The URL Frontier • DNS Resolution Module • Fetch Module • Parsing Module • Content Seen Module • Duplicate Eliminator • Host Splitter(D)

Modules Of the Crawler • Content Seen Module • Checks if the URL was already processed before. • DNS Resolution Module • What is DNS resolution? • Any Disadvantages/issues? • Solution.

Modules Of Crawler • Fetch Module: • It will retrieve the WebPages from the server (http protocol)using the data provided by the DNS resoluter. • Duplicate Eliminator: • It will remove the duplicate urls, thus reducing the burden .

Modules Of the Crawler • URL Frontier • Basic Test To accept the URL. • Fingerprints and shingles • Filtering of the URL’s (exclusive and inclusive) • Normalizing the URL • Checking for duplicates. • Does URLF assign priority? On what basis? • Responsibilities • only one connection is open at a time to any host; (D) • high-priority pages are crawled preferentially. • Housekeeping (log such as URL’s crawled). • Check Pointing – State of the Frontier

Normalizing The URL • Converting the scheme and host to lower case. HTTP://www.Example.com/ → http://www.example.com/ • Adding trailing : http://www.example.com → http://www.example.com/ • Removing directory index. Default directory index are not needed in URLs. http://www.example.com/a/index.html → http://www.example.com/a/ • Converting the entire URL to lower case. URLs from a case-insensitive web server may be converted to lowercase to avoid ambiguity. www.example.com/BAR.html → www.example.com/bar.html • Removing the fragment. The fragment component of a URL is removed. www.example.com/bar.html#sec1 → www.example.com/bar.html Reference: wikipedia.org/wiki/URL_normalization

Few more rules to be respected by the crawler: • 1. Only one connection should be open to any given host at a time. • 2. A waiting time of a few seconds should occur between successive requests to a host. • 3. Politeness restrictions should be obeyed.

Distributing The Basic Crawler • What Makes The Basic System Distributed? • Function Of The Host Splitter. • Issues with making the crawler distributed. • Fingerprints for same pages may be different. • No use of caching since no popular fingerprints. • Fingerprints need to be saved along with URL’s since updates can happen.

Distributing The Basic Crawler

CRAWLING

CRAWLING

Presentation Transcript

FOCUSED CRAWLING

Web Crawling

Web Crawling

Web Crawling

Web Crawling

Crawling

CRAWLING THE WEB

Crawling

Crawling and Ranking

Crawling HTML

Crawling

Crawling

Some crawling algorithms

Adaptive Focused Crawling

Adaptive Focused Crawling

Crawling

Flying/Crawling Wires

Crawling

Datahut - Web Crawling Services

Crawling and Ranking

Ch. 8: Web Crawling

Deep Web Crawling