Creating a Web Crawler in 3 Steps

Creating a Web Crawler in 3 Steps Issac Goldstand isaac@cpan.org Mirimar Networks http://www.mirimar.net/

The 3 steps • Creating the User Agent • Creating the content parser • Tying it together

Step 1 – Creating the User Agent • Lib-WWW Perl (LWP) • OO interface to creating user agents for interacting with remote websites and web applications • We will look at LWP::RobotUA

Creating the LWP Object • User agent • Cookie jar • Timeout

Robot UA extras • Robot rules • Delay • use_sleep

Implementation of Step 1 use LWP::RobotUA; # First, create the user agent - MyBot/1.0 my $ua=LWP::RobotUA->new('MyBot/1.0', \ 'isaac@cpan.org'); $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed

Step 2 – Creating the content parser • HTML::Parser • Event-driven parser mechanism • OO and function oriented interfaces • Hooks to functions at certain points

Subclassing HTML::Parser • Biggest issue is non-persistence • CGI authors may be used to this, but still makes for many caveats • You must implement your own state preservation mechanism

Implementation of Step 2 package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; }

Implementation of Step 2 (cont) sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; } sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; } }

Shortcut HTML::SimpleLinkExtor • Simple package to extract links from HTML • Handles many links – we only want HREF type links

Step 3 – Tying it together • Simple application • Instanciate objects • Enter request loop • Spit data to somewhere • Add parsed links to queue

Implementation of Step 3 for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue } }

End result #!/usr/bin/perl use strict; use LWP::RobotUA; use HTML::Parser; use HTML::SimpleLinkExtor; my @urls; # List of URLs to visit my %authors; my $ua=LWP::RobotUA->new('AuthorBot/1.0','isaac@cpan.org'); # First, create & setup the user agent $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed my $p=My::LinkParser->new; # Create parsers my $linkex=HTML::SimpleLinkExtor->new; $urls[0]="http://www.beamartyr.net/"; # Initialize list of URLs

End result for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue } } print "Results:\n"; # Print results map {print "$_\t$authors{$_}\n"} keys %authors;

End result package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; } sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; }

End result sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; } }

What’s missing? • Full URLs for relative links • Non-HTTP links • Queues & caches • Persistent storage • Link (and data) validation

In review • Create robot user agent to crawl websites nicely • Create parsers to extract data from sites, and links to the next sites • Create a simple program to parse a queue of URLs

Thank you! For more information: Issac Goldstand isaac@cpan.org http://www.beamartyr.net/ http://www.mirimar.net/

Creating a Web Crawler in 3 Steps