420 likes | 931 Views
Data-Mining the Web Using Perl. Burt L. Monroe Director, Quantitative Social Science Initiative Department of Political Science The Pennsylvania State University. Data-Mining the Web. Examples Election Returns in Luxembourg Luxembourg Official Election Results, 2004
E N D
Data-Mining the Web Using Perl Burt L. Monroe Director, Quantitative Social Science Initiative Department of Political Science The Pennsylvania State University
Data-Mining the Web • Examples • Election Returns in Luxembourg • Luxembourg Official Election Results, 2004 • http://qssi.psu.edu/files/luxembourg.pl • Parliamentary Speech • The Congressional Record
How’d You Do That? • There are several programming languages with “straightforward” facilities for doing this. Most notably, • Perl • Python • Java • I’m going to talk about Perl, because • it’s the most established • it’s the one I know • It appears that Python may be preferable, but that’s for someone else to say.
What’s Perl? • Open source (free / flexible / extensible / a little wild and woolly – like Linux, R) programming language. • It is very very good at processing text. • note, webpages are just texts. • note, datasets (like a flat spreadsheet or Stata file) are just texts. • Social scientists might have some use for turning one into the other, no? • It has very useful facilities for building • Spiders • Scrapers • (and “agents”, “robots”, “crawlers”, etc.)
What’s a Spider? • A spider is a program designed to automatically gather webpages. • If, for example, you want to automatically download all of the speeches delivered in Congress today – without manually clicking on every one, cutting and pasting, etc. – you might want to build a spider.
What’s a scraper? • A scraper (or “screen-scraper”) extracts the information you want – whatever you consider to be data – from a given webpage. • If you want to know who said “health” and how many times, you might want to build a scraper.
BEWARE! • Spiders (and other similar types of programs – “robots”, “crawlers”) can be put to nefarious use: • appropriating copyrighted materials • extracting email addresses for spammers • overwhelming servers to create “denial of service” • generally violating a site’s “terms of service” or “acceptable use policy” • If you are not careful to use legal and ethical good practices, you can • be denied access to a website altogether • get yourself or the university sued or even subjected to criminal penalties
Perl • Open-source • Cross-platform • (Windows – I recommend “ActivePerl” from http://www.activestate.com) • There are many websites with resources: • http://www.cpan.org (Comprehensive Perl Archive Network) • http://www.perlmonks.org (PerlMonks) • http://www.perl.org • http://perl.oreilly.com (O’Reilly Publishing) • Lots of mailing lists, etc.
Books • Basics of Perl • The best books are put out by O’Reilly Publishing and are generally known by the animal on the cover. • Learning Perl (the Llama) • or, Learning Perl on Win32 Systems (the Gecko) • Programming Perl (the Camel) • Web-mining • Perl & LWP (the Blesbok, apparently) • Spidering Hacks • These books, and some others, are or will be available in the “QuaSSI Library” (in Pond 216).
Running Perl • For machines with approved ActivePerl installations in Pond ... • Perl is located in c:/Perl/ • For today, • we will operate entirely in the directory c:/Perl/eg/ • To get there, • open Programs -> Accessories -> Command Prompt • At the prompt, type c: • Type cd Perl/eg • (In your particular installation, or in a Mac, or something like Unix on high performance computing, these details will be different.)
The First Perl Program • Go to the QuaSSI Website for the example scripts for todays workshop: • http://qssi.psu.edu/files/howdy.pl • Right-click on the first script, “howdy.pl”, and save it to c:\Perl\eg\ • Open up the text-editor WinEdt (you could use almost anything) and then open howdy.pl • That’s a complete Perl program. • Note: that’s all a program is – a text file.
Running a Perl Program • Go back to your command prompt. • Type perl howdy.pl –w • (The –w tells perl to give you warnings about what might be wrong if the program is broken.)
Modifying a program • Go back to WinEdt • Edit the text between the quotation marks to say something new • Click File -> Save • Go back to the command prompt • Hit the up arrow (to get the last command, perl howdy.pl –w • Look at that – you’re a programmer!
Break the program • Go back to WinEdt • Delete the semicolon at the end of the line • Save the file • Go back to the command prompt and run the program, with –w, again • What happened?
Perl at 30,000 feet • Much of the next set of slides is stolen shamelessly from Andy Tester’s “Perl at 10,000 Feet” at www.petdance.com • (I’m skipping even more than he did.)
Some generalities about Perl • Statements in Perl are, or usually can be, constructed in a fairly natural English-like way. • There are many ways to do any one thing. • The syntax can be offputting and hard to read, especially at first. It is easy to “obfuscate” Perl code and this is sometimes done intentionally. • Main syntax rule: end all lines with ;
Data Types • Scalars • Arrays and Lists • Hashes • References • Filehandles • Objects
Scalars • Numbers • Generally decimal floating point • (Can be made integer, octal, hexadecimal) • Strings • Can contain any character • Can be null: “” • Can be arbitrarily large
Strings • Single-quoted • characters are as shown with only two exceptions. • single-quote in a single-quoted string requires \’ • backslash in a single-quoted string requires \\ • Double-quoted • it will interpolate – calculate variables or control sequences. • For example • $foo = “myfile”; • $datafile = “$foo.txt”; • will result in the variable $datafile holding the string “myfile.txt” • Another example • print ‘Howdy\n’; will print: • Howdy\n • print “Howdy\n”; will print • Howdy • (\n is a control sequence, standing for “new line”).
Scalar operators • Math • *, /, % (for modulo), ** (for exponentiation), etc. • Strings • x to repeat the thing on the left • “b” x 10 gives “bbbbbbbbbb” • . concatenates strings • (“na” x 16).“ Batman!” gives ... • Perl knows to convert when mixing these two types: • “3”*4 gives 12 • “3”.4 gives “34”
Comparing Scalars Comparison Numeric String • Equal == eq • Not equal != ne • Less than < lt • Greater than > gt • Less / equal <= le • Greater / equal >= ge 8 < 25 TRUE! “8” lt “25” FALSE!
Variables • A sign, followed by a letter, followed by pretty much whatever. • Sign determines the type: • $foo is a scalar • @foo is a list • %foo is a hash • Variables default to global (they apply in all parts of your program). This can be problematic. • local $var will make the variable active only for the current “block” of code. • my $var does the same, and is the more usual construction. • the very common use strict; at the beginning of code forces good practice in the use of local variables (creates more syntax errors, but prevents more whoppers that could blow everything up.)
Lists and Arrays • A list is an ordered set of (usually) scalars. • An array is a variable holding a list. • my @foo = (1,2,3) • my @bar = (“elephant”, 3.14) • Can be constructed as lists of scalar variables: • my @data = ($name, $address, $SSN)
Using Arrays • Elements are indexed, from 0. • my @animals = (“frog”, “bear”, “elephant”); • print $animals[2]; # prints elephant • Note: element is a scalar, so $ rather than @ • Subsections are “slices”. • my @mammals = @animals[1,2]; • Lots of functions for • using as a stack (moving things on and off the right or left side of the array). • sorting • joining two arrays • splitting a scalar string into an array • my $sentence = “This is my sentence.”; • my @words = split(“ “, $sentence); • # now @words contains (“This”, “is”, “my”, “sentence”);
Programming Controls • Control structures • if / then / elsif / else • while • do {} while • do {} until • for () • foreach() # loops over a list • Errors / warnings • die “message” kills program and prints “message”. • warn “message” prints message and keeps going.
Hashes • “Associative arrays” • A set of • values (any scalar), indexed by • keys (strings) • Example • my %info; • $info{ “name” } = “Burt Monroe”; • $info{ “age” } = 39; • With hashes and arrays you can create almost any arbitrary data structure (even arrays of arrays, arrays of hashes, hashes of arrays, etc.)
File Handling • open() function opens a file for processing. • Prefix the filename to define how • “<“ for input from existing file (read) • “>” to create for output (write) • “>>” to append to a file (that may not yet exist) • open (IN, “<myfile.txt”) or die “Can’t open myfile.txt”; • Can then use <> to refer to the file. The above would be <IN>.
Matching string patterns using regular expressions • This is where much of the power of Perl lies. • m/pattern/ will check the last stored variable ($_) for pattern. • $var =~ m/pattern/; will check $var for pattern. • If the pattern is in $var, then • $var =~ m/pattern/ is TRUE. • If you “group” part of the pattern and it is present, • $var =~ m/(pattern)/ is true, AND, now a variable names $1 contains the first match it found. • Group more pieces of the pattern and the matches are stored in $2, $3, etc. • This only grabs the *first* match. To grab all, say • my @matches = ($var =~ m/(pattern)/g); • This will store every match in the array @matches.
What’s a “regular expression”? • Combination of any literal character, number, etc. . any single character * zero or more of the previous + one or more of the previous ? zero or one of the previous [aeiou] character class – this is the vowels ^ beginning of the line $ end of the line \b word boundary \d \D digit / non-digit \s \S space / non-space \w \W word character / non-word character | or – match this or that () grouping • See handout for more.
Examples • Romeo|Juliet “Romeo” or “Juliet” • \d\d\d-\d\d\d\d a phone number • (\d\d\d-)?\d\d\d-\d\d\d\d phone #, maybe w/ area • \b[aeiou]\w+ a word starting w/ a vowel • \b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b email add.
Modules • Hundreds of modules / packages available through cpan. • ActivePerl gives a GUI for installing them in its “Perl Package Manager”.
A basic Perl example • Counting words. • counter1.pl
Grabbing from the web • The basic idea is simply to have Perl act as an “agent”, in the way a browser like Explorer or Firefox does -- requesting and interpreting webpages. • There are a few basic modules that can do this.
LWP::Simple • lwpsimpleget.pl
LWP::UserAgent • More elaborate than LWP::Simple. • I’m going to skip that one today, but it’s covered in details in the main books • Perl & LWP • Spidering Hacks • Pretty much all of the functionality has been wrapped more intuitively into ...
WWW::Mechanize • mechanizeget.pl
Scraping • At its base, this is just extracting information from the page(s) you download. • Simple example: • freshair.pl
Your agent can interact ... • For example, what if the webpage involves a form ... • Example • abstracts.pl • You can authenticate with username and password, run through proxy servers, and so on.
Spiders • Type 1 Requester • Requests a few items with known urls from a website. • Type 2 Requester • Requests a few items, then requests (some set of) pages to which those items link. • Type 3 Requester • Starts at a given url, and then requests everything linked, everything linked by that, etc. at the same host server. The idea here is usually to download an entire website. • Type 4 Requester • Starts at a given url, requests everything linked anywhere, everything linked by that, etc. until it, perhaps, visits the entire web. • YOU – I am talking to YOU – in all likelihood have no business writing Type 3 or Type 4 spiders. These can easily go seriously awry causing mayhem of many sorts. Write only spiders with known finite scope.
Back to the Luxembourg Miner • Commune-level election results from Luxembourg. • luxembourg.pl
More on Scraping • All of the examples scraped / parsed using regular expressions. • More structured data like HTML is often better (or only) addressed with more specialized tools: • HTML::TokeParser • HTML::TreeBuilder • There are modules for scraping from XML, spreadsheets, databases, Word docs, PDFs.