1 / 41

Data-Mining the Web Using Perl

Data-Mining the Web Using Perl. Burt L. Monroe Director, Quantitative Social Science Initiative Department of Political Science The Pennsylvania State University. Data-Mining the Web. Examples Election Returns in Luxembourg Luxembourg Official Election Results, 2004

Philip
Download Presentation

Data-Mining the Web Using Perl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-Mining the Web Using Perl Burt L. Monroe Director, Quantitative Social Science Initiative Department of Political Science The Pennsylvania State University

  2. Data-Mining the Web • Examples • Election Returns in Luxembourg • Luxembourg Official Election Results, 2004 • http://qssi.psu.edu/files/luxembourg.pl • Parliamentary Speech • The Congressional Record

  3. How’d You Do That? • There are several programming languages with “straightforward” facilities for doing this. Most notably, • Perl • Python • Java • I’m going to talk about Perl, because • it’s the most established • it’s the one I know • It appears that Python may be preferable, but that’s for someone else to say.

  4. What’s Perl? • Open source (free / flexible / extensible / a little wild and woolly – like Linux, R) programming language. • It is very very good at processing text. • note, webpages are just texts. • note, datasets (like a flat spreadsheet or Stata file) are just texts. • Social scientists might have some use for turning one into the other, no? • It has very useful facilities for building • Spiders • Scrapers • (and “agents”, “robots”, “crawlers”, etc.)

  5. What’s a Spider? • A spider is a program designed to automatically gather webpages. • If, for example, you want to automatically download all of the speeches delivered in Congress today – without manually clicking on every one, cutting and pasting, etc. – you might want to build a spider.

  6. What’s a scraper? • A scraper (or “screen-scraper”) extracts the information you want – whatever you consider to be data – from a given webpage. • If you want to know who said “health” and how many times, you might want to build a scraper.

  7. BEWARE! • Spiders (and other similar types of programs – “robots”, “crawlers”) can be put to nefarious use: • appropriating copyrighted materials • extracting email addresses for spammers • overwhelming servers to create “denial of service” • generally violating a site’s “terms of service” or “acceptable use policy” • If you are not careful to use legal and ethical good practices, you can • be denied access to a website altogether • get yourself or the university sued or even subjected to criminal penalties

  8. Perl • Open-source • Cross-platform • (Windows – I recommend “ActivePerl” from http://www.activestate.com) • There are many websites with resources: • http://www.cpan.org (Comprehensive Perl Archive Network) • http://www.perlmonks.org (PerlMonks) • http://www.perl.org • http://perl.oreilly.com (O’Reilly Publishing) • Lots of mailing lists, etc.

  9. Books • Basics of Perl • The best books are put out by O’Reilly Publishing and are generally known by the animal on the cover. • Learning Perl (the Llama) • or, Learning Perl on Win32 Systems (the Gecko) • Programming Perl (the Camel) • Web-mining • Perl & LWP (the Blesbok, apparently) • Spidering Hacks • These books, and some others, are or will be available in the “QuaSSI Library” (in Pond 216).

  10. Running Perl • For machines with approved ActivePerl installations in Pond ... • Perl is located in c:/Perl/ • For today, • we will operate entirely in the directory c:/Perl/eg/ • To get there, • open Programs -> Accessories -> Command Prompt • At the prompt, type c: • Type cd Perl/eg • (In your particular installation, or in a Mac, or something like Unix on high performance computing, these details will be different.)

  11. The First Perl Program • Go to the QuaSSI Website for the example scripts for todays workshop: • http://qssi.psu.edu/files/howdy.pl • Right-click on the first script, “howdy.pl”, and save it to c:\Perl\eg\ • Open up the text-editor WinEdt (you could use almost anything) and then open howdy.pl • That’s a complete Perl program. • Note: that’s all a program is – a text file.

  12. Running a Perl Program • Go back to your command prompt. • Type perl howdy.pl –w • (The –w tells perl to give you warnings about what might be wrong if the program is broken.)

  13. Modifying a program • Go back to WinEdt • Edit the text between the quotation marks to say something new • Click File -> Save • Go back to the command prompt • Hit the up arrow (to get the last command, perl howdy.pl –w • Look at that – you’re a programmer!

  14. Break the program • Go back to WinEdt • Delete the semicolon at the end of the line • Save the file • Go back to the command prompt and run the program, with –w, again • What happened?

  15. Perl at 30,000 feet • Much of the next set of slides is stolen shamelessly from Andy Tester’s “Perl at 10,000 Feet” at www.petdance.com • (I’m skipping even more than he did.)

  16. Some generalities about Perl • Statements in Perl are, or usually can be, constructed in a fairly natural English-like way. • There are many ways to do any one thing. • The syntax can be offputting and hard to read, especially at first. It is easy to “obfuscate” Perl code and this is sometimes done intentionally. • Main syntax rule: end all lines with ;

  17. Data Types • Scalars • Arrays and Lists • Hashes • References • Filehandles • Objects

  18. Scalars • Numbers • Generally decimal floating point • (Can be made integer, octal, hexadecimal) • Strings • Can contain any character • Can be null: “” • Can be arbitrarily large

  19. Strings • Single-quoted • characters are as shown with only two exceptions. • single-quote in a single-quoted string requires \’ • backslash in a single-quoted string requires \\ • Double-quoted • it will interpolate – calculate variables or control sequences. • For example • $foo = “myfile”; • $datafile = “$foo.txt”; • will result in the variable $datafile holding the string “myfile.txt” • Another example • print ‘Howdy\n’; will print: • Howdy\n • print “Howdy\n”; will print • Howdy • (\n is a control sequence, standing for “new line”).

  20. Scalar operators • Math • *, /, % (for modulo), ** (for exponentiation), etc. • Strings • x to repeat the thing on the left • “b” x 10 gives “bbbbbbbbbb” • . concatenates strings • (“na” x 16).“ Batman!” gives ... • Perl knows to convert when mixing these two types: • “3”*4 gives 12 • “3”.4 gives “34”

  21. Comparing Scalars Comparison Numeric String • Equal == eq • Not equal != ne • Less than < lt • Greater than > gt • Less / equal <= le • Greater / equal >= ge 8 < 25 TRUE! “8” lt “25” FALSE!

  22. Variables • A sign, followed by a letter, followed by pretty much whatever. • Sign determines the type: • $foo is a scalar • @foo is a list • %foo is a hash • Variables default to global (they apply in all parts of your program). This can be problematic. • local $var will make the variable active only for the current “block” of code. • my $var does the same, and is the more usual construction. • the very common use strict; at the beginning of code forces good practice in the use of local variables (creates more syntax errors, but prevents more whoppers that could blow everything up.)

  23. Lists and Arrays • A list is an ordered set of (usually) scalars. • An array is a variable holding a list. • my @foo = (1,2,3) • my @bar = (“elephant”, 3.14) • Can be constructed as lists of scalar variables: • my @data = ($name, $address, $SSN)

  24. Using Arrays • Elements are indexed, from 0. • my @animals = (“frog”, “bear”, “elephant”); • print $animals[2]; # prints elephant • Note: element is a scalar, so $ rather than @ • Subsections are “slices”. • my @mammals = @animals[1,2]; • Lots of functions for • using as a stack (moving things on and off the right or left side of the array). • sorting • joining two arrays • splitting a scalar string into an array • my $sentence = “This is my sentence.”; • my @words = split(“ “, $sentence); • # now @words contains (“This”, “is”, “my”, “sentence”);

  25. Programming Controls • Control structures • if / then / elsif / else • while • do {} while • do {} until • for () • foreach() # loops over a list • Errors / warnings • die “message” kills program and prints “message”. • warn “message” prints message and keeps going.

  26. Hashes • “Associative arrays” • A set of • values (any scalar), indexed by • keys (strings) • Example • my %info; • $info{ “name” } = “Burt Monroe”; • $info{ “age” } = 39; • With hashes and arrays you can create almost any arbitrary data structure (even arrays of arrays, arrays of hashes, hashes of arrays, etc.)

  27. File Handling • open() function opens a file for processing. • Prefix the filename to define how • “<“ for input from existing file (read) • “>” to create for output (write) • “>>” to append to a file (that may not yet exist) • open (IN, “<myfile.txt”) or die “Can’t open myfile.txt”; • Can then use <> to refer to the file. The above would be <IN>.

  28. Matching string patterns using regular expressions • This is where much of the power of Perl lies. • m/pattern/ will check the last stored variable ($_) for pattern. • $var =~ m/pattern/; will check $var for pattern. • If the pattern is in $var, then • $var =~ m/pattern/ is TRUE. • If you “group” part of the pattern and it is present, • $var =~ m/(pattern)/ is true, AND, now a variable names $1 contains the first match it found. • Group more pieces of the pattern and the matches are stored in $2, $3, etc. • This only grabs the *first* match. To grab all, say • my @matches = ($var =~ m/(pattern)/g); • This will store every match in the array @matches.

  29. What’s a “regular expression”? • Combination of any literal character, number, etc. . any single character * zero or more of the previous + one or more of the previous ? zero or one of the previous [aeiou] character class – this is the vowels ^ beginning of the line $ end of the line \b word boundary \d \D digit / non-digit \s \S space / non-space \w \W word character / non-word character | or – match this or that () grouping • See handout for more.

  30. Examples • Romeo|Juliet “Romeo” or “Juliet” • \d\d\d-\d\d\d\d a phone number • (\d\d\d-)?\d\d\d-\d\d\d\d phone #, maybe w/ area • \b[aeiou]\w+ a word starting w/ a vowel • \b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b email add.

  31. Modules • Hundreds of modules / packages available through cpan. • ActivePerl gives a GUI for installing them in its “Perl Package Manager”.

  32. A basic Perl example • Counting words. • counter1.pl

  33. Grabbing from the web • The basic idea is simply to have Perl act as an “agent”, in the way a browser like Explorer or Firefox does -- requesting and interpreting webpages. • There are a few basic modules that can do this.

  34. LWP::Simple • lwpsimpleget.pl

  35. LWP::UserAgent • More elaborate than LWP::Simple. • I’m going to skip that one today, but it’s covered in details in the main books • Perl & LWP • Spidering Hacks • Pretty much all of the functionality has been wrapped more intuitively into ...

  36. WWW::Mechanize • mechanizeget.pl

  37. Scraping • At its base, this is just extracting information from the page(s) you download. • Simple example: • freshair.pl

  38. Your agent can interact ... • For example, what if the webpage involves a form ... • Example • abstracts.pl • You can authenticate with username and password, run through proxy servers, and so on.

  39. Spiders • Type 1 Requester • Requests a few items with known urls from a website. • Type 2 Requester • Requests a few items, then requests (some set of) pages to which those items link. • Type 3 Requester • Starts at a given url, and then requests everything linked, everything linked by that, etc. at the same host server. The idea here is usually to download an entire website. • Type 4 Requester • Starts at a given url, requests everything linked anywhere, everything linked by that, etc. until it, perhaps, visits the entire web. • YOU – I am talking to YOU – in all likelihood have no business writing Type 3 or Type 4 spiders. These can easily go seriously awry causing mayhem of many sorts. Write only spiders with known finite scope.

  40. Back to the Luxembourg Miner • Commune-level election results from Luxembourg. • luxembourg.pl

  41. More on Scraping • All of the examples scraped / parsed using regular expressions. • More structured data like HTML is often better (or only) addressed with more specialized tools: • HTML::TokeParser • HTML::TreeBuilder • There are modules for scraping from XML, spreadsheets, databases, Word docs, PDFs.

More Related