270 likes | 405 Views
Internet. Simple Things with LWP::Simple. Often you need to simply copy a web page into a file (or into a string variable) so you can parse out some useful bit of information. The LWP::Simple module makes this process very easy. It uses the procedural style instead of object-oriented.
E N D
Simple Things with LWP::Simple • Often you need to simply copy a web page into a file (or into a string variable) so you can parse out some useful bit of information. • The LWP::Simple module makes this process very easy. It uses the procedural style instead of object-oriented. • The command “get” returns the HTML code from the web site specified as its argument as a string variable: my $html = get(“http://biolinx.bios.niu.edu/bios546/start_hello3.html”); print $html; • “getstore” fetches the web site specified by its first argument, and then stores it into the file specified by its second argument: getstore($www_file, $local_file);
More LWP::Simple • “is_error” returns a true value if the web operation you attempted didn’t work, which happens entirely too frequently. Using is_error is a good general practice: unless (is_error(getstore($www_file, $local_file) ) ) { print "Success transferring file: $local_file\n" } else { print "Error transferring file: $www_file\n"; } • These same commands can be used for simple FTP requests as well, assuming you don’t have to deal with password and user name issues. This example comes from the NCBI’s website, te FTP download section. The index for this section is found at http://www.ncbi.nlm.nih.gov/Ftp/. my $ftp_file = "ftp://ftp.ncbi.nih.gov/blast/documents/developer/readdb.txt"; my $local_file = "ncbi_file.txt"; getstore($ftp_file, $local_file);
Multiple Downloads • A word of caution: if you request many documents in rapid succession from a single site, you risk crashing their server and drawing their anger. This is easy to do if you just put an internet request into a loop and let it run—it will issue all of the requests in a fraction of a second. • To allow servers to keep up with you, it is useful to pause a bit between requests. The command “sleep” accomplishes this: the Perl program stops executing for the number of seconds given as the parameter to sleep. • For example: for (my $i = 0; $i <= 100; $i++) { my $html = get(“http://biolinx.bios.niu.edu/bios546/start_hello3.html”); print $html; sleep(5); # pause for 5 seconds }
Review of Object-Oriented Syntax • An object in Perl is a reference to an anonymous hash. The hash has a set of key-value pairs, which are usually accessed through specific functions (methods) rather than being directly addressed. • Objects are usually associated with specific Perl modules. A module is a set of functions (which can also be called object methods) that are imported into your program with a statement such as “use GD;” or “use LWP::Simple;”. • To use the object methods, you need to first create the object, the hash that contains the data to be manipulated by the methods. Most, but not all, Perl modules create objects with a “new” method. For example: my $ftp = Net::FTP->new(“your_ftp_site”) ; • In this example, $ftp is a reference to the anonymous hash that is the object you have created. The Net::FTP->new is an invocation of the “new” method from the Net::FTP module, and (“your_ftp_site”) is a parameter that is being passed to that method. • Some modules allow you to invoke “new” with a procedure-oriented syntax. That is, the function is written first followed by its parameters: my $im = new GD::Image(760, 420); • Some modules use some term other than “new” to create objects. In this example, “connect” is used instead of “new”: my $dbh = DBI->connect(database, name, password);
Object Methods • Once the object has been created, other methods can be invoked on it. These methods almost always use the arrow notation, which implies that the given method is invoked on the data contained in that specific object. For example, $ftp->get("blast_out.txt", "local.txt"); • In OO-speak, the “new” method is a class method, in that it is not operating on any specific data. Methods like “get” are instance methods (or object methods), which operate on a pre-existing object that contains a specific set of data. • Class methods start with the module’s name, followed by the arrow, while instance methods start with the scalar variable that references the object: my $ftp = Net::FTP->new(“your_ftp_site”) ; # class method $anchor = $parse->get_tag("a") ; # instance method
Full FTP Access through Perl • Using LWP::Simple you can download a file through FTP if it is accessible to the general public. At the server end, this means that you are accessing an account whose user name is “anonymous”, with no required password. • Many times you will need to use FTP to send or receive a file for a password-protected account. Unfortunately, LWP::Simple won’t help you for this. Instead, use the Net::FTP module, which allows full control of all the standard FTP commands. • Net::FTP uses object-oriented syntax. So, the first thing you need to do is create a new FTP object, using the address of the FTP site you wish to contact: use Net::FTP; my $ftp = Net::FTP->new(“your_ftp_site”) or die “Couldn’t connect: $@\n”; • The $@ gives an error message if the connection couldn’t be made.
More FTP Access • Next you need to log in: $ftp->login(“user_name”, “password”) or die “Couldn’t login “, $ftp->message, “\n”; • The $ftp->message section gives you the error message generated by the Net::FTP module; the $@ error messages are much less useful once the FTP object has been created with “new”. • Notice that your password is going to be written in plain text in your program—are you sure this will be secure? • Once you have logged in, it is just a matter of using various FTP commands. Here are the most useful of them: • change working directory (usually we would use “cd”, but you need “cwd” here): $ftp->cwd(“sub_directory_name"); • Download (get) the file “blast_out.txt” from the FTP server to your local file “local.txt”: If you don’t supply the second argument, the local file gets the same name as the original file. $ftp->get("blast_out.txt", "local.txt"); • Upload (put) your local file “junk.txt” onto the FTP server with the name “remote.txt”. If you don’t supply the second argument, the local file gets the same name as the original file. $ftp->put("junk.txt", "remote.txt") or die "Couldn‘t put file\n"; • End the FTP session: $ftp->quit;
SFTP Access • Biolinx and most of the other servers I deal with use SSH and SFTP instead of telnet and FTP. SSH and SFTP use encryption to send data, especially passwords, over the net. • Thus, the FTP server on biolinx is not functional and the Net::FTP module won’t work. • We can use Net::SFTP module instead. use Net::SFTP; my $sftp = Net::SFTP->new(“host_name”); $sftp->get(“remote_file”, “local_file”); $sftp->put(“local_file”, “remote_file”); $sftp->ls(“remote_path”); # gets directory listing
Filling Out Forms • Many web applications are forms that require you to fill out some information, and then they return an HTML file containing the requested information. This situation can best be dealt with using the HTTP::Request::Common and LWP::UserAgent modules. • The HTTP::Request::Common module creates an HTTP Request object, a standard header to be sent over the internet to the server requesting that it take some action. In this case, the header will contain name=value pairs for the CGI program on the server to process. • The LWP::UserAgent module is what actually submits the HTTP request to the lower layers of the computer, which in turn send it out over the web. This module also then collects and interprets the response.
More Forms • First, you need to create a new UserAgent object. use HTTP::Request::Common; use LWP::UserAgent; my $ua = LWP::UserAgent->new; • You also need the URL specified in the form’s action. my $form_action = "http://biolinx.bios.niu.edu/cgi-bin/bios546/hello3.cgi"; • The form name=value pairs need to be put into a hash, and then a reference to that hash needs to be supplied as an argument. • Note that you can supply any values you want; you are not limited to the choices given in the form. Of course, there’s no telling what the CGI program will do with your unexpected values. my %form_responses = (your_name => "Hortense", greeting => "OH NOOOO IT's ", line => "thin", ck => "this_one" ); • Then supply this information to the UserAgent object, specifying the POST method (which actually comes from the HTTP::Request::Common module). my $response = $ua->request(POST $form_action , \%form_responses ); • The $response ia a reference to an anonymous hash. The contents of the response (the HTML supplied by the CGI program in response to your request) is the value of the key “_content”. So, printing out the response could be done by: print "$response->{_content}\n";
More Forms • So, the whole program to fill out and submit this form looks like: use HTTP::Request::Common; use LWP::UserAgent; my $ua = LWP::UserAgent->new; my $form_action = "http://biolinx.bios.niu.edu/cgi-bin/bios546/hello3.cgi"; my %form_responses = ( your_name => "Hortense", greeting => "OH NOOOO It's ", line => "thin", ck => "this_one" ); my $response = $ua->request(POST $form_action , \%form_responses ); print "$response->{_content}\n";
Using LWP::UserAgent to GET an HTML File • This can be done using LWP::Simple, as discussed earlier, but LWP::UserAgent and HTTP::Request::Common can also be used. The method parallels the form-filling method described above, except that the GET method is used instead of POST. Note that both POST and GET are methods found in the HTTP::Request::Common module. use HTTP::Request::Common; use LWP::UserAgent; my $html_file = "http://biolinx.bios.niu.edu/bios546/start_hello3.html"; my $ua = LWP::UserAgent->new; my $response = $ua->get($html_file ); my $file = $response->content; print "$file\n";
Password-Protected HTML Sites 1 • Rather than using FTP or SFTP, many servers have download areas that use regular HTML code, but with a password protection mechanism called “HTTP Authentication”. • When you request access to a document on such a server, the HTTP Authentication method returns a message saying that the document is part of a “realm” that requires a user name and password. • The trick is, you need to get that realm name. It is listed in the HTTP headers, under “www-authenticate”. • It is always possible to examine all key-value pairs in the hash returned by the “get” function, and also by all sub-hashes with in it, my $ua = LWP::UserAgent->new; my $response = $ua->get($html_file ); print " : $response->{_headers}{'www-authenticate'}\n";
Password-Protected HTML Sites 2 • The “credentials” method of LWP::UserAgent can then be used to give the user name and password. This is done before using the “get” function. • The credentials method does have some syntax issues: • $ua->credentials(‘server-name: port’, ‘realm-name’, ‘username’ => ‘password’); • Server-name is the web site URL without the http://, and before the first /. Thus, the main Biology department server’s name is www.bios.niu.edu • The port number for most HTTP transactions is 80 • Realm-name was found on the previous slide, and we presume you have the username and password. • Put everything in single quotes, and note the => between username and password. • Once the credentials command is given, the rest of the site can be processed as shown on the other slides.
Parsing an HTML Document • Another useful module is HTML::TokeParser. It can extract the contents of tags (i.e. tag attributes) or the text between tags. It also has an object-oriented interface, and I am only going to describe a few of its functions; other functions are covered in the module’s documentation. • To create a new object for the module, you have to supply it with a file name or with a reference to a string containing the file’s contents. I prefer to download the HTML file first, save it to a local file, then open and parse it as a separate operation. Since parsing is often a process that needs to be repeated with minor variations until you get it right, saving the HTML locally saves having to download the file more than once. use HTML::TokeParser; my $parse = HTML::TokeParser->new(“local_file.txt”);
More HTML::TokeParser • One important function is get_tag. You supply it with a list of tags that you want to detect, and it returns them one-by-one, in conjunction with a while loop. • The return value of get_tag is a reference to an array. If the tag was a start tag, the array has 4 elements: • 0: the type of tag it is • 1: a reference to a hash in which the keys are the attributes within that tag and the values are the attribute values. • 2: a reference to an array containing all the attribute names (without their values). • 3: the actual text of the tag • For an end tag, the returned array reference has only 2 elements, the type of tag it is, and the text of the tag.
get_tag Example while (my $token = $parse->get_tag("textarea", "select", "input") ) { # print tag type print "type: $token->[0]\n"; # print attribute-value hash foreach my $key (sort keys %{$token->[1]} ) { print " $key : $token->[1]{$key}\n"; } # print attribute array foreach my $item (@{$token->[2]} ) { print " attribute: $item\n"; } # print tag text print "$token->[3]\n\n"; }
More HTML::TokeParser • Two other useful functions: get_text and get_trimmed_text. The latter removes leading and trailing whitespace, and reduces all internal whitespace to a single space character. This proves to be quite useful, because HTML documents often contain excess whitespace that the web browsers ignore. • These two functions take a list of tags as parameters, and they return a string that is all the text between the current position in the file and the next example of the listed tags. • It is easiest to use these functions in conjunction with get_tag. while (my $anchor = $parse->get_tag("a") ) { # my $url = $anchor->[1]{href}; my $link_text = $parse->get_trimmed_text('/a'); print "$url : $link_text\n"; }
Perl Modules • Lots of useful functions can be performed without you ever having to write any code, because someone else has already done it for you, and written it into a module. • We have a number of modules already installed—they either came with the standard Perl distribution or I have installed them. • A way to test for the existence of a module: all modules have built-in documentation, so on the command line type “perldoc” followed by the module’s name. If the module is installed, you get the documentation; if not, it gives a “no documentation for this module” message. • Another way to test for a module’s existence: in a Perl program, type “use” followed by the module’s name. If it doesn’t exist, you will generate an error message saying that it can’t locate that module in any one of several directories.
Finding and Downloading Modules • Nearly all of them are at the Comprehensive Perl Archive Network (CPAN), at www.cpan.org. There are several search functions associated with this site, because the list is quite long. I like the http://search.cpan.org function, which allows a regular search as well as search by category. • Following the link for a given module gives you the documentation for that module. Looking over this documentation should give you an idea as to whether this module will do what you need it to. • The documentation will also have a link for downloading the module. Usually this is a file ending in “.tar.gz” or some such. It is a compressed file that turns into many files when uncompressed. • When you uncompress it, it will create a new sub-directory under the directory that the .tar.gz file is in, whose name is the module’s name and version number. For example, the Bit::Vector module, version 6.4 produced a sub-directory whose name is Bit-Vector-6.4. • You uncompress the .tar.gz file with the command “tar xvzf file_name.tar.gz”. There are other compression schemes: look at the documentation for “gunzip” and “tar” for some help with this, or search the internet.
Installing Modules • Move into the newly created directory, then read the INSTALL.txt and/or README.txt documents. These should give detailed installation instructions. • One thing often found in these documents is a list or prerequisites: other modules that need to be installed before the one you want can be installed. Getting all the prerequisites in can be a real pain. • Especially painful is the need some modules have for installing C libraries (they start with “lib”). This usually requires root access on the computer: seek help (from me for biolinx). • In general, most modules are installed by first running a program called Makefile.PL. Since you probably don’t have the privileges to install modules so they will be available to anyone on the computer, you need to install locally. To do this, start with the command “perl Makefile.PL PREFIX=directory_you_want_to_install_to”. • Next, issue the command “make”. Then, try “make test”. This may or may not do anything, depending on the module. Finally, run “make install”. If all goes well, you now have a working module! • Module names like HTTP::Request::Common imply that the actual module is called Common.pm and it is located in a directory called Request, which is a sub-directory of the HTTP directory
Using Modules • Include the line “use yourModuleName;” near the top of your program. • If this is a locally installed module, installed under your home directory and only available to you, you need to put a statement like “use lib “path_to_directory”;’ before you put in the “use yourModuleName;” statement. For instance, you install the module Junk.pm into /home/z12346/perl_modules directory. The module has a function called “print_junk” that takes a string to print and a number of times to print it as parameters. To use this function: use lib ‘/home/z123456/perl_modules”; use Junk; Junk::print_junk(“red”, 7); • Functions in that module can always be invoked by using their fully-qualified name, something like “yourModuleName::yourFunction()”. That is, the module’s name followed by 2 colons (::), followed by the function name, along with whatever parameters that function needs. • Some modules export some or all of their functions, so you can just use the function names without the module name. The GD graphics module is an example: you just use the function name without having to out GD:: first. • Other modules need you to specifically import the functions you need. An example is CGI::Carp, where you need to import the fatalsToBrowser function with the line “use CGI::Carp qw(fatalsToBrowser);”. If you don’t do this, that function won’t work.
URL • Components of a URL: http://biolinx.bios.niu.edu:80/cgi-bin/bios546/hello2.cgi?your_name=Fred#section2 • http:// the protocol used for communication. http is “hypertext transfer protocol, for World Wide Web communication, especially hyperlinked text, images, etc. ftp is “file transfer protocol”, used to move files around. smtp is “simple mail transfer protocol”, used for e-mail. There are other protocols as well, but these are the main ones in use today. • biolinx.bios.niu.edu is the host name. It can also be an IP number instead of a name: 131.156.41.4 is the equivalent of biolinx.bios.niu.edu. In reality, the internet runs on IP numbers: host names get translated into IP numbers by Domain Name Servers, which maintain and update lists of names and numbers. • :80 is the port number that the web server is listening to. Port 80 is the default for HTTP communications, so the number is usually left off. Other protocols have different default ports. For instance, telnet listens to port 23. • /cgi-bin/bios546/hello2.cgi is the path within the server to the requested file. • ?your_name=Fred is the query string, which is one way of encoding name=value pairs from forms. • #section2 is the document fragment identifier. It moves you to locations within the HTML document that are identified by <a name=“section2”> tags. This is an anchor tag, the same type of tag used for hyperlinks, with a “name” attribute.
URL Encoding • Only a few characters can be part of a URL as plain text: mainly the alphanumeric symbols: a-z A-Z 0-9, but also a few others: $-_.+!*'(), • However, certain characters can be used safely in specific places: http:// for instance, or the # that starts a fragment identifier. • All other symbols need to be encoded. The symbol codes start with % and are followed by a 2-digit hexadecimal number. For instance, a space is encoded as %20. • The code numbers are the same as in ASCII (hex numbers 00 through 7F, equivalent to decimal numbers 0-127). The same numbers are also used for encoding HTML character entities, except that HTML character entities use decimal numbers instead of hexadecimal. For example, the ASCII numerical code for a space is 32, which is 20 in hexadecimal. A space written as a character entity is &32; and in URL encoding it is %20. • Also, the rest of the ISO-Latin (ISO-8859-1) character set, hex numbers 80-FF can be used in URLs. These characters include several other western European alphabets. • These include: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, Swedish. (Don’t worry, I won’t test you on this list—it is here for entertainment purposes) • HTML 4 supports the 16,384 Unicode symbols including a very wide range of alphabetical characters. It encodes these as 4 hexadecimal digits. However, these can’t be used in URL encoding.
More on HTTP, TCP/IP, and Internet Theory • when I get around to it....