550 likes | 821 Views
Practical extraction and report language. Perl Tutorial. http://www.comp.leeds.ac.uk/Perl/start.html. Why Perl?. Perl is built around regular expressions REs are good for string processing Therefore Perl is a good scripting language Perl is especially popular for CGI scripts
E N D
Practical extraction and report language Perl Tutorial http://www.comp.leeds.ac.uk/Perl/start.html
Why Perl? • Perl is built around regular expressions • REs are good for string processing • Therefore Perl is a good scripting language • Perl is especially popular for CGI scripts • Perl makes full use of the power of UNIX • Short Perl programs can be very short • “Perl is designed to make the easy jobs easy, without making the difficult jobs impossible.” -- Larry Wall, Programming Perl
Why not Perl? • Perl is very UNIX-oriented • Perl is available on other platforms... • ...but isn’t always fully implemented there • However, Perl is often the best way to get some UNIX capabilities on less capable platforms • Perl does not scale well to large programs • Weak subroutines, heavy use of global variables • Perl’s syntax is not particularly appealing
What is a scripting language? • Operating systems can do many things • copy, move, create, delete, compare files • execute programs, including compilers • schedule activities, monitor processes, etc. • A command-line interface gives you access to these functions, but only one at a time • A scripting language is a “wrapper” language that integrates OS functions
Major scripting languages • UNIX has sh, Perl • Macintosh has AppleScript, Frontier • Windows has no major scripting languages • probably due to the weaknesses of DOS • Generic scripting languages include: • Perl (most popular) • Tcl (easiest for beginners) • Python (new, Java-like, best for large programs)
Perl Example 1 #!/usr/local/bin/perl # # Program to do the obvious # print 'Hello world.'; # Print a message
Comments on “Hello, World” • Comments are # to end of line • But the first line, #!/usr/local/bin/perl, tells where to find the Perl compiler on your system • Perl statements end with semicolons • Perl is case-sensitive • Perl is compiled and run in a single operation
Variables • A variable is a name of a place where some information is stored. For example: • $yearOfBirth = 1976; • $currentYear = 2000; • $age = $currentYear-$yearOfBirth; • print $age; • Same name can store strings: • $yearOfBirth = ‘None of your business’; • The variables in the example program can be identified as such because their names start with a dollar ($). Perl uses different prefix characters for structure names in programs. Here is an overview: • $: variable containing scalar values such as a number or a string • @: variable containing a list with numeric keys • %: variable containing a list with strings as keys • &: subroutine
Operations on numbers • Perl contains the following arithmetic operators: • +: sum • -: subtraction • *: product • /: division • %: modulo division • **: exponent • Apart from these operators, Perl contains some built-in arithmetic functions. Some of these are mentioned in the following list: • abs($x): absolute value • int($x): integer part • rand(): random number between 0 and 1 • sqrt($x): square root
Test your understanding • $text =~ s/bug/feature/; • $text =~ s/bug/feature/g; • $text =~ tr/[A-Z]/[a-z]/; • $text =~ tr/AEIOUaeiou//d; • $text =~ tr/[0-9]/x/cs; • $text =~ s/[A-Z]/CAPS/g;
Examples • # replace first occurrence of "bug" • $text =~ s/bug/feature/; • # replace all occurrences of "bug" • $text =~ s/bug/feature/g; • # convert to lower case • $text =~ tr/[A-Z]/[a-z]/; • # delete vowels • $text =~ tr/AEIOUaeiou//d; • # replace nonnumber sequences with a single x • $text =~ tr/[0-9]/x/cs; • # replace each capital character by CAPS • $text =~ s/[A-Z]/CAPS/g;
Regular expressions Examples: 1. Clean an HTML formatted text 2. Grab URLs from a Web page 3. Transform all lines from a file into lower case • \b: word boundaries • \d: digits • \n: newline • \r: carriage return • \s: white space characters • \t: tab • \w: alphanumeric characters • ^: beginning of string • $: end of string • .: any character • [bdkp]: characters b, d, k and p • [a-f]: characters a to f • [^a-f]: all characters except a to f • abc|def: string abc or string def • *: zero or more times • +: one or more times • ?: zero or one time • {p,q}: at least p times and at most q times • {p,}: at least p times • {p}: exactly p times
Lists and arrays • @a = (); # empty list • @b = (1,2,3); # three numbers • @c = ("Jan","Piet","Marie"); # three strings • @d = ("Dirk",1.92,46,"20-03-1977"); # a mixed list • Variables and sublists are interpolated in a list • @b = ($a,$a+1,$a+2); # variable interpolation • @c = ("Jan",("Piet","Marie")); # list interpolation • @d = ("Dirk",1.92,46,(),"20-03-1977"); # empty list • # don’t get lists containing lists – just a simple list • @e = ( @b, @c ); # same as (1,2,3,"Jan","Piet","Marie")
Lists and arrays • Practical construction operators • ($x..$y) • @x = (1..6); # same as (1, 2, 3, 4, 5, 6) • @z = (2..5,8,11..13); # same as (2,3,4,5,8,11,12,13) • qw() "quote word" function • qw(Jan Piet Marie) is a shorter notation for ("Jan","Piet","Marie").
Split • It takes a regular expression and a string, and splits the string into a list, breaking it into pieces at places where the regular expression matches. $string = "Jan Piet\nMarie \tDirk";@list = split /\s+/, $string; # yields ( "Jan","Piet","Marie","Dirk" ) # remember \s is a white space • $string = " Jan Piet\nMarie \tDirk\n"; # empty string at begin and end!!!@list = split /\s+/, $string; # yields ( "", "Jan","Piet","Marie","Dirk", "" )$string = "Jan:Piet;Marie---Dirk"; # use any regular expression... @list = split /[:;]|---/, $string; # yields ( "Jan","Piet","Marie","Dirk" )$string = "Jan Piet"; # use an empty regular expression to split on letters @letters= split //, $string; # yields ( "J","a","n"," ","P","i","e","t")
More about arrays • @array = ("an","bert","cindy","dirk"); • $length = @array; # $length now has the value 4 • print $length; # prints 4 • print $#array; # prints 3, last valid subscript • print $array[$#array] # prints "dirk" • print scalar(@array) # prints 4
Working with lists Subscripts convert lists to strings @array = ("an","bert","cindy","dirk"); print "The array contains $array[0] $array[1] $array[2] $array[3]"; # interpolate print "The array contains @array"; function join STRING LIST. $string = join ":", @array; # $string now has the value "an:bert:cindy:dirk" Iteration over lists for( $i=0 ; $i<=$#array; $i++){ $item = $array[$i]; $item =~ tr/a-z/A-Z/; print "$item "; } foreach $item (@array){ $item =~ tr/a-z/A-Z/; print "$item "; # prints a capitalized version of each item }
More about arrays – multiple value assignments • ($a, $b) = ("one","two"); • ($onething, @manythings) = (1,2,3,4,5,6) • # now $onething equals 1 • # and @manythings = (2,3,4,5,6) • ($array[0],$array[1]) = ($array[1],$array[0]); • # swap the first two • Pay attention to the fact that assignment to a variable first evaluates the right hand-side of the expression, and then makes a copy of the result • @array = ("an","bert","cindy","dirk"); • @copyarray = @array; # makes a deep copy • $copyarray[2] = "XXXXX";
Manipulating lists and their elements PUSH • push ARRAY LIST • appends the list to the end of the array. • if the second argument is a scalar rather than a list, it appends it as the last item of the array. • @array = ("an","bert","cindy","dirk"); • @brray = ("eve","frank"); • push @array, @brray; • # @array is ("an","bert","cindy","dirk","eve","frank") • push @brray, "gerben"; • # @brray is ("eve","frank","gerben")
Manipulating lists and their elements POP • pop ARRAY does the opposite of push. it removes the last item of its argument list and returns it. • If the list is empty it returns undef. • @array = ("an","bert","cindy","dirk"); • $item = pop @array; • # $item is "dirk" and @array is ( "an","bert","cindy") • shift @array removes the first element - works on the left end of the list, but is otherwise the same as pop. • unshift (@array, @newStuff) puts stuff on the left side of the list, just as push does for the right side.
Grep • grep CONDITION LIST • returns a list of all items from list that satisfy some condition. • For example: • @large = grep $_ > 10, (1,2,4,8,16,25); # returns (16,25) • @i_names = grep /i/, @array; # returns ("cindy","dirk")
map • map OPERATION LIST • is an extension of grep, and performs an arbitrary operation on each element of a list. • For example: • @array = ("an","bert","cindy","dirk"); • @more = map $_ + 3, (1,2,4,8,16,25); • # returns (4,5,7,11,19,28) • @initials = map substr($_,0,1), @array; • # returns ("a","b","c","d")
Hashes (Associative Arrays) • associate keys with values – named with % • allows for almost instantaneous lookup of a value that is associated with some particular key Examples if %wordfrequency is the hash table, $wordfrequency{"the"} = 12731; # creates key "the", value 12731 $phonenumber{"An De Wilde"} = "+31-20-6777871"; $index{$word} = $nwords; $occurrences{$a}++; # if this is the first reference, # the value associated with $a will # be increased from 0 to 1
Hash Operations • %birthdays = ("An","25-02-1975","Bert","12-10-1953","Cindy","23-05-1969","Dirk","01-04-1961"); • # fill the hash • %birthdays = (An => "25-02-1975", Bert => "12-10-1953", Cindy => "23-05-1969", Dirk => "01-04-1961" ); • # fill the hash; the same as above, but more explicit • @list = %birthdays; # make a list of the key/value pairs • %copy_of_bdays = %birthdays; # copy a hash
Hashes (What if not there?) • Existing, Defined and true. • If the value for a key does not exist in the hash, the access to it returns the undef value. • special test function exists(HASHENTRY), which returns true if the hash key exists in the hash • if($hash{$key}){...}, or if(defined($hash{$key})){...} • return false if the key $key has no associated value • print "Exists\n" if exists $array{$key};
Perl Example 2 #!/ex2/usr/bin/perl # Remove blank lines from a file # Usage: singlespace < oldfile > newfile while ($line = <STDIN>) { if ($line eq "\n") { next; } print "$line"; }
More Perl notes • On the UNIX command line; • < filename means to get input from this file • > filename means to send output to this file • In Perl, <STDIN> is the input file, <STDOUT> is the output file • Scalar variables start with $ • Scalar variables hold strings or numbers, and they are interchangeable • Examples: • $priority = 9; • $priority = '9'; • Array variables start with @
Perl Example 3 #!/usr/local/bin/perl # Usage: fixm <filenames> # Replace \r with \n -- replaces input files foreach $file (@ARGV) { print "Processing $file\n"; if (-e "fixm_temp") { die "*** File fixm_temp already exists!\n"; } if (! -e $file) { die "*** No such file: $file!\n"; } open DOIT, "| tr \'\\015' \'\\012' < $file > fixm_temp" or die "*** Can't: tr '\015' '\012' < $ file > $ fixm_temp\n"; close DOIT; open DOIT, "| mv -f fixm_temp $file" or die "*** Can't: mv -f fixm_temp $file\n"; close DOIT; }
Comments on example 3 • In # Usage: fixm <filenames>, the angle brackets just mean to supply a list of file names here • In UNIX text editors, the \r (carriage return) character usually shows up as ^M (hence the name fixm_temp) • The UNIX command tr '\015' '\012' replaces all \015 characters (\r) with \012 (\n) characters • The format of the open and close commands is: • openfileHandle,fileName • closefileHandle,fileName • "| tr \'\\015' \'\\012' < $file > fixm_temp"says: Take input from $file, pipe it to the tr command, put the output onfixm_temp
Arithmetic in Perl $a = 1 + 2; # Add 1 and 2 and store in $a $a = 3 - 4; # Subtract 4 from 3 and store in $a $a = 5 * 6; # Multiply 5 and 6 $a = 7 / 8; # Divide 7 by 8 to give 0.875 $a = 9 ** 10; # Nine to the power of 10, that is, 910 $a = 5 % 2; # Remainder of 5 divided by 2 ++$a; # Increment $a and then return it $a++; # Return $a and then increment it --$a; # Decrement $a and then return it $a--; # Return $a and then decrement it
String and assignment operators $a = $b . $c; # Concatenate $b and $c $a = $b x $c; # $b repeated $c times $a = $b; # Assign $b to $a $a += $b; # Add $b to $a $a -= $b; # Subtract $b from $a $a .= $b; # Append $b onto $a
Single and double quotes • $a = 'apples'; • $b = 'bananas'; • print $a . ' and ' . $b; • prints: apples and bananas • print '$a and $b'; • prints: $a and $b • print "$a and $b"; • prints: apples and bananas
Arrays • @food = ("apples", "bananas", "cherries"); • But… • print $food[1]; • prints "bananas" • @morefood = ("meat", @food); • @morefood == ("meat", "apples", "bananas", "cherries"); • ($a, $b, $c) = (5, 10, 20);
push and pop • push adds one or more things to the end of a list • push (@food, "eggs", "bread"); • push returns the new length of the list • pop removes and returns the last element • $sandwich = pop(@food); • $len = @food; # $len gets length of @food • $#food # returns index of last element
foreach # Visit each item in turn and call it $morsel foreach $morsel (@food) { print "$morsel\n"; print "Yum yum\n"; }
Tests • “Zero” is false. This includes:0, '0', "0", '', "" • Anything not false is true • Use == and != for numbers, eq and ne for strings • &&, ||, and ! are and, or, and not, respectively.
for loops • for loops are just as in C or Java • for ($i = 0; $i < 10; ++$i){ print "$i\n";}
while loops #!/usr/local/bin/perl print "Password? "; $a = <STDIN>; chop $a; # Remove the newline at end while ($a ne "fred") { print "sorry. Again? "; $a = <STDIN>; chop $a;}
do..while and do..until loops #!/usr/local/bin/perl do { print "Password? "; $a = <STDIN>; chop $a; } while ($a ne "fred");
if statements if ($a) { print "The string is not empty\n"; } else { print "The string is empty\n"; }
if - elsif statements if (!$a) { print "The string is empty\n"; } elsif (length($a) == 1) { print "The string has one character\n"; } elsif (length($a) == 2) { print "The string has two characters\n"; } else { print "The string has many characters\n"; }
Why Perl? • Two factors make Perl important: • Pattern matching/string manipulation • Based on regular expressions (REs) • REs are similar in power to those in Formal Languages… • …but have many convenience features • Ability to execute UNIX commands • Less useful outside a UNIX environment
Basic pattern matching • $sentence =~ /the/ • True if $sentence contains "the" • $sentence = "The dog bites.";if ($sentence =~ /the/) # is false • …because Perl is case-sensitive • !~ is "does not contain"
RE special characters . # Any single character except a newline ^ # The beginning of the line or string $ # The end of the line or string * # Zero or more of the last character + # One or more of the last character ? # Zero or one of the last character
RE examples ^.*$ # matches the entire string hi.*bye # matches from "hi" to "bye" inclusive x +y # matches x, one or more blanks, and y ^Dear # matches "Dear" only at beginning bags? # matches "bag" or "bags" hiss+ # matches "hiss", "hisss", "hissss", etc.
Square brackets [qjk] # Either q or j or k [^qjk] # Neither q nor j nor k [a-z] # Anything from a to z inclusive [^a-z] # No lower case letters [a-zA-Z] # Any letter [a-z]+ # Any non-zero sequence of # lower case letters
More examples [aeiou]+ # matches one or more vowels [^aeiou]+ # matches one or more nonvowels [0-9]+ # matches an unsigned integer [0-9A-F] # matches a single hex digit [a-zA-Z] # matches any letter [a-zA-Z0-9_]+ # matches identifiers
More special characters \n # A newline \t # A tab \w # Any alphanumeric; same as [a-zA-Z0-9_] \W # Any non-word char; same as [^a-zA-Z0-9_] \d # Any digit. The same as [0-9] \D # Any non-digit. The same as [^0-9] \s # Any whitespace character\S # Any non-whitespace character \b # A word boundary, outside [] only \B # No word boundary
Quoting special characters \| # Vertical bar \[ # An open square bracket \) # A closing parenthesis \* # An asterisk \^ # A carat symbol \/ # A slash \\ # A backslash
Alternatives and parentheses jelly|cream # Either jelly or cream (eg|le)gs # Either eggs or legs (da)+ # Either da or dada or # dadada or...