500 likes | 595 Views
Introduction to Perl Part II. By: Cédric Notredame (Adapted from BT McInnes). Passing Arguments To Your Program. Command Line Arguments. Command line arguments in Perl are extremely easy. @ARGV is the array that holds all arguments passed in from the command line. Example:
E N D
Introduction to PerlPart II By: Cédric Notredame (Adapted from BT McInnes)
Command Line Arguments • Command line arguments in Perl are extremely easy. • @ARGV is the array that holds all arguments passed in from the command line. • Example: • ./prog.pl arg1 arg2 arg3 • @ARGV would contain ('arg1', ‘arg2', 'arg3’) • $#ARGV returns the number of command line arguments that have been passed. • Remember $#array is the size of the array!
File Handlers • Opening a File: open (SRC, “my_file.txt”); • Reading from a File $line = <SRC>; # reads upto a newline character • Closing a File close (SRC);
File Handlers cont... • Opening a file for output: open (DST, “>my_file.txt”); • Opening a file for appending open (DST, “>>my_file.txt”); • Writing to a file: print DST “Printing my first line.\n”; • Safeguarding against opening a non existent file open (SRC, “file.txt”) || die “Could not open file.\n”;
File Test Operators • Check to see if a file exists: if ( -e “file.txt”) { # The file exists! } • Other file test operators: -r readable -x executable -d is a directory -T is a text file
Quick Program with File Handles • Program to copy a file to a destination file #!/usr/bin/perl -w open(SRC, “file.txt”) || die “Could not open source file.\n”; open(DST, “>newfile.txt”); while ( $line = <SRC> ) { print DST $line; } close SRC; close DST;
Some Default File Handles • STDIN : Standard Input $line = <STDIN>; # takes input from stdin • STDOUT : Standard output print STDOUT “File handling in Perl is sweet!\n”; • STDERR : Standard Error print STDERR “Error!!\n”;
The <> File Handle • The “empty” file handle takes the command line file(s) or STDIN; • $line = <>; • If program is run ./prog.pl file.txt, this will automatically open file.txt and read the first line. • If program is run ./prog.pl file1.txt file2.txt, this will first read in file1.txt and then file2.txt ... you will not know when one ends and the other begins.
The <> File Handle cont... • If program is run ./prog.pl, the program will wait for you to enter text at the prompt, and will continue until you enter the EOF character • CTRL-D in UNIX
Example Program with STDIN • Suppose you want to determine if you are one of the three stooges #!/usr/local/bin/perl %stooges = (larry => 1, moe => 1, curly => 1 ); print “Enter your name: ? “; $name = <STDIN>; chomp $name; if($stooges{ lc($name) }) { print “You are one of the Three Stooges!!\n”; } else { print “Sorry, you are not a Stooge!!\n”; }
Combining File Content Given The two Following Files: File1.txt 1 2 3 And File2.txt a b c Write a program that takes the two files as arguments and outputs a third file that looks like: File3.txt 1 a 2 b 3 Tip: ./mix_files File1.txt File2.txt File3.txt
Combining File Content #! /usr/bin/perl open (F, “$ARGV[0]); open (G, “$ARGV[1]); open (H, “>$ARGV[2]); while ( defined (F) && defined (G) && ($l1=<F>) && ($l2=<G>)) { print H “$l1$l2”; } close (F); close (G); close (H);
Chomp and Chop • Chomp : function that deletes a trailing newline from the end of a string. • $line = “this is the first line of text\n”; • chomp $line; # removes the new line character • print $line; # prints “this is the first line of # text” without returning • Chop : function that chops off the last character of a string. • $line = “this is the first line of text”; • chop $line; • print $line; #prints “this is the first line of tex”
Regular Expressions • What are Regular Expressions .. a few definitions. • Specifies a class of strings that belong to the formal / regular languages defined by regular expressions • In other words, a formula for matching strings that follow a specified pattern. • Some things you can do with regular expressions • Parse the text • Add and/or replace subsections of text • Remove pieces of the text
Regular Expressions cont.. • A regular expression characterizes a regular language • Examples in UNIX: • ls *.c • Lists all the files in the current directory that are postfixed '.c' • ls *.txt • Lists all the files in the current directory that are postfixed '.txt'
Simple Example for ... ? Clarity • In the simplest form, a regular expression is a string of characters that you are looking for • We want to find all the words that contain the string 'ing' in our text. • The regular expression we would use : /ing/
The Match Operator • What would are program then look like: if($word=~m/ing/) { print “$word\n”;}
Exercise: • Download any text you wish from the internet and count all the words in “ing” it contains… • wget “http://www.trinity.edu/~mkearl/family.html”
Exercise: #!/usr/local/bin/perl while(<>) { chomp; @words = split/ /; foreach $word(@words) { if($word=~m/ing/) { print “$word\n”;$ing++; } } } print “$ing Words in ing\n”;
Regular Expressions Types • Regular expressions are composed of two types of characters: • Literals • Normal text characters • Like what we saw in the previous program ( /ing/ ) • Metacharacters • special characters • Add a great deal of flexibility to your search
Metacharacters • Match more than just characters • Match line position • ^ start of a line ( carat ) • $ end of a line ( dollar sign ) • Match any characters in a list : [ ... ] • Example : • /[Bb]ridget/ matches Bridget or bridget • /Mc[Ii]nnes/ matches McInnes or Mcinnes
Our Simple Example Revisited • Now suppose we only want to match words that end in 'ing' rather than just contain 'ing'. • How would we change are regular expressions to accomplish this: • Previous Regular Expression: $word =~m/ ing / • New Regular Expression: $word=~m/ ing$ /
Ranges of Regular Expressions • Ranges can be specified in Regular Expressions • Valid Ranges • [A-Z] Upper Case Roman Alphabet • [a-z] Lower Case Roman Alphabet • [A-Za-z] Upper or Lower Case Roman Alphabet • [A-F] Upper Case A through F Roman Characters • [A-z] Valid but be careful • Invalid Ranges • [a-Z] Not Valid • [F-A] Not Valid
Ranges cont ... • Ranges of Digits can also be specified • [0-9] Valid • [9-0] Invalid • Negating Ranges • / [^0-9] / • Match anything except a digit • / [^a] / • Match anything except an a • / ^[^A-Z] / • Match anything that starts with something other than a single upper case letter • First ^ : start of line • Second ^ : negation
Our Simple Example Again • Now suppose we want to create a list of all the words in our text that do not end in 'ing' • How would we change are regular expressions to accomplish this: • Previous Regular Expression: $word =~m/ ing$ / • New Regular Expression: !($word=~m/ (ing)$ /)
Matching Interogations • $string=~/([^.?]+\?)/ • $string=~/[.?]([A-Z0-9][^.?]+\?)/ • $string=~/([\w\s]+\?)/
Removing HTML Tags • $string=~s/\<[^>]+\>/ /g • g: substitute EVERY instance
Literal Metacharacters • Suppose that you actually want to look for all strings that equal ‘$' in your text • Use the \ symbol • / \$ / Regular expression to search for • What does the following Regular Expressions Match? / [ ABCDEFGHIJKLMNOP$] \$/ / [ A-P$ ] \$ / • Matches any line that contains ( A-P or $) followed by $
Patterns provided in Perl • Some Patterns • \d [ 0 – 9 ] • \w [a – z A – Z 0 – 9_] • \s [ \r \t \n \f ] (white space pattern) • \D [^ 0 - 9] • \W [^ a – z A – Z 0 – 9_] • \S [^ \r \t \n \f] • Example : ( 19\d\d ) • Looks for any year in the 1900's
Using Patterns in our Example • Commonly words are not separated by just a single space but by tabs, returns, ect... • Let's modify our split function to incorporate multiple white space #!/usr/local/bin/perl while(<>) { chomp; @words = split/\s+/, $_; foreach $word(@words) { if($word=~m/ing$/) { print “$word\n”; } }
Word Boundary Metacharacter • Regular Expression to match the start or the end of a 'word' : \b • Examples: • / Jeff\b / Match Jeff but not Jefferson • / Carol\b / Match Carol but not Caroline • / Rollin\b / Match Rollin but not Rolling • /\bform / Match form or formation but not Information • /\bform\b/ Match form but neither information nor formation
DOT Metacharacter • The DOT Metacharacter, '.' symbolizes any character except a new line • / b . bble/ • Would possibly return : bobble, babble, bubble • / . oat/ • Would possibly return : boat, coat, goat • Note: remember '.*' usually means a bunch of anything, this can be handy but also can have hidden ramifications.
PIPE Metacharacter • The PIPE Metacharacter is used for alternation • / Bridget (Thomson | McInnes) / • Match Bridget Thomson or Bridget McInnes but NOT Bridget Thomson McInnes • / B | bridget / • Match B or bridget • / ^( B | b ) ridget / • Match Bridget or bridget at the beginning of a line
Our Simple Example • Now with our example, suppose that we want to not only get all words that end in 'ing' but also 'ed'. • How would we change are regular expressions to accomplish this: • Previous Regular Expression: $word =~m/ ing / • New Regular Expression: $word=~m/ (ing|ed)/
The ? Metacharacter • The metacharacter, ?, indicates that the character immediately preceding it occurs zero or one time • Examples: • / worl?ds / • Match either 'worlds' or 'words' • / m?ethane / • Match either 'methane' or 'ethane'
The * Metacharacter • The metacharacter, *, indicates that the character immediately preceding it occurs zero or more times • Example : • / ab*c/ Match 'ac', 'abc', 'abbc', 'abbbc' ect... • Matches any string that starts with an a, if possibly followed by a sequence of b's and ends with a c. • Sometimes called Kleene's star
Our Simple Example again • Now suppose we want to create a list of all the words in our text that end in 'ing' or 'ings' • How would we change are regular expressions to accomplish this: • Previous Regular Expression: $word =~m/ ing$ / • New Regular Expression: $word=~m/ ings?$ /
Exercise • For each of the strings (a)--(e), say which of the patterns (i)--(xii) it matches. Where there is a match, what would be the values of $MATCH, $1, $2, etc.? • 1) the quick brown fox jumped over the lazy dog • 2) The Sea! The Sea! • 3) (.+)\s*\1 • 4) 9780471975632 • 5) C:\DOS\PATH\NAME • 1) /[a-z]/ • 2) /(\W+)/ • 3) /\W*/ • 4) /^\w+$/ • 5) /[^\w+$]/ • 6) /\d/ • 7) /(.+)\s*\1/ • 8) /((.+)\s*\1)/ • 9) /(.+)\s*((\1))/ • 11) /\DOS/ • 12) /\\DOS/ • 13) /\\\DOS/
Exercise • For each of the strings (a)--(e), say which of the patterns (i)--(xii) it matches. Where there is a match, what would be the values of $MATCH, $1, $2, etc.? • 1) the quick brown fox jumped over the lazy dog 1,2,3,5 • 2) The Sea! The Sea! 1,2,3,5,7,9 • 3) (.+)\s*\1 1,2,3, 5, 6 • 4) 9780471975632 3,4,6 • 5) C:\DOS\PATH\NAME 2,3,5,10,11,12 • 1) /[a-z]/ 1,2,3 • 2) /(\W+)/ 1,2,3,5 • 3) /\W*/ 1,2,3,5 • 4) /^\w+$/ 4 • 5) /[^\w+$]/ 1,2,3,5 • 6) /\d/ 3,4 • 7) /(.+)\s*\1/ 2, • 8) /((.+)\s*\1)/ • 9) /(.+)\s*((\1))/ 2 • 10) /\DOS/ 5 • 11) /\\DOS/ 5 • 12) /\\\DOS/ 5
Modifying Text • Match • Up to this point, we have seen attempt to match a given regular expression • Example : $variable =~m/ regex / • Substitution • Takes match one step further : if there is a match, then replace it with the given string • Example : $variable =~s/ regex / replacement/ $var =~ s/ Cedric / Notredame /g; $var =~ s/ing/ed /;
Substitution Example • Suppose when we find all our words that end in 'ing' we want to replace the 'ing' with 'ed'. #!/usr/local/bin/perl -w while(<>) { chomp $_; @words = split/ \s+/, $_; foreach $word(@words) { if($word=~s/ing$/ed/) { print “$word\n”; } } }
Special Variable Modified by a Match • $target=“I have 25 apples” $target=~/(\d+)/ • $& => 25 • Copy of text matched by the regex • $' =>”I have “ • A copy of the target text until the first match • $` => “ apples” • A copy of the target text after the last match • $1, $2, $3, ect $1=25 • The text matched by 1st, 2nd, ect., set of parentheses. Note : $0 is not included here • $+ • A copy of the highest numbered $1, $2, $3, ect..
Our Simple Example once again • Now lets revise our program to find all the words that end in 'ing' without splitting our line of text into an array of words #!/usr/local/bin/perl -w while(<>) { chomp $_; if($_=~/([A-Za-z]*ing\b)/g) { print "$&\n"; } }
Example #!/usr/local/bin $exp = <STDIN>; chomp $exp; if($exp=~/^([A-Za-z+\s]*)\bcrave\b([\sA-Za-z]+)/) { print “$1\n”; print “$2\n”; } • Run Program with string : I crave to rule the world! • Results: • “I “ • to rule the world!
Example #!/usr/local/bin $exp = <STDIN>; chomp $exp; if($exp=~/\bcrave\b/) { print “$`\n”; print “$&\n”; print “$’\n”; } • Run Program with string : I crave to rule the world! • Results: • I • crave • to rule the world!