430 likes | 532 Views
Perl in an Hour. or maybe two (depends on how fast I talk). Basics. Hello, World!. #! /usr/bin/perl use warnings; use strict; # enforces variable scoping print “Hello, world!<br>”; invoke the Perl compiler with the #! in the first column of the first line.
E N D
Perl in an Hour or maybe two (depends on how fast I talk)
Hello, World! #! /usr/bin/perl use warnings; use strict; # enforces variable scoping print “Hello, world!\n”; • invoke the Perl compiler with the #! in the first column of the first line. • lines end with semicolons ; • comments are single line, starting with # • “print” followed by a double-quoted string interprets variables and metacharacters. • print by default prints to STDOUT, the monitor. • note that many built-in functions, such as print, do not require parentheses (but they can be used for dis-ambiguation). • Processing: a quick compilation/optimization step, followed by execution. Execution starts at the top of the program and proceeds line-by-line. There is no “main” block: it is implied, the code that is not part of any other block. • for SEED interactions: • put this line into the .bashrc file in your home directory: source ~fig/FIGdisk/config/fig-user-env.sh • then, create your file with: tool_hdr whatever.pl . This installs some code needed for file access.
Numerical Operations • Number representation oddities: • you can use underscores as punctuation within a number: 1_000_000 is the same as 1000000. Perl ignores the _’s. • base 10 exponents are represented by e or E: 1.3e-12 is the Perl representation of 1.3 x 10-12. • Warning: BLAST programs often give scores like “e-12”. Perl needs to see a 1 in front of this. $score = “1” . $score if substr $score, 0,1 eq “e”; • Numerical operations are mostly as in C (+, -, *, /, % (mod), ** (exponent) ). • However, division is floating point: 10 / 3 gives 3.3333. • to get integer division, use “int”: int(10/3) give 3. • Operator precedence is as in C, but use parentheses!
Strings • Strings are a fundamental data type in Perl (as opposed to characters). • A string is anything surrounded by quotes: “dog”. Variables and metacharacters are interpreted in the string (unless you use single quotes). • Concatenation is done using the dot ( . ). “the” . “ “ . “dog” is interpreted as “the dog” • Numbers and strings are the two main types of scalar variable. They are freely interconverted as needed. Thus, “5” and 5 are the same thing. • Non-numerical strings have a numerical value of 0: “the dog” + 3 equals 3 because “the dog” is interpreted as a number due to the + sign. “5” + 3 equals 8, because even though “5” is written as a string, the + sign causes it to be interpreted as a number.
Scalar Variables also: logical comparisons, if, while
Scalar Variables • Scalar variables hold a single value. • Scalar variable names start with $ in Perl. This makes them easy to spot. Names can contain letters, numbers and underscores, but must not start with a number, and they are case-sensitive. • Perl variables are loosely typed: $var can hold a number or a string, and there is no distinction between different types of number. • Variables are declared with the “my” keyword, and values are assigned with “=“. • Variables are visible only within the smallest code block (enclosed by {} ) containing them. • Variables have global scope if declared in the main section of the program, outside any code block. • Using the contents of a variable as the name of another variable, a “symbolic reference”, is considered Very Bad in Perl. • For example, you have $foo = 'snonk', and then want to operate on the value of $snonk. • Binary assignment operators: $dog++, ++$dog, $dog += 3, $dog .= “cat”, etc.
Logical Operations • Perl considers these values false: 0 (zero), “0” (the string zero), “” (the empty string) and undef (undefined, the default value for a declared but undefined variable). Everything else is true. • Comparison between numbers uses different operators than comparison between strings!. == vs. eq, != vs. ne, > vs. gt, <= vs. le, etc. • Logical: ! is “not”, && is “and”, || is “or”. The words also work, but they have a much lower precedence than the symbols. • C’s ternary operator ?: works in Perl too.
If Statements • if (logical_expression_in_parentheses) { code set off by curly braces; } elsif (another logical test) { # note the spelling! more code; } else { code; } • Even single statements must be enclosed within {} • There is a backwards logic for single statements: print “yes” if ($var > 17);
While loops • while (logical test) { code block; } • There is also a do-while loop, as in C. • “next” ends loop execution are returns you to the logical test at the top. “last” breaks you out of the loop altogether.
Arrays and Lists also: for and foreach loops, scalar context, $_
Lists • A list is a set of elements enclosed within parentheses and separated by commas. String elements must be quoted. You can mix numerical and string values. (1,3,5) is a list. So is (1, 3.1416, “duh”). • The empty list is (). • The qw operator (“quote word”) adds commas and quotes at spaces: qw(3 dog day) is equivalent to (“3”, “dog”, “day”).
Arrays • An array is a variable that holds a list. Array names start with @. • Arrays adjust their size automatically: no need to pre-declare an array size. • You can assign lists to arrays, arrays to lists, etc: @arr = (1, 3, “duh”); ($dog, $cat) = @arr; • Note that in the last case, $dog gets 1, $cat gets 3, and “duh” is discarded. • More subtly: (@pets, $dog, $cat) = qw( rover fido spot rex duke fluffy); causes @pets to get all the names, and $dog and $cat to remain undefined. • switching positions: ($last, $first) = ($first, $last); # not in C! • print @arr; runs all the elements together. print “@arr”; separates the elements by a space.
Accessing Array Elements • Arrays are numbered from 0, as in C. • An individual array element is a scalar, so its name starts with $. • Array indexes are given in square brackets. $arr[3] is the 4th element of @arr. • Important: $arr is a completely separate and independent variable from @arr (and its elements such as $arr[0] ). • Negative numbers are used to access array elements from the end: $arr[-1] returns the last element in the array, $arr[-2] returns the next-to-last element, etc. • $#arr gives the index of the last array element. Thus $arr[-1] == $arr[ $#arr ]; Very useful for loops.
Array Operations • You can easily add or remove elements from either end of an array. • “push” and “pop” operate on the end (right side) of an array. push @arr, $var; is standard syntax for adding an element to an array. • “shift” and “unshift” operate on the left end of the array. • my $var = shift @arr; is standard syntax for unloading an array. • Standard C-style “for” loops: for (my $i = 0; $i < 10; $i++) { # note variable declaration do something; } • “foreach” loops in which each element of the array is substituted into the scalar in turn. Sometimes called “indirect object” syntax. foreach $element ( @arr ) { do something with $element; }
Two Perl Oddities: Context-sensitive variables and $_ • The value of an array changes when used as an array or when used as a scalar, i.e. “in scalar context”. @arr = (1, 3, 5, 7); # array (or list) context print “@arr”; gives “1 3 5 7”; but: $var = @arr; # scalar context = number of elements print “$var”; gives “4” (the number of elements in @arr) • In many cases, if you don’t assign input to a variable, Perl automatically assigns it to the variable “$_”, which can often be used without being written explicitly. foreach (@pets) { print; } foreach (@pets) { print $_, “ “; }
Hashes • a “hash” is another fundamental data structure, like scalars and arrays. Hashes are sometimes called “associative arrays”. • Basically, a hash associates a key with a value. A hash is composed of a set of key-value pairs. • A key is a string: any collection of characters, generally enclosed in quotes. Any scalar can be a key, but they are all converted to strings. • A value can be almost anything: the values are just scalar variables. • One hash oddity: neither the keys nor the values is sorted or stored in a useful order. The order you enter hash items is not related to the order with which you retrieve them.
Hash Specifics • The punctuation mark used to denote a hash is % (percent sign). • Hash elements are accessed by enclosing the key in curly braces. For example, the hash %stoplight can be populated as follows: $stoplight{red} = “stop”; $stoplight{yellow} = “caution”; $stoplight{green} = “go”; • Each key can refer to only a single value. You can’t have duplicate keys. If you try, the first value will be lost and only the second will work. • A hash is really a list with alternating keys and values. Thus it is possible to load a hash like: %stoplight = (“red”, “stop”, “yellow”, “caution”, “green”, “go”); • A better way is to use the => operator (“big arrow”), which is really just a synonym for a comma (and it also quotes the keys): %stoplight = (red => “stop”, yellow => “caution”, green => “go” );
Hash Operations • “keys” gives a list of all the keys used in the hash. Here’s a common use: foreach (keys %stoplight) { print “$_ stands for $stoplight{$_}\n”; } • “values” lists all the values. “each” returns a set of 2-member lists, key and value. while ( ($key, $value) = each %stoplight) { print “$key : $value\n”; } • Removing elements in a hash is done with “delete”: delete $stoplight{red}; • Testing for existence with “exists”: exists $stoplight{red) returns true if that key-value pair exists, and “false” if it doesn’t.
Subroutines a.k.a. functions Also: running external programs such as BLAST
Subroutines • Subroutines do not need to be pre-declared. They can be defined before, after, or in the middle of the main program. Although not often used, subroutines use & as a punctuation mark. • Subroutines are defined with the keyword “sub” followed by the actual code within curly braces. For example: sub print_qwerty { print “qwerty\n”; } • Subroutines are invoked using their names. Any arguments need to be put inside parentheses following the subroutine name: print_qwerty(); • Subroutines can return more than one value, using the “return” keyword. • More than one value can be returned. They are returned as a list. sub print_qwerty3 { print “qwerty\n”; return 5, 17, “uiop”; } ($var1, $var2, $var3) = print_qwerty3();
More on Subroutines • You can pass arguments into a subroutine as a list enclosed in parentheses: print_words(“dog”, “cat”); • The arguments are copied into an array called “@_”, and they can be accessed as elements of that array from within the subroutine. sub print_words { foreach my $word (@_) { print “$word\n”; } } • Note that you aren’t required to specify the number of arguments in advance. • Variables declared in the main body are global in scope, visible from within any subroutine. Variables declared within a subroutine are visible only within that subroutine.
File Interactions • The “open” command assigns a file name to a file handle. open INFILE, “my_file.txt”; # opened for reading • best to test for success. Standard syntax: open INFILE, “my_file.txt” or die “Couldn’t open read-file\n”; • Files are read one line at a time, by enclosing the file handle in angle brackets: while (<INFILE>) { print; } # each line goes into $_ by default • the “chomp” command removes the terminal newline character from input lines while (my $line = <INFILE>) { chomp $line; print $line; } • To open a file for writing, use “>” before the file name: open OUTFILE, “>my_file.txt”; • To actually write to this file, use the file handle’s name with “print”: print OUTFILE “something interesting\n”; • Appending is done by using “>>” in front of the file’s name: open APPENDFILE, “>>my_file.txt”; • Files are closed automatically when the program terminates, but sometimes you need to specifically close them: close INFILE; • Command line arguments are passed to the program in the @ARGV array, equivalent to C’s argv.
Running External Programs with Perl • The most commonly used Perl command for running external programs is “system”. This command executes the program specified by its arguments, then returns control to the next line in your Perl program. • You can also enclose the program name in backticks; the program’s output to STDOUT is the return value of this: $output_string = ` blastall –p blastn –i my_input_file –d my_database`; • “system” returns the signal number resulting from the process it executed. If all went well, this is 0. Numbers other than 0 indicate some kind of error. • The simplest way to use “system” is to simply enclose the command line you need in quotes: system( “blastall –p blastn –i my_input_file –d my_database –o my_blast_output.txt” ) • The above line invokes the bash shell to interpret the command line, converting each group of symbols separated by spaces into a separate argument. • You can avoid invoking a shell (a somewhat more secure method), by separating out the individual space-delimited segments yourself: system( “blastal”l, “–p”, “blastn”, “–”i”, “my_input_file”, “–d”, “my_database”, “–o”, “my_blast_output.txt” )
String Manipulations • Don’t forget: “.” is the concatenation operator. • “split” takes a string and separates it into an array of strings at whatever pattern of characters is indicated as the first argument between slashes: split /,/, “cat,dog,bird”; This expression splits the string at each comma, returning “cat”, “dog”, “bird”. The splitting characters (the comma in this case) are discarded. • Note the comma after the splitting pattern: /,/, . It is necessary! • To split a string into individual characters, use: my @chars = split //, “The dog”; • “join” takes the elements of an array and joins them into a single string, separated by whatever symbol(s) you like. join “:“, “dog”, “cat”, “bird”; # gives “dog:cat:bird”. • “substr” extracts part of a string, based on the start position and length of the desired substring. Note that the first position is 0. $my_substring = substr $string, start_pos, length
More String Functions • “reverse” reverses the string. • When used on an array, it reverses the order of the elements • The transliteration operator tr/// substitutes one character for another. It is invoked with the binding operator =~, which is extensively used with regular expressions. It uses two argument lists separated by slashes. It substitutes every instance of the first list with the corresponding element in the second list. my $sequence = “AAGCTG”; $sequence =~ tr/ACGT/TGCA/; # $sequence is now “TTCGAC” • tr returns the number of characters converted, so it can be used to count them: $num = ($sequence=~ tr/CG// ); returns the number of G’s and C’s. • to reverse-complement a DNA strand: $sequence = reverse $sequence; $sequence =~ tr/ACGT/TGCA/;
Regular Expressions • Regular expressions are the main way Perl matches patterns within strings. For example, finding pieces of text within a larger document, or finding a restriction site within a larger sequence. • Note: regular expressions DO NOT work very well for DNA sequences, because they don’t deal well with gaps. • There are 2 main operators that use regular expressions: 1. matching (which returns TRUE if a match is found and FALSE if no match is found). m/regex/ or just /regex/ 2. substitution, which substitutes one pattern of characters for another within a string. s/orginal_pattern/new_string/ • “split” also uses regex. • Strings are associated with match or substitution operations using the “binding operator” =~ . • Syntax: if ($str =~ /dog/ ) { print “matches”; } # matching $str =~ s/dog/cat/; # substitutes “cat” for “dog” in $str
Pattern Matching • Literal matching: exact match of each character with no gaps or mismatches: $str = “doggie”; # matches /dog/, /do/, /og/, but NOT /dg/ or /dm/ • an “i” after the match pattern makes it case-insensitive: /dog/i matches “Dog”. • Position assertions: ^ at the beginning means that the matched string must be at the beginning of the line; $ at the end means it must be at the end: • “dog” matches /do/, /^do/, and /og$/, but NOT /^og/ or /do$/ • Quantifiers are placed after the character you want to match. • * means 0 or more of the preceding character • + means 1 or more • ? Means 0 or 1 • {3} means 3; {3,5} means 3, 4, or 5 • for example: /do*g/ matches “dg” or “dog” or “dooog” /do+g/ matches “dog” but not “dg”
Character Classes • There are several built-in classes: • “.” stands for any single character except newline • note that /.*/ matches anything, including the empty string • \d is a digit (0-9) and \D is any non-digit • \s is a whitespace character: space or tab; \S is non-whitespace • \w is a word character: a letter, a digit, or underscore; \W is any other character • Your own character classes are enclosed in square brackets: [acf] is any single a, c, or f. • you negate a character class with a ^ first: [^acf] is anything except a, c, or f. • you can use hyphens to indicate a range (ASCII): [a-z] is any small letter, [a-zA-Z] is any letter
Pattern Memory • To capture the matched pattern, surround it with parentheses. Then, the special variables $1, $2, $3, etc. contain the matched pattern. $str = “The z number is z576890”; $str =~ /is z(\d+)/; print $1; # prints “567890” • the numbered variables are assigned left to right on the basis of the opening (left) parenthesis /(the ((cat) (runs)))/ ; captures: $1 = the cat runs; $2 = cat runs; $3 = cat; $4 = runs. • these variables only exist within the smallest block of code (delimited by { } ) containing the regex. • Matching is “greedy” and not “lazy” by default: “doggg” =~ /(dog+)/ extracts “doggg” not “dog” • a ? after the quantifier converts to lazy matching: “doggg” =~ /(dog+?)/ extracts “dog” not “doggg”
Substitution • Basic syntax: $string =~ s/original_pattern/replacement_string/; • The original pattern is a regular expression and can capture parts of the pattern with parentheses. • The replacement string is just a string, not a regex, although it can contain $1, $2, etc. memory variables $str = “A cat is a nice pet”; $str =~ s/cat/dog/; print $str; # prints “A dog is a nice pet” • Modifiers: by default, only 1 substitution is made on the string. To substitute all instances of the pattern, put a g after the expression. Also, an i after the expression makes it case-insensitive. $str = “A cat is a cat is a CAT”; $str =~ s/cat/dog/; # gives “A dog is a cat is a cat” $str =~ s/cat/dog/gi; # gives “A dog is a dog is a dog” • You can also use substitution to remove characters. Thus s/[^ACGT]//g finds any character that isn’t A, C, G, or T and replaces it with nothing. • Substitution and assignment: keeps the original string intact and assigns the altered string to a new variable: ($newstr = $oldstr) =~ s/cat/dog/;
References and Data Structures (including multidimensional arrays and hashes, and Sorting)
References • In Perl, the backslash is used to create a reference (i.e. a pointer): my $var = 5; my $var_ref = \$var; • To dereference a simple reference, put it inside curly braces with another $ in front of it. Thus, ${$var_ref} is the same as $var, that is, the value “5”. • In many cases you can leave the curly braces out: $$var_ref works just as well as ${$var_ref}. But, in complicated expressions this can cause havoc due to precedence problems. • To dereference array elements, the arrow notation is preferred: my $arr_ref = \(1,3,5,”duh”); print $arr_ref->[3]; # prints “duh” • ${$arr_ref}[3] also works (de-referencing $arr_ref with {} ) • use {} for hash references instead of [] • References to arrays and hashes are the standard way of passing these items into and out of subroutines, to avoid copying them.
Multidimensional Arrays • Square brackets are used to generate a reference to an anonymous array, which is then assigned to a scalar variable. $arr_ref = [1, 3, 5, 7]; $arr2_ref = [@array]; • Similarly, curly braces generate references to anonymous hashes. • A two-dimensional array consists of an array of references to a set of anonymous arrays. @array_2d = ([1,2,4], [7,8,9] , [5,6,3]); • Dereferencing is as in C: print $array_2d[0][1]; # prints “2” • Multidimensional arrays, or mixtures of arrays and hashes, are generated similarly. • Autovivification: you need to declare the top-level array or hash, but all lower levels come into existence automatically. my $hash_ref; $hash_ref->{dog}[3]{color} = “brown”; # array elements 0, 1, and 2 are “undef”
Sorting • Perl has a built-in quicksort function (but of course if you really enjoy writing sort routines, please feel free to indulge yourself). • By default, “sort” goes in ASCII order: my @sorted = sort @array; my @sorted_keys = sort keys %hash; • For numerical sort: my @sorted = sort {$a <=> $b} @array; my @sorted_indexes = sort {$a <=> $b} 0 ..$#array; • Each pair of elements in @array is substituted into the special $a and $b variables. Use these names only! • largest-to-smallest (reverse) numerical sort: my @sorted = sort {$b <=> $a} @array; • A sorting function is written within the curly braces. It needs to return a negative number if $a is greater than $b, 0 if they are equal, and a positive number if $a is less than $b. • Perl uses the built-in <=> operator for numerical comparisons, and “cmp” for ASCII comparisons. • A multi-level sort is done using “||” (or), since the comparison returns 0 (false) if the top level items are equal. @last_names = qw(Coburn Smith Jones Jones Smith); @first_names = qw(Fred Harold Mary Jane Hortense); @sorted_indexes = sort {$last_names[$a] cmp $last_names[$b] || $first_names[$a] cmp $first_names[$b] } 0 .. $#last_names;
Modules • Commonly used subroutines are often put into a separate file, a module. • a module is just a text file, not made executable, with no invocation of Perl at the top. • modules are given a “.pm” extension • modules must return a true value, so nearly all of them have “1;” as their last line. • the content of a module is a “package”, which is given the same name as the module. The package is set off by curly braces, and it contains your subroutines. • for example, MyModule.pm looks like: { package MyModule; sub my_sub1 { whatever; } sub my_sub2 { whatever else; } 1; } • Module files need to be located in one of the directories listed in the built-in @INC array. • To put your own directory in this array: unshift @INC, “path_to_your_lib”; • modules can also be in the same directory as the main program, since the current directory (.) is listed in @INC.
More Modules • The subroutines and variables in a module are in a separate namespace from the main program (whose namespace is called ”main”). • To use them, you need to have a line like “use MyModule;” in your program (the module name without the .pm). • also, you need to provide the fully-qualified name of the variable or subroutine, which is the module name followed by 2 colons :: $MyModule::var1 or &MyModule::my_sub2(); • Some modules allow you to import specific subroutines with a construction like: use MyModule qw(mysub1 mysub2); • In this case, the module name does not have to be used when invoking those subroutines. • Details of exporting and importing are found with the standard Perl Exporter module: see the documentation for that. • One of the joys of Perl is that people share a lot of useful code in the form of modules. The central repository is CPAN (www.cpan.org). Before writing your own module to do something obvious, look there first. • caveat emptor: some modules are very high quality and others aren’t • some require other Perl modules, or compiled C libraries, to be installed first • In general, after downloading a module, installation is done by these 3 commands at the Unix prompt: perl Makefile.PL make make install
Object-Oriented Perl • A class is defined by a package. Class methods are subroutines in that package. • CRITICAL: Arrow notation: Class->method(par1, par2) is interpreted in Perl as: Class::method(“Class”, par1, par2). That is, the class name becomes the first member of the @_ array passed to the method (subroutine). • For example, in the main program: Cow->sound(); in Cow.pm: {package Cow; sub sound { my $class = shift; print “the $class says Moooo”; } } # end of Cow • Inheritance is done with the @ISA array (“is-a”), which must be declared in each package with the “our” keyword. @ISA lists all the superclasses for this class. {package Cow; our @ISA = qw(Animal MethaneProducer); ... }
Instances • An instance of a class is defined by a reference to an anonymous hash. • The hash reference gets associated with its class using the keyword “bless”. my $elsie = {}; bless $elsie, “Cow”; • In Perl, most constructors are called “new”. An example: {package Cow; sub new { my $class = shift; my $self = {}; # anonymous hash ref bless $self, $class; } } # end of Cow my $elsie = Cow->new; # invocation in main program, creating a new instance • Default properties of the instances are put into the anonymous hash in the “new” method: my $self = { legs => 4, color => “brown” , sound => “moo” }; • my $bossie = Cow->new(“color” => “white”); # override the default color
Instance Methods • Accessor methods (also called “set or get”). Note that by default, instance data members are NOT private. Access using accessor methods is a matter of politeness, not force. sub color { my $self = shift; if (@_) { # arguments exist, so it’s a set $self->{color} = shift; } else { # no arguments, so it’s a get return $self->{color}; } } • Destructors: Perl uses an automatic garbage collection system. When the last reference to an object is removed, the object is automatically destroyed. Thus class modules rarely contain explicit destructors. • Operator overloading: “overload” is a built-in method. To overload an operator, define the altered method as a subroutine within the package, and put in a line like: use overload ‘+’ => &my_add;