Introduction to Perl

Introduction to Perl Part II

Associative arrays or Hashes • Like arrays, but instead of numbers as indices can use strings • @array = (‘john’, ‘steve’, ‘aaron’, ‘max’, ‘juan’, ‘sue’) • %hash = ( ‘apple’ => 12, ‘pear’ => 3, ‘cherry’ =>30, ‘lemon’ => 2, ‘peach’ => 6, ‘kiwi’ => 3); Array Hash apple 12 pear 3 0 1 2 3 4 5 cherry 30 lemon 2 ‘john’ ‘steve’ ‘aaron’ ‘max’ ‘juan’ ‘sue’ peach 6 kiwi 3

Using hashes • { } operator • Set a value • $hash{‘cherry’} = 10; • Access a value • print $hash{‘cherry’}, “\n”; • Remove an entry • delete $hash{‘cherry’};

Get the Keys • keys function will return a list of the hash keys • @keys = keys %fruit; • for my $key ( keys %fruit ) { print “$key => $hash{$key}\n”;} • Would be ‘apple’, ‘pear’, ... • Order of keys is NOT guaranteed!

Get just the values • @values = values %hash; • for my $val ( @values ) { print “val is $val\n”;}

Iterate through a set • Order is not guaranteed! while( my ($key,$value) = each %hash){ print “$key => $value\n”;}

Subroutines • Set of code that can be reused • Can also be referred to as procedures and functions

Defining a subroutine • sub routine_name { } • Calling the routine • routine_name; • &routine_name; (& is optional)

Passing data to a subroutine • Pass in a list of data • &dosomething($var1,$var2); • sub dosomething { my ($v1,$v2) = @_;}sub do2 { my $v1 = shift @_; my $v2 = shift;}

Returning data from a subroutine • The last line of the routine set the return valuesub dothis { my $c = 10 + 20;}print dothis(), “\n”; • Can also use return specify return value and/or leave routine early

Write subroutine which returns true if codon is a stop codon (for standard genetic code).-1 on error, 1 on true, 0 on false sub is_stopcodon { my $val = shift @_; if( length($val) != 3 ) { return -1; } elsif( $val eq ‘TAA’ || $val eq ‘TAG’ || $val eq ‘TGA’ ) { return 1; } else { return 0; }

Context • array versus scalar context • my $length = @lst;my ($first) = @lst; • Want array used to report context subroutines are called in • Can force scalar context with scalarmy $len = scalar @lst;

subroutine context sub dostuff { if( wantarray ) { print “array/list context\n”; } else { print “scalar context\n”; } } dostuff(); # scalar my @a = dostuff(); # array my %h = dostuff(); # array my $s = dostuff(); # scalar

Why do you care about context? sub dostuff { my @r = (10,20); return @r; } my @a = dostuff(); # array my %h = dostuff(); # array my $s = dostuff(); # scalar print “@a\n”; # 10 20 print join(“ “, keys %h),”\n”; # 10 print “$s\n”; # 2

References • Are “pointers” the data object instead of object itsself • Allows us to have a shorthand to refer to something and pass it around • Must “dereference” something to get its actual value, the “reference” is just a location in memory

Reference Operators • \ in front of variable to get its memory location • my $ptr = \@vals; • [ ] for arrays, { } for hashes • Can assign a pointer directly • my $ptr = [ (‘owlmonkey’, ‘lemur’)];my $hashptr = { ‘chrom’ => ‘III’, ‘start’ => 23};

Dereferencing • Need to cast reference back to datatype • my @list = @$ptr; • my %hash = %$hashref; • Can also use ‘{ }’ to clarify • my @list = @{$ptr}; • my %hash = %{$hashref};

Really they are not so hard... • my @list = (‘fugu’, ‘human’, ‘worm’, ‘fly’); • my $list_ref = \@list; • my $list_ref_copy = [@list]; • for my $item ( @$list_ref ) { print “$item\n”;}

Why use references? • Simplify argument passing to subroutines • Allows updating of data without making multiple copies • What if we wanted to pass in 2 arrays to a subroutine? • sub func { my (@v1,@v2) = @_; } • How do we know when one stops and another starts?

Why use references? • Passing in two arrays to intermix. • sub func { my ($v1,$v2) = @_; my @mixed; while( @$v1 || @$v2 ) { push @mixed, shift @$v1 if @$v1; push @mixed, shift @$v2 if @$v2; } return \@mixed;}

References also allow Arrays of Arrays • my @lst;push @lst, [‘milk’, ‘butter’, ‘cheese’];push @lst, [‘wine’, ‘sherry’, ‘port’];push @lst, [‘bread’, ‘bagels’, ‘croissants’]; • my @matrix = [ [1, 0, 0], [0, 1, 0], [0, 0, 1] ];

Hashes of arrays • $hash{‘dogs’} = [‘beagle’, ‘shepherd’, ‘lab’];$hash{‘cats’} = [‘calico’, ‘tabby’, ‘siamese’];$hash{‘fish’} = [‘gold’,’beta’,’tuna’]; • for my $key (keys %hash ) { print “$key => “, join(“\t”, @{$hash{$key}}), “\n”;}

More matrix use my @matrix; open(IN, $file) || die $!; # read in the matrix while(<IN>) { push @matrix, [split]; } # data looks like # GENENAME EXPVALUE STATUS # sort by 2nd column for my $row ( sort { $a->[1] <=> $b->[1] } @matrix ) { print join(“\t”, @$row), “\n”;}

Funny operators • my @bases = qw(C A G T) • my $msg = <<EOFThis is the message I wanted to tell you aboutEOF;

Regular Expressions • Part of “amazing power” of Perl • Allow matching of patterns • Syntax can be tricky • Worth the effort to learn!

A simple regexp • if( $fruit eq ‘apple’ || $fruit eq ‘Apple’ || $fruit eq ‘pear’) { print “got a fruit $fruit\n”;} • if( $fruit =~ /[Aa]pple|pear/ ){ print “matched fruit $fruit\n”;}

Regular Expression syntax • use the =~ operator to match • if( $var =~ /pattern/ ) {} - scalar context • my ($a,$b) = ( $var =~ /(\S+)\s+(\S+)/ ); • if( $var !~ m// ) { } - true if pattern doesn’t • m/REGEXPHERE/ - match • s/REGEXP/REPLACE/ - substitute • tr/VALUES/NEWVALUES/ - translate

m// operator (match) • Search a string for a pattern match • If no string is specified, will match $_ • Pattern can contain variables which will be interpolated (and pattern recompiled)while( <DATA> ) { if( /A$num/ ) { $num++ }}while( <DATA> ) { if( /A$num/o ) { $num++ }}

Pattern extras • m// -if specify m, can replace / with anything e.g. m##, m[], m!! • /i - case insensitive • /g - global match (more than one) • /x - extended regexps (allows comments and whitespace) • /o - compile regexp once

Shortcuts • \s - whitespace (tab,space,newline, etc) • \S - NOT whitespace • \d - numerics ([0-9]) • \D - NOT numerics • \t, \n - tab, newline • . - anything

Regexp Operators • + - 1 -> many (match 1,2,3,4,... instances )/a+/ will match ‘a’, ‘aa’, ‘aaaaa’ • * - 0 -> many • ? - 0 or 1 • {N}, {M,N} - match exactly N, or M to N • [], [^] - anything in the brackets, anything but what is in the brackets

Saving what you matched • Things in parentheses can be retrieved via variables $1, $2, $3, etc for 1st,2nd,3rd matches • if( /(\S+)\s+([\d\.\+\-]+)/) { print “$1 --> $2\n”;} • my ($name,$score) = ($var =~ /(\S+)\s+([\d\.\+\-]+)/);

Simple Regexp my $line = “aardvark”;if( $line =~ /aa/ ) { print “has double a\n” }if( $line =~ /(a{2})/ ) { print “has double a\n” }if( $line =~ /(a+)/ ) { print “has 1 or more a\n” }

Matching gene names # File contains lots of gene names # YFL001C YAR102W - yeast ORF names# let-1, unc-7 - worm names http://biosci.umn.edu/CGC/Nomenclature/nomenguid.htm # ENSG000000101 - human Ensembl gene names while(<IN>) { if( /^(Y([A-P])(R|L)(\d{3})(W|C)(\-\w)?)/ ) { printf “yeast gene %s, chrom %d,%s arm, %d %s strand\n”, $1, (ord($2)-ord(‘A’))+1, $3, $4; } elsif( /^(ENSG\d+)/ ) { print “human gene $1\n” } elsif( /^(\w{3,4}\-\d+)/ ) { print “worm gene $1\n”; } }

Putting it all together • A parser for output from a gene prediction program

GlimmerM (Version 3.0) Sequence name: BAC1Contig11 Sequence length: 31797 bp Predicted genes/exons Gene Exon Strand Exon Exon Range Exon # # Type Length 1 1 + Initial 13907 13985 79 1 2 + Internal 14117 14594 478 1 3 + Internal 14635 14665 31 1 4 + Internal 14746 15463 718 1 5 + Terminal 15497 15606 110 2 1 + Initial 20662 21143 482 2 2 + Internal 21190 21618 429 2 3 + Terminal 21624 21990 367 3 1 - Single 25351 25485 135 4 1 + Initial 27744 27804 61 4 2 + Internal 27858 27952 95 4 3 + Internal 28091 28576 486 4 4 + Internal 28636 28647 12 4 5 + Internal 28746 28792 47 4 6 + Terminal 28852 28954 103 5 3 - Terminal 29953 30037 85 5 2 - Internal 30152 30235 84 5 1 - Initial 30302 30318 17

Putting it together • while(<>) { if(/^(Glimmer\S*)\s+$(.+)$/ { $method = $1; $version = $2; } elsif( /^(Predicted genes)|(Gene)|(\s+\#)/ || /^\s+$/ ) { next } elsif( # glimmer 3.0 output /^\s+(\d+)\s+ # gene num (\d+)\s+ # exon num ([\+\-])\s+ # strand (\S+)\s+ # exon type (\d+)\s+(\d+) # exon start, end \s+(\d+) # exon length /ox ) {my ($genenum,$exonnum,$strand,$type,$start,$end, $len) = ( $1,$2,$3,$4,$5,$6,$7); }}

s/// operator (substitute) • Same as m// but will allow you to substitute whatever is matched in first section with value in the second section • $sport =~ s/soccer/football/ • $addto =~ s/(Gene)/$1-$genenum/;

The tr/// operator (translate) • Match and replace what is in the first section, in order, with what is in the second. • lowercase - tr/[A-Z]/[a-z]/ • shift cipher - tr/[A-Z]/[B-ZA]/ • revcom - $dna =~ tr/[ACGT]/[TGCA]/; $dna = reverse($dna);

(aside) DNA ambiguity chars • aMino - {A,C}, Keto - {G,T} • puRines - {A,G}, prYmidines - {C,T} • Strong - {G,C}, Weak - {A,T} • H (Not G)- {ACT}, B (Not A), V (Not T), D(Not C) • $str =~ tr/acgtrymkswhbvdnxACGTRYMKSWHBVDNX/tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX/;

Introduction to Perl