Regular Expressions

Regular Expressions • Provide a way of writing a compact description of a set of strings • Sort of like wildcards • Single character patterns • A single character matches itself • A “.” matches any single character except newline • [characters] – matches any one of the characters • ^ means “does not match” Regular Expressions

Examples • G • [0123456789] • [0-9] • [a-zA-z] • [^0-9] Regular Expressions

Multipliers Regular Expressions

Character Class Abbreviations Regular Expressions

Alternation • Sometimes we would like to match different possible words or character strings. • This is accomplished by using ‘|’ • To match dog or cat • dog|cat • At each character position, perl will first try to match the first alternative, dog. If dog doesn't match, perl will then try the next alternative, cat. If cat doesn't match either, then the match fails and perl moves to the next position in the string. Regular Expressions

Your Turn!!! • To check your understanding of regular expressions… • There is redundancy in the genetic code • GCU, GCC, GCA, and GCG all encode Alanine • Write a series of regular expressions that could be used to match a codon to the amino acid it encodes • For example • (GC.) would match all the codons that encode Alanine Regular Expressions

Fill In The Blanks Regular Expressions

The Answers Regular Expressions

The Regular Expression Engine • Think of Perl using a “railway” diagram of connected states • Perl moves through the diagram by matching characters • If the engine reaches the final state, it has matched the input string Regular Expressions

abc a b c Start ‘a’ ‘b’ Match 12ababc Regular Expressions

My Problem XXXX, ROBERT 4653 N VCSG-4 rma9999 XXXXXX, ADAM 3976 N VCSG-4 716-555-4281 alb9999 XXXXXXX, EDWARD 4637 N VCSG-2 716-555-4780 esb9999 XXXXXXX, JOHN 1906 N VCSG-4 716-555-4780 XXXX, DERRICK 6432 N VCSG-2 716-555-3161 dxc9999 XXXXXXXXX, JOHN 5034 N VCSG-2 716-555-3894 jak9999 XXX, JASON 9020 N VCSG-2 716-555-3145 jsl9999 XXXXXXX, SARAH 7610 N VCSG-2 716-555-3147 sem9999 XXXXXXXX, CHRISTOPHER 6309 N VCSG-2 716-555-3427 cco9999 XXXXXXX, MICHAEL 8195 N VCSG-2 716-555-3166 mpp9999 XXXXXX, SHAUN 9925 N VCSG-2 716-555-3145 sls9999 XXXXXX, WILLIAM 2568 N VCSG-2 716-555-3144 wjw9999 XXXXXX, PATRICK 2335 N EECC-2 716-555-3144 psw9999 Regular Expressions

XXXXXXX, EDWARD 4637 N VCSG-2 716-555-4780 esb9999 Match 1 or more non-comma characters Match 1 or more non-whitespace characters Match 4 digits Match 0 or more non-whitespace characters (the fields may not be in the input Match anything!! Roster to CSV while(<>) { ($last,$first,$id,$ntid,$gradeType,$program,$phone,$email)= /([^,]+), (\S+) (\d{4}) (\S*) (\S*) (\S+) (\S*) (\S*).*/; print "\"$last,$first\",$id,$program,$email\@cs.rit.edu\n"; } Regular Expressions

The Result "XXXX,ROBERT",4653,VCSG-4,rma9999@cs.rit.edu "XXXXXX,ADAM",3976,VCSG-4,alb9999@cs.rit.edu "XXXXXXX,EDWARD",4637,VCSG-2,esb9999@cs.rit.edu "XXXXXXX,JOHN",1906,VCSG-4,@cs.rit.edu "XXXX,DERRICK",6432,VCSG-2,dxc9999@cs.rit.edu "XXXXXXXXX,JOHN",5034,VCSG-2,jak9999@cs.rit.edu "XXX,JASON",9020,VCSG-2,jsl9999@cs.rit.edu "XXXXXXX,SARAH",7610,VCSG-2,sem9999@cs.rit.edu "XXXXXXXX,CHRISTOPHER",6309,VCSG-2,cco9999@cs.rit.edu "XXXXXXX,MICHAEL",8195,VCSG-2,mpp9999@cs.rit.edu "XXXXXX,SHAUN",9925,VCSG-2,sls9999@cs.rit.edu "XXXXXX,WILLIAM",2568,VCSG-2,wjw9999@cs.rit.edu "XXXXXX,PATRICK",2335,EECC-2,psw9999@cs.rit.edu Regular Expressions

What Can We Do? • /pattern/ • m/pattern/ • Find an occurrence of pattern • s/pattern/replacement/ • Replace an occurrence of pattern with replacement • All of these work on $_ • ‘m’ and ‘s’ can be followed with a ‘g’ which says to do the operation globally Regular Expressions

split/join • split() can be used to break a string into fields $line = “merlin::118:10:Randal:/home/meryln:/usr/bin/perl”; @fields = split(/:/,$line); • join() can be used to glue them back together • $outline = join( “:”,@fields); Regular Expressions

=~ • If you want to apply the matching operators to something other than $_ use =~ • $line =~ /foo/; • $line =~ /^ACCESSION/; • $line =~ s/ACCESSION(\s)*//; Regular Expressions

accession.perl open INFILE,"<$ARGV[0]" or die "Unable to open Genbank file ($!)"; while ( !eof( INFILE ) ) { $line = <INFILE>; if ( $line =~ /^ACCESSION/ ) { $line =~ s/ACCESSION(\s)*//; print $line,"\n"; } } close INFILE; Regular Expressions

Your Turn!!! • Modify the script accession.perl so that in addition to the accession number it also prints out the locus and the organism • The output from your program should look like this Accession: AF165912 Organism: Arabidopsis thaliana Locus: AF165912 5485 bp DNA linear PLN 29-JUL-1999 • The information obtained from the file must be printed in the order specified Regular Expressions

accession.perl open INFILE,"<$ARGV[0]" or die "Unable to open Genbank file ($!)"; while ( !eof( INFILE ) ) { $line = <INFILE>; if ( $line =~ /^ACCESSION/ ) { $accession = $line; $accession =~ s/ACCESSION(\s)*//; } elsif ( $line =~ /^SOURCE/ ) { $organism = $line; $organism =~ s/SOURCE(\s)*//; } elsif ( $line =~ /^LOCUS/ ) { $locus = $line; $locus =~ s/LOCUS(\s)*//; } } print "Accession: ",$accession; print "Organism: ",$organism; print "Locus: ",$locus; close INFILE; Regular Expressions

sequence.perl open INFILE,"<$ARGV[0]" or die "Unable to open Genbank file ($!)"; $in_sequence = 0; # 0 is false while ( !eof( INFILE ) ) { $line = <INFILE>; if ( $line =~ /^\/\/\n/ ) { $in_sequence = 1; # non-zero value is true } elsif ( $line =~ /^ORIGIN/ ) { $in_sequence = true; } elsif ( $in_sequence ) { print $line,"\n"; } } close INFILE; Regular Expressions

Your Turn!!! • Modify the script sequence.perl so that the sequence data is placed into a scalar variable named $sequence_data. • All of the spaces, newlines, and line numbers should be removed from the sequence data • Write your program so that it prints $sequence_data to verify it works correctly • Testing hint • Modify the Genbank record so that it has many fewer sequence lines Regular Expressions

sequence.perl open INFILE,"<$ARGV[0]" or die "Unable to open Genbank file ($!)"; $in_sequence = 0; # 0 is false $sequence_data = ""; while ( !eof( INFILE ) ) { $line = <INFILE>; if ( $line =~ /^\/\// ) { $in_sequence = 1; # non-zero value is true } elsif ( $line =~ /^ORIGIN/ ) { $in_sequence = true; } elsif ( $in_sequence ) { $sequence_data = $sequence_data . $line; } } $sequence_data =~ s/[\s0-9]//g; print $sequence_data; close INFILE; Regular Expressions

Where Are We? • So now we have a program that extracts the DNA sequence information out of a Genbank record • Lets go one step further and convert the sequence data into an equivalent sequence of amino acids • To simplify things assume that a reading frame starts with the first nucleotide in the sequence Regular Expressions

Hashes • Associative arrays • Key value pairs • Given the key, the table returns the value • You can build them by hand %codons = (‘TCA’,‘S’,‘TCC’,‘S’,…) Alternatively… %codons = ( ‘TCA’ => ‘S’, ‘TCC’ => ‘S’, ‘TCG’ => ‘S’, … ); Regular Expressions

Building a Hash • The hash can be built under program control • The file codons.txt contains 64 lines in the following format • TCA S Serine • A perl script could read the lines one at a time • Break extract the codon and the amino acid from the line • Add the information to the hash Regular Expressions

Hash Operations • Some operations you can perform on a hash $codons{‘TCA’}=‘S’; • Adds the entry (TCA,S) to the hash delete $codons{‘XXX’}; • Removes the key XXX from the hash keys %codons • Returns a list of the keys associated with the hash sort keys %codons • Sorts the keys in the hash Regular Expressions

buildHash.perl open CODONS,"<$ARGV[0]" or die "Unable to open codon file ($!)"; while (<CODONS>) { ($codon,$amino,$name)=split /\s/; $codons{$codon}=$amino; } foreach $d ( sort keys %codons ) { print "$d: $codons{$d}\n"; } close CODONS; Regular Expressions

Now This is Cool!! $dna = ‘CGACGTTTCGTACGGACTAGCT’; $amino_acids = “”; for ($i=0; $i<length($dna)-2; $i=$i+3) { $amino_acids .= $codons{ substr( $dna, $i, 3 ) }; } print $amino_acids,”\n”; Regular Expressions

Homework Assignment • Write a perl script that will read sequence data from a Genbank record and will print out the amino acid sequence for each of the 6 reading frames associated with the sequence data in the record • Take advantage of subroutines • Clearly translate will be helpful • If a codon cannot be translated print ? In the output Regular Expressions

Regular Expressions