320 likes | 500 Views
Outline. BINF634 Lecture 5. Program 1 Solution Quiz 2 Solution Program 2 Discussions Regular Expressions Regular Expressions Lab Time to Work on Program 2. Program 1 Discussions. You must test all of your code on binf I am testing your code on binf
E N D
Outline BINF634 Lecture 5 • Program 1 Solution • Quiz 2 Solution • Program 2 Discussions • Regular Expressions • Regular Expressions Lab • Time to Work on Program 2 BINF 634 Fall 2013 Lect 5
Program 1 Discussions • You must test all of your code on binf • I am testing your code on binf • I can’t possibly know what configuration of what machine that your code runs under • The perl on binf must be the first line in your program • #!/usr/bin/perl BINF 634 Fall 2013 Lect 5
#!/usr/bin/perl use strict; use warnings; # File: cpg.pl # Author: Jeff Solka # Date: 01 Aug 2011 # # Purpose: Read sequences from a FASTA format file # Programming Assignment #1 # the argument list should contain the file name die "usage: fasta.pl filename\n" if scalar @ARGV < 1; # get the filename from the argument list my ($filename) = @ARGV; # Open the file given as the first argument on the command line open(INFILE, $filename) or die "Can't open $filename\n"; # variable declarations: my @header = (); # array of headers my @sequence = (); # array of sequences my $count = 0; # number of sequences # read FASTA file my $n = -1; # index of current sequence while (my $line = <INFILE>) { chomp $line; # remove training \n from line if ($line =~ /^>/) { # line starts with a ">" $n++; # this starts a new header $header[$n] = $line; # save header line $sequence[$n] = ""; # start a new (empty) sequence } Program 1Solution Program 1 Solution BINF 634 Fall 2013 Lect 5
else { next if not @header; # ignore data before first header $sequence[$n] .= $line # append to end of current sequence } } $count = $n+1; # set count to the number of sequences close INFILE; # remove white space from all sequences for (my $i = 0; $i < $count; $i++) { $sequence[$i] =~ s/\s//g; } ########## Sequence processing starts here: ##### REST OF PROGRAM my $maxlength = 0; my $minlength = 1E99; my $sumlength = 0; my $avlength = 0; # process the sequences for (my $i = 0; $i < $count; $i++) { $sumlength += length($sequence[$i]); if(length($sequence[$i]) > $maxlength){ $maxlength = length($sequence[$i]); } if(length($sequence[$i]) < $minlength){ $minlength = length($sequence[$i]); } } $avlength = $sumlength/$count; # print out statistics print "Report for file $filename \n"; print "There are $count sequences in the file \n"; print "Total sequence length = $sumlength \n"; print "Maximum sequence length = $maxlength \n"; print "Minimum sequence length = $minlength \n"; print "Ave sequence length = $avlength \n"; Program 1 Solution Program 1 Solution (cont.) BINF 634 Fall 2013 Lect 5
# print out sequence information for (my $i = 0; $i < $count; $i++) { print "$header[$i]\n"; print "Length:",length($sequence[$i]),"\n"; # Notice that we can use scalar variables to hold numbers. my $a = 0; my $c = 0; my $g = 0; my $t = 0; my $cg = 0; # Use a regular expression "trick", and five while loops, # to find the counts of the four bases plus errors while($sequence[$i] =~ /a/ig){$a++} while($sequence[$i] =~ /c/ig){$c++} while($sequence[$i] =~ /g/ig){$g++} while($sequence[$i] =~ /t/ig){$t++} while($sequence[$i] =~ /cg/ig){$cg++} printf "A:%d %0.2f \n", $a, $a/length($sequence[$i]); printf "C:%d %0.2f \n", $c, $c/length($sequence[$i]); printf "G:%d %0.2f \n", $g, $g/length($sequence[$i]); printf "T:%d %0.2f \n", $t, $t/length($sequence[$i]); printf "CpG:%d %0.2f \n", $cg, $cg/length($sequence[$i]); } exit; Program 1 Solution Program 1 Solution (cont.) BINF 634 Fall 2013 Lect 5
#!/usr/bin/perl -w use strict; use warnings; #quiz2 Fall 2013 #Jeff Solka my(@a)=(1..10); my $b = 'alucarD'; print "array a and string b prior to the function call \n"; print "@a \n"; print "$b \n"; myfun(\@a,\$b); print "array a and string b after the function call \n"; print "@a \n"; print "$b \n"; exit; sub myfun{ my($i,$j)=@_; my $element; $$j=reverse($$j); foreach $element(@$i) { $element = $element - 1; } } Program 1 Solution Quiz 2 Solution BINF 634 Fall 2013 Lect 5
Quiz 2 Solution Quiz 2 Program in Action Program in action. array a and string b prior to the function call 1 2 3 4 5 6 7 8 9 10 alucarD array a and string b after the function call 0 1 2 3 4 5 6 7 8 9 Dracula BINF 634 Fall 2013 Lect 5
Program 2 Discussions aNy Questions on program 2? BINF 634 Fall 2013 Lect 5
Regular Expression (Humor) Regular Expression Humor • A relevant cartoon BINF 634 Fall 2013 Lect 5
Regular Expression (Why?) Regular Expressions • Bioinformatics programs often have to look for patterns in strings: • Find a DNA sequences containing only C's and G's • Look for a sequence that begins with ATG and ends with TAG • Regular expressions are a way of describing a PATTERN: • "all the words that begin with the letter A" • "every 10-digit phone number“ • We create regular expression to match the different parts of the pattern we're looking for • Ordinary characters match themselves • Meta-characters are special symbols that match a group of characters • for example \d matches any digit BINF 634 Fall 2013 Lect 5
Regular Expression (How?) Meta Characters(see Camel Book, Ch. 5) BINF 634 Fall 2013 Lect 5
Regular Expression (How?) Ways to Control Patterns(see Camel Book, Ch. 5) BINF 634 Fall 2013 Lect 5
Regular Expression (Practice) Examples # match if string $str contains 0 or more white space characters $str =~ /^\s*$/; # string $str contains all capital letters (at least one) $str =~ /^[A-Z]+$/; # string $str contains a capital letter followed by 0 or more digits $str =~ /[A-Z]\d*/; # number $n contains some digits before and after a decimal point $n =~ /^\d+\.\d+$/; # string contains A and B separated by any two characters $s =~ /A..B/; # string does NOT contains ATG $s !~ /ATG/; BINF 634 Fall 2013 Lect 5
Regular Expression (Practice) Examples # match if string $str contains any sequence of three consecutive A's $str =~ /AAA/; $str =~ /A{3}/; # match if string $str consist of exactly three A's $str =~ /^AAA$/; $str =~ /^A{3}$/; # match if $str contains a codon for Alanine (GCA, GCT, GCC, GCG) $str =~ /GC./; # match if $str contains a STOP codon (TAA, TAG, TGA) $str =~ /TA[AG]|TGA/; $str =~ /T(AA|AG|GA)/; $str =~ /T(A[AG]|GA)/; BINF 634 Fall 2013 Lect 5
Regular Expression (Practice) Examples # string contains any word containing all capital letters $str =~ /\b[A-Z]+\b/; # A followed by any number of C or G's followed by T or A $str =~ /A[CG]*(T|A)/; $str =~ /A[CG]{0,}[TA]/; # TT followed by one or more CA's followed by anything except G $str =~ /TT(CA)+[^G]/; # string begins with B and has between 5 and 10 letters $str =~ /^B.{4,9}$/; # string consists of a 10 digit phone number: ddd-ddd-dddd $str =~ /^\d\d\d\-\d\d\d\-\d\d\d\d$/; $str =~ /^\d{3}\-\d{3}\-\d{4}$/; BINF 634 Fall 2013 Lect 5
Regular Expression (What Did We Match?) Capturing Matches • When we match a string with a regular expression, we may want to find out what matched • Do this by surrounding the part of interest with ( ) • Then access special variables $1, $2, etc to get matches: $str = "Perl is a programming language used for bioinformatics."; $str =~ /(.*) is.*(b.*)\./; $first = $1; $second = $2; print "$first $second\n"; # prints "Perl bioinformatics" # or, you can capture the results in a list assignment: ($first, $second) = $str =~ /(.*) is.*(b.*)\./; print "$first $second\n"; # prints "Perl bioinformatics" BINF 634 Fall 2013 Lect 5
Regular Expression (What Did We Match?) Capturing Matches • When we match a string with a regular expression, we may want to find out what matched • Do this by surrounding the part of interest with ( ) • Then access special variables $1, $2, etc to get matches: $str = "Perl is a programming language used for bioinformatics."; $str =~ /(P.*l)/; $word = $1; print $word; # prints "Perl is a programming l" $str =~ /(P.*?l)/; $word = $1; print $word; # prints "Perl" $str =~ /\b(u.*?)\b/; $word = $1; print $word; # prints "used" BINF 634 Fall 2013 Lect 5
Regular Expression (What Did We Match?) Capturing Matches • If no string is given to the match operators, $_ is assumed @A = qw / ATGGCT CCCCGGTAT GCAGTGG /; for (@A) { ($first, $second) = /(.+)GG(.+)/; print "$first $second\n" if ($first and $second); } OUTPUT: AT CT CCCC TAT Q. Why no output for third string? BINF 634 Fall 2013 Lect 5
Regular Expression (What Did We Match?) #!/usr/bin/perl use strict; use warnings; my $string = "Several rapidly developing RNA interference (RNAi) methodologies hold the promise to selectively inhibit gene expression in mammals. RNAi is an innate cellular process activated when a double-stranded RNA (dsRNA) molecule of greater than 19 duplex nucleotides enters the cell, causing the degradation of not only the invading dsRNA molecule, but also single-stranded (ssRNAs) RNAs of identical sequences, including endogenous mRNAs."; # find all words containing "RNA" while ( $string =~ /(\w*RNA\w*)/g ) { print "$1\n"; } exit; Output: RNA RNAi RNAi RNA dsRNA dsRNA ssRNAs RNAs mRNAs BINF 634 Fall 2013 Lect 5
Regular Expression (What Did We Match?) #!/usr/bin/perl use strict; use warnings; my $string = "Several rapidly developing RNA interference (RNAi) methodologies hold the promise to selectively inhibit gene expression in mammals. RNAi is an innate cellular process activated when a double-stranded RNA (dsRNA) molecule of greater than 19 duplex nucleotides enters the cell, causing the degradation of not only the invading dsRNA molecule, but also single-stranded (ssRNAs) RNAs of identical sequences, including endogenous mRNAs."; # find all words containing "RNA" while ( $string =~ /(\w+RNA\w+)/g ) { print "$1\n"; } exit; Output: ssRNAs mRNAs BINF 634 Fall 2013 Lect 5
Regular Expression (What Did We Match?) #!/usr/bin/perl use strict; use warnings; my $string = "Several rapidly developing RNA interference (RNAi) methodologies hold the promise to selectively inhibit gene expression in mammals. RNAi is an innate cellular process activated when a double-stranded RNA (dsRNA) molecule of greater than 19 duplex nucleotides enters the cell, causing the degradation of not only the invading dsRNA molecule, but also single-stranded (ssRNAs) RNAs of identical sequences, including endogenous mRNAs."; # find all words containing "RNA" while ( $string =~ /(\S+RNA\S+)/g ) { print "$1\n"; } exit; Output: (RNAi) (dsRNA) (ssRNAs) mRNAs. BINF 634 Fall 2013 Lect 5
Regular Expression (What Did We Match?) Capturing Matches When we match a string with a regular expression, several special variables get set automatically: $string =~ /REGEXP/; $` = part of string to the left of the match $& = part of string matched by the regular expression REGEXP $’ = part of string the the right the match $string = "ATCGCAT"; $string =~ /T.G/; print "left part: $` \n"; print "match: $& \n"; print "right part: $’ \n"; Output: left part: A match: TCG right part: CAT BINF 634 Fall 2013 Lect 5
Regular Expression (A Regular Expression Tester) A Nice Application of Capturing Matches #!/usr/bin/perl print ("\nEnter string or cntl-D to quit\n"); print ("Square brackets indicate text that matched pattern\n\n"); $prompt = "test> "; print $prompt; while(<STDIN>) { chomp; if(/REGEXP Goes Here/) { print("$`\[$&]$'\n"); } else { print("no match\n"); } print $prompt; } exit; BINF 634 Fall 2013 Lect 5
Regular Expression (A Nicer Regular Expression Tester) An Even Nicer Implementation of This Idea - I #!/usr/bin/perl use strict; use warnings; # File: regex_tester.pl # Author: Jim Logan # # Fully interactive version (i.e., no recompiles required) a regular expression # tester based on a script by Fernando J. Pineda as presented to # class of BINF623 by Jeff Solka on 10/1/12. # Particularly useful in an Eclipse environment using its cut and paste facility. # instructions for use print "\nAccepts keyboard entry of a regular expression and then permits\n"; print "successive entry of strings to test that expression.\n"; print "Square brackets in output indicate the text that matched pattern\n\n"; print "Note: Depending upon the environment (e.g. Eclipse), you may be\n"; print "able to cut and paste into both the \"Next expression\" and the\n"; print "\"New test string\" fields and then edit as desired.\n"; BINF 634 Fall 2013 Lect 5
Regular Expression (A Nicer Regular Expression Tester) An Even Nicer Implementation of This Idea - II # initialization my $regex = '/^.*$/'; #default regex to start and to demonstrate my $string = 'This is a test string'; my $input = ""; my $stripped_regex = ""; while (1) { # outer loop to sequence regular expressions print "\nCurrent regular expresssion: $regex\n"; print "Enter a new expression to change or ENTER to continue without change.\n"; print "(\"quit\" terminates the program)\n"; print "New expression: "; $input = <STDIN>; chomp $input; if ($input =~ /^q.*$/i) {exit}; if ($input !~ /^$/) { $regex = $input; } $stripped_regex = substr ($regex, 1, length ($regex) -2); BINF 634 Fall 2013 Lect 5
Regular Expression (A Nicer Regular Expression Tester) An Even Nicer Implementation of This Idea - III # User includes the two slashes for a regular expresssion # but they are stripped here so that variable is just the pattern # that will be interpolated in /pattern/ context. while (1) { # inner loop to sequence strings to test the expression print "\nCurrent test string: $string\n"; print "Enter a new expression to change or ENTER to reset the regex.\n"; print "New test string: "; $input = <STDIN>; chomp $input; if ($input =~ /^$/) { # for blank line, go back to set expresssion last; } else { $string = $input; # else run regex over input } BINF 634 Fall 2013 Lect 5
Regular Expression (A Nicer Regular Expression Tester) An Even Nicer Implementation of This Idea - IV if( $string =~ /$stripped_regex/) { print("$`\[$&]$'\n"); } # show match in context of input else { print("no match\n"); } } } exit; BINF 634 Fall 2013 Lect 5
Regular Expression (Where Did the Match Occur?) Finding the position of matches If we use the global modifier g, then pos($string) returns position after the match: $string = "ATCGCATGGAA"; $string =~ /T.G/g; print "$& ends at position ", pos($string)-1, "\n\"; $string =~ /T.G/g; print "$& ends at position ", pos($string)-1, "\n"; Output: TCG ends at position 3 TGG ends at position 8 BINF 634 Fall 2013 Lect 5
Regular Expression (Where Did the Match Occur?) #!/usr/bin/perl use strict; use warnings; my $string = "Several rapidly developing RNA interference (RNAi) methodologies hold the promise to selectively inhibit gene expression in mammals. RNAi is an innate cellular process activated when a double-stranded RNA (dsRNA) molecule of greater than 19 duplex nucleotides enters the cell, causing the degradation of not only the invading dsRNA molecule, but also single-stranded (ssRNAs) RNAs of identical sequences, including endogenous mRNAs."; # find all words containing "RNA" while ( $string =~ /(\S+RNA\S+)/g ) { print "$1 ends at position ", pos($string)-1, "\n"; } exit; Output: (RNAi) ends at position 49 (dsRNA) ends at position 211 (ssRNAs) ends at position 374 mRNAs. ends at position 431 BINF 634 Fall 2013 Lect 5
Additional Reading Some Useful URLs • http://docs.python.org/library/re.html • http://www.regular-expressions.info/ • http://www.regular-expressions.info/tutorial.html • http://www.bjnet.edu.cn/tech/book/perl/ • Nice tutorial regexp discussed on Day 7 • http://www.troubleshooters.com/codecorn/littperl/perlreg.htm BINF 634 Fall 2013 Lect 5
On the Horizon Homework • Remember we meet Tuesday of week 10/15/13 at the usual place and time due to the Columbus day Holiday. • Program 2 due Tuesday 10/7/13 at 7:00 pm. • Quiz 3 will occur next week. • Remember that on Tuesday October 15, 2012 we will have our in class midterm exam. It will be open book and notes. BINF 634 Fall 2013 Lect 5
Our Regular Expression Lab Regular Expression Lab • Counts as a quiz grade • 100 possible points BINF 634 Fall 2013 Lect 5