1 / 32

BINF634 Lecture 5

Outline. BINF634 Lecture 5. Program 1 Solution Quiz 2 Solution Program 2 Discussions Regular Expressions Regular Expressions Lab Time to Work on Program 2. Program 1 Discussions. You must test all of your code on binf I am testing your code on binf

morrie
Download Presentation

BINF634 Lecture 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline BINF634 Lecture 5 • Program 1 Solution • Quiz 2 Solution • Program 2 Discussions • Regular Expressions • Regular Expressions Lab • Time to Work on Program 2 BINF 634 Fall 2013 Lect 5

  2. Program 1 Discussions • You must test all of your code on binf • I am testing your code on binf • I can’t possibly know what configuration of what machine that your code runs under • The perl on binf must be the first line in your program • #!/usr/bin/perl BINF 634 Fall 2013 Lect 5

  3. #!/usr/bin/perl use strict; use warnings; # File: cpg.pl # Author: Jeff Solka # Date: 01 Aug 2011 # # Purpose: Read sequences from a FASTA format file # Programming Assignment #1 # the argument list should contain the file name die "usage: fasta.pl filename\n" if scalar @ARGV < 1; # get the filename from the argument list my ($filename) = @ARGV; # Open the file given as the first argument on the command line open(INFILE, $filename) or die "Can't open $filename\n"; # variable declarations: my @header = (); # array of headers my @sequence = (); # array of sequences my $count = 0; # number of sequences # read FASTA file my $n = -1; # index of current sequence while (my $line = <INFILE>) { chomp $line; # remove training \n from line if ($line =~ /^>/) { # line starts with a ">" $n++; # this starts a new header $header[$n] = $line; # save header line $sequence[$n] = ""; # start a new (empty) sequence } Program 1Solution Program 1 Solution BINF 634 Fall 2013 Lect 5

  4. else { next if not @header; # ignore data before first header $sequence[$n] .= $line # append to end of current sequence } } $count = $n+1; # set count to the number of sequences close INFILE; # remove white space from all sequences for (my $i = 0; $i < $count; $i++) { $sequence[$i] =~ s/\s//g; } ########## Sequence processing starts here: ##### REST OF PROGRAM my $maxlength = 0; my $minlength = 1E99; my $sumlength = 0; my $avlength = 0; # process the sequences for (my $i = 0; $i < $count; $i++) { $sumlength += length($sequence[$i]); if(length($sequence[$i]) > $maxlength){ $maxlength = length($sequence[$i]); } if(length($sequence[$i]) < $minlength){ $minlength = length($sequence[$i]); } } $avlength = $sumlength/$count; # print out statistics print "Report for file $filename \n"; print "There are $count sequences in the file \n"; print "Total sequence length = $sumlength \n"; print "Maximum sequence length = $maxlength \n"; print "Minimum sequence length = $minlength \n"; print "Ave sequence length = $avlength \n"; Program 1 Solution Program 1 Solution (cont.) BINF 634 Fall 2013 Lect 5

  5. # print out sequence information for (my $i = 0; $i < $count; $i++) { print "$header[$i]\n"; print "Length:",length($sequence[$i]),"\n"; # Notice that we can use scalar variables to hold numbers. my $a = 0; my $c = 0; my $g = 0; my $t = 0; my $cg = 0; # Use a regular expression "trick", and five while loops, # to find the counts of the four bases plus errors while($sequence[$i] =~ /a/ig){$a++} while($sequence[$i] =~ /c/ig){$c++} while($sequence[$i] =~ /g/ig){$g++} while($sequence[$i] =~ /t/ig){$t++} while($sequence[$i] =~ /cg/ig){$cg++} printf "A:%d %0.2f \n", $a, $a/length($sequence[$i]); printf "C:%d %0.2f \n", $c, $c/length($sequence[$i]); printf "G:%d %0.2f \n", $g, $g/length($sequence[$i]); printf "T:%d %0.2f \n", $t, $t/length($sequence[$i]); printf "CpG:%d %0.2f \n", $cg, $cg/length($sequence[$i]); } exit; Program 1 Solution Program 1 Solution (cont.) BINF 634 Fall 2013 Lect 5

  6. #!/usr/bin/perl -w use strict; use warnings; #quiz2 Fall 2013 #Jeff Solka my(@a)=(1..10); my $b = 'alucarD'; print "array a and string b prior to the function call \n"; print "@a \n"; print "$b \n"; myfun(\@a,\$b); print "array a and string b after the function call \n"; print "@a \n"; print "$b \n"; exit; sub myfun{ my($i,$j)=@_; my $element; $$j=reverse($$j); foreach $element(@$i) { $element = $element - 1; } } Program 1 Solution Quiz 2 Solution BINF 634 Fall 2013 Lect 5

  7. Quiz 2 Solution Quiz 2 Program in Action Program in action. array a and string b prior to the function call 1 2 3 4 5 6 7 8 9 10 alucarD array a and string b after the function call 0 1 2 3 4 5 6 7 8 9 Dracula BINF 634 Fall 2013 Lect 5

  8. Program 2 Discussions aNy Questions on program 2? BINF 634 Fall 2013 Lect 5

  9. Regular Expression (Humor) Regular Expression Humor • A relevant cartoon BINF 634 Fall 2013 Lect 5

  10. Regular Expression (Why?) Regular Expressions • Bioinformatics programs often have to look for patterns in strings: • Find a DNA sequences containing only C's and G's • Look for a sequence that begins with ATG and ends with TAG • Regular expressions are a way of describing a PATTERN: • "all the words that begin with the letter A" • "every 10-digit phone number“ • We create regular expression to match the different parts of the pattern we're looking for • Ordinary characters match themselves • Meta-characters are special symbols that match a group of characters • for example \d matches any digit BINF 634 Fall 2013 Lect 5

  11. Regular Expression (How?) Meta Characters(see Camel Book, Ch. 5) BINF 634 Fall 2013 Lect 5

  12. Regular Expression (How?) Ways to Control Patterns(see Camel Book, Ch. 5) BINF 634 Fall 2013 Lect 5

  13. Regular Expression (Practice) Examples # match if string $str contains 0 or more white space characters $str =~ /^\s*$/; # string $str contains all capital letters (at least one) $str =~ /^[A-Z]+$/; # string $str contains a capital letter followed by 0 or more digits $str =~ /[A-Z]\d*/; # number $n contains some digits before and after a decimal point $n =~ /^\d+\.\d+$/; # string contains A and B separated by any two characters $s =~ /A..B/; # string does NOT contains ATG $s !~ /ATG/; BINF 634 Fall 2013 Lect 5

  14. Regular Expression (Practice) Examples # match if string $str contains any sequence of three consecutive A's $str =~ /AAA/; $str =~ /A{3}/; # match if string $str consist of exactly three A's $str =~ /^AAA$/; $str =~ /^A{3}$/; # match if $str contains a codon for Alanine (GCA, GCT, GCC, GCG) $str =~ /GC./; # match if $str contains a STOP codon (TAA, TAG, TGA) $str =~ /TA[AG]|TGA/; $str =~ /T(AA|AG|GA)/; $str =~ /T(A[AG]|GA)/; BINF 634 Fall 2013 Lect 5

  15. Regular Expression (Practice) Examples # string contains any word containing all capital letters $str =~ /\b[A-Z]+\b/; # A followed by any number of C or G's followed by T or A $str =~ /A[CG]*(T|A)/; $str =~ /A[CG]{0,}[TA]/; # TT followed by one or more CA's followed by anything except G $str =~ /TT(CA)+[^G]/; # string begins with B and has between 5 and 10 letters $str =~ /^B.{4,9}$/; # string consists of a 10 digit phone number: ddd-ddd-dddd $str =~ /^\d\d\d\-\d\d\d\-\d\d\d\d$/; $str =~ /^\d{3}\-\d{3}\-\d{4}$/; BINF 634 Fall 2013 Lect 5

  16. Regular Expression (What Did We Match?) Capturing Matches • When we match a string with a regular expression, we may want to find out what matched • Do this by surrounding the part of interest with ( ) • Then access special variables $1, $2, etc to get matches: $str = "Perl is a programming language used for bioinformatics."; $str =~ /(.*) is.*(b.*)\./; $first = $1; $second = $2; print "$first $second\n"; # prints "Perl bioinformatics" # or, you can capture the results in a list assignment: ($first, $second) = $str =~ /(.*) is.*(b.*)\./; print "$first $second\n"; # prints "Perl bioinformatics" BINF 634 Fall 2013 Lect 5

  17. Regular Expression (What Did We Match?) Capturing Matches • When we match a string with a regular expression, we may want to find out what matched • Do this by surrounding the part of interest with ( ) • Then access special variables $1, $2, etc to get matches: $str = "Perl is a programming language used for bioinformatics."; $str =~ /(P.*l)/; $word = $1; print $word; # prints "Perl is a programming l" $str =~ /(P.*?l)/; $word = $1; print $word; # prints "Perl" $str =~ /\b(u.*?)\b/; $word = $1; print $word; # prints "used" BINF 634 Fall 2013 Lect 5

  18. Regular Expression (What Did We Match?) Capturing Matches • If no string is given to the match operators, $_ is assumed @A = qw / ATGGCT CCCCGGTAT GCAGTGG /; for (@A) { ($first, $second) = /(.+)GG(.+)/; print "$first $second\n" if ($first and $second); } OUTPUT: AT CT CCCC TAT Q. Why no output for third string? BINF 634 Fall 2013 Lect 5

  19. Regular Expression (What Did We Match?) #!/usr/bin/perl use strict; use warnings; my $string = "Several rapidly developing RNA interference (RNAi) methodologies hold the promise to selectively inhibit gene expression in mammals. RNAi is an innate cellular process activated when a double-stranded RNA (dsRNA) molecule of greater than 19 duplex nucleotides enters the cell, causing the degradation of not only the invading dsRNA molecule, but also single-stranded (ssRNAs) RNAs of identical sequences, including endogenous mRNAs."; # find all words containing "RNA" while ( $string =~ /(\w*RNA\w*)/g ) { print "$1\n"; } exit; Output: RNA RNAi RNAi RNA dsRNA dsRNA ssRNAs RNAs mRNAs BINF 634 Fall 2013 Lect 5

  20. Regular Expression (What Did We Match?) #!/usr/bin/perl use strict; use warnings; my $string = "Several rapidly developing RNA interference (RNAi) methodologies hold the promise to selectively inhibit gene expression in mammals. RNAi is an innate cellular process activated when a double-stranded RNA (dsRNA) molecule of greater than 19 duplex nucleotides enters the cell, causing the degradation of not only the invading dsRNA molecule, but also single-stranded (ssRNAs) RNAs of identical sequences, including endogenous mRNAs."; # find all words containing "RNA" while ( $string =~ /(\w+RNA\w+)/g ) { print "$1\n"; } exit; Output: ssRNAs mRNAs BINF 634 Fall 2013 Lect 5

  21. Regular Expression (What Did We Match?) #!/usr/bin/perl use strict; use warnings; my $string = "Several rapidly developing RNA interference (RNAi) methodologies hold the promise to selectively inhibit gene expression in mammals. RNAi is an innate cellular process activated when a double-stranded RNA (dsRNA) molecule of greater than 19 duplex nucleotides enters the cell, causing the degradation of not only the invading dsRNA molecule, but also single-stranded (ssRNAs) RNAs of identical sequences, including endogenous mRNAs."; # find all words containing "RNA" while ( $string =~ /(\S+RNA\S+)/g ) { print "$1\n"; } exit; Output: (RNAi) (dsRNA) (ssRNAs) mRNAs. BINF 634 Fall 2013 Lect 5

  22. Regular Expression (What Did We Match?) Capturing Matches When we match a string with a regular expression, several special variables get set automatically: $string =~ /REGEXP/; $` = part of string to the left of the match $& = part of string matched by the regular expression REGEXP $’ = part of string the the right the match $string = "ATCGCAT"; $string =~ /T.G/; print "left part: $` \n"; print "match: $& \n"; print "right part: $’ \n"; Output: left part: A match: TCG right part: CAT BINF 634 Fall 2013 Lect 5

  23. Regular Expression (A Regular Expression Tester) A Nice Application of Capturing Matches #!/usr/bin/perl print ("\nEnter string or cntl-D to quit\n"); print ("Square brackets indicate text that matched pattern\n\n"); $prompt = "test> "; print $prompt; while(<STDIN>) { chomp; if(/REGEXP Goes Here/) { print("$`\[$&]$'\n"); } else { print("no match\n"); } print $prompt; } exit; BINF 634 Fall 2013 Lect 5

  24. Regular Expression (A Nicer Regular Expression Tester) An Even Nicer Implementation of This Idea - I #!/usr/bin/perl use strict; use warnings; # File: regex_tester.pl # Author: Jim Logan # # Fully interactive version (i.e., no recompiles required) a regular expression # tester based on a script by Fernando J. Pineda as presented to # class of BINF623 by Jeff Solka on 10/1/12. # Particularly useful in an Eclipse environment using its cut and paste facility. # instructions for use print "\nAccepts keyboard entry of a regular expression and then permits\n"; print "successive entry of strings to test that expression.\n"; print "Square brackets in output indicate the text that matched pattern\n\n"; print "Note: Depending upon the environment (e.g. Eclipse), you may be\n"; print "able to cut and paste into both the \"Next expression\" and the\n"; print "\"New test string\" fields and then edit as desired.\n"; BINF 634 Fall 2013 Lect 5

  25. Regular Expression (A Nicer Regular Expression Tester) An Even Nicer Implementation of This Idea - II # initialization my $regex = '/^.*$/'; #default regex to start and to demonstrate my $string = 'This is a test string'; my $input = ""; my $stripped_regex = ""; while (1) { # outer loop to sequence regular expressions print "\nCurrent regular expresssion: $regex\n"; print "Enter a new expression to change or ENTER to continue without change.\n"; print "(\"quit\" terminates the program)\n"; print "New expression: "; $input = <STDIN>; chomp $input; if ($input =~ /^q.*$/i) {exit}; if ($input !~ /^$/) { $regex = $input; } $stripped_regex = substr ($regex, 1, length ($regex) -2); BINF 634 Fall 2013 Lect 5

  26. Regular Expression (A Nicer Regular Expression Tester) An Even Nicer Implementation of This Idea - III # User includes the two slashes for a regular expresssion # but they are stripped here so that variable is just the pattern # that will be interpolated in /pattern/ context. while (1) { # inner loop to sequence strings to test the expression print "\nCurrent test string: $string\n"; print "Enter a new expression to change or ENTER to reset the regex.\n"; print "New test string: "; $input = <STDIN>; chomp $input; if ($input =~ /^$/) { # for blank line, go back to set expresssion last; } else { $string = $input; # else run regex over input } BINF 634 Fall 2013 Lect 5

  27. Regular Expression (A Nicer Regular Expression Tester) An Even Nicer Implementation of This Idea - IV if( $string =~ /$stripped_regex/) { print("$`\[$&]$'\n"); } # show match in context of input else { print("no match\n"); } } } exit; BINF 634 Fall 2013 Lect 5

  28. Regular Expression (Where Did the Match Occur?) Finding the position of matches If we use the global modifier g, then pos($string) returns position after the match: $string = "ATCGCATGGAA"; $string =~ /T.G/g; print "$& ends at position ", pos($string)-1, "\n\"; $string =~ /T.G/g; print "$& ends at position ", pos($string)-1, "\n"; Output: TCG ends at position 3 TGG ends at position 8 BINF 634 Fall 2013 Lect 5

  29. Regular Expression (Where Did the Match Occur?) #!/usr/bin/perl use strict; use warnings; my $string = "Several rapidly developing RNA interference (RNAi) methodologies hold the promise to selectively inhibit gene expression in mammals. RNAi is an innate cellular process activated when a double-stranded RNA (dsRNA) molecule of greater than 19 duplex nucleotides enters the cell, causing the degradation of not only the invading dsRNA molecule, but also single-stranded (ssRNAs) RNAs of identical sequences, including endogenous mRNAs."; # find all words containing "RNA" while ( $string =~ /(\S+RNA\S+)/g ) { print "$1 ends at position ", pos($string)-1, "\n"; } exit; Output: (RNAi) ends at position 49 (dsRNA) ends at position 211 (ssRNAs) ends at position 374 mRNAs. ends at position 431 BINF 634 Fall 2013 Lect 5

  30. Additional Reading Some Useful URLs • http://docs.python.org/library/re.html • http://www.regular-expressions.info/ • http://www.regular-expressions.info/tutorial.html • http://www.bjnet.edu.cn/tech/book/perl/ • Nice tutorial regexp discussed on Day 7 • http://www.troubleshooters.com/codecorn/littperl/perlreg.htm BINF 634 Fall 2013 Lect 5

  31. On the Horizon Homework • Remember we meet Tuesday of week 10/15/13 at the usual place and time due to the Columbus day Holiday. • Program 2 due Tuesday 10/7/13 at 7:00 pm. • Quiz 3 will occur next week. • Remember that on Tuesday October 15, 2012 we will have our in class midterm exam. It will be open book and notes. BINF 634 Fall 2013 Lect 5

  32. Our Regular Expression Lab Regular Expression Lab • Counts as a quiz grade • 100 possible points BINF 634 Fall 2013 Lect 5

More Related