460 likes | 664 Views
The Power of Perl Regular Expressions. What is a regular expression (regex)?. It is a description for a group of characters you want to search for in a string, a file, a website, etc. Think of the group of characters as a pattern that you want to find within a string
E N D
The Power of Perl Regular Expressions
What is a regular expression (regex)? • It is a description for a group of characters you want to search for in a string, a file, a website, etc. • Think of the group of characters as a pattern that you want to find within a string • Use regular expressions to search text quickly and accurately
Pattern Matching Syntax • $variable_name =~ /pattern/; • $variable_name – this is the variable containng the string you want to search • =~ - the binding operator is used for testing regular expressions • Letters before and after / (front and back, respectively, are operators and modifiers that affect the regular expression search
Matching operator you have been introduced to substitution and translation operators already • m// or just // is used to find patterns in a string • Test if a string contains the sequence ATG • $dnastr = ‘TTCGATGCCAC’; • If ($str =~ /ATG/) { • Print (“ATG found.\n”); • } • Else { • Print (“ATG not found.\n”); • } • Exit;
Case modifier • /atg/ would not find a match in the previous example • However /atg/i would • i is a case-independent modifier • We will introduce additional modifiers when necessary
Global modifier • If there were more than one ATG in the sequence, the previous examples only acknowledge the first one they run into • /ATG/g • g is a modifier for a global search, searching a string for ALL instance of pattern not the first one.
Other operators for regex • s/// - substitution perator is used to change strings, put the oldstring between the first and second /, and the new string between the second and third • tr/// - is used to change individual characters. Put the old character between the first and second /, and new character between the second and third
Metacharacters help search for complicated patterns • \d or [0-9] – match any digit • \w or [a-zA-Z_0-9] – match a character • \D – match a non-digit character • \W – match a non-word character • \s, [\t\n\r\f] – match whitespace character • \S – match non-whitespace character • \n – match a newline character • \r – match a carriage return • \t – match a tab • \f – match a formfeed • . – match any SINGLE character There are more!
Regex quantifiers • These syntax structures allow you to specifiy how long a regular expression pattern match should be • * match 0 or more times • + match 1 or more times • ? Match 1 or 0 times • {n} match exactly n times • {n, } match at least n times • {n,m} match at least n, but not more than m times
Examples of quantifier use • [A+CGC?A] #match one or more A’s followed by CG, followed by an optional C followed by an A • /A{3}/ # Match exactly 3 A’s • /A{3,} # match 3 or more A’s • /A {3,8}/ #match 3 to 8 A’s • The transcription factor binding site for SSP protein is GGCGGCGGCTGGCTAGGG • /{(GGC), 3}T{G,2}CTA{G,3}/
Alternation • Vertical bar (|) allows you to match one of several alternatives • /song|blue/ # match either ‘song’ or ‘blue’ • /a|b|c/ # match a, b, or c, same as [abc] • The GATA-1 TF binding site is defined by a T or an A, followed by GATA followed by an A or G. In regex that would be: /(T|A)GATA(A|G)/
Anchoring patterns • ^ matches the beginning of a string, while $ matches the end of a string • /^this/ #matches ‘this one’ but not ‘watch this’ • /this$/ #matches ‘watch this’ but not ‘this one’
Pattern memory • You know how to match characters, you need a way to find out what was matched by storing or saving the matching portions • Putting parentheses around any pattern will allow the part of the string matched by the pattern to be remembered and stored in a special variable called $1. If there are multiple patterns, they are stored in $2, $3, …)
Finding and storing GATA-1 binding site • $seq = “AAAGAGAGGGATAGAATAGAGATGATAAGAAA”; • $seq =~ /(T|A)GATA(A|G)/; • Print “$1\n”; • Output: TGATAA
Other special variables • $& the part of the string that actually matched • $` everything before the match • $’ everything after the match • Modify previous program to : • Print “$`\n”; • Print “$&\n”; • Print “$’\n”; • Output: AAAGAGAGGGATAGAATAGAGA TGATAA GAAA
Websites on RegEx • http://www.perldoc.com/perl 5.6.1/pod/perlre.html • http://www.troubleshooters.com/codecorn/littperl/perlreg.htm • http://www.devshed.com/Server_Side/Administration/RegExp/page2.html • http://www.javaworld.com/javaworld/jw-07-2001/jw-0713-regex.html
Exercises • Try some regular expressions with your motif.pl program pg. 67-69 • Read pages 70-75, work through example 5-4 (pick your own nucleotide file from NCBI) • Next, do Example 5-7 to learn how to write to files
What are subroutines? • A unique function that you generate to perform some action • Save a lot of typing • Make for “neater” programs • Syntax: • sub subroutinenameofyourchoice • { block of code • } • Place subroutines either at the beginning or end of a script
Example 6-1. A subroutine to append ACGT to DNA • $dna = ‘CGACGTCTTC..’; • $longer_dna = addACGT ($dna); • Print “I added ACGT to $dna and got $longer_dna\n\n”; • Exit; • (subroutine on next slide) • Output: I added ACGT to CGACGTCTTC.. and got CGACGTCTTC..ACGT
A subroutine example • Sub addACTG { • my ($dna) = @_; • $dna .=‘ACGT’; • Return $dna; • }
Two types of variables in subroutines • Variables passed into subroutines are called “arguments” • When a list of arguments are passed to a subroutine they are stored in the “magical” array called @_ • Other variables declared with my, and are restricted to the scope of the subroutine • This protects them from interacting with other variables in the program
Returning results via return • Most subroutines return their results via the return function, can return single scalar, multiple scalar, an array, etc.
Calling subroutines • To call a subroutine means to type its name, give it arguments, and collect the results • The call looks like: $longer_dna = addACGT($dna); • You could put as many variables as you want, and they would all be stored in the @_ array
Scoping – restricting variables using ‘my’ • Declare a variable as my with • My ($x); • My $x ; • Or combining declaration with initialization: • My($x) = ’49’; • My($x) = @_; Once declared it exists only until the end of the block it was declared in.
Working through 6-2 • Inherent flaw – variable name used both within and outside subroutine • To stop using undeclared variables, you can enforce this by using: use strict; • This insists that programs have all their variables declared as my variables
Example 6-3 shows how to use command line arguments • Many programs run from command lines (ie. in Unix) • @ARGV is an array that contains all command line arguments • This example will count the number of G’s from sequence entered in the command line
Keeping a library of subroutines The book suggests writing a module called BegPerlBioinfo.pm This can hold your subroutines, just as they appear in your programs The last line in a module must be 1; To use any of the subroutines in this module, you put the following statement in your code: use BegPerlBioinfo; (.pm is not added)
Fixing bugs in your code • Prevention – use strict; and use warnings; • Pick up subtle errors when you program is not running the way you want it to • Quick fixes – comments and print statements • Can insert print commands at various stages to witness the progress of your program, or comment out statements (quick on Komodo) • Perl Debugger
Examples • Work through example 6-2, 6-3 and use the debugger on Example 6-4 pages 106-116 • Homework
Hashes or Associative Arrays • Hashes begin with % • Hashes are similar to an array. The difference is that an array uses integers as index values but hashes uses arbitrary scalars called keys. • Keys are used to retrieve values from a hash
For example • An array A hash Element Value Key Value [0] Rob first name Rob [1] professor profession professor [2] UW-Parkside location UW-Park For the array, you have to access the contents using their index #, for the hash you access values via keys
Hash syntax • %hash1 = (key, element, key, element); • Or • %hash1 = ( key => element; key => element; ); Or can assign keys and values line by line in FASTA formatted files ftp://ftp.tigr.org/pub/data/
Operators for Hashes • Keys() returns a list of all the current keys • Values() returns a list of all current values • Each() returns key-value pair as two element list • Delete() removes both key and value from hash
A hash in action: • %hash1 = ( “first name” => “Rob”, “profession” => professor, “location” => ‘UW-Parkside’ ); Print “Contents of hash: \n”; Foreach $k (keys %hash1) { Print “$k => $hash1{$k}\n”; } exit Output: Contents of hash: First name => Rob Profession => professor Location => UW-Parkside
Another look: • %hash1 = ( “first name” => “Rob”, “profession” => professor, “location” => ‘UW-Parkside’ ); Print “Contents of hash: \n”; Foreach $v (values %hash1) { Print “\$v now contains $v\n”; } exit Output: Contents of hash: $v now contains: Rob $v now contains: UW-Parkside $v now contains professor
There are many actions you can perform on a hash • Sort the hash by its keys or values • Sort (keys %hash1); • Assign keys or values of a hash to an array • @ary1 = keys (%hash1); • Assign an individual hash entry to a variable • $var1 = $hash1{$k}; • Delete an entry from a hash • Delete $hash1{$k};