The Power of Perl Regular Expressions

The Power of Perl Regular Expressions

What is a regular expression (regex)? • It is a description for a group of characters you want to search for in a string, a file, a website, etc. • Think of the group of characters as a pattern that you want to find within a string • Use regular expressions to search text quickly and accurately

Pattern Matching Syntax • $variable_name =~ /pattern/; • $variable_name – this is the variable containng the string you want to search • =~ - the binding operator is used for testing regular expressions • Letters before and after / (front and back, respectively, are operators and modifiers that affect the regular expression search

Matching operator you have been introduced to substitution and translation operators already • m// or just // is used to find patterns in a string • Test if a string contains the sequence ATG • $dnastr = ‘TTCGATGCCAC’; • If ($str =~ /ATG/) { • Print (“ATG found.\n”); • } • Else { • Print (“ATG not found.\n”); • } • Exit;

Case modifier • /atg/ would not find a match in the previous example • However /atg/i would • i is a case-independent modifier • We will introduce additional modifiers when necessary

Global modifier • If there were more than one ATG in the sequence, the previous examples only acknowledge the first one they run into • /ATG/g • g is a modifier for a global search, searching a string for ALL instance of pattern not the first one.

Other operators for regex • s/// - substitution perator is used to change strings, put the oldstring between the first and second /, and the new string between the second and third • tr/// - is used to change individual characters. Put the old character between the first and second /, and new character between the second and third

Metacharacters help search for complicated patterns • \d or [0-9] – match any digit • \w or [a-zA-Z_0-9] – match a character • \D – match a non-digit character • \W – match a non-word character • \s, [\t\n\r\f] – match whitespace character • \S – match non-whitespace character • \n – match a newline character • \r – match a carriage return • \t – match a tab • \f – match a formfeed • . – match any SINGLE character There are more!

Regex quantifiers • These syntax structures allow you to specifiy how long a regular expression pattern match should be • * match 0 or more times • + match 1 or more times • ? Match 1 or 0 times • {n} match exactly n times • {n, } match at least n times • {n,m} match at least n, but not more than m times

Examples of quantifier use • [A+CGC?A] #match one or more A’s followed by CG, followed by an optional C followed by an A • /A{3}/ # Match exactly 3 A’s • /A{3,} # match 3 or more A’s • /A {3,8}/ #match 3 to 8 A’s • The transcription factor binding site for SSP protein is GGCGGCGGCTGGCTAGGG • /{(GGC), 3}T{G,2}CTA{G,3}/

Alternation • Vertical bar (|) allows you to match one of several alternatives • /song|blue/ # match either ‘song’ or ‘blue’ • /a|b|c/ # match a, b, or c, same as [abc] • The GATA-1 TF binding site is defined by a T or an A, followed by GATA followed by an A or G. In regex that would be: /(T|A)GATA(A|G)/

Anchoring patterns • ^ matches the beginning of a string, while $ matches the end of a string • /^this/ #matches ‘this one’ but not ‘watch this’ • /this$/ #matches ‘watch this’ but not ‘this one’

Pattern memory • You know how to match characters, you need a way to find out what was matched by storing or saving the matching portions • Putting parentheses around any pattern will allow the part of the string matched by the pattern to be remembered and stored in a special variable called $1. If there are multiple patterns, they are stored in $2, $3, …)

Finding and storing GATA-1 binding site • $seq = “AAAGAGAGGGATAGAATAGAGATGATAAGAAA”; • $seq =~ /(T|A)GATA(A|G)/; • Print “$1\n”; • Output: TGATAA

Other special variables • $& the part of the string that actually matched • $` everything before the match • $’ everything after the match • Modify previous program to : • Print “$`\n”; • Print “$&\n”; • Print “$’\n”; • Output: AAAGAGAGGGATAGAATAGAGA TGATAA GAAA

Websites on RegEx • http://www.perldoc.com/perl 5.6.1/pod/perlre.html • http://www.troubleshooters.com/codecorn/littperl/perlreg.htm • http://www.devshed.com/Server_Side/Administration/RegExp/page2.html • http://www.javaworld.com/javaworld/jw-07-2001/jw-0713-regex.html

Exercises • Try some regular expressions with your motif.pl program pg. 67-69 • Read pages 70-75, work through example 5-4 (pick your own nucleotide file from NCBI) • Next, do Example 5-7 to learn how to write to files

What are subroutines? • A unique function that you generate to perform some action • Save a lot of typing • Make for “neater” programs • Syntax: • sub subroutinenameofyourchoice • { block of code • } • Place subroutines either at the beginning or end of a script

Example 6-1. A subroutine to append ACGT to DNA • $dna = ‘CGACGTCTTC..’; • $longer_dna = addACGT ($dna); • Print “I added ACGT to $dna and got $longer_dna\n\n”; • Exit; • (subroutine on next slide) • Output: I added ACGT to CGACGTCTTC.. and got CGACGTCTTC..ACGT

A subroutine example • Sub addACTG { • my ($dna) = @_; • $dna .=‘ACGT’; • Return $dna; • }

Two types of variables in subroutines • Variables passed into subroutines are called “arguments” • When a list of arguments are passed to a subroutine they are stored in the “magical” array called @_ • Other variables declared with my, and are restricted to the scope of the subroutine • This protects them from interacting with other variables in the program

Returning results via return • Most subroutines return their results via the return function, can return single scalar, multiple scalar, an array, etc.

Calling subroutines • To call a subroutine means to type its name, give it arguments, and collect the results • The call looks like: $longer_dna = addACGT($dna); • You could put as many variables as you want, and they would all be stored in the @_ array

Scoping – restricting variables using ‘my’ • Declare a variable as my with • My ($x); • My $x ; • Or combining declaration with initialization: • My($x) = ’49’; • My($x) = @_; Once declared it exists only until the end of the block it was declared in.

Working through 6-2 • Inherent flaw – variable name used both within and outside subroutine • To stop using undeclared variables, you can enforce this by using: use strict; • This insists that programs have all their variables declared as my variables

Example 6-3 shows how to use command line arguments • Many programs run from command lines (ie. in Unix) • @ARGV is an array that contains all command line arguments • This example will count the number of G’s from sequence entered in the command line

Keeping a library of subroutines The book suggests writing a module called BegPerlBioinfo.pm This can hold your subroutines, just as they appear in your programs The last line in a module must be 1; To use any of the subroutines in this module, you put the following statement in your code: use BegPerlBioinfo; (.pm is not added)

Fixing bugs in your code • Prevention – use strict; and use warnings; • Pick up subtle errors when you program is not running the way you want it to • Quick fixes – comments and print statements • Can insert print commands at various stages to witness the progress of your program, or comment out statements (quick on Komodo) • Perl Debugger

Examples • Work through example 6-2, 6-3 and use the debugger on Example 6-4 pages 106-116 • Homework

Hashes or Associative Arrays • Hashes begin with % • Hashes are similar to an array. The difference is that an array uses integers as index values but hashes uses arbitrary scalars called keys. • Keys are used to retrieve values from a hash

For example • An array A hash Element Value Key Value [0] Rob first name Rob [1] professor profession professor [2] UW-Parkside location UW-Park For the array, you have to access the contents using their index #, for the hash you access values via keys

Hash syntax • %hash1 = (key, element, key, element); • Or • %hash1 = ( key => element; key => element; ); Or can assign keys and values line by line in FASTA formatted files ftp://ftp.tigr.org/pub/data/

Operators for Hashes • Keys() returns a list of all the current keys • Values() returns a list of all current values • Each() returns key-value pair as two element list • Delete() removes both key and value from hash

A hash in action: • %hash1 = ( “first name” => “Rob”, “profession” => professor, “location” => ‘UW-Parkside’ ); Print “Contents of hash: \n”; Foreach $k (keys %hash1) { Print “$k => $hash1{$k}\n”; } exit Output: Contents of hash: First name => Rob Profession => professor Location => UW-Parkside

Another look: • %hash1 = ( “first name” => “Rob”, “profession” => professor, “location” => ‘UW-Parkside’ ); Print “Contents of hash: \n”; Foreach $v (values %hash1) { Print “\$v now contains $v\n”; } exit Output: Contents of hash: $v now contains: Rob $v now contains: UW-Parkside $v now contains professor

There are many actions you can perform on a hash • Sort the hash by its keys or values • Sort (keys %hash1); • Assign keys or values of a hash to an array • @ary1 = keys (%hash1); • Assign an individual hash entry to a variable • $var1 = $hash1{$k}; • Delete an entry from a hash • Delete $hash1{$k};

The Power of Perl Regular Expressions

The Power of Perl Regular Expressions

Presentation Transcript

Regular Expressions

Regular Expressions

The power of regular expressions

Regular Expressions in Perl

Regular Expressions

Perl Regular Expressions

Regular Expressions in Perl – Part I

Perl, Beyond the Basics: Regular Expressions, Subroutines, and Objects in Perl

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions: Theory and Perl Implementation

Perl Regular Expressions in SAS 9

Regular Expressions in Perl Part I

Perl Regular Expressions

Regular Expressions

Regular Expressions in Perl – Part 1

Regular expressions

Regular Expressions

Perl Regular Expressions – Part 1