Regular Expressions & Pattern Matching

James Wasmuth University of Edinburgh james.wasmuth@ed.ac.uk Regular Expressions &Pattern Matching

Definitions • Pattern Match – searching for a specified pattern within string. • For example: • A sequence motif, • Accession number of a sequence, • Parse HTML, • Validating user input. • Regular Expression (regex) – how to make a pattern match.

Regular Expressions • A separate programming language, • Utilised in most popular languages - usually as separate library • Perl - fully incorporated (unique).

How Regex work Regex code Perl compiler output regex engine Input data (e.g. sequence file) Overview: how to create regular expressions how to use them to match and extract data biological context

Simple Patterns • Place the regex between a pair of forward slashes ( / / ). • try: • #!/usr/bin/perl • while (<STDIN>) { • if (/abc/) { • print “>> found ‘abc’ in $_\n”; • } • } • Save then run the program. Type something on the terminal then press return. Ctrl+C to exit script. • If you type anything containing ‘abc’ the print statement is returned.

Binding Operator • Previous example matched against $_ • Want to match against a scalar variable? • Binding Operator “=~” matches pattern on right against string on left. • Usually add the m operator – clarity of code. • $string =~ m/pattern/

Simple Patterns (2) • Also access files and pattern match using I/O. • try: • #!/usr/bin/perl • open IN, “<genomes_desc.txt”; • while ($line = <IN>) { • if ($line=~m/elegans/) { #true if finds ‘elegans’ • print $line; • } • }

Flexible matching • Within regex there are many characters with special meanings – metacharacters • star (*) matches any number of instances • /ab*c/ => ‘a’ followed by zero or more ‘b’ followed by ‘c’ • plus (+) matches at least one instance • /ab+c/ => ‘a’ followed by 1 or more ‘b’ followed by ‘c’ • question mark (?) matches zero or one instance • /ab?c/ => ‘a’ followed by 0 or 1 ‘b’ followed by ‘c’

More Flexibility • Match a character a specific number or range of instances • {x}will match x number of instances. • /ab{3}c/ => abbbc • {x,y}will match between x and yinstances. • /a{2,4}bc/ => aabc oraaabc oraaaabc • {x,}will match x+ instances. • /abc{3,}/ => abccc or abccccccc or abcccccccc

More Flexibility • dot (.) is a wildcard character – matches any character except new line (\n) • /a.c/ => ‘a’ followed by any character followed by ‘c’ • Combine metacharacters • /a.{4}c/ => ‘a’ followed 4 instances of any character followed by ‘c’ • so will match addddc , afgthc , ab569c

Escaping Metacharacters to use a * , + , ? or . in the pattern when not a metacharacter, need to 'escape' them with a backslash. /C\. elegans/ => C. elegansonly /C. elegans/ => will match Ca , Cb , C3 , C> , C. , etc... The 'delimitor' of the regex, forward slash '/', and the 'escape' character, backslash '\', are also metacharacters. These need to be escaped if required in regex. Important when trying to match URLs and email addresses. /joe\.bloggs\@darwin\.co\.uk/ /www\.envgen\.nox\.ac\.uk\/biolinux\.html/

Finding Sequence Identifiers • The file nemaglobins contains EMBL database entries for globins of the phylum Nematoda. Write a script that counts the number of entries. • try: • #!/usr/bin/perl • $count; • open IN, “<nemaglobins.embl” or die; • while ($line = <IN>) { • if ($line=~m/AC .*/) { #that's three spaces • $count++; • } • } • print “total=$count\n”;

Grouping Patterns • So far using metacharacters with one character. • Can group patterns – place within parenthesis “()”. • Powerful when coupled with quantifiers. • /MLSTSTG+/ =>MLSTSTGGGGGGGGG… • /MLS(TSTG)+/ =>MLSTSTGTSTGTSTG…TSTG • /ML(ST){2}G/ =>MLSTSTG

Alternative Matching • Match this or this. • Two ways which depend on nature of pattern • 1) use a verticle bar ‘|’ • matches if either left side or right side matches, • /(human|mouse|rat)/ => any string with human or mouse or rat.

2) character class is a list of characters within '[]'. It will match any single character within the class. • /[wxyz1234\t]/ => any of the nine. • a range can be specified with '-' • /[w-z1-4\t]/ => as above • to match a hyphen it must be first in the class • /[-a-zA-Z]/ => any letter character or a hyphen negating a character with '^' /[^z]/ => any character exceptz • /[^abc]/=> any character except a or b or c

Revisting EMBL file • Want to find the number of globins from Ascaris and ?????. • #!/usr/bin/perl • $count; • open IN, “<nemaglobins.embl” or die; • while ($line=<IN>) { • if ($line=~m/OS (Ascaris|Toxocara)/) { • $count++; • } • } • print “Found $count globins from Ascaris or Toxocara\n”;

Shortcuts • \d => any digit [0-9] • \w => any “word” character [A-Za-z0-9_] • \s => any white space[\t\n\r\f ] • \D => any character except a digit [^\d] • \W => any character except a “word” character [^\w] • \S=> any character except a white space [^\s] • Can use any of these in conjunction with quantifiers, • /\s*/ => any amount of white space

Anchoring a Pattern • /pattern/ will match anywhere in the string • Anchors hold the pattern to a point in the string. • caret “^” (shift 6) marks the beginning of string while dollar “$” marks end of a string. • /^elegans/ => elegans only at start of string. Not C. elegans. • /Canis$/ => Canis only at end of string. Not Canis lupus. • /^\s*$/ => a blank line. • ‘$’ ignores the new line character ‘\n’

Memory Variables • Able to extract sections of the pattern and store in a variable. • Part of the pattern within parentheses ‘()’ is stored in special variable. • First instance is $1, second $2, the fourth $4… • Extract from file • Organism: Homo sapiens • From Perl script: • if ($line=~m/Organism:\s(\w+)\s(\w+)/) { • $genus = $1; • $species = $2; • }

Revisiting EMBL File (again) • Use shortcuts and anchors to find what you want. • if ($line=~m/AC .*/) { #found lots • Try: • if ($line=~m/^AC\s{3}([.\w]+)\s*/) { • $accession=$1; #info stored to use later

Substitutions • Match a pattern within in a string and replace with another string. • Uses the ‘s’ operator • s/abc/xyz/ => find abc and replace with xyz • Only finds first instance of match. Using ‘g’ modifer will find and replace all. • $line = ‘abcaabbcabca’; • $line =~ s/abc/xyz/g; • print $line; xyzaabbcxyza

More Substitutions • Remove all gap characters from a multiple sequence alignment: • $aln = ‘AADG--ASD--P-GSTST’; • $aln =~ s/-//g; • print $aln; # AADGASDPGSTST • Inserting information: • $line = ‘vector:’; • $line =~ s/(vector:)/$1 M13MP7/; • $name = ‘Daniel’; • $name =~ s/(Daniel)/Jack $1/;

Resources • Learning Perl (O' Reilly) Ch. 7-9 • Regular Expression Pocket Reference (O' Reilly) • perldoc perlre • http://etext.lib.virginia.edu/helpsheets/regex.html • http://www.nematodes.org/~jamesw/Perl/regex • Master Regular Expressions (O'Reilly)

Regular Expressions & Pattern Matching