Recap: Two ways of using regular expression

Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) • Compile regExp to a special format (an SRE_Pattern object) • Search for this SRE_Pattern in text • Result is an SRE_Match object or precompile the expression: compiledRE = re.compile( regExp) 1. Now compiledRE is an SRE_Pattern object compiledRE.search( text ) 2. Use search method in this SRE_Pattern to search text 3. Result is same SRE_Match object

A few more metacharacters ^: indicates placement at the beginning of the string $: indicates placement at the end of the string # search for zero or one t, followed by two a’s # at the beginning of the string: regExp1 = “^t?aa“ # search for g followed by one or more c’s followed by a # at the end of the string: regExp1 = “gc+a$“ # whole string should match ct followed by zero or more # g’s followed by a: regExp1 = “^ctg*a$“

This time we usere.search() to search the text for the regular expressions directly without compiling them in advance Text1 contains the regular expression ^t?aa Text1 contains the regular expression gc+a$ Text2 contains the regular expression ^ctg*a$

Yet more metacharacters.. {}: indicate repetition | : match either regular expression to the left or to the right (): indicate a group (a part of a regular expression) # search for four t’s followed by three c’s: regExp1 = “t{4}c{3}“ # search for g followed by 1, 2 or 3 c’s: regExp1 = “gc{1,3}$“ # search for either gg or cc: regExp1 = “gg|cc“ # search for either gg or cc followed by tt: regExp1 = “(gg|cc)tt“

Microsatellites: follow-up on exercise Microsatellites are small consecutive DNA repeats which are found throughout the genome of organisms ranging from yeasts through to mammals. • AAAAAAAAAAA would be referred to as (A)11 • GTGTGTGTGTGT would be referred to as (GT)6 • CTGCTGCTGCTG would be referred to as (CTG)4 • ACTCACTCACTCACTC would be referred to as (ACTC)4 Microsatellites have high mutation rates and therefore may show high variation between individuals within a species. Source: http://www.amonline.net.au/evolutionary_biology/tour/microsatellites.htm

Looking for microsatellites microsatellites.py Sequence contains the pattern AA+ Sequence does not contain the pattern GT(GT)+ Sequence contains the pattern CTG(CTG)+ Sequence does not contain the pattern ACTC(ACTC)+

Escaping metacharacters \: used to escape a metacharacter (“to take it literally”) # search for x followed by + followed by y: regExp1 = “x\+y“ # search for ( followed by x followed by y: regExp1 = “\(xy“ # search for x followed by ? followed by y: regExp1 = “x\?y“ # search for x followed by at least one ^ followed by 3: regExp1 = “x\^+3“

Character Classes A character class matches one of the characters in the class: [abc] matches eithera or b or c. d[abc]d matches dad and dbd and dcd [ab]+c matches e.g. ac, abc, bac, bbabaabc, .. • Metacharacter ^ at beginning negates character class: [^abc] matches any character other thana, b and c • A class can use – to indicate a range of characters: [a-e] is the same as [abcde] • Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *

Special Sequences Special sequence: shortcut for a common character class regExp1 = “\d\d:\d\d:\d\d [AP]M” # (possibly illegal) time stamps 04:23:19 PM regExp2 = "\w+@[\w.]+\.dk“ # any Danish email address

Regular expression functions sub, split, match

*a*b*c*d*e*f *a*b*c4d5e6f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', ''] method search found \db method match found \da

Groups text = "But here: chili@daimi.au.dk what a *(.@#$ silly @#*.( email address“ regExp = "\w+@[\w.]+\.dk“ # match Danish email address compiledRE = re.compile( regExp) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish email address:", SRE_Match.group() We can extract the actual substring that matched the regular expression by calling method group() in the SRE_Match object: Text contains this Danish email address: chili@daimi.au.dk

The substring that matches the whole RE is called a group • The RE can be subdivided into smaller groups (parts) • Each group of the matching substring can be extracted • Metacharacters ( and ) denote a group text = "But here: chili@daimi.au.dk what a *(.@#$ silly @#*.( email address“ # Match any Danish email address; define two groups: username and domain: regExp = “(\w+)@([\w.]+\.dk)“ compiledRE = re.compile( regExp ) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish email address:", SRE_Match.group() print “Username:”, SRE_Match.group(1), “\nDomain:”, SRE_Match.group(2) Text contains this Danish email address: chili@daimi.au.dk Username: chili Domain: daimi.au.dk

Greedy vs. non-greedy operators • + and * are greedy operators • They attempt to match as many characters as possible • +? and *? are non-greedy operators • They attempt to match as few characters as possible

# Task: Find a space-separated list of digits, report the first digit. import re text = "1 2 3 4 5 blah blah" # use greedy operator + regExp = "(\d )+" print "Greedy operator:", re.match( regExp, text ).group() # use non-greedy version instead (by putting a ? after the +) regExp = "(\d )+?" print "Non-greedy operator:", re.match( regExp, text ).group() Greedy operator: 1 2 3 4 5 Non-greedy operator: 1

Recap: Two ways of using regular expression