1 / 15

Recap: Two ways of using regular expression

Recap: Two ways of using regular expression. Search directly: re.search( regExp, text ) Compile regExp to a special format (an SRE_Pattern object) Search for this SRE_Pattern in text Result is an SRE_Match object or precompile the expression: compiledRE = re.compile( regExp)

wcarmen
Download Presentation

Recap: Two ways of using regular expression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recap: Two ways of using regular expression Search directly: re.search( regExp, text ) • Compile regExp to a special format (an SRE_Pattern object) • Search for this SRE_Pattern in text • Result is an SRE_Match object or precompile the expression: compiledRE = re.compile( regExp) 1. Now compiledRE is an SRE_Pattern object compiledRE.search( text ) 2. Use search method in this SRE_Pattern to search text 3. Result is same SRE_Match object

  2. A few more metacharacters ^: indicates placement at the beginning of the string $: indicates placement at the end of the string # search for zero or one t, followed by two a’s # at the beginning of the string: regExp1 = “^t?aa“ # search for g followed by one or more c’s followed by a # at the end of the string: regExp1 = “gc+a$“ # whole string should match ct followed by zero or more # g’s followed by a: regExp1 = “^ctg*a$“

  3. This time we usere.search() to search the text for the regular expressions directly without compiling them in advance Text1 contains the regular expression ^t?aa Text1 contains the regular expression gc+a$ Text2 contains the regular expression ^ctg*a$

  4. Yet more metacharacters.. {}: indicate repetition | : match either regular expression to the left or to the right (): indicate a group (a part of a regular expression) # search for four t’s followed by three c’s: regExp1 = “t{4}c{3}“ # search for g followed by 1, 2 or 3 c’s: regExp1 = “gc{1,3}$“ # search for either gg or cc: regExp1 = “gg|cc“ # search for either gg or cc followed by tt: regExp1 = “(gg|cc)tt“

  5. Microsatellites: follow-up on exercise Microsatellites are small consecutive DNA repeats which are found throughout the genome of organisms ranging from yeasts through to mammals. • AAAAAAAAAAA would be referred to as (A)11 • GTGTGTGTGTGT would be referred to as (GT)6 • CTGCTGCTGCTG would be referred to as (CTG)4 • ACTCACTCACTCACTC would be referred to as (ACTC)4 Microsatellites have high mutation rates and therefore may show high variation between individuals within a species. Source: http://www.amonline.net.au/evolutionary_biology/tour/microsatellites.htm

  6. Looking for microsatellites microsatellites.py Sequence contains the pattern AA+ Sequence does not contain the pattern GT(GT)+ Sequence contains the pattern CTG(CTG)+ Sequence does not contain the pattern ACTC(ACTC)+

  7. Escaping metacharacters \: used to escape a metacharacter (“to take it literally”) # search for x followed by + followed by y: regExp1 = “x\+y“ # search for ( followed by x followed by y: regExp1 = “\(xy“ # search for x followed by ? followed by y: regExp1 = “x\?y“ # search for x followed by at least one ^ followed by 3: regExp1 = “x\^+3“

  8. Character Classes A character class matches one of the characters in the class: [abc] matches eithera or b or c. d[abc]d matches dad and dbd and dcd [ab]+c matches e.g. ac, abc, bac, bbabaabc, .. • Metacharacter ^ at beginning negates character class: [^abc] matches any character other thana, b and c • A class can use – to indicate a range of characters: [a-e] is the same as [abcde] • Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *

  9. Special Sequences Special sequence: shortcut for a common character class regExp1 = “\d\d:\d\d:\d\d [AP]M” # (possibly illegal) time stamps 04:23:19 PM regExp2 = "\w+@[\w.]+\.dk“ # any Danish email address

  10. Regular expression functions sub, split, match

  11. *a*b*c*d*e*f *a*b*c4d5e6f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', ''] method search found \db method match found \da

  12. Groups text = "But here: chili@daimi.au.dk what a *(.@#$ silly @#*.( email address“ regExp = "\w+@[\w.]+\.dk“ # match Danish email address compiledRE = re.compile( regExp) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish email address:", SRE_Match.group() We can extract the actual substring that matched the regular expression by calling method group() in the SRE_Match object: Text contains this Danish email address: chili@daimi.au.dk

  13. The substring that matches the whole RE is called a group • The RE can be subdivided into smaller groups (parts) • Each group of the matching substring can be extracted • Metacharacters ( and ) denote a group text = "But here: chili@daimi.au.dk what a *(.@#$ silly @#*.( email address“ # Match any Danish email address; define two groups: username and domain: regExp = “(\w+)@([\w.]+\.dk)“ compiledRE = re.compile( regExp ) SRE_Match = compiledRE.search( text ) if SRE_Match: print "Text contains this Danish email address:", SRE_Match.group() print “Username:”, SRE_Match.group(1), “\nDomain:”, SRE_Match.group(2) Text contains this Danish email address: chili@daimi.au.dk Username: chili Domain: daimi.au.dk

  14. Greedy vs. non-greedy operators • + and * are greedy operators • They attempt to match as many characters as possible • +? and *? are non-greedy operators • They attempt to match as few characters as possible

  15. # Task: Find a space-separated list of digits, report the first digit. import re text = "1 2 3 4 5 blah blah" # use greedy operator + regExp = "(\d )+" print "Greedy operator:", re.match( regExp, text ).group() # use non-greedy version instead (by putting a ? after the +) regExp = "(\d )+?" print "Non-greedy operator:", re.match( regExp, text ).group() Greedy operator: 1 2 3 4 5 Non-greedy operator: 1

More Related