160 likes | 302 Views
Microsatellites: follow-up on exercise. Microsatellites are small consecutive DNA repeats which are found throughout almost all genomes AAAAAAAAAAA would be referred to as (A) 11 GTGTGTGTGTGT would be referred to as (GT) 6 CTGCTGCTGCTG would be referred to as (CTG) 4
E N D
Microsatellites: follow-up on exercise Microsatellites are small consecutive DNA repeats which are found throughout almost all genomes • AAAAAAAAAAA would be referred to as (A)11 • GTGTGTGTGTGT would be referred to as (GT)6 • CTGCTGCTGCTG would be referred to as (CTG)4 • ACTCACTCACTCACTC would be referred to as (ACTC)4 Microsatellites have high mutation rates and therefore may show high variation between individuals within a species.
Looking for microsatellites microsatellites.py Sequence contains the pattern AA+ Sequence does not contain the pattern GT(GT)+ Sequence contains the pattern CTG(CTG)+ Sequence does not contain the pattern ACTC(ACTC)+
Character Classes A character class matches one of the characters in the class: [abc] matches eitheraorborc. d[abc]d matches dad and dbd and dcd [ab]+c matches e.g. ac, abc, bac, bbabaabc, .. • Metacharacter ^ at beginning negates character class: [^abc] matches any character other thana, b and c • A class can use – to indicate a range of characters: [a-e] is the same as [abcde] • Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *
Common character classes regExp1 = “\d\d:\d\d:\d\d [AP]M” # (possibly illegal) time stamps 04:23:19 PM regExp2 = "\w+@[\w.]+\.dk“ # any Danish email address Inside character class: backslash not necessary Backslash necessary
Regular expression functions sub, split regExpfunctions.py *a*b*c*d*e*f *a*b*c4d5e6f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', '']
If you put ()’s around delimiter pattern, delimiters are returned also • Recall the trypsin exercise: regExpsplit.py [‘DCQ’, ‘R’, ‘VYAPFM’, ‘K’, ‘LIHDQWGWDYNNWTSM’, ‘K’, ‘GDA’, ‘R’, ‘EILIMPFCQWTSPF’, ‘R’, ‘NMGCHV’]
Greedy vs. non-greedy operators • + and * are greedy operators • They attempt to match as many characters as possible • +? and *? are non-greedy operators • They attempt to match as few characters as possible
ATGCGACTGACTCGTAGCGATGCTATGCGATCGATGTAG ATGCGACTGACTCGTAG nongreedy.py
Non-greedy • NB: won’t skip a match to get to a shorter match: • Search string read left to right, first match is reported >>> import re >>> s = “aa---aa----bb” >>> re.search(“aa.*?bb”, s).group() ‘aa---aa----bb’
Extract today’s exercises from course calendar <tr body bgcolor="#ccddbb"> <td valign=top> Mon<br>30/10 </td> <td valign=top> Practical matters. Introduction to Unix and Python and the dynamic interpreter.<br> <i>Read: <a href="emacs.html">Emacs tricks</a>, Course notes chapters <a href="Notes/ch01.pdf">1</a> and <a href="Notes/ch02.pdf">2</a></i>. </td> <td valign=top> <a href="Exercises/fileorganization.html">File organization</a>, <a href="Exercises/unix.html">Pattern counting using Unix commands</a>, <a href="pythonmode.html">Python mode</a>, <a href="Exercises/mathfunctions.html">Math functions</a>, <a href="Exercises/functionsfile.html">Math functions in a file</a> <br> </td> <td valign=top> </td> </tr> <tr body bgcolor="#ccddbb"> <td valign=top> Thu<br>2/11<!-- Thu<br> Nov 3rd<br> 14-16 --></td> <td valign=top> String formatting, string methods, if/else, while loops, for loops, functions, ..<br> <i>Read: Course notes, chapters <a href="Notes/ch03.pdf">3</a> and <a href="Notes/ch04.pdf">4</a></i>. • Idea: Get some date from user; then.. • Extract entry for date • Extract all exercises in this entry
Extract today’s exercises from course calendar 30/10 </td> <td valign=top> Practical matters. Introduction to Unix and Python and the dynamic interpreter.<br> <i>Read: <a href="emacs.html">Emacs tricks</a>, Course notes chapters <a href="Notes/ch01.pdf">1</a> and <a href="Notes/ch02.pdf">2</a></i>. </td> <td valign=top> <a href="Exercises/fileorganization.html">File organization</a>, <a href="Exercises/unix.html">Pattern counting using Unix commands</a>, <a href="pythonmode.html">Python mode</a>, <a href="Exercises/mathfunctions.html">Math functions</a>, <a href="Exercises/functionsfile.html">Math functions in a file</a> <br> </td> <td valign=top> </td> </tr> > necessary to avoid matching exercise names containing tr (e.g. the trypsin exercise) Use non-greedy version to avoid matching several date entries Word boundary necessary, otherwise “4/1” is matched inside “24/11”. Hence raw string. DOTALL mode necessary to have . match across several lines. • Regular expressions: • r“\b%s\b.*?/tr>”%date (DOTALL mode)
Extract today’s exercises from course calendar 30/10 </td> <td valign=top> Practical matters. Introduction to Unix and Python and the dynamic interpreter.<br> <i>Read: <a href="emacs.html">Emacs tricks</a>, Course notes chapters <a href="Notes/ch01.pdf">1</a> and <a href="Notes/ch02.pdf">2</a></i>. </td> <td valign=top> <a href="Exercises/fileorganization.html">File organization</a>, <a href="Exercises/unix.html">Pattern counting using Unix commands</a>, <a href="pythonmode.html">Python mode</a>, <a href="Exercises/mathfunctions.html">Math functions</a>, <a href="Exercises/functionsfile.html">Math functions in a file</a> <br> </td> <td valign=top> </td> </tr> Use non-greedy version in case several exercises on same line • Regular expressions: • r“\b%s\b.*?/tr>”%date (DOTALL mode) • r”(Exercises/\w+\.html).*?\b([ \w]+)”
regex_webpage.py Several subgroups in pattern: findall returns list of subgroup tuples