1 / 16

Microsatellites: follow-up on exercise

Microsatellites: follow-up on exercise. Microsatellites are small consecutive DNA repeats which are found throughout almost all genomes AAAAAAAAAAA would be referred to as (A) 11 GTGTGTGTGTGT would be referred to as (GT) 6 CTGCTGCTGCTG would be referred to as (CTG) 4

Download Presentation

Microsatellites: follow-up on exercise

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microsatellites: follow-up on exercise Microsatellites are small consecutive DNA repeats which are found throughout almost all genomes • AAAAAAAAAAA would be referred to as (A)11 • GTGTGTGTGTGT would be referred to as (GT)6 • CTGCTGCTGCTG would be referred to as (CTG)4 • ACTCACTCACTCACTC would be referred to as (ACTC)4 Microsatellites have high mutation rates and therefore may show high variation between individuals within a species.

  2. Looking for microsatellites microsatellites.py Sequence contains the pattern AA+ Sequence does not contain the pattern GT(GT)+ Sequence contains the pattern CTG(CTG)+ Sequence does not contain the pattern ACTC(ACTC)+

  3. Character Classes A character class matches one of the characters in the class: [abc] matches eitheraorborc. d[abc]d matches dad and dbd and dcd [ab]+c matches e.g. ac, abc, bac, bbabaabc, .. • Metacharacter ^ at beginning negates character class: [^abc] matches any character other thana, b and c • A class can use – to indicate a range of characters: [a-e] is the same as [abcde] • Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *

  4. Common character classes regExp1 = “\d\d:\d\d:\d\d [AP]M” # (possibly illegal) time stamps 04:23:19 PM regExp2 = "\w+@[\w.]+\.dk“ # any Danish email address Inside character class: backslash not necessary Backslash necessary

  5. Regular expression functions sub, split regExpfunctions.py *a*b*c*d*e*f *a*b*c4d5e6f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', '']

  6. If you put ()’s around delimiter pattern, delimiters are returned also • Recall the trypsin exercise: regExpsplit.py [‘DCQ’, ‘R’, ‘VYAPFM’, ‘K’, ‘LIHDQWGWDYNNWTSM’, ‘K’, ‘GDA’, ‘R’, ‘EILIMPFCQWTSPF’, ‘R’, ‘NMGCHV’]

  7. Greedy vs. non-greedy operators • + and * are greedy operators • They attempt to match as many characters as possible • +? and *? are non-greedy operators • They attempt to match as few characters as possible

  8. ATGCGACTGACTCGTAGCGATGCTATGCGATCGATGTAG ATGCGACTGACTCGTAG nongreedy.py

  9. Non-greedy • NB: won’t skip a match to get to a shorter match: • Search string read left to right, first match is reported >>> import re >>> s = “aa---aa----bb” >>> re.search(“aa.*?bb”, s).group() ‘aa---aa----bb’

  10. Extract today’s exercises from course calendar

  11. Extract today’s exercises from course calendar <tr body bgcolor="#ccddbb"> <td valign=top> Mon<br>30/10 </td> <td valign=top> Practical matters. Introduction to Unix and Python and the dynamic interpreter.<br> <i>Read: <a href="emacs.html">Emacs tricks</a>, Course notes chapters <a href="Notes/ch01.pdf">1</a> and <a href="Notes/ch02.pdf">2</a></i>. </td> <td valign=top> <a href="Exercises/fileorganization.html">File organization</a>, <a href="Exercises/unix.html">Pattern counting using Unix commands</a>, <a href="pythonmode.html">Python mode</a>, <a href="Exercises/mathfunctions.html">Math functions</a>, <a href="Exercises/functionsfile.html">Math functions in a file</a> <br> </td> <td valign=top> </td> </tr> <tr body bgcolor="#ccddbb"> <td valign=top> Thu<br>2/11<!-- Thu<br> Nov 3rd<br> 14-16 --></td> <td valign=top> String formatting, string methods, if/else, while loops, for loops, functions, ..<br> <i>Read: Course notes, chapters <a href="Notes/ch03.pdf">3</a> and <a href="Notes/ch04.pdf">4</a></i>. • Idea: Get some date from user; then.. • Extract entry for date • Extract all exercises in this entry

  12. Extract today’s exercises from course calendar 30/10 </td> <td valign=top> Practical matters. Introduction to Unix and Python and the dynamic interpreter.<br> <i>Read: <a href="emacs.html">Emacs tricks</a>, Course notes chapters <a href="Notes/ch01.pdf">1</a> and <a href="Notes/ch02.pdf">2</a></i>. </td> <td valign=top> <a href="Exercises/fileorganization.html">File organization</a>, <a href="Exercises/unix.html">Pattern counting using Unix commands</a>, <a href="pythonmode.html">Python mode</a>, <a href="Exercises/mathfunctions.html">Math functions</a>, <a href="Exercises/functionsfile.html">Math functions in a file</a> <br> </td> <td valign=top> </td> </tr> > necessary to avoid matching exercise names containing tr (e.g. the trypsin exercise) Use non-greedy version to avoid matching several date entries Word boundary necessary, otherwise “4/1” is matched inside “24/11”. Hence raw string. DOTALL mode necessary to have . match across several lines. • Regular expressions: • r“\b%s\b.*?/tr>”%date (DOTALL mode)

  13. Extract today’s exercises from course calendar 30/10 </td> <td valign=top> Practical matters. Introduction to Unix and Python and the dynamic interpreter.<br> <i>Read: <a href="emacs.html">Emacs tricks</a>, Course notes chapters <a href="Notes/ch01.pdf">1</a> and <a href="Notes/ch02.pdf">2</a></i>. </td> <td valign=top> <a href="Exercises/fileorganization.html">File organization</a>, <a href="Exercises/unix.html">Pattern counting using Unix commands</a>, <a href="pythonmode.html">Python mode</a>, <a href="Exercises/mathfunctions.html">Math functions</a>, <a href="Exercises/functionsfile.html">Math functions in a file</a> <br> </td> <td valign=top> </td> </tr> Use non-greedy version in case several exercises on same line • Regular expressions: • r“\b%s\b.*?/tr>”%date (DOTALL mode) • r”(Exercises/\w+\.html).*?\b([ \w]+)”

  14. regex_webpage.py Several subgroups in pattern: findall returns list of subgroup tuples

  15. Trial runs

  16. .. on to the exercises

More Related