Microsatellites: follow-up on exercise

Microsatellites: follow-up on exercise Microsatellites are small consecutive DNA repeats which are found throughout almost all genomes • AAAAAAAAAAA would be referred to as (A)11 • GTGTGTGTGTGT would be referred to as (GT)6 • CTGCTGCTGCTG would be referred to as (CTG)4 • ACTCACTCACTCACTC would be referred to as (ACTC)4 Microsatellites have high mutation rates and therefore may show high variation between individuals within a species.

Looking for microsatellites microsatellites.py Sequence contains the pattern AA+ Sequence does not contain the pattern GT(GT)+ Sequence contains the pattern CTG(CTG)+ Sequence does not contain the pattern ACTC(ACTC)+

Character Classes A character class matches one of the characters in the class: [abc] matches eitheraorborc. d[abc]d matches dad and dbd and dcd [ab]+c matches e.g. ac, abc, bac, bbabaabc, .. • Metacharacter ^ at beginning negates character class: [^abc] matches any character other thana, b and c • A class can use – to indicate a range of characters: [a-e] is the same as [abcde] • Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *

Common character classes regExp1 = “\d\d:\d\d:\d\d [AP]M” # (possibly illegal) time stamps 04:23:19 PM regExp2 = "\w+@[\w.]+\.dk“ # any Danish email address Inside character class: backslash not necessary Backslash necessary

Regular expression functions sub, split regExpfunctions.py *a*b*c*d*e*f *a*b*c4d5e6f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', '']

If you put ()’s around delimiter pattern, delimiters are returned also • Recall the trypsin exercise: regExpsplit.py [‘DCQ’, ‘R’, ‘VYAPFM’, ‘K’, ‘LIHDQWGWDYNNWTSM’, ‘K’, ‘GDA’, ‘R’, ‘EILIMPFCQWTSPF’, ‘R’, ‘NMGCHV’]

Greedy vs. non-greedy operators • + and * are greedy operators • They attempt to match as many characters as possible • +? and *? are non-greedy operators • They attempt to match as few characters as possible

ATGCGACTGACTCGTAGCGATGCTATGCGATCGATGTAG ATGCGACTGACTCGTAG nongreedy.py

Non-greedy • NB: won’t skip a match to get to a shorter match: • Search string read left to right, first match is reported >>> import re >>> s = “aa---aa----bb” >>> re.search(“aa.*?bb”, s).group() ‘aa---aa----bb’

Extract today’s exercises from course calendar

Extract today’s exercises from course calendar <tr body bgcolor="#ccddbb"> <td valign=top> Mon 30/10 </td> <td valign=top> Practical matters. Introduction to Unix and Python and the dynamic interpreter. Read: <a href="emacs.html">Emacs tricks</a>, Course notes chapters <a href="Notes/ch01.pdf">1</a> and <a href="Notes/ch02.pdf">2</a>. </td> <td valign=top> <a href="Exercises/fileorganization.html">File organization</a>, <a href="Exercises/unix.html">Pattern counting using Unix commands</a>, <a href="pythonmode.html">Python mode</a>, <a href="Exercises/mathfunctions.html">Math functions</a>, <a href="Exercises/functionsfile.html">Math functions in a file</a> </td> <td valign=top> </td> </tr> <tr body bgcolor="#ccddbb"> <td valign=top> Thu 2/11</td> <td valign=top> String formatting, string methods, if/else, while loops, for loops, functions, .. Read: Course notes, chapters <a href="Notes/ch03.pdf">3</a> and <a href="Notes/ch04.pdf">4</a>. • Idea: Get some date from user; then.. • Extract entry for date • Extract all exercises in this entry

Extract today’s exercises from course calendar 30/10 </td> <td valign=top> Practical matters. Introduction to Unix and Python and the dynamic interpreter. Read: <a href="emacs.html">Emacs tricks</a>, Course notes chapters <a href="Notes/ch01.pdf">1</a> and <a href="Notes/ch02.pdf">2</a>. </td> <td valign=top> <a href="Exercises/fileorganization.html">File organization</a>, <a href="Exercises/unix.html">Pattern counting using Unix commands</a>, <a href="pythonmode.html">Python mode</a>, <a href="Exercises/mathfunctions.html">Math functions</a>, <a href="Exercises/functionsfile.html">Math functions in a file</a> </td> <td valign=top> </td> </tr> > necessary to avoid matching exercise names containing tr (e.g. the trypsin exercise) Use non-greedy version to avoid matching several date entries Word boundary necessary, otherwise “4/1” is matched inside “24/11”. Hence raw string. DOTALL mode necessary to have . match across several lines. • Regular expressions: • r“\b%s\b.*?/tr>”%date (DOTALL mode)

Extract today’s exercises from course calendar 30/10 </td> <td valign=top> Practical matters. Introduction to Unix and Python and the dynamic interpreter. Read: <a href="emacs.html">Emacs tricks</a>, Course notes chapters <a href="Notes/ch01.pdf">1</a> and <a href="Notes/ch02.pdf">2</a>. </td> <td valign=top> <a href="Exercises/fileorganization.html">File organization</a>, <a href="Exercises/unix.html">Pattern counting using Unix commands</a>, <a href="pythonmode.html">Python mode</a>, <a href="Exercises/mathfunctions.html">Math functions</a>, <a href="Exercises/functionsfile.html">Math functions in a file</a> </td> <td valign=top> </td> </tr> Use non-greedy version in case several exercises on same line • Regular expressions: • r“\b%s\b.*?/tr>”%date (DOTALL mode) • r”(Exercises/\w+\.html).*?\b([ \w]+)”

regex_webpage.py Several subgroups in pattern: findall returns list of subgroup tuples

Trial runs

.. on to the exercises

Microsatellites: follow-up on exercise

Microsatellites: follow-up on exercise

Presentation Transcript

Regular Expressions in Java

Text Processing with Regular Expressions

Regular Expressions in Java

Constructing an Equivalent Regular Grammar from a Regular Expression

Lecture 12: Regular Expression in JavaScript

Review

Regular Expression 1. What is regular expression?

Lecture 6

Context Free Grammars

Transition Diagrams

Regular Languages Cont’d

CS5371 Theory of Computation

A Fast Regular Expression Indexing Engine

Regular Expression

Deep Packet Inspection with Regular Expression Matching

正則表示法 Regular Expression

Overview

COT 4210 Lecture Notes - 3

Regular Expressions

Regular Expressions

Regular Expression - Intro