420 likes | 432 Views
C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Master Course Sequence Alignment Lecture 11 Sequence Motif Searches. Pattern matching.
E N D
C E N T E R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Master Course Sequence Alignment Lecture 11Sequence Motif Searches
Pattern matching • Functional genomics: finding out the function of all genes (and other parts) in a genome • Ability to recognise protein function paramount • Database searching is crucial strategy • trypsin has catalytic triad (His, Asp, Ser). How to recognize this? • (local) alignments not always suitable • short patterns, too many ‘don’t cares’, etc.
Degenerate DNA codes Four bases: A, C, G, T Two-fold degenerate IUB codes: • R=[AG] • Y=[CT] • K=[GT] • M=[AC] • S=[GC] • W=[AT] Four-fold degenerate: N=[AGCT]
Degenerate protein codes 20 bases: ACDEFGHIKLMNPQRSTVWY Degenerate codes: • X = unknown, all (20-fold degenerate) • B = [DE] • Z = [NQ]
Transcription factors • Required but not a part of the RNA polymerase complex • Many different roles in gene regulation • Binding • Interaction • Initiation • Enhancing • Repressing • Various structural classes (e.g. zinc finger domains) • Consist of both a DNA-binding domain and an interactive domain
Transcription factors Transcription factor – polymerase interaction sets off gene transcription… TF binding site TF mRNA transcription Pol II TATA mRNA transcription TF binding site (closed) TATA TF binding site (open) Nucleosomes (chromatin structures composed of histones) are structures round which DNA coils. This blocks access of TFs … many TFBSs are possible
Motifs • Short sequences of DNA (or RNA or amino acids) • Often consist of 5- 16 nucleotides • Protein motifs can be more variable • May contain gaps • Examples include: • Splice sites • Start/stop codons • Transmembrane domains • Centromeres • Phosphorylation sites • Coiled-coil domains • Transcription factor binding site (regulatory motifs)
pattern matching This lecture: • Regular expressions • Pre-processing data for pattern matching: suffix trees • Hidden Markov models (brief)
Rationale for regular expressions • “I want to see all sequences that ... • ... contain a C” • ... contain a C or an F” • ... contain a C and an F” • ... contain a C immediately followed by an F” • ... contain a C later followed by an F” • ... begin with a C” • ... do not contain a C” • ... contain at least three Cs” • ... contain exactly three Cs” • ... has a C at the seventh position” • ... either contain a C, an E, and an F in any order except CFE, unless there are also at most three Ps, or there is a ....
regular expressions • alphabet: set of symbols • {A, C, T, G} • string: sequence of symbols from alphabet • AACTG, CATG, GGA, ACFT, ε • regex: formal method to define (sub)set of strings • [^C].AG?T* • used for pattern matching • check if database sequence ∈ regex
contruction of a regex • regex contains: • symbols from alphabet • C→ { C } • operators • operations on regex(es) yield new regex • concatenation, union, repetition, ...
basic operators r1r2concatenation AC → { AC } AAC → { AAC } [s1s2 ... sn] union (of symbols) [ACG] → { A, C, G } [AC]G → { AG, CG } r1|r2union (of regexes) A|CC → { A, CC } [AC]|AC → { A, C, AC } r+ repeat once or more C+ → { C, CC, CCC, CCCC, ... } A[AC]+ → { AA, AC, AAA, AAC, ACA, ACC, AAAA, AAAC, ... }
derived operators r? optional C? → { ε, C } AC?G → { AG, ACG } r* repeat zero or more times C* → { ε, C, CC, CCC, CCCC, ... } A*C → { C, AC, AAC, AAAC, ... } [AC]* → { ε, A, C, AA, AC, CA, CC, AAA, AAC, ACA, ACC, ... } rn-mrepeat n – m times C4 → { CCCC } C2-4 → { CC, CCC, CCCC } C-3 → { ε, C, CC, CCC } C3- → { CCC, CCCC, CCCCC, ... }
miscellaneous . any symbol . → { A, C, G, T } A.C → { AAC, ACC, AGC, ATC } .? → { ε, A, C, G, T } .* → { ε, A, C, G, T, AA, AC, AG, AT, CA, CC, CG, CT, GA, ... } [^s1s2 ... sn] exclude symbols [^A] → { C, G, T } [^AC] → { G, T } (r) grouping (AC)? → { ε, AC } AC? → { A, AC } (AC)* → { ε, AC, ACAC, ACACAC, ACACACAC, ... } AC* → { A, AC, ACC, ACCC, ... }
limitations • regex cannot remember indeterminate counts !!! • “I want to see all sequences with ... • ... six Cs followed by six Ts” • C6T6 • ... any number of Cs followed by any number of Ts” • C*T* • ... Cs followed by an equal number of Ts” • CnTn • (CT|CCTT|CCCTTT|C4T4| ... )? • use (context-free) grammar
regexes in pattern matching • pattern described by regex • check if sequence ∈ regex • matching done very efficiently • O(n) • using state machine
state machines • compile regex to state machine • match sequence with regex AC*T|GGC
Example from BLAST:Determining Query Words • Given: • query sequence: QLNFSAGW • word length w = 3 • word score threshold T = 8 • Step 1: determine all words of length w in query sequence QLN LNF NFS FSA SAG AGW
Example from BLAST:Determining Query Words • Step 2: determine all words that score at least T when compared to a word in the query sequence: words from query words w/ T=8 sequence QLN QLN=11, QMD=9, HLN=8, ZLN=9,… LNF LNF=9, LBF=9, LBY=8, FNW=8,… NFS NFS=12, AFS=8, NYS=8, DFT=10,… … SAG none ...
Example from BLAST:Scanning the Database • search database for all occurrences of query words • approach: • build a DFA (deterministic finite-state automaton) that recognizes all query words • run DB sequences through DFA • remember hits
Example from BLAST: Scanning the Database • consider a DFA to recognize the query words: QL, QM, ZL • All that a DFA does is read strings, and output "accept" or "reject." • use Mealy paradigm (accept on transitions) to save space and time Moore paradigm: the alphabet is (a, b), the states are q0, q1, and q2, the start state is q0 (denoted by the arrow coming from nowhere), the only accepting state is q2 (denoted by the double ring around the state), and the transitions are the arrows. The machine works as follows. Given an input string, we start at the start state, and read in each character one at a time, jumping from state to state as directed by the transitions. When we run out of input, we check to see if we are in an accept state. If we are, then we accept. If not, we reject. Moore paradigm: accept/reject states Mealy paradigm: accept/reject transitions
Example from BLAST: a DFA to recognize query words: QL, QM, ZL Q Mealy paradigm not (L or M or Q) L or M start Q Z Z L not (L or Z) not (Q or Z) Accept on red transitions (Mealy paradigm)
other uses • many programs use regular expressions • command-line interpreter • del *.* • editor • search • replace • compilers • perl, grep, sed, awk • many different syntaxes
local vs. global matching • global: regex describes entire string to be matched • ACCCCTG C3- • local: regex describes substring to be matched • ACCCCTG C3- • ^ matches start-of-string • ^CG: match everything starting with CG • ^[^CG]: match everything not starting with C or G • $ matches end-of-string • AC$: match everything ending with AC
Regular expressions Alignment ADLGAVFALCDRYFQ SDVGPRSCFCERFYQ ADLGRTQNRCDRYYQ ADIGQPHSLCERYFQ Regular expression [AS]-D-[IVL]-G-x4-{PG}-C-[DE]-R-[FY]2-Q {PG} = not (P or G) For short sequence stretches, regular expressions are often more suitable to describe the information than alignments (or profiles)
Regular expressions Regular expression No. of exact matches in DB D-A-V-I-D 71 D-A-V-I-[DENQ] 252 [DENQ]-A-V-I-[DENQ] 925 [DENQ]-A-[VLI]-I-[DENQ] 2739 [DENQ]-[AG]-[VLI]2-[DENQ] 51506 D-A-V-E 1088
Motif-based function prediction • Prediction of protein functions based on identified sequence motifs • PROSITE contains patterns specific for more than a thousand protein families. • ScanPROSITE -- it allows to scan a protein sequence for occurrence of patterns and profiles stored in PROSITE http://www.expasy.org/prosite/
Prosite example: extended profile Acyl carrier protein phosphopantetheine domain profile. /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=71; /DISJOINT: DEFINITION=PROTECT; N1=6; N2=66; /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=2.3; R2=.02281121; TEXT='-LogE'; /CUT_OFF: LEVEL=0; SCORE=271; N_SCORE=8.5; MODE=1; TEXT='!'; /CUT_OFF: LEVEL=-1; SCORE=184; N_SCORE=6.5; MODE=1; TEXT='?'; /DEFAULT: D=-20; I=-20; B1=-80; E1=-80; MI=-105; MD=-105; IM=-105; DM=-105; MM=1; M0=-1; A B C D E F G H I K L M N P Q R S T V W Y Z /I: B1=0; BI=-105; BD=-105; /M: SY='T'; M= -5,-15,-20,-17,-12,-10,-22,-18, 2,-13, -1, 0,-13, -6,-10,-13, -5, 4, 1,-23, -9,-12; /M: SY='E'; M= -6, -6,-22, -6, 9,-13,-21, -9,-11, 0, -8, -7, -7,-13, 1, 1, -4, -3, -8,-24,-10, 4; /M: SY='E'; M= -5, 9,-24, 11, 15,-24,-12, -3,-23, 3,-20,-15, 6, -9, 5, 1, 4, -2,-19,-29,-16, 9; /M: SY='E'; M= -5, 2,-26, 4, 8,-22,-13, -7,-21, 7,-17,-12, 0,-13, 3, 7, -2, -6,-16,-22,-12, 5; /M: SY='L'; M= -6,-27,-19,-30,-23, 4,-30,-23, 26,-25, 28, 17,-25,-27,-21,-20,-19, -5, 23,-23, -3,-23; /M: SY='R'; M= -3,-10,-10,-11, 2,-16,-19,-11,-13, -1, -8, -7, -8,-17, -1, 3, -5, -6, -9,-26,-13, -1; /M: SY='E'; M= -1, 3,-23, 4, 9,-24,-11, -7,-22, 8,-19,-13, 2,-11, 5, 6, 2, -2,-17,-26,-15, 7; /M: SY='I'; M= -5,-22,-20,-27,-19, -4,-29,-20, 20,-19, 13, 10,-18,-21,-14,-17,-15, -6, 14,-20, -4,-18; /M: SY='I'; M= -8,-30,-24,-33,-27, 8,-29,-26, 19,-24, 15, 9,-28,-27,-23,-21,-22,-10, 17, 9, 4,-25; /M: SY='A'; M= 11, -8, -8,-12, -5,-19,-11,-14,-14, -1,-14, -9, -6,-15, -4, -4, 2, -2, -6,-25,-15, -5; /M: SY='E'; M= -5, 10,-26, 15, 22,-28,-12, -2,-26, 6,-21,-16, 4, -8, 10, 0, 2, -6,-23,-28,-16, 16; /M: SY='V'; M= -5,-14,-15,-16, -6,-11,-23,-14, 4,-11, 0, 1,-13,-19, -5,-12, -7, -5, 6,-24, -8, -6; /M: SY='L'; M= -2,-24,-21,-26,-19, 5,-24,-20, 10,-23, 22, 7,-23,-24,-18,-19,-18, -7, 6, -7, 0,-18; /M: SY='G'; M= 3, -4,-25, -5, -4,-27, 24,-12,-29, -6,-25,-16, 1,-12, -4, -8, 5, -9,-23,-24,-22, -4; /M: SY='V'; M= -1,-12,-19,-14, -8,-11,-20,-14, 4,-12, 0, 1,-11,-18,-10,-13, -5, -2, 7,-25, -9,-10; /I: I=-4; MI=0; MD=-15; IM=0; /M: M= -2, -6,-13, -6, -5, -9,-11,-10, -2, -8, 0, -2, -7, -7, -7, -8, -5, -3, -1,-18, -8, -7; D=-3; /I: DM=-15; . . .
Suffix Trees • A suffix tree (also called PAT tree or, in an earlier form, position tree) is a data structure that presents the suffixes of a given string, allowing fast implementation of many important string operations. • A suffix is defined as the shortest sub-sequence starting at a given position that is unique in the complete sequence and can therefore be used to clearly identify that position. • The suffix tree for a string S is a tree whose edges are labelled with strings, such that each suffix of S corresponds to exactly one path from the tree's root to a leaf. • Constructing such a tree for the string S takes time and space linear in the length of S. • Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. • Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself. With what kind of strings would this not be the case?
Suffix Trees Given the string `mississippi', `miss' is a prefix, `ippi' is a suffix, & `issi' is a substring. Note that a substring is a prefix of a suffix. If txt=t1t2...ti...tn is a string, then Ti=titi+1...tn is the suffix of txt that starts at position i, e.g. T1 = mississippi = txt T2 = ississippi T3 = ssissippi T4 = sissippi T5 = issippi T6 = ssippi T7 = sippi T8 = ippi T9 = ppi T10 = pi T11 = i T12 = (empty)
Suffix Trees From http://www.allisons.org/ll/AlgDS/Tree/Suffix/, please study the information provided by this link! 11: i 8: ippi 5: issippi 2: ississippi 1: mississippi 10: pi 9: ppi 7: sippi 4: sissippi 6: ssippi 3: ssissippi |(1:mississippi)|leaf tree:| | | |(6:ssippi)|leaf | |(3:ssi)| | | |(9:ppi)|leaf |(2:i)| | |(9:ppi)|leaf | | | |(6:ssippi)|leaf | |(4:si)| | | |(9:ppi)|leaf |(3:s)| | | |(6:ssippi)|leaf | |(5:i)| | | |(9:ppi)|leaf | | |(10:pi)|leaf |(9:p)| | |(11:i)|leaf 7 branching nodes
Building a Suffix Tree Take a sequence: • Group all positions according to base (nucleotide, a.a.) type leading to the first level in the tree (symbol ‘$’ is often included to indicate the end of the string) • Regroup each group according to the following base, giving the second row of the tree • Continue this process and stop when a (sub)group only contains one sequence position (but include complete suffix)
Finding motifs in sequences • Many methods exist • MEME, Gibbs • Typically, these programs try and find motifs of a given length W in a set of N sequences. • Using probabilistic formalisms, they distinguish background residues, those not in the pattern, and residues at specific positions in the pattern.
Example of HMM repository:The PFAM Database Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. For each family in Pfam you can: • Look at multiple alignments • View protein domain architectures • Examine species distribution • Follow links to other databases • View known protein structures • Search with Hidden Markov Model (HMM) for each alignment
The PFAM Database Pfam is a database of two parts, the first is the curated part of Pfam containing about 9000 protein families (Pfam-A). Pfam-A comprises manually crafted multiple alignments and profile-HMMs . To give Pfam a more comprehensive coverage of known proteins we automatically generate a supplement called Pfam-B. This contains a large number of small families taken from the PRODOM database that do not overlap with Pfam-A. Although of lower quality Pfam-B families can be useful when no Pfam-A families are found.
The PFAM Database Sequence coverage Pfam-A : 74% (Yellow) Sequence coverage Pfam-B : 13% (Blue) Other (Grey) 74% of proteins have at least one match with Pfam. Version 21.0 - November 2006: Pfam-A contains 8957 families
Clan pages in Pfam. (A) A screen shot of a clan summary page, containing the description, annotation and membership of the clan. From this page, the user can view the family relationship diagram (B). Each family in the clan is represented by a blue box and its relationship to other families is represented by solid lines (significant profile–profile comparison score) or dashed lines (non-significant profile-profile comparison score). Beside each line, the profile–profile comparison E-value score is presented. This score is also linked to a visualization of the profile–profile comparison alignment (C). The clan summary page also provides a link to the clan alignment (D) (for more details see text). The clan alignment is a multiple sequence alignment of all of the clan members seed alignments (each set of seed sequences are separated by the alternate background shading). The alignments are coloured using Jalview.
HMM-based homology searching Transition probabilities and Emission probabilities Gapped HMMs also have insertion and deletion states
d1 d2 d3 d4 I0 I1 I2 I3 I4 m0 m1 m2 m3 m4 m5 End Start Profile HMM: m=match state, I-insert state, d=delete state; go from left to right. I and m states output amino acids; d states are ‘silent”. Transition probabilities and Emission probabilities
A hidden Markov model accompanying a PFAM alignment HMMER2.0 [2.2g] NAME cytochrome_b_N ACC PF00033 DESC Cytochrome b(N-terminal)/b6/petB LENG 222 ALPH Amino RF no CS no MAP yes COM hmmbuild -F HMM_ls.ann SEED.ann COM hmmcalibrate --seed 0 HMM_ls.ann NSEQ 8 DATE Thu Dec 12 02:48:53 2002 CKSUM 8731 GA -41.9 -41.9 TC -41.9 -41.9 NC -42.4 -42.4 XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455 NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -170.913223 0.138730 HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -300 * -2414 1 -2605 -2478 -3823 -3810 -1719 -3245 -3021 -1267 -3347 -794 5096 -3495 -3580 -3266 -3203 -3106 -2761 -1627 -2687 -2398 1 • -564 -3141 -2265 -289 -2463 -701 -1378 -1300 –8788 2 1405 -805 -1720 -1394 -2431 567 -1366 -2119 -1298 -2328 -1482 -1065 -1670 -1131 -1568 1592 2088 -1424 -2629 -2251 3 • -148 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 – HMMs are good for profile searches… but optimising the many parameters when using HMMs to do alignments from scratch is a problem.