1.5k likes | 1.67k Views
Sequence comparison and Phylogeny. based on Chapter 4 Lesk, Introduction to Bioinformatics. Contents. Motivation Sequence comparison and alignments Dot plots Dynamic programming Substitution matrices Dynamic programming: Local and global alignments and gaps BLAST
E N D
Sequence comparisonand Phylogeny based on Chapter 4 Lesk, Introduction to Bioinformatics
Contents Motivation Sequence comparison and alignments Dot plots Dynamic programming Substitution matrices Dynamic programming: Local and global alignments and gaps BLAST Significance of alignments Multiple sequence alignments Phylogenetic trees
Motivation From where are we? Recent Africa vs. Multi-regional Hypothese In 1999 Encephalitis caused by the West Nile Virus broke out in New York. How did the virus come to New York? How did the nucleus get into the eucaryotic cells? To answer such questions we will need sequence comparison and phylogenetic trees
Sequence Similarity Searches Sequence similarity can be clue to common evolutionary ancestor… E.g. globin genes in chimpanzees and humans … or common function E.g. v-sys onco genes in simian sarcoma virus leading to cancer in monkeys and the seemingly unrelated growth stimulating hormone PDGF, which stimulates cell growth (first success of similarity idea, 1983) In general: If an unknown sequences is found, deduce its function/structure indirectly by finding similar sequences, whose function/structure is known Assumption: Evolution changes sequences “slowly” often maintaining main features of a sequence’s function/structure
Sequence alignment Substitutions, insertions and deletions can be interpreted in evolutionary terms But: distinguish chance similarity and real biological relationship CCGTAA TCGTAA CCGTAT TCGTAC TTGTAA TAGTAC TCGTAG
Evolution Convergent evolution: same sequence evolved from different ancestors Back evolution - mutate to a previous sequence CCGTAA TCGTAA CCGTAT TAGTAA TCGTAC TAGTAC TAGTAC TCGTAG CCGTAA
Similarity vs. Homology Any sequence can be similar Sequences homologues if evolved from common ancestor Homologous sequences: Orthologs: similar biological function Paralogs: different biological function (after gene duplication), e.g. lysozyme and α-lactalbumin, a mammalian regulatory protein Assumption: Similarity indicator for homology Note, altered function of the expressed protein will determine if the organism will survive to reproduce, and hence pass on the altered gene
Sequence alignments Given two or more sequences, we wish to Measure their similarity Determine the residue-residue correspondences Observe patterns of conservation and variability Infer evolutionary relationships
What is the best alignment? Uninformative: -------gctgaacg ctataatc------- Without gaps: gctgaacg ctataatc With gaps: gctga-a--cg --ct-ataatc Another one: gctg-aa-cg -ctataatc- Formally: The best alignment have only a minimal number of mismatches (insertions, deletions, replace) We need a method to systematically explore and to compute alignments
Scores for an alignment Percentage of matches Score each match, mismatch, gap opening, gap extension Example match +1 mismatch -1 Gap opening -3 Gap extension -1 Uninformative: 0%, score= -21 -------gctgaacg ctataatc------- Without gaps:25%,score= -4 gctgaacg ctataatc With gaps: 0%, score= -23 gctga-a--cg --ct-ataatc Another one: 50%, score=-12 gctg-aa-cg -ctataatc-
Scores for an alignment Percentage of matches Score each match, mismatch, gap opening, gap extension Example match +2 mismatch -1 Gap opening -1 Gap extension -1 Uninformative: 0%, score= -17 -------gctgaacg ctataatc------- Without gaps:25%,score= -2 gctgaacg ctataatc With gaps: 0%, score= -15 gctga-a--cg --ct-ataatc Another one: 50%, score=0 gctg-aa-cg -ctataatc-
Dot plots A convenient way of comparing 2 sequences visually Use matrix, put 1 sequence on X-axis, 1 on Y-axis Cells with identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme - could have weights)
Dot plots D O R O T H Y C R O W F O O T H O D G K I N D O R O T H Y H O D G K I N
Dot plots D O R O T H Y C R O W F O O T H O D G K I N D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I I N N
Interpreting dot plots What do identical sequences look like? What do unrelated sequences look like? What do distantly related sequences look like? What does reverse sequence look like? Relevant for detections of stems in RNA structure What does a palindrome look like? Relevant for restriction enzymes What do repeats look like? What does a protein with domains A and B and another one with domains B and C look like?
Dot plot for identical sequences D O R O T H Y H O D G K I N D D D O O O O R R O O O O T T H H H Y Y H H H O O O O D D D G G K K I I N N
Dotplot for unrelated sequences D O R O T H Y H O D G K I N O O O O T T T T O O O O D D D I I E T T E R R
Dotplot for distantly related sequences D O R O T H Y H O D G K I N T T I I M O O O O T T H H H Y Y J E N N K K I I N N
Dotplot for reverse sequences D O R O T H Y H O D G K I N N N I I K K G G D D D O O O O H H H Y Y H H H T T O O O O R R O O O O D D
Dotplot for reverse sequences Relevant to identify stems in RNA structures Plot sequence against its reverse complement
Palindromes and restriction enzymes Madam, I'm Adam Able was I ere I saw Elba (supposedly said by Napoleon) Doc note I dissent, a fast never prevents a fatness, I diet on cod. Because DNA is double stranded and the strands run antiparallel, palindromes are defined as any double stranded DNA in which reading 5’ to 3’ both are the same The HindIII cutting site: 5'-AAGCTT-3' 3'-TTCGAA-5' The EcoRI cutting site: 5'-GAATTC-3' 3'-CTTAAG-5'
Dotplot of a Palindrome M A D A M M M M A A A D D A A A M M M
Dotplot of repeats T W E N T Y O N E T W E N T Y T W O T T T T T T W W W W E E E E N N N N T T T T T T Y Y Y T T T T T T W W W W O O O T T T T T T W W W W E E E E N N N N T T T T T T Y Y Y O O O N N N N E E E E
Dotplot of Repeats/Palindrome M A D A M I M A D A M M M M M M A A A A A D D D A A A A A M M M M M I I M M M M M A A A A A D D D A A A A A M M M M M
Dotplot for shared domain D O R O T H Y H O D G K I N D D O O O O R R O O O O T T H H H Y Y M I I L L E R R
Result Dot plot dorothycrowfoothodgkin d* * o * * * ** * r * * o * * * ** * t * * h * * y * h * * o * * * ** * d* * g * k * i * n *
Result Dot plot dorothycrowfoothodgkin d* * o * * * ** * r * * o * * * ** * t * * h * * y * h * * o * * * ** * d* * g * k * i * n *
Dotplots Window size 15 Dot if 6 matches in window
Window size 15 • Dot if • 6 matches in window
>gi|1942644|pdb|1MEG| Crystal Structure Of A Caricain D158e Mutant In Complex With E-64 Length = 216 Score = 271 bits (693), Expect = 1e-73 Identities = 142/216 (65%), Positives = 168/216 (77%), Gaps = 4/216 (1%) Query: 1 IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRS 60 +PE VDWR+KGAVTPV++QGSCGSCWAFSAV T+EGI KIRTG L + SEQEL+DC+RRS Sbjct: 1 LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSEQELVDCERRS 60 Query: 61 YGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120 +GC GGYP AL+ VA+ GIH R+ YPY+ Q CR+++ G KT GV +VQP N+G Sbjct: 61 HGCKGGYPPYALEYVAKNGIHLRSKYPYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120 Query: 121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV----GYGPNYILIKNS 176 LL +IA QPVSVV+++ G+ FQLY+GGIF GPCG KV+HAV AV G YILIKNS Sbjct: 121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGYGKSGGKGYILIKNS 180 Query: 177 WGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 212 WGT WGE GYIRIKR GNS GVCGLY SS+YP KN Sbjct: 181 WGTAWGEKGYIRIKRAPGNSPGVCGLYKSSYYPTKN 216 1 lpenvdwrkk gavtpvrhqg scgscwafsa vatveginki rtgklvelse qelvdcerrs 61 hgckggyppy aleyvakngi hlrskypyka kqgtcrakqv ggpivktsgv grvqpnnegn 121 llnaiakqpv svvveskgrp fqlykggife gpcgtkveha vtavgygksg gkgyilikns 181 wgtawgekgy irikrapgns pgvcglykss yyptkn
Window size 15 • Dot if • 6 matches in window
>gi|2624670|pdb|1AIM| Cruzain Inhibited By Benzoyl-Tyrosine-Alanine- Fluoromethylketone Length = 215 Score = 121 bits (303), Expect = 3e-28 Identities = 78/202 (38%), Positives = 107/202 (52%), Gaps = 13/202 (6%) Query: 2 PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRSY 61 P VDWR +GAVT VK+QG CGSCWAFSA+ +E + L SEQ L+ CD+ Sbjct: 2 PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFLAGHPLTNLSEQMLVSCDKTDS 61 Query: 62 GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRYCRSREKGPYAAKTDGVRQVQP 115 GC+GG +A + + Q ++ ++YPY EG+ C + A T V Q Sbjct: 62 GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPPCTTSGHTVGATITGHVELPQD 121 Query: 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYGPN----YI 171 Q A ++ N PV+V + A+ + Y GG+ +DH V VGY + Y Sbjct: 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVSEALDHGVLLVGYNDSAAVPYW 178 Query: 172 LIKNSWGTGWGENGYIRIKRGT 193 +IKNSW T WGE GYIRI +G+ Sbjct: 179 IIKNSWTTQWGEEGYIRIAKGS 200
Window size 15 • Dot if • 6 matches in window
gi|7546546|pdb|1EF7|B Chain B, Crystal Structure Of Human Cathepsin X Length = 242 Score = 52.0 bits (123), Expect = 2e-07 Identities = 60/231 (25%), Positives = 94/231 (40%), Gaps = 34/231 (14%) Query: 1 IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVTIEGIIKIRTGNL---NQYSEQ 51 +P+ DWR V + +NQ CGSCWA ++ + I I+ S Q Sbjct: 1 LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSAMADRINIKRKGAWPSTLLSVQ 60 Query: 52 ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCR--------SREKGPY 103 ++DC C GG S Q+GI Y+ + C + K + Sbjct: 61 NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQAKDQECDKFNQCGTCNEFKECH 119 Query: 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNK-V 157 A + + +V Y + + AN P+S + A + Y GGI+ + Sbjct: 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMATER-LANYTGGIYAEYQDTTYI 178 Query: 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI-----KRGTGNSYGV 199 +H V+ G+ G Y +++NSWG WGE G++RI K G G Y + Sbjct: 179 NHVVSVAGWGISDGTEYWIVRNSWGEPWGERGWLRIVTSTYKDGKGARYNL 229
Window size 5 • Dot if • 2 matches in window
Window size 1 • Dot if • 1 match in window
From Dotplots to Alignments Obvious best alignment: DOROTHYCROWFOOTHODGKIN DOROTHY--------HODGKIN D O R O T H Y C R O W F O O T H O D G K I N D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I I N N
From Dotplots to Alignments Find “best” path from top left corner to bottom right Moving “east” corresponds to “-” in the second sequence Moving “south” corresponds to “-” in the first sequence Moving “southeast” corresponds to a match if the characters are the same or a mismatch otherwise Can we automate this?
From Dotplots to Alignments Algorithm (Dynamic Programming): Insert a row 0 and column 0 initialised with 0 Starting from the top left, move down row by row from row 1 and right column by column from column 1 visiting each cell Consider The value of the cell north The value of the cell west The value of the cell northwest if the row/column character mismatch 1 + the value of the cell northwest if the row/column character match Put down the maximum of these values as the value for the current cell Trace back the path with the highest values from the bottom right to the top left and output the alignment
From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3 C4 T5 G6 A7 T
From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0 Insert a row 0 and column 0 initialised with 0
From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07 T 0 • Consider • Value north • Value west • Value northwest if the row/column character mismatch • 1 + value northwest if the row/column character match • Put down the maximum of these values for current celll
From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4
Reading the Alignment 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 -tgcat-a- at-c-tgat
Reading the Alignment: there are more than one possibility 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 ---tgcata atctg-at-
Formally:Longest Common Subsequence LCS What is the length s(V,W) of the longest common subsequence of two sequencesV=v1..vn and W=w1..wm ? Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit = wjt for 1 ≤ t ≤ k How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and si-1,jsi,j = maxsi,j-1si-1,j-1 + 1, if vi = wj Then s(V,W) = sn,m is the length of the LCS {
Example LCS 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3 C4 T5 G6 A7 T
Example LCS: 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0 Initialisation: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and