1.15k likes | 1.69k Views
Sequence comparison. based on Chapter 4 Lesk, Introduction to Bioinformatics. Contents. Motivation Sequence comparison and alignments Dot plots Dynamic programming Substitution matrices Dynamic programming: Local and global alignments and gaps BLAST Significance of alignments
E N D
Sequence comparison based on Chapter 4 Lesk, Introduction to Bioinformatics
Contents Motivation Sequence comparison and alignments Dot plots Dynamic programming Substitution matrices Dynamic programming: Local and global alignments and gaps BLAST Significance of alignments Multiple sequence alignments
Motivation From where are we? Recent Africa vs. Multi-regional Hypothese In 1999 Encephalitis caused by the West Nile Virus broke out in New York. How did the virus come to New York? How did the nucleus get into the eucaryotic cells? To answer such questions we will need sequence comparison and phylogenetic trees
Sequence alignment Substitutions, insertions and deletions can be interpreted in evolutionary terms But: distinguish chance similarity and real biological relationship CCGTAA TCGTAA CCGTAT TCGTAC TTGTAA TAGTAC TCGTAG
Evolution Convergent evolution: same sequence evolved from different ancestors Back evolution - mutate to a previous sequence CCGTAA TCGTAA CCGTAT TAGTAA TCGTAC TAGTAC TAGTAC TCGTAG CCGTAA
Sequence alignments Given two or more sequences, we wish to Measure their similarity Determine the residue-residue correspondences Observe patterns of conservation and variability Infer evolutionary relationships
What is the best alignment? Uninformative: -------gctgaacgctataatc------- Without gaps: gctgaacgctataatc With gaps: gctga-a--cg --ct-ataatc Another one: gctg-aa-cg -ctataatc- Formally: The best alignments have only a minimal number of mismatches (insertions, deletions, replace) We need a method to systematically explore and to compute alignments
Scores for an alignment Sequence Identity: Percentage of matches Score each match, mismatch, gap opening, gap extension [attg] a t t g [accc] a c - - Example match +1 mismatch -1 Gap opening -3 Gap extension -1 Uninformative: 0%, score= -19-------gctgaacgctataatc------- Without gaps: 25%, score= -4 gctgaacgctataatc With gaps: 0%, score= -19gctga-a--cg --ct-ataatc Another one: 50%, score= -8 gctg-aa-cg -ctataatc-
Scores for an alignment Sequence Identity: Percentage of matches Score each match, mismatch, gap opening, gap extension [attg] a t t g [accc] a c - - Example match +2 mismatch -1 Gap opening -1 Gap extension -1 Uninformative: 0%, score= -15-------gctgaacgctataatc------- Without gaps: 25%, score= -2gctgaacgctataatc With gaps: 0%, score= -11gctga-a--cg --ct-ataatc Another one: 50%, score= 5gctg-aa-cg -ctataatc-
Dot plots A convenient way of comparing 2 sequences visually Use matrix, put 1 sequence on X-axis, 1 on Y-axis Cells with identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme - could have weights)
Dot plots D O R O T H Y C R O W F O O T H O D G K I N D O R O T H Y H O D G K I N
Dot plots D O R O T H Y C R O W F O O T H O D G K I N D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I I N N
Interpreting dot plots What do identical sequences look like? What do unrelated sequences look like? What do distantly related sequences look like? What does reverse sequence look like? Relevant for detections of stems in RNA structure What does a palindrome look like? Relevant for restriction enzymes What do repeats look like? What does a protein with domains A and B and another one with domains B and C look like?
Dot plot for identical sequences D O R O T H Y H O D G K I N D D D O O O O R R O O O O T T H H H Y Y H H H O O O O D D D G G K K I I N N
Dotplot for unrelated sequences D O R O T H Y H O D G K I N O O O O T T T T O O O O D D D I I E T T E R R
Dotplot for distantly related sequences D O R O T H Y H O D G K I N T T I I M O O O O T T H H H Y Y J E N N K K I I N N
Dotplot for reverse sequences Relevant to identify stems in RNA structures Plot sequence against its reverse complement
Dotplot for reverse sequences D O R O T H Y H O D G K I N N N I I K K G G D D D O O O O H H H Y Y H H H T T O O O O R R O O O O D D
Dotplot of a Palindrome M A D A M M M M A A A D D A A A M M M
Dotplot of repeats T W E N T Y O N E T W E N T Y T W O T T T T T T W W W W E E E E N N N N T T T T T T Y Y Y T T T T T T W W W W O O O T T T T T T W W W W E E E E N N N N T T T T T T Y Y Y O O O N N N N E E E E
Dotplot of Repeats/Palindrome M A D A M I M A D A M M M M M M A A A A A D D D A A A A A M M M M M I I M M M M M A A A A A D D D A A A A A M M M M M
Dotplot for shared domain D O R O T H Y H O D G K I N D D O O O O R R O O O O T T H H H Y Y M I I L L E R R
Result Dot plot dorothycrowfoothodgkin d* * o * * * ** * r * * o * * * ** * t * * h * * y * h * * o * * * ** * d* * g * k * i * n *
Dotplots Window size 15 Dot if 6 matches in window
Window size 15 • Dot if • 6 matches in window Cacain and Caricain, two proteases from papaya
Cacain and Caricain, two proteases from papaya >gi|1942644|pdb|1MEG| Crystal Structure Of A Caricain D158e Mutant In Complex With E-64 Length = 216 Score = 271 bits (693), Expect = 1e-73 Identities = 142/216 (65%), Positives = 168/216 (77%), Gaps = 4/216 (1%) Query: 1 IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRS 60 +PE VDWR+KGAVTPV++QGSCGSCWAFSAV T+EGI KIRTG L + SEQEL+DC+RRS Sbjct: 1 LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSEQELVDCERRS 60 Query: 61 YGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120 +GC GGYP AL+ VA+ GIH R+ YPY+ Q CR+++ G KT GV +VQP N+G Sbjct: 61 HGCKGGYPPYALEYVAKNGIHLRSKYPYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120 Query: 121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV----GYGPNYILIKNS 176 LL +IA QPVSVV+++ G+ FQLY+GGIF GPCG KV+HAV AV G YILIKNS Sbjct: 121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGYGKSGGKGYILIKNS 180 Query: 177 WGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 212 WGT WGE GYIRIKR GNS GVCGLY SS+YP KN Sbjct: 181 WGTAWGEKGYIRIKRAPGNSPGVCGLYKSSYYPTKN 216 1 lpenvdwrkk gavtpvrhqg scgscwafsa vatveginki rtgklvelse qelvdcerrs 61 hgckggyppy aleyvakngi hlrskypyka kqgtcrakqv ggpivktsgv grvqpnnegn 121 llnaiakqpv svvveskgrp fqlykggife gpcgtkveha vtavgygksg gkgyilikns 181 wgtawgekgy irikrapgns pgvcglykss yyptkn
Window size 15 • Dot if • 6 matches in window Cacain and Cruzain, a protease from Trypanosoma cruzi
>gi|2624670|pdb|1AIM| Cruzain Inhibited By Benzoyl-Tyrosine-Alanine- Fluoromethylketone Length = 215 Score = 121 bits (303), Expect = 3e-28 Identities = 78/202 (38%), Positives = 107/202 (52%), Gaps = 13/202 (6%) Query: 2 PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRSY 61 P VDWR +GAVT VK+QG CGSCWAFSA+ +E + L SEQ L+ CD+ Sbjct: 2 PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFLAGHPLTNLSEQMLVSCDKTDS 61 Query: 62 GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRYCRSREKGPYAAKTDGVRQVQP 115 GC+GG +A + + Q ++ ++YPY EG+ C + A T V Q Sbjct: 62 GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPPCTTSGHTVGATITGHVELPQD 121 Query: 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYGPN----YI 171 Q A ++ N PV+V + A+ + Y GG+ +DH V VGY + Y Sbjct: 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVSEALDHGVLLVGYNDSAAVPYW 178 Query: 172 LIKNSWGTGWGENGYIRIKRGT 193 +IKNSW T WGE GYIRI +G+ Sbjct: 179 IIKNSWTTQWGEEGYIRIAKGS 200 Cacain and Cruzain, a protease from Trypanosoma
Window size 15 • Dot if • 6 matches in window Cacain and Cathepsin, a human protease
gi|7546546|pdb|1EF7|B Chain B, Crystal Structure Of Human Cathepsin X Length = 242 Score = 52.0 bits (123), Expect = 2e-07 Identities = 60/231 (25%), Positives = 94/231 (40%), Gaps = 34/231 (14%) Query: 1 IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVTIEGIIKIRTGNL---NQYSEQ 51 +P+ DWR V + +NQ CGSCWA ++ + I I+ S Q Sbjct: 1 LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSAMADRINIKRKGAWPSTLLSVQ 60 Query: 52 ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCR--------SREKGPY 103 ++DC C GG S Q+GI Y+ + C + K + Sbjct: 61 NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQAKDQECDKFNQCGTCNEFKECH 119 Query: 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNK-V 157 A + + +V Y + + AN P+S + A + Y GGI+ + Sbjct: 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMATER-LANYTGGIYAEYQDTTYI 178 Query: 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI-----KRGTGNSYGV 199 +H V+ G+ G Y +++NSWG WGE G++RI K G G Y + Sbjct: 179 NHVVSVAGWGISDGTEYWIVRNSWGEPWGERGWLRIVTSTYKDGKGARYNL 229 Cacain and Cathepsin, a human protease
Window size 5 • Dot if • 2 matches in window Cacain and Cathepsin, a human protease
Window size 1 • Dot if • 1 match in window Cacain and Cathepsin, a human protease
From Dotplots to Alignments Obvious best alignment: DOROTHYCROWFOOTHODGKIN DOROTHY--------HODGKIN D O R O T H Y C R O W F O O T H O D G K I N D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I I N N
From Dotplots to Alignments Find “best” path from top left corner to bottom right Moving “east” corresponds to “-” in the second sequence Moving “south” corresponds to “-” in the first sequence Moving “southeast” corresponds to a match (if the characters are the same) or a mismatch (otherwise) Can we automate this?
From Dotplots to Alignments Algorithm (Dynamic Programming): Insert a row 0 and column 0 initialised with 0 Starting from the top left, move down row by row from row 1 and right column by column from column 1 visiting each cell Consider The value of the cell north The value of the cell west The value of the cell northwest if the row/column character mismatch 1 + the value of the cell northwest if the row/column character match Put down the maximum of these values as the value for the current cell Trace back the path with the highest values from the bottom right to the top left and output the alignment
From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3 C4 T5 G6 A7 T
From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0 Insert a row 0 and column 0 initialised with 0
From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 2 T 03 C 04 T 05 G 06 A 07 T 0 0 0 1 1 1 • Consider • Value north • Value west • Value northwest if the row/column character mismatch • 1 + value northwest if the row/column character match • Put down the maximum of these values for current celll
From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4
Reading the Alignment 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 -tgcat-a- at-c-tgat
Reading the Alignment: there are more than one possibility 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 ---tgcata atctg-at-
Formally:Longest Common Subsequence LCS What is the length s(V,W) of the longest common subsequence of two sequencesV=v1..vn and W=w1..wm ? Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit = wjt for 1 ≤ t ≤ k How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and si-1,jsi,j = maxsi,j-1si-1,j-1 + 1, if vi = wj Then s(V,W) = sn,m is the length of the LCS {
Example LCS 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3 C4 T5 G6 A7 T
Example LCS: 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0 Initialisation: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m
Example LCS: 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07 T 0 Computing each cell: si-1,jsi,j= maxsi,j-1si-1,j-1 + 1, if vi = wj {
Example LCS: 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 Computing each cell: si-1,jsi,j= maxsi,j-1si-1,j-1 + 1, if vi = wj {
LCS Algorithm LCS(V,W) For i = 0 to n si,0 = 0 For j = 0 to m s0,j = 0 For i = 1 to n For j = 1 to m If vi = wjand si-1,j-1 +1≥si-1,j and si-1,j-1 +1≥si,j-1 Then si,j= si-1,j-1 +1 bi,j= North West Else if si-1,j ≥si,j-1 Then si,j= si-1,j bi,j= North Else si,j= si,j-1 bi,j= West Return s and b Complexity: LCS has quadratic complexity: O(n m)
Printing the alignment of LCS PRINT-LCS(b,V,i,j) If i=0 or j=0 Then Return If bi,j = North West Then PRINT-LCS(V,b,i-1,j-1) Print vi Else if bi,j = North Then PRINT-LCS(V,b,i-1,j) Else PRINT-LCS(V,b,i,j-1)