Sequence comparison

Sequence comparison based on Chapter 4 Lesk, Introduction to Bioinformatics

Contents Motivation Sequence comparison and alignments Dot plots Dynamic programming Substitution matrices Dynamic programming: Local and global alignments and gaps BLAST Significance of alignments Multiple sequence alignments

Motivation From where are we? Recent Africa vs. Multi-regional Hypothese In 1999 Encephalitis caused by the West Nile Virus broke out in New York. How did the virus come to New York? How did the nucleus get into the eucaryotic cells? To answer such questions we will need sequence comparison and phylogenetic trees

Sequence alignment Substitutions, insertions and deletions can be interpreted in evolutionary terms But: distinguish chance similarity and real biological relationship CCGTAA TCGTAA CCGTAT TCGTAC TTGTAA TAGTAC TCGTAG

Evolution Convergent evolution: same sequence evolved from different ancestors Back evolution - mutate to a previous sequence CCGTAA TCGTAA CCGTAT TAGTAA TCGTAC TAGTAC TAGTAC TCGTAG CCGTAA

Sequence alignments Given two or more sequences, we wish to Measure their similarity Determine the residue-residue correspondences Observe patterns of conservation and variability Infer evolutionary relationships

What is the best alignment? Uninformative: -------gctgaacgctataatc------- Without gaps: gctgaacgctataatc With gaps: gctga-a--cg --ct-ataatc Another one: gctg-aa-cg -ctataatc- Formally: The best alignments have only a minimal number of mismatches (insertions, deletions, replace) We need a method to systematically explore and to compute alignments

Scores for an alignment Sequence Identity: Percentage of matches Score each match, mismatch, gap opening, gap extension [attg] a t t g [accc] a c - - Example match +1 mismatch -1 Gap opening -3 Gap extension -1 Uninformative: 0%, score= -19-------gctgaacgctataatc------- Without gaps: 25%, score= -4 gctgaacgctataatc With gaps: 0%, score= -19gctga-a--cg --ct-ataatc Another one: 50%, score= -8 gctg-aa-cg -ctataatc-

Scores for an alignment Sequence Identity: Percentage of matches Score each match, mismatch, gap opening, gap extension [attg] a t t g [accc] a c - - Example match +2 mismatch -1 Gap opening -1 Gap extension -1 Uninformative: 0%, score= -15-------gctgaacgctataatc------- Without gaps: 25%, score= -2gctgaacgctataatc With gaps: 0%, score= -11gctga-a--cg --ct-ataatc Another one: 50%, score= 5gctg-aa-cg -ctataatc-

Dot plots

Dot plots A convenient way of comparing 2 sequences visually Use matrix, put 1 sequence on X-axis, 1 on Y-axis Cells with identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme - could have weights)

Dot plots D O R O T H Y C R O W F O O T H O D G K I N D O R O T H Y H O D G K I N

Dot plots D O R O T H Y C R O W F O O T H O D G K I N D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I I N N

Interpreting dot plots What do identical sequences look like? What do unrelated sequences look like? What do distantly related sequences look like? What does reverse sequence look like? Relevant for detections of stems in RNA structure What does a palindrome look like? Relevant for restriction enzymes What do repeats look like? What does a protein with domains A and B and another one with domains B and C look like?

Dot plot for identical sequences D O R O T H Y H O D G K I N D D D O O O O R R O O O O T T H H H Y Y H H H O O O O D D D G G K K I I N N

Dotplot for unrelated sequences D O R O T H Y H O D G K I N O O O O T T T T O O O O D D D I I E T T E R R

Dotplot for distantly related sequences D O R O T H Y H O D G K I N T T I I M O O O O T T H H H Y Y J E N N K K I I N N

Dotplot for reverse sequences Relevant to identify stems in RNA structures Plot sequence against its reverse complement

Dotplot for reverse sequences D O R O T H Y H O D G K I N N N I I K K G G D D D O O O O H H H Y Y H H H T T O O O O R R O O O O D D

Dotplot of a Palindrome M A D A M M M M A A A D D A A A M M M

Dotplot of repeats T W E N T Y O N E T W E N T Y T W O T T T T T T W W W W E E E E N N N N T T T T T T Y Y Y T T T T T T W W W W O O O T T T T T T W W W W E E E E N N N N T T T T T T Y Y Y O O O N N N N E E E E

Dotplot of Repeats/Palindrome M A D A M I M A D A M M M M M M A A A A A D D D A A A A A M M M M M I I M M M M M A A A A A D D D A A A A A M M M M M

Dotplot for shared domain D O R O T H Y H O D G K I N D D O O O O R R O O O O T T H H H Y Y M I I L L E R R

Result Dot plot dorothycrowfoothodgkin d* * o * * * ** * r * * o * * * ** * t * * h * * y * h * * o * * * ** * d* * g * k * i * n *

Dotplots Window size 15 Dot if 6 matches in window

Window size 15 • Dot if • 6 matches in window Cacain and Caricain, two proteases from papaya

Cacain and Caricain, two proteases from papaya >gi|1942644|pdb|1MEG| Crystal Structure Of A Caricain D158e Mutant In Complex With E-64 Length = 216 Score = 271 bits (693), Expect = 1e-73 Identities = 142/216 (65%), Positives = 168/216 (77%), Gaps = 4/216 (1%) Query: 1 IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRS 60 +PE VDWR+KGAVTPV++QGSCGSCWAFSAV T+EGI KIRTG L + SEQEL+DC+RRS Sbjct: 1 LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSEQELVDCERRS 60 Query: 61 YGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120 +GC GGYP AL+ VA+ GIH R+ YPY+ Q CR+++ G KT GV +VQP N+G Sbjct: 61 HGCKGGYPPYALEYVAKNGIHLRSKYPYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120 Query: 121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV----GYGPNYILIKNS 176 LL +IA QPVSVV+++ G+ FQLY+GGIF GPCG KV+HAV AV G YILIKNS Sbjct: 121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGYGKSGGKGYILIKNS 180 Query: 177 WGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 212 WGT WGE GYIRIKR GNS GVCGLY SS+YP KN Sbjct: 181 WGTAWGEKGYIRIKRAPGNSPGVCGLYKSSYYPTKN 216 1 lpenvdwrkk gavtpvrhqg scgscwafsa vatveginki rtgklvelse qelvdcerrs 61 hgckggyppy aleyvakngi hlrskypyka kqgtcrakqv ggpivktsgv grvqpnnegn 121 llnaiakqpv svvveskgrp fqlykggife gpcgtkveha vtavgygksg gkgyilikns 181 wgtawgekgy irikrapgns pgvcglykss yyptkn

Window size 15 • Dot if • 6 matches in window Cacain and Cruzain, a protease from Trypanosoma cruzi

>gi|2624670|pdb|1AIM| Cruzain Inhibited By Benzoyl-Tyrosine-Alanine- Fluoromethylketone Length = 215 Score = 121 bits (303), Expect = 3e-28 Identities = 78/202 (38%), Positives = 107/202 (52%), Gaps = 13/202 (6%) Query: 2 PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRSY 61 P VDWR +GAVT VK+QG CGSCWAFSA+ +E + L SEQ L+ CD+ Sbjct: 2 PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFLAGHPLTNLSEQMLVSCDKTDS 61 Query: 62 GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRYCRSREKGPYAAKTDGVRQVQP 115 GC+GG +A + + Q ++ ++YPY EG+ C + A T V Q Sbjct: 62 GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPPCTTSGHTVGATITGHVELPQD 121 Query: 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYGPN----YI 171 Q A ++ N PV+V + A+ + Y GG+ +DH V VGY + Y Sbjct: 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVSEALDHGVLLVGYNDSAAVPYW 178 Query: 172 LIKNSWGTGWGENGYIRIKRGT 193 +IKNSW T WGE GYIRI +G+ Sbjct: 179 IIKNSWTTQWGEEGYIRIAKGS 200 Cacain and Cruzain, a protease from Trypanosoma

Window size 15 • Dot if • 6 matches in window Cacain and Cathepsin, a human protease

gi|7546546|pdb|1EF7|B Chain B, Crystal Structure Of Human Cathepsin X Length = 242 Score = 52.0 bits (123), Expect = 2e-07 Identities = 60/231 (25%), Positives = 94/231 (40%), Gaps = 34/231 (14%) Query: 1 IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVTIEGIIKIRTGNL---NQYSEQ 51 +P+ DWR V + +NQ CGSCWA ++ + I I+ S Q Sbjct: 1 LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSAMADRINIKRKGAWPSTLLSVQ 60 Query: 52 ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCR--------SREKGPY 103 ++DC C GG S Q+GI Y+ + C + K + Sbjct: 61 NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQAKDQECDKFNQCGTCNEFKECH 119 Query: 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNK-V 157 A + + +V Y + + AN P+S + A + Y GGI+ + Sbjct: 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMATER-LANYTGGIYAEYQDTTYI 178 Query: 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI-----KRGTGNSYGV 199 +H V+ G+ G Y +++NSWG WGE G++RI K G G Y + Sbjct: 179 NHVVSVAGWGISDGTEYWIVRNSWGEPWGERGWLRIVTSTYKDGKGARYNL 229 Cacain and Cathepsin, a human protease

Window size 5 • Dot if • 2 matches in window Cacain and Cathepsin, a human protease

Window size 1 • Dot if • 1 match in window Cacain and Cathepsin, a human protease

Dynamic programming

From Dotplots to Alignments Obvious best alignment: DOROTHYCROWFOOTHODGKIN DOROTHY--------HODGKIN D O R O T H Y C R O W F O O T H O D G K I N D D D O O O O O O O R R R O O O O O O O T T T H H H Y Y H H H O O O O O O O D D D G G K K I I N N

From Dotplots to Alignments Find “best” path from top left corner to bottom right Moving “east” corresponds to “-” in the second sequence Moving “south” corresponds to “-” in the first sequence Moving “southeast” corresponds to a match (if the characters are the same) or a mismatch (otherwise) Can we automate this?

From Dotplots to Alignments Algorithm (Dynamic Programming): Insert a row 0 and column 0 initialised with 0 Starting from the top left, move down row by row from row 1 and right column by column from column 1 visiting each cell Consider The value of the cell north The value of the cell west The value of the cell northwest if the row/column character mismatch 1 + the value of the cell northwest if the row/column character match Put down the maximum of these values as the value for the current cell Trace back the path with the highest values from the bottom right to the top left and output the alignment

From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3 C4 T5 G6 A7 T

From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0 Insert a row 0 and column 0 initialised with 0

From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 2 T 03 C 04 T 05 G 06 A 07 T 0 0 0 1 1 1 • Consider • Value north • Value west • Value northwest if the row/column character mismatch • 1 + value northwest if the row/column character match • Put down the maximum of these values for current celll

From Dotplots to Alignments 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4

Reading the Alignment 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 -tgcat-a- at-c-tgat

Reading the Alignment: there are more than one possibility 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 ---tgcata atctg-at-

Formally:Longest Common Subsequence LCS What is the length s(V,W) of the longest common subsequence of two sequencesV=v1..vn and W=w1..wm ? Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit = wjt for 1 ≤ t ≤ k How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and si-1,jsi,j = maxsi,j-1si-1,j-1 + 1, if vi = wj Then s(V,W) = sn,m is the length of the LCS {

Example LCS 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3 C4 T5 G6 A7 T

Example LCS: 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0 Initialisation: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m

Example LCS: 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07 T 0 Computing each cell: si-1,jsi,j= maxsi,j-1si-1,j-1 + 1, if vi = wj {

Example LCS: 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4 Computing each cell: si-1,jsi,j= maxsi,j-1si-1,j-1 + 1, if vi = wj {

LCS Algorithm LCS(V,W) For i = 0 to n si,0 = 0 For j = 0 to m s0,j = 0 For i = 1 to n For j = 1 to m If vi = wjand si-1,j-1 +1≥si-1,j and si-1,j-1 +1≥si,j-1 Then si,j= si-1,j-1 +1 bi,j= North West Else if si-1,j ≥si,j-1 Then si,j= si-1,j bi,j= North Else si,j= si,j-1 bi,j= West Return s and b Complexity: LCS has quadratic complexity: O(n m)

Printing the alignment of LCS PRINT-LCS(b,V,i,j) If i=0 or j=0 Then Return If bi,j = North West Then PRINT-LCS(V,b,i-1,j-1) Print vi Else if bi,j = North Then PRINT-LCS(V,b,i-1,j) Else PRINT-LCS(V,b,i,j-1)

Sequence comparison

Sequence comparison

Presentation Transcript

Sequence comparison: More dynamic programming

Sequence Comparison

Sequence comparison: Score matrices

Sequence Comparison

Sequence comparison and Phylogeny

Sequence Comparison

Sequence comparison: Local alignment

Sequence Comparison

Pairwise Sequence Comparison

Multiple sequence comparison (MSC)

Pairwise sequence comparison

CAP5510 – Bioinformatics Sequence Comparison

Sequence comparison: Dynamic programming

Part I : SEQUENCE COMPARISON

Part II : Sequence Comparison Multiple Sequence Alignment

Sequence comparison: Dynamic programming

Sequence comparison: Local alignment

CAP5510 – Bioinformatics Sequence Comparison