290 likes | 479 Views
Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments. Susan Bibeault June 9, 2000. Outline. Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work. Outline. Problem Statement and Importance
E N D
Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments Susan Bibeault June 9, 2000
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
V-LSPADN--VKAAWGKVGAHAGEYGAEALERM---F- VHLTPEEKSAVTALWGKVNVD--EVGGEALGRLLVVYP G-LSDGEWQLVLNVWGKVEA---DIPGHVLIRL---FK -VLSPADN--VKAAWGKVGAHAGEYGAEALERMF---- VHLTPEEKSAVTALWGKVNVD--EVGGEALGRLLVVYP -GLSDGEWQLVLNVWGKVEA---DIPGHVLIRLFK--- Multiple Sequence Alignment • Problem Given Sequence Set: • Insert gaps into sequences so that evolutionary conserved regions are aligned • Important tool • Relate Homologous Proteins • Discover Conserved Regions VLSPADNVKAAWGKVGAHAGEYGAEALERMF VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVY GLSDGEWQLVLNVWGKVEADIPGHVLIRLFK
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Sum of Pairs Tree based gorilla human orangutan chimpanzee gibbon cost(i,j) cost(edge)m Scoring Multiple Alignments cost(i,j) = 6 cost(edge) = 1m
Scoring Cost Matrix: C (aa1, aa2) Gaps Penalties: Simple: C (aa, -) Affine: C(-) + Len * C (aa,-) Alignments V L S P A D N V K A G L S D G E W Q L V L Cost(s[1..i],t[i..j]) = min( Cost(s[1..i],t[i..j-1]) – g, Cost(s[1..i-1],t[i..j-1]) – C(s[i],t[j]) Cost(s[1..i-1],t[i..j]) – g))
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Current Approaches Global Alignment ABCDEFGHI :::: :::: ABCD-FGHI Local Alignment XXXABCDYYY :::: ZZZABCDEEEE • Global Methods • Optimal Algorithms (MSA, MWT, MUSEQAL) • Progressive (MULTALIGN, PILEUP, CLUSTAL, MULTAL, AMULT, DFALIGN, MAP, PRRP, AMPS) • Local methods • PIMA, DIALIGN, PRALIGN, MACAW, BlockMaker, Iteralign • Combined (GENALIGN, ASSEMBLE, DCA) • Statistical (HMMT, SAGA, SAM, Match Box) • Parsimony (MALIGN, TreeAlign) • Global Methods • Optimal Algorithms (MSA, MWT, MUSEQAL) • Progressive (MULTALIGN, PILEUP, CLUSTAL, MULTAL, AMULT, DFALIGN, MAP, PRRP, AMPS) • Local methods • PIMA, DIALIGN, PRALIGN, MACAW, BlockMaker, Iteralign • Combined (GENALIGN, ASSEMBLE, DCA) • Statistical (HMMT, SAGA, SAM, Match Box) • Parsimony (MALIGN, TreeAlign)
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Distance Estimation Tree Construction Node Initialization Tree Partitioning Iteration Our Heuristic
PESLALYNKFSIKSDVW PEALNYGRY-SSESDVW PESLALYNKF---SIKSDVW PEALNYGRY----SSESDVW PESLALYNKFSIKSDVW PEAL-NYGRYSSESDVW Estimation of Protein Distance Aligned Sequences Estimated Pair Distances Issue: Implied vs. Optimal Pair Alignments PEAAALYGRFT---IKSDVW PESAALYGRFT---IKSDVW PESLALYNKF---SIKSDVW PEALNYGRY----SSESDVW PEALNYGWY----SSESDVW PEVIRMQDDNPFSFSQSDVY PEALNYGWY----SSESDVW PEVIRMQDDNPFSFSQSDVY
Interior Node Classification • Interior Nodes Classified by Percent Identity • PID = (# matched residues) / (# total residues) • User Specified Tiers • User Specified Cost Criterion • Example: • PID > 60% -- PAM 40 – High Gap Penalties • PID > 40% -- PAM 120 – Medium Gap Penalties • PID < 40% -- PAM 200 – Low Gap Penalty
Ordering Alignments Isolate Sub Trees Threshold PID Order Alignments • Sub Tree • Border Nodes • Integrate All
Sum of Pairs Bounded Search Implementation Modular Reentrant Flexible Cost Criterion Interior Alignments
Generating Consensus Alignment (A1,A2,A3) Consensus X • Min ( Di(Ai,X) ) For Each Position i: Xi = A1 D1 D2 A2 X D3 A3 Min (cost(, A1i) + cost(, A2i) + cost(, A3i))
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Testing the Method • BAliBASE benchmark • “Correct” Alignments • Core Blocks of Conserved Motifs • Typical “Hard Problem” Sets • Protein Parsimony • Measures “Evolutionary Steps” of Alignment
Baseline BAliBASE SP better
Baseline BAliBASE TC better
Baseline - ProtPars better
Orphans/Families BAliBASE SP better
Orphans/Families ProtPars better
Larger Families better
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Conclusions • Solution Quality • Captures Evolutionary Information • Iterations Converge Quickly • Useful Tool
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Future Work • Improved Alignment Consensus • Multiple Partitioning Thresholds • Multiple Solutions • Integrated Phylogeny Modifications • Parallel Implementation