170 likes | 324 Views
Multiple Sequence Alignment : NP - Hardness and How to Deal with It. Jens Stoye Bielefeld University, Germany. Preliminaries : Pairwise Alignment. >pdb|1KSW|A Chain A, Structure Of Human C- Src Tyrosine Kinase (Thr338gly Mutant ) In Complex With N6-Benzyl Adp Length =452
E N D
Multiple SequenceAlignment:NP-HardnessandHowtoDealwithIt Jens Stoye Bielefeld University, Germany
Preliminaries: PairwiseAlignment >pdb|1KSW|A Chain A, StructureOf Human C-SrcTyrosineKinase (Thr338gly Mutant) In ComplexWith N6-Benzyl Adp Length=452 Score = 161 bits (408), Expect = 5e-47, Method: Compositionalmatrixadjust. Identities = 81/85 (95%), Positives = 81/85 (95%), Gaps = 1/85 (1%) Query 1 PRESLRLEAKLGQGCFGEVWMGTWNDTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL 60 PRESLRLE KLGQGCFGEVWMGTWN TTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL Sbjct 182 PRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL 241 Query 61 VQLYAVVS-EPIYIVIEYMSKGSLL 84 VQLYAVVS EPIYIV EYMSKGSLL Sbjct242 VQLYAVVSEEPIYIVGEYMSKGSLL 266 PRESLRLEAKLGQGCFGEVWMGTWNDTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL PRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKL VQLYAVVS-EPIYIVIEYMSKGSLL VQLYAVVSEEPIYIVGEYMSKGSLL
Preliminaries: PairwiseAlignment Find bestalignmentoftwosequences:highest score/lowestcost Analysis: O(n2) time
Multiple Alignment ksequences, not just 2 sp|P00526|SRC ---GLAK--DAWEIPRESLRLEAKLGQGCFGEVWMGTWND-TTRVAIKTLKPGT--MSPE 52 sp|P00527|YES ---GLAK--DAWEIPRESLRLEVKLGQGCFGEVWMGTWNG-TTKVAIKTLKLGT--MMPE 52 sp|P00521|ABL TIYGVSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDT--MEVE 58 sp|P00542|FES -VLNRAVPKDKWVLNHEDLVLGEQIGRGNFGEVFSGRLRADNTLVAVKSCRETLPPDIKA 59 sp|P00530|FPS -VLTRAVLKDKWVLNHEDVLLGERIGRGNFGEVFSGRLRADNTPVAVKSCRETLPPELKA 59 sp|P00532|KRAF -------SSYYWKMEASEVMLSTRIGSGSFGTVYKGKWHGDVAVKILKVVDPTP--EQLQ 51 * : .: : ::* * :* *: * . :* sp|P00526|SRC AFLQEAQVMKKLRHEKLVQLYAVVSEEP-IYIVIEYMSKGSLLDFLKGEMGKYLRLPQLV 111 sp|P00527|YES AFLQEAQIMKKLRHDKLVPLYAVVSEEP-IYIVTEFMTKGSLLDFLKEGEGKFLKLPQLV 111 sp|P00521|ABL EFLKEAAVMKEIKHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVSAVVLL 118 sp|P00542|FES KFLQEAKILKQYSHPNIVRLIGVCTQKQPIYIVMELVQGGDFLTFLRT-EGARLRMKTLL 118 sp|P00530|FPS KFLQEARILKQCNHPNIVRLIGVCTQKQPIYIVMELVQGGDFLSFLRS-KGPRLKMKKLI 118 sp|P00532|KRAF AFRNEVAVLRKTRHVNILLFMGYMTKDN-LAIVTQWCEGSSLYKHLHV-QETKFQMFQLI 109 * :*. :::: * ::: : . :.. : *: : ..: .*: . *: sp|P00526|SRC DMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAK- 170 sp|P00527|YES DMAAQIADGMAYIERMNYIHRDLRAANILVGDNLVCKIADFGLARLIEDNEYTARQGAK- 170 sp|P00521|ABL YMATQISSAMEYLEKKNFIHRDLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAK- 177 sp|P00542|FES QMVGDAAAGMEYLESKCCIHRDLAARNCLVTEKNVLKISDFGMSREAADGIYAASGGLRQ 178 sp|P00530|FPS KMMENAAAGMEYLESKHCIHRDLAARNCLVTEKNTLKISDFGMSRQEEDGVYASTGGMKQ 178 sp|P00532|KRAF DIARQTAQGMDYLHAKNIIHRDMKSNNIFLHEGLTVKIGDFGLATVKSRWSGSQQVEQPT 169 : : : .* *:. :***: : * :: : *:.***:: : sp|P00526|SRC FPIKWTAPEAALYG---RFTIKSDVWSFGILLTELTTKGRVPYPGMVNR-EVLDQVERGY 226 sp|P00527|YES FPIKWTAPEAALYG---RFTIKSDVWSFGILLTELVTKGRVPYPGMVNR-EVLEQVERGY 226 sp|P00521|ABL FPIKWTAPESLAYN---KFSIKSDVWAFGVLLWEIATYGMSPYPGIDLS-QVYELLEKDY 233 sp|P00542|FES VPVKWTAPEALNYG---RYSSESDVWSFGILLWETFSLGASPYPNLSNQ-QTREFVEKGG 234 sp|P00530|FPS IPVKWTAPEALNYG---WYSSESDVWSFGILLWEAFSLGAVPYANLSNQ-QTREAIEQGV 234 sp|P00532|KRAF GSVLWMAPEVIRMQDDNPFSFQSDVYSYGIVLYELMAG-ELPYAHINNRDQIIFMVGRGY 228 .: * *** :: :***:::*::* * : **. : : : :. sp|P00526|SRC RMPCP----PECPESLHDLMCQCWRKDPEERPTFKYLQAQLLPACVLEVAE- 273 sp|P00527|YES RMPCP----QGCPESLHELMKLCWKKDPDERPTFEYIQSFLEDYFTAAEPSG 274 sp|P00521|ABL RMERP----EGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQESSIS- 280 sp|P00542|FES RLPCP----ELCPDAVFRLMEQCWAYEPGQRPSFSAIYQELQSIRKRHR--- 279 sp|P00530|FPS RLEPP----EQCPEDVYRLMQRCWEYDPHRRPSFGAVHQDLIAIRKRHR--- 279 sp|P00532|KRAF ASPDLSRLYKNCPKAIKRLVADCVKKVKEERPLFPQILSSIELLQHSLPKIN 280 **. : *: * ** * : :
Multiple Alignment ksequences, not just 2
Multiple Alignment – Why? Highlight similaritiesofthesequences in a family: • sequenceassembly • molecularmodeling, structure-functionconclusions • databasesearch (sequencefamilies) • proteindomains • primer design Highlight dissimilaritiesbetweenthesequences in a family: • reconstructionofphylogenetictrees • analysisofsinglenucleotidepolymorphisms (SNPs) „Oneortwohomologoussequenceswhisper... a full multiple alignmentshouts out loud“ (Hubbard et al., 1996)
Multiple AlignmentObjectiveFunctions • Find bestalignmentofksequences:highest score/lowestcost • Based on pairwiseprojections: • sumof all pairs: • treealignment score:
Alignmentof 2 Sequences 2 sequences O(n2) time
Alignmentof 3 Sequences 3 sequences O(n3) time
AlignmentofkSequences in factevenworse: O(nk2k) time ksequences O(nk) time
NP Hardness CS terminology: The computationalproblemof SP multiple sequencealignmentis NP hard. In practice: Don‘teventryitformorethan 10 or 12 sequences. Whatcanwe do? • computeanyway • running time heuristics • approximationalgorithms • fixedparameteralgorithms • correctnessheuristics
Carrillo/LipmanHeuristics Running time heuristics: oftenfaster, but not in worstcase.
Center Star Algorithm Approximation algorithm:Never worsethan 2 timestheoptimum
DivideandConquerAlignment Noperformanceguarantee, but oftenverygood
Multiple Alignment in Practice Mostly progressive, e.g. CLUSTAL W Not covered: hybrid approaches, e.g. T-COFFEE, MAUVE, Clustal Omega local multiple alignment, e.g. DIALIGN