590 likes | 687 Views
Visualisation of Multiple Sequence Alignments. VIZBI 2011 Des Higgins Conway Institute University College Dublin Ireland. Multiple Alignment?. Align 3 or more sequences together Homologous residues lined up in columns
E N D
Visualisation of Multiple Sequence Alignments VIZBI 2011 Des Higgins Conway Institute University College Dublin Ireland
Multiple Alignment? • Align 3 or more sequences together • Homologous residues lined up in columns Whale myoglobin ----VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin GSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTP---EFFPKFKGLTTLupin globin ---GALTESQAALVKSSWEEF--NIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE • Needed because of • Orthologues from different species But mainly: • Paralogues from Gene duplications • Multi-gene families • e.g. humans have approx. 500 protein kinases
Human Protein KinasesThe human kinome comprises 40 atypical PKs and 478 classical PKs. The latter consist of 388 serine/threonine kinases, 90 tyrosine kinases and 50 sequences which lack a functional catalytic site. (Manning et al., Science, 2002)
Globin Multiple Alignment Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- 1. Visualise the residues/gaps?
Globin Multiple Alignment Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---
Globin Multiple Alignment Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- Alpha helices
Globin Multiple Alignment Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- Haem binding Histidines
Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin Globin Multiple Alignment Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- 2. Visualise the sequence groupings?
So: What is the Problem? • What if N >> 100,000? • e.g. SSU rRNA • www.arb-silva.de • 1,471,257 seqs • e.g. ABC transporters • PFAM • ABC_tran PF00005 • 127,458 seqs • Metagenomics
Sequence 10,000 vertebrate genomes! • =>5,000,000 protein kinases, GPCRs
SequenceJuxtaposer: Fluid Navigation For Large-Scale Sequence Comparison In Context James Slack Kristian Hildebrandy Tamara Munzner Katherine St. John. Proc. German Conference on Bioinformatics 2004, pp 37-42 Poster D03 VIZBI, 2011 Sequence Surveyor: scalable multiple sequence alignment overview visualisation. Danielle Albers, Colin Dewey, Michael Gleicher Poster D09 VIZBI, 2011 JProfileGrid: visualising very large multiple sequence alignments. Alberto Roca, Aaron Abajian, David Vigerust
This talk • How to make huge multiple alignments • How to cluster > 100,000 sequences • MDS/PCA on big datasets
Multiple Sequence Alignment • NP complete • Mainly use: “Progressive Alignment” • Greedy heuristic • Use a tree/clustering of the seqs • Barton and Sternberg (1988)Feng and Doolittle (1987)Higgins and Sharp (1988) Hogeweg and Hesper (1984)Willlie Taylor (1987)
Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin “Guide Tree” Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . : Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . : Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . : Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Clustal • 66,000 citations • Clustal1-Clustal4 • 1988, Paul Sharp, Dublin • Clustal V 1992 • EMBL Heidelberg, • Rainer Fuchs • Alan Bleasby • Clustal W, Clustal X 1994-2005 • Toby Gibson, EMBL, Heidelberg • Julie Thompson, ICGEB, Strasbourg • Clustal W and Clustal X 2.0 2007 • University College Dublin www.clustal.org
Complexity • Guide tree construction O(N2) • Later Progressive Alignment O(N) • Guide tree construction is limiting >10,000 seq alignment is tough
PartTree • MAFFT Package • Select n sequences where n << N • UPGMA on n sequences • Cluster the remainder (N-n) with their closest clusters Katoh, K., Toh, H., 2007. PartTree: an algorithm to build an approximate tree from alarge number of unaligned sequences. Bioinformatics 23, 372–374.
Embedding? • Replace each sequence by a Vector • Vector-Vector distances • MUCH faster than • Seq. – Seq. distances • Vectors very fast/simple to cluster • e.g. cluster 10,000 vectors of length 150 • <<1 min on 1 processor • UPGMA • e.g. cluster 300,000 vectors of length 300 • 6 mins • k-means, k = 300
Embedding papers • FastMap • Faloutsos, C., Lin, K. (1995) FastMap: A Fast Algorithm for Indexing Data-Mining andVisualisation of Traditional and Multimedia Datasets, Proc. 1995ACM SIGMOD International Con. on Management of Data, pp.163–174. • Sparsemap • G. Hristescu and M. Farach-Colton. Cluster-preserving embedding of proteins. Technical Report 99-50, Computer Science Department, Rutgers University, 1999.
mBED • Select k seqs “randomly” • k << N • kα logN • Use distance to each of these k “references” • k long vector for each sequence • Use heuristics • avoid duplicates • find outliers • Very fast and simple • Complexity O(kN) i.e. O(NlogN) • Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG. (2010)Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol Biol. 14;5:21.
k seeds N N k N mBED
MDS visualisation? • Do PCA on Embedded sequences • 3994 H3N2 HA sequences • 1967 (blue) - 2008 (orange)
Guide Tree Quality • 1000 randomguide trees • 1000 sparsemap trees • Clustal tree • mBED
Clustal Ω • Release first version by April 2011 • Scalable • mBed • Gordon Blackshields • Accurate • HMM-HMM alignment • HHalign • Johannes Söding, Munich. • Re-use old alignments • Kevin Karplus • UCSC
Align 120,000 abc transporters • 6 hours on 1 core • More accurate than • MUSCLE or MAFFT • Coming soon... Fabian Sievers Andreas Wilm David Dineen
MDS/PCA etc. • Dimension reduction • Treat alignment columns as variables • PCA • Principal Components Analysis • CA • Correspondence Analysis, Jean Paul Benzécri • Use NxN distance matrix • MDS • PCOORD
Use CA, PCA for Sequences? • every alignment column: • 20 binary variables • Or several physicochemical properties
15 Chymotrypsins Trypsin-like serine proteases 10 Elastases 31 Trypsins • Correspondence Analysis • Supervise: • Between Groups Analysis • Dolédec and Chessel (1987)(similar to PLS discriminantanalysis)
Trypsin Wallace IM, Higgins DG.(2007) Supervised multivariate analysis of sequence groups to identify specificity determining residues. BMC Bioinformatics. 8:135.
MDS • Multidimensional Scaling • Fit distances to a NxN distance matrix • Use euclidean distances? • “Classical scaling” = Principal Co-Ordinates Analysis • PCOORD, John Gower • Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325-328. • Higgins, D.G. (1992) Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets. CABIOS, 8, 15-22. • Complexity at least O(N2)
Large scale MDS? • SC-MDS • Jengnan Tzeng, Henry Horng-Shing Lu, and Wen-Hsiung Li (2008) Multidimensional scaling for large genomic data sets BMC Bioinformatics. 2008; 9: 179. • mBED • Blackshields et al., (2010) • PCOORD or MDS on a subset of the sequences • add the rest later • Landmark MDS + Nystrom approximation • V. de Silva, J.B. Tenenbaum, “Sparse multidimensional scaling using landmark points.” (2004) Technical report, Stanford University. Easily do MDS on >100,000 seqs
H3N2 flu sequences • Weifeng Shi • 8167 HA sequences • human H3N2 influenza viruses • DNAdist in Phylip • K2P (Kimura two parameter) model • Python: MatplotlIb