1 / 39

SeedMap and mBED: Fast scalable clustering of sequences by Embedding

SeedMap and mBED: Fast scalable clustering of sequences by Embedding. Des Higgins University College Dublin, Ireland. SeedMap. Gordon Blackshields Iain Wallace Mark Larkin Andreas Wilm Blackshields G, Larkin M, Wallace IM, Wilm A, Higgins DG. (2008)

jonny
Download Presentation

SeedMap and mBED: Fast scalable clustering of sequences by Embedding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SeedMap and mBED:Fast scalable clustering of sequences by Embedding Des Higgins University College Dublin, Ireland

  2. SeedMap • Gordon Blackshields • Iain Wallace • Mark Larkin • Andreas Wilm Blackshields G, Larkin M, Wallace IM, Wilm A, Higgins DG.(2008) Fast embedding methods for clustering tens of thousands of sequences. Comput Biol Chem. 32(4):282-6. • Science Foundation Ireland

  3. mBED • Gordon Blackshields • Fabian Sievers • Andreas Wilm • Unpublished/in prep. 2009 • SFI

  4. Sequence clustering? • Database organisation • Sequence assembly • Phylogeny/evolution/epidemiology • Guide trees for progressive alignment

  5. Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . : • Progressive Alignment: • Feng and Doolittle, 1987 • Willie Taylor, 1987, 1988 • Hogeweg and Hesper, 1984

  6. Clustal • Clustal1-Clustal4 1988 • Paul Sharp, Genetics, TC Dublin • Clustal V 1992 • EMBL Heidelberg, • Rainer Fuchs • Alan Bleasby • Clustal W 1994-2006, Clustal X 1997-2006 • EMBL, EBI, UCC • Toby Gibson, EMBL, Heidelberg • Julie Thompson, ICGEB, Strasbourg • Clustal W and Clustal X 2.0 August 2007 • University College Dublin

  7. Guide trees? • For N sequences: • NxN distances (or “similarities”) • UPGMA? • O(N2) or (N3)? • For N >>10,000 ? • can get prohibitive VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP -VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA * * * * * **** * * *** * * * * * *** *

  8. PartTree • MAFFT Package • Select n sequences where n << N • UPGMA on n sequences • Cluster the remainder (N-n) with their closest clusters Katoh, K., Toh, H., 2007. PartTree: an algorithm to build an approximate tree from alarge number of unaligned sequences. Bioinformatics 23, 372–374.

  9. Embedding? • Replace each sequence by a Vector • Vector-Vector distances • MUCH faster than • Seq. – Seq. distances • Vectors very fast/simple to cluster • e.g. cluster 10,000 vectors of length 150 • 1 min on 1 processor • UPGMA • e.g. cluster 300,000 vectors of length 300 • 6 mins • k-means, k = 300

  10. Embedding seqs • How? • MDS • PCA • CA • PCOORD • Higgins,D.G. (1992) CABIOS • Wallace, I and Higgins, D.G. (2006) BMC Bioinformatics But Need distance matrix and/or multiple alignment

  11. Embedding seqs • How? • Karhunen-Loève Transform (KLT) • Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. Boston, Academic Press. • FastMap • Faloutsos, C., Lin, K. (1995) FastMap: A Fast Algorithm for Indexing Data-Mining andVisualisation of Traditional and Multimedia Datasets, Proc. 1995ACM SIGMOD International Con. on Management of Data, pp.163–174. • Vector elements are positions along “axes” • Between References or seeds or landmarks • Find Pivot Points

  12. Pivot Points!!

  13. Co-ordinate for point i

  14. 2nd – kth co-ordinate

  15. SparseMap • G. Hristescu and M. Farach-Colton. Cluster-preserving embedding of proteins. Technical Report 99-50, Computer Science Department, Rutgers University, 1999. • Each vector element is • Distance from a group of seqs. • For N seqs: • k groups, where k = log2N. • Each group is 2..2log2N in size

  16. Lipschitz Embedding Bourgain, J. (1985). "On Lipschitz Embedding of Finite Metric Spaces in Hilbert Space." Israel Journal of Mathematics 52(1-2): 46-52.

  17. SparseMap • Cumbersome/complicated in practice • Stochastic • Every run is different • Related embedding developments by Nat and Michal Linial • Linial N, London E, Rabinovich Y: The Geometry of Graphs and Some of Its Algorithmic Applications. Combinatorica 1995, 15:215-245. • Linial M, Linial N, Tishby N, Yona G: Global self-organization of all known protein sequences reveals inherent biological signatures. J Mol Biol 1997, 268(2):539-56.

  18. SeedMap • Derived from SparseMap • Choose groups deterministically • Look for “natural” groups • Use heuristics to • Find Outliers • Find Duplicates • Use UPGMA on small subsets to generate groups

  19. Benchmarking? • Use the clustering as a guide tree • ClustalW • Benchmark it • Balibase • Oxbench • Pfam+Homstrad

  20. SeedMap on Benchmarks

  21. HOMSTRAD-Pfam

  22. Times

  23. mBED • Why not just use single sequences as references? • Select k seqs “randomly” • With constant stride from seqs sorted by length • kα logN • Use heuristics • avoid duplicates • find outliers • Very fast and simple • Complexity O(kN)

  24. k seeds N N k N mBED

  25. Benchmarking • 10 Biggest families in Pfam • With < 10,000 seqs • With 2 or more known structures • Compare structural alignment to Homstrad

  26. Homstrad-Pfam Name Size Len Embedding Distance Matrix Alignment Time (s) Calculation Time (s) Column Score (%) 1 2 3 4 1 2 3 41 2 3 4 PF01381 9993 53- 25 55 136 764 57 55 175 13.3 26.7 25.3 34.7 PF00006 9796 209- 134 248 280 4364 48 49 88 42.8 36.6 36.6 38 PF00989 9681 95- 43 88 197 1281 50 51 159 46.5 33.3 31.8 34.1 PF00486 9615 75- 34 69 107 950 55 52 10463.9 92.8 64.9 89.7 PF00571 9551 119- 73 143 2681993 54 50 152 6.23.11.5 1.5 PF00097 9423 41- 18 38 94 517 44 43 11553.2 54.8 61.3 54.8 PF01479 9352 47- 17 40 90 496 45 46 124 58.3 91.7 89.6 79.2 PF00046 9305 54- 20 43 85 651 41 42 7759.4 44.9 46.4 60.9 PF00550 9249 63- 28 59 136 794 47 47 141 51.3 32.9 55.3 59.2 PF00149 9072 198- 133 256 552 3515 47 46 172 75.4 71.9 72.3 76.1 Average 9503 95 0 53 104 195 1533 49 48 131 47 48.9 48.5 52.8 • Full d(x, y) distance matrix • mBed • mBed + usePivotObjects • mBed + usePivotGroups

  27. Homstrad-Pfam Name Size Len Embedding Distance Matrix Alignment Time (s) Calculation Time (s) Column Score (%) 1 2 3 4 1 2 3 41 2 3 4 PF01381 9993 53- 25 55 136 764 57 55 175 13.3 26.7 25.3 34.7 PF00006 9796 209- 134 248 280 4364 48 49 88 42.8 36.6 36.6 38 PF00989 9681 95- 43 88 197 1281 50 51 159 46.5 33.3 31.8 34.1 PF00486 9615 75- 34 69 107 950 55 52 10463.9 92.8 64.9 89.7 PF00571 9551 119- 73 143 2681993 54 50 152 6.23.11.5 1.5 PF00097 9423 41- 18 38 94 517 44 43 11553.2 54.8 61.3 54.8 PF01479 9352 47- 17 40 90 496 45 46 124 58.3 91.7 89.6 79.2 PF00046 9305 54- 20 43 85 651 41 42 7759.4 44.9 46.4 60.9 PF00550 9249 63- 28 59 136 794 47 47 141 51.3 32.9 55.3 59.2 PF00149 9072 198- 133 256 552 3515 47 46 172 75.4 71.9 72.3 76.1 Average 9503 95 0 53 104 195 1533 49 48 131 47 48.9 48.5 52.8 • Full d(x, y) distance matrix • mBed • mBed + usePivotObjects • mBed + usePivotGroups

  28. Homstrad-Pfam Name Size Len Embedding Distance Matrix Alignment Time (s) Calculation Time (s) Column Score (%) 1 2 3 4 1 2 3 41 2 3 4 PF01381 9993 53- 25 55 136 764 57 55 175 13.3 26.7 25.3 34.7 PF00006 9796 209- 134 248 280 4364 48 49 88 42.8 36.6 36.6 38 PF00989 9681 95- 43 88 197 1281 50 51 159 46.5 33.3 31.8 34.1 PF00486 9615 75- 34 69 107 950 55 52 10463.9 92.8 64.9 89.7 PF00571 9551 119- 73 143 2681993 54 50 152 6.23.11.5 1.5 PF00097 9423 41- 18 38 94 517 44 43 11553.2 54.8 61.3 54.8 PF01479 9352 47- 17 40 90 496 45 46 124 58.3 91.7 89.6 79.2 PF00046 9305 54- 20 43 85 651 41 42 7759.4 44.9 46.4 60.9 PF00550 9249 63- 28 59 136 794 47 47 141 51.3 32.9 55.3 59.2 PF00149 9072 198- 133 256 552 3515 47 46 172 75.4 71.9 72.3 76.1 Average 9503 95 0 53 104 195 1533 49 48 131 47 48.9 48.5 52.8 • Full d(x, y) distance matrix • mBed • mBed + usePivotObjects • mBed + usePivotGroups

  29. Guide Tree Quality • Shuffle labels on 1000 trees as a bootstrap • Clustal tree • SparseMap • mBED

  30. Guide Tree Quality • Shuffle labels on 1000 trees as a bootstrap • Clustal tree • SparseMap • mBED

  31. Guide Tree Quality • Shuffle labels on 1000 trees as a bootstrap • Clustal tree • SparseMap • mBED

  32. Guide Tree Quality • Shuffle labels on 1000 trees as a bootstrap • Clustal tree • SparseMap • mBED

  33. MDS visualisation? • Do PCA on Embedded sequences • 3994 H3N2 HA sequences • 1967 (blue) - 2008 (orange)

  34. Very large datasets • e.g. 381,602 tRNA from RF00005 • 40 mins embeddingPlus 6 mins to cluster with k-means • k = 300

  35. Method Alignment Column Score (%) HOMSTRAD/Pfam Guide Trees constructed internal to method ClustalW 60.12 ClustalW -quicktree -ktuple=2 59.92 ClustalW -quicktree -ktuple=2 -noweights -maxdiv=0 59.83 MAFFT 66.51 MAFFT -retree 2 61.46 MAFFT -retree 1 60.09 MAFFT -retree 2 -parttree 60.65 MAFFT -retree 1 -parttree 59.27 Guide Trees constructed external to method ClustalW + “MAFFT -retree 0 –parttree” 54.75 ClustalW + “mBed –fullMatrix” 61.03 ClustalW + “mBed –SparseMap” 58.85 ClustalW + “mBed –SeedMap” 59.24 MAFFT + “mBed –SeedMap” 57.57 No of alignments = 650

  36. Method Alignment Column Score (%) HOMSTRAD/Pfam Guide Trees constructed internal to method ClustalW 60.12 ClustalW -quicktree -ktuple=2 59.92 ClustalW -quicktree -ktuple=2 -noweights -maxdiv=0 59.83 MAFFT 66.51 MAFFT -retree 2 61.46 MAFFT -retree 1 60.09 MAFFT -retree 2 -parttree 60.65 MAFFT -retree 1 -parttree 59.27 Guide Trees constructed external to method ClustalW + “MAFFT -retree 0 –parttree” 54.75 ClustalW + “mBed –fullMatrix” 61.03 ClustalW + “mBed –SparseMap” 58.85 ClustalW + “mBed –SeedMap” 59.24 MAFFT + “mBed –SeedMap” 57.57

  37. Method Alignment Column Score (%) HOMSTRAD/Pfam Guide Trees constructed internal to method ClustalW 60.12 ClustalW -quicktree -ktuple=2 59.92 ClustalW -quicktree -ktuple=2 -noweights -maxdiv=0 59.83 MAFFT 66.51 MAFFT -retree 2 61.46 MAFFT -retree 1 60.09 MAFFT -retree 2 -parttree 60.65 MAFFT -retree 1 -parttree 59.27 Guide Trees constructed external to method ClustalW + “MAFFT -retree 0 –parttree” 54.75 ClustalW + “mBed –fullMatrix” 61.03 ClustalW + “mBed –SparseMap” 58.85 ClustalW + “mBed –SeedMap” 59.24 MAFFT + “mBed –SeedMap” 57.57

  38. Thank you!Grazie! Merci!

More Related