420 likes | 642 Views
SeedMap and mBED: Fast scalable clustering of sequences by Embedding. Des Higgins University College Dublin, Ireland. SeedMap. Gordon Blackshields Iain Wallace Mark Larkin Andreas Wilm Blackshields G, Larkin M, Wallace IM, Wilm A, Higgins DG. (2008)
E N D
SeedMap and mBED:Fast scalable clustering of sequences by Embedding Des Higgins University College Dublin, Ireland
SeedMap • Gordon Blackshields • Iain Wallace • Mark Larkin • Andreas Wilm Blackshields G, Larkin M, Wallace IM, Wilm A, Higgins DG.(2008) Fast embedding methods for clustering tens of thousands of sequences. Comput Biol Chem. 32(4):282-6. • Science Foundation Ireland
mBED • Gordon Blackshields • Fabian Sievers • Andreas Wilm • Unpublished/in prep. 2009 • SFI
Sequence clustering? • Database organisation • Sequence assembly • Phylogeny/evolution/epidemiology • Guide trees for progressive alignment
Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . : • Progressive Alignment: • Feng and Doolittle, 1987 • Willie Taylor, 1987, 1988 • Hogeweg and Hesper, 1984
Clustal • Clustal1-Clustal4 1988 • Paul Sharp, Genetics, TC Dublin • Clustal V 1992 • EMBL Heidelberg, • Rainer Fuchs • Alan Bleasby • Clustal W 1994-2006, Clustal X 1997-2006 • EMBL, EBI, UCC • Toby Gibson, EMBL, Heidelberg • Julie Thompson, ICGEB, Strasbourg • Clustal W and Clustal X 2.0 August 2007 • University College Dublin
Guide trees? • For N sequences: • NxN distances (or “similarities”) • UPGMA? • O(N2) or (N3)? • For N >>10,000 ? • can get prohibitive VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP -VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA * * * * * **** * * *** * * * * * *** *
PartTree • MAFFT Package • Select n sequences where n << N • UPGMA on n sequences • Cluster the remainder (N-n) with their closest clusters Katoh, K., Toh, H., 2007. PartTree: an algorithm to build an approximate tree from alarge number of unaligned sequences. Bioinformatics 23, 372–374.
Embedding? • Replace each sequence by a Vector • Vector-Vector distances • MUCH faster than • Seq. – Seq. distances • Vectors very fast/simple to cluster • e.g. cluster 10,000 vectors of length 150 • 1 min on 1 processor • UPGMA • e.g. cluster 300,000 vectors of length 300 • 6 mins • k-means, k = 300
Embedding seqs • How? • MDS • PCA • CA • PCOORD • Higgins,D.G. (1992) CABIOS • Wallace, I and Higgins, D.G. (2006) BMC Bioinformatics But Need distance matrix and/or multiple alignment
Embedding seqs • How? • Karhunen-Loève Transform (KLT) • Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. Boston, Academic Press. • FastMap • Faloutsos, C., Lin, K. (1995) FastMap: A Fast Algorithm for Indexing Data-Mining andVisualisation of Traditional and Multimedia Datasets, Proc. 1995ACM SIGMOD International Con. on Management of Data, pp.163–174. • Vector elements are positions along “axes” • Between References or seeds or landmarks • Find Pivot Points
SparseMap • G. Hristescu and M. Farach-Colton. Cluster-preserving embedding of proteins. Technical Report 99-50, Computer Science Department, Rutgers University, 1999. • Each vector element is • Distance from a group of seqs. • For N seqs: • k groups, where k = log2N. • Each group is 2..2log2N in size
Lipschitz Embedding Bourgain, J. (1985). "On Lipschitz Embedding of Finite Metric Spaces in Hilbert Space." Israel Journal of Mathematics 52(1-2): 46-52.
SparseMap • Cumbersome/complicated in practice • Stochastic • Every run is different • Related embedding developments by Nat and Michal Linial • Linial N, London E, Rabinovich Y: The Geometry of Graphs and Some of Its Algorithmic Applications. Combinatorica 1995, 15:215-245. • Linial M, Linial N, Tishby N, Yona G: Global self-organization of all known protein sequences reveals inherent biological signatures. J Mol Biol 1997, 268(2):539-56.
SeedMap • Derived from SparseMap • Choose groups deterministically • Look for “natural” groups • Use heuristics to • Find Outliers • Find Duplicates • Use UPGMA on small subsets to generate groups
Benchmarking? • Use the clustering as a guide tree • ClustalW • Benchmark it • Balibase • Oxbench • Pfam+Homstrad
mBED • Why not just use single sequences as references? • Select k seqs “randomly” • With constant stride from seqs sorted by length • kα logN • Use heuristics • avoid duplicates • find outliers • Very fast and simple • Complexity O(kN)
k seeds N N k N mBED
Benchmarking • 10 Biggest families in Pfam • With < 10,000 seqs • With 2 or more known structures • Compare structural alignment to Homstrad
Homstrad-Pfam Name Size Len Embedding Distance Matrix Alignment Time (s) Calculation Time (s) Column Score (%) 1 2 3 4 1 2 3 41 2 3 4 PF01381 9993 53- 25 55 136 764 57 55 175 13.3 26.7 25.3 34.7 PF00006 9796 209- 134 248 280 4364 48 49 88 42.8 36.6 36.6 38 PF00989 9681 95- 43 88 197 1281 50 51 159 46.5 33.3 31.8 34.1 PF00486 9615 75- 34 69 107 950 55 52 10463.9 92.8 64.9 89.7 PF00571 9551 119- 73 143 2681993 54 50 152 6.23.11.5 1.5 PF00097 9423 41- 18 38 94 517 44 43 11553.2 54.8 61.3 54.8 PF01479 9352 47- 17 40 90 496 45 46 124 58.3 91.7 89.6 79.2 PF00046 9305 54- 20 43 85 651 41 42 7759.4 44.9 46.4 60.9 PF00550 9249 63- 28 59 136 794 47 47 141 51.3 32.9 55.3 59.2 PF00149 9072 198- 133 256 552 3515 47 46 172 75.4 71.9 72.3 76.1 Average 9503 95 0 53 104 195 1533 49 48 131 47 48.9 48.5 52.8 • Full d(x, y) distance matrix • mBed • mBed + usePivotObjects • mBed + usePivotGroups
Homstrad-Pfam Name Size Len Embedding Distance Matrix Alignment Time (s) Calculation Time (s) Column Score (%) 1 2 3 4 1 2 3 41 2 3 4 PF01381 9993 53- 25 55 136 764 57 55 175 13.3 26.7 25.3 34.7 PF00006 9796 209- 134 248 280 4364 48 49 88 42.8 36.6 36.6 38 PF00989 9681 95- 43 88 197 1281 50 51 159 46.5 33.3 31.8 34.1 PF00486 9615 75- 34 69 107 950 55 52 10463.9 92.8 64.9 89.7 PF00571 9551 119- 73 143 2681993 54 50 152 6.23.11.5 1.5 PF00097 9423 41- 18 38 94 517 44 43 11553.2 54.8 61.3 54.8 PF01479 9352 47- 17 40 90 496 45 46 124 58.3 91.7 89.6 79.2 PF00046 9305 54- 20 43 85 651 41 42 7759.4 44.9 46.4 60.9 PF00550 9249 63- 28 59 136 794 47 47 141 51.3 32.9 55.3 59.2 PF00149 9072 198- 133 256 552 3515 47 46 172 75.4 71.9 72.3 76.1 Average 9503 95 0 53 104 195 1533 49 48 131 47 48.9 48.5 52.8 • Full d(x, y) distance matrix • mBed • mBed + usePivotObjects • mBed + usePivotGroups
Homstrad-Pfam Name Size Len Embedding Distance Matrix Alignment Time (s) Calculation Time (s) Column Score (%) 1 2 3 4 1 2 3 41 2 3 4 PF01381 9993 53- 25 55 136 764 57 55 175 13.3 26.7 25.3 34.7 PF00006 9796 209- 134 248 280 4364 48 49 88 42.8 36.6 36.6 38 PF00989 9681 95- 43 88 197 1281 50 51 159 46.5 33.3 31.8 34.1 PF00486 9615 75- 34 69 107 950 55 52 10463.9 92.8 64.9 89.7 PF00571 9551 119- 73 143 2681993 54 50 152 6.23.11.5 1.5 PF00097 9423 41- 18 38 94 517 44 43 11553.2 54.8 61.3 54.8 PF01479 9352 47- 17 40 90 496 45 46 124 58.3 91.7 89.6 79.2 PF00046 9305 54- 20 43 85 651 41 42 7759.4 44.9 46.4 60.9 PF00550 9249 63- 28 59 136 794 47 47 141 51.3 32.9 55.3 59.2 PF00149 9072 198- 133 256 552 3515 47 46 172 75.4 71.9 72.3 76.1 Average 9503 95 0 53 104 195 1533 49 48 131 47 48.9 48.5 52.8 • Full d(x, y) distance matrix • mBed • mBed + usePivotObjects • mBed + usePivotGroups
Guide Tree Quality • Shuffle labels on 1000 trees as a bootstrap • Clustal tree • SparseMap • mBED
Guide Tree Quality • Shuffle labels on 1000 trees as a bootstrap • Clustal tree • SparseMap • mBED
Guide Tree Quality • Shuffle labels on 1000 trees as a bootstrap • Clustal tree • SparseMap • mBED
Guide Tree Quality • Shuffle labels on 1000 trees as a bootstrap • Clustal tree • SparseMap • mBED
MDS visualisation? • Do PCA on Embedded sequences • 3994 H3N2 HA sequences • 1967 (blue) - 2008 (orange)
Very large datasets • e.g. 381,602 tRNA from RF00005 • 40 mins embeddingPlus 6 mins to cluster with k-means • k = 300
Method Alignment Column Score (%) HOMSTRAD/Pfam Guide Trees constructed internal to method ClustalW 60.12 ClustalW -quicktree -ktuple=2 59.92 ClustalW -quicktree -ktuple=2 -noweights -maxdiv=0 59.83 MAFFT 66.51 MAFFT -retree 2 61.46 MAFFT -retree 1 60.09 MAFFT -retree 2 -parttree 60.65 MAFFT -retree 1 -parttree 59.27 Guide Trees constructed external to method ClustalW + “MAFFT -retree 0 –parttree” 54.75 ClustalW + “mBed –fullMatrix” 61.03 ClustalW + “mBed –SparseMap” 58.85 ClustalW + “mBed –SeedMap” 59.24 MAFFT + “mBed –SeedMap” 57.57 No of alignments = 650
Method Alignment Column Score (%) HOMSTRAD/Pfam Guide Trees constructed internal to method ClustalW 60.12 ClustalW -quicktree -ktuple=2 59.92 ClustalW -quicktree -ktuple=2 -noweights -maxdiv=0 59.83 MAFFT 66.51 MAFFT -retree 2 61.46 MAFFT -retree 1 60.09 MAFFT -retree 2 -parttree 60.65 MAFFT -retree 1 -parttree 59.27 Guide Trees constructed external to method ClustalW + “MAFFT -retree 0 –parttree” 54.75 ClustalW + “mBed –fullMatrix” 61.03 ClustalW + “mBed –SparseMap” 58.85 ClustalW + “mBed –SeedMap” 59.24 MAFFT + “mBed –SeedMap” 57.57
Method Alignment Column Score (%) HOMSTRAD/Pfam Guide Trees constructed internal to method ClustalW 60.12 ClustalW -quicktree -ktuple=2 59.92 ClustalW -quicktree -ktuple=2 -noweights -maxdiv=0 59.83 MAFFT 66.51 MAFFT -retree 2 61.46 MAFFT -retree 1 60.09 MAFFT -retree 2 -parttree 60.65 MAFFT -retree 1 -parttree 59.27 Guide Trees constructed external to method ClustalW + “MAFFT -retree 0 –parttree” 54.75 ClustalW + “mBed –fullMatrix” 61.03 ClustalW + “mBed –SparseMap” 58.85 ClustalW + “mBed –SeedMap” 59.24 MAFFT + “mBed –SeedMap” 57.57