SeedMap and mBED: Fast scalable clustering of sequences by Embedding

SeedMap and mBED:Fast scalable clustering of sequences by Embedding Des Higgins University College Dublin, Ireland

SeedMap • Gordon Blackshields • Iain Wallace • Mark Larkin • Andreas Wilm Blackshields G, Larkin M, Wallace IM, Wilm A, Higgins DG.(2008) Fast embedding methods for clustering tens of thousands of sequences. Comput Biol Chem. 32(4):282-6. • Science Foundation Ireland

mBED • Gordon Blackshields • Fabian Sievers • Andreas Wilm • Unpublished/in prep. 2009 • SFI

Sequence clustering? • Database organisation • Sequence assembly • Phylogeny/evolution/epidemiology • Guide trees for progressive alignment

Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST Horse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN Human alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS- Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT Lamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT Lupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL Horse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL Human alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL Whale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Lamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV Lupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------ Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------ Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------ Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------ Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------- Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . : • Progressive Alignment: • Feng and Doolittle, 1987 • Willie Taylor, 1987, 1988 • Hogeweg and Hesper, 1984

Clustal • Clustal1-Clustal4 1988 • Paul Sharp, Genetics, TC Dublin • Clustal V 1992 • EMBL Heidelberg, • Rainer Fuchs • Alan Bleasby • Clustal W 1994-2006, Clustal X 1997-2006 • EMBL, EBI, UCC • Toby Gibson, EMBL, Heidelberg • Julie Thompson, ICGEB, Strasbourg • Clustal W and Clustal X 2.0 August 2007 • University College Dublin

Guide trees? • For N sequences: • NxN distances (or “similarities”) • UPGMA? • O(N2) or (N3)? • For N >>10,000 ? • can get prohibitive VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP -VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA * * * * * **** * * *** * * * * * *** *

PartTree • MAFFT Package • Select n sequences where n << N • UPGMA on n sequences • Cluster the remainder (N-n) with their closest clusters Katoh, K., Toh, H., 2007. PartTree: an algorithm to build an approximate tree from alarge number of unaligned sequences. Bioinformatics 23, 372–374.

Embedding? • Replace each sequence by a Vector • Vector-Vector distances • MUCH faster than • Seq. – Seq. distances • Vectors very fast/simple to cluster • e.g. cluster 10,000 vectors of length 150 • 1 min on 1 processor • UPGMA • e.g. cluster 300,000 vectors of length 300 • 6 mins • k-means, k = 300

Embedding seqs • How? • MDS • PCA • CA • PCOORD • Higgins,D.G. (1992) CABIOS • Wallace, I and Higgins, D.G. (2006) BMC Bioinformatics But Need distance matrix and/or multiple alignment

Embedding seqs • How? • Karhunen-Loève Transform (KLT) • Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. Boston, Academic Press. • FastMap • Faloutsos, C., Lin, K. (1995) FastMap: A Fast Algorithm for Indexing Data-Mining andVisualisation of Traditional and Multimedia Datasets, Proc. 1995ACM SIGMOD International Con. on Management of Data, pp.163–174. • Vector elements are positions along “axes” • Between References or seeds or landmarks • Find Pivot Points

Pivot Points!!

Co-ordinate for point i

2nd – kth co-ordinate

SparseMap • G. Hristescu and M. Farach-Colton. Cluster-preserving embedding of proteins. Technical Report 99-50, Computer Science Department, Rutgers University, 1999. • Each vector element is • Distance from a group of seqs. • For N seqs: • k groups, where k = log2N. • Each group is 2..2log2N in size

Lipschitz Embedding Bourgain, J. (1985). "On Lipschitz Embedding of Finite Metric Spaces in Hilbert Space." Israel Journal of Mathematics 52(1-2): 46-52.

SparseMap • Cumbersome/complicated in practice • Stochastic • Every run is different • Related embedding developments by Nat and Michal Linial • Linial N, London E, Rabinovich Y: The Geometry of Graphs and Some of Its Algorithmic Applications. Combinatorica 1995, 15:215-245. • Linial M, Linial N, Tishby N, Yona G: Global self-organization of all known protein sequences reveals inherent biological signatures. J Mol Biol 1997, 268(2):539-56.

SeedMap • Derived from SparseMap • Choose groups deterministically • Look for “natural” groups • Use heuristics to • Find Outliers • Find Duplicates • Use UPGMA on small subsets to generate groups

Benchmarking? • Use the clustering as a guide tree • ClustalW • Benchmark it • Balibase • Oxbench • Pfam+Homstrad

SeedMap on Benchmarks

HOMSTRAD-Pfam

Times

mBED • Why not just use single sequences as references? • Select k seqs “randomly” • With constant stride from seqs sorted by length • kα logN • Use heuristics • avoid duplicates • find outliers • Very fast and simple • Complexity O(kN)

k seeds N N k N mBED

Benchmarking • 10 Biggest families in Pfam • With < 10,000 seqs • With 2 or more known structures • Compare structural alignment to Homstrad

Homstrad-Pfam Name Size Len Embedding Distance Matrix Alignment Time (s) Calculation Time (s) Column Score (%) 1 2 3 4 1 2 3 41 2 3 4 PF01381 9993 53- 25 55 136 764 57 55 175 13.3 26.7 25.3 34.7 PF00006 9796 209- 134 248 280 4364 48 49 88 42.8 36.6 36.6 38 PF00989 9681 95- 43 88 197 1281 50 51 159 46.5 33.3 31.8 34.1 PF00486 9615 75- 34 69 107 950 55 52 10463.9 92.8 64.9 89.7 PF00571 9551 119- 73 143 2681993 54 50 152 6.23.11.5 1.5 PF00097 9423 41- 18 38 94 517 44 43 11553.2 54.8 61.3 54.8 PF01479 9352 47- 17 40 90 496 45 46 124 58.3 91.7 89.6 79.2 PF00046 9305 54- 20 43 85 651 41 42 7759.4 44.9 46.4 60.9 PF00550 9249 63- 28 59 136 794 47 47 141 51.3 32.9 55.3 59.2 PF00149 9072 198- 133 256 552 3515 47 46 172 75.4 71.9 72.3 76.1 Average 9503 95 0 53 104 195 1533 49 48 131 47 48.9 48.5 52.8 • Full d(x, y) distance matrix • mBed • mBed + usePivotObjects • mBed + usePivotGroups

Guide Tree Quality • Shuffle labels on 1000 trees as a bootstrap • Clustal tree • SparseMap • mBED

MDS visualisation? • Do PCA on Embedded sequences • 3994 H3N2 HA sequences • 1967 (blue) - 2008 (orange)

Very large datasets • e.g. 381,602 tRNA from RF00005 • 40 mins embeddingPlus 6 mins to cluster with k-means • k = 300

Method Alignment Column Score (%) HOMSTRAD/Pfam Guide Trees constructed internal to method ClustalW 60.12 ClustalW -quicktree -ktuple=2 59.92 ClustalW -quicktree -ktuple=2 -noweights -maxdiv=0 59.83 MAFFT 66.51 MAFFT -retree 2 61.46 MAFFT -retree 1 60.09 MAFFT -retree 2 -parttree 60.65 MAFFT -retree 1 -parttree 59.27 Guide Trees constructed external to method ClustalW + “MAFFT -retree 0 –parttree” 54.75 ClustalW + “mBed –fullMatrix” 61.03 ClustalW + “mBed –SparseMap” 58.85 ClustalW + “mBed –SeedMap” 59.24 MAFFT + “mBed –SeedMap” 57.57 No of alignments = 650

Method Alignment Column Score (%) HOMSTRAD/Pfam Guide Trees constructed internal to method ClustalW 60.12 ClustalW -quicktree -ktuple=2 59.92 ClustalW -quicktree -ktuple=2 -noweights -maxdiv=0 59.83 MAFFT 66.51 MAFFT -retree 2 61.46 MAFFT -retree 1 60.09 MAFFT -retree 2 -parttree 60.65 MAFFT -retree 1 -parttree 59.27 Guide Trees constructed external to method ClustalW + “MAFFT -retree 0 –parttree” 54.75 ClustalW + “mBed –fullMatrix” 61.03 ClustalW + “mBed –SparseMap” 58.85 ClustalW + “mBed –SeedMap” 59.24 MAFFT + “mBed –SeedMap” 57.57

Thank you!Grazie! Merci!

SeedMap and mBED: Fast scalable clustering of sequences by Embedding

SeedMap and mBED: Fast scalable clustering of sequences by Embedding

Presentation Transcript

Constructing Scalable Overlays for Pub/Sub With Many Topics

Facility Access and Shipment Tracking (FAST) Overview

Fast Food and Obesity

Fast Food, Fast Talk? Gesprächssituationen in Fast-food Restaurants

Digital Image Watermarking

Cluster and Outlier Analysis

Decision Tree Classification

Facility Access and Shipment Tracking (FAST) – Overview Presentation

Clustering of non-numerical data

INFINITE SEQUENCES AND SERIES

The Power of Task Sequences

Unit 1 Child Development

Drill:

Clustering IV

الجلسة الرابعة التحليل العنقودي Clustering Analysis تشرح لكل الفئات

Clustering Methods

Small Galaxy Groups Clustering and the Evolution of Galaxy Clustering

Data mining @ Mahout

Embedding-Based Subsequence Matching in Large Sequence Databases

INFINITE SEQUENCES AND SERIES

Segmentation and Clustering

Large Mesh Simplification using Processing Sequences