1 / 43

Sequence Clustering

Sequence Clustering. Reducing Search S pace in Protein and DNA /RNA S equence A nalysis Denis Kaznadzey, Prokaryotic Super Program. MGM Workshop January 30, 2011. Sequence Clustering Outline. Classification of Sequences General Problem of Clustering Distance Measures

july
Download Presentation

Sequence Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Clustering Reducing Search Space in Protein and DNA/RNA Sequence Analysis Denis Kaznadzey, Prokaryotic Super Program MGM Workshop January 30, 2011

  2. Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

  3. Classification as Research Tool To deal with a huge variety of individual objects: • Classify into groups of essentially similarobjects • When new data arrives, assign objects to existing groups • Classify ‘leftovers’ • Occasionally review the entire classification Problem: What is ‘essentially similar’? • Finding properties that are important (ontological relevancy) • Does classification reflect reality in any way?

  4. Classification Ways to classify objects: • Spectral methods • Parametric decomposition • Clustering

  5. Sequence Data Abundance In the modern biology: The most abundant type of data is sequence: • DNA • Genomic • Meta-Genomic • Environmental Samples (16S rDNA) • RNA (cDNA libraries; RNA-Seq) • Derived Proteins How to compare sequences? - Criteria depend on application, e.g. GC content vs. order of bases.

  6. Sequence Clustering Select Applications in Genomic Sciences: Genome Assembly: Binning, Scaffolding Transcriptomics: RNAseq (read) clustering Protein Function and Evolution studies: Protein families Phylogenetic profiling: OTUs

  7. Clustering is Crucial for MetaGenomics METAGENOMICS • Thousands of samples • Hundreds of millions reads per sample • Trillions of base pairs • Billions of genes impossible to observe/analyze individually Clustering becomes a strict requirement: - Find what classes of sequences are seen - Analyze classes rather then individual sequences

  8. MetaGenomics Analysis Tasks • Primary tasks: • Assess diversity • Find genes • Predict functions • Predict pathways • Estimate capabilities Based on sequence comparison.

  9. Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

  10. Clustering in General • Any Clustering is based on the Distance in some Metric • Initial clustering is based on pair-wise distances • Subsequent classification is based on distances from objects to clusters: Pledging

  11. Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

  12. Similarity Metrics • What is “similar”: • Similarity measure should better reflect “reality” • This “reality” depends on the application: • Assembly: find identical sub-strings • Orthology detection: Identify homologous proteins across the species • Functional prediction: Identify proteins with similar evolutionary conserved motifs Measure is: Identity Percentage Substitution matrix based Match to HMM or PSSM

  13. Similarity Measure Computing similarity measure: • Edit distance or (ungapped) statistics P-value: BLAST, Fasta, needle, water, etc. • Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee • K-mere statistics: CD-HIT, USEARCH, MUSCLE • Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ • Suffix Arrays: Bowtie, BWT • Position-Specific scoring matrix: PSI-Blast, Impala • Hidden Markov Models: HMMer, HHSearch/HHPred, SAM

  14. Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

  15. Assembling Clusters There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology): • Linkage-based • Average linkage • Complete linkage • Single linkage • Hierarchy-based • Fitting function-based(K-mean) • Non-linear classifiers (SOM, etc.) • Greedy methods (iterative, suboptimal)

  16. Linkage-Based Clustering Average linkage Complete linkage Single linkage

  17. Hierarchical Clustering • Build a tree representation of relationships • Cut the branches using some quantitative criteria

  18. Building the Tree Criteria: More similar sequences appear at closer branches This goal is not achievable for practical distance measures 2 C ? B 3 1 2 4 A D A B C D A B D C 4 • Solutions: • Approximation methods: neighbor join, UPGMA • Search for the optimal tree by explicit criteria: (maximum parsimony, maximal likelihood, etc.)

  19. Suboptimal Tree Building Neighbor joining (corresponds to single-linkage clustering): • Order edges by distance • Join in order from short to long, merging branches as needed UnweightedPair Group Method with Arithmetic Mean (UPGMA):(corresponds to average-linkage clustering) • For every pair of clusters (A, B), starting with all singletons: • Compute average of distances between every object in A and every object in B • Merge the clusters of the closest average distance

  20. Global Fitting-Function Based K-mean clustering • Pre-define the number of clusters • Find a distribution so that the sum of distances to the means is minimal • Computationally hard • Heuristics used, application specific heuristics may be efficient

  21. Non-Linear Methods • Self-Organizing Maps:“self-learning” method • A neural network trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space

  22. Pledging Based on distance to cluster • Representative • Set of representatives (all at extreme) • Other measure, may be unrelated to the initial one (profile, model)

  23. Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

  24. Performance Considerations Distance computing is harder than clustering(Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ) • For large data sets only k-mere and suffix array measures are practical • However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes the use of sensitive similarity measures possible. • For boolean distance, iterative similarity detection is possible (no off-the-shelf implementations) • Binning: pre-clustering by rough and fast methods 33 objects 528 pairs 4 groups 127 pairs

  25. Single Linkage is Fast • Time- and space- efficient clustering method: transitive closure-based • Requires ‘boolean’ distances (two sequences can be linked or not linked • Requires the number of nodes to be known • Space ~ NodesNo • Run-time (worst) ~ EdgesNo* AveClustSize • Run-time (average) ~ EdgesNo * log2 (AveClustSize)

  26. Single Linkage is Prone to Aggregation Single-linkage clustering killer: CLUSTER AGGREGATION In large clusters, even a small number of random links lead to huge conglomerates.

  27. Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

  28. Case Study: RNA-SeqPipeline • Goals: • Compute transcript structures • Compute expression profiles (“virtual”) Reads/EST clusters Reads/ ESTdb Reads / clones attributed to particular source/condition Counting reads originating from different sources Source / condition specific expression profiles

  29. RNAseq Analysis Solutions Source: bioinfo.org, Macquarie University, Sydney

  30. RNAseq Clustering Approach Outline: Outcome: • 1. Detect identities (common segments): • Compute similarities • Select the “good” ones • 2. Merge sequences into groups with shared segments: SINGLE LINKAGE One biggest cluster contains more then 60% of all sequences (selection by better similarity does not help) What causes aggregation and how to fight it?

  31. Aggregation in RNA-SeqClustering • “Bad” identities: • Pieces of vector constructs / adaptors • Repeats • Redundant sequences • Spurious matches (short infrequent repeats) • Chimeras (if pre-amplification is used)

  32. Similarities Selection • Computing ‘boolean’ distances: • Threshold – based • Additional rules (match arrangement) • % identity + length + arrangement: OK

  33. Trimming / Masking • Fighting aggregation • Vector / adapter trimming: • Lucy, Figaro, etc. – integrated in many assembly suites (newbler, velvet, AMOS, CLCbio, etc.) • Low complexity detection / masking: • SEG, DUST, FastQC, WindowMasker etc. – often integrated in search tools

  34. Repeat Elimination Regular (tandem) repeats: • Pre-search masking: Based on structure (IMEx, SRF) or on database (TRDB) • Post-search detection based on similarity properties (multiple parallel threads)

  35. Repeat Elimination Irregular (long) repeats: • Database based: RepeatMasker • De-novo: • RepeatScout, • orrb, • PILER, etc. (Require genome as input, construct database)

  36. Detecting Chimeras • Detecting chimeric sequences: • Abundance-based: Perseus, UCHIME • Chimeras undergo less amplification cycles. So chimera segments in native arrangements are more frequent • Specific to 16S: ChimeraSlayer, Bellerophon • Chimera ‘arms’ are closer to originating clades then the entire chimera

  37. Detecting Chimeras • Similarity coverage-based: Mira assembler

  38. Detecting Chimeras • Similarity graph topology-based: dchim Alignment view Connectivity view

  39. Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

  40. Protein Clustering • Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species • Position-specific scoring matrices and profile-HMMs provide better sensitivity, but SLOW • Similar problems as for RNA-Seq/EST clustering, but their causes are harder to fight • No ‘one fits all’ solution: manual tuning and curation required for comprehensive results, especially at a large scale • The results of clustering are precious, they are kept as databases (PFAM, COGs, KOGs, eggNOG)

  41. Protein Clustering at JGI • Functional annotation of metagenome genes through protein clusters (IMG): • Build a set of functionally homogenous clusters of similar proteins – for annotated genomes • Build HMM for each cluster, compose model database • Pledge metagenome proteins to clusters by matching to models • Cluster unpledged proteins, build models, update model database

  42. Protein Clustering Use of Protein Clusters reduces search space, but adds another level of indirection, which is a source of errors, and adds complexity that consumes effort However, for proteins, which form dense relationship networks, clustering is a great tool KonstantinosMavrommatis will elaborate on protein clustering techniques

  43. Thank you!

More Related