Sequence Clustering

Sequence Clustering Reducing Search Space in Protein and DNA/RNA Sequence Analysis Denis Kaznadzey, Prokaryotic Super Program MGM Workshop May 14, 2012

Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

Classification as Research Tool To deal with a huge variety of individual objects: • Classify into groups of essentially similarobjects • When new data arrives, assign objects to existing groups • Classify ‘leftovers’ • Occasionally review the entire classification Problem: What is ‘essentially similar’? • Finding properties that are important (ontological relevancy) • Does classification reflect reality in any way?

Classification Ways to classify objects: • Spectral methods • Parametric decomposition • Clustering

Sequence Data Abundance In the modern biology: The most abundant type of data is sequence: • DNA • Genomic • Meta-Genomic • Environmental Samples (16S rDNA) • RNA (cDNA libraries; RNA-Seq) • Derived Proteins How to compare sequences? - Criteria depend on application, e.g. GC content vs. order of bases.

Sequence Clustering Select Applications in Genomic Sciences: Genome Assembly: Binning, Scaffolding Transcriptomics: RNAseq (read) clustering Protein Function and Evolution studies: Protein families Phylogenetic profiling: OTUs

Clustering is Crucial for MetaGenomics METAGENOMICS • Thousands of samples • Hundreds of millions reads per sample • Trillions of base pairs • Billions of genes impossible to observe/analyze individually Clustering becomes a strict requirement: - Find what classes of sequences are seen - Analyze classes rather then individual sequences

MetaGenomics Analysis Tasks • Primary tasks: • Assess diversity • Find genes • Predict functions • Predict pathways • Estimate capabilities Based on sequence comparison.

Clustering in General • Any Clustering is based on the Distance in some Metric • Initial clustering is based on pair-wise distances • Subsequent classification is based on distances from objects to clusters: Pledging

Similarity Metrics • What is “similar”: • Similarity measure should better reflect “reality” • This “reality” depends on the application: • Assembly: find identical sub-strings • Orthology detection: Identify homologous proteins across the species • Functional prediction: Identify proteins with similar evolutionary conserved motifs Measure is: Identity Percentage Substitution matrix based Match to HMM or PSSM

Similarity Measure Computing similarity measure: • Edit distance or (ungapped) statistics P-value: BLAST, Fasta, needle, water, etc. • Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee • K-mere statistics: CD-HIT, USEARCH, MUSCLE • Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ • Suffix Arrays: Bowtie, BWT • Position-Specific scoring matrix: PSI-Blast, Impala • Hidden Markov Models: HMMer, HHSearch/HHPred, SAM

Assembling Clusters There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology): • Linkage-based • Average linkage • Complete linkage • Single linkage • Hierarchy-based • Fitting function-based(K-mean) • Non-linear classifiers (SOM, etc.) • Greedy methods (iterative, suboptimal)

Linkage-Based Clustering Average linkage Complete linkage Single linkage

Hierarchical Clustering • Build a tree representation of relationships • Cut the branches using some quantitative criteria

Building the Tree Criteria: More similar sequences appear at closer branches This goal is not achievable for practical distance measures 2 C ? B 3 1 2 4 A D A B C D A B D C 4 • Solutions: • Approximation methods: neighbor join, UPGMA • Search for the optimal tree by explicit criteria: (maximum parsimony, maximal likelihood, etc.)

Suboptimal Tree Building Neighbor joining (corresponds to single-linkage clustering): • Order edges by distance • Join in order from short to long, merging branches as needed UnweightedPair Group Method with Arithmetic Mean (UPGMA):(corresponds to average-linkage clustering) • For every pair of clusters (A, B), starting with all singletons: • Compute average of distances between every object in A and every object in B • Merge the clusters of the closest average distance

Global Fitting-Function Based K-mean clustering • Pre-define the number of clusters • Find a distribution so that the sum of distances to the means is minimal • Computationally hard • Heuristics used, application specific heuristics may be efficient

Non-Linear Methods • Self-Organizing Maps:“self-learning” method • A neural network trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space

Pledging Based on distance to cluster • Representative • Set of representatives (all at extreme) • Other measure, may be unrelated to the initial one (profile, model)

Performance Considerations Distance computing is harder than clustering(Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ) • For large data sets only k-mere and suffix array measures are practical • However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes the use of sensitive similarity measures possible. • For boolean distance, iterative similarity detection is possible (no off-the-shelf implementations) • Binning: pre-clustering by rough and fast methods 33 objects 528 pairs 4 groups 127 pairs

Single Linkage is Fast • Time- and space- efficient clustering method: transitive closure-based • Requires ‘boolean’ distances (two sequences can be linked or not linked • Requires the number of nodes to be known • Space ~ NodesNo • Run-time (worst) ~ EdgesNo* AveClustSize • Run-time (average) ~ EdgesNo * log2 (AveClustSize)

Single Linkage is Prone to Aggregation Single-linkage clustering killer: CLUSTER AGGREGATION In large clusters, even a small number of random links lead to huge conglomerates.

Case Study: RNA-SeqPipeline • Goals: • Compute transcript structures • Compute expression profiles (“virtual”) Reads/EST clusters Reads/ ESTdb Reads / clones attributed to particular source/condition Counting reads originating from different sources Source / condition specific expression profiles

RNAseq Analysis Solutions Source: bioinfo.org, Macquarie University, Sydney

RNAseq Clustering Approach Outline: Outcome: • 1. Detect identities (common segments): • Compute similarities • Select the “good” ones • 2. Merge sequences into groups with shared segments: SINGLE LINKAGE One biggest cluster contains more then 60% of all sequences (selection by better similarity does not help) What causes aggregation and how to fight it?

Aggregation in RNA-SeqClustering • “Bad” identities: • Pieces of vector constructs / adaptors • Repeats • Redundant sequences • Spurious matches (short infrequent repeats) • Chimeras (if pre-amplification is used)

Similarities Selection • Computing ‘boolean’ distances: • Threshold – based • Additional rules (match arrangement) • % identity + length + arrangement: OK

Trimming / Masking • Fighting aggregation • Vector / adapter trimming: • Lucy, Figaro, etc. – integrated in many assembly suites (newbler, velvet, AMOS, CLCbio, etc.) • Low complexity detection / masking: • SEG, DUST, FastQC, WindowMasker etc. – often integrated in search tools

Repeat Elimination Regular (tandem) repeats: • Pre-search masking: Based on structure (IMEx, SRF) or on database (TRDB) • Post-search detection based on similarity properties (multiple parallel threads)

Repeat Elimination Irregular (long) repeats: • Database based: RepeatMasker • De-novo: • RepeatScout, • orrb, • PILER, etc. (Require genome as input, construct database)

Detecting Chimeras • Detecting chimeric sequences: • Abundance-based: Perseus, UCHIME • Chimeras undergo less amplification cycles. So chimera segments in native arrangements are more frequent • Specific to 16S: ChimeraSlayer, Bellerophon • Chimera ‘arms’ are closer to originating clades then the entire chimera

Detecting Chimeras • Similarity coverage-based: Mira assembler

Detecting Chimeras • Similarity graph topology-based: dchim Alignment view Connectivity view

Protein Clustering • Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species • Position-specific scoring matrices and profile-HMMs provide better sensitivity, but SLOW • Similar problems as for RNA-Seq/EST clustering, but their causes are harder to fight • No ‘one fits all’ solution: manual tuning and curation required for comprehensive results, especially at a large scale • The results of clustering are precious, they are kept as databases (PFAM, COGs, KOGs, eggNOG)

Protein Clustering at JGI • Functional annotation of metagenome genes through protein clusters (IMG): • Build a set of functionally homogenous clusters of similar proteins – for annotated genomes • Build HMM for each cluster, compose model database • Pledge metagenome proteins to clusters by matching to models • Cluster unpledged proteins, build models, update model database

Protein Clustering Use of Protein Clusters reduces search space, but adds another level of indirection, which is a source of errors, and adds complexity that consumes effort However, for proteins, which form dense relationship networks, clustering is a great tool KonstantinosMavrommatis will elaborate on protein clustering techniques

Thank you!

Sequence Clustering