Sequence Clustering

Sequence Clustering Reducing Search Space in Protein and DNA/RNA Sequence Analysis Denis Kaznadzey, GBP MGM Workshop September 26, 2011

Sequence clustering To deal with a huge variety of individual ‘objects’: • Classify into groups of essentially similar objects • When new data arrives, assign objects to existing groups • Classify ‘leftovers’ • Occasionally review entire classification • Problem: What is essentially similar’? • Finding properties that are important (Ontological relevancy) • Does classification reflect reality in any way?

Sequence clustering Taxonomical Classification vs. Continuity of Great Chain of Being Even if reductionist, classification is a tool to study the world – the biology in particular. When data is incomplete, any classification is a convention. At the same time, it is an approximation of a “reality”. Carl Linnaeus Georges Buffon

Sequence clustering • In Modern Biology: Most abundant type of data is sequence: • Genomic DNA • RNA (through RNASeq) • Derived Proteins • Primary feature is Primary Structure, but • - Classification criteria depends on application.

Sequence Clustering Select Applications in Genomic Sciences: Genome Assembly: Binning, Scaffolding Transcriptomics: EST (read) clustering Protein Function and Evolution studies: Protein families Phylogenetic profiling: OTUs

Sequence Clustering • In Metagenomics: • Primary tasks: • Assess diversity • Find genes • Predict functions • Predict pathways • Estimate capabilities Based on sequence comparison.

Sequence Clustering • Any Clustering is based on the Distance in some Metric. • Initial clustering is based on pair-wise distances. • Subsequent classification is based on distances from object to clusters • Representative • Set of representatives (all at extreme) • Other measure, may be unrelated to initial.

Sequence Clustering • When distance measure is chosen, and distances are obtained / computed: • There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology) • K-mean, average linkage, complete linkage, single linkage, iterative, SOM, etc. • However options for large volume clustering are limited due to performance of algorithms. • Single-linkage can be computed very efficiently • (Method for pledging new sequences to clusters may be computationally more intense)

Sequence clustering • Most efficient clustering: transitive-closure based. • Requires ‘boolean’ distances (two sequences can be linked or not linked • Requires number of nodes to be known • Space ~ NodesNo • Run-time (worst) ~ EdgesNo* AveClustSize • Run-time (average) ~ EdgesNo * log2 (AveClustSize))

Sequence clustering • Practical Transitive Closure algorithm: • Allocate array of sequence numbers A [0..N] • Phase I: connect linked vertices through vertex of smallest index • For each edge (m, n): • While A [n] != n: • n = A [n] • While A [m] != m: • m = A [m] • A [max (m, n)] = min (m, n) • Phase II: propagate smallest indices as cluster identifiers • For each n from 0 to N: • If A [n] ! = A [ A [n]]: • A [n] = A [A [n]] • Phase III: collect clusters. (Implementation dependent) • Count number of distinct cluster “id”s => M (1 pass) • Allocate array of sizes; Count size of each cluster (1 pass) • Allocate array of clusters; fill it in (1 pass) +(1,3) +(5,6) +(6, 1) (0); (1,3,5,6); (2); (4)

Sequence clustering • Computing ‘boolean’ distances: • Threshold – based • Additional rules (match arrangement) • Example: read/EST clustering • % identity + length + arrangement: OK

Computing similarity measure: • Edit distance or (ungapped) statistics P-value: BLAST, Fasta, needle, water, etc. • Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee • K-mere statistics: CD-HIT, USEARCH, MUSCLE • Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ • Suffix Arrays: Bowtie, BWT • Position-Specific scoring matrix: PSI-Blast, Impala • Hidden Markoff Models: HMMer, HHSearch/HHPred, SAM

Sequence clustering • Distance computing is harder then clustering. (Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ) • For large data sets only k-mere and suffix array measures are practical. • However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes use of sensitive similarity measures possible. • For boolean distance, iterative similarity detection is possible. Fast binning->slow comparing. (no off-the-shelve implementations(?))

Sequence clustering • Boolean distance clustering killer: • CLUSTER AGGREGATION. • In large clusters, even a small number of random links lead to huge conglomerates.

Common causes: • Contamination with standard constructs • Repeats • Chimeras • Spurious similarities (low complexity zones etc.

Sequence clustering • Fighting aggregation • Vector / adapter trimming: • Lucy, Figaro, etc. Integrated in many assembly suites (newbler, velvet, AMOS, CLCbio, etc.) • Low complexity detection / masking: • SEG, DUST, FastQC, WindowMaskeretc. – often integrated in search tools.

Sequence clustering • Repeat detection / masking: • Regular (tandem) repeats: • Pre-search masking: Based on structure (IMEx, SRF); or on database (TRDB) • Post-search detection based on similarity properties (multiple parallel threads) • Irregular (long) repeats: • Database based: RepeatMasker • De-novo: RepeatScout, orrb, PILER, etc. Require genome as input, construct database.

Sequence clustering • Detecting chimeric sequences: • Abundance-based: Perseus, UCHIME • Chimeras undergo less amplification cycles. So chimera segments in native arrangement are more frequent. • Specific to 16S: ChimeraSlayer, Bellerophon • Chimera ‘arms’ are closer to originating phyla then entire chimera

Sequence clustering • Detecting chimeric sequences • Similarity coverage based: Mira assembler

Sequence clustering • Detecting chimeric sequences • Similarity graph topology based: dchim Alignment view Connectivity view

Protein Clusters: various criteria • Primary structure similarity • Close evolutionary relationship • Similarity in physical properties • 3-D structure similarity • Similar fold arrangement • Domain structure similarity • Common or similar functions • etc.

Sequence clustering • Functional and structural classifications in IMG

Sequence clustering • Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species • Position-specific scoring matrices and profile-HMMs provide better sensitivity, but MUCHSLOWER. • For individual genomes (103 -5x104 proteins) could be used with massively parallel computations (while number of genomes is within thousands) • For metagenomescan not be used with foreseeable computing resources.

Sequence clustering • Functional annotation of metagenome genes through protein clusters (under development): • Build set of functionally homogenous clusters of similar proteins – for annotated genomes • Build HMMs for each cluster, compose model database • Pledge metagenomeproteins to clusters by matching to models • Cluster unpledged proteins, build models, update model database. • Balance model database by creating model tree: aggregating small relative clusters and dissecting large ones. • Perform hierarchical searches through profiles tree.

Sequence clustering • Clustering reduces search space, but adds another level of indirection, which is a source of errors, and complexity, which consumes effort. • Improves only searches within parameters space used for clustering • (structure-based clusters not useful for searching for certain codon usage, etc.)

However, for proteins, which form dense relationship networks, clustering is a great tool.

Thank you!

Sequence Clustering