PatternHunter II: Highly Sensitive and Fast Homology Search

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation PatternHunter II: Highly Sensitive and Fast Homology Search Ming Li, Bin Ma Derek Kisman, John Tromp R94922059 林語君

Overview • Homology search • Local alignment algorithms • PH I • PH II • Multiple Spaced Seeds • Computing hit probability • Finding a good seed set • PH II Design • Performance

Local alignment • Smith-Waterman • Smith and Waterman, 1981; Waterman and Eggert, 1987 • SSearch • FastA • Wilbur and Lipman, 1983; Lipman and Pearson, 1985 • BLAST • Altschul et al., 1990; Altschul et al., 1997 • Blast Family: BLASTN, BLASTP, etc. • MEGABLAST

PatternHunter • Seed • Tradeoff: sensitivity <-> computation • Consecutive k letters • k=11 in Blastn, k=28 in MegaBlast • Nonconsecutive k letters • Spaced seed • A model of k as its weight

PatternHunter II • Genome Informatics 14 (2003) • Extend single optimized spaced seed of PH to multiple ones • Speed: BLASTN (MEGABLAST) • Sensitivity: Smith-Waterman (SSearch)

Definition • A homologous region, R • A seed hitsR • A seed set A={a1,…ak} hits R • Similarity • R has p=x% identities • Sensitivity • Hit probability • Optimal (DP) = 1

Computing Hit Probability • NP-hard on multiple seeds • DP on 1 seed • Extend DP to multiple seeds

Computing Hit Probability of Multiple Seeds • Let A={a1,…ak} be a set of k seeds and R a random region of Length L with similarity level p. • Binary string b is a suffix of R[0:i] • Answer: f ( L,Є ), Є= empty string

Computing Hit Probability of Multiple Seeds

Finding a Good Seed Set • NP-hard for both optimal seed and multiple seeds • Greedy

Finding a Good Seed Set • Compute the 1st seed a1 which maximizes the hit probability of {a1} • Compute the 2nd seed a2 which maximizes the hit probability of {a1, a2} • Repeat until • Reach the desired number of seeds • Reach the desired hit probability

Finding a Good Seed Set • May not optimize the combined hit probability • Good enough • Optimal • 16 weight, 11 seeds, L=64, similarity=70%, first four seeds:{111010010100110111,111100110010100001011,110100001100010101111,1110111010001111} • Greedy • 16 weight, 12 seeds, L=64, similarity=70%, first four seeds:{111010010100110111,1111000100010011010111,1100110100101000110111,1110100011110010001101}

Performance of the seeds • From low to high • Solid: weight-11 k=1,2,4,8,16 seeds • Dashed: 1-seed, weight=10,9,8,7

Performance of the seeds • Reducing the weight by 1 • Increase the expected number of hits by a factor of 4 • Doubling the number of seeds • Increase the expected number of hits by a factor of 2 • Better: Multiple seeds

PH II Performance • Compare with Blast(Blastn), Smith-Waterman(SSearch) • Sensitivity of SSearch = 1 • Alignment score • BLAST methods (hash, DP) • match=1, mismatch=-1, gapopen=-5, gapextension=-1

PH II Performance • From low to high • Solid: PH II, 1, 2, 4, 8 seeds weight 11 • Dashed: Blastn, seed weight 11

Complexity Proof • Finding optimal spaced seeds • NP-hard • Finding one optimal seed • NP-hard • Computing the hit probability of multiple seeds • NP-hard

PatternHunter II: Highly Sensitive and Fast Homology Search

PatternHunter II: Highly Sensitive and Fast Homology Search

Presentation Transcript

Locality Sensitive Hashing and Large Scale Image Search

Fast Similarity Search for Learned Metrics

FAST Search for Sharepoint 2010

Topic-Sensitive SourceRank: Extending SourceRank for Performing Context-Sensitive Search over Deep-Web

The Highly Sensitive Person

Understanding: The Key to Protecting Highly Sensitive Personally Identifiable Information

 Highly photo-sensitive OTFT

Homology Modeling

Homology modeling with SWISS-MODEL

Anatomical Homology

Homology and Homologs

Search for t g e g / mg

Ontolica Enhanced Search User Experience

Simplicial Homology

Evaluating search engines

PatternHunter II: Highly Sensitive and Fast Homology Search

Find Satellite Imagery in a Database Quickly and Accurately

Dark Matter Search with Direction sensitive Scintillator Ⅱ

Homology

Homology Search Tools

FAST Search and Transfer

Fast Handover Support for Highly Mobile Users using COTS 802.11 Cards