PatternHunter II: Highly Sensitive and Fast Homology Search

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation PatternHunter II: Highly Sensitive and Fast Homology Search Ming Li, Bin Ma Derek Kisman, John Tromp R94922059 林語君

Overview • Homology search • Local alignment algorithms • PH I • PH II • Multiple Spaced Seeds • Computing hit probability • Finding a good seed set • PH II Design • Performance

Local alignment • Smith-Waterman • Smith and Waterman, 1981; Waterman and Eggert, 1987 • SSearch • FastA • Wilbur and Lipman, 1983; Lipman and Pearson, 1985 • BLAST • Altschul et al., 1990; Altschul et al., 1997 • Blast Family: BLASTN, BLASTP, etc. • MEGABLAST

PatternHunter • Seed • Tradeoff: sensitivity <-> computation • Consecutive k letters • k=11 in Blastn, k=28 in MegaBlast • Nonconsecutive k letters • Spaced seed • A model of k as its weight

PatternHunter II • Genome Informatics 14 (2003) • Extend single optimized spaced seed of PH to multiple ones • Speed: BLASTN (MEGABLAST) • Sensitivity: Smith-Waterman (SSearch)

Definition • A homologous region, R • A seed hitsR • A seed set A={a1,…ak} hits R • Similarity • R has p=x% identities • Sensitivity • Hit probability • Optimal (DP) = 1

Computing Hit Probability • NP-hard on multiple seeds • DP on 1 seed • Extend DP to multiple seeds

Computing Hit Probability of Multiple Seeds • Let A={a1,…ak} be a set of k seeds and R a random region of Length L with similarity level p. • Binary string b is a suffix of R[0:i] • Answer: f ( L,Є ), Є= empty string

Computing Hit Probability of Multiple Seeds

Finding a Good Seed Set • NP-hard for both optimal seed and multiple seeds • Greedy

Finding a Good Seed Set • Compute the 1st seed a1 which maximizes the hit probability of {a1} • Compute the 2nd seed a2 which maximizes the hit probability of {a1, a2} • Repeat until • Reach the desired number of seeds • Reach the desired hit probability

Finding a Good Seed Set • May not optimize the combined hit probability • Good enough • Optimal • 16 weight, 11 seeds, L=64, similarity=70%, first four seeds:{111010010100110111,111100110010100001011,110100001100010101111,1110111010001111} • Greedy • 16 weight, 12 seeds, L=64, similarity=70%, first four seeds:{111010010100110111,1111000100010011010111,1100110100101000110111,1110100011110010001101}

Performance of the seeds • From low to high • Solid: weight-11 k=1,2,4,8,16 seeds • Dashed: 1-seed, weight=10,9,8,7

Performance of the seeds • Reducing the weight by 1 • Increase the expected number of hits by a factor of 4 • Doubling the number of seeds • Increase the expected number of hits by a factor of 2 • Better: Multiple seeds

PH II Performance • Compare with Blast(Blastn), Smith-Waterman(SSearch) • Sensitivity of SSearch = 1 • Alignment score • BLAST methods (hash, DP) • match=1, mismatch=-1, gapopen=-5, gapextension=-1

PH II Performance • From low to high • Solid: PH II, 1, 2, 4, 8 seeds weight 11 • Dashed: Blastn, seed weight 11

Complexity Proof • Finding optimal spaced seeds • NP-hard • Finding one optimal seed • NP-hard • Computing the hit probability of multiple seeds • NP-hard

PatternHunter II: Highly Sensitive and Fast Homology Search

PatternHunter II: Highly Sensitive and Fast Homology Search

Presentation Transcript

Highly sensitive NQR

Modern Homology Search

The Highly Sensitive Person

Homology Groups And Persistence Homology

Title II and Highly Qualified

 Highly photo-sensitive OTFT

Homology Search Tools

Modern Homology Search

PatternHunter: faster and more sensitive homology search

Fast search methods

Heuristic homology search methods

Fast Image Search

FISH Fast Identification of Segmental Homology

Homology

FAST Search and Transfer

Waiting Time and Seed Selection for Homology Search

Highly Reliable And Sensitive STD Test Kits

Waiting Time and Seed Selection for Homology Search

Homology Search Tools

PatternHunter: faster and more sensitive homology search

Highly sensitive person therapist

highly sensitive person therapist