1.19k likes | 1.21k Views
PatternHunter is a novel algorithm for homology search that offers improved speed and sensitivity compared to traditional methods like BLAST. This paper explores the concept, performance, and advantages of PatternHunter in comparison to existing algorithms. The algorithm utilizes spaced seed patterns to efficiently identify homologous sequences, enhancing its performance and accuracy in detecting similarities between sequences. The paper also delves into the technical details of PatternHunter II, discussing its algorithmic advancements and performance metrics. Overall, PatternHunter demonstrates superior performance in identifying nonconsecutive seeds, proof of the BLAST algorithm, finding seeded matches, and facilitating gapped extensions through dynamic programming.
E N D
PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086 洪錫全 B92902087 郭立翔
Agenda PatternHunter Spaced Seed Algorithm Performance PatternHunter II Algorithm Performance Translated PatternHunter
Outline A short review about BLAST. Some definition and background. What’s the difference and the same between BLAST and PatternHunter. Why PatternHunter is better?? Nonconsecutive seeds Proof
Blast Algorithm Find seeded matches Extent to HSP’s (High scoring Segment Pairs) Gapped Extension, dynamic programming Report significant local alignments
A short review about BLAST Find hits. BLAST first scans the database for words that score at least T when aligned with some word within the query sequence. Any aligned word pair satisfying this condition is called a hit.
A short review about BLAST Find HSPs HSP (High scoring Segment Pair) is much longer than a single word pair, and may therefore entail multiple hits on the same diagonal within a relative shot distance of one another.
A short review about BLAST Generate gapped alignment This means that two or more HSPs in BLAST with scores well below 38 bits can, in combination, rise to statistical significance. If any one of these HSPs is missed, so may be the combined result.
A short review about BLAST In summary, the new gapped BLAST algorithm requires two non-overlapping hits of score at least T, within a distance A of one another, to invoke an ungapped extension of the second hit. If the HSP generated normalized score at least Sg bits, then a gapped extension is triggered.
Some definition, some background Similarity How similar it is between two sequences? Usually mean that the probability of the same symbol appear in anywhere of two sequences. Sensitivity The probability to find a local alignment. Specificity In all local alignments, how many alignments are homologous.
Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Define the Seed Defining the seed: w w -> weight or number of positions to match Blastn: 11 MegaBlast: 28 model -> relative position of letters for each w w model m m -> length of model “window”
Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Seed Parameters: w = 11 w = 11 letters: letters: 0, 1 1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1 m = 18 m = 18 1 – exact match required { model model 0 – no match required, any value Patternhunter most sensitive model Blastn seed is all “1”s
Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Seed, Hit, Homology What is a seed? Seeds determine how an algorithm looks for hits What is a hit? Hits indicate a similarity that may indicate a homology
Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 hit GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC || ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA---- || ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG ------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG ||||||| |||| | | |||| ||||| || ||||| || |||||| ||||||||||||||| GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG Human-Mouse genome homology
Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Example: Consider the following two sequences: GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT || ||||||||| |||||||| |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT What’s the differences in finding the seed between Blast and PatternHunter?
Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 BLAST uses “consecutive seeds” In BLAST, we often use the consecutive model with weight 11. GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT || ||||||||| |||||||| |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT → 11111111111 → … →… … → 11111111111 ← However, it fails to find the alignment in the two sequence.
Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Consecutive seeds There’s also a dilemma for BLAST type of search. Dilemma Sensitivity – needs shorter seeds too many random hits, slow computation Speed – needs longer seeds lose distant homologies
Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 PatternHunter uses “non-consecutive seed” In PatternHunter, we often use the spaced model with weight 11 and length 18. GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT || ||||||||| |||||||| |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT 111010010100110111
Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Consecutive vs. Nonconsecutive? The non-consecutive seed is the primary difference and strength of Patternhunter Blastn: 1 1 1 1 1 1 1 1 1 1 1 PatternHunter: 1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1
Reference Ming Li, NHC2005 A trivial comparison between spaced and consecutive seed Consider 111 and 1101. To fail seed 111, we can use 110110110110… 66.66% similarity But we can prove, seed 1101 will hit every region with 61% similarity for sufficient long region.
Reference Ming Li, NHC2005 Proof Suppose there is a length 100 region which is not hit by 1101. We can break the region into blocks of 1a0b. Besides the last block, the other blocks have the following few cases: 10bfor b>=1 110bfor b>=2 1110bfor b>=2 In each block, similarity <= 3/5. The last block has at most 3 matches. So, in total there are at most 61 matches in 100 positions. The similarity is <=61%.
Reference Ming Li, NHC2005 Formalize Given i.i.d. sequence (homology region) with Pr(1)=p and Pr(0)=1-p for each bit: 1100111011101101011101101011111011101 111*1**1*1**11*111 Which seed is more likely to hit this region: BLAST seed: 11111111111 Spaced seed: 111*1**1*1**11*111
Reference Ming Li, NHC2005 Expect Less, Get More Lemma: The expected number of hits of a weight W length M seed model within a length L region with homology level p is (L-M+1)pW Proof. E(#hits) = ∑i=1 … L-M+1pW ■ Example: In a region of length 64 with p=0.7 Pr(BLAST seed hits)=0.3 E(# of hits by BLAST seed)=1.07 Pr(optimal spaced seed hits)=0.466, 50% more E(# of hits by spaced seed)=0.93, 14% less
Reference Ming Li, NHC2005 Why Is Spaced Seed Better? A wrong, but intuitive, proof: seed s, interval I, similarity p E(#hits) = Pr(s hits) E(#hits | s hits) Thus: Pr(s hits) = Lpw/ E(#hits | s hits) For optimized spaced seed, E(#hits | s hits) 111*1**1*1**11*111 Non overlap Prob 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 7 p7 ….. For spaced seed: the divisor is 1+p6+p6+p6+p7+ … For BLAST seed: the divisor is bigger: 1+ p + p2+ p3 + …
Reference Ming Li, NHC2005 Simulated sensitivity curves
Reference Ming Li, NHC2005 Observations of spaced seeds Seed models with different shapes can detect different homologies. Two consequences: Some models may detect more homologies than others More sensitive homology search PatternHunter I Can use several seed models simultaneously to hit more homologies Approaching 100% sensitive homology search PatternHunter II
PatternHunter – Algorithm & Performance
Outline Hit generation Hit extension Gapped extension Performance
Hit generation Index created for each position in the query sequence
Hit generation Similar to MegaBlast: Hash tables Encode ATCG into binary code 00, 01, 10, 11 respectively Find each situations in one of the sequence and record the offsets in the hash table
Hit generation An example: Now we want to find hits between sequences S and T
Spaced seed For sequence T: A 00 Scan T 01 A T A T G C A T 1 1 0 1 0 1 1 0 Model Seed ‧‧ ‧‧ C 10 G 11 A T T C A 0001011000 = 88 Weight=5 the value is between 0~2^10-1
After filling in the hash table… ‧‧‧ Position in T For each position in S: 0 1 2 3 ‧‧‧ 10 19 34 (NULL) 14 10 48 134 1.Calculate int value 2. Find hits in S by the lookup value ‧‧‧ 87 88 2 8 33 ‧‧‧
Hash tables: space required ‧‧‧ Position in T 0 10 (NULL) 14 48 10 19 34 4^w integers 1 2 |T| integers 3 ‧‧‧ 134 ‧‧‧ Total: 4^(w+1)+4|T| bytes 87 88 2 8 33 ‧‧‧
Cost a lot to make a hash table? If the number of hits found for one index is large, the cost of computing index is relatively negligible.
Hit extension HSP: Highscoring Segment Pair Scan those hits with a window, and choose the highest-scored one.
Hit extension S The chosen hit T
Hit extension Set the mid point of the chosen hit as the cut point, split the graph into 4
Hit extension S T
Hit extension And then do the Smith-Waterman in 2 of the 4, until it reaches the dropoff score.
Hit extension S Smith-Waterman Cost=1/2*O(mn) Smith-Waterman T
Hit extension If the resulting segment pair has a score below certain minimum, then ignore it. Else we gain a HSP and do the next step-gap extension.
Hit extension A question: when doing extension in 2 ways, how to synchronize the score?
Gapped Extension To find the best way to extend an HSP to the left across gaps. To extend an HSP we try all candidates from a diagonal-sorted set. Penalty for gap open + gap extension + cropping
Gapped Extension Search front
From left to right Optimal Left Too Far Right Too Far Right Optimal Left
From left to right Optimal Left Too Far Right
Descriptions in the paper We use a red-black tree for this. Insert HSP when the optimal alignment to its left is found Retired from the tree once newly generated HSPs are too far beyond its right endpoint to make use of it.
Thought 1 The first one will be inserted Fast
Thought 1 May not find the best one End Start Better Worse