PatternHunter: faster and more sensitive homology search

PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 鍾承宏 B92902033 王凱平 B92902039 莊謹譽 B92902072 張智翔 B92902086 洪錫全 B92902087 郭立翔

Agenda  PatternHunter  Spaced Seed  Algorithm  Performance  PatternHunter II  Algorithm  Performance  Translated PatternHunter

PatternHunter – Spaced Seed

Outline  A short review about BLAST.  Some definition and background.  What’s the difference and the same between BLAST and PatternHunter.  Why PatternHunter is better??  Nonconsecutive seeds  Proof

Blast Algorithm  Find seeded matches  Extent to HSP’s (High scoring Segment Pairs)  Gapped Extension, dynamic programming  Report significant local alignments

A short review about BLAST  Find hits.  BLAST first scans the database for words that score at least T when aligned with some word within the query sequence. Any aligned word pair satisfying this condition is called a hit.

A short review about BLAST  Find HSPs  HSP (High scoring Segment Pair) is much longer than a single word pair, and may therefore entail multiple hits on the same diagonal within a relative shot distance of one another.

A short review about BLAST  Generate gapped alignment  This means that two or more HSPs in BLAST with scores well below 38 bits can, in combination, rise to statistical significance. If any one of these HSPs is missed, so may be the combined result.

A short review about BLAST  In summary, the new gapped BLAST algorithm requires two non-overlapping hits of score at least T, within a distance A of one another, to invoke an ungapped extension of the second hit. If the HSP generated normalized score at least Sg bits, then a gapped extension is triggered.

Some definition, some background  Similarity  How similar it is between two sequences?  Usually mean that the probability of the same symbol appear in anywhere of two sequences.  Sensitivity  The probability to find a local alignment.  Specificity  In all local alignments, how many alignments are homologous.

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Define the Seed  Defining the seed: w w -> weight or number of positions to match  Blastn: 11 MegaBlast: 28 model -> relative position of letters for each w w model m m -> length of model “window”

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Seed Parameters: w = 11 w = 11 letters: letters: 0, 1 1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1 m = 18 m = 18 1 – exact match required { model model 0 – no match required, any value Patternhunter most sensitive model Blastn seed is all “1”s

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Seed, Hit, Homology  What is a seed?  Seeds determine how an algorithm looks for hits  What is a hit?  Hits indicate a similarity that may indicate a homology

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 hit GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC || ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA---- || ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG ------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG ||||||| |||| | | |||| ||||| || ||||| || |||||| ||||||||||||||| GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG Human-Mouse genome homology

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Example:  Consider the following two sequences: GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT || ||||||||| |||||||| |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT  What’s the differences in finding the seed between Blast and PatternHunter?

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 BLAST uses “consecutive seeds”  In BLAST, we often use the consecutive model with weight 11. GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT || ||||||||| |||||||| |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT → 11111111111 → … →… … → 11111111111 ←  However, it fails to find the alignment in the two sequence.

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Consecutive seeds  There’s also a dilemma for BLAST type of search.  Dilemma  Sensitivity – needs shorter seeds  too many random hits, slow computation  Speed – needs longer seeds  lose distant homologies

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 PatternHunter uses “non-consecutive seed”  In PatternHunter, we often use the spaced model with weight 11 and length 18. GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT || ||||||||| |||||||| |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT 111010010100110111

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Consecutive vs. Nonconsecutive?  The non-consecutive seed is the primary difference and strength of Patternhunter Blastn: 1 1 1 1 1 1 1 1 1 1 1 PatternHunter: 1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1

Reference Ming Li, NHC2005 A trivial comparison between spaced and consecutive seed  Consider 111 and 1101.  To fail seed 111, we can use  110110110110…  66.66% similarity  But we can prove, seed 1101 will hit every region with 61% similarity for sufficient long region.

Reference Ming Li, NHC2005 Proof  Suppose there is a length 100 region which is not hit by 1101.  We can break the region into blocks of 1a0b. Besides the last block, the other blocks have the following few cases:  10bfor b>=1  110bfor b>=2  1110bfor b>=2  In each block, similarity <= 3/5.  The last block has at most 3 matches.  So, in total there are at most 61 matches in 100 positions. The similarity is <=61%.

Reference Ming Li, NHC2005 Formalize  Given i.i.d. sequence (homology region) with Pr(1)=p and Pr(0)=1-p for each bit: 1100111011101101011101101011111011101 111*1**1*1**11*111  Which seed is more likely to hit this region:  BLAST seed: 11111111111  Spaced seed: 111*1**1*1**11*111

Reference Ming Li, NHC2005 Expect Less, Get More  Lemma: The expected number of hits of a weight W length M seed model within a length L region with homology level p is (L-M+1)pW Proof. E(#hits) = ∑i=1 … L-M+1pW ■  Example: In a region of length 64 with p=0.7  Pr(BLAST seed hits)=0.3 E(# of hits by BLAST seed)=1.07  Pr(optimal spaced seed hits)=0.466, 50% more E(# of hits by spaced seed)=0.93, 14% less

Reference Ming Li, NHC2005 Why Is Spaced Seed Better? A wrong, but intuitive, proof: seed s, interval I, similarity p E(#hits) = Pr(s hits) E(#hits | s hits) Thus: Pr(s hits) = Lpw/ E(#hits | s hits) For optimized spaced seed, E(#hits | s hits) 111*1**1*1**11*111 Non overlap Prob 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 7 p7 …..  For spaced seed: the divisor is 1+p6+p6+p6+p7+ …  For BLAST seed: the divisor is bigger: 1+ p + p2+ p3 + …

Reference Ming Li, NHC2005 Simulated sensitivity curves

Reference Ming Li, NHC2005 Observations of spaced seeds  Seed models with different shapes can detect different homologies.  Two consequences:  Some models may detect more homologies than others  More sensitive homology search  PatternHunter I  Can use several seed models simultaneously to hit more homologies  Approaching 100% sensitive homology search  PatternHunter II

PatternHunter – Algorithm & Performance

Outline  Hit generation  Hit extension  Gapped extension  Performance

Hit generation  Index created for each position in the query sequence

Hit generation  Similar to MegaBlast: Hash tables  Encode ATCG into binary code 00, 01, 10, 11 respectively  Find each situations in one of the sequence and record the offsets in the hash table

Hit generation  An example: Now we want to find hits between sequences S and T

Spaced seed For sequence T: A 00 Scan T 01 A T A T G C A T 1 1 0 1 0 1 1 0 Model Seed ‧‧ ‧‧ C 10 G 11 A T T C A 0001011000 = 88 Weight=5  the value is between 0~2^10-1

After filling in the hash table… ‧‧‧ Position in T For each position in S: 0 1 2 3 ‧‧‧ 10 19 34 (NULL) 14 10 48 134 1.Calculate int value 2. Find hits in S by the lookup value ‧‧‧ 87 88 2 8 33 ‧‧‧

Hash tables: space required ‧‧‧ Position in T 0 10 (NULL) 14 48 10 19 34 4^w integers 1 2 |T| integers 3 ‧‧‧ 134 ‧‧‧ Total: 4^(w+1)+4|T| bytes 87 88 2 8 33 ‧‧‧

Cost a lot to make a hash table?  If the number of hits found for one index is large, the cost of computing index is relatively negligible.

Hit extension  HSP: Highscoring Segment Pair  Scan those hits with a window, and choose the highest-scored one.

Hit extension S The chosen hit T

Hit extension  Set the mid point of the chosen hit as the cut point, split the graph into 4

Hit extension S T

Hit extension  And then do the Smith-Waterman in 2 of the 4, until it reaches the dropoff score.

Hit extension S Smith-Waterman Cost=1/2*O(mn) Smith-Waterman T

Hit extension  If the resulting segment pair has a score below certain minimum, then ignore it.  Else we gain a HSP and do the next step-gap extension.

Hit extension  A question: when doing extension in 2 ways, how to synchronize the score?

Gapped Extension  To find the best way to extend an HSP to the left across gaps.  To extend an HSP we try all candidates from a diagonal-sorted set.  Penalty for gap open + gap extension + cropping

Gapped Extension Search front

From left to right Optimal Left Too Far Right Too Far Right Optimal Left

From left to right Optimal Left Too Far Right

Descriptions in the paper  We use a red-black tree for this.  Insert HSP when the optimal alignment to its left is found  Retired from the tree once newly generated HSPs are too far beyond its right endpoint to make use of it.

Thought 1  The first one will be inserted  Fast

Thought 1  May not find the best one End Start Better Worse

PatternHunter: faster and more sensitive homology search

PatternHunter: faster and more sensitive homology search

Presentation Transcript

Get Faster Website Traffic by Setting up Your WordPress Blog

Locality Sensitive Hashing and Large Scale Image Search

Topic-Sensitive SourceRank: Extending SourceRank for Performing Context-Sensitive Search over Deep-Web

Internet Search Strategy

DockoMatic : Automated Tool for Homology Modeling and Docking Studies

Odd homology of tangles and cobordisms

Homology Modeling

Homology modeling with SWISS-MODEL

Anatomical Homology

Homology and Homologs

Search for t g e g / mg

Simplicial Homology

PatternHunter II: Highly Sensitive and Fast Homology Search

Dark Matter Search with Direction sensitive Scintillator Ⅱ

6. Homology Modeling

Exercises

Exploring the need for a new Antibody Homology Modeling Protocol

CSF: Current DIAGNOSTIC in Cuba

Homology

Homology Search Tools

Waiting Time and Seed Selection for Homology Search

Homology Modeling