Modern Homology Search

Modern Homology Search Ming Li Canada Research Chair in Bioinformatics Computer Science. U. Waterloo

I want to show you how one simple theoretical idea can make big impact to a practical field.

Outline • Motivation, market, biology, homology search • Review dynamic programming, BLAST heuristics • Optimized spaced seeds • The idea • How to compute them • Data models, coding regions, HMM, vector seeds • Multiple optimized spaced seeds • PatternHunter program • Mathematical theory of spaced seeds • Why they are better • Complexity of finding optimal spaced seeds • Applications beyond bioinformatics & Open problems

1. Introduction • Biology • Motivation and market • Definition of homology search problem

Protein DNA (Genotype) Biology Phenotype

Human: 3 billion bases, 30k genes. E. coli: 5 million bases, 4k genes T A A T C G cDNA T A reverse transcription A T translation transcription G C Protein mRNA C G (20 amino acids) (A,C,G,U) G C Codon: three nucleotides encode an amino acid. 64 codons 20 amino acids, some w/more codes A T C G T A A T

A gigantic gold mine • The trend of genetic data growth • 400 Eukaryote genome projects underway • GenBank doubles every 18 months • Comparative genomics  all-against-all search • Software must be scalable to large datasets. 30 billion in year 2005

Homology search • Given two DNA sequences, find all local similar regions, using “edit distance” (match=1, mismatch=-1, gapopen=-5, gapext=-1). • Example. Input: • E. coli genome: 5 million base pairs • H. influenza genome: 1.8 million base pairs Output: all local alignments (PH demo) java –jar phn.jar –i ecoli.fna –j hinf.fna –o out.txt –b

Homology search vs Google search • Internet search • Size limit: 5 billion people x homepage size • Supercomputing power used: ½ million CPU-hours/day • Query frequency: Google --- 112 million/day • Query type: exact keyword search --- easy to do • Homology search • Size limit: 5 billion people x 3 billion basepairs + millions of species x billion bases • 10% (?) of world’s supercomputing power • Query frequency: NCBI BLAST -- 150,000/day, 15% increase/month • Query type: approximate search

Bioinformatics Companies living on Blast • Paracel (Celera) • TimeLogic (recently liquidated) • TurboGenomics (TurboWorx) • NSF, NIH, pharmaceuticals proudly support many supercomputing centers mainly for homology search • Hardware high cost & become obsolete in 2 years.

History • Dynamic programming (1970-1980) • Full sensitivity, but slow. • Human vs mouse genomes: 104 CPU-years • BLAST, FASTA heuristics (1980-1990) • Low sensitivity, still not fast enough. • Human vs mouse genomes: 19 CPU-years • BLAST paper was referenced 100000 times • Modern technology: full sensitivity & greater speed!

2. Old Technology • Dynamic programming – full sensitivity homology search • BLAST heuristics --- trading sensitivity with time.

Longest Common Subsequence (LCS). V=v1v2 … vn W=w1w2 … wm s(i,j) = length of LCS of V[1..i] and W[1..j] Dynamic Programming: s(i-1,j) s(i,j)=max s(i,j-1) s(i-1,j-1)+1,vi=wj Sequence Alignment (Needleman-Wunsch, 1970) s(i,j)=max s(i-1,j) + d(vi,-) s(i,j-1)+d(-,wj) s(i-1,j-1)+d(vi,wj) 0 (Smith-Waterman, 1981) * No affine gap penalties, where d, for proteins, is either PAM or BLOSUM matrix. For DNA, it can be: match 1, mismatch -1, gap -3 (In fact: open gap -5, extension -1.) d(-,x)=d(x,-)= -a, d(x,y)=-u. When a=0, u=infinity, it is LCS. Dynamic Programming:

Misc. Concerns • Local sequence alignment, add s[i,j]=0. • Gap penalties. For good reasons, we charge first gap cost a, and then subsequent continuous insertions b<a. • Space efficient sequence alignment. Hirschberg alg. in O(n2) time, O(n) space. • Multiple alignment of k sequences: O(nk)

Blast • Popular software, using heuristics. By Altschul, Gish, Miller, Myers, Lipman, 1990. • E(xpected)-value: e= dmne -lS, here S is score, m is database length and n is query length. • Meaning: e is number of hits one can “expect” to see just by chance when searching a database of size m. • Basic Strategy: For all 3 a.a. (and closely related) triples, remember their locations in database. Given a query sequence S. For all triples in S, find their location in database, then extend as long as e-value significant. Similar strategy for DNA (7-11 nucleotides). Too slow in genome level.

Blast Algorithm • Find seeded (11 bases) matches • Extent to HSP’s (High Scoring Pairs) • Gapped Extension, dynamic programming • Report all local alignments

BLAST Algorithm Example: • Find seeded matches of 11 base pairs • Extend each match to right and left, until the scores drop too much, to form an alignment • Report all local alignments Example: AGCGATGTCACGCGCCCGTATTTCCGTA TCGGATCTCACGCGCCCGGCTTACCGTG 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 0 G | | | | | | | | | | | | | | x | | | | | |

BLAST Dilemma: • If you want to speed up, have to use a longer seed. However, we now face a dilemma: • increasing seed size speeds up, but loses sensitivity; • decreasing seed size gains sensitivity, but loses speed. • How do we increase sensitivity & speed simultaneously? For 20 years, many tried: suffix tree, better programming ..

3. Optimized Spaced Seeds • Why are they better • How to compute them • Data models, coding regions, HMM • PatternHunter • Multiple spaced seeds • Vector seeds

New thinking • Three lines of existing approaches: • Smith-Waterman exhaustive dynamic programming. • Blast family: find seed matches (11 for Blastn, 28 for MegaBlast), then extend. Dilemma: increasing seed size speeds up but loses sensitivity; descreasing seed size gains sensitivity but loses speed. • Suffix tree: MUMmer, Quasar, REPuter. Only good for precise matches, highly similar sequences. • Blast approach is the only way to deal with real and large problems, but we must to solve Blast dilemma: We need a way to improve sensitivity and speed simultaneously. • Goals: Improve (a) Blast, (b) Smith-Waterman.

Optimal Spaced Seed(Ma, Tromp, Li: Bioinformatics, 18:3, 2002, 440-445) • Spaced Seed: nonconsecutive matches and optimized match positions. • Represent BLAST seed by 11111111111 • Spaced seed: 111*1**1*1**11*111 • 1 means a required match • * means “don’t care” position • This seemingly simple change makes a huge difference: significantly increases hit prob. to homologous region while reducing bad hits.

Formalization • Given i.i.d. sequence (homology region) with Pr(1)=p and Pr(0)=1-p for each bit: 1100111011101101011101101011111011101 • Which seed is more likely to hit this region: • BLAST seed: 11111111111 • Spaced seed: 111*1**1*1**11*111 111*1**1*1**11*111

Sensitivity: PH weight 11 seed vs Blast 11 & 10

PH 2-hit sensitivity vs Blastn 11, 12 1-hit

Expect Less, Get More • Lemma: The expected number of hits of a weight W length M seed model within a length L region with similarity p is (L-M+1)pW Proof: The expected number of hits is the sum, over the L-M+1 possible positions of fitting the seed within the region, of the probability of W specific matches, the latter being pW. ■ • Example: In a region of length 64 with 0.7 similarity, PH has probability of 0.466 to hit vs Blast 0.3, 50% increase. On the other hand, by above lemma, Blast expects 1.07 hits, while PH 0.93, 14% less.

Why Is Spaced Seed Better? A wrong, but intuitive, proof: seed s, interval I, similarity p E(#hits) = Pr(s hits) E(#hits | s hits) Thus: Pr(s hits) = Lpw / E(#hits | s hits) For optimized spaced seed, E(#hits | s hits) 111*1**1*1**11*111 Non overlap Prob 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 7 p7 ….. • For spaced seed: the divisor is 1+p6+p6+p6+p7+ … • For BLAST seed: the divisor is bigger: 1+ p + p2 + p3 + …

Finding Optimal Spaced Seeds(Keich, Li, Ma, Tromp, Discrete Appl. Math.138(2004), 253-263 ) Let f(i,b) be the probability that seed s hits the length i prefix of R that ends with b. Thus, if s matches b, then f(i,b) = 1, otherwise we have the recursive relationship: f(i,b)= (1-p)f(i-1,0b') + pf(i-1,1b') where b' is b deleting the last bit. Then the probability of s hitting R is, Σ|b|=M Prob(b) f(|R|,b). Choose s with the highest probability.

Improvements • Brejova-Brown-Vinar (HMM) and Buhler-Keich-Sun (Markov): The input sequence can be modeled by a (hidden) Markov process, instead of iid. • Multiple seeds • Brejova-Brown-Vinar: Vector seeds • Csuros: Variable length seeds – e.g. shorter seeds for rare query words.

PatternHunter(Ma, Tromp, Li: Bioinformatics, 18:3, 2002, 440-445) • PH used optimal spaced seeds, novel usage of data structures: red-black tree, queues, stacks, hashtables, new gapped alignment algorithm. • Written in Java. Yet many times faster than BLAST. • Used in Mouse/RAT Genome Consortia (Nature, Dec. 5, 2002), as well as in hundreds of institutions and industry.

Comparison with BLAST • On Pentium III 700MH, 1GB BLAST PatternHunter E.coli vs H.inf 716s 14s/68M Arabidopsis 2 vs 4 -- 498s/280M Human 21 vs 22 -- 5250s/417M Human(3G) vs Mouse(x3=9G)* 19 years 20 days • All with filter off and identical parameters • 16M reads of Mouse genome against Human genome for MIT Whitehead. Best BLAST program takes 19 years at the same sensitivity

Quality Comparison:x-axis: alignment ranky-axis: alignment scoreboth axes in logarithmic scale A. thaliana chr 2 vs 4 E. Coli vs H. influenza

Genome Alignment by PatternHunter (4 seconds)

Prior Literature • Over the years, it turns out, that the mathematicians know about certain patterns appear more likely than others: for example, in a random English text, ABC is more likely to appear than AAA, in their study of renewal theory. • Random or multiple spaced q-grams were used in the following work: • FLASH by Califano & Rigoutsos • Multiple filtration by Pevzner & Waterman • LSH of Buhler • Praparata et al

Coding Region & HMM The Hidden Markov Model is a finite set of states, each of which is associated with a probability distribution. Transitions among the states are governed by a set of probabilities called transition probabilities. In a particular state an outcome or observation can be generated, according to the associated probability distribution. It is only the outcome, not the state, which is visible to an external observer and therefore states are ``hidden'' from the outside; hence the name Hidden Markov Model. An HMM has the following components satisfying usual probability laws: • The number of states of the model, N. • The number of observation symbols in the alphabet, M. • A set of state transition probabilities, depending on current state. • A probability distribution of the observable symbol at each of the states.

Modeling coding region with HMM • PatternHunter original assumption: I is a sequence of N independent Bernoulli variables X0, … XN-1 with P(Xi)=p. • Coding region third position usually is less conserved. Skipping it should be a good idea (BLAT 110110110). With spaced seeds, we can model it using HMM. • Brejova, Brown, Vinar: M(3) HMM: I is a sequence of N independent Bernoulli random variables X0,X1, … XN-1 where Pr(Xi=1)=pi mod 3. In particular: p0 p1 p2 human/mouse 0.82 0.87 0.61 human/fruit fly 0.67 0.77 0.40 • DP algorithm naturally extends to M(3), • Opt seed (weight 8) for coding region: 11011011000011

Picture of M(3) Extending M(3), Brejova-Brown-Vinar proposed M(8), above, to model dependencies among positions within codons: Σpijk = 1, each codon has conservation pattern ijk with probability pijk. Figures from Brejova-Brown-Vinar

Sensitivity of all seeds Under 4 models From Brejova-Brown-Vinar

Running PHDownload at: www.BioinformaticsSolutions.com java –jar ph.jar –i query.fna –j subject.fna –o out.txt –b –multi 4 -Xmx512m --- for large files -j missing: query.fna self-comparison -db: multiple sequence input, 0,1,2,3 (no, query, subject, both) -W: seed weight -G: open gap penalty (default 5) -E: gap extension (default 1) -q: mismatch penalty (default 1) -r: reward for match (default 1) -model: specify model in binary -H: hits before extension -P: show progress -b: Blast output format -multi: number of seeds to be used (default 1; max 16)

Multiple Spaced Seeds: ApproachingSmith-Waterman Sensitivity(Li, Ma, Kisman, Tromp, PatternHunter II, J. Bioinfo Comput. Biol. 2004) • The biggest problem for Blast was low sensitivity (and low speed). Massive parallel machines are built to do Smith-Waterman exhaustive dynamic programming. • Spaced seeds give PH a unique opportunity of using several optimal seeds to achieve optimal sensitivity, this was not possible for Blast technology. • Economy --- 2 seeds (½ time slow down) achieve sensitivity of 1 seed of shorter length (4 times speed). • With multiple optimal seeds. PH II approaches Smith-Waterman sensitivity, and 3000 times faster. • Finding optimal seeds, even one, is NP-hard. • Experiment: 29715 mouse EST, 4407 human EST. Sensitivity and speed comparison next two slides.

Finding Multiple Seeds (PH II) • Computing the hit probability of given k seeds on a uniformly distributed random region is NP-hard. • Finding a set of k optimal seeds to maximize the hit probability cannot be approximated within 1-1/e unless NP=P • Given k seeds, algorithm DP can be adapted to compute their sensitivity (probability one of the seeds has a hit). • Greedy Strategy • Compute one optimal seed a. S={a} • Given S, find b maximizing hit probability of S U {b}. • Coding region, use M(3), 0.8, 0.8, 0.5 (for mod 3 positions) • We took 12 CPU-days on 3GHz PC to compute 16 weight 11 seeds. General seeds (first 4): 111010010100110111, 111100110010100001011, 11010000110001010111, 1110111010001111.

Sensitivity Comparison with Smith-Waterman (at 100%) The thick dashed curve is the sensitivity of Blastn, seed weight 11. From low to high, the solid curves are the sensitivity of PH II using 1, 2, 4, 8 weight 11 coding region seeds, and the thin dashed curves are the sensitivity 1, 2, 4, 8 weight 11 general purpose seeds, respectively

Speed Comparison with Smith-Waterman • Smith-Waterman (SSearch): 20 CPU-days. • PatternHunter II with 4 seeds: 475 CPU-seconds. 3638 times faster than Smith-Waterman dynamic programming at the same sensitivity.

Vector Seeds Definition. A vector seed is an ordered pair Q=(w,T), where w is a weight vector (w1, … wM) and T is a threshold value. An alignment sequence x1,…,xn contains a hit of the seed Q at position k if i=1..M(wi∙xk+i-1) ≥ T The number of nonzero positions in w is called the support of the seed. Comments. Other seeds are special cases of vector seeds.

Experiment (Brejova PhD. thesis, 2005) • Predicted sensitivity and false positive rate (prob. hit 0.25-homology region of seed length) comparison of various types of seeds in a Bernoulli region of 0.7 match probability, length 64. ----------------------------------------------------------------------------------------- Name Weight vector T Support Sensitivity False positive rate ================================================== BLAST 1111111111 10 10 41% 9.5x10-7 PH-10 1110101100011011 10 10 60% 9.5x10-7 BLAST-13-15 111111111111111 13 15 73% 9.2x10-7 VS-13-15 111111011011101111 13 15 85% 9.2x10-7 VS-12-13 111101100110101111 12 13 74% 6.0x10-7 VS-11-12 111011001011010111 11 12 84% 2.2x10-6

Variable weight spaced seeds • M. Csuros proposal: use variable weighted spaced seeds depending on the composition of the strings to be hashed. Rarer strings are expected to produce fewer false positives, use seeds with smaller weight --- this increases the sensitivity.

Open Question • Can we use different patterns (of same weight) at different positions. So this can be called variable position-spaced seeds. At the same weight, we do not increase expected number of false positive hits. Can we increase sensitivity? Find these patterns? How much increase will there be? Practical enough? • If the above is not the case, then prove: single optimal seed is always better than variable positions.

Old field, new trend • Research trend • Dozens of papers on spaced seeds have appeared since the original PH paper, in 3 years. • Many more have used PH in their work. • Most modern alignment programs (including BLAST) have now adopted spaced seeds • Spaced seeds are serving thousands of users/day • PatternHunter direct users • Pharmaceutical/biotech firms. • Mouse Genome Consortium, Nature, Dec. 5, 2002. • Hundreds of academic institutions.

Modern Homology Search