Heuristic alignment algorithms; Cost matrices

Heuristic alignment algorithms;Cost matrices 2.5 – 2.9 Thomas van Dijk

Content • Dynamic programming • Improve to linear space • Heuristics for database searches • BLAST, FASTA • Statistical significance • What the scores mean • Score parameters (PAM, BLOSUM)

Dynamic programming Improve to linear space

Dynamic programming • NW and SW run in O( nm ) time • And use O( nm ) space • For proteins, this is okay. • But DNA strings are huge! • Improve this to O( n+m ) space • while still using O( nm ) time.

Basic idea • Full DP requires O( nm ) space

Basic idea for linear space • Cells don’t directly depend on cells more than one column to the left. • So keep only two columns; forget the rest No back-pointers! How to find the alignment?

Divide and conquer • If we happen to know a cell on the optimal alignment…

Divide and conquer • We could then repeat this trick!

Divide and conquer • But how to find such a point?

Modified DP • Determine where the optimal alignment crossed a certain column: • at every cell, remember at which row it crossed the column.

Space analysis • Always only two columns at a time. • Clearly O( n+m ) space. • But what about the time?

What are we doing? • Using linear space DP at every step of a divide and conquer scheme.

Time analysis • We do more work now, but how much? • Look at case with two identical strings. • n2 n2

Time analysis • We do more work now, but how much? • Look at case with two identical strings. • n2 + ½ n2 ¼ n2 ¼ n2

Time analysis • We do more work now, but how much? • Look at case with two identical strings. • n2 + ½ n2 + ¼ n2 1/16 n2 1/16 n2 1/16 n2 1/16 n2

Time analysis • We do more work now, but how much? • Look at case with two identical strings. • n2 + ½ n2 + ¼ n2 + … < 2n2 et cetera…

Time analysis • We do more work now, but how much? • Look at case with two identical strings. • n2 + ½ n2 + ¼ n2 + … < 2n2 • Along the same lines,the algorithm in generalis still O( nm). • And actually only abouttwice as much work! et cetera…

Questions?

Heuristics for database search BLAST, FASTA

Searching a database • Database of strings • Query string • Select best matching string(s) from the database.

Algorithms we already know • Until now, the algorithms were exact and correct • (For the model!) • But for this purpose, too slow: • Say 100M strings with 1000 chars each • Say 10M cells per second • Leads to ~3 hours search time

Getting faster • Full DP is too expensive • Space issue: fixed. • With guaranteed same result  • Now the time issue

BLAST and FASTA • Heuristics to prevent doing full DP •  - Not guaranteed same result. •  - But tend to work well. • Idea: • First spend some time to analyze strings • Don’t calculate all DP cells; • Only some, based on analysis.

Basic idea • In a good alignment, it is likely that several short parts match exactly. • A C C A B B D B C D C B B C B A • A B B A D A C C B B C C D C D A • A C C A B B D B C D CB B C B A • A B B A D A C CB B C C D C D A • A C C A B B DB C D C B B C … • A B B A D A C C B B C CD C D A

k-tuples • Decompose strings into k-tuples with corresponding offset. • E.g. with k=3, “A C C A B B” becomes • 0: A C C • 1: C C A • 2: C A B • 3: A B B • Do this for the database and the query

Example strings • A C C A B B D B C D C B B C B A • A B B A D A C C B B C C D C D A • A C C A B B D B C D CB B C B A • A B B A D A C CB B C C D C D A • A C C A B B DB C D C B B C … • A B B A D A C C B B C CD C D A

0: ACC 1: CCA 2: CAB 3: ABB 4: BBD 5: BDB 6: DBC 7: BCD 8: CDC 9: DCB 10: CBB 11: BBC 12: BCB 13: CBA 0: ABB 1: BBA 2: BAD 3: ADA 4: DAC 5: ACC 6: CCB 7: CBB 8: BBC 9: BCC 10: CCD 11: CDC 12: DCD 13: CDA 3 -5 3 -3 3-tuple join

Offset in query Matches / hits • Lots on the same diagonal: might be a good alignment. Offset in db string

Offset in query Don’t do full DP • Do e.g. “banded DP” around diagonals with many matches Offset in db string

Some options • If no diagonal with multiple matches, don’t DP at all. • Don’t just allow exact ktup matches, but generate ‘neighborhood’ for query tuples. • …

Personal experience • “Database architecture”practical assignment • MonetDB: main memory DBMS • SWISS-PROT decomposedinto 3-tuples • 150k strings • 150M 3-tuples • Find database strings with more than one match on the same diagonal.

Personal experience 43 char query string • ~500k matches (in ~122k strings) • ~32k diagonals with more than one match • in ~25k strings With some implementation effort: ~1s (Kudos for Monet here!)

Personal experience • From 150k strings to 15k ‘probable’ strings in 1 second. • This discards 90% percent of database for almost no work. • And even gives extra information to speed up subsequent calculations. • … but might discard otherwise good matches.

Personal experience Tiny query 6 char query • 122 diagonals in 119 strings in no time at all An actual protein from the database: 250 char query • ~285k diagonals in ~99k strings in about 5 seconds

BLAST/FASTA conclusion •  - Not guaranteed same result. •  - But tend to work well.

Questions?

Statistical significance What do the scores mean? Score parameters (PAM, BLOSUM)

What do the scores mean? • We are calculating ‘optimal’ scores, but what do they mean? • Used log-odds to get an additive scoring scheme • Biologically meaningful versusJust the best alignment between random strings • Bayesian approach • Classical approach

Bayesian approach • Interested in: • Probability of a match, given the strings • P( M | x,y ) • Already know: • Probability of strings given the models, i.e.P( x,y | M ) and P( x,y | R ) • So … Bayes rule.

Bayesian approach • Bayes rule gives us: P( x,y | M ) P( M ) P( M | x,y ) = P( x,y ) • …rewrite… …rewrite… …rewrite…

Bayesian approach • P( M | x,y ) = σ( S’ )S’ = log( P(x,y|M)/P(x,y|R) ) + log( P(M)/P(R) ) Our score!

Take care! • Requires that substitution matrix contains probabilities. • Mind the prior probabilities:when matching against a database, subtract a log(N) term. • Alignment score is for the optimal alignment between the strings; ignores possible other good alignments.

Classical approach • Call the maximum score among N random strings: MN. • P( MN < x ) • means “Probability that the best match from a search of a large number N of unrelated sequences has score lower than x.” • is an Extreme Value Distribution • Consider x = our score. Then if this probability is very large: likely that our match was not just random.

Correcting for length • Additive score • So longer strings have higher scores!

Correcting for length • If match with any string should be equally likely • Correct for this bias by subtracting log(length) from the score • Because score is log-odds, this `is a division’ that normalizes score

Scoring parameters • How to get the values in cost matrices? • Just count frequencies? • PAM • BLOSUM

Just count frequencies? • Would be maximum likelihood estimate •  - Need lots of confirmed alignments •  - Different amounts of divergence

PAM • Grouped proteins by ‘family.’ • Phylogenetic tree • PAM1 matrix probabilities for 1 unit of time. • PAMn as (PAM1)n • PAM250 often used.

BLOSUM • Long-term PAM are inaccurate: • Inaccuracies in PAM1 multiply! • Actually differences between short-term and long-term changes. • Different BLOSUM matrices are specifically determined for different levels of divergence • Solves both problems.

Gap penalties • No proper time-dependent model • But seems reasonable that: • expected number of gaps linear in time • length of gaps constant distribution

Heuristic alignment algorithms; Cost matrices