1 / 53

Heuristic alignment algorithms; Cost matrices

Heuristic alignment algorithms; Cost matrices. 2.5 – 2.9 Thomas van Dijk. Content. Dynamic programming Improve to linear space Heuristics for database searches BLAST, FASTA Statistical significance What the scores mean Score parameters (PAM, BLOSUM). Dynamic programming.

donar
Download Presentation

Heuristic alignment algorithms; Cost matrices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Heuristic alignment algorithms;Cost matrices 2.5 – 2.9 Thomas van Dijk

  2. Content • Dynamic programming • Improve to linear space • Heuristics for database searches • BLAST, FASTA • Statistical significance • What the scores mean • Score parameters (PAM, BLOSUM)

  3. Dynamic programming Improve to linear space

  4. Dynamic programming • NW and SW run in O( nm ) time • And use O( nm ) space • For proteins, this is okay. • But DNA strings are huge! • Improve this to O( n+m ) space • while still using O( nm ) time.

  5. Basic idea • Full DP requires O( nm ) space

  6. Basic idea for linear space • Cells don’t directly depend on cells more than one column to the left. • So keep only two columns; forget the rest No back-pointers! How to find the alignment?

  7. Divide and conquer • If we happen to know a cell on the optimal alignment…

  8. Divide and conquer • We could then repeat this trick!

  9. Divide and conquer • But how to find such a point?

  10. Modified DP • Determine where the optimal alignment crossed a certain column: • at every cell, remember at which row it crossed the column.

  11. Space analysis • Always only two columns at a time. • Clearly O( n+m ) space. • But what about the time?

  12. What are we doing? • Using linear space DP at every step of a divide and conquer scheme.

  13. Time analysis • We do more work now, but how much? • Look at case with two identical strings. • n2 n2

  14. Time analysis • We do more work now, but how much? • Look at case with two identical strings. • n2 + ½ n2 ¼ n2 ¼ n2

  15. Time analysis • We do more work now, but how much? • Look at case with two identical strings. • n2 + ½ n2 + ¼ n2 1/16 n2 1/16 n2 1/16 n2 1/16 n2

  16. Time analysis • We do more work now, but how much? • Look at case with two identical strings. • n2 + ½ n2 + ¼ n2 + … < 2n2 et cetera…

  17. Time analysis • We do more work now, but how much? • Look at case with two identical strings. • n2 + ½ n2 + ¼ n2 + … < 2n2 • Along the same lines,the algorithm in generalis still O( nm). • And actually only abouttwice as much work! et cetera…

  18. Questions?

  19. Heuristics for database search BLAST, FASTA

  20. Searching a database • Database of strings • Query string • Select best matching string(s) from the database.

  21. Algorithms we already know • Until now, the algorithms were exact and correct • (For the model!) • But for this purpose, too slow: • Say 100M strings with 1000 chars each • Say 10M cells per second • Leads to ~3 hours search time

  22. Getting faster • Full DP is too expensive • Space issue: fixed. • With guaranteed same result  • Now the time issue

  23. BLAST and FASTA • Heuristics to prevent doing full DP •  - Not guaranteed same result. •  - But tend to work well. • Idea: • First spend some time to analyze strings • Don’t calculate all DP cells; • Only some, based on analysis.

  24. Basic idea • In a good alignment, it is likely that several short parts match exactly. • A C C A B B D B C D C B B C B A • A B B A D A C C B B C C D C D A • A C C A B B D B C D CB B C B A • A B B A D A C CB B C C D C D A • A C C A B B DB C D C B B C … • A B B A D A C C B B C CD C D A

  25. k-tuples • Decompose strings into k-tuples with corresponding offset. • E.g. with k=3, “A C C A B B” becomes • 0: A C C • 1: C C A • 2: C A B • 3: A B B • Do this for the database and the query

  26. Example strings • A C C A B B D B C D C B B C B A • A B B A D A C C B B C C D C D A • A C C A B B D B C D CB B C B A • A B B A D A C CB B C C D C D A • A C C A B B DB C D C B B C … • A B B A D A C C B B C CD C D A

  27. 0: ACC 1: CCA 2: CAB 3: ABB 4: BBD 5: BDB 6: DBC 7: BCD 8: CDC 9: DCB 10: CBB 11: BBC 12: BCB 13: CBA 0: ABB 1: BBA 2: BAD 3: ADA 4: DAC 5: ACC 6: CCB 7: CBB 8: BBC 9: BCC 10: CCD 11: CDC 12: DCD 13: CDA 3 -5 3 -3 3-tuple join

  28. Offset in query Matches / hits • Lots on the same diagonal: might be a good alignment. Offset in db string

  29. Offset in query Don’t do full DP • Do e.g. “banded DP” around diagonals with many matches Offset in db string

  30. Some options • If no diagonal with multiple matches, don’t DP at all. • Don’t just allow exact ktup matches, but generate ‘neighborhood’ for query tuples. • …

  31. Personal experience • “Database architecture”practical assignment • MonetDB: main memory DBMS • SWISS-PROT decomposedinto 3-tuples • 150k strings • 150M 3-tuples • Find database strings with more than one match on the same diagonal.

  32. Personal experience 43 char query string • ~500k matches (in ~122k strings) • ~32k diagonals with more than one match • in ~25k strings With some implementation effort: ~1s (Kudos for Monet here!)

  33. Personal experience • From 150k strings to 15k ‘probable’ strings in 1 second. • This discards 90% percent of database for almost no work. • And even gives extra information to speed up subsequent calculations. • … but might discard otherwise good matches.

  34. Personal experience Tiny query 6 char query • 122 diagonals in 119 strings in no time at all An actual protein from the database: 250 char query • ~285k diagonals in ~99k strings in about 5 seconds

  35. BLAST/FASTA conclusion •  - Not guaranteed same result. •  - But tend to work well.

  36. Questions?

  37. Statistical significance What do the scores mean? Score parameters (PAM, BLOSUM)

  38. What do the scores mean? • We are calculating ‘optimal’ scores, but what do they mean? • Used log-odds to get an additive scoring scheme • Biologically meaningful versusJust the best alignment between random strings • Bayesian approach • Classical approach

  39. Bayesian approach • Interested in: • Probability of a match, given the strings • P( M | x,y ) • Already know: • Probability of strings given the models, i.e.P( x,y | M ) and P( x,y | R ) • So … Bayes rule.

  40. Bayesian approach • Bayes rule gives us: P( x,y | M ) P( M ) P( M | x,y ) = P( x,y ) • …rewrite… …rewrite… …rewrite…

  41. Bayesian approach • P( M | x,y ) = σ( S’ )S’ = log( P(x,y|M)/P(x,y|R) ) + log( P(M)/P(R) ) Our score!

  42. Take care! • Requires that substitution matrix contains probabilities. • Mind the prior probabilities:when matching against a database, subtract a log(N) term. • Alignment score is for the optimal alignment between the strings; ignores possible other good alignments.

  43. Classical approach • Call the maximum score among N random strings: MN. • P( MN < x ) • means “Probability that the best match from a search of a large number N of unrelated sequences has score lower than x.” • is an Extreme Value Distribution • Consider x = our score. Then if this probability is very large: likely that our match was not just random.

  44. Correcting for length • Additive score • So longer strings have higher scores!

  45. Correcting for length • If match with any string should be equally likely • Correct for this bias by subtracting log(length) from the score • Because score is log-odds, this `is a division’ that normalizes score

  46. Scoring parameters • How to get the values in cost matrices? • Just count frequencies? • PAM • BLOSUM

  47. Just count frequencies? • Would be maximum likelihood estimate •  - Need lots of confirmed alignments •  - Different amounts of divergence

  48. PAM • Grouped proteins by ‘family.’ • Phylogenetic tree • PAM1 matrix probabilities for 1 unit of time. • PAMn as (PAM1)n • PAM250 often used.

  49. BLOSUM • Long-term PAM are inaccurate: • Inaccuracies in PAM1 multiply! • Actually differences between short-term and long-term changes. • Different BLOSUM matrices are specifically determined for different levels of divergence • Solves both problems.

  50. Gap penalties • No proper time-dependent model • But seems reasonable that: • expected number of gaps linear in time • length of gaps constant distribution

More Related