1 / 43

A Table-Driven, Full-Sensitivity Similarity Search Algorithm

This presentation outlines a table-driven similarity search algorithm by Wang, Jia-Nan, and Huang, Yu-Feng, focusing on local alignment methods like Smith-Waterman and BLAST for protein sequence similarity. The approach enhances efficiency by computing only 4% of the dynamic programming matrix, improving protein matching accuracy.

mbobbie
Download Presentation

A Table-Driven, Full-Sensitivity Similarity Search Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu-Feng

  2. Outline • Introduction • Background • Preliminary • Method • Experiment

  3. Introduction • Given a Query and database . Do local alignment • Smith-Waterman : Guaranteed to find all local alignment . Expensive • BLAST • FASTA

  4. Improvement • Hardware: more investment on computer ,CPU • Software Phil Green’s SWAT appeal to sparsity and some machine-level coding tricks 60% of dynamic programming matrix has value 0 Avoiding computing most of these unproductive entries

  5. Focus on improving protein similarity searches • This approach examines and compute only 4% of the underlying dynamic programming matrix

  6. Recall • Sequence alignment • Local sequence alignment • Global sequence alignment • Goal – matching path with highest score • Table-based computation and dynamic programming

  7. Dynamic Programming • Three basic components • Recurrence relation • Tabular computation • Traceback

  8. Smith-Waterman Method • Dynamic programming algorithm • Find the most similar subsequences of two sequences • Problem • Lots of computation  will be googol • Programmer  will be crazy and excite • Why?  how to accelerate

  9. Background • Scoring System • Simple scoring scheme • Affine gap penalty scoring scheme • PAM120 (PAMn) • BLOSUM62 (BLOSUMn)

  10. Simple Scoring Scheme • Match (e.g. +8) • Mismatch (e.g. -5) • Gap constant penalty (e.g. -20)

  11. Affine Gap Penalty Scoring Scheme • Match (e.g. +8) • Mismatch (e.g. -5) • Gap symbol (e.g. -5) • Gap open penalty (e.g. -10)

  12. PAM • PAM – Percent Accepted Mutation • Dayhoff et al. (1978) • PAM unit • Evolutionary time corresponding to average of 1 mutation per 100 residues  1% accepted • PAMn • Relates to mutation probabilities in evolutionary interval of n PAM units Some information from: http://www.apl.jhu.edu/~przytyck/CAMS_2004_1b.pdf

  13. PAM120 Source: http://eta.embl-heidelberg.de:8000/misc/mat/pam120.html

  14. BLOSUM62 • BLOSUM – BLOcks SUbstitution Matrix • Steven and Jorga G. Henikoff (1992) • Paper: Amino acid substitution matrices from protein blocks [PubMed] • BLOSUMn • Relates to mutation probabilities observed between pairs of related proteins that diverged so above n% identity Some information from: http://www.apl.jhu.edu/~przytyck/CAMS_2004_1b.pdf

  15. BLOSUM62

  16. Preliminaries • Σ : sequences are composed • |Σ| × |Σ| Substitution matrix S giving the score • Uniform gap penalty g > 0 • Query = q1q2. . .qp of P letters • Target = t1t2. . .tn of N letters • Threshold T > 0

  17. Score Table  Edit Graph Picture source: http://searchlauncher.bcm.tmc.edu/help/Pictures/S-Wexample.gif

  18. Problem • Find a high score local alignment between Query and Target whose path score ≧T • Edit-graph figure1 • Limit our attention to prefix-positive paths • If there is a path of score T or greater in the edit graph then there is a prefix positive path of score T or greater

  19. Definition • A set P of index-value pairs { (i,v): i is [0,P]

  20. The start and extension tables • Consider a vertex x in row j of the edit graph of Query vs. Target

  21. Start Trimming • Limiting the dynamic programming to the startable vertices requires a table Start(w) where w = |Σ|ks

  22. Start Trimming • Worst case • Let αbe the expected percentage of vertices that are seed

  23. Extension Trimming • A table that eliminates vertices that are not extendable • (i,j) is extendable vertex iff C(i,j)>Extend(i,Target[j+1…j+ke])

  24. Extension Trimming

  25. A Table-Driven Scheme for DP • Goal: to restrict the SW computation to productive vertices • Jump table – captures the effect of Advance and Delete over kJ > 0 rows • space  unmanageably large • But only record those for which

  26. Jump table • Start table • Space-saving version for Jump and Start tables

  27. Check for paths scoring T or more

  28. Recall – Affine Gap Penalty • Score • Match • Mismatch • Gap symbol - gsp • Gap open penalty - gop • Affine cost of gap of length k • g + kh, g = gop, h = gsp

  29. D D D I I I C C C -h δ(ai,bj) -g-h D -g-h I C -h Diagram of Affine Gap Penalty Source: kmchao’s lecture note

  30. Recurrence system - Gotoh

  31. The Case of Affine Gap Costs • Simple scoring scheme  affine gap penalty scheme • Affine edit graph and vertex structure • Question: how to modify the equations defined above?

  32. Recurrence System for Affine Gap Costs • Two observations • To compute the jth row form the (j-1)st requires knowing only the vectors of and values in row j-1, and not on the values in that row • If then the value at vertex need not be recorded as any maximal path through its will have score less than the maximal path passing through the corresponding

  33. Recurrence System

  34. Results

  35. Experiment • Method • Edit graph based approach vs. SWAT • Scoring matrix • PAM120 • Affine gap cost • 8+4n • Database (target) • 3 million residue subset of the PIR database • Query • A periodic clock protein of length 173 (pcp) • A lactate dehydrogenase of length 319 (dehydro) • A cGMP kinase of length 670 (kinase) • A growth factor of length 1210 (g factor)

  36. PAM120 & Gap Cost 8+4n

  37. BLOSUM62 & Gap Cost 8+2n

  38. Ending Thanks for Your Attention

More Related