430 likes | 452 Views
This presentation outlines a table-driven similarity search algorithm by Wang, Jia-Nan, and Huang, Yu-Feng, focusing on local alignment methods like Smith-Waterman and BLAST for protein sequence similarity. The approach enhances efficiency by computing only 4% of the dynamic programming matrix, improving protein matching accuracy.
E N D
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu-Feng
Outline • Introduction • Background • Preliminary • Method • Experiment
Introduction • Given a Query and database . Do local alignment • Smith-Waterman : Guaranteed to find all local alignment . Expensive • BLAST • FASTA
Improvement • Hardware: more investment on computer ,CPU • Software Phil Green’s SWAT appeal to sparsity and some machine-level coding tricks 60% of dynamic programming matrix has value 0 Avoiding computing most of these unproductive entries
Focus on improving protein similarity searches • This approach examines and compute only 4% of the underlying dynamic programming matrix
Recall • Sequence alignment • Local sequence alignment • Global sequence alignment • Goal – matching path with highest score • Table-based computation and dynamic programming
Dynamic Programming • Three basic components • Recurrence relation • Tabular computation • Traceback
Smith-Waterman Method • Dynamic programming algorithm • Find the most similar subsequences of two sequences • Problem • Lots of computation will be googol • Programmer will be crazy and excite • Why? how to accelerate
Background • Scoring System • Simple scoring scheme • Affine gap penalty scoring scheme • PAM120 (PAMn) • BLOSUM62 (BLOSUMn)
Simple Scoring Scheme • Match (e.g. +8) • Mismatch (e.g. -5) • Gap constant penalty (e.g. -20)
Affine Gap Penalty Scoring Scheme • Match (e.g. +8) • Mismatch (e.g. -5) • Gap symbol (e.g. -5) • Gap open penalty (e.g. -10)
PAM • PAM – Percent Accepted Mutation • Dayhoff et al. (1978) • PAM unit • Evolutionary time corresponding to average of 1 mutation per 100 residues 1% accepted • PAMn • Relates to mutation probabilities in evolutionary interval of n PAM units Some information from: http://www.apl.jhu.edu/~przytyck/CAMS_2004_1b.pdf
PAM120 Source: http://eta.embl-heidelberg.de:8000/misc/mat/pam120.html
BLOSUM62 • BLOSUM – BLOcks SUbstitution Matrix • Steven and Jorga G. Henikoff (1992) • Paper: Amino acid substitution matrices from protein blocks [PubMed] • BLOSUMn • Relates to mutation probabilities observed between pairs of related proteins that diverged so above n% identity Some information from: http://www.apl.jhu.edu/~przytyck/CAMS_2004_1b.pdf
Preliminaries • Σ : sequences are composed • |Σ| × |Σ| Substitution matrix S giving the score • Uniform gap penalty g > 0 • Query = q1q2. . .qp of P letters • Target = t1t2. . .tn of N letters • Threshold T > 0
Score Table Edit Graph Picture source: http://searchlauncher.bcm.tmc.edu/help/Pictures/S-Wexample.gif
Problem • Find a high score local alignment between Query and Target whose path score ≧T • Edit-graph figure1 • Limit our attention to prefix-positive paths • If there is a path of score T or greater in the edit graph then there is a prefix positive path of score T or greater
Definition • A set P of index-value pairs { (i,v): i is [0,P]
The start and extension tables • Consider a vertex x in row j of the edit graph of Query vs. Target
Start Trimming • Limiting the dynamic programming to the startable vertices requires a table Start(w) where w = |Σ|ks
Start Trimming • Worst case • Let αbe the expected percentage of vertices that are seed
Extension Trimming • A table that eliminates vertices that are not extendable • (i,j) is extendable vertex iff C(i,j)>Extend(i,Target[j+1…j+ke])
A Table-Driven Scheme for DP • Goal: to restrict the SW computation to productive vertices • Jump table – captures the effect of Advance and Delete over kJ > 0 rows • space unmanageably large • But only record those for which
Jump table • Start table • Space-saving version for Jump and Start tables
Recall – Affine Gap Penalty • Score • Match • Mismatch • Gap symbol - gsp • Gap open penalty - gop • Affine cost of gap of length k • g + kh, g = gop, h = gsp
D D D I I I C C C -h δ(ai,bj) -g-h D -g-h I C -h Diagram of Affine Gap Penalty Source: kmchao’s lecture note
The Case of Affine Gap Costs • Simple scoring scheme affine gap penalty scheme • Affine edit graph and vertex structure • Question: how to modify the equations defined above?
Recurrence System for Affine Gap Costs • Two observations • To compute the jth row form the (j-1)st requires knowing only the vectors of and values in row j-1, and not on the values in that row • If then the value at vertex need not be recorded as any maximal path through its will have score less than the maximal path passing through the corresponding
Experiment • Method • Edit graph based approach vs. SWAT • Scoring matrix • PAM120 • Affine gap cost • 8+4n • Database (target) • 3 million residue subset of the PIR database • Query • A periodic clock protein of length 173 (pcp) • A lactate dehydrogenase of length 319 (dehydro) • A cGMP kinase of length 670 (kinase) • A growth factor of length 1210 (g factor)
Ending Thanks for Your Attention