A Table-Driven, Full-Sensitivity Similarity Search Algorithm

A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu-Feng

Outline • Introduction • Background • Preliminary • Method • Experiment

Introduction • Given a Query and database . Do local alignment • Smith-Waterman : Guaranteed to find all local alignment . Expensive • BLAST • FASTA

Improvement • Hardware: more investment on computer ,CPU • Software Phil Green’s SWAT appeal to sparsity and some machine-level coding tricks 60% of dynamic programming matrix has value 0 Avoiding computing most of these unproductive entries

Focus on improving protein similarity searches • This approach examines and compute only 4% of the underlying dynamic programming matrix

Recall • Sequence alignment • Local sequence alignment • Global sequence alignment • Goal – matching path with highest score • Table-based computation and dynamic programming

Dynamic Programming • Three basic components • Recurrence relation • Tabular computation • Traceback

Smith-Waterman Method • Dynamic programming algorithm • Find the most similar subsequences of two sequences • Problem • Lots of computation  will be googol • Programmer  will be crazy and excite • Why?  how to accelerate

Background • Scoring System • Simple scoring scheme • Affine gap penalty scoring scheme • PAM120 (PAMn) • BLOSUM62 (BLOSUMn)

Simple Scoring Scheme • Match (e.g. +8) • Mismatch (e.g. -5) • Gap constant penalty (e.g. -20)

Affine Gap Penalty Scoring Scheme • Match (e.g. +8) • Mismatch (e.g. -5) • Gap symbol (e.g. -5) • Gap open penalty (e.g. -10)

PAM • PAM – Percent Accepted Mutation • Dayhoff et al. (1978) • PAM unit • Evolutionary time corresponding to average of 1 mutation per 100 residues  1% accepted • PAMn • Relates to mutation probabilities in evolutionary interval of n PAM units Some information from: http://www.apl.jhu.edu/~przytyck/CAMS_2004_1b.pdf

PAM120 Source: http://eta.embl-heidelberg.de:8000/misc/mat/pam120.html

BLOSUM62 • BLOSUM – BLOcks SUbstitution Matrix • Steven and Jorga G. Henikoff (1992) • Paper: Amino acid substitution matrices from protein blocks [PubMed] • BLOSUMn • Relates to mutation probabilities observed between pairs of related proteins that diverged so above n% identity Some information from: http://www.apl.jhu.edu/~przytyck/CAMS_2004_1b.pdf

BLOSUM62

Preliminaries • Σ : sequences are composed • |Σ| × |Σ| Substitution matrix S giving the score • Uniform gap penalty g > 0 • Query = q1q2．．．qp of P letters • Target = t1t2．．．tn of N letters • Threshold T > 0

Score Table  Edit Graph Picture source: http://searchlauncher.bcm.tmc.edu/help/Pictures/S-Wexample.gif

Problem • Find a high score local alignment between Query and Target whose path score ≧T • Edit-graph figure1 • Limit our attention to prefix-positive paths • If there is a path of score T or greater in the edit graph then there is a prefix positive path of score T or greater

Definition • A set P of index-value pairs { (i,v): i is [0,P]

The start and extension tables • Consider a vertex x in row j of the edit graph of Query vs. Target

Start Trimming • Limiting the dynamic programming to the startable vertices requires a table Start(w) where w = |Σ|ks

Start Trimming • Worst case • Let αbe the expected percentage of vertices that are seed

Extension Trimming • A table that eliminates vertices that are not extendable • (i,j) is extendable vertex iff C(i,j)>Extend(i,Target[j+1…j+ke])

Extension Trimming

A Table-Driven Scheme for DP • Goal: to restrict the SW computation to productive vertices • Jump table – captures the effect of Advance and Delete over kJ > 0 rows • space  unmanageably large • But only record those for which

Jump table • Start table • Space-saving version for Jump and Start tables

Check for paths scoring T or more

Recall – Affine Gap Penalty • Score • Match • Mismatch • Gap symbol - gsp • Gap open penalty - gop • Affine cost of gap of length k • g + kh, g = gop, h = gsp

D D D I I I C C C -h δ(ai,bj) -g-h D -g-h I C -h Diagram of Affine Gap Penalty Source: kmchao’s lecture note

Recurrence system - Gotoh

The Case of Affine Gap Costs • Simple scoring scheme  affine gap penalty scheme • Affine edit graph and vertex structure • Question: how to modify the equations defined above?

Recurrence System for Affine Gap Costs • Two observations • To compute the jth row form the (j-1)st requires knowing only the vectors of and values in row j-1, and not on the values in that row • If then the value at vertex need not be recorded as any maximal path through its will have score less than the maximal path passing through the corresponding

Recurrence System

Results

Experiment • Method • Edit graph based approach vs. SWAT • Scoring matrix • PAM120 • Affine gap cost • 8+4n • Database (target) • 3 million residue subset of the PIR database • Query • A periodic clock protein of length 173 (pcp) • A lactate dehydrogenase of length 319 (dehydro) • A cGMP kinase of length 670 (kinase) • A growth factor of length 1210 (g factor)

PAM120 & Gap Cost 8+4n

BLOSUM62 & Gap Cost 8+2n

Ending Thanks for Your Attention

A Table-Driven, Full-Sensitivity Similarity Search Algorithm

A Table-Driven, Full-Sensitivity Similarity Search Algorithm

Presentation Transcript

Geometry of Similarity Search

A Metric Cache for Similarity Search

A General Algorithm for Subtree Similarity-Search

Search Algorithm

An Efficient Video Similarity Search Algorithm

Informed search A* algorithm

A Natural Search Algorithm

A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Database Similarity Search

A* Search Algorithm

A Full-Text Search Algorithm for Long Queries

Connected Substructure Similarity Search

Similarity Search

Merging Algorithm Sensitivity Analysis

Search Algorithm

An Efficient Video Similarity Search Algorithm

Similarity Search: A Matching Based Approach

Search Algorithm

Operators for Similarity Search

Table-driven parsing

Table-driven parsing

Database Similarity Search