A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System

A Sequence Similarity Search Algorithm Based on a ProbabilisticInterpretation of an Alignment Scoring System Philipp Bucher and Kay Hofmann Proc Int Conf Intell Syst Mol Biol. 1996;4:44-51

Goal • Modify Smith-Waterman (SW) algorithm such that it has a probabilistic interpretation

Introduction 1 • Goal: find a local alignment between a query sequence and a sequence in a database • Local similarity to find conserved domains • Conservation implies function

Introduction 2 • Smith-Waterman (SW) Algorithm (dynamic programming) is the most sensitive algorithm to identify local alignment between two sequences • Heuristic algorithms such as FASTA and BLAST are modifications or special cases of SW algorithm • O (mxn)

Definition • a = a1 a2 . . . am b = b1 b2 . . . bn a,bS, S containing N elements u alignment path u= (x1,y1), (x2,y2), . . . (xl,yl) x k+1>xk, y k+1>y k, x £m, y£n m=8 n=7 l=6 EGAWGHE-E P-AW-HEAE EAWHEE PAWHEE

Sequence dependent Sequence independent Gap score Scoring Substitution matrix s(a, b) SA(a, b, u) = SM (a, b, u) + SG(u) • Gap weighting function w(k ) • w(k ) = a + bk for k ³1, • w(0) = 0 if k=0

Defines a probability distribution over the sequence space by means of a stochastic process involving arandom walkthrough the model Defines a probability distribution over the space of sequence pairs by means of a stochastic process involving a random walk through an alignment path matrix ? ? Probabilistic Smith-Waterman (PSW) Algorithm ASS HMM

Length distribution (same for ASS and Null model) Null model residue probability distribution over the alphabet S residue a Null probability

PSW score

Defined by model

Scoring fxn of local alignment Length normalizing fxn

Length normalizing fxn SM(a, b, u) SG(u) Scoring fxn of localalignment SA(a, b, u)= SM(a, b, u) + SG(u) Z is some logarithmic base that satisfies:

SA(a, b, u)= SM(a, b, u) + SG(u)

G RKE GAWG--HE- AAW-RKHEE GAWHE AAWHE Length of unmatched pairs Length of matched pairs P0(a,b) vk, wk unmatched residues in a and b, respectively xk, yk matched residues in a and b, respectively

= 1

Performance evaluation of PSW • BLAST (Blosum 62) • SSEARCH • Native SW • Blosum 45 • default gap weighting fxn • PSW • Blosum 45 • Same weighing fxn as SSEARCH • Search the Swissprot protein database • Query: from well known protein family and domains

Typically 50-90% true positives retrieved for a single query sequence • % True positives affected by • Divergence of sequence family • Stringency of significant criterion applied • Stringency of criterion determined by fixing thenumber of false positives accepted • Not appropriate if the status of sequences is not known in advance

5% 9% 14% 14% 26% 33% 54% 54% 53% Comparison Equivalent performance of SSEARCH and PSW on GPC receptors, SH2-domain, SH3-domain

Comparison II • Improved or equivalent performance of PSW over native SW • PSW is specially more sensitive for stringent criterion

Summary • Pairwise sequence alignments can be improved by interpreting a scoring system as a probabilistic model • Probabilistic interpretation gives higher sensitivity • Log-likelihood ratio eliminates scoring bias due to sequence length or choice of the scoring matrix • Facilitates optimization of gap weighting matrices

Advantages of PSW • No assumption about evolutionary relatedness is made • Therefore, any scoring matrix can be used

A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System