220 likes | 422 Views
A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System. Philipp Bucher and Kay Hofmann. Proc Int Conf Intell Syst Mol Biol. 1996;4:44-51. Goal. Modify Smith-Waterman (SW) algorithm such that it has a probabilistic interpretation.
E N D
A Sequence Similarity Search Algorithm Based on a ProbabilisticInterpretation of an Alignment Scoring System Philipp Bucher and Kay Hofmann Proc Int Conf Intell Syst Mol Biol. 1996;4:44-51
Goal • Modify Smith-Waterman (SW) algorithm such that it has a probabilistic interpretation
Introduction 1 • Goal: find a local alignment between a query sequence and a sequence in a database • Local similarity to find conserved domains • Conservation implies function
Introduction 2 • Smith-Waterman (SW) Algorithm (dynamic programming) is the most sensitive algorithm to identify local alignment between two sequences • Heuristic algorithms such as FASTA and BLAST are modifications or special cases of SW algorithm • O (mxn)
Definition • a = a1 a2 . . . am b = b1 b2 . . . bn a,bS, S containing N elements u alignment path u= (x1,y1), (x2,y2), . . . (xl,yl) x k+1>xk, y k+1>y k, x £m, y£n m=8 n=7 l=6 EGAWGHE-E P-AW-HEAE EAWHEE PAWHEE
Sequence dependent Sequence independent Gap score Scoring Substitution matrix s(a, b) SA(a, b, u) = SM (a, b, u) + SG(u) • Gap weighting function w(k ) • w(k ) = a + bk for k ³1, • w(0) = 0 if k=0
Defines a probability distribution over the sequence space by means of a stochastic process involving arandom walkthrough the model Defines a probability distribution over the space of sequence pairs by means of a stochastic process involving a random walk through an alignment path matrix ? ? Probabilistic Smith-Waterman (PSW) Algorithm ASS HMM
Length distribution (same for ASS and Null model) Null model residue probability distribution over the alphabet S residue a Null probability
Scoring fxn of local alignment Length normalizing fxn
Length normalizing fxn SM(a, b, u) SG(u) Scoring fxn of localalignment SA(a, b, u)= SM(a, b, u) + SG(u) Z is some logarithmic base that satisfies:
G RKE GAWG--HE- AAW-RKHEE GAWHE AAWHE Length of unmatched pairs Length of matched pairs P0(a,b) vk, wk unmatched residues in a and b, respectively xk, yk matched residues in a and b, respectively
Performance evaluation of PSW • BLAST (Blosum 62) • SSEARCH • Native SW • Blosum 45 • default gap weighting fxn • PSW • Blosum 45 • Same weighing fxn as SSEARCH • Search the Swissprot protein database • Query: from well known protein family and domains
Typically 50-90% true positives retrieved for a single query sequence • % True positives affected by • Divergence of sequence family • Stringency of significant criterion applied • Stringency of criterion determined by fixing thenumber of false positives accepted • Not appropriate if the status of sequences is not known in advance
5% 9% 14% 14% 26% 33% 54% 54% 53% Comparison Equivalent performance of SSEARCH and PSW on GPC receptors, SH2-domain, SH3-domain
Comparison II • Improved or equivalent performance of PSW over native SW • PSW is specially more sensitive for stringent criterion
Summary • Pairwise sequence alignments can be improved by interpreting a scoring system as a probabilistic model • Probabilistic interpretation gives higher sensitivity • Log-likelihood ratio eliminates scoring bias due to sequence length or choice of the scoring matrix • Facilitates optimization of gap weighting matrices
Advantages of PSW • No assumption about evolutionary relatedness is made • Therefore, any scoring matrix can be used