1 / 22

A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System

A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System. Philipp Bucher and Kay Hofmann. Proc Int Conf Intell Syst Mol Biol. 1996;4:44-51. Goal. Modify Smith-Waterman (SW) algorithm such that it has a probabilistic interpretation.

yestin
Download Presentation

A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Sequence Similarity Search Algorithm Based on a ProbabilisticInterpretation of an Alignment Scoring System Philipp Bucher and Kay Hofmann Proc Int Conf Intell Syst Mol Biol. 1996;4:44-51

  2. Goal • Modify Smith-Waterman (SW) algorithm such that it has a probabilistic interpretation

  3. Introduction 1 • Goal: find a local alignment between a query sequence and a sequence in a database • Local similarity to find conserved domains • Conservation implies function

  4. Introduction 2 • Smith-Waterman (SW) Algorithm (dynamic programming) is the most sensitive algorithm to identify local alignment between two sequences • Heuristic algorithms such as FASTA and BLAST are modifications or special cases of SW algorithm • O (mxn)

  5. Definition • a = a1 a2 . . . am b = b1 b2 . . . bn a,bS, S containing N elements u alignment path u= (x1,y1), (x2,y2), . . . (xl,yl) x k+1>xk, y k+1>y k, x £m, y£n m=8 n=7 l=6 EGAWGHE-E P-AW-HEAE EAWHEE PAWHEE

  6. Sequence dependent Sequence independent Gap score Scoring Substitution matrix s(a, b) SA(a, b, u) = SM (a, b, u) + SG(u) • Gap weighting function w(k ) • w(k ) = a + bk for k ³1, • w(0) = 0 if k=0

  7. Defines a probability distribution over the sequence space by means of a stochastic process involving arandom walkthrough the model Defines a probability distribution over the space of sequence pairs by means of a stochastic process involving a random walk through an alignment path matrix ? ? Probabilistic Smith-Waterman (PSW) Algorithm ASS HMM

  8. Length distribution (same for ASS and Null model) Null model residue probability distribution over the alphabet S residue a Null probability

  9. PSW score

  10. Defined by model

  11. Scoring fxn of local alignment Length normalizing fxn

  12. Length normalizing fxn SM(a, b, u) SG(u) Scoring fxn of localalignment SA(a, b, u)= SM(a, b, u) + SG(u) Z is some logarithmic base that satisfies:

  13. SA(a, b, u)= SM(a, b, u) + SG(u)

  14. G RKE GAWG--HE- AAW-RKHEE GAWHE AAWHE Length of unmatched pairs Length of matched pairs P0(a,b) vk, wk unmatched residues in a and b, respectively xk, yk matched residues in a and b, respectively

  15. = 1

  16. Performance evaluation of PSW • BLAST (Blosum 62) • SSEARCH • Native SW • Blosum 45 • default gap weighting fxn • PSW • Blosum 45 • Same weighing fxn as SSEARCH • Search the Swissprot protein database • Query: from well known protein family and domains

  17. Typically 50-90% true positives retrieved for a single query sequence • % True positives affected by • Divergence of sequence family • Stringency of significant criterion applied • Stringency of criterion determined by fixing thenumber of false positives accepted • Not appropriate if the status of sequences is not known in advance

  18. 5% 9% 14% 14% 26% 33% 54% 54% 53% Comparison Equivalent performance of SSEARCH and PSW on GPC receptors, SH2-domain, SH3-domain

  19. Comparison II • Improved or equivalent performance of PSW over native SW • PSW is specially more sensitive for stringent criterion

  20. Summary • Pairwise sequence alignments can be improved by interpreting a scoring system as a probabilistic model • Probabilistic interpretation gives higher sensitivity • Log-likelihood ratio eliminates scoring bias due to sequence length or choice of the scoring matrix • Facilitates optimization of gap weighting matrices

  21. Advantages of PSW • No assumption about evolutionary relatedness is made • Therefore, any scoring matrix can be used

More Related