1 / 33

Comparing Database Search Methods & Improving the Performance of PSI-BLAST

Comparing Database Search Methods & Improving the Performance of PSI-BLAST. Stephen Altschul. “Gold standards” for protein classification. Traditional curated sequence databases with family and superfamily classifications: PIR SWISS-PROT Structure-based protein domain classification:

beryl
Download Presentation

Comparing Database Search Methods & Improving the Performance of PSI-BLAST

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul

  2. “Gold standards” for protein classification Traditional curated sequence databases with family and superfamily classifications: PIR SWISS-PROT Structure-based protein domain classification: SCOP

  3. Measuring retrieval accuracy Sensitivity: TP/R Specificity: TP/P

  4. Receiver Operating Characteristic curve False – True + False + True –

  5. Random retrieval on a ROC plot

  6. Line of fixed sensitivity

  7. Line of fixed specificity

  8. Line of fixed crossover ratio

  9. Line of fixed crossover ratio

  10. ROC score: area under the ROC curve

  11. Region of interest in ROC analysis

  12. Region of interest in ROC analysis

  13. Truncated ROC, or ROCn curve Fraction unrelated accepted 0 10–3

  14. ROCn score: area under the ROCn curve

  15. Questions concerning ROC analysis • What false-positive cutoff value should be used? • When does it make sense to pool the results of database searches? • When are the ROC scores for two different methods significantly different?

  16. Marginal ratio of true to false positives

  17. Definition of the ROCn score ti : Number of related sequences (true positives) returned before the ith false positive t: Total number of related sequences

  18. “Random distribution” of ROCn scores • Bootstrap resampling can be used to assign a statistical significance to differences in ROCn scores. • Under reasonable assumptions, the distribution of bootstrapped ROCn scores is approximately normal. • Resampling a small subset in a large database is equivalent to resampling the subset with independent Poisson distributions with mean 1.

  19. 1 1 2 2 3 3 4 4 4 4 5 5 7 7 8 8 10 10 10 10 1 2 3 4 5 6 7 8 9 10 The true records are well characterized. Only false records are resampled with replacement. The false records are the noise. Bootstrap resampling of false positives Retrieval Ranking of the Database

  20. Mean and variance for the normal distribution of ROCn scores yielded by resampling only the false positives

  21. Mean and variance for the normal distribution of the difference of two ROCn scores, yielded by resampling only the false positives

  22. PSI-BLAST in a nutshell • With a protein sequence as query, use BLAST to search a protein sequence database. • Collapse significant local alignments (those with E-value less than or equal to a set threshold h) into a multiple alignment, using the residues of the query sequence as alignment-column placeholders. • Abstract a position-specific score matrix from the multiple alignment. • Search the database with the score matrix as query. • Iterate a fixed number of times, or until convergence.

  23. Protocol for evaluating PSI-BLAST • For each query sequence, search a comprehensive protein sequence database (e.g. NCBI’s nr) through a fixed number of PSI-BLAST iterations, or until convergence. • Use the resulting position-specific score matrix to search the “gold standard” database. • Pool the search results for ROC analysis.

  24. The effect of acceptance threshold h on PSI-BLAST accuracy

  25. 1. New statistical parameters 2. Smith-Waterman alignment 3. Substitution matrix frequency ratios 4. Apply SEG to database sequences 5. Composition-based statistics 6. “Concentrated” accounting of gaps 7. “Dispersed” accounting of gaps 8. Exponentiate Henikoff weights 9. Reverse sequence normalization 10. Window for amino acid composition Some ideas for improving PSI-BLAST 11. Use pseudocounts with composition window 12. Vary gap costs 13. Generalized affine gap costs 14. Substitution score offset 15. Information-dependent pseudocount parameter 16. Database-sequence length-normalization 17. Restricted score rescaling 18. Adjust purging percentage 19. Adjust pseudocount parameter 20. Adjust acceptance threshold

  26. The effect of composition-based statistics on PSI-BLAST accuracy

  27. Composition-based statistics Statistics based on “standard” amino acid frequencies can differ by orders of magnitude from those based upon the peculiar composition of two proteins. Standard protein: 4.5 % N DNA pol III, β chain [M. genitalium]: 12.1 % N DNA pol III, β chain [C. jejuni]: 7.6 % N Depending upon the composition assumed, a search of nr with M. genitalium DNA pol III as query yields different E-values for C. jejuni DNA pol III, as well as for the highest-scoring false positive: “Standard” statistics: 10-10 0.0002 Composition-based statistics: 0.001 0.2 At a threshold of 0.0001, “standard” statistics yield 54 true positives, while at 0.1, composition-based statistics yield 55 true positives.

  28. The effect of dispersed accounting of gaps on PSI-BLAST accuracy

  29. The effect of restricted score rescaling and parameter tuning on PSI-BLAST accuracy

  30. Accuracy of PSI-BLAST

  31. PSI-BLAST accuracy as a function of the number of iterations

  32. Literature ROC analysis Swets, J.A. (1988) Science240:1285-1293 Gribskov, M. & Robinson, N.L. (1996) Comput. Chem.20:25-33 PSI-BLAST Altschul, S.F. et al. (1997) Nucl. Acids Res.25:3389-3402 Composition-based statistics Karplus, K. et al. (1998) Bioinformatics14:846-856 Schäffer, A.A. et al. (1999) Bioinformatics15:1000-1011 Mott, R. (2000) J. Mol. Biol.300:649-659 Statistics of ROCn resampling Schäffer, A.A. et al. (2001) Nucl. Acids Res.29:2994-3005 Spouge, J.L. & Czabarka, E. (2002) ISMB Poster 133A

  33. Analysis of ROCn score distribution John Spouge Eva Czabarka Improvements to PSI-BLAST Alejandro Schäffer L. Aravind Thomas Madden Sergei Shavirin John Spouge Yuri Wolf Eugene Koonin Acknowledgements

More Related