1 / 21

Testing statistical significance scores of sequence comparison methods with structure similarity

Testing statistical significance scores of sequence comparison methods with structure similarity. Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27. Introduction. Sequence comparison: Important for finding similar proteins (homologs) for a protein with unknown function

noel-sweet
Download Presentation

Testing statistical significance scores of sequence comparison methods with structure similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

  2. Introduction • Sequence comparison: Important for finding similar proteins (homologs) for a protein with unknown function • Algorithms: BLAST, FASTA, Smith-Waterman • Statistical scores: E-value (standard), Z-value

  3. E-value or Z-value? • Smith-Waterman sequence comparison with Z-value statistics: 100 randomized shuffles to test significance of SW score rnd ori: 5*SD  Z = 5 O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score

  4. E-value or Z-value? • Z-value calculation takes much time (2x100 randomizations) • Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E-value • BUT Advantage of Z-value has never been proven by experimental results

  5. How to compare? • Structural comparison is better than sequence comparison • ASTRAL SCOP: Structural Classification Of Proteins • e.g. a.2.1.3, c.1.2.4; same number ~ same structure • Use structural classification as benchmark for sequence comparison methods

  6. ASTRAL SCOP statistics

  7. Methods (1) • Smith-Waterman algorithms: dynamic programming; computationally intensive • Paracel with e-value (PA E): • SW implementation of Paracel • Biofacet with z-value (BF Z): • SW implementation of Gene-IT • ParAlign with e-value (PA E): • SW implementation of Sencel • SSEARCH with e-value (SS E): • SW implementation of FASTA (see next page)

  8. Methods (2) • Heuristic algorithms: • FASTA (FA E) • Pearson & Lipman, 1988 • Heuristic approximation; performs better than BLAST with strongly diverged proteins • BLAST (BL E): • Altschul et al., 1990 • Heuristic approximation; stretches local alignments (HSPs) to global alignment • Should be faster than FASTA

  9. Method parameters • all: • matrix: BLOSUM62 • gap open penalty: 12 • gap extension penalty: 1 • Biofacet with z-value: 100 randomizations

  10. Receiver Operating Characteristic • R.O.C.: statistical value, mostly used in clinical medicine • Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis

  11. ROC50 Example • Take 100 best hits • True positives: in same SCOP family, or false positives: not in same family • For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12) • - Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC50 (0,167) • - Take average of ROC50 scores for all entries

  12. ROC50 results

  13. Coverage vs. Error • C.V.E. = Coverage vs. Error (Brenner et al., 1998) • E.P.Q. = selectivity indicator (how much false positives?) • Coverage = sensitivity indicator (how much true positives of total?)

  14. CVE Example • Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e=0.01 • True positives: in same SCOP family, or false positives: not in same family • For each threshold, calculate coverage: number of true positives divided by total number of possible true positives • For each threshold, calculate errors-per-query: number of false positives divided by number of queries • - Plot coverage on x-axis and errors-per-query on y-axis; right-bottom is best

  15. CVE results - + (for PDB010)

  16. Mean Average Precision • A.P.: borrowed from information retrieval search (Salton, 1991) • Recall: true positives divided by number of homologs • Precision: true positives divided by number of hits • A.P. = approximate integral to calculate area under recall-precision curve

  17. Mean AP Example • - Take 100 best hits • - True positives: in same SCOP family, or false positives: not in same family • For each of the true positives: divide the positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the true positive rank (2,3,4,5,9,12,14,15,16,18,19,20) • Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140) • Take average of AP scores for all entries = mean AP

  18. Mean AP results

  19. Time consumption • PDB095 all-against-all comparison: • Biofacet: multiple days (Z-value calc.!) • SSEARCH: 5h49m • ParAlign: 47m • FASTA: 40m • BLAST: 15m

  20. Conclusions • e-value better than Z-value(!) • SW implementations are (more or less) the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scoresbest of all • Use FASTA/BLAST only when time is important • Larger structural comparison database needed for better analysis

  21. Credits Peter Groenen Wilco Fleuren Jack Leunissen

More Related