210 likes | 340 Views
Testing statistical significance scores of sequence comparison methods with structure similarity. Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27. Introduction. Sequence comparison: Important for finding similar proteins (homologs) for a protein with unknown function
E N D
Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27
Introduction • Sequence comparison: Important for finding similar proteins (homologs) for a protein with unknown function • Algorithms: BLAST, FASTA, Smith-Waterman • Statistical scores: E-value (standard), Z-value
E-value or Z-value? • Smith-Waterman sequence comparison with Z-value statistics: 100 randomized shuffles to test significance of SW score rnd ori: 5*SD Z = 5 O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score
E-value or Z-value? • Z-value calculation takes much time (2x100 randomizations) • Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E-value • BUT Advantage of Z-value has never been proven by experimental results
How to compare? • Structural comparison is better than sequence comparison • ASTRAL SCOP: Structural Classification Of Proteins • e.g. a.2.1.3, c.1.2.4; same number ~ same structure • Use structural classification as benchmark for sequence comparison methods
Methods (1) • Smith-Waterman algorithms: dynamic programming; computationally intensive • Paracel with e-value (PA E): • SW implementation of Paracel • Biofacet with z-value (BF Z): • SW implementation of Gene-IT • ParAlign with e-value (PA E): • SW implementation of Sencel • SSEARCH with e-value (SS E): • SW implementation of FASTA (see next page)
Methods (2) • Heuristic algorithms: • FASTA (FA E) • Pearson & Lipman, 1988 • Heuristic approximation; performs better than BLAST with strongly diverged proteins • BLAST (BL E): • Altschul et al., 1990 • Heuristic approximation; stretches local alignments (HSPs) to global alignment • Should be faster than FASTA
Method parameters • all: • matrix: BLOSUM62 • gap open penalty: 12 • gap extension penalty: 1 • Biofacet with z-value: 100 randomizations
Receiver Operating Characteristic • R.O.C.: statistical value, mostly used in clinical medicine • Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis
ROC50 Example • Take 100 best hits • True positives: in same SCOP family, or false positives: not in same family • For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12) • - Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC50 (0,167) • - Take average of ROC50 scores for all entries
Coverage vs. Error • C.V.E. = Coverage vs. Error (Brenner et al., 1998) • E.P.Q. = selectivity indicator (how much false positives?) • Coverage = sensitivity indicator (how much true positives of total?)
CVE Example • Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e=0.01 • True positives: in same SCOP family, or false positives: not in same family • For each threshold, calculate coverage: number of true positives divided by total number of possible true positives • For each threshold, calculate errors-per-query: number of false positives divided by number of queries • - Plot coverage on x-axis and errors-per-query on y-axis; right-bottom is best
CVE results - + (for PDB010)
Mean Average Precision • A.P.: borrowed from information retrieval search (Salton, 1991) • Recall: true positives divided by number of homologs • Precision: true positives divided by number of hits • A.P. = approximate integral to calculate area under recall-precision curve
Mean AP Example • - Take 100 best hits • - True positives: in same SCOP family, or false positives: not in same family • For each of the true positives: divide the positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the true positive rank (2,3,4,5,9,12,14,15,16,18,19,20) • Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140) • Take average of AP scores for all entries = mean AP
Time consumption • PDB095 all-against-all comparison: • Biofacet: multiple days (Z-value calc.!) • SSEARCH: 5h49m • ParAlign: 47m • FASTA: 40m • BLAST: 15m
Conclusions • e-value better than Z-value(!) • SW implementations are (more or less) the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scoresbest of all • Use FASTA/BLAST only when time is important • Larger structural comparison database needed for better analysis
Credits Peter Groenen Wilco Fleuren Jack Leunissen