380 likes | 613 Views
Statistical Evaluation of Pairwise Protein Sequence Comparison with the Bayesian Bootstrap. Gavin A. Price, Gavin E. Cooks, Richard E. Green and Steven E. Brenner. Goals. Detail an improved statistical benchmark of pairwise protein sequence comparison algorithm
E N D
Statistical Evaluation of Pairwise Protein Sequence Comparison with the Bayesian Bootstrap Gavin A. Price, Gavin E. Cooks,Richard E. Green and Steven E. Brenner
Goals • Detail an improved statistical benchmark of pairwise protein sequence comparison algorithm • Apply analysis to the comparative study of amino acid substitution matrix families
Goals • Detail an improved statistical benchmark of pairwise protein sequence comparison algorithm • Apply analysis to the comparative study of amino acid substitution matrix families
Background • What is a superfamily? • Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies‡ ‡-Introduction to Structural Classification of Proteins, http://scop.mrc-lmb.cam.ac.uk/scop/intro.html
Building the Database • SCOP (version 1.61) • Evolutionary interrelations are known • Homologous domains placed in same superfamily • No two sequences have >40% pairwise identity • Focus on remote homologs • Subset is non-redundant • ASTRAL provides SCOP subsets • Further divided into training (2592 sequences) and test (2182 sequences) sets
Alignment scoring • Collect alignment scores (E-values) from pair-wise alignment of all sequences in the database • E-value: number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search • Vary the alignment score threshold • Allow few false positive errors with stringent cutoff • Allow more errors but find more true relations with more permissive threshold • Plot the results in a coverage versus error per query (CVE) plot
Many false positives, few true homologs Few false positives, many true homologs Example CVE plot Number of false positive matches Proportion of true relations found
Superfamily Size Normalization • Why Normalize? • Large superfamilies account for most of the true relations in the database • Protein domains are more diverse • Interrelations are harder to discover • Unnormalized coverage is dominated by largest SCOP superfamilies • As superfamily size grows, the number of relationships will grow quadratically • Normalization prevents skewing of performance evaluations
1 1 = n(n-1) n2-n Superfamily Size Normalization • Unnormalized case • Coverage is the fraction of all true relations that are found • Each correctly deduced sequence relationship contributes to the coverage • n – number of sequences in the database • n2-n – number of possible sequence relationships • n -1 – weight of each sequence match
Superfamily Size Normalization • Two possible types of Normalization • Linear • Quadratic • Normalization downweights larger superfamilies
1 n(s – 1) Superfamily Normalization • Linear Normalization • Average fraction of true relations per sequence • Weight of each sequence match divided by no. of true homologs of the query: • (s-1) – no. of true homologs in the query • s – superfamily size
1 S(s2– s) Superfamily Normalization • Quadratic Normalization • Average fraction of true relations per superfamily • Weight of each sequence match is divided by the total coverage of the superfamily, i.e. number of true relations within the superfamily: • (s2-s) – number of true relations within the superfamily • s – size of superfamily • S – number of superfamilies in the database
Standard Bootstrapping • Tests if measured results are statistically significant, given finite size of datasets • Randomly sample N sequences with replacement • N – original database size • Collect many replica sets, analyze, and interpret
Standard Bootstrapping • Problems • Replicas are biased relative to the original data, underestimating the true coverage • Why? • Each sequence is represented zero, one or more times • Chance of not including a particular sequence is ~1/e or 37% (Weights are approx. Poissonian) • Each replica has a 60% chance of entirely neglecting each size-2 superfamily
Standard Bootstrapping • Problems • Sequences preferentially selected from larger, more diverse superfamilies • Correct sequence relationships are harder to discover • Undersampling leads to unwarranted reduction in homology
Thick lines represent original CVE Lighter areas represent bootstrapped replicas Standard Bootstrap
Bayesian Bootstrap • Prior Assumption: All sequences are equally probable • Sample sequences with replacement • Combine prior with sample likelihood via Bayes theorem • Result is the Dirichlet posterior distribution on the fraction of the original population that each sampled sequence represents
Bayesian Bootstrap • With non-integer weights, smaller superfamilies do not drop out • Bias is eliminated
Overpredicted variance • Replicas are skewed left Comparison of Bootstrap Methods
Normalization Formulas • wi – weight of query sequence • wj – weight of target sequence • S – number of superfamilies *Summing to s includes only weights of sequences in query’s superfamily
Goals • Detail an improved statistical benchmark of pairwise protein sequence comparison algorithm • Apply analysis to the comparative study of amino acid substitution matrix families
Example of Methodology • Compare performance of four matrix families • Choose a single matrix from each family based on ability to align remote homologs given that it matches close homologs relatively easily • Use SSEARCH algorithm • Implementation of Smith-Waterman algorithm
Example of Methodology • Matrix families tested: • Classic PAM matrices (Dayhoff) • VTML (variable time maximum likelihood) matrices • Based on Dayhoff model • Trained on large set of diverse homologs • Standard BLOSUM matrices • Derived from the BLOCKS 5 database of protein sequence alignments • BLOCKS 13+ BLOSUM matrices • BLOSUM family reparameterized using BLOCKS 13+ database, contains many more sequences than standard BLOCKS 5
Selecting Matrices • Vary matrix number in each family • PAM: • 10 310 • VTML: • 50 350 • BLOSUM: • 30 100 • BLOCKS 13+ BLOSUM: • 40 100 • Vary gap open (520) & extension (13) penalties
Coverage v. Matrix No. & Gap Params Global Maxima
Optimal Parameters • Parameters are optimized for coverage on the training dataset under linear normalization at 0.01 EPQ
Coverage Results * Standard Deviation from bootstrapping test datasets
BLOSUM VTML PAM BLOCKS 13+ BLOSUM Composite CVE • Under Bayesian Bootstrap • Thick lines are original • Thin lines are bootstrap replicas
Bootstrap Distributions • Replicate distribution for each family at 0.01 EPQ • PAM is worse than others BLOSUM VTML PAM BLOCKS 13+ BLOSUM
Statistic of Interest • Interested in mean difference in coverage • Different than difference in mean coverage • Does not underestimate statistical significance • Results obtained from a single data replica are correlated across different parameters • Rather than generate independent replicas for each method
VTML and BLOSUM nearly equivalent BLOCKS 13+ and PAM differ significantly Mean Difference in Coverage • Most Extreme Cases Shown
Test for Statistical Significance • Compute Z-statistic (mean / standard deviation) • If |Z-statistic| > 1.96, reject the hypothesis that the methods posses equivalent performance at 95% confidence
Z-Scores For Distribution Pairs *Statistics are of the form (difference in mean coverage)/(mean difference in coverage) The latter is the statistic of interest
Summation • PAM is least effective • Trained on a small collection of relatively close homologs • VTML performs better than PAM but not BLOSUM or BLOCKS 13+ BLOSUM • Result of training on larger sets of remote homologs
Summation • Bayesian Bootstrap • Estimate statistics in database homology search • Lacks anomalies introduced by standard bootstrap • Why? • Bayesian does not underrepresent small superfamilies
Alternative to Bayesian Bootstrap • Resample entire superfamilies • Largest superfamilies contain majority of the intersequence relations • Leads to noisy replica ensembles • Assumes transitivity • Groups sequences into non-overlapping families • Many interesting datasets are not transitive • Multi-domain protein sequences • Not universally applicable
Why Accept Bayesian Bootstrap? • Researchers predict most undiscovered superfamilies contain few sequences • Bias inherent to standard bootstrap will be exacerbated as sequence databases grow • Bayesian bootstrap provides better resampling without increasing computational or conceptual complexity