360 likes | 499 Views
Testing sequence comparison methods with structure similarity. @ Organon, Oss 2006-02-07 Tim Hulsen. Introduction. Main goal: transfer function of proteins in model organisms to proteins in humans
E N D
Testing sequence comparison methods with structure similarity @ Organon, Oss 2006-02-07 Tim Hulsen
Introduction • Main goal: transfer function of proteins in model organisms to proteins in humans • Make use of “orthology”: proteins evolved from common ancestor in different species (very similar function!) • Several ortholog identification methods, relying on: • Sequence comparisons • (Phylogenies)
Introduction • Quality of ortholog identification depends on: 1.) Quality of sequence comparison algorithm: - Smith-Waterman vs. BLAST, FASTA, etc. - Z-value vs. E-value 2.) Quality of ortholog identification itself (phylogenies, clustering, etc.) 2 -> previous research 1 -> this presentation
Previous research • Comparison of several ortholog identification methods • Orthologs should have similar function • Functional data of orthologs should behave similar: • Gene expression data • Protein interaction data • Interpro IDs • Gene order
Orthology method comparison • Compared methods: • BBH, Best Bidirectional Hit • INP, InParanoid • KOG, euKaryotic Orthologous Groups • MCL, OrthoMCL • PGT, PhyloGenetic Tree • Z1H, Z-value > 1 Hundred
Orthology method comparison • e.g. correlation in expression profiles • Affymetrix human and mouse expr. data, using SNOMED tissue classification • Check if the expression profile of a protein is similar to the expression profile of its ortholog Hs Mm
Orthology method comparison • e.g. conservation of protein interaction • DIP (Database of Interacting Proteins) • Check if the orthologs of two interacting proteins are still interacting in the other species -> calculate fraction Hs Hs Mm Mm
Orthology method comparison • Trade-off between sensitivity and selectivity • BBH and INP are most sensitive but also most selective • Results can differ depending on what sequence comparison algorithm is used: - BLAST, FASTA, Smith-Waterman? - E-value or Z-value?
E-value or Z-value? • Smith-Waterman with Z-value statistics: 100 randomized shuffles to test significance of SW score rnd ori: 5*SD Z = 5 O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2. YMSHQFTVGE etc. # seqs SW score
E-value or Z-value? • Z-value calculation takes much time (2x100 randomizations) • Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E-value? • Advantage of Z-value has never been proven by experimental results
How to compare? • Structural comparison is better than sequence comparison • ASTRAL SCOP: Structural Classification Of Proteins • e.g. a.2.1.3, c.1.2.4; same number ~ same structure • Use structural classification as benchmark for sequence comparison methods
Methods (1) • Smith-Waterman algorithms: dynamic programming; computationally intensive • Paracel with e-value (PA E): • SW implementation of Paracel • Biofacet with z-value (BF Z): • SW implementation of Gene-IT • ParAlign with e-value (PA E): • SW implementation of Sencel • SSEARCH with e-value (SS E): • SW implementation of FASTA (see next page)
Methods (2) • Heuristic algorithms: • FASTA (FA E) • Pearson & Lipman, 1988 • Heuristic approximation; performs better than BLAST with strongly diverged proteins • BLAST (BL E): • Altschul et al., 1990 • Heuristic approximation; stretches local alignments (HSPs) to global alignment • Should be faster than FASTA
Method parameters • all: • matrix: BLOSUM62 • gap open penalty: 12 • gap extension penalty: 1 • Biofacet with z-value: 100 randomizations
Receiver Operating Characteristic • R.O.C.: statistical value, mostly used in clinical medicine • Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis
ROC50 Example • Take 100 best hits • True positives: in same SCOP family, or false positives: not in same family • For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12) • - Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC50 (0,167) • - Take average of ROC50 scores for all entries
Coverage vs. Error • C.V.E. = Coverage vs. Error (Brenner et al., 1998) • E.P.Q. = selectivity indicator (how much false positives?) • Coverage = sensitivity indicator (how much true positives of total?)
CVE Example • Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e=0.01 • True positives: in same SCOP family, or false positives: not in same family • For each threshold, calculate the coverage: number of true positives divided by the total number of possible true positives • For each treshold, calculate the errors-per-query: number of false positives divided by the number of queries • - Plot coverage on x-axis and errors-per-query on y-axis; right-bottom is best
CVE results - + (only PDB095)
Mean Average Precision • A.P.: borrowed from information retrieval search (Salton, 1991) • Recall: true positives divided by number of homologs • Precision: true positives divided by number of hits • A.P. = approximate integral to calculate area under recall-precision curve
Mean AP Example • - Take 100 best hits • - True positives: in same SCOP family, or false positives: not in same family • For each of the true positives: divide the true positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the positive rank (2,3,4,5,9,12,14,15,16,18,19,20) • Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140) • Take average of AP scores for all entries = mean AP
Time consumption • PDB095 all-against-all comparison: • Biofacet: multiple days (z value calc.!) • BLAST: 2d,4h,16m • SSEARCH: 5h49m • ParAlign: 47m • FASTA: 40m
Preliminary conclusions • SSEARCH gives best results • When time is important, FASTA is a good alternative • Z-value seems to have no advantage over E-value
Problems • Bias in PDB? • Sequence length • Amino acid composition • Difference in matrices? • Difference in SW implementations?
Bias in PDB sequence length? Yes! Short sequences are over-represented in the ASTRAL SCOP PDB sets
Bias in PDB aa distribution? No! Approximately equal amino acid distribution in the ASTRAL SCOP PDB sets
Conclusions • E-value better than Z-value! • SW implementations are (more or less) the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scoresbest of all • Larger structural comparison database needed for better analysis
Credits • NV Organon: • Peter Groenen • Wilco Fleuren • Wageningen UR: • Jack Leunissen