240 likes | 361 Views
Using the Fisher kernel method to detect remote protein homologies. Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16). Abstract. Detecting remote protein homologies Fisher kernel method Variant of Support Vector Machines using new kernel function
E N D
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Abstract • Detecting remote protein homologies • Fisher kernel method • Variant of Support Vector Machines using new kernel function • Derived from Hidden Markov Models
Introduction (1) • Detecting protein homologies (sequence-based algorithm) • BLAST, Fasta, PROBE, templates, profiles, position-specific weight matrices, HMM • Comparison by (Brenner 1996; Park et al. 1998) • SCOP classification of protein structures • Remote protein homologies existing between protein domain in the same structural superfamily. • Statistical models like PSI-BLAST and HMMs are better than simple pairwise comparison methods.
Introduction (2) • Generative statistical models (HMMs) • Extracting features from protein sequences • Mapping all protein sequences to points in a Euclidean feature space of fixed dimension. • General discriminative statistical method to classify the points. • Improvements acquired • Over HMMs alone.
Methods • How generative models work. (HMMs) • Training examples ( sequences known to be members of protein family ) : positive • Tuning parameters with a priori knowledge • Model assigns a probability to any given protein sequence. • The sequence from that family yield a higher probability than that of outside family. • Log-likelihood ratio as score
Discriminative approaches • Using both positive and negative examples • Parameter is tuned so that the model can optimally discriminate members of the family from nonmembers. • When training examples are few • Likelihood ratio is optimal if generative models perfectly fit to data but… • Discriminative methods often performs better.
Kernel methods • Discriminant function L(X) • Where { Xi, i = 1,…,n} and hypothesis class H1, H2 • + : the sequence of the family, - : outside of the family • Contribution of Kernel • i : overall importance of the example Xi. • Measure of pairwise similarity : K(Xi, X) • User supplies the type of kernel for the application area!!
The Fischer kernel (1) • Deriving kernel function from generative models • Advantage 1 : handle variable length protein sequences!! • Advantage 2 : encoding of prior knowledge about protein sequences • HMMs (difference) • Kernel function specifies a similarity score for any pair of sequences. • Likelihood score from an HMM only measures the closeness of the sequence to the model itself.
The Fischer kernel (2) • Sufficient statistics • Each parameter in HMM : Posterior frequencies • Of particular transition. • Of generating one of the residues of the query sequence. • Reflects the process of generating the query sequence from HMM. • Alterative of sufficient statistics : Fischer score • Magnitude of the components : how each contributes to generating the query sequence.
The Fischer kernel (3) • Kernel function used in this paper. note that its fixed vector. • Summary • Train HMM with positive examples. • Map each new protein sequence X into a fixed vector, Fisher score. • Calculate the kernel function • Get resulting discriminant function (SVM-Fisher)
The Fischer kernel (4) • Combination of scores • There might be more than one HMM model for the family or superfamily of interest. • Average score • Maximum score
Experimental Methods • Methods • SVM-Fisher (this paper) • BLAST (Altshul et al. 1990; Gish & States 1993) • HMMs using SAM-T98 methodology (Park et al. 1998; Karplus, Barrett, & Hughey 1998; Hughey & Krogh 1995l 1996) • Measurement of recognition rate for members of superfamilies of the SCOP protein structure classification (Hubbard et al. 1997) • Withholding all members of SCOP family • Train with the remaining members of SCOP superfamily • Test with withheld data • Question: “Could the method discover a new family of a known superfamily?”
Overview of experiments • Database • SCOP version 1.37 PDB90 : consisting of protein domains, no two of which have 90% of more residue identity • PDB90 eliminates redundant sequences. • Generative models • SAM-T98 HMMs • Data selection • Get 33 test families from 16 superfamilies. • Evaluation strategy • Assessing to what extent it gave better scores to the positive test examples thant it gave to the negative test examples.
SCOP: a Structural Classification of Proteins database • Hierachical levels • Family: clustered proteins by common evolutionary origin: residue identities of above 30%, lower sequence identities but very similar functions and structures • Superfamily: low sequence identities but probably common evolutionary origin • Fold: same major secondary structure in the same arrangement and with the same topological connections
Figure 1: Separation of the SCOP PDB90 database into training and test sequences, shown for the G proteins test family
Multiple models used • Modeling superfamily • SAM-T98 : starts with a single sequence (the guide sequence for the domain) and build a model • Too many sequences! • Using a subset of PDB90. • Train SVM-Fisher method using each of models in turn
Details on the training and test sets • All PDB90 sequence outside the fold of the test family were used as either negative training or negative test examples. • Reverse test/training allocation of negative examples, and repeat experiments. • Fold-by-fold basis split of negative examples. • For positive examples • PDB90 sequences in the superfamily of the test family are used. • Homologs found by each individual SAM-T98 model are used.
BLAST methods • WU-BLAST version 2.0a16 (Althcshul & Gish 1996) • PDB90 database was queried with each positive training examples, and E-values were recorded. • BLAST:SCOP-only • BLAST:SCOP+SAM-T98-homologs • Scores were combined by the maximum method
Generative HMM models • SAM-T98 method • Null model: reverse sequence model • Same data and same set of models as in the SVM-Fisher • Combined with maximum methods
Results • Metric : the rate of false positives (RFP) • RFP for a positive test sequence : the fraction of negative test sequences that score as good of better than positive sequence.
G-proteins • The result of the family of the nucleotide triphosphate hydrolases SCOP superfamily • Test the ability to distinguish 8 PDB90 G proteins from 2439 sequences in other SCOP folds. • Table 1 • In SVM-Fisher • 5 of the 8 G proteins are better than all 2439 negative test sequences. • Maximum RFP • Median RFP • Figure 2 • RFP curve
Table 1. Rate of false positives for G proteins family. BLAST = BLAST:SCOP-only, B-Hom = BLAST:SCOP+SAMT-98-homologs, S-T98 = SAMT-98, and SVM-F = SVM-Fisher method
Figure 2: 4 methods on the 33 test families. Curve of median RFP
Discussion • New approach • to recognition of remote protein homologies make a discriminative method built on top of a generative model (HMMs) • Discriminative method on top of HMM methods • Significant improvement • Combining multiple score would be improved. • Allocation problem • Different training set for tuning HMM and different training set for discriminative model • Extend the method to identify multiple domains within large protein sequences