200 likes | 302 Views
CISC 841 Bioinformatics Combining HMMs with SVMs. HMM gradients. Fisher Score <X> = log P(X|H, ) The gradient of a sequence X with respect to a given model is computed using the forward-backward algorithm. Each dimension corresponds to one parameter of the model.
E N D
CISC 841 Bioinformatics Combining HMMs with SVMs Li Liao, CISC841, F07
HMM gradients • Fisher Score <X> = log P(X|H, ) • The gradient of a sequence X with respect to a given model is computed using the forward-backward algorithm. • Each dimension corresponds to one parameter of the model. • The feature space is tailored to the sequences from which the model was trained. Li Liao, CISC841, F07
SVM-Fisher discrimination • A probabilistic hidden Markov model is trained from some example sequences x1 x2 x3 … xN • Usually probability model P(xi|) (or function of P(xi|)) is used as a measure of sequence-model membership, and a threshold is used on this measure to decide membership. • The Fisher vector is a vector of gradients of P(xi|) (or gradients of function of P(xi|)) w.r.t the parameters of the model. Uxi = P(xi|) • One can take the training example sequences (positive set) and other sequences that are known to be non-members (negative set), and transform them into Fisher vectors. • A Support Vector Machine (SVM) can be trained using the positive and negative Fisher vectors, and can be used to classify other sequences. Li Liao, CISC841, F07
Application: Protein remote homology detection Li Liao, CISC841, F07
SVM-Pairwise method Positive train Negative train Protein non-homologs Protein homologs 1 Positive pairwise score vectors Negative pairwise score vectors Testing data Target protein of unknown function 2 3 Support vector machine Binary classification Li Liao, CISC841, F07
Experiment: known protein families Li Liao, CISC841, F07 Jaakkola, Diekhans and Haussler 1999
A measure of sensitivity and specificity 5 6 ROC = 1 ROC = 0.67 ROC = 0 ROC: receiver operating characteristic score is the normalized area under a curve the plots true positives as a function of false positives
Application: Discriminating signal peptide from transmembrane proteins Li Liao, CISC841, F07
SignalP TM protein Feature selection • We expect gradients w.r.t transition parameters to be better discrimination features • We look for those transitions that are differentially used by TM proteins and SP proteins - transform each signal peptide sequence (1275) into a Fisher vector w.r.t transition parameters and find the resultant vector - transform each TM sequence into a Fisher vector w.r.t transition parameters and find the resultant vector - compare the two resultant vectors Li Liao, CISC841, F07
Gradients of P(s|x) In pattern recognition problems, we are interested in P(s|x,) rather than P(x|) Us|x = log P(s|x,) = log P(s, x|) - log P(x|) Li Liao, CISC841, F07
TMMOD sequence to vector x Us|x subsets of 247 TM proteins SVM Learn ? subsets of 1275 SP proteins SVM Classifier ? ? ? Classification experiment • 10-fold cross validation experiment using - positive set (247 TM proteins) - negative set (1275 signal peptide containing proteins) • SVM-light package is used. Li Liao, CISC841, F07
Discrimination results • Results • A third (68) more SP proteins that were incorrectly classified as TM TM proteins are identified correctly. Li Liao, CISC841, F07
Application: Protein-Protein Interaction Prediction Li Liao, CISC841, F07
Interaction Profile Hidden Markov Model (ipHMM) Fredrich et al (2006) Li Liao, CISC841, F07
Knowledge transfer: • Build ipHMM from proteins whose structural information is available. • Align the sequences of proteins whose structural information is • not available to the model. Likelihood Score Vector <LSai, A, LSai, B, LSbj,A, LSbj, B> Fisher Score Vector U(x) = ∇θ logP(x|θ) Uij = Ej(i) / ej(i) + k Ej(k) Li Liao, CISC841, F07
Data set Fredrich et al (2006): 2018 proteins in 36 domain families Li Liao, CISC841, F07
Conclusions • Structural information at binding sites enhances protein-protein interaction prediction. • Interaction profile HMM can transfer structural information • Fisher scores extracted from domain profiles further enhance protein-protein interaction prediction for proteins with no available structural information. Li Liao, CISC841, F07