360 likes | 385 Views
This paper presents a method for inferring the strengths of protein-protein interactions using linear programming, based on experimental data. The proposed method combines probabilistic modeling and linear programming to accurately predict protein-protein interaction strengths. Experimental data, including both binary and numerical data, is used to validate the effectiveness of the proposed methods.
E N D
Inferring strengths of protein-protein interactions from experimental data using linear programming Morihiro Hayashida, Nobuhisa Ueda, Tatsuya Akutsu Bioinformatics Center, Kyoto University
Overview • Background • Probabilistic model • Related work • Biological experimental data • Proposed methods • For binary data • For numerical data • Results of computational experiments • Conclusion
Background (1/3) • Understanding protein-protein interactions is useful for understanding of protein functions. • Transcription factors • Proteins interact with a factor. • Regulate the gene. • Receptors, etc.
Background (2/3) • Various methods were developed for inference of protein-protein interactions • Gene fusion/Rosetta stone (Enright et al. and Marcotte et al. 1999) • Number of possible genes to be applied is limited. • Molecular dynamics • Long CPU time • Difficult to predict precisely
Background (3/3) • A Model based on domain-domain interactions hasbeen proposed. • Use domains defined by databases like InterPro or Pfam. Domain Domain
Overview • Background • Probabilistic model • Related work • Biological experimental data • Proposed methods • For binary data • For numerical data • Results of computational experiments • Conclusion
Probabilistic model of interaction (1/2) • Model (Deng et al., 2002) • Two proteins interact. At least one pair of domains interacts. • Interactions between domains are independent events. D3 D1 P1 P2 D2 D2 D4
Probabilistic model of interaction (2/2) • : Proteins Pi and Pj interact • : Domains Dm and Dn interact • : Domain pair (Dm ,Dn) is included in protein pair PiXPj
Overview • Background • Probabilistic model • Related work • Association method (Sprinzak et al., 2001) • EM method (Deng et al., 2002) • Biological experimental data • Proposed methods • Results of computational experiments • Conclusion
Related work • INPUT: • interacting protein pairs (positive examples) • non-interacting protein pairs (negative examples) • OUTPUT: Pr(Dmn=1) for all domain pairs
Association method (Sprinzak et al., 2001) • Inference of probabilities of domain-domain interactions using ratios of frequencies • : Number of interacting protein pairs that include (Dm, Dn) • : Number of protein pairs that include (Dm, Dn)
EM method (Deng et al.,2002) • Probability (likelihood L) that experimental data {Oij={0,1}} are observed. • Use EM algorithm in order to (locally) maximize L. • Estimate Pr(Dmn=1)
Overview • Background • Probabilistic model • Related work • Biological experimental data • Proposed methods • For binary data • For numerical data • Results of computational experiments • Conclusion
Biological experimental data • Related methods (Association and EM) use only binary data (interact or not). • Experimental data using Yeast 2 hybrid • Ito et al. (2000, 2001) • Uetz et al. (2001) • For many protein pairs, different results (Oij= {0,1}) were observed. • We developed new methods using raw numerical data.
Numerical data • Ito et al. (2000,2001) • For each protein pair, experiments were performed multiple times. • IST (Interaction Sequence Tag) • Number of observed interactions • By using a threshold, we obtain binary data.
Overview • Background • Probabilistic model • Related work • Biological experimental data • Proposed methods • For binary data • For numerical data • Results of computational experiments • Conclusion
It seems difficult to modify EM method for numerical data. Linear Programming For binary data LPBN Combined methods LPEM EMLP SVM-based method For numerical data ASNM LPNM Proposed methods
Overview • Background • Probabilistic model • Related work • Biological experimental data • Proposed methods • For binary data • For numerical data • Results of computational experiments • Conclusion
LPBN (LP-based method)(1/2) • Transformation into linear inequalities • PiandPjinteract
LPBN (LP-based method)(2/2) • Linear programming for inference of protein-protein interactions
Combination of EM and LPBN • LPEM method • Use the results of LPBN as initial parameter values for EM. • EMLP method • Constrains to LPBN with the following inequalities so that LP solutions are close to EM solutions.
Simple SVM-based method • Feature vector • Simple linear kernel with • Interacting pairs = Positive examples • Non-interacting pairs = Negative examples
Overview • Background • Probabilistic model • Related work • Biological experimental data • Proposed methods • For binary data • For numerical data • Results of computational experiments • Conclusion
Strength of protein-protein interaction • For each protein pair, experiments were performed multiple times. • The ratio can be considered as strength. • Kij : Number of observed interactions for a protein pair (Pi,Pj) • Mij : Number of experiments for (Pi,Pj)
LPNM method (1/2) • Minimize the gap between Pr(Pij=1) and using LP.
LPNM method (2/2) • Linear programming for inference of strengths of protein-protein interactions
ASNM • Modified Association method for numerical data • For binary data (Sprinzak et al., 2001)
Overview • Background • Probabilistic model • Related work • Biological experimental data • Proposedmethods • For binary data • For numerical data • Results of computational experiments • Conclusion
Computational experimentsfor binary data • DIP database (Xenarios et al., 2002) • 1767 protein pairs as positive • 2/3 of the pairs for training, 1/3 for test • Computational environment • Xeon processor 2.8 GHz • LP solver: loqo
Results on training data (binary data) EM Association LPBN SVM
Results on test data (binary data) EM EMLP LPEM SVM Association
Computational experimentsfor numerical data • YIP database (Ito et al., 2001, 2002) • IST (Interaction Sequence Tag) • 1586 protein pairs • 4/5 for training, 1/5 for test • Computational environment • Xeon processor 2.8 GHz • LP solver: lp_solve
Results on test data (numerical data) ASNM LPNM EM Association
Results on test data (numerical data) • LPNM is the best. • EM and Association methods classify Pr(Pij=1) into either 0 or 1.
Conclusion • We have defined a new problem to infer strengths of protein-protein interactions. • We have proposed LP-based methods. • For binary data • LPBN, LPEM, EMLP • SVM-based method • For numerical data • ASNM • LPNM • LPNM outperformed the other methods.
Future work • Improve the methods to avoid overfitting. • Improve the probabilistic model to understand protein-protein interactions more accurately.