220 likes | 232 Views
Protein Quaternary Fold Recognition Using Conditional Graphical Models. Yan Liu, Jaime Carbonell V anathi Gopalakrishnan (U Pitt), Peter Weigele (MIT) Language Technologies Institute School of Computer Science Carnegie Mellon University IJCAI-2007 – Hyderabad, India. Nobelprize.org.
E N D
Protein Quaternary Fold Recognition Using Conditional Graphical Models Yan Liu, Jaime Carbonell Vanathi Gopalakrishnan (U Pitt), Peter Weigele (MIT) Language Technologies Institute School of Computer Science Carnegie Mellon University IJCAI-2007 – Hyderabad, India
Nobelprize.org DSCTFTTAAAAKAGKAKAG Protein sequence + Protein function Protein structure Snapshot of Cell Biology
Example Protein Structures Triple beta-spiral fold in Adenovirus Fiber Shaft Adenovirus Fibre Shaft Virus Capsid
Predicting Protein Structures • Protein Structure is a key determinant of protein function • Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins • The gap between the known protein sequences and structures: • 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) • Therefore we need to predict structures in-silico
Seq 1: APA FSVSPA … SGACGPECAESG Seq 2 : DSCTFT…TAAAAKAGKAKCSTITL Quaternary Folds and Alignments • Protein fold • Identifiable regular arrangement of secondary structural elements • Thus far, a limited number of protein folds have been discovered (~1000) • Very few research work on quaternary folds • Complex structures and few labeled data • Quaternary fold recognition
Previous Work • Sequence similarity perspective • Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997] • Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998] • Window-based methods, e.g. PSI_pred [Jones, 2001] • Physical forces perspective • Homology modeling or threading, e.g. Threader [Jones, 1998] • Structural biology perspective • Painstakingly hand-engineered methods for specific structures, e.g.αα- and ββ- hairpins, β-turn and β-helix [Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001] Fail to capture the structure properties and long-range dependencies Generative models based on rough approximation of free-energy, perform very poorly on complex structures Very Hard to generalize due to built-in constants, fixed features
Conditional Random Fields • Hidden Markov model (HMM)[Rabiner, 1989] • Conditional random fields (CRFs)[Lafferty et al, 2001] • Model conditional probability directly (discriminative models, directly optimizable) • Allow arbitrary dependencies in observation • Adaptive to different loss functions and regularizers • Promising results in multiple applications • But, need to scale up (computationally) and extend to long-distance dependencies
Our Solution: Conditional Graphical Models Local dependency Long-range dependency • Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si} • Feature definition • Node feature • Local interaction feature • Long-range interaction feature
Joint Labels Linked Segmentation CRF • Node: secondary structure elements and/or simple fold • Edges: Local interactions and long-range inter-chain and intra-chain interactions • L-SCRF: conditional probability of y given x is defined as
Linked Segmentation CRF (II) • Classification: • Training : learn the model parametersλ • Minimizing regularized negative log loss • Iterative search algorithms by seeking the direction whose empirical values agree with the expectation • Complex graphs results in huge computational complexity
Approximate Inference of L-SCRF • Most approximation algorithms cannot handle variable number of nodes in the graph, but we need variable graph topologies, so… • Reversible jump MCMC sampling [Greens, 1995, Schmidler et al, 2001] withFour types of Metropolis operators • State switching • Position switching • Segment split • Segment merge • Simulated annealing reversible jump MCMC [Andireu et al, 2000] • Replace the sample with RJ MCMC • Theoretically converge on the global optimum
Experiments: Target Quaternary Fold • Triple beta-spirals [van Raaij et al. Nature 1999] • Virus fibers in adenovirus, reovirus and PRD1 • Double barrel trimer [Benson et al, 2004] • Coat protein of adenovirus, PRD1, STIV, PBCV
Tertiary Fold Recognition: β-Helix fold • Histogram and ranks for known β-helices against PDB-minus dataset 5 Chain graph model reduces the real running time of SCRFs model by around 50 times
Fold Alignment Prediction:β-Helix • Predicted alignment for known β-helices on cross-family validation
Discovery of New Potential β-helices • Run structural predictor seeking potential β-helices from Uniprot (structurally unresolved) databases • Full list (98 new predictions) can be accessed at www.cs.cmu.edu/~yanliu/SCRF.html • Verification on 3 proteins with later experimentally resolved structures from different organisms • 1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase • 1PXZ: The Major Allergen From Cedar Pollen • GP14 of Shigella bacteriophage as a β-helix protein • No single false positive!
Experiment Results: Fold Recognition Triple beta-spirals Double barrel-trimer
Triple beta-spirals Four states: B1, B2, T1 and T2 Correct Alignment: B1: i – o B2: a - h Predicted Alignment B1 B2 Experiment Results: Alignment Prediction
Experiment Results:Discovery of New Membership Proteins • Predicted membership proteins of triple beta-spirals can be accessed at http://www.cs.cmu.edu/~yanliu/swissprot_list.xls • Membership proteins of double barrel-trimer suggested by biologists [Benson, 2005] compared with L-SCRF predictions
Conclusion • Conditional graphical models for protein structure prediction • Effective representation for protein structural properties • Feasibility to incorporate different kinds of informative features • Efficient inference algorithms for large-scale applications • A major extension compared with previous work • Knowledge representation through graphical models • Ability to handle long-range interactions within one chain and between chains • Future work • Automatic learning of graph topology • Applications to other domains
Graphical Models • A graphical model is a graph representation of probability dependencies [Pearl 1993; Jordan 1999] • Node: random variables • Edges: dependency relations • Directed graphical model (Bayesian networks) • Undirected graphical model (Markov random fields)