610 likes | 803 Views
Conditional Graphical Models for Protein Structure Prediction. Yan Liu Language Technologies Institute School of Computer Science Carnegie Mellon University Oct 24, 2006. Nobelprize.org. DSCTFTTAAAAKAGKAKAG. Protein sequence. +. Protein function. Protein structure.
E N D
Conditional Graphical Models forProtein Structure Prediction Yan Liu Language Technologies Institute School of Computer Science Carnegie Mellon University Oct 24, 2006
Nobelprize.org DSCTFTTAAAAKAGKAKAG Protein sequence + Protein function Protein structure Snapshot of Cell Biology
Protein Structures and Functions Example: triple beta-spiral fold Adenovirus Fibre Shaft Virus Capsid Courtesy of Nobelprize.org
Protein Structure Determination • Lab experiments: time and labor- consuming • X-ray crystallography Nobel Prize, Kendrew & Perutz, 1962 • NMR spectroscopy Nobel Prize, Kurt Wuthrich, 2002 • The gap between sequence and structure necessitates computational methods of protein structure determination • 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) 1MBN 1BUS
Protein Structure Hierarchy We focus on predicting the topology of the structures from sequences APAFSVSPASGACGPECA
Major Challenges • Protein structures are non-linear • Long-range dependencies • Structural similarity often does not indicate sequence similarity • Sequence alignment reaches twilight zone (under 25% similarity) β-α-β motif Ubiquitin (blue) Ubx-Faf1 (gold)
Previous Work • Sequence similarity perspective • Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997] • Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998] • Window-based methods, e.g. PSI_pred [Jones, 2001] • Physical forces perspective • Homology modeling or threading, e.g. Threader [Jones, 1998] • Structural biology perspective • Methods of careful design for specific structures, e.g.αα- and ββ- hairpins, β-turn and β-helix [Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001] Fail to capture the structure properties Generative models based on physical free-energy Hard to generalize due to the various informative features
Structured Prediction • Many prediction tasks involve outputs with correlations or constraints Structure Sequence • Tree Grid Input John ate the cat . SEQUENCEXS…WGIKQLQAR Output HHHCCCEEE…EECCCCEEE • Fundamental importance in many areas • Potential for significant theoretical and practical advances
Graphical Models • A graphical model is a graph representation of probability dependencies [Pearl 1993; Jordan 1999] • Node: random variables • Edges: dependency relations • Directed graphical model (Bayesian networks) • Undirected graphical model (Markov random fields)
Conditional Random Fields • Hidden Markov model (HMM)[Rabiner, 1989] • Conditional random fields (CRFs)[Lafferty et al, 2001] • Model conditional probability directly • Allow arbitrary dependencies in observation • Adaptive to different loss functions and regularizers • Promising results in multiple applications
Protein Structure Prediction • Dependency between residues (single observation) • Dependency between components (subsequences of observations)
Outline • Brief introduction to protein structures • Graphical models for structured-prediction • Conditional graphical models for protein structure prediction • General framework • Specific models • Experiment results • Conclusion and discussion
Our Solution: Conditional Graphical Models Local dependency Long-range dependency • Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si} • Feature definition • Node feature • Local interaction feature • Long-range interaction feature
Conditional Graphical Models (II) • Conditional probability given observed sequences x is defined as • Prediction: • Training phase : learn the model parameters λ • Minimizing regularized negative log loss • Iterative search algorithms by seeking the direction whose empirical values agree with the expectation
Major Components • Graph topology • Secondary structure prediction: CRF, kernel CRF • Tertiary fold recognition: Segmentation CRF, Chain graph model • Quaternary fold recognition: Linked segmentation CRF • Efficient inference • Prefer exact inference with O(nd) complexity • Resort to approximate inference • Features • Allows flexible and rich feature definition
Protein Secondary Structure Prediction • Given a protein sequence, predict its secondary structure assignments • Three classes: helix (H), sheets (E) and coil (C) • Input: APAFSVSPASGACGPECA • Output: CCEEEEECCCCCHHHCCC
CRF on Secondary Structure Prediction [Liu et al, Bioinformatics 2004] C C E E …. ... C • Node semantics –secondary structure assignment • Graphical model - conditional random fields (CRFs) or kernel CRF • Inference algorithm - efficient inferences exists, such as forward-backward or Viterbi algorithm
Training Phase Testing Phase • Input: ..APAFSVSPASGACGPECA.. • Output 1: Does the target fold exist? • Output 2: ..NNEEEEECCCCCHHHCCC.. Yes Protein Fold Recognition and Alignment • Protein fold: identifiable regular arrangement of secondary structural elements • Different from previous simple fold classification • Provide important information and novel biological insights
Conditional Graphical Model for Fixed Template Fold[Liu et al, RECOMB 2005] • Node semantics - secondary structure elements of variable lengths • Graphical model - segmentation conditional random fields (SCRFs) • Inference - forward-backward and Viterbi-like algorithm can be derived given some assumptions β-α-β motif
Conditional Graphical Model for Repetitive Fold Recognition [Liu et al, ICML 2005] • Node semantics - two layer segmentation Y = {M, {Ξi}, T} • Level 1: envelop, or one repeat, level 2: components of one repeat • Graphical model - Chain graph model • A graph consisting of directed and undirected graphs • Inference - forward-backward algorithm and Viterbi-like algorithm
Conditional Graphical Model for for Quaternary Fold Recognition[Liu et al, IJCAI 2007] • Node semantics – secondary structure elements and/or simple fold • Graphical model - linked segmentation CRF (L-SCRF) • Fix template and/or repetitive subunits • Inter-chain and intra-chain interactions
Approximate Inference • Varying dimensionality requires reversible jump MCMC sampling [Greens, 1995, Schmidler et al, 2001] • Four types of Metropolis proposals • State switching • Position switching • Segment split • Segment merge • Simulated annealing reversible jump MCMC [Andireu et al, 2000] • Replace the sample with RJ MCMC • Theoretically converge on the global optimum
Conditional Graphical Models for Protein Structure Prediction
Kernelization Segment Correlations Local and Global Tradeoff Inter-chain Segment Correlations Model Roadmap Generalized as conditional graphical models Conditional random fields Kernel CRFs Segmentation CRFs Chain graph model Linked segmentation CRFs
Outline • Brief introduction to protein structures • Graphical models for structured prediction • Conditional graphical models for protein structure prediction • Experiment results • Fold recognition • Fold alignment prediction • Discovery of potential membership proteins • Conclusion and discussion
Experiments: Target Fold • Right-handedβ-helix fold [Yoder et al, 1993] • Bacterial infection of plants, binding the O-antigen and so on • Leucine-rich repeats (LLR) [Kobe & Deisenhofer, 1994] • Structural framework for protein-protein interaction
Experiments: Target Quaternary Fold • Triple beta-spirals [van Raaij et al. Nature 1999] • Virus fibers in adenovirus, reovirus and PRD1 • Double barrel trimer [Benson et al, 2004] • Coat protein of adenovirus, PRD1, STIV, PBCV
Tertiary Fold Recognition: β-Helix fold • Histogram and ranks for known β-helices against PDB-minus dataset 5 Chain graph model reduces the real running time of SCRFs model by around 50 times
Quaternary Fold Recognition: Triple β-Spirals • Histogram and ranks for known triple β-spirals against PDB-minus dataset
Quaternary Fold Recognition: Double Barrel-Trimer • Histogram and ranks for known double barrel-trimer against PDB-minus dataset
Fold Alignment Prediction:β-Helix • Predicted alignment for known β-helices on cross-family validation
Fold Alignment Prediction:LLR and Triple β-Spirals • Predicted alignment for known LLRs using chain graph model (left) and triple β-spirals using L-SCRFs
Discovery of Potential β-helices • Hypothesize potential β-helices from Uniprot reference databases • Full list can be accessed at www.cs.cmu.edu/~yanliu/SCRF.html • Verification on proteins with later resolved structures from different organisms • 1YP2: potato tuber ADP-glucose pyrophosphorylase • 1PXZ: major allergen from Cedar Pollen • GP14 of Shigella bacteriophage as a β-helix protein
Conclusion • Thesis Statement • Conditional graphical models are effective for protein structure prediction • Strong claims • Effective representation for protein structural properties • Flexibility to incorporate different kinds of informative features • Efficient inference algorithms for large-scale applications • Weak claims • Ability to handle long-range interactions • Best performance bounded by prior knowledge
Contribution and Limitation • Contribution to machine learning • Enrichment of graphical models • Formulation to incorporate domain knowledge • Contribution to computational biology • Effective for protein structure prediction and fold recognition • Solutions for the long-range interactions (inter-chain and intra-chain) • Limitation • Manual feature extraction • Difficulty in verification • High complexity
Protein structure prediction Protein function and protein-protein interaction prediction Drug target design Graph-based semi-supervised learning Active learning for structured data Graph topology learning + Future Work • Computational biology • Machine Learning
Acknowledgement • Jaime Carbonell, Eric Xing, John Lafferty, Vanathi Gopalakrishnan • Chris Langmead, Yiming Yang, Roni Rosenfeld, Peter Weigele , Jonathan King, Judith Klein-Seetharaman, , Ivet Bahar, James Conway and many more • And fellow graduate students …
Features for Tertiary Fold Recognition • Node features • Regular expression template, HMM profiles • Secondary structure prediction scores • Segment length • Inter-node features • β-strand Side-chain alignment scores • Preferences for parallel alignment scores • Distance between adjacent B23 segments • Features are general and easy to extend
Discovery of Potential Double Barrel-Trimer • Potential proteins suggested in [Benson, 2005]
Inference Algorithm for SCRF • Backward-forward algorithm* • Viterbi algorithm* p(state yr ends at r |xl+1 xl+2… xr-1xrand state yl ends at l) =
Reversible jump MCMC Algorithm • Three types of proposals • Position switching: randomly select a segment j and a new position assignment dj(i+1) ~U(dj-1(i),dj+1(i)) • Segment split: randomly select a segment j and split it into two segments where (dj(i+1) , dj+1(i+1) ) = G(dj-1(i) ,u(i) ) where u(i) ~ U • Segment merge: randomly select a segment j and merge segment j and j+1 • Simulated annealing reversible jump MCMC for computing y = argmax P(y|x) [Andireu et al, 2000]
Protein Structure Determination • Lab experiments: time and labor- consuming • X-ray crystallography • NMR spectroscopy • Electron microscopy and many more • Computational methods: • Homology modeling: ≥ 30% sequence similarity • Fold recognition: < 30% sequence similarity • Ab inito methods: no template structure needed • Active research area in multiple scientific fields
Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients Evaluation Measure
Outline • Brief introduction to protein structures • Discriminative graphical models • Generalized discriminative graphical models for protein fold recognition • Experiment results • Conclusion and discussion
Graphical Models for Structured Prediction • Conditional Random Fields • Model conditional probability directly, not joint probability • Allow arbitrary dependencies in observation (e.g. long range, overlapping) • Adaptive to different loss functions and regularizers • Promising results in multiple applications • Recent developments • Alternative estimation algorithms (Collins, 2002, Dietterich et al, 2004) • Alternative loss functions, use of kernels (Taskar et al., 2003, Altun et al, 2003, Tsochantaridis et al, 2004) • Baysian formulation (Qi and Minka, 2005) and semi-markov version (Sarawagi and cohen, 2004)