1 / 63

Machine Learning in the Study of Protein Structure

Machine Learning in the Study of Protein Structure. Rui Kuang Columbia University Candidacy Exam Talk May, 2004. Table of contents. Introduction to protein structure and its prediction HMM, SVM and string kernels Protein ranking and structureal classification

dana-king
Download Presentation

Machine Learning in the Study of Protein Structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning in the Study of Protein Structure Rui Kuang Columbia University Candidacy Exam Talk May, 2004

  2. Table of contents • Introduction to protein structure and its prediction • HMM, SVM and string kernels • Protein ranking and structureal classification • Protein secondary and higher order structure prediction • Protein domain segmentation • Future work

  3. Part I Introduction to Protein Structure and Its Prediction

  4. Why do we study protein structure • Protein– Derived from Greek word proteios meaning “of the first rank” in 1838 by Jöns J. Berzelius. • Crucial in all biological processes, such as Enzymatic catalysis, transport and storage, immune protection…… • Functions depend on structures --- structure can help us to understand function

  5. Building blocks • Amino acid Hydrophobic: AVLIFPM Charged residues: DEKR Polar: STCNQHYW Special : G • Polypeptide chain Extend from its amino terminus to its carboxy terminus

  6. How to Describe Protein Structure • Primary: amino acid sequence • Secondary structure: alpha helix, beta sheet and loops • Tertiary: Phi-Psi angle • Quaternary: arrangement of several polypeptide chains

  7. Secondary Structure : Alpha Helix hydrogen bonds between n and n+i (i=3,4,5)

  8. Secondary Structure : Beta Sheet Antiparallel Beta Sheet Parallel Beta Sheet We can also have mix.

  9. Secondary Structure : Loop Regions • Less conserved structure • Insertions and deletions are more often • Conformations are flexible

  10. Tertiary Structure Phi – N - bond Psi – -C’ bond

  11. Protein Domains • A polypeptide chain or a part of a polypeptide chain that can fold independently into a stable tertiary structure. • Built from different combinations of secondary structure elements and motifs

  12. Three Main Classes of Domain Structures • During the evolution, the structural core tends to be conserved • Alpha domains : The core is build up exclusively from alpha helices • Beta domains : The core comprises anti-parallel beta sheets packed against each other • Alpha/Beta domains : a predominantly parallel Beta sheet surrounded by alpha helices

  13. Determination of Protein Structures • X-ray crystallography The interaction of x-rays with electrons arranged in a crystal can produce electron-density map, which can be interpreted to an atomic model. Crystal is very hard to grow. • Nuclear magnetic resonance (NMR) Some atomic nuclei have a magnetic spin. Probed the molecule by radio frequency and get the distances between atoms. Only applicable to small molecules.

  14. Structure space sparse 2000- 24,000 Function space Ill-defined ????? (20,000by GO) Sequence, Structure and Function Sequence space dense 1,000,000 Thanks to Michal Linial

  15. From sequence to structure • Significant sequence similarity (>30%) usually suggests strong resemblance in structure • Remote homologous protein can also share similar structure • Structure space can be represented by discrete groups of folds • Yet the boundaries between these classes are often difficult to define

  16. From structure to function • Structure similarity implies evolutionary relationship and functional similarity • However functions can be associated with different structures, different superfamilies can have the same fold and homologous superfamilies can evolve into distinct functions. • 66% of proteins having similar fold also have a similar function

  17. Protein structure prediction in three areas • Comparative modeling:Where there is a clear sequence relationship between the target structure and one or more known structures. • Fold recognition ('threading'):No sequence homology with known structures. Find consistent folds (remote homology detection). • Ab initio structure prediction(‘de novo’):Deriving structures, approximate or otherwise, from sequence.

  18. Comparative Modeling • Find homology proteins with known structure as templates • Align target with template sequences in terms of sequence/PSSM (most important) • Evolution makes sequence similarity weaker • No one-one corresponding due to Insertion/deletion • Full-atom refinement and loop modeling

  19. Fold recognition • There is no known structure for the homologies of target sequence • Find remote homologies with consistent (similar) structures. • Does structural information help? • Do comparative modeling

  20. De novo • No template available for use, predict the structure by folding simulation • Rosetta: • based on short segments independently sample distinct distributions of local conformations from known structure • Folding happens when orientations and conformations allow low free energy interactions. Optimized by a Monte Carlo search procedure

  21. How to study protein structure with machine learning? • With above 24,000 known structures and functions in Protein Data Bank • Protein pairwise comparison • protein structural classification • Protein structure prediction • Protein segmentation……

  22. Part II Hidden Markov Model, Support Vector Machine and String Kernels K( , ) Thanks to Nello Cristianini

  23. Hidden Markov Models for Modeling Protein Alignment Maximum Likelihood Or Maximum a posteriori HMM Krogh, Brown, Mian, Sjolander and Haussler, 1993

  24. Hidden Markov Models for Modeling Protein • Probability of sequence x through path q • Viterbi algorithm for finding the best path • Forward and backward for posterior decoding Krogh, Brown, Mian, Sjolander and Haussler, 1993

  25. Hidden Markov Models for Modeling Protein Build HMM from sequences not aligned EM algorithm • Choose initial length and parameters • Iterate until the change of likelihood is small • Calculate expected number of times each transition or emission is used • Maximize the likelihood to get new parameters

  26. String kernels for text classification • String subsequence kernel –SSK : • A recursive computation of SSK has the complexity of the computation O(n|s||t|). It is quadratic in terms of the length of input sequences. Not practical. Lodhi, Cristianini and etc... 2002

  27. Part III Protein ranking and structural classification Where are my relatives?

  28. Structural Classification Databases • SCOP, CATH, FSSP • Sequence pairwise comparison • Smith-waterman, BLAST, PSI-BLAST, rank-propagation, SAM-T98 • Discriminative classification • SVM pairwise, mismatch kernel, EMOTIF kernel, I-Site kernel, semi-supervised kernel

  29. SCOP Fold Superfamily Negative Test Set Negative Training Set Family Positive Test Set Positive Training Set SCOP Family : Sequence identity > 30% or functions and structures are very similar Superfamily : low sequence similarity but functional features suggest probable common evolutionary origin Common fold : same major secondary structures in same arrangement with the same topological connections

  30. CATH • Class • Architecture • Topology • Homologous • Sequence family

  31. Local alignment: Smith-Waterman algorithm • For two string x and y, a local alignment with gaps is: • The score is: • Smith-Waterman score: Thanks to Jean Philippe

  32. BLAST: a heuristic algorithm for matching DNA/Protein sequences • Idea: True match are likely to contain a short stretch of identity • A list of ‘neighborhood words” of the query sequence • Search database with the list, whenever there is a match do a ‘hit extension’, stopping at the maximum scoring extension Altschul, Madden, Schaffer, Zhang etc., 1997

  33. PSI-BLAST: Position-specific Iterated BLAST • Only extend those double hit within a certain range. • A gapped alignment uses dynamic programming to extend a central pair of aligned residues in both directions. • PSI-BLAST can takes PSSM as input to search database Altschul, Madden, Schaffer, Zhang etc., 1997

  34. Local and Global Consistency • Affinity matrix • D is a diagonal matrix • Iterate • F* is the limit of seuqnce {F(t)} Zhou, Bousquet, Lal, Weston, and Scholkopf, 2003

  35. Rank propagation • Protein similarity network: • Graph nodes: protein sequences in the database • Directed edges: a exponential function of the PSI-BLAST e-value (destination node as query) • Activation value at each node: the similarity to the query sequnce • Exploit the structure of the protein similarity network Weston, Elisseeff, Zhou, Leslie and Noble, 2004

  36. SAM-T98 • The first iteration: query sequence to search NR database using WU-BLASTP and build alignment for the found homologs • 2nd-4th iterations: take the alignment from the previous iterations to find more homologs with WU-BLASTP and update the alignment with the new homologs found. • Build a HMM from the final alignment. The HMM of query sequence is used to search database, or we can use query sequence to search against HMM database Karplus, Barrett and Hughey, 1999

  37. Use discriminative methods, such as SVM to utilize negative data as well as positive data…

  38. Fisher Kernel • A HMM (or more than one) is built for each family • Derive kernel function from the fisher scores of each sequence given a HMM H1: Jaakkola, Diekhans and Haussler, 2000

  39. SVM-pairwise • Represent sequence P as a vector of pairwise similarity score with all training sequences • The similarity score could be a Smith-Waterman score or PSI-BLAST evalue. Liao and Noble, 2002

  40. Mismatch Kernel AKQ KQD QDY DYY YYY… Implementation with suffix tree achieves linear time complexity O(||mkm+1(|x|+|y|)) AKQDYYYYE… AKQ … CKQ AKY … DKQ AAQ ( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 ) AAQAKQDKQEKQ AKQ Leslie, Eskin, Cohen, Weston and Noble, 2002

  41. EMOTIF Database • A motif database of protein families • Substitution groups from separation score Nevill-manning, Wu and Brutlag, 1998

  42. EMOTIF Database (continued) • All possible motifs are enumerated from sequence alignment Nevill-manning, Wu and Brutlag, 1998

  43. EMOTIF Kernel • EMOTIF TRIE built from eBLOCKS • EMOTIF feature vector: where is the number of occurrences of the motif m in x Ben-Hur and Brutlag, 2003

  44. I-SITE Motif Library • Sequence segments (3-15 amino acids long) are clustered via Kmeans • Within each cluster structure similarity is calculated in terms of dme and mda • Only those clusters with good dme and mda are refined and considered motifs afterwords

  45. I-SITE Kernel • Similar to EMOTIF kernel I-SITE kernel encodes protein sequences as a vector of the confidence level against structural motifs in the library

  46. Cluster kernels • Profile Kernels Implicitly average the feature vectors for sequences in the PSI-BLAST neighborhood of input sequence (dependent on the size of the neighborhood and total length of unlabeled sequences) • Bagged Kernels Run bagged k-means to estimate p(x,y), the empirical probability that x and y are in the same cluster. The new kernel is the product of p(x,y) and base kernel K(x,y) Weston, Leslie, Zhou, Elisseeff and Noble, 2003

  47. Part IV Protein secondary and higher order structure prediction Can we really do that?

  48. PHD: Profile network from HeiDelberg B. Rost and C. Sander, 1993

  49. PSIPRED D. T. Jones, 1999

More Related