1 / 49

Predicting the function of a protein form either a sequence or a structure (is not trivial)

Predicting the function of a protein form either a sequence or a structure (is not trivial). Adam Godzik The Sanford-Burnham Medical Research Institute. Summary - overview. Homology based methods Analogy based methods Physics based methods Why function prediction?. Multilevel definition

suki
Download Presentation

Predicting the function of a protein form either a sequence or a structure (is not trivial)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting the function of a protein form either a sequence or a structure(is not trivial) Adam Godzik The Sanford-Burnham Medical Research Institute

  2. Summary - overview • Homology based methods • Analogy based methods • Physics based methods • Why function prediction?

  3. Multilevel definition Phenotype Cellular function Molecular function (activity) Substrates Inhibitors cofactors Several attempts to develop a unified function classification EC classification for enzymes 4.2.31.101 Merops (proteases), CAZY (hydrolases) Gene ontology What we mean by function

  4. Genes (proteins) Two, complementary views of the evolution and diversity of life Organisms (species)

  5. Both are amazingly large and diverse • Organisms (species) • About 1.5M known today, 10-100 million species estimated to exists, depending on the definition of species and other assumptions • Their relations can be described in a tree of life, at least for eukaryotes. • Bacterial and archeal tree of life is much more controversial, some even dispute the concepts of species for bacteria • Proteins • With 20 amino acid alphabet, the number of possible protein sequences is very large (20100 i.e. 1.2*10130 short proteins(!)) • Total number: >10billions? • 10-100M species, with ~4K genes in a bacterial and ~10K in an eukaryotic genome • Over 25 million known today, i.e. ~0.2% • Representative sample?

  6. From the 25 million proteins known today • Direct experimental data is available for few thousand proteins • Indirect experimental data are available for perhaps few hundred thousand • Structures of ~60 thousands have been solved

  7. protein universe seems to be very large. But is it random?

  8. Many proteins (like species) are close relatives • Histone H1 (human) - histone H1 (chicken) • SRRSASHPTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAA • | | || || || ||| ||| | |||||||||||||||||| ||| |||||| || • SKKSTDHPKYSDMIVAAIQAEKNRAGSSRQSIQKYIKSHYKVGENADSQIKLSIKRLVTT • similarity: 77% id, BLAST e.value 0.0 • function: two H1 histones from different species (orthologs) • Their functions and structures are obviously very similar

  9. We can organize the protein universe into neighborhoods (families)?

  10. How many protein families are still out there? Number of protein clusters (modeling families) grows linearly in number of protein sequences (and exponentially in time) – cumulative total From Yooseph et al, PloS Biology, (2007) 5:e16

  11. How far can we go? • Histone H5 - histone H1 • TYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRLA • | | | | | | | | | ||| | | | |||| |||||||| • SVTELITKAVSASKERKGLSLAALKKALAAGGYDVEKNNSRIKLGLKSLVSKGTLVQTKGTGASGSFRLS • similarity: 40% seq id, BLAST e.value 10-15 • function: two histones (paralogs) • Structures still very similar, functions somewhat different, but obviously similar

  12. This is surely too far? • Histone H5 - TRANSCRIPTION FACTOR E2F-4 • PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | | • GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW • similarity :7% seq id, BLAST e.value 1

  13. Structure – obviously similar (2.4 Å RMSD over 80 aa) function – clearly related (both bind DNA) More subtle similarity can be detected with more sophisticated methods Is it?

  14. We can keep adding more layers

  15. Unknown protein GLLTTKFVSLLQEAKDGVLDLKLAADTLAVRQKRRIYDITNVLEGIGLIEKKSKNSIQW Well studied protein SRRSASHPTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAA most “function assignments” are provided by predicted homology Similarity -> homology similarity ? prediction

  16. Recognition of close and/or distant homologs based on similarity Sequence Sequence/profile, profile/profile Structure Problems How to predict differences? Even homologous proteins evolve and change! Similarity -> homology based annotations

  17. Prediction by homology Are there any well characterized proteins similar to my protein? Can we assume they are homologous? Recognition What is the position-by-position target/template equivalence Alignment Structure of my protein is similar to the other one Modeling Function prediction Function of my protein is similar to the other one

  18. We could predict Role in the whole organism Structure of a complex activity 3D structure

  19. Important distinction • Similarity • Two proteins have similar sequences/structures/functions if by some metric the s/s/f of one protein is more similar to the s/s/f of another than to a randomly chosen protein • Homology • Two proteins are homologous if they have evolved from a common ancestor • Common error • Two proteins are 65% homologous • What we really meant • The sequences of two proteins are 65% similar, therefore we can safely assume they are homologous, why else they would be so similar?

  20. If life would be easy, this is how it would look like similar homologous not similar unrelated

  21. Not (obviously) similar, but (probably) homologous • Histon H5 and transcription factor E2F4, identity 7%, similar fold, similar function (DNA binding) • PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL | | | | | • GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW

  22. Similar, but not homologous • phosphoribosyltransferaseand viral coat protein, identity: 42%, different folds, different functions • . . . . . 99 IRLKSYCNDQSTGDIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQY.NPKMVKVASLLVKRTPRSVGY 173 : ||. ||| || |. || | : | | | | || | || |:| | ||.| | 214 VPLKTDANDQ.IGDSLY....SAMTVDDFGVLAVRVVNDHNPTKVT..SKVRIYMKPKHVRV...WCPRPPRAVPY 279

  23. Similarity vs. homology similar homologous not similar homologous similar not homologous not similar unrelated

  24. Can we return to this simple picture by redefining similarity? similar homologous not similar unrelated

  25. New protein (target) KAAELEMEKEQILRSLGEISVHNCMFKLEECDREEIEAITDRLTKRTKTVQVVVETPRNEEQKKALEDATLMIDEVGEMMHSNIEKAKLCLQ Known protein (template) VKKDALENLRVYLCEKIIAERHFDHLRAKKILSREDTEEISCRTSSRKRAGKLLDYLQENPKGLDTLVESIRREKTQNF Are these two protein families related?

  26. How to compare two families? ? Score =

  27. Profile-profile similarity Compare as vectors in 21 dimensional space (FFAS)

  28. How to validate a protocol1. Recognition • Folding benchmarks • from structural clustering of PDB (several sets, 700 pairs used here)compared to sequence based clustering of the same group of proteins • correct predictions vs. wrong predictions • CASP meetings, CAFASP, LiveBench • published and/or publicly available predictions, fold prediction servers, available prediction programs

  29. Summary - overview • Homology based methods • Analogy based methods • Physics based methods • Why function prediction?

  30. Recognition of potential analogs based on similarity in Genome organization (non homologous replacements) Genomic fingerprints Expression patterns Specific features Charge distribution Presence of specific patterns Problems Is this similarity related to function? Similarity -> analogy based annotations

  31. TM0449 (thy1) – from prediction to proof • TM0449 • Hypothetical, uncharacterized protein • Multiple homologs in pathogenic and thermophilic bacteria • Novel fold • evidence • Phylogenetic profile complementing thymidylate synthase • A homolog complements TS in Dictyostelium • Confirmed experimentally

  32. 3D motif search finds an identical arrangement binding phosphate in a different protein

  33. Summary - overview • Homology based methods • Analogy based methods • Physics based methods • Why function prediction?

  34. “Ab initio function prediction” – substrate docking

  35. We know the structure of one protein in the family and functions of some others – is the function conserved? Newly solved target Gallery of models

  36. We can analyze conservation of surface features by mapping them on the sphere

  37. And then compare maps between homologs

  38. And come up with new (predicted) functions Phospholipid vs. retinol vs. short peptide binding

  39. Summary - overview • Homology based methods • Analogy based methods • Physics based methods • Why function prediction?

  40. Why my interest in function prediction? • Structural genomics: the structure is often the easiest experimental information to obtain (after sequence)

  41. Function vs function 1970 1990 2005 2010 ? • We witnessed dramatic technological advances in sequencing and now structure determination, function analysis remain a painstaking, manual effort. • We used to know a lot about function even before we started working on a protein. Well, not anymore 1 year Function discovery Structure determination Sequencing

  42. Structure determination is now done on an assembly line target selection 3 X 1 X 2 X 1 X 1 X 1 X 7 X 1 X 2 X 1 X 1 X 2 X 1 X 1 X 1 X 2 X 5 X xtal screening bl xtal mounting data collection phasing tracing expression imaging purification crystallization harvesting cloning struc. validation struc. refinement annotation publication PDB

  43. Even few years ago functional annotation seemed trivial target selection 3 X 1 X 2 X 1 X 1 X 1 X 7 X 1 X 2 X 1 X 1 X 2 X 1 X 1 X 1 X 2 X 5 X xtal screening bl xtal mounting data collection phasing tracing expression imaging purification crystallization harvesting cloning struc. validation struc. refinement annotation publication PDB

  44. After few years, the reality seems to be very different target selection 2 X 1 X 2 X 1 X 1 X 1 X 1 X 1 X 7 X xtal screening bl xtal mounting data collection phasing tracing expression imaging purification crystallization harvesting cloning struc. validation struc. refinement annotation publication PDB

  45. The classical way 1. A function is discovered and studied 2. The gene responsible in this function is identified 3. Function is confirmed 4. Product of this gene is isolated, crystallized solved. 5. we have a whole story! Structure “rationalizes” function and provides molecular details Post-genomic 1. a new, uncharacterized gene is found in a genome 2. predictions or high-throughput methods prioritize this gene for further studies 3. the protein is studied in detail Structure is solved in a high throughput center Structure is the first experimental information about the “hypothetical” protein “reverse order” of function and structure determination and it’s challenges

  46. We now have hundreds of structures of proteins with unknown functions

  47. Summary • For some, function prediction is a practical, day to day problem • Analogy based approaches dominate the field • Homology seen from sequence similarity • structural similarities • Potential active sites, clefts, surface features • Many useful tools exists, but they are very scattered and not very user-friendly

  48. Summary (2) • Avoid overconfidence - “easy” predictions contain many surprises • Only synergy of several independent lines of reasoning can give a correct answer • Elimination of “easy”, but inconsistent predictions is critical • So far, AFP doesn’t even come close to expert analysis

More Related