1 / 96

Molecular Modeling: building a 3D protein structure from its sequence

Understand the importance of protein folding, structure levels, and bioinformatics tools for building accurate protein models. Explore how proteins function based on their 3D shapes and sequences.

jromine
Download Presentation

Molecular Modeling: building a 3D protein structure from its sequence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prof Shoba RanganathanDept of Chemistry & Biomolecular SciencesMacquarie University, Sydney(shoba.ranganthan@mq.edu.au) Molecular Modeling: building a 3D protein structure from its sequence

  2. Why protein structure? • In the factory of the living cell, proteins are the workers, performing a variety of tasks • Each protein adopts a particular folding pattern that determines its function • The 3D structure of a protein brings into close proximity residues that are far apart in the amino acid sequence

  3. How does a protein fold? • Most newly synthesized proteins fold without assistance! • Ribonuclease A: denatured protein could refold and recover its activity (C. Anfinsen -1966) • “Structure implies function” • The amino acid sequence encodes the protein’s structural information

  4. Understanding Protein Structure • A Quick Overview of Sequence Analysis • Finding a Structural Homologue • Template Selection • Aligning the Query Sequence to Template Structure(s) • Building the Model

  5. The basics • Proteins are linear heteropolymers: one or more polypeptide chains • Repeat units: 20 amino acid residues • Range from a few 10s-1000s • Three-dimensional shapes (“folds”) adopted vary enormously • Experimental methods: X-ray crystallography, electron microscopy and NMR (nuclear magnetic resonance)

  6. The (L-)amino acid N C O O R Side chain = H,CH3,… Backbone Amino C a + - Carboxylate

  7. The peptide bond

  8. Coplanar atoms

  9. Levels of protein structure • Zeroth: amino acid composition • Primary • This is simply the order of covalent linkages along the polypeptide chain, i.e. the sequence itself

  10. Levels of protein structure • Secondary • Local organization of the protein backbone: a-helix, b-strand (which assemble into b-sheets), turn and interconnecting loop

  11. Ramachandran / phi-psi plot b-sheet a-helix (left handed) y a-helix (right handed) f

  12. Levels of protein structure • Tertiary • packing of secondary structure elements into a compact spatial unit • “Fold” or domain – this is the level to which structure prediction is currently possible

  13. Levels of protein structure • Quaternary • Assembly of homo- or heteromeric protein chains • Usually the functional unit of a protein, especially for enzymes

  14. Structural classes All-a (helical) All-b (sheet)

  15. Structural classes a/b(parallelb-sheet) a+b(antiparallelb-sheet)

  16. Structural information • Protein Data Bank: maintained by the Research Collaboratory for Structural Bioinformatics • http://www.rcsb.org/pdb • > 74,888structures of proteins • Also contains structures of DNA, carbohydrates, protein-DNA complexes and numerous small ligand molecules.

  17. The PDB data • Text files • Each entry is identified by a unique 4-letter code: say 1emg • 1emg entry • Header information • Atomic coordinates in Å (1 Ångstrom = 1.0e-10 m)

  18. PDB Header details • identifies the molecule, any modifications, date of release of PDB entry • organism, keywords, method • Authors, reference, resolution if X-ray structure • Sequence, x-reference to sequence databases HEADER GREENFLUORESCENT PROTEIN 12-NOV-98 1EMG TITLE GREEN FLUORESCENT PROTEIN (65-67 REPLACED BY CRO, S65T TITLE 2 SUBSTITUTION, Q80R) COMPND MOL_ID: 1; COMPND 2 MOLECULE: GREEN FLUORESCENT PROTEIN; COMPND 3 CHAIN: A; COMPND 4 ENGINEERED: YES; COMPND 5 MUTATION: 65 - 67 REPLACED BY CRO, S65T SUBSTITUTION, Q80R COMPND 6 SUBSTITUTION; COMPND 7 BIOLOGICAL_UNIT: MONOMER

  19. The data itself • Coordinates for each heavy (non-hydrogen) atom from the first residue to the last • Any ligands (starting with HETATM) follow the biomacromolecule • O of water molecules (also HETATM) at the end ATOM 1 N SER A 2 29.089 9.397 51.904 1.00 81.75 ATOM 2 CA SER A 2 27.883 10.162 52.185 1.00 79.71 ATOM 3 C SER A 2 26.659 9.634 51.463 1.00 82.64 ATOM 4 O SER A 2 26.718 8.686 50.686 1.00 81.02 ATOM 5 CB SER A 2 28.039 11.660 51.932 1.00 75.59 ATOM 6 OG SER A 2 27.582 12.038 50.639 1.00 43.28 ------- ATOM 1737 CD1 ILE A 229 39.535 21.584 52.346 1.00 41.62 TER 1738 ILE A 229

  20. Structural Families • SCOP - Structural Classification Of Proteins • http://scop.mrc-lmb.cam.ac.uk/scop • FSSP – Family of Structurally Similar Proteins • http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+LibInfo+-lib+FSSP • CATH – Class, Architecture, Topology, Homology • http://www.cathdb.info/

  21. Structure comparison facts • Proteins adopt a limited number of topologies. • Homologous sequences show very similar structures, with strong conservation in secondary structural elements: variations in non-conserved regions. • In the absence of sequence homology, some folds are preferred by vastly different sequences.

  22. Structure comparison facts • The “active site” (a collection of functionally critical residues) is remarkably conserved, even when the protein fold is different. • Structural models (especially those based on homology) provide insights into possible function for new proteins. • Implications for • protein engineering • ligand/drug design, • function assignment of genomic data.

  23. Visualizing PDB information • RASMOL: most popular, available for all platforms (Sayle et al, 2005) http://www.bernstein-plus-sons.com/software/rasmol • DeepView Swiss-PDBViewer: from Swiss-Prot (Guex & Peitsch, 1997) http://spdbv.vital-it.ch/ • Chemscape Chime Plug-in: for PC and Mac http://accelrys.com/resource-center/downloads/freeware/ • PyMOL: Available for all platforms (DeLano, W.L. The PyMOL Molecular Graphics System, 2002) http://www.pymol.org/ • ICM: Very good, available for all platforms (Abagyan et al, 1994) http://www.molsoft.com/index.html

  24. RASMOL views - SH2 domain All-atom model Space-filling model Atom colors:NOCS

  25. RASMOL views – 1sha Ca Trace Ribbon Rainbow coloring:Nto C Coloring:by structural units

  26. Homologous folds • Hemoglobinand erythrocruorin: 31% sequence identity

  27. Analogous folds • Hemoglobin and phycocyanin: 9% sequence identity

  28. Surface Properties Cro repressor – DNA complex • Basic residues in blue • Acidic residues in red

  29. Mapping Functional Regions Immunoglobulin l light chain - dimer • Hydrophobhic residues in magenta • Hydrophilic and charged residues in cyan

  30. Understanding Protein Structure • A Quick Overview of Sequence Analysis • Finding a Structural Homologue • Template Selection • Aligning the Query Sequence to Template Structure(s) • Building the Model

  31. Siblings and Cousins • Siblings or homologues: sequences with at least 30% sequence identity over an alignment length of at least 125 residues and conservation of function. • Cousins or paralogues: < 30% identity but with conservation of function • Both show structural conservation • Homologues located using a database search tool such as BLAST (free webserver): http://www.ncbi.nlm.nih.gov/BLAST • Paralogues require a more sensitive method such as PSI-BLAST

  32. Multiple Sequence Alignment Finding the best way to match the residues of related sequences • Identical residues must be lined up • The rest should be arranged, based on • observed substitution in protein families • chemical similarity • charge similarity • Where it is impossible to get the residues to line up, the biological concept of insertion/deletion in invoked: the ‘gap’ in alignments

  33. MSA Methods • CLUSTALW / CLUSTALX (Thompson et al, 1997): freely available for all platforms and one of the best alignment programs http://bips.u-strasbg.fr/fr/Documentation/ClustalX/ • MAXHOM (Sander & Schneider, 1991): alignment based on maximum homology; available via the PredictProtein webserver, free for academics http://www.predictprotein.org/ • MALIGN (Johnson et al, 1994): freely available web server that uses MALIGN, based on the structural alignment of protein families http://caps.ncbs.res.in/iws/malign.html

  34. Alignment Checks • Conservation of functionally important residues: e.g. the catalytic triad (Asp-Ser-His) that are essential for serine proteinase activity • Line up of structurally important residues: e.g. cysteines forming disulfide bonds • Overall, maximizing the alignment of “like” residues • Completely conserved residues usually indicate some conserved structural or functional role, especially buried charges

  35. Sequence Motifs & Patterns • From the analysis of the alignment of protein families • Conserved sequence features, usually associated with a specific function • PROSITE (Hulo et al, 2006) database for protein “signature” patterns: http://prosite.expasy.org/

  36. Aligned Sequence Families • From alignments of homologous sequences: • PRINTS • PRODOM: http://prodom.prabi.fr/prodom/current/html/home.php • From Hidden Markov Model based methods: • PFAM: http://pfam.sanger.ac.uk/

  37. Protein Domains • Most proteins are composed of structural subunits called domains • A domain is a compact unit of protein structure, usually associated with a function. • It is usually a “fold” - in the case of monomeric soluble proteins. • A domain comprises normally only one protein chain: rare examples involving 2 chains are known. • Domains can be shared between different proteins: like a LEGO block

  38. Protein Architectures • Beads-on-a-string: sequential location: tyrosine-protein kinase receptor TIE-1 (immunoglobulin, EGF, fibronectin type-3 and protein kinase). • Domain insertions: “plugged-in” - pyruvate kinase (1pyk) • SMART: smart.embl-heidelberg.de Simple Modular Architecture Retrieval Tool

  39. Dissection into Domains • A sequence, usually > 125 residues should be routinely checked to see how many domains are present. • Conserved Domain Architecture Retrieval Tool (CDART) uses information in Pfam and SMART to assign domains along a sequence • E.g. NP_002917 shows similarity to G-protein regulators:

  40. Understanding Protein Structure • A Quick Overview of Sequence Analysis • Finding a Structural Homologue • Template Selection • Aligning the Query Sequence to Template Structure(s) • Building the Model

  41. Structural Homologues • BLASTP vs. PDB database or PSI-BLAST: look for 4-character PDB ID • E < 0.005 • Domain coverage: at least 60% coverage is recommended • Gaps: we don’t want them. Choose between: • few gaps and reasonable similarity scores or • lots of gaps and high similarity scores?

  42. Small Proteins: Disulfide bonds • BLAST-type methods may not locate homologues, if Conserved Domain search is not turned on. • Are the Cys residues conserved? • Gaps: where are they on the structure? gnl|Pfam|pfam00095, wap, WAP-type (Whey Acidic Protein) • four-disulfide core'. CD-Length = 46 residues, 100.0% aligned • Score = 43.9 bits (102), Expect = 1e-06 • Q:49 KAGFCPWNLLQMISSTGPCPMKIECSSDRECSGNMKCCNVDCVMTCTPP 97 • D: 1 KPGVCPWVSISE---AGQCLELNPCQSDEECPGNKKCCPGSCGMSCLTP46

  43. Metal-binding domains C2H2 Zinc Finger • 2 Cys & 2 His binding to Zinc • Not detected even by CD-search in BLAST • Detected by Pfam & SMART • Sequence Pattern: #-X-C-X(1-5)-C-X3-#-X5-#-X2-H-X(3-6)-[H/C]

  44. Structure Prediction Methods • Secondary Structure Prediction: identify local structural elements such as helices, strands and loops. • > 75% accuracy achievable • PredictProtein or PHD http://www.predictprotein.org/ • PSIPRED http://bioinf.cs.ucl.ac.uk/psipred/ • SSPro http://scratch.proteomics.ics.uci.edu/

  45. Folds from Secondary Structure Predictions • Assembling SSEs into folds is a combinatorial problem • Current methods depend on available structural data for mapping predictions: • FORREST http://abs.cit.nih.gov/foresst/foresst.html • TOPITS from the PHD server http://www.rostlab.org/papers/1995_topits/

  46. Tertiary Structure Prediction • Fold recognition/Threading: < 20% identity typically • Best results obtained by combining several database search and knowledge-based tools: • 3D-PSSM http://www.sbg.bio.ic.ac.uk/~3dpssm/ • FUGUE http://tardis.nibio.go.jp/fugue/

  47. Understanding Protein Structure • A Quick Overview of Sequence Analysis • Finding a Structural Homologue • Template Selection • Aligning the Query Sequence to Template Structure(s) • Building the Model

  48. One or many templates? • Sequence similarity: extract template sequences and align with query: select the most similar structure • Completeness: Missing data? REMARK 465 MISSING RESIDUES REMARK 465 THE FOLLOWING RESIDUES WERE NOT LOCATED IN THE REMARK 465 EXPERIMENT. (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN REMARK 465 IDENTIFIER; SSSEQ=SEQUENCE NUMBER; I=INSERTION CODE.) REMARK 465 REMARK 465 M RES C SSSEQI REMARK 465 MET A 1 REMARK 465 THR A 230 REMARK 470 M RES CSSEQI ATOMS REMARK 470 GLU A 5 OE2 REMARK 470 GLU A 6 CG CD OE1 OE2 REMARK 470 GLU A 17 OE1

  49. One or many templates? • X-ray or NMR?: • Lowest resolution X-ray structure • X-ray and then NMR • NMR average over assembly • One or many?: • Structure alignment of Ca atoms • If 2 templates are very close, keep only one • Keep templates that provide new information

  50. Many templates • Sequence alignment from structure comparison of templates (SSA) can be different from a simple sequence alignment (SA). • For model building, • align templates structurally • extract the corresponding SSA

More Related