Biomolecular Modeling: building a 3D protein structure from its sequence

Prof Shoba RanganathanDept. of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, Australia &Dept of Biochemistry, Yong Loo Lin School of MedicineNational University of Singapore(shoba@bic.nus.edu.sg) Biomolecular Modeling: building a 3D protein structure from its sequence

Why protein structure? • In the factory of the living cell, proteins are the workers, performing a variety of tasks • Each protein adopts a particular folding pattern that determines its function • The 3D structure of a protein brings into close proximity residues that are far apart in the amino acid sequence

How does a protein fold? • Most newly synthesized proteins fold without assistance! • Ribonuclease A: denatured protein could refold and recover its activity (C. Anfinsen -1966) • “Structure implies function” • The amino acid sequence encodes the protein’s structural information

Understanding Protein Structure • A Quick Overview of Sequence Analysis • Finding a Structural Homologue • Template Selection • Aligning the Query Sequence to Template Structure(s) • Building the Model

The basics • Proteins are linear heteropolymers: one or more polypeptide chains • Repeat units: 20 amino acid residues • Range from a few 10s-1000s • Three-dimensional shapes (“folds”) adopted vary enormously • Experimental methods: X-ray crystallography, electron microscopy and NMR (nuclear magnetic resonance)

The (L-)amino acid N C O O R Side chain = H,CH3,… Backbone Amino C a + - Carboxylate

The peptide bond

Coplanar atoms

Levels of protein structure • Zeroth: amino acid composition • Primary • This is simply the order of covalent linkages along the polypeptide chain, i.e. the sequence itself

Levels of protein structure • Secondary • Local organization of the protein backbone: a-helix, b-strand (which assemble into b-sheets), turn and interconnecting loop

Ramachandran / phi-psi plot b-sheet a-helix (left handed) y a-helix (right handed) f

Levels of protein structure • Tertiary • packing of secondary structure elements into a compact spatial unit • “Fold” or domain – this is the level to which structure prediction is currently possible

Levels of protein structure • Quaternary • Assembly of homo- or heteromeric protein chains • Usually the functional unit of a protein, especially for enzymes

Structural classes All-a (helical) All-b (sheet)

Structural classes a/b(parallelb-sheet) a+b(antiparallelb-sheet)

Structural information • Protein Data Bank: maintained by the Research Collaboratory for Structural Bioinformatics • http://www.rcsb.org/pdb • > 45,744structures of proteins • Also contains structures of DNA, carbohydrates, protein-DNA complexes and numerous small ligand molecules.

The PDB data • Text files • Each entry is identified by a unique 4-letter code: say 1emg • 1emg entry • Header information • Atomic coordinates in Å (1 Ångstrom = 1.0e-10 m)

PDB Header details • identifies the molecule, any modifications, date of release of PDB entry • organism, keywords, method • Authors, reference, resolution if X-ray structure • Sequence, x-reference to sequence databases HEADER GREENFLUORESCENT PROTEIN 12-NOV-98 1EMG TITLE GREEN FLUORESCENT PROTEIN (65-67 REPLACED BY CRO, S65T TITLE 2 SUBSTITUTION, Q80R) COMPND MOL_ID: 1; COMPND 2 MOLECULE: GREEN FLUORESCENT PROTEIN; COMPND 3 CHAIN: A; COMPND 4 ENGINEERED: YES; COMPND 5 MUTATION: 65 - 67 REPLACED BY CRO, S65T SUBSTITUTION, Q80R COMPND 6 SUBSTITUTION; COMPND 7 BIOLOGICAL_UNIT: MONOMER

The data itself • Coordinates for each heavy (non-hydrogen) atom from the first residue to the last • Any ligands (starting with HETATM) follow the biomacromolecule • O of water molecules (also HETATM) at the end ATOM 1 N SER A 2 29.089 9.397 51.904 1.00 81.75 ATOM 2 CA SER A 2 27.883 10.162 52.185 1.00 79.71 ATOM 3 C SER A 2 26.659 9.634 51.463 1.00 82.64 ATOM 4 O SER A 2 26.718 8.686 50.686 1.00 81.02 ATOM 5 CB SER A 2 28.039 11.660 51.932 1.00 75.59 ATOM 6 OG SER A 2 27.582 12.038 50.639 1.00 43.28 ------- ATOM 1737 CD1 ILE A 229 39.535 21.584 52.346 1.00 41.62 TER 1738 ILE A 229

Structural Families • SCOP - Structural Classification Of Proteins • http://scop.mrc-lmb.cam.ac.uk/scop • FSSP – Family of Structurally Similar Proteins • http://www.ebi.ac.uk/dali/fssp/ • CATH – Class, Architecture, Topology, Homology • http://www.biochem.ucl.ac.uk/bsm/cath

Structure comparison facts • Proteins adopt a limited number of topologies. • Homologous sequences show very similar structures, with strong conservation in secondary structural elements: variations in non-conserved regions. • In the absence of sequence homology, some folds are preferred by vastly different sequences.

Structure comparison facts • The “active site” (a collection of functionally critical residues) is remarkably conserved, even when the protein fold is different. • Structural models (especially those based on homology) provide insights into possible function for new proteins. • Implications for • protein engineering • ligand/drug design, • function assignment of genomic data.

Visualizing PDB information • RASMOL: most popular, available for all platforms (Sayle et al, 2005) http://www.bernstein-plus-sons.com/software/rasmol • DeepView Swiss-PDBViewer: from Swiss-Prot (Guex & Peitsch, 1997) http://tw.expasy.org/spdbv/ • Chemscape Chime Plug-in: for PC and Mac http://www.mdli.com/products/framework/chemscape • PyMOL: Very good, available for all platforms (DeLano, W.L. The PyMOL Molecular Graphics System, 2002) http://pymol.sourceforge.net

RASMOL views - SH2 domain All-atom model Space-filling model Atom colors:NOCS

RASMOL views – 1sha Ca Trace Ribbon Rainbow coloring:Nto C Coloring:by structural units

Homologous folds • Hemoglobinand erythrocruorin: 31% sequence identity

Analogous folds • Hemoglobin and phycocyanin: 9% sequence identity

Surface Properties Cro repressor – DNA complex • Basic residues in blue • Acidic residues in red

Mapping Functional Regions Immunoglobulin l light chain - dimer • Hydrophobhic residues in magenta • Hydrophilic and charged residues in cyan

Siblings and Cousins • Siblings or homologues: sequences with at least 30% sequence identity over an alignment length of at least 125 residues and conservation of function. • Cousins or paralogues: < 30% identity but with conservation of function • Both show structural conservation • Homologues located using a database search tool such as BLAST (free webserver): http://www.ncbi.nlm.nih.gov/BLAST • Paralogues require a more sensitive method such as PSI-BLAST

Multiple Sequence Alignment Finding the best way to match the residues of related sequences • Identical residues must be lined up • The rest should be arranged, based on • observed substitution in protein families • chemical similarity • charge similarity • Where it is impossible to get the residues to line up, the biological concept of insertion/deletion in invoked: the ‘gap’ in alignments

MSA Methods • CLUSTALW / CLUSTALX (Thompson et al, 1997): freely available for all platforms and one of the best alignment programs http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html • MAXHOM (Sander & Schneider, 1991): alignment based on maximum homology; available via the PredictProtein webserver, free for academics http://cubic.bioc.columbia.edu/predictprotein/ • MALIGN (Johnson et al, 1994): freely available UNIX program, based on the structural alignment of protein families http://www.abo.fi/fak/mnf/bkf/research/johnson/software.html

Alignment Checks • Conservation of functionally important residues: e.g. the catalytic triad (Asp-Ser-His) that are essential for serine proteinase activity • Line up of structurally important residues: e.g. cysteines forming disulfide bonds • Overall, maximizing the alignment of “like” residues • Completely conserved residues usually indicate some conserved structural or functional role, especially buried charges

Sequence Motifs & Patterns • From the analysis of the alignment of protein families • Conserved sequence features, usually associated with a specific function • PROSITE (Hulo et al, 2006) database for protein “signature” patterns: http://www.expasy.ch/prosite

Aligned Sequence Families • From alignments of homologous sequences: • PRINTS • PRODOM: http://www.toulouse.inra.fr/prodom.html • From Hidden Markov Model based methods: • PFAM: http://www.sanger.ac.uk/Pfam

Protein Domains • Most proteins are composed of structural subunits called domains • A domain is a compact unit of protein structure, usually associated with a function. • It is usually a “fold” - in the case of monomeric soluble proteins. • A domain comprises normally only one protein chain: rare examples involving 2 chains are known. • Domains can be shared between different proteins: like a LEGO block

Protein Architectures • Beads-on-a-string: sequential location: tyrosine-protein kinase receptor TIE-1 (immunoglobulin, EGF, fibronectin type-3 and protein kinase). • Domain insertions: “plugged-in” - pyruvate kinase (1pyk) • SMART: smart.embl-heidelberg.de Simple Modular Architecture Retrieval Tool

Dissection into Domains • A sequence, usually > 125 residues should be routinely checked to see how many domains are present. • Conserved Domain Architecture Retrieval Tool (CDART) uses information in Pfam and SMART to assign domains along a sequence • E.g. NP_002917 shows similarity to G-protein regulators:

Structural Homologues • BLASTP vs. PDB database or PSI-BLAST: look for 4-character PDB ID • E < 0.005 • Domain coverage: at least 60% coverage is recommended • Gaps: we don’t want them. Choose between: • few gaps and reasonable similarity scores or • lots of gaps and high similarity scores?

Small Proteins: Disulfide bonds • BLAST-type methods may not locate homologues, if Conserved Domain search is not turned on. • Are the Cys residues conserved? • Gaps: where are they on the structure? gnl|Pfam|pfam00095, wap, WAP-type (Whey Acidic Protein) • four-disulfide core'. CD-Length = 46 residues, 100.0% aligned • Score = 43.9 bits (102), Expect = 1e-06 • Q:49 KAGFCPWNLLQMISSTGPCPMKIECSSDRECSGNMKCCNVDCVMTCTPP 97 • D: 1 KPGVCPWVSISE---AGQCLELNPCQSDEECPGNKKCCPGSCGMSCLTP46

Metal-binding domains C2H2 Zinc Finger • 2 Cys & 2 His binding to Zinc • Not detected even by CD-search in BLAST • Detected by Pfam & SMART • Sequence Pattern: #-X-C-X(1-5)-C-X3-#-X5-#-X2-H-X(3-6)-[H/C]

Structure Prediction Methods • Secondary Structure Prediction: identify local structural elements such as helices, strands and loops. • > 75% accuracy achievable • PredictProtein or PHD http://cubic.bioc.columbia.edu/pp/ • PSIPRED http://bioinf.cs.ucl.ac.uk/psipred/ • SSPro http://promoter.ics.uci.edu/BRNN-PRED/

Folds from Secondary Structure Predictions • Assembling SSEs into folds is a combinatorial problem • Current methods depend on available structural data for mapping predictions: • FORREST http://abs.cit.nih.gov/foresst/foresst.html • TOPITS from the PHD server http://cubic.bioc.columbia.edu/pp

Tertiary Structure Prediction • Fold recognition/Threading: < 20% identity typically • Best results obtained by combining several database search and knowledge-based tools: • 3D-PSSM http://www.sbg.bio.ic.ac.uk/~3dpssm/ • FUGUE http://www-cryst.bioc.cam.ac.uk/fugue/

One or many templates? • Sequence similarity: extract template sequences and align with query: select the most similar structure • Completeness: Missing data? REMARK 465 MISSING RESIDUES REMARK 465 THE FOLLOWING RESIDUES WERE NOT LOCATED IN THE REMARK 465 EXPERIMENT. (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN REMARK 465 IDENTIFIER; SSSEQ=SEQUENCE NUMBER; I=INSERTION CODE.) REMARK 465 REMARK 465 M RES C SSSEQI REMARK 465 MET A 1 REMARK 465 THR A 230 REMARK 470 M RES CSSEQI ATOMS REMARK 470 GLU A 5 OE2 REMARK 470 GLU A 6 CG CD OE1 OE2 REMARK 470 GLU A 17 OE1

One or many templates? • X-ray or NMR?: • Lowest resolution X-ray structure • X-ray and then NMR • NMR average over assembly • One or many?: • Structure alignment of Ca atoms • If 2 templates are very close, keep only one • Keep templates that provide new information

Many templates • Sequence alignment from structure comparison of templates (SSA) can be different from a simple sequence alignment (SA). • For model building, • align templates structurally • extract the corresponding SSA

Biomolecular Modeling: building a 3D protein structure from its sequence

Biomolecular Modeling: building a 3D protein structure from its sequence

Presentation Transcript

Protein structures

68402: Structural Design of Buildings II 61420: Design of Steel Structures 62323: Architectural Structures II

Molecular Dynamics

BUILDING SYSTEMS OF CARE: CRITICAL STRUCTURES AND PROCESSES

Information Modeling Requirement Analysis

Measurement, Modeling, and Analysis of the Internet: Part II

PROTEIN

Kagan Structures

Discrete Choice Modeling

2d-3D Structure Modelling

PROTEIN METABOLISM

Intro to Computer Science I

Unified Modeling Language (UML)

INTRODUCTION ,MODELING CONCEPTS,CLASS MODELING

BIOMOLECULAR MATERIALS

Javier Junquera

Part 11 Structures analysis and prediction

TEACHER GUIDE – BUILDING BRIDGES (EiE CURRICULUM)

Introduction to UML, the Unified Modeling Language

ARCH Structures including SAP2000, Wolfgang Schueller