Introduction to Proteomics and Protein Structure Modeling BMI 705

Introduction to Proteomics and Protein Structure ModelingBMI 705 Kun Huang Department of Biomedical Informatics Ohio State University

Review of Protein Structure (5 min) Introduction to Proteomics (10 min) Protein Structure Database and Classification (15 min) Protein Structure Prediction (15 min) 3-D Alignment (left for next lab session)

Review of Biology – Protein Structure Obtaining 3-D structure (Computation)

Review of Biology – Protein Structure Levels of structure

Review of Biology – Protein Topology

Review of Biology – Protein Structure Obtaining 3-D structure

Review of Biology – Protein Structure Obtaining 3-D structure (NMR)

Review of Biology – Protein Structure Obtaining 3-D structure (Bioinformatics)

Review of Biology – Protein Structure 3-D structure (dynamics / computation) Subdomain Rearrangement in HIV-1 Reverse Transcriptase

Review of Biology – Protein Structure 3-D structure (modulation) Binding with ligand Methylation Phorsphorylation Glycosylation, ubiquintinization, etc.

Post-Translational Modification (PTM) • PTMs involving addition include: • acetylation, the addition of an acetyl group, usually at the N-terminus of the protein • alkylation, the addition of an alkyl group (e.g. methyl, ethyl) • methylation the addition of a methyl group, usually at lysine or arginine residues. (This is a type of alkylation.) • biotinylation, acylation of conserved lysine residues with a biotin appendage • glutamylation, covalent linkage of glutamic acid residues to tubulin and some other proteins. • glycylation, covalent linkage of one to more than 40 glycine residues to the tubulin C-terminal tail • glycosylation, the addition of a glycosyl group to either asparagine, hydroxylysine, serine, or threonine, resulting in a glycoprotein • isoprenylation, the addition of an isoprenoid group (e.g. farnesol and geranylgeraniol) • lipoylation, attachment of a lipoate functionality • phosphopantetheinylation, the addition of a 4'-phosphopantetheinyl moiety from coenzyme A, as in fatty acid, polyketide, non-ribosomal peptide and leucine biosynthesis • phosphorylation, the addition of a phosphate group, usually to serine, tyrosine, threonine or histidine • sulfation, the addition of a sulfate group to a tyrosine. • Selenation • C-terminal amidation

Post-Translational Modification (PTM) • PTMs involving addition of other proteins or peptides • ISGylation, the covalent linkage to the ISG15 protein (Interferon-Stimulated Gene 15) (2) • SUMOylation, the covalent linkage to the SUMO protein (Small Ubiquitin-related MOdifier) (1) • ubiquitination, the covalent linkage to the protein ubiquitin. • PTMs involving changing the chemical nature of amino acids • citrullination, or deimination the conversion of arginine to citrulline • deamidation, the conversion of glutamine to glutamic acid or asparagine to aspartic acid

Review of Protein Structure (5 min) Introduction to Proteomics (10 min) Protein Structure Database and Classification (15 min) Protein Structure Prediction (15 min) 3-D Alignment (10 min)

Proteomics The term proteome was coined by Mark Wilkins in 1995 and is used to describe the entire complement of proteins in a given biological organism or system at a given time, i.e. the protein products of the genome. The term has been applied to several different types of biological systems. A cellular proteome is the collection of proteins found in a particular cell type under a particular set of environmental conditions such as exposure to hormone stimulation.

Proteomics vs. Genomics The proteome is larger than the genome, especially in eukaryotes, in the sense that there are more proteins than genes. This is due to alternative Splicing_(genetics) splicing of genes and post-translational modifications like glycosylation or phosphorylation. The proteome has at least two levels of complexity lacking in the genome. When the genome is defined by the sequence of nucleotides, the proteome cannot be limited to the sum of the sequences of the proteins present. Knowledge of the proteome requires knowledge of (1) the structure of the proteins in the proteome and (2) the functional interaction between the proteins.

Proteomics Techniques – 2D Gel Proteomics, the study of the proteome, has largely been practiced through the separation of proteins by two dimensional gel electrophoresis. In the first dimension, the proteins are separated by isoelectric focusing, which resolves proteins on the basis of charge. In the second dimension, proteins are separated by molecular weight using SDS-PAGE. The gel is dyed with Coomassie Blue or silver to visualize the proteins. Spots on the gel are proteins that have migrated to specific locations. Matching is a big issue

Proteomics Techniques – Mass Spec Peptide mass fingerprinting identifies a protein by cleaving it into short peptides and then deduces the protein's identity by matching the observed peptide masses against a sequence database. Tandem mass spectrometry, on the other hand, can get sequence information from individual peptides by isolating them, colliding them with a nonreactive gas, and then cataloging the fragment ions produced.

Proteomics Techniques – Mass Spec

Proteomics Techniques – Microarray Measures mRNA level, no change in mRNA does not necessarily mean no change in protein expression and function due to effects of posttranslational modulation.

Review of Protein Structure (5 min) Introduction to Proteomics (10 min) Protein Structure Database and Classification (15 min) Protein Structure Prediction (15 min) 3-D Alignment (left for next lab)

Protein Databases UniProt is the universal protein database, a central repository of protein data created by combining Swiss-Prot, TrEMBL and PIR. This makes it the world's most comprehensive resource on protein information. The Protein Information Resource (PIR), located at Georgetown University Medical Center (GUMC), is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies. Swiss-Prot is a curated biological database of protein sequences from different species created in 1986 by Amos Bairoch during his PhD and developed by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute. Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. PDB NCBI http://proteome.nih.gov/links.html

PubMed – Protein Databases The Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL, and DDBJ as well as protein sequences submitted to Protein Information Resource (PIR), SWISS-PROT, Protein Research Foundation (PRF), and Protein Data Bank (PDB) (sequences from solved structures). The Structure database or Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB). The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy. Use Cn3D, the NCBI 3D structure viewer, for easy interactive visualization of molecular structures from Entrez. Tutorial: http://www.pdb.org/pdbstatic/tutorials/tutorial.html

Example – PDB • http://www.pdb.org • Only proteins with known structures are included.

Example – PDB

Protein Visualization Softwares • Cn3d • RasMol • TOPS • Chime • DSSP • Molscript • Ribbons • MSMS • Surfnet • …

PubMed Structure Database

Protein Structure Classification - SCOP • Structure Classification Of Proteins database • http://scop.mrc-lmb.cam.ac.uk/scop/ • Hierarchical Clustering • Family – clear evolutionarily relationship • Superfamily – probable common evolutionary origin • Fold – major structural similarity • Boundaries between levels are more or less subjective • Conservative evolutionary classification leads to many new divisions at the family and superfamily levels, therefore it is recommended to first focus on higher levels in the classification tree.

Protein Structure Classification - SCOP

Protein Structure Classification - SCOP • a/a • b/b • a/b • a+b • Misc

Protein Structure Classification - SCOP Scop Classification StatisticsSCOP: Structural Classification of Proteins. 1.69 release25973 PDB Entries (1 Oct 2004). 70859 Domains. 1 Literature Reference(excluding nucleic acids and theoretical models)

Protein Structure Classification - SCOP

Protein Structure Classification - CATH • CATH Protein Structure Classification • http://www.cathdb.info/latest/index.html • CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(C), Architecture(A), Topology(T) and Homologous superfamily (H). • Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. • Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. • The topology level clusters structures into fold groups according to their topological connections and numbers of secondary structures. • The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to fold groups and homologous superfamilies are made by sequence and structure comparisons.

Protein Structure Classification - CATH

CATH vs. SCOP

Similarity – DALI score Distance Matrix Embedding in 3-D space (multiple dimensional scaling) Kim, PNAS, Mar 4, 2003 Protein Fold Space Map

Review of Protein Structure (5 min) Introduction to Proteomics (10 min) Protein Structure Database and Classification (15 min) Protein Structure Prediction (15 min) 3-D Alignment (Left for next lab)

Secondary Structure Prediction AGADIR - An algorithm to predict the helical content of peptides APSSP - Advanced Protein Secondary Structure Prediction Server GOR - Garnier et al, 1996 HNN - Hierarchical Neural Network method (Guermeur, 1997) Jpred - A consensus method for protein secondary structure prediction at University of Dundee JUFO - Protein secondary structure prediction from sequence (neural network) nnPredict - University of California at San Francisco (UCSF) Porter - University College Dublin PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, EvalSec from Columbia University Prof - Cascaded Multiple Classifiers for Secondary Structure Prediction PSA - BioMolecular Engineering Research Center (BMERC) / Boston PSIpred - Various protein structure prediction methods at Brunel University SOPMA - Geourjon and Deléage, 1995 SSpro - Secondary structure prediction using bidirectional recurrent neural networks at University of California DLP - Domain linker prediction at RIKEN http://us.expasy.org/tools/#secondary

Secondary Structure Prediction - HNN • http://npsa-pbil.ibcp.fr/cgi-bin/secpred_hnn.pl • >gi|78099986|sp|P0ABK2|CYDB_ECOLI Cytochrome d ubiquinol oxidase subunit 2 (Cytochrome d ubiquinol oxidase subunit II) (Cytochrome bd-I oxidase subunit II) MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPIILLYTAWCYWKMFGRITKEDIERNTHSLY

Secondary Structure Prediction - HNN Sequence length : 379 HNN : Alpha helix (Hh) : 209 is 55.15% 310 helix (Gg) : 0 is 0.00% Pi helix (Ii) : 0 is 0.00% Beta bridge (Bb) : 0 is 0.00% Extended strand (Ee) : 55 is 14.51% Beta turn (Tt) : 0 is 0.00% Bend region (Ss) : 0 is 0.00% Random coil (Cc) : 115 is 30.34% Ambigous states (?) : 0 is 0.00% Other states : 0 is 0.00% 10 20 30 40 50 60 70 | | | | | | | MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA ccchhhhhhhhhhhhhhheeeeehccchhcchhhhhheecccccceeeeeeccccccccceeeeeeccch LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN hhhhhhhhhhhhhhhhhhhhhhhhhhhhhcccccccccchhhhhhhhhhcceeehccchccheehhhhhc LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV hhcccccchhhhheeeeccchhhhhcchceccceeeeeeeeeccchhhhhhhchhhhhhchhhhhhhhhh TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI hhhhhhccceeeeeeccceeeeeccccccccccchhhhhhhhhhhheeccccceeeeccchhhhhhhhhh LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPI hhhhhhhhhhhhhhhhhhhhhhhhhcchhhcccccccchhhccccchhcccchhhhhhhhhhhhhhhhhh ILLYTAWCYWKMFGRITKEDIERNTHSLY hhhhhhhhhhhhhhhcchhhhhhhccccc

Secondary Structure Prediction - HNN

Motifs Readily Identified from Sequence • Zinc Finger - order and spacing of a pattern for cysteine and histidine. • Leucine zippers – two antiparallel alpha helices held together by interactions between hybrophobic leucine residues at every seventh position in each helix. • Coiled coils – 2-3 helices coiled around each other in a left-handed supercoil (3.5 residue/turn instead of 3.6 – 7/two turns); first and fourth are always hydrophobic, others hydrophilic; 5-10 heptads. • Transmembrane-spanning proteins – alpha helices comprising amino acids with hydrophobic side chains, typically 20-30 residues.

Topology Prediction PSORT - Prediction of protein subcellular localization TargetP - Prediction of subcellular location DAS - Prediction of transmembrane regions in prokaryotes using the Dense Alignment Surface method (Stockholm University) HMMTOP - Prediction of transmembrane helices and topology of proteins (Hungarian Academy of Sciences) PredictProtein - Prediction of transmembrane helix location and topology (Columbia University) SOSUI - Prediction of transmembrane regions (Nagoya University, Japan) TMAP - Transmembrane detection based on multiple sequence alignment (Karolinska Institut; Sweden) TMHMM - Prediction of transmembrane helices in proteins (CBS; Denmark) TMpred - Prediction of transmembrane regions and protein orientation (EMBnet-CH) TopPred - Topology prediction of membrane proteins (France) http://us.expasy.org/tools

Introduction to Proteomics and Protein Structure Modeling BMI 705