INF380 – Proteomics-I Chapter 1 - Proteome - proteomics

INF380 – Proteomics-IChapter 1 - Proteome - proteomics • While the genome is essentially identical in all cells in an organism, the set of expressed proteins varies extensively through time as well as according to the specific conditions a cell finds itself in. • The term proteome may therefore be used with several meanings. • All the proteins encoded by the genome of a species. • The set of proteins expressed in a particular cell (or tissue or organ) at a particular time and under specific conditions. • The term can also be used for the set of proteins of a subcellular structure or organelle. • Proteomics is therefore the study of the subsets of proteins present in different parts of the organism and how they change with time and varying conditions. INF380 - Proteomics-1

Goal of proteomics • The overall goal of proteomics is to understand the function of all proteins found in an organism. • This implies that collection of large quantities of proteomics data, frameworks are therefore needed for the storage and presentation of such data. • The Gene Ontology Consortium (GO), an international collaboration that aims to catalogue and standardize the existing knowledge about protein function, has established a large controlled vocabulary (CV) for gene and protein function. This CV is subdivided in three orthogonal types: • molecular function • biological process • cellular component. INF380 - Proteomics-1

Tasks of proteomics • Protein identification is the determination of which protein we have in our sample. • This can be done by determining the sequence of the protein, or by measuring so many properties of the protein that it is statistically unlikely that it could be another protein. In this book we primarily consider determining the sequence. • Protein characterization is the determination of the various biophysical and/or biochemical properties of the protein (regardless of whether the protein has been identified). Although there are many important protein properties, in this book we mainly concentrate on the problem of determining the posttranslational modifications. INF380 - Proteomics-1

Tasks of proteomics • Protein quantification is the determination of the amount (abundance) of a protein in the sample, either as a relative or an absolute value. It should be noted that the determination of protein abundance is most often far from trivial. • Protein sample comparison is the determination of the similarities and the differences in the protein composition of different samples. Some aspects of sample comparison are listed. • Relative occurrence; the presence of proteins in some samples but not in others. • Relative abundance; the presence of proteins in different amounts in different samples. • Differential modification; the presence of different modified forms of proteins in different samples. INF380 - Proteomics-1

Protein identity • The objective of proteomics is to assign experiment-derived information on protein function to a particular protein. • To do this, it is necessary to identify the protein, a task which can be achieved by protein mass spectrometry. • In this approach, the acquired mass data must then be matched to the entries in a relevant protein sequence database. • Ideally, a protein in the database should have a unique amino acid sequence and a single source of origin (a species name). Fulfilling these requirements are not always straightforward. • A single organism may contain several genes that encode proteins of identical sequence, violating the first of the above constraints (sequence uniqueness). • Second, two proteins from two different organisms may also have identical sequences, violating the single source of origin constraint. • The species concept itself is far from trivial and sometimes controversial. INF380 - Proteomics-1

Amino acid properties INF380 - Proteomics-1

Protein properties • A protein can be considered as having a set of properties. • A common way of describing a property of an object is to use an (attribute,value) pair. • Since we mainly concentrate on protein identification, characterization of posttranslational modifications, and comparison of proteins, we are primarily interested in attributes that can be of help in solving these tasks. • For such attributes there exist some desirable requirements. • The value of the attribute should be (easily) measurable. • The attribute should have a (high) degree of specificity, meaning that different proteins should ideally have different attribute values. • The value should be constant (minimal variation with time and other external factors). • It should be possible to calculate or predict the value from the sequence of the protein, if the sequence is known (this is useful for protein identification). INF380 - Proteomics-1

Protein properties • We can distinguish between properties that • are intrinsic to the protein sequence (given standard buffer conditions) • depend on the molecular or cellular context of the protein • We are mainly interested in the intrinsic properties. • There exist a large number of protein attributes, and many of them can be used in protein analysis in some way. • We will concentrate on the physio-chemical ones that are used in the analytic methods described later in the book. • We will also briefly mention how they can be measured, whether the values can be calculated or estimated from the sequence and, if possible, how this can be done. INF380 - Proteomics-1

Protein properties • The amino acid sequence is the most fundamental attribute of a protein. • Correspondingly, it is also referred to as the primary structure of the protein. • If the sequence is known, the protein is considered identified. • When we consider a protein we should therefore first try to determine its complete sequence. This is however not a straightforward process. INF380 - Proteomics-1

Protein properties - molecular mass • Molecular mass (symbol m) of a molecule or atom is expressed in unified atomic mass units (symbol u), defined as 1/12 the mass of carbon 12 (which is 1.6605402 * 10-27 kg). • A more common term for unified atomic mass unit is Dalton (Da). • All chemical elements have naturally occurring isotopes, elements that have the same atomic number (and therefore similar chemical properties), but different molecular mass (slightly different physical properties). This is important for the mass definitions. • Exact mass is the calculated mass of an ion or molecule containing a single isotope for each atom (most frequently the lightest isotope of the element). It is calculated using an appropriate degree of accuracy. • Monoisotopic mass is the exact mass of an ion or molecule calculated using the mass of the most abundant isotope of each element. • Average mass is the mass of an ion or molecule calculated using the average mass of each element weighted for its natural isotopic abundance. • Nominal mass is the mass of an ion or molecule calculated using the mass of the most abundant isotope of each element rounded to the nearest integer value. It is equivalent to the sum of the mass numbers of all constituent atoms. • Accurate mass is an experimentally determined mass of an ion that is used to determine an elemental formula. • Apparent mass is used in the literature to indicate a molecular mass that is estimated from an inaccurate experiment INF380 - Proteomics-1

Protein properties - molecular mass • Measuring the mass of a protein in the laboratory is performed in different ways • estimated by SDS-PAGE (sodium dodecyl sulfate polyacrylamide gel electrophoresis). This method has limited accuracy and the inaccuracy increases with increasing mass • gel filtration and analytical ultracentrifugation are used to estimate the masses, in particular of larger proteins and protein complexes. • The mass of (small) proteins can however be accurately determined by mass spectrometry, • Experimentally measured masses necessarily carry errors or uncertainties. • The total error of a general measurement is often given as either a relative or an absolute value, and the two scales used for these are the • part-per-million (ppm) scale for relative error. • milli-mass unit (mmu) or Dalton for absolute error. 1 mmu is 0.001 Da. • For ions less than 200 Da, a measurement with 5 ppm accuracy is considered sufficient to determine the elemental composition. INF380 - Proteomics-1

Protein properties – isoelectric point • The isoelectric point of a protein is defined using the term pH, which is a measure of the acidity of a solution. • pH is an abbreviation for potential of Hydrogen, and is measured using the number of H+ and OH--ions in the solution. • A neutral solution of pure water (at 25 degrees C) implies equal amounts of H+ and OH--ions • By definition, pH is the negative logarithm of the concentration of H+. For neutral solutions we have pH = -log1010-7 = 7 • When the concentration of H+ is high (low pH) the solution is considered acidic, otherwise basic. • Proteins (like amino acids and many other molecules) have both acidic and basic characteristics distributed along the chain of the molecule • Depending on the availability of protons in the solution (the pH) in which the protein is dissolved, the number of positive and negative charges of the protein will vary • At a certain pH of the solution, the protein will have an equal number of positive and negative charges • The net charge is then zero, and the protein is electrically neutral • This pH value is called the isoelectric point (pI) of the protein • Different proteins will have different pIs, depending on their amino acid compositions INF380 - Proteomics-1

Protein properties – hydrophobicity • Compounds that tend to repel water are termed hydrophobic, as opposed to hydrophilic compounds that dissolve easily in water. • Though hydrophobicity is one of the most important physio-chemical property of amino acids and proteins, it is still poorly defined. • The hydrophobicity of the amino acids has been determined through either calculation or measurement, using different ways. • Because of this, many scales exist for the hydrophobicity values of amino acids. Fortunately most of them are fairly similar, and for most applications it does not matter which of them is used. The most often used scale is the Kyte-Doolittle scale from 1982 in which the most hydrophobic amino acids have the highest positive values. • A hydrophobicity value for a protein can be calculated simply by adding the hydrophobicity values of all the residues and dividing the sum by the number of residues. This is called a GRAVY score (GRand AVerages of hYdrophobicity). • For proteins however, it is generally more common to finding hydrophobic regions in proteins. A sliding window is used in this approach, and the average hydrophobicity of the residues inside the window is plotted at the residue in the middle of the window. INF380 - Proteomics-1

Protein properties – amino acid composition • The amino acid composition of a protein can be determined by first cleaving (hydrolyzing) all the peptide bonds in the protein to release the constituent amino acids. The amino acids are subsequently separated, and after staining the intensity of each amino acid is determined. From this the prevalence of each amino acid is calculated. • Theoretical calculation of the amino acid composition from a protein sequence is performed simply by counting the number of occurrences of each amino acid in the protein. INF380 - Proteomics-1

Posttranslational modifications • A posttranslational modification (PTM) can be defined as any alteration to the chemical structure of the protein after the formation of the sequence by the translation. • PTMs thus occur strictly in vivo and this implies that the chemically induced modifications that often occur in the lab during sample preparation are not PTMs. • These are usually described as chemical artefactual or in vitro modifications • PTMs are very important to the survival of the cell • Some examples of common PTMs ar: • Cleavage of the protein, for example the removal of inhibitory sequences or sorting signals. • Ligand-binding. • Covalent addition of further groups, for example glycosylation, acetylation etc. • Phosphorylation and dephosphoylation. • In this book we will only consider modifications that affect the structure of one residue. • The modifications will change some of the attributes of the proteins, most notably the mass but also the pI and the hydrophobicity. • Several databases of modifications exist, one of which is UniMod, also including chemical modifications. • Each modification has one entry in the UniMod database (several hundreds) • Each entry can have several specifications, and each specification refers to the modification of a specific amino acid. A specification consists of a site, a position, and a mass. INF380 - Proteomics-1

Sequence databases • The most commonly used sequence databases for protein identification are • UniProt KnowledgeBase (Swiss-Prot/TrEMBL, PIR) • The NCBI non-redundant database • The International Protein Index (IPI) • The key difference between Swiss-Prot and PIR on one hand and TrEMBL on the other hand, lies in the manual curation effort that underlies the former two databases. • We will focus here on the Swiss-Prot database • All entries in Swiss-Prot have passed through a rigorous manual control by human curators. • Swiss-Prot contains a lot of curated protein information and annotations, some are: • The accession number, which supplies a unique identifier for the entry. • The sequence. • Molecular mass (denoted as molecular weight). • Observed and predicted modifications. • The sequence is the sequence after translation, yet before any modifications that may cleave off parts of it. • A feature table contains reported differences in the sequence for a protein. They are of different types: INF380 - Proteomics-1

Sequence databases • A feature table contains reported differences in the sequence for a protein. They are of different types: • Conflict, positions where different (bibliography) sources report different amino acids. • Variants, where the sources report that sequence variants (alleles) exist. • Mutagen,positions where alterations have experimentally been performed. • Varsplic, is the description of sequence variants produced by alternative splicing. Note that following our definition this would be different proteins, and that a derived database called Swiss-Prot Varsplic exists in which all splice variants are included as separate protein entries. • The feature table also includes observed posttranslational modifications. The type of modification, and the position, is specified for each observed modification. • Finally, it is important to keep in mind that even Swiss-Prot may contain errors, although the curation process is specifically designed to minimize potential errors. INF380 - Proteomics-1

Identification and characterization • Protein identification can be performed by collecting properties of the proteins under consideration, and explore whether we can recognize these properties among the known proteins. • This is generally done by comparing the measured properties to the calculated or documented properties of the proteins in a protein database, and could be performed by the following procedure: • Choose a protein database, or a subset of a protein database, with the candidate proteins. • Compare the properties of an unknown protein to the properties of the database proteins, and select the proteins that best fit the observed properties. • Compute a probability for the hypothesis that the best fit database protein and the unknown protein really are the same protein. Note that we do not always know whether the unknown protein is in the database in the first place. The probability calculation mentioned earlier will be different depending on whether we know the protein to be present in the database, or not. • We have seen that, especially due to the modifications, there does not exist a single, or even a set of, protein attributes that make this procedure workable in the general case. It has therefore been necessary to develop other techniques and procedures for identifying proteins. • Mass measurement has shown to be the most useful technique, with mass spectrometers as the instruments of choice. INF380 - Proteomics-1

Mass spectrometers • Mass spectrometers are general instruments for measuring the masses of the molecules in a sample. • However, mass spectrometers are only able to recognize charged molecules, therefore the molecules must be ionized. • In proteomics the ionization is commonly achieved by the addition of protons. Hence the mass of the peptide or protein is increased by the nominal mass of 1 Da times the number of charges (protons) • By convention the number of added (or lost) protons are denoted by z. It is then important to recognize that the mass is only indirectly determined, it is the ratio mass-over-charge, denoted m/z, that is measured. • The masses are only obtained after processing of the m/z values. • Depending on the instrument and the experimental conditions, the processing of m/z data to masses can be anything from trivial to exceedingly difficult. INF380 - Proteomics-1

Mass spectra INF380 - Proteomics-1

Top down and bottom up proteomics • Although proteomics can be performed in a number of different ways, it may be useful to divide the existing approaches into two types of paradigms: top down and bottom up. • In the top down paradigm, intact proteins are directly used for the analysis. • In the bottom up paradigm, the proteins are first cleaved into smaller parts, and these parts are then used for identification and characterization. These smaller parts are called peptides. • Methods using a combination of both paradigms (hybrid methods) are also in use. • The bottom up paradigm is mostly used, especially since it is difficult to perform large scale analysis of intact proteins and top down proteomics techniques. INF380 - Proteomics-1

Use of peptides • There are several reasons for doing mass spectrometry on peptides, and not (solely) on intact proteins. • The absolute error in the measurement increases with the measured m/z. • A protein does not have one defined mass, but rather a mass distribution. This is due to the occurrence of stable isotopes. Due to their long sequence, proteins will have a much more complex mass distribution than the much shorter peptides. As a result, it is a lot easier to obtain a single, monoisotopic mass for a peptide than for a protein. • It is not possible to measure the mass of all proteins, especially (very) large, and hydrophobic proteins. • Sensitivity of measurement of intact protein masses is not nearly as good as sensitivity for peptide mass measurement. • The existence of modifications complicates the analysis. This effect is much more noticeable for whole proteins as their long sequence can be modified several times with several different modifications. For the much shorter peptides, the combinatorial possibilities are more manageable. INF380 - Proteomics-1

Protein digestion into peptides • A peptide means a contiguous stretch of only a few tens of residues (in practice, the desired length lies between 6 and 20 residues). • Protein cleavage into such peptides is generally done by enzymes called proteases, and the cleaving operation is called digestion • The measured masses (experimental masses) of the resulting peptides can be used for comparison against the predicted masses (theoretical masses) of the peptides obtained after an in silico digest of candidate protein sequences in a database. • The cleavage can be simulated on a computer when the specificity of the protease is known (for example that it cleaves only after arginine residues) • The peptide masses can theoretically be calculated as the sum of the residue masses with the addition of the masses of the intact termini: a hydrogen (H) at the N-terminus and a hydroxyl group (OH) at the C-terminus. INF380 - Proteomics-1

Two approaches for bottom up proteomics • The foundation for bottom up proteomics are to digest the proteins into peptides, and compare peptide properties to the properties of theoretical peptides in a protein sequence database. • Two main approaches have evolved to perform such analysis, differing in • the peptide properties used, • the MS instruments used, • the method for separation to reduce the complexity. • Peptide masses or peptide sequences • The process of comparing experimental masses to theoretical masses has been named Peptide mass fingerprinting, PMF • If our unknown protein corresponds to a database protein, each of the theoretical peptide masses of the database protein should coincide with an experimental peptide mass, and vice versa. • In reality full experimental coverage of the protein sequence is never achieved; 20-40 % sequence coverage is more usually obtained. • There are probably hundreds, or even thousands, of peptides in a database that have a mass within the accuracy limit used. • Thus deriving the sequence (or part of the sequence) of a peptide gives much higher confidence in the subsequent identification of the protein of origin. INF380 - Proteomics-1

Two approaches for bottom up proteomics • Mass spectrometers Different mass spectrometers are used for measuring the mass, and deriving the sequence. • One mass spectrometry analysis is sufficient for measuring the masses, using an MS instrument. • For deriving the sequence two mass analyzers in series are used, MS/MS instrument. The first mass analyzer is used to specifically select ionized molecules from a particular m/z interval. • Separation When presented with the problem of analyzing a mixture of proteins mass spectrometers are easily overcome by a too complex mixture, resulting in the analysis of only a minor part of the total protein complement of the sample. • By fractionating the initial sample into fractions the mass spectrometer can be used to analyze each obtained fraction separately. • This will result in that more of the proteins in the sample are analyzed, • Fractionation is usually achieved by different methods of separation. • When proteins or peptides are separated, they are split into fractions, in which the members share those properties that are used by the separation process. • The (main) separation steps may be performed on either the proteins or the peptides. The separation step can therefore be done either before or after digestion. INF380 - Proteomics-1

Two approaches for bottom up proteomics INF380 - Proteomics-1

Peptide mass fingerprinting (PMF) INF380 - Proteomics-1

MS/MS (Tandem MS) INF380 - Proteomics-1

INF380 – Proteomics-I Chapter 1 - Proteome - proteomics