Protein Identification by Sequence Database Search

Protein Identification by Sequence Database Search Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center

Outline • Proteomics • Mass Spectrometry • Protein Identification • Peptide Mass Fingerprint • Tandem Mass Spectrometry

Proteomics • Proteins are the machines that drive much of biology • Genes are merely the recipe • The direct characterization of a sample’s proteins en masse. • What proteins are present? • How much of each protein is present?

Protein separation Molecular weight (MW) Isoelectric point (pI) Staining Birds-eye view of protein abundance 2D Gel-Electrophoresis

2D Gel-Electrophoresis Bécamel et al., Biol. Proced. Online 2002;4:94-104.

Paradigm Shift • Traditional protein chemistry assay methods struggle to establish identity. • Identity requires: • Specificity of measurement (Precision) • Mass spectrometry • A reference for comparison (Measurement → Identity) • Protein sequence databases

Sample + _ Detector Ionizer Mass Analyzer Mass Spectrometer • ElectronMultiplier(EM) • Time-Of-Flight (TOF) • Quadrapole • Ion-Trap • MALDI • Electro-SprayIonization (ESI)

Mass Spectrometer (MALDI-TOF) UV (337 nm) Microchannel plate detector Field-free drift zone Source Pulse voltage Analyte/matrix Ed = 0 Length = D Length = s Backing plate (grounded) Extraction grid (source voltage -Vs) Detector grid -Vs

Mass Spectrum

Mass is fundamental

Peptide Mass Fingerprint Cut out 2D-GelSpot

Peptide Mass Fingerprint Trypsin Digest

Peptide Mass Fingerprint MS

Peptide Mass Fingerprint

Peptide Mass Fingerprint • Trypsin: digestion enzyme • Highly specific • Cuts after K & R except if followed by P • Protein sequence from sequence database • In silico digest • Mass computation • For each protein sequence in turn: • Compare computer generated masses with observed spectrum

Protein Sequence • Myoglobin GLSDGEWQQV LNVWGKVEAD IAGHGQEVLI RLFTGHPETL EKFDKFKHLK TEAEMKASED LKKHGTVVLT ALGGILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISDA IIHVLHSKHP GDFGADAQGA MTKALELFRN DIAAKYKELG FQG

Amino-Acid Masses

Peptide Mass & m/z • Peptide Molecular Weight: N-terminal-mass (0.00) + Sum (AA masses) + C-terminal-mass (18.010560) • Observed Peptide m/z: (Peptide Molecular Weight + z * Proton-mass (1.007825)) / z • Monoisotopic mass values!

Peptide Masses 1811.90 GLSDGEWQQVLNVWGK 1606.85 VEADIAGHGQEVLIR 1271.66 LFTGHPETLEK 1378.83 HGTVVLTALGGILK 1982.05 KGHHEAELKPLAQSHATK 1853.95 GHHEAELKPLAQSHATK 1884.01 YLEFISDAIIHVLHSK 1502.66 HPGDFGADAQGAMTK 748.43 ALELFR

Peptide Mass Fingerprint YLEFISDAIIHVLHSK GHHEAELKPLAQSHATK GLSDGEWQQVLNVWGK HPGDFGADAQGAMTK HGTVVLTALGGILK VEADIAGHGQEVLIR KGHHEAELKPLAQSHATK ALELFR LFTGHPETLEK

Enzymatic Digest and Fractionation Sample Preparation for Tandem Mass Spectrometry

Single Stage MS MS

Tandem Mass Spectrometry(MS/MS) MS/MS

Peptide Fragmentation Peptides consist of amino-acids arranged in a linear backbone. N-terminus H…-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus AA residuei-1 AA residuei AA residuei+1

Peptide Fragmentation

yn-i bi Peptide Fragmentation yn-i-1 -HN-CH-CO-NH-CH-CO-NH- Ri+1 Ri bi+1

xn-i yn-i zn-i yn-i-1 -HN-CH-CO-NH-CH-CO-NH- CH-R’ Ri i+1 R” ai bi ci i+1 bi+1 Peptide Fragmentation

Peptide Fragmentation Peptide: S-G-F-L-E-E-D-E-L-K

88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions 100 % Intensity 0 m/z 250 500 750 1000 Peptide Fragmentation

88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 y7 % Intensity y5 y2 y3 y8 y4 y9 0 m/z 250 500 750 1000 Peptide Fragmentation

88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 y7 % Intensity y5 b3 b4 y2 y3 b5 y8 y4 b8 y9 b6 b7 b9 0 m/z 250 500 750 1000 Peptide Fragmentation

Peptide Identification Given: • The mass of the precursor ion, and • The MS/MS spectrum Output: • The amino-acid sequence of the peptide

Peptide Identification Two paradigms: • De novo interpretation • Sequence database search

100 % Intensity 0 m/z 250 500 750 1000 De Novo Interpretation

100 % Intensity E L 0 m/z 250 500 750 1000 De Novo Interpretation

100 % Intensity SGF G E E E D E KL E E D L L L F 0 m/z 250 500 750 1000 De Novo Interpretation

De Novo Interpretation

De Novo Interpretation …from Lu and Chen (2003), JCB 10:1

De Novo Interpretation

De Novo Interpretation …from Lu and Chen (2003), JCB 10:1

De Novo Interpretation • Find good paths in spectrum graph • Can’t use same peak twice • Forbidden pairs: NP-hard • “Nested” forbidden pairs: Dynamic Prog. • Simple peptide fragmentation model • Usually many apparently good solutions • Needs better fragmentation model • Needs better path scoring

De Novo Interpretation • Amino-acids have duplicate masses! • Incomplete ladders create ambiguity. • Noise peaks and unmodeled fragments create ambiguity • “Best” de novo interpretation may have no biological relevance • Current algorithms cannot model many aspects of peptide fragmentation • Identifies relatively few peptides in high-throughput workflows

Sequence Database Search • Compares peptides from a protein sequence database with spectra • Filter peptide candidates by • Precursor mass • Digest motif • Score each peptide against spectrum • Generate all possible peptide fragments • Match putative fragments with peaks • Score and rank

S G F L E E D E L K 100 % Intensity 0 m/z 250 500 750 1000 Sequence Database Search

88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions 100 % Intensity 0 m/z 250 500 750 1000 Sequence Database Search

88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 y7 % Intensity y5 b3 b4 y2 y3 b5 y8 y4 b8 y9 b6 b7 b9 0 m/z 250 500 750 1000 Sequence Database Search

Sequence Database Search • No need for complete ladders • Possible to model all known peptide fragments • Sequence permutations eliminated • All candidates have some biological relevance • Practical for high-throughput peptide identification • Correct peptide might be missing from database!

Peptide Candidate Filtering • Digestion Enzyme: Trypsin • Cuts just after K or R unless followed by a P. • Basic residues (K & R) at C-terminal attract ionizing charge, leading to strong y-ions • “Average” peptide length about 10-15 amino-acids • Must allow for “missed” cleavage sites

Peptide Candidate Filtering >ALBU_HUMAN MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAK… No missed cleavage sites MK WVTFISLLFLFSSAYSR GVFR R DAHK SEVAHR FK DLGEENFK ALVLIAFAQYLQQCPFEDHVK LVNEVTEFAK …

Protein Identification by Sequence Database Search

Protein Identification by Sequence Database Search

Presentation Transcript

Techniques for Protein Sequence Alignment and Database Searching

Protein Identification

Protein Sequence Databases

Protein Identification via Database searching

PROTEIN SEQUENCE ANALYSIS

Protein sequence retrieval AND other database information

PROTEIN DATABASE

Protein sequence analysis

Protein Identification by Sequence Database Search

Protein Primary Sequence

Novel Peptide Identification using ESTs and Sequence Database Compression

Protein Sequence

Protein Database

Protein Feature Identification

Protein Identification by Database Searching

Sequence Search

Protein sequence databases

Protein Sequence Motifs

PROTEIN IDENTIFICATION BY MASS SPECTROMETRY

SEQUENCE DATABASE

protein identification

protein identification service