Proteomics: A Challenge for Technology and Information Science

Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics tgriffin@umn.edu

What is proteomics? “Proteomics includes not only the identification and quantification of proteins, but also the determination of their localization, modifications, interactions, activities, and, ultimately, their function.” -Stan Fields in Science, 2001.

Genomics vs. Proteomics Similarities: Large datasets, tools needed for annotation and interpretation of results Differences: Genomics – generally mature technologies, data processing methods, questions asked usually involve quantitative changes in RNA transcripts (microarrays) Proteomics – still evolving, complexity of protein biochemical properties: expression changes, modifications, interactions, activities – many questions to ask and data to interpret, methods changing, different approaches (mass spec, arrays etc.),

Genomics, Proteomics, and Systems Biology genomics proteomics computational biology genomic DNA protein products functional protein mRNA system interactions between components catalytic activity sub cellular location identify system components Protein Modifications measure and define properties mature prototype emerging Protein dynamics 3D structure protein phosphorylation quantitative profiling protein cataloguing descriptive protein interaction maps arrays sequencing

peptide fragments peptides ++ ++ + + + + ++ ++ ++ + + + + + + + + + ++ ++ + Mass Ionization: MALDI or Electrospray Isolation Fragmentation Analysis m/z “Shotgun” identification of proteins in mixtures by LC-MS/MS Liquid chromatography coupled to tandem mass spectrometry (MS/MS) µLC separation (50-100 um) Tandem mass spectrum (thousands in a matter of hours)

Relative Abundance 200 400 600 800 1000 1200 m/z Peptide sequence determination from MS/MS spectra Collision-induced dissociation (CID) creates two prominent ion series: y14 y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1 y-series: H2N-N--S--G--D--I--V--N--L--G--S--I--A--G--R-COOH b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b-series:

Peptide sequence identifies the protein GDIVNLGSIAGR DIVNLGSIAGR IVNLGSIAGR VNLGSIAGR NLGSIAGR LGSIAGR GSIAGR H2N-NSGDIVNLGSIAGR-COOH Relative Abundance SIAGR IAGR AGR GR R 200 400 600 800 1000 1200 m/z YMR134W, yeast protein involved in iron metabolism

High-throughput protein identification by LC-MS/MS and automated sequence database searching Raw MS/MS spectrum Direct identification of 1000+ proteins from complex mixtures Protein sequence and/or DNA sequence database search Peptide sequence match Protein identification

Dealing with the data • Experimental information, metadata capture 1. Data acquisition • Sequence database searching • Quantitative analysis 2. Peak analysis Integrated workflow? • Database mining • Assignment of function, pathway, localization etc. • Output for database archiving, publication 3. Knowledge annotation and interpretation

1. Data acquisition: capturing experimental information Proteomics Experimental Data Repository (PEDRo) Proposed schema • Similar to genomic needs, but experimental info a bit different

Relative Abundance 200 400 600 800 1000 1200 m/z 2. Peak Analysis Computational algorithms for searching MS/MS spectra against protein sequence databases, mRNA sequences, DNA sequences • ProFound • Mascot • PepSea • MS-Fit • MOWSE • Peptident • Multident • Sequest • PepFrag • MS-Tag Protein identification • need cpu horsepower (parallel computing)

2. Peak Analysis: data formats Format 1 Format 2 Format 3 ? ? Output 3 Output 2 Output 1 • Lack of flexibility • Slow to evolve • Lack of incorporation of competing products, methods

2. Peak Analysis: need general, flexible, in-house solutions Format 1 Format 2 Format 3 reverse engineering of data formats General tools for analysis of multiple data formats

2. Peak Analysis; reverse engineering data formats http://sashimi.sourceforge.net/software_glossolalia.html

2. Peak analysis: quality control of protein matches filtering Unfiltered – 105+ matches (lots of noise and junk) Filtered – thousands of “true” matches • Statistical analysis of database results (tools are available)

2. Peak Analysis: Quantitative analysis • External chemical labeling • Metabolic labeling (SILAC) • Enzymatic incorporation (O16/O18) • Flexibility is key – need tools to handle different quantitative methods

2. Peak Analysis: Quantitative analysis Sample 2 Relative intensity = relative protein abundance Sample 1

Evolving methodologies: iTRAQ Sample:12 3 4 Digest to peptides Digest to peptides Digest to peptides Digest to peptides iTRAQ label: +114 +115 +116 +117 Multidimensional separation 3 2 4 1 Intensity MS/MS spectrum m/z 114 115 116 117 Diagnostic ions used for quantitative analysis Peptide fragments used for sequence identification • 4-way multiplexing: simultaneous comparison of multiple states, replicates

Need for “changeable” tools “new” 3 “old” 116.0972 2 4 Intensity 115.0963 1 117.1025 114.1005 Automated analysis tools?

3. Knowledge annotation: making sense of lists of data

3. Knowledge annotation: mining proteomic/genomic databases

3. Knowledge annotation: needs • Annotation: accession numbers and protein names • Functional assignments (functional degeneracy?) • Pathway assignments • Subcellular localization • Disease implications • Comparison of different proteomic datasets (i.e. expression profiles compared to modification state profiles, other protein properties) • Automated and streamlined?? • Publication and deposit in databases • Visualization of complex phenomena, interpretation of biological relevance • Modeling, integration with genomics data – computational and systems biology

Proteomics: A Challenge for Technology and Information Science