490 likes | 749 Views
Protein Identification by Database Searching. John Cottrell Matrix Science. Three ways to use mass spectrometry data for protein identification. Peptide Mass Fingerprint A set of peptide molecular masses from an enzyme digest of a protein. PMF Servers on the Web.
E N D
Protein Identification by Database Searching John Cottrell Matrix Science
Three ways to use mass spectrometry data for protein identification • Peptide Mass Fingerprint A set of peptide molecular masses from an enzyme digest of a protein Protein Identification by Database Searching
PMF Servers on the Web • ASCQ_ME: https://www.genopole-lille.fr/logiciel/ascq_me/ • Bupid: http://zlab.bu.edu/Amemee/ • Mascot: http://www.matrixscience.com/search_form_select.html • MassSearch: http://www.cbrg.ethz.ch/services/MassSearch_new • MS-Fit (Protein Prospector): http://prospector.ucsf.edu/prospector/mshome.htm • PepMAPPER: http://www.nwsr.manchester.ac.uk/mapper/ • Profound (Prowl): http://prowl.rockefeller.edu/prowl-cgi/profound.exe • Mowse, PeptideSearch, Protocall, Aldente, XProteo Protein Identification by Database Searching
Search • Parameters • database • taxonomy • enzyme • missed cleavages • fixed modifications • variable modifications • protein MW • estimated mass measurement error Protein Identification by Database Searching
Henzel, W. J., Watanabe, C., Stults, J. T., JASMS 2003, 14, 931-942. Protein Identification by Database Searching
Peptide Mass Fingerprint • Fast, simple analysis • High sensitivity • Need database of protein sequences • not ESTs or genomic DNA • Sequence must be present in database • or close homolog • Not good for mixtures • especially a minor component. Protein Identification by Database Searching
H+ x3 y3 z3 x2 y2 z2 x1 y1 z1 R1 O R2 O R3 O R4 O H – N – C – C – N – C – C – N – C – C – N – C – C – OH H H H H H H H H a1 b1 c1 a2 b2 c2 a3 b3 c3 • Roepstorff, P. and Fohlman, J. (1984). Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed Mass Spectrom 11, 601. Protein Identification by Database Searching
Three ways to use mass spectrometry data for protein identification • Peptide Mass Fingerprint A set of peptide molecular masses from an enzyme digest of a protein • Sequence Query Mass values combined with amino acid sequence or composition data Protein Identification by Database Searching
Mann, M. and Wilm, M., Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66 4390-9 (1994). Protein Identification by Database Searching
1489.430 tag(650.213,GWSV,1079.335) Protein Identification by Database Searching
Sequence Tag Servers on the Web • Mascot • http://www.matrixscience.com/search_form_select.html • MS-Seq (Protein Prospector) • http://prospector.ucsf.edu/prospector/mshome.htm • MultiIdent (TagIdent, etc.) • http://www.expasy.org/tools/multiident/ • PeptideSearch, Spider Protein Identification by Database Searching
Sequence Tag • Rapid search times • Essentially a filter • Error tolerant • Match peptide with unknown modification or SNP • Requires interpretation of spectrum • Usually manual, hence not high throughput • Tag has to be called correctly • Although ambiguity is OK • 2060.78 tag(977.4,[Q|K][Q|K][Q|K]EE,1619.7). Protein Identification by Database Searching
Three ways to use mass spectrometry data for protein identification • Peptide Mass Fingerprint A set of peptide molecular masses from an enzyme digest of a protein • Sequence Query Mass values combined with amino acid sequence or composition data • MS/MS Ions Search Uninterpreted MS/MS data from a single peptide or from a complete LC-MS/MS run Protein Identification by Database Searching
SEQUEST • Eng, J. K., McCormack, A. L. and Yates, J. R., 3rd., An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5 976-89 (1994) Protein Identification by Database Searching
MS/MS Ions Search Servers on the Web Protein Identification by Database Searching
MS/MS Ions Search • Easily automated for high throughput • Can get matches from marginal data • Can be slow No enzyme Many variable modifications Large database Large dataset • MS/MS is peptide identification Proteins by inference. Protein Identification by Database Searching
Search Parameters Protein Identification by Database Searching
Search Parameters • Sequence Database Protein Identification by Database Searching
Search Parameters • Sequence Database • Swiss-Prot (~500,000 entries) • High quality, non-redundant • NCBInr, UniRef100 (~19,000,000 entries) • Comprehensive, non-identical • EST databases (>400,000,000 entries) • Very large and very redundant • Sequences from a single genome • A consensus sequence • Peptides are lost at exon-intron boundaries (Entry counts are from mid-2012) Protein Identification by Database Searching
Search Parameters • Taxonomy • Swiss-Prot 2010_08 Mammalia (mammals)=65104 Primates=26940 Homo sapiens (human)=20292 Other primates=6648 Rodentia (Rodents)=25473 Mus.=16358 Mus musculus (house mouse)=16307 Rattus=7533 Other rodentia=1582 Other mammalia=12691 Protein Identification by Database Searching
Search Parameters • Mass Tolerances • Most search engines support separate mass tolerances for precursors and fragments • May allow fixed units (Da, mmu) or proportional (ppm, %) • Some search engines can correct for selection of 13C peak • Unless search engine performs some type of re-calibration, need to provide conservative estimate of mass accuracy, not precision • This doesn’t have to be a guessing game. Run a standard, then look at the error graphs for strong matches Protein Identification by Database Searching
Search Parameters • Enzyme can be • Fully specific • Non-specific (“no enzyme”) • Some search engines support • Limited number of missed cleavage points • Semi-specific enzymes • Enzyme mixtures Protein Identification by Database Searching
Search Parameters • Common peak list formats • DTA (Sequest) • PKL (Masslynx) • MGF (Mascot) • mzData (.XML) • mzML (.mzML) Protein Identification by Database Searching
Search Parameters • Modifications • Fixed / static / quantitative modifications cost nothing • Variable / differential / non-quantitative modifications are very expensive Protein Identification by Database Searching
Search Parameters • Modifications • Common artefacts Protein Identification by Database Searching
Site Analysis Protein Identification by Database Searching
Site Analysis Protein Identification by Database Searching
Site Analysis Protein Identification by Database Searching
Site Analysis Protein Identification by Database Searching
Multi-pass Searches • Implemented under a variety of names • X!Tandem: Model refinement • Mascot: Error tolerant search • Spectrum Mill: Search saved hits, homology mode, unassigned single mass gap • Phenyx: 2-rounds • Paragon: Thorough ID, fraglet-taglet Protein Identification by Database Searching
Scoring Total matches Incorrect matches Correct matches Score Protein Identification by Database Searching
Scoring Receiver Operating Characteristic Protein Identification by Database Searching
Sensitivity & Specificity Protein Identification by Database Searching
Sensitivity & Specificity • Search a “decoy” database • Decoy entries can be reversed or shuffled or randomised versions of target entries • Decoy entries can be separate database or concatenated to target entries • Gives a clear estimate of false discovery rate Protein Identification by Database Searching
Sensitivity & Specificity Total matches Incorrect matches Correct matches Score Protein Identification by Database Searching
Sensitivity & Specificity Protein Identification by Database Searching
Protein Inference General approach is to create a minimal list of proteins. “Principal of parsimony” or “Occam’s razor” Protein A Peptide 1 Peptide 2 Peptide 3 Protein B Peptide 1 Peptide 3 Protein C Peptide 2 Protein Identification by Database Searching
Further Reading: Exercises: http://www.ms-ms.com/exercises/exercises.html Protein Identification by Database Searching