990 likes | 1.19k Views
Protein Identification via Database searching. Attila Kert é sz- Farkas kfattila@icgeb.org Protein Structure and Bioinformatics Group, ICGEB, Trieste. Mass Spectra analysis. Biological sample. Results report. Mass Spectra analysis. Biological sample. Results report.
E N D
Protein Identification via Database searching Attila Kertész-Farkas kfattila@icgeb.org Protein Structure and Bioinformatics Group, ICGEB, Trieste
Mass Spectra analysis Biological sample Results report
Mass Spectra analysis Biological sample Results report
Computational analysis of MS/MS • Two approaches: • De novo sequencing • Database searching based • Hybrid
De novo sequencing • • can identify new peptides and proteins • Able to discover (new) PTMs • Independent of protein databases • • Requires MS/MS data of good quality • No statistics based validation
Database searching-based MS/MS tandem mass spectra identification • Pipeline Input data Peptide assignment Validation Protein inference Interpretation Quantitation
Database searching-based MS/MS tandem mass spectra identification • Pipeline Input data Peptide assignment Validation Protein inference Interpretation Quantitation
Database searching-based MS/MS tandem mass spectra identification • Pipeline Input data Peptide identification Validation Protein inference Interpretation Data formats Database searching Statistical methods for validations Quantitation Protein assembling
Input data Peptide assignment Validation Protein inference Interpretation Quantitation • Mass spectrum: • Histogram of the mass over charge of the observed fragment ions. • Spectrum normalization. Usually intensity is scaled to [0,100] interval.
Input data Peptide assignment Validation Protein inference Interpretation Quantitation • Most common formats are the mzXML, MGF and DAT,
MGF file format Input data Peptide assignment Validation Protein inference Interpretation Quantitation
.mzXML Input data Peptide assignment Validation Protein inference Interpretation Quantitation
Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 1. 2 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 1. 2 • 2. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Peptide assignment Input data Validation Protein inference Interpretation Scores: 3. 4 1. 2 2. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Peptide assignment Input data Validation Protein inference Interpretation Scores: 3. 4 1. 2 2. 1 4. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Peptide assignment Input data Validation Protein inference Interpretation Scores: 3. 4 1. 2 2. 1 4. 1 5. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Peptide assignment Input data Validation Protein inference Interpretation Scores: 3. 4 1. 2 2. 2 2. 1 4. 1 5. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 3. 4 • 14. 3 • 1. 2 • 2 • 7. 2 • 2. 1 • 4. 1 • 9. 1 • 12. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000045.1|SWISS-PROT:P18510-1 MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE
Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 15. 32 • 3. 4 • 14. 3 • 1. 2 • 2 • 7. 2 • 2. 1 • 4. 1 • 9. 1 • 12. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000045.1|SWISS-PROT:P18510-1 MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE
Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 15. 32 • 3. 4 • 14. 3 • 1. 2 • 2 • 7. 2 • 2. 1 • 4. 1 • 9. 1 • 12. 1 Quantitation Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Protein sequence DB
Peptide assignment Input data Validation Protein inference Interpretation Scores: 13. 4 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Peptide assignment Input data Validation Protein inference Interpretation Scores: 13. 4 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Quantitation Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Protein sequence DB
Peptide assignment Input data Validation Protein inference Interpretation Scores: 11. 3 6. 3 9. 3 3. 3 1. 3 4. 2 7. 2 13. 2 1. 1 10. 1 Quantitation Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Protein sequence DB
Peptide assignment Input data Validation Protein inference Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Quantitation Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Protein sequence DB
Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 1. 2 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 1. 2 Quantitation Input data Experimental Spectra Spectra comparison: 1. Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 1. 2 Quantitation Input data Experimental Spectra Spectra comparison: 1. 2. Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Shared Peak Count (SPC) This is the number of the peaks in the theoretical spectrum that are matched to peaks in the experimental spectrum 0% 1 0
Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Shared Peak Count (SPC) This is the number of the peaks in the theoretical spectrum that are matched to peaks in the experimental spectrum 0% 1 SPC = 7 0
Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Inner product (I) This is the sum of the intensities of the peaks in the experimental spectrum that match to peaks in the theoretical spectrum 0% 1 0
Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Inner product (I) This is the sum of the intensities of the peaks in the experimental spectrum that match to peaks in the theoretical spectrum I = 3.5 0% 1 0
Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Hyperscore: H= I*Nb!*Ny! I is the sum of the intensity of the matched peaks Nb, (resp. Ny) is the number of the matched b (resp. y) peaks in the theoretical spectrum ! is the factorial function. 0% 1 b y b b y b y y b y 0
Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Hyperscore: H= I*Nb!*Ny! - I is the sum of the intensity of the matched peaks - Nb, (resp. Ny) is the number of the matched b (resp. y) peaks in the theoretical spectrum - ! is the factorial function. 0% 1 b y b b y b y y b y H = 3.2*3!*4! = 3.2*6*24 = 460.8 0
Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Xcorr q is the query spectrum t is the theoretical spectrum 0% 1 0
Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 0
Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 I(q,t[-75])= 0
Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 I(q,t[-32])= 0
Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 I(q,t[0])= 0
Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 I(q,t[32])= 0 And so on.
Peptide assignment Input data Validation Protein inference Interpretation Quantitation Protein Sequence Databases • Completeness: • Complete • Longer searching time • Redundancy: • Sequence variations can be found • Redundant database can mess up the statistics • Quality of sequence annotation 2. Protein sequence DB
Peptide assignment Input data Validation Protein inference Interpretation Quantitation • EntrezProtein DB • http://www.ncbi.nlm.nih.gov/sites/entrez?db=protein • Most complete, redundant • Reference Sequence (RefSeq) and UniProt (Swiss-Prot and TrEMBL) • http://www.ncbi.nlm.nih.gov/RefSeq/ • http://www.uniprot.org/ • Well annotated, non-redundant • International Protein Index (IPI) • http://www.ebi.ac.uk/IPI/IPIhelp.html • Represents a good balance between redundancy and completeness. • Contains cross-reference to Ensemble, UniProt, RefSeq. • Sequences from a single genome • Difficult to obtain good statistics on small datasats. 2. Protein sequence DB
Peptide assignment Input data Validation Protein inference Interpretation Quantitation • Taxonomy • Allows searches to be limited to entries from particular species or groups of species. • Speed up a search, and ensures that the hit list will only contain entries from the selected species. • For non-redundant databases, a single entry may represent identical sequences from multiple species. The accession string and title text from the FASTA entry, listed on the master results page, will usually describe just one of these entries. To see the equivalent entries, and to explore their taxonomy, follow the accession number link in the results list to the Protein View. If the hit is from a non-redundant database, and represents multiple entries with identical sequences, the Protein View will include links to NCBI Entrez and the NCBI Taxonomy Browser for all equivalent entries. 2. Protein sequence DB
Peptide assignment Input data Validation Protein inference Interpretation Quantitation Run time • Database search has to enumerate all peptides and compare them to all experimental spectra. • This can be slow with large protein sequence databases especially when slow scoring function is applied, like Xcorr.
Peptide assignment Input data Validation Protein inference Interpretation Quantitation Speedup techniques • Fast database indexing • Fast implementation of sequence indexing in the database • Parent mass check • PTMs can be lost • Sequest’s preliminary score • Tag-based filtering (de novo hybrid) • Increases the specificity(or sensitivity)
Peptide assignment Input data Validation Protein inference Interpretation Quantitation • Advanced database indexing • Better implementation of the sequence indexing • Better representation of protein sequences.
Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 1. 2 Quantitation Input data Experimental Spectra Parent mass check Spectra comparison Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Peptide assignment Input data Validation Protein inference Interpretation • Scores: Quantitation Input data Experimental Spectra Parent mass check Spectra comparison Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Peptide assignment Input data Validation Protein inference Interpretation Quantitation Fast prescoring (used in SEQUEST) So called Sp score: R(q,t) is the maximum number of consecutive matched b-y ions. 100% 0% 1 Sp=3.2*7*(1+0.0075*4)/10=2.3072 SEQUEST selects the top 500 scoring peptides, scored by Sp, and rescores them using the Xcorr. 0