1 / 17

Bioinformatics Tools for Proteomics

Bioinformatics Tools for Proteomics. Simon Hubbard Faculty of Life Sciences. Overview of projects/problems @ Manchester. Data management Data standards COGEME (Oliver, Paton), 3GP (Gaskell, Hubbard), PEDRO (Oliver, Paton, Brass, Hubbard) Databases PEDRO/Pierre, MSMS PepSeeker (Hubbard)

micol
Download Presentation

Bioinformatics Tools for Proteomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Tools for Proteomics Simon Hubbard Faculty of Life Sciences

  2. Overview of projects/problems @ Manchester • Data management • Data standards • COGEME (Oliver, Paton), 3GP (Gaskell, Hubbard), PEDRO (Oliver, Paton, Brass, Hubbard) • Databases • PEDRO/Pierre, MSMS PepSeeker (Hubbard) • Data integration • ISPIDER (Hubbard, Embury, Goble, Stevens, Oliver, Paton), SILAC ratios (Hubbard,Sims) • Protein and peptide identification problems • PMF • GAPSIA (Oliver, Hubbard), ISPIDER (Hubbard) • Tandem MS • MSMS/machine learning (Hubbard, Yin) • Chicken proteomics (Hubbard)

  3. Diversity of proteome data gels sequences >A01562 MAPKATYLIGAADKFHW >A01567 MAQQPKEMLNILADKFHWFLYC Other data: Species, PTMS, pathways, functional annotation, transcriptome data Structures/folds mass spec

  4. RA2 RA6 RA3&4 RA2 RA1 2D Gel Visualisation Client + Phosph. Extensions + Aspergil. Extensions Proteome Request Handler Proteomic Ontologies/ Vocabularies Source Selection Services Instance Ident/Mapping Services Data Cleaning Services RA1-6 myGrid Ontology Services myGrid DQP myGrid Workflows RA3&4 AutoMed DAS RA1 WS WS WS WS WS WS WS WS PRIDE PEDRo GS PS PF TR FA PPI WS WS RA5 &6 Phos PID RA2 Integrated Proteomics Informatics Platform - Architecture ISPIDER Proteomics Clients WP3 Vanilla Query Client PPI Validation + Analysis Client Protein ID Client WP4 WP6 WP1 WP5 WP2 Web services ISPIDER Proteomics Grid Infrastructure Existing E-Science Infrastructure WP1 Public Proteomic Resources WP6 WP3 ISPIDER Resources Existing Resources KEY: WS = Web services, GS = Genome sequence, TR = transcriptomic data, PS = protein structure, PF = protein family, FA = functional annotation, PPI = protein-protein interaction data, WP = Work Package

  5. ISPIDER workflow enacted using Taverna PepMapperIdentification software Web services GO entry

  6. EST contig Chicken ESTs • BBSRC collection of 330K ESTs + 20K cDNAs • Goals • can we enable proteomics in chicken pre-genome ? • help validate gene predictions post-genome ? Assembled ESTs

  7. Chicken proteomics - EORF program • Dynamic programming based algorithm • Uses synonymous codon bias/probs • Quality scores, custom gap pens + stop penalty • Also can use BLAST output EST Sequence AATTTAGACCGAAGTCCCAGACTGATCCAGTTCAAATGGGAGGCCT FRAME 1 FRAME 2 FRAME 3 Aat Tta Gac Cga Agt Ccc Aga Ctg Atc Cag Ttc Aaa Tgg Gag Gcc Att Tag Acc Gaa Gtc Cca Gac Tga Tcc Agt Tca Aat Ggg Agg Cct Ttt Aga Ccg Aag Tcc Cag Act Gat Cca Gtt Caa Atg Gga Ggc TRANSLATION Frameshift Mutation

  8. Impact of EORF on Proteome Database Searching EORF translations have much improved pepmap5 scores compared to the top scoring frame from a 3-frame translation of a given sequence for 7/10 spots EORF Predicted (top scoring hit)#--> num prot_name bayescore lograw fbayscore lograw HIT> 001 356655.3 EORF Prediction Bayesian Score: 1.00e-00 Raw Score: 45.74 __num search_m/z db_m/z msdiff dppm start len z p seq : modsPEP 6 1018.5110 1018.4880 0.0230 22 196 8 1 0 HAEYTLERPEP 1 832.3140 832.3723 0.0582 69 319 7 1 0 EEQEAARPEP 2 870.5370 870.5083 0.0287 32 362 8 1 1 KPSPSKARPEP 5 1003.5630 1003.5345 0.0284 28 555 8 1 1 EQTKQLEKPEP 9 1158.6310 1158.6404 0.0094 8 599 10 1 1 QALKNQISEKPEP 13 1404.7130 1404.7190 0.0060 4 625 11 1 1 SQRQQELMQLK:M-OxidationPEP 11 1262.5970 1262.6163 0.0193 15 758 11 1 0 HLNQDHTVNGKPEP 25 2663.2100 2663.3972 0.1873 70 870 26 1 1 ALSVLSIPNNVTGGRNGVLCADIHSRPEP 12 1284.7321 1284.7133 0.0188 14 948 12 1 1 VQLLRVCGPGSRHighest Scoring match from 3 frame translations (50th hit)HIT> 050 356655.3.frame2 Bayesian Score: 1.13e-08 Raw Score 27.44 __num search_m/z db_m/z msdiff dppm start len z p seq : modsPEP 3 915.4750 915.4756 0.0007 0 196 8 1 1 TCRIHTGKPEP 5 1003.5630 1003.5345 0.0284 28 555 8 1 1 EQTKQLEKPEP 9 1158.6310 1158.6404 0.0094 8 599 10 1 1 QALKNQISEKPEP 13 1404.7130 1404.7190 0.0060 4 625 11 1 1 SQRQQELMQLK:M-Oxidation PEP 11 1262.5970 1262.6163 0.0193 15 758 11 1 0 HLNQDHTVNGK

  9. Chicken proteomics Gene prediction Chicken genome: ~18K genes ~28K proteins Chicken ESTs and cDNAs Validation, re-annotation & new genes ? EST assembly into contigs (80K) Identified peptides from the 2 protocols Id with Mascot EORF proteins (80K) ENSEMBL proteins (~28K) Chicken Sample Prep LC-MS/MS

  10. Example peptides mapped to TIM Identified peptides GAFTGEISPAMIK DIGAAWVILGHSERHVFGESDELIGQK AIADNVK VVLAYEPVWAIGTGKIIYGGSVTGGNCK TEVVCGAPSIYLDFARQKLDAKIGVAAQNCYKVPKGAFTGEISPAMIKDIGAAWVILGHS ERRHVFGESDELIGQKVAHALAEGLGVIACIGEKLDEREAGITEKVVFEQTKAIADNVKD WSKVVLAYEPVWAIGTGKTATPQQAQEVHEKLRGWLKSHVSDAVAQSTRIIYGGSVTGGN CKELASQHDVDGFLVGGASLKPEFVDIINAK

  11. Interesting contig-based matches Contig 356557.10 matches to ENSGALP00000016523 >ENSGALP00000016523 ENSGALT00000016542 ENSGALG00000010175 chr=3 start=26593306 end=26598709 strand=-1 Length = 725 Query: 1 NPDDITNEEYGEFYK 15 NPDDIT EEYGEFYK Sbjct: 293 NPDDITQEEYGEFYK 307 EORF peptide (ms E-val = 2 x10-6) (also other peptides identified in this gene from ENSEMBL and EORF pipelines) ENSEMBL peptide Heat shock cognate protein HSP 90-beta

  12. Interesting contig-based matches (2) Query= 333647.5 maps to ENSGALP00000023917 (23 letters)>ENSGALP00000023917 ENSGALT00000023963 ENSGALG00000014846 chr=Z start=6315811 end=6343507 strand=-1Query: 1 GITAVSNNAGVDNFGLGLLLQTK 23 VDNFGLGLLLQTKSbjct: 1 ----------VDNFGLGLLLQTK 13 Exon 4 Exon 3

  13. EORF peptide hits with no transcript match • Exonerate used to map 37 contigs to genome • Contig 356093.1 exonerate raw score 1378 from 35508388–35508979 on xsome 4 • Also has BLASTX hits to Uniref UniRef100_Q96I23 Hypothetical protein [Homo sapiens] 106 3e-22 UniRef100_Q9D1C3 Mus musculus 18-day embryo whole body cDNA, RIK... 105 5e-22

  14. Isotopic modelling of theoretical fragment ions • Modified SEQUEST approach (BISA) Experimental spectrum SEQUEST theoretical spectrum BISA theoretical spectrum

  15. Specificity – Sensitivity analysis ROC plots demonstrate improvement

  16. MS/MS relational database

  17. University of Manchester Khalid Belhajjame Jennifer Siepen Jennifer Lynch Ian Overton Haizhou Tang Thomas McLoughlin Chris Cole Julian Selley Sheffield Stuart Wilson Dundee Cheryll Tickle Steve Oliver Norman Paton Suzanne Embury Carol Goble Robert Stevens Acknowledgements

More Related