560 likes | 730 Views
Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. Synopsis.
E N D
Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park
Synopsis • MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. • Key concepts: • Spectrum acquisition is unbiased • Direct observation of amino-acid sequence • Sensitive to small sequence variations
Synopsis • MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. • Applications: • Cancer biomarkers • Genome annotation
Mass Spectrometry for Proteomics • Measure mass of many (bio)molecules simultaneously • High bandwidth • Mass is an intrinsic property of all (bio)molecules • No prior knowledge required
Sample + _ Detector Ionizer Mass Analyzer Mass Spectrometer ElectronMultiplier(EM) Time-Of-Flight (TOF) Quadrapole Ion-Trap MALDI Electro-SprayIonization (ESI)
100 % Intensity 0 m/z 250 500 750 1000 High Bandwidth
Mass Spectrometry for Proteomics • Measure mass of many molecules simultaneously • ...but not too many, abundance bias • Mass is an intrinsic property of all (bio)molecules • ...but need a reference to compare to
Mass Spectrometry for Proteomics • Mass spectrometry has been around since the turn of the century... • ...why is MS based Proteomics so new? • Ionization methods • MALDI, Electrospray • Protein chemistry & automation • Chromatography, Gels, Computers • Protein / genome sequences • A reference for comparison
Enzymatic Digest and Fractionation Sample Preparation for Peptide Identification
Single Stage MS MS m/z
Tandem Mass Spectrometry(MS/MS) m/z Precursor selection m/z
Tandem Mass Spectrometry(MS/MS) Precursor selection + collision induced dissociation (CID) m/z MS/MS m/z
Peptide Identification • For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well • Peptide sequences from (any) sequence database • Swiss-Prot, IPI, NCBI’s nr, ESTs, genomes, ... • Automated, high-throughput peptide identification in complex mixtures
Peptide Identification ...can provide direct experimental evidence for the amino-acid sequence of functionalproteins. Evidence for: • Functional protein isoforms • Translation start and frame • Proteins with short open-reading-frames
Why is this useful for ...... genome annotation? • Evidence for SNPs and alternative splicing stops with transcription • No genomic or transcript evidence for translation start-site. • Conservation doesn’t stop at coding bases! • Statistical gene-finders struggle with micro-exons, translation start-site, and short ORFs.
Why is this useful for ...... cancer biomarkers? • Alternative splicing is the norm! • Only 20-25K human genes • Each gene makes many proteins • Some splicing is believed to be silencing • Lots of splicing in cancer • Proteins have clinical implications • Statistical biomarker discovery • Putative malfunctioning proteins
What can be observed? • Known coding SNPs • Novel coding mutations • Alternative splicing isoforms • Microexons ( non-cannonical splice-sites ) • Alternative translation start-sites ( codons ) • Alternative translation frames • “Dark” open-reading-frames
Splice Isoform • Human Jurkat leukemia cell-line • Lipid-raft extraction protocol, targeting T cells • von Haller, et al. MCP 2003. • LIME1 gene: • LCK interacting transmembrane adaptor 1 • LCK gene: • Leukocyte-specific protein tyrosine kinase • Proto-oncogene • Chromosomal aberration involving LCK in leukemias. • Multiple significant peptide identifications
Novel Mutation • HUPO Plasma Proteome Project • Pooled samples from 10 male & 10 female healthy Chinese subjects • Plasma/EDTA sample protocol • Li, et al. Proteomics 2005. (Lab 29) • TTR gene • Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis. • Familial amyloidotic polyneuropathy • late-onset, dominant inheritance
Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy
Translation Start-Site • Human erythroleukemia K562 cell-line • Depth of coverage study • Resing et al. Anal. Chem. 2004. • THOC2 gene: • Part of the heteromultimeric THO/TREX complex. • Initially believed to be a “novel” ORF • RefSeq mRNA in Jun 2007, no RefSeq protein • TrEMBL entry Feb 2005, no SwissProt entry • Genbank mRNA in May 2002 (complete CDS) • Plenty of EST support • ~ 100,000 bases upstream of other isoforms
Easily distinguish minor sequence variations Two B. anthracis Sterne α/β SASP annotations • RefSeq/Gb: MVMARN... (7441 Da) • CMR: MARN... (7211 Da) • Intact proteins differ by 230 Da • 7441 Da vs 7211 Da • N-terminal tryptic peptides: • MVMAR (606.3 Da), MVMARNR (876.4 Da), vs • MARNR (646.3 Da) • Very different MS/MS spectra
Bacterial Gene-Finding • Find all the open-reading-frames... …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stopcodon Stopcodon ...courtesy of Art Delcher
Bacterial Gene-Finding • Find all the open-reading-frames......but they overlap – which ones are correct? Reversestrand Stopcodon …ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT… …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stopcodon Stopcodon ShiftedStop ...courtesy of Art Delcher
Coding-Sequence “Score” ...courtesy of Art Delcher
Glimmer3 trained & compared to RefSeq genes with annotated function Correct STOP: 99.6% Correct START: 84.3% “Not all the genomes necessarily have carefully/accurately annotated start sites, so the results for number of correct starts may be suspect.” Glimmer3 Performance
N-terminal peptides • (Protein) N-terminal peptides establish • start-site of known & unexpected ORFs Use: • Directly to annotate genomes • Evaluate and improve algorithms • Map cross-species
N-terminal peptide workflows • Typical proteomics workflows sample peptides from the proteome “randomly” • Caulobacter crescentus (70%) • 3733 Proteins (RefSeq Genome annot.) • 66K tryptic peptides (600 Da to 3000 Da) • 2085 N-terminal tryptic peptides (3%)
Protect protein N-terminus Digest to peptides Chemically modify free peptide N-term Use chem. mod. to capture unwanted peptides N-terminal peptide workflow Nat Biotech, Vol. 21, pp. 566-569, 2003.
Multiple (digest) enzymes: trypsin-R: 60% (80%) acid + lys-C + trypsin:85% (94%) Repeated LC-MS/MS Precursor Exclusion / Inclusion lists MALDI / ESI Protein separation and/or orthogonal fractionation Increasing N-terminal peptide coverage Anal Chem, Vol. 76, pp. 4193-4201, 2004.
Proteomics Informatics • Search spectra against: • Entire bacterial genome; • All Met initiated peptides; or • Statistically likely Met initiated peptides. • Easily consider initial Met loss PTM, too • Off-the-shelf MS/MS search engines (Mascot / X!Tandem / OMSSA)
Other Practical Issues • Suitable for commonly available instrumentation • Only the sample prep. is (somewhat) novel. • Need living organism • Stage of life-cycle? • Bang for buck? • N-terminal peptides / $$$$ • In discussions with JCVI (ex TIGR) • Possible pilot project?
Other Research Projects • Improving peptide identification by MS/MS • Spectral matching using HMMs • Combining search engine results • Spectral matching for detection and quantitation • Microorganism identification using MS • Live public web-site and database • (Inexact) uniqueness guarantees • Primer/Probe oligo design • Pathogen detection (DNA & Peptide) • Significant false-positive peptide identifications
Spectral Matching • Detection vs. identification • Increased sensitivity • No novel peptides • NIST GC/MS Spectral Library • Identifies small molecules, • 100,000’s of (consensus) spectra • Bundled/Sold with many instruments • “Dot-product” spectral comparison • Current project: Peptide MS/MS
Hidden Markov Models for Spectral Matching • Capture statistical variation and consensus in peak intensity • Capture semantics of peaks • Extrapolate model to other peptides • Good specificity with superior sensitivity for peptide detection • Assign 1000’s of additional spectra (w/ p-value < 10-5)
www.RMIDb.org Statistics: • 16.7 x 106 (6.4 x 106) protein sequences • ~ 40,000 organisms, ~ 19,700 species • 557 (415) complete genomes Sources: • TIGR’s CMR, SwissProt, TrEMBL, Genbank Proteins, RefSeq Proteins & Genomes • Inclusive Glimmer3 predictions on Genomes • Pfam and GO assignments using BOINC grid
www.RMIDb.org Accessed from all over the world...
Uniqueness guarantees • 20-mer oligo signatures for B. anthracis • In all available strains as exact match • No (inexact) match to other Bacillus species
Uniqueness guarantees • Human genome primer design problem • “4-unique” DNA 20-mers: • Edit-distance ≥ 5 to any non-specific hybridization site • No such valid loci on Chr. 22! • Currently analyzing entire genome • “3-unique” DNA 20-mers: • Initial experiments suggest ~ 0.01% valid • Approx. 1 valid oligo every 10,000 bases