1 / 21

Proteogenomic Novelty in 105 TCGA Breast Tumors

Proteogenomic Novelty in 105 TCGA Breast Tumors. Karl Clauser CPTAC Breast Cancer Analysis Group Broad Institute of MIT and Harvard Fred Hutchinson Cancer Research Center Washington University New York University CPTAC Data Jamboree April 16, 2014 National Institutes of Health

elani
Download Presentation

Proteogenomic Novelty in 105 TCGA Breast Tumors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proteogenomic Noveltyin 105 TCGA Breast Tumors Karl Clauser CPTAC Breast Cancer Analysis Group Broad Institute of MIT and Harvard Fred Hutchinson Cancer Research Center Washington University New York University CPTAC Data Jamboree April 16, 2014 National Institutes of Health Bethesda, Maryland

  2. Tumor-specific protein databases forMS/MS-spectra searches Kelly Ruggles, David Fenyo, NYU

  3. QUILTS: Treatment of different variant types In alternates frameshifts Unannotated Alternative Splicing 1 frame translation 1 frame translation In frameshiftsdb Novel 1 frame translation Partially Novel Splicing Novel Novel downstream: 1 frame translation Novel upstream: 6 frame translation In other db Completely Novel Expression 6 frame translation 6 frame translation Fusion Genes In variants db 1 frame translation Variants

  4. Proteogenomic mapping: Genetic alterations can be observed on protein level (105 tumors) • S • S • S • S • | • work in • progress • | • Low thresholds applied to Genome calls (>1 read RNA-seq, >2 QUAL phred-scaled Variants) • High thresholds applied to Proteome calls (<0.1% FDR) • 0.2-2.7% of frameshifts, alternative splices & single AA variants observable by proteomics • mRNA may not be translated or at low abundance • Proteome coverage is incomplete

  5. Global proteome and phosphoproteome discovery workflow for TCGA breast tumors 1 mg total protein per tumor Internal reference: equal representation of basal, Her2 and Luminal A/B subtypes

  6. Serial Search Strategy with Personalized Databases Variants: 133,241 • Concatenated FASTA files, 105 patients • Altered proteins • Removed redundant entries > Canonical – Variant Patient 1 SIGNALINGPATHWAHREGULATOR >Canonical Protein – Variant Patient 2 SIKNALINGPATHWAYREGULATOR 25,776,160 Spectra (105 patients) (36 iTRAQ experiments) (25 LC-MS/MS runs / experiment) 3247 Variants Matched RefSeq-Hs-7/2013: 31,852 • Alternate Spliceforms: 67,853 • Frameshifts: 19,944 > Canonical Protein SIGNALINGPATHWAYREGULATOR > Canonical– Alternate splice Patient 1 SIGNALINGREGULATOR >Canonical – Alternate splice Patient 2 SIGNALINGPATHREGULATOR > Canonical – Truncation Patient 1 SIGNALINGPATFRAMESHIF >Canonical – Novel Exon Insert Patient 2 SIGNALINGPATHWAYINSERTREGULATOR >Canonical – Partial Exon Deletion Patient 3 SIGNALINGPATHWAYULATOR 197 Splice Junctions Matched 11,328,955 Matched Spectra (44% of total) (1% FDR) 14,447,205 Leftover Spectra 22 Truncation Overlaps Matched 11 Insertion Overlaps Matched 49 Deletion Junctions Matched • Concatenated: 252,890 • Low confidence thresholds for Genome calls • Variants: >2 QUAL score (phred-scaled) • Alternative splices, frameshifts:>1 read • High confidence for Proteome IDs • <0.1% FDR peptide spectrum match

  7. Frequency of Single AA Variants, Alternative Splices, FrameshiftsAcross Patients • Somatic variants are less frequent than germline variants • Some germlinevariants are very common • Rare germline variants present in RefSeq • Some alternative splice forms and frameshifts are very common • Should be in RefSeq Genome & Transcriptome Data very common

  8. How many RNA-seq reads to yield a proteomics observation of an alternate splice or frameshift? 1 experiment: 3 individual patients + 1 Common control (40 patients) 197 Alternative splices 82 Frameshifts 17 observed in >1 Expmt Max # Reads Max # Reads 19 observed in >1 Expmt

  9. Frameshift Truncation: ras-Related protein Rab-15Observed only in Proteomics Exp 3 E159 Max RNA-Seq Reads: 1 Present in only 1 Common control member

  10. Frameshift Truncation: Cysteine-rich protein 1Observed in 9 Proteomics Experiments E159 Max RNA-Seq Reads: 1 Present in only 1 Common control member

  11. Frameshift Truncation: Cullin-2 isoform aObserved in 3 Proteomics Experiments Max RNA-Seq Reads: 1 Present in only 1 Common control member E159

  12. Many missing observations even when transcript present in many common control members 1 experiment: 3 individual patients + 1 Common control (40 patients) Alternative splices Frameshifts

  13. Majority of Alternative Splice Junctions and Frameshifts observed in >1 Proteomics Experiment Pie chart 1 experiment: 3 individual patients + 1 Common control (40 patients) Alternative splices Frameshifts 150/197 observed in >1 experiment 44/82 observed in >1 experiment

  14. Next steps: • Examine “other” category • Fusion genes (junction-spanning) • Novel exon splicing (2 sides) • Completely novel gene • Use updated somatic variants from QUILTS • Define genomic data thresholds suitable for proteomic observations • RNA-seq: Min read count • Variant calling: phred-scaled QUAL score • Sort out Germline/Somatic variant call mix status across patients

  15. Summary of Proteome Re-processing105 TCGA patients- 36 iTAQ experiments

  16. Changes in Re-processing of TCGA data • Extraction • CentroidingUse Xcalibur , instead of SM. • iTRAQratios  are little changed, • intensities lower by ~5x (will more closely match NIST central analysis pipeline) • Precursor  MH+  range expanded from 750-4000 to 750-6000. • Searches • Replace database with RefSeqversion used as reference for the personalized database generation. • database content/size very similar, • protein identifiers change from gi numbers to RefSeq numbers. • Allowed modifications will be expanded. Increases the # of identified spectra by ~10%. • From Full iTRAQ, M-ox, N-deam, q-pyro • To iTRAQ-Full-Lys-only, M-ox, N-deam, q-pyro, c-pyro, Ac-nTermProt • Autovalidation • Proteome initial processing, peptide FDR per experiment : 1.1 -1.4%, • but overall peptide FDR across all 36 experiments: ~5.5% • Phosphoproteomeinitial processing , peptide FDR per experiment : 1.6 -2.1% • but overall peptide FDR across all 36 experiments: ~7.2%. • Changes will seek to bring the overall peptide FDR’s down to ~1% • require multiple observations (protein, P-site) across experiments • raise score thresholds • Quantitation • Will use PIP(precursor ion purity) filtering to exclude from quantitation but not identification. • PIP > 50% excludes ~7.8% of spectra. • Filtering reduces standard deviations on protein & phosphositelevel iTRAQratios

  17. Y Chromosome Frameshift- CD99 antigenObserved in 36 Proteomics Experiments E159 Partial exon deletion splice, plus frameshift truncation Max RNA-Seq Reads: 12 Transcript present in 18/40 Common Control Members

  18. Acknowledgments • Washington U./MD Anderson/NYU • Sherri Davies • Matthew Ellis • David Fenyo • Kelly Ruggles • Reid Townsend • Li Ding • Broad Institute/FHCRC • Steve Carr • Karl Clauser • Michael Gillette • Jana Qiao • Philipp Mertins • DR Mani • Eric Kuhn • Sue Abbatiello • Amanda Paulovich • Pei Wang • Sean Wang • Ping Yan • NCI Staff • Emily Boja • Mehdi Mesri • Rob Rivers • Chris Kinsinger • Henry Rodriguez Funding • National Cancer Institute

  19. Single AA Variants may be Somatic in Some Patients, Germline in Others 81 Patients Nov 2013 Genomic • Highly Interesting, should correlate with prognosis and/or subtype. • May correlate with prognosis? • Might as well be canonical isoforms? • Detectable, but too rare to indicate biology. Proteomic • G&S mix genomic variants have the highest observation rate by Proteomics. • Genomic variants present in only a single patient are observable by Proteomics

  20. Not all Germline &Somatic mix Single AA Variants are “Essentially” Germline 81 Patients Nov 2013 Proteomic Genomic • Is G&S mix status primarily an artifact of variant calling accuracy/sensitivity? • Is there some cancer biology involved for high S/G ratio variants? • Are patients with germline form more cancer prone? • Does somatic form correlate with prognosis, development of drug-resistance?

  21. Wide Range of Somatic Single AA Variants/Patient Skip • Low confidence thresholds applied to calls • Variants: >2 QUAL score (phred-scaled) • Alternative splices: >1 read

More Related