150 likes | 301 Views
False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony. Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Peptide-Spectrum Matches.
E N D
False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center
Peptide-Spectrum Matches • Sigma49 – 32,691 LTQ MS/MS spectra of 49 human protein standards; IPI Human • Yeast – 162,420 LTQ MS/MS spectra from a yeast cell lysate; SGD. • X!TandemE-value (no refinement), 1% FDR Spectra used in: Zhang, B.; Chambers, M. C.; Tabb, D. L. 2007.
Traditional Protein Parsimony • Select the smallest set of proteins that explain all identified peptides. • Sensible principle, implies • Eliminate equivalent/subset proteins • Equivalent proteins are problematic: • Which one to choose? • Unique-protein peptides force the inclusion of proteins into solution • True for most tools, even probability based ones • Bad consequences for FDR filtered ids
Many proteins are easy • Eliminate equivalent / dominated proteins • Sigma49: 277 → 60 proteins • Yeast: 1226 → 1085 proteins • Many components have a single protein: • Sigma49: 52 ( 3 multi-protein) • Yeast: 994 (43 multi-protein) • "Unique" peptides force protein inclusion • Sigma49: 16 single-peptide proteins • Yeast: 476 single-peptide proteins
Must eliminate redundancy • Contained proteins should not be selected 37 distinct peptides
Must eliminate redundancy 1.0 1.0 0.8 0.7 0.0 1.0 • Contained proteins should not be selected • Even if they have some probability mass • Number of sibling peptides matter less if they are shared. Single AA Difference
Must ignore some PSMs 1.0 0.0 0.0 0.0 0.0 1.0 • A single additionalpeptideshould not force protein into solution Single AA Difference
Example from Yeast • "Inosinemonophosphate dehydrogenase" • 4 gene family • Contained proteins should not be selected • Single peptide evidence for YML056C 1.0 0.6 0.0 1.0
Must ignore some PSMs • Improving peptide identification sensitivitymakes things worse! • False PSMs don't cluster PSMs PSMs 2x Proteins 10%
Must ignore some PSMs • Improving peptide identification sensitivitymakes things worse! • False PSMs don't cluster PSMs PSMs Select Proteins to Explain True PSM% 90% 90%
Must ignore some PSMs • How do we choose? • Maximize # peptides? • Minimize FDR (naïve model)? • Maximize # PSMs?
Generalized Protein Parsimony • Weight peptides by number of PSMs • Constrainunique peptides per protein • Maximize explained peptides (PSMs) • Match PSM filtering FDR to % uncovered PSMs • Readily solved by branch-and-bound • Permits complex protein/peptide constraints • Reduces to traditional protein parsimony
Match FDR to uncovered PSMs Traditional Parsimony at 1% FDR: 1085 (609 2+-Unique) Proteins
Software • Filter multi-acquisition identifications by: • FDR, E-value, probability • Rewrite PSMs to reflect parsimony analysis • PepXML, CSV, Excel • Component-wise Peptide-Protein matrix: • Selected, Dominant, Equivalent, Contained • Selected protein accessions: • …plus equivalents
Conclusions • Many components are clear • Doesn't matter what technique is used • Traditional techniques do not handle the second protein in a component well • A single additional peptide should not force • Explain only the true PSM %: • Determine protein criteria first • Adjust PSM filter until explained peptides match