120 likes | 241 Views
Protein grouping in mzIdentML. ProteinDetectionList. ProteinAmbiguityGroup id=“PAG1”. ProteinDetectionHypothesis id=“PDH1” dbseq_ref =“dbseq_Q05421|CP2E1_MOUSE” anchor protein. ProteinDetectionHypothesis id=“PDH2” dbseq_ref =“dbseq_Q05423|CP2E2_MOUSE” sequence same-set.
E N D
ProteinDetectionList ProteinAmbiguityGroup id=“PAG1” ProteinDetectionHypothesis id=“PDH1” dbseq_ref=“dbseq_Q05421|CP2E1_MOUSE” anchor protein ProteinDetectionHypothesis id=“PDH2” dbseq_ref=“dbseq_Q05423|CP2E2_MOUSE” sequence same-set ProteinDetectionHypothesis id=“PDH3” dbseq_ref=“dbseq_Q05312|CP2F1_MOUSE” sequence subset ProteinAmbiguityGroup id=“PAG2” ....
Existing CV terms for ProteinDetectionHypothesis id: MS:1001591 name: anchor protein def: "A representative protein selected from a set of sequence same-set or spectrum same-set proteins." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001592 name: family member protein def: "A protein with significant homology to another protein, but some distinguishing peptide matches." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001593 name: group member with undefined relationship OR ortholog protein def: "TO ENDETAIL: a really generic relationship OR ortholog protein." [PSI:MS] is_a: MS:1001101 ! protein group or subset relationship id: MS:1001594 name: sequence same-set protein def: "A protein which is indistinguishable or equivalent to another protein, having matches to an identical set of peptide sequences." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001595 name: spectrum same-set protein def: "A protein which is indistinguishable or equivalent to another protein, having matches to a set of peptide sequences that cannot be distinguished using the evidence in the mass spectra." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship
Existing CV terms for ProteinDetectionHypothesis id: MS:1001596 name: sequence sub-set protein def: "A protein with a sub-set of the peptide sequence matches for another protein, and no distinguishing peptide matches." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001597 name: spectrum sub-set protein def: "A protein with a sub-set of the matched spectra for another protein, where the matches cannot be distinguished using the evidence in the mass spectra, and no distinguishing peptide matches." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001598 name: sequence subsumable protein def: "A sequence same-set or sequence sub-set protein where the matches are distributed across two or more proteins." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001599 name: spectrum subsumable protein def: "A spectrum same-set or spectrum sub-set protein where the matches are distributed across two or more proteins." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship
Problems • No requirement for any exporter to use the terms “MAY” • “anchor protein” doesn’t capture intended role and isn’t used consis id: MS:1001596 name: sequence sub-set protein def: "A protein with a sub-set ...." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship • No definition of what should be put in the value slot of cv terms: • Could be the PDH identifier, accession or DBSequence identifier of group representative or any other protein that is super-set to this protein • Or anything else for that matter • What does passThreshold= “true” on PDH mean? • Unclear how to count the number of identified proteins in an mzIdentML file • Count PAGs or count PDHs? • No terms for protocol describing how inference has been done or how to interpret results
Proposed work group outcomes • Attach cv terms to <ProteinDetectionProtocol> describing how protein inference has been done • Still under discussion, since these effectively describe parts of the algorithm used • Exactly one mandatory “representative protein” MUST be present per group (new name for “anchor protein”) on PDH • To be checked by semantic validator • ProteinDetectionList MUST have a cv term “number of identified proteins” (count PAGs that have “representative protein” PDH with passThreshold=“true” • Each PDH SHOULD be flagged with one term from a group stating whether it is “representative protein”, “sequence|spectrum same-set”, “sequence|spectrum subset”, “sequence|spectrum subsumed” or “marginally distinguished” (i.e. Not strictly any of these, but not enough evidence to be a group representative) • Value slot of these terms SHOULD contain a comma-separated list of super-set or same-set (as appropriate) PDH IDs
Table 1 –New CV terms for reporting how protein inference has been performed. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.
Table 1 cont. –New CV terms for reporting how protein inference has been performed. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.
Table 2 New CV terms for reporting protein set (group) relationships and global statistics about the protein identification results. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.
Table 2 cont. New CV terms for reporting protein set (group) relationships and global statistics about the protein identification results. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.
Unresolved issues • Are the protocol terms necessary / sensible / overkill? • Is there general consensus on the idea that the number of identified proteins MUST be reported • and must equal count of PAGs with PDH passThreshold=“true” • Is it sensible to have SHOULD rules on all subset/same-sets? • Extra terms for relationships between protein sequences • Probably these will be removed • Mechanism for updating the mzIdentML specifications and validation software • Minor update + submission to shortened PSI process?