250 likes | 559 Views
De Novo Peptide Sequencing via Probabilistic Network Modeling. PepNovo. Peptide Fragmentation. N. A. C. F. E. T. P. G. R. C. CID. N. A. C. F. E. T. P. G. R. C. PM-M. M. Collision-Induced Dissociation (CID). Peptide Fragmentation.
E N D
De Novo Peptide Sequencing viaProbabilistic Network Modeling PepNovo
Peptide Fragmentation N A C F E T P G R C CID N A C F E T P G R C PM-M M Collision-Induced Dissociation (CID)
Peptide Fragmentation • A peptide with mass PM, that fragments into a prefix of mass m, and a suffix of mass PM-m, can produce different fragment ions: • The intensities at the expected offsets from mass m are used to create an intensity vector:
Scoring for De Novo Sequencing • All masses in spectrum range can be considered putative cleavage sites. • Given observed intensities , how to evaluate if mass m is cleavage site. • A common statistical tool used by many scoring functions is the likelihood ratio test (Dancik et al. 99’, Havilio et al. 03’,...)
Dancik et al. ’99 – Hypotheses • The main concept: Give premium for present peaks and penalties for missing peaks. • Uses a probability table: • PR – Probability of observing random peak (~0.1) (Random hypothesis). Fragmentation Hypothesis
Scoring a Cleavage Site (Dancik ‘99) • Out of k possible ions for cleavage at m, t are detected (w.l.o.g fragments 1,..,t) and k-t are missing (t+1,..,k). • Score using a log ratio test: Probability of cleavage site m according to Fragmentation hypothesis Probability of cleavage site m according to Random hypothesis
PepNovo Scoring • PepNovo implements a similar likelihood ratio test mechanism. • Can be viewed as extending the scoring model of Dancik et al. 99’. • Includes several factors that are not sufficiently addressed in current scoring functions.
Enhancements to Dancik et al. (’99) • Several Intensity values. • Combinations of fragment ions. • Incorporation of additional chemical knowledge (e.g., preferred cleavage sites). • Positional influence of the cleavage site. • Improved Random Model.
y b y2 a b2 N-aa (N-terminal amino acid) C-aa (C-terminal amino acid) y-H2O a-H2O b-H2O y-NH3 a-NH3 pos(m) (region in peptide) b-H2O-H2O y-H2O-NH3 y-H2O-H2O b-H2O-NH3 b-NH3 HCID - Fragmentation Network Amino acid influence Ion combinations Positional influence
Discrete Intensity Values • Peak intensity normalized according to grass level (average of weakest 33% of peaks in spectrum). • Normalized intensities Discretized into 4 intensity levels: • zero : I < 0.05 • low : 0.05 ≤ I < 2 (62% of peaks) • medium : 2 ≤ I < 10 (26% of peaks) • high : I ≥ 10 (12% of peaks)
b-NH3 y b y2 a b2 y-H2O y-NH3 b-H2O a-NH3 a-H2O b-H2O-H2O b-H2O-NH3 y-H2O-H2O y-H2O-NH3 Combinations of Fragments • Different combinations have significantly different probabilities: • P(b=high| y=high) = 0.36, vs. P(b=high| y=low) = 0.03. • P(b-H2O > zero | b=high) = 0.5, vs. P(b-H2O > zero | b= zero) = 0.24.
N-aa (N-terminal amino acid) C-aa (C-terminal amino acid) y b Additional Chemical Knowledge • The identity of the flanking amino acids influences the peak intensities: • Increased intensities N-terminal to Proline and Glycine • Increased intensities C-terminal to Aspartic Acid. • 400 amino acid combinations reduced to 15 equivalence sets (X-P,X-G, etc.).
y b y2 a b2 Positional Influence pos(m) (region in peptide) • Creates separate models for different locations in the peptide • Models phenomena such as: • weak b/y ions near the ends. • prevalence of a-ions in the first half of the peptides. • prevalence of b2 towards the peptide’s C-terminal and y2 near the N-terminal.
Probability under HCID • From the decomposition properties of probabilistic networks, each node is independent from the rest of the nodes given the value of its parents so: where (f) are the parents of node f.
3 3 3 2 2 2 2 2 1 1 0 m/z Intensity levels Bin Window HRandom – Regional Density 2ε w
Computing the Random Probability • =1-(2ε)/w , is the probability of a single peak missing the bin. • Let ni, 1≤i≤d, be counts of peaks with intensity i in window w:
Random Model for HRandom • Peak occurrences are treated as random independent events: • The probability of observing a peak at random is estimated from the local density of peaks in the spectrum.
The Likelihood Ratio Score • A putative cleavage site is scored according to the log ratio test: • Can be used to score a peptide by summing the score for the prefix masses:
PepNovo’s De Novo Sequencing • A spectrum graph is created from the experimental MS/MS spectrum. • The nodes are scored using our method. • Highest scoring anti-symmetric path is found using dynamic programming algorithm.
Spectrum Graph • Acyclic graph. • Nodes are cleavage sites, each has a massm and score s. • Edges connect nodes with mass differences corresponding to an amino acid. Q V S A m:0s:5.0 m:71.2s: 4.3 m:99.1s:8.1 m:199.4s: 5.6 m:113s: -1.2 m:163.2s: 2.8 L W
Results Benchmarking reported for 280 spectra.