De Novo Peptide Sequencing via Probabilistic Network Modeling

De Novo Peptide Sequencing viaProbabilistic Network Modeling PepNovo

Peptide Fragmentation N A C F E T P G R C CID N A C F E T P G R C PM-M M Collision-Induced Dissociation (CID)

Peptide Fragmentation • A peptide with mass PM, that fragments into a prefix of mass m, and a suffix of mass PM-m, can produce different fragment ions: • The intensities at the expected offsets from mass m are used to create an intensity vector:

The Spectrum Graph

Scoring for De Novo Sequencing • All masses in spectrum range can be considered putative cleavage sites. • Given observed intensities , how to evaluate if mass m is cleavage site. • A common statistical tool used by many scoring functions is the likelihood ratio test (Dancik et al. 99’, Havilio et al. 03’,...)

Dancik et al. ’99 – Hypotheses • The main concept: Give premium for present peaks and penalties for missing peaks. • Uses a probability table: • PR – Probability of observing random peak (~0.1) (Random hypothesis). Fragmentation Hypothesis

Scoring a Cleavage Site (Dancik ‘99) • Out of k possible ions for cleavage at m, t are detected (w.l.o.g fragments 1,..,t) and k-t are missing (t+1,..,k). • Score using a log ratio test: Probability of cleavage site m according to Fragmentation hypothesis Probability of cleavage site m according to Random hypothesis

PepNovo Scoring • PepNovo implements a similar likelihood ratio test mechanism. • Can be viewed as extending the scoring model of Dancik et al. 99’. • Includes several factors that are not sufficiently addressed in current scoring functions.

Enhancements to Dancik et al. (’99) • Several Intensity values. • Combinations of fragment ions. • Incorporation of additional chemical knowledge (e.g., preferred cleavage sites). • Positional influence of the cleavage site. • Improved Random Model.

y b y2 a b2 N-aa (N-terminal amino acid) C-aa (C-terminal amino acid) y-H2O a-H2O b-H2O y-NH3 a-NH3 pos(m) (region in peptide) b-H2O-H2O y-H2O-NH3 y-H2O-H2O b-H2O-NH3 b-NH3 HCID - Fragmentation Network Amino acid influence Ion combinations Positional influence

Discrete Intensity Values • Peak intensity normalized according to grass level (average of weakest 33% of peaks in spectrum). • Normalized intensities Discretized into 4 intensity levels: • zero : I < 0.05 • low : 0.05 ≤ I < 2 (62% of peaks) • medium : 2 ≤ I < 10 (26% of peaks) • high : I ≥ 10 (12% of peaks)

b-NH3 y b y2 a b2 y-H2O y-NH3 b-H2O a-NH3 a-H2O b-H2O-H2O b-H2O-NH3 y-H2O-H2O y-H2O-NH3 Combinations of Fragments • Different combinations have significantly different probabilities: • P(b=high| y=high) = 0.36, vs. P(b=high| y=low) = 0.03. • P(b-H2O > zero | b=high) = 0.5, vs. P(b-H2O > zero | b= zero) = 0.24.

N-aa (N-terminal amino acid) C-aa (C-terminal amino acid) y b Additional Chemical Knowledge • The identity of the flanking amino acids influences the peak intensities: • Increased intensities N-terminal to Proline and Glycine • Increased intensities C-terminal to Aspartic Acid. • 400 amino acid combinations reduced to 15 equivalence sets (X-P,X-G, etc.).

y b y2 a b2 Positional Influence pos(m) (region in peptide) • Creates separate models for different locations in the peptide • Models phenomena such as: • weak b/y ions near the ends. • prevalence of a-ions in the first half of the peptides. • prevalence of b2 towards the peptide’s C-terminal and y2 near the N-terminal.

Probability under HCID • From the decomposition properties of probabilistic networks, each node is independent from the rest of the nodes given the value of its parents so: where (f) are the parents of node f.

3 3 3 2 2 2 2 2 1 1 0 m/z Intensity levels Bin Window HRandom – Regional Density 2ε w

Computing the Random Probability • =1-(2ε)/w , is the probability of a single peak missing the bin. • Let ni, 1≤i≤d, be counts of peaks with intensity i in window w:

Random Model for HRandom • Peak occurrences are treated as random independent events: • The probability of observing a peak at random is estimated from the local density of peaks in the spectrum.

The Likelihood Ratio Score • A putative cleavage site is scored according to the log ratio test: • Can be used to score a peptide by summing the score for the prefix masses:

PepNovo’s De Novo Sequencing • A spectrum graph is created from the experimental MS/MS spectrum. • The nodes are scored using our method. • Highest scoring anti-symmetric path is found using dynamic programming algorithm.

Spectrum Graph • Acyclic graph. • Nodes are cleavage sites, each has a massm and score s. • Edges connect nodes with mass differences corresponding to an amino acid. Q V S A m:0s:5.0 m:71.2s: 4.3 m:99.1s:8.1 m:199.4s: 5.6 m:113s: -1.2 m:163.2s: 2.8 L W

Results Benchmarking reported for 280 spectra.

Q & A

De Novo Peptide Sequencing via Probabilistic Network Modeling

De Novo Peptide Sequencing via Probabilistic Network Modeling

Presentation Transcript

Probabilistic Modeling of Tone Perception

PEAKS: De Novo Sequencing using MS/MS spectra

Bayesian Nonparametrics via Probabilistic Programming

GSAT501: Proteomics Peptide sequencing

De Novo Sequencing of MS Spectra

De Novo Sequencing v.s . Database Search

De Novo Sequencing and Homology Searching with De Novo Sequence Tags

Probabilistic Neural Network (PNN)

PEAKS: De Novo Sequencing using Tandem Mass Spectrometry

Peptide Sequencing by Mass Spectrometry

Algorithmic Problems in Peptide Sequencing

An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry

Probabilistic (=stochastic) modeling

Probabilistic Modeling and Uncertainty

De Novo Antibody Sequencing

The Prerequisites for Genetic Analysis--De Novo Sequencing

De novo Peptide Design

Genome De Novo Assemblies and Applications in NGS Sequencing

Peptide Sequencing by Mass Spectrometry

Network Modeling

Algorithmic Problems in Peptide Sequencing