1 / 47

CSE182-L11

CSE182-L11. Protein sequencing and Mass Spectrometry. Whole genome shotgun. Input: Shotgun sequence fragments (reads) Mate pairs Output: A single sequence created by consensus of overlapping reads First generation of assemblers did not include mate-pairs (Phrap, CAP..)

camdyn
Download Presentation

CSE182-L11

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE182-L11 Protein sequencing and Mass Spectrometry CSE182

  2. Whole genome shotgun • Input: • Shotgun sequence fragments (reads) • Mate pairs • Output: • A single sequence created by consensus of overlapping reads • First generation of assemblers did not include mate-pairs (Phrap, CAP..) • Second generation: CA, Arachne, Euler • We will discuss Arachne, a freely available sequence assembler (2nd generation) CSE182

  3. Arachne (also celera assembler) • Overlap • Problem 1: Large all against all computation • Fast overlap computation using k-mer hashing. • Layout • Problem 2: Small contigs with 10X coverage • Solution 2: Use mate-pairs to build super-contigs • Problem 3: Repetitive structure of the genome. CSE182

  4. Problem 3: Repeats CSE182

  5. 40-50% of the human genome is made up of repetitive elements. Repeats can cause great problems in the assembly! Chimerism causes a clone to be from two different parts of the genome. Can again give a completely wrong assembly Repeats & Chimerisms CSE182

  6. How can you detect if your fragment overlap is due to a repeat? Repeats CSE182

  7. Repeat detection • Lander Waterman strikes again! • The expected number of clones in a Repeat containing island is MUCH larger than in a non-repeat containing island (contig). • Thus, every contig can be marked as Unique, or non-unique. In the first step, throw away the non-unique islands. Repeat CSE182

  8. Detecting Repeat Contigs 1: Read Density • Compute the log-odds ratio of two hypotheses: • H1: The contig is from a unique region of the genome. • The contig is from a region that is repeated at least twice CSE182

  9. Detecting Chimeric reads • Chimeric reads: Reads that contain sequence from two genomic locations. • Good overlaps: G(a,b) if a,b overlap with a high score • Transitive overlap: T(a,c) if G(a,b), and G(b,c) • Find a point x across which only transitive overlaps occur. X is a point of chimerism CSE182

  10. Contig assembly • Reads are merged into contigs upto repeat boundaries. • (a,b) & (a,c) overlap, (b,c) should overlap as well. Also, • shift(a,c)=shift(a,b)+shift(b,c) • Most of the contigs are unique pieces of the genome, and end at some Repeat boundary. • Some contigs might be entirely within repeats. These must be detected CSE182

  11. Creating Super Contigs CSE182

  12. Supercontig assembly • Supercontigs are built incrementally • Initially, each contig is a supercontig. • In each round, a pair of super-contigs is merged until no more can be performed. • Create a Priority Queue with a score for every pair of ‘mergeable supercontigs’. • Score has two terms: • A reward for multiple mate-pair links • A penalty for distance between the links. CSE182

  13. Supercontig merging • Remove the top scoring pair (S1,S2) from the priority queue. • Merge (S1,S2) to form contig T. • Remove all pairs in Q containing S1 or S2 • Find all supercontigs W that share mate-pair links with T and insert (T,W) into the priority queue. • Detect Repeated Supercontigs and remove CSE182

  14. Repeat Supercontigs • If the distance between two super-contigs is not correct, they are marked as Repeated • If transitivity is not maintained, then there is a Repeat CSE182

  15. Filling gaps in Supercontigs CSE182

  16. Consensus Derivation • Consensus sequence is created by converting pairwise read alignments into multiple-read alignments. • The final sequence is reported as a consensus for each of the super contigs. • The supercontigs themselves are ordered using physical markers. • Gaps are filled in using directed sequencing efforts. CSE182

  17. Summary • Whole genome shotgun is now routine: • Human, Mouse, Rat, Dog, Chimpanzee.. • Many Prokaryotes (One can be sequenced in a day) • Plant genomes: Arabidopsis, Rice • Model organisms: Worm, Fly, Yeast • A lot is not known about genome structure, organization and function. • Comparative genomics offers low hanging fruit CSE182

  18. Course Summary Gene finding • Sequence Comparison (BLAST & other tools) • Protein Motifs: • Profiles/Regular Expression/HMMs • Discovering protein coding genes • Gene finding HMMs • DNA signals (splice signals) • How is the genomic sequence itself obtained? • LW statistics • Sequencing and assembly • Next topic: the dynamic aspects of the cell ESTs Protein sequence analysis CSE182

  19. Dynamic aspects of cellular function • Expressed transcripts • Microarrays,…. • Expressed proteins • Mass spectrometry,.. • Protein-protein interactions (protein networks) • Protein-DNA interactions • Population studies CSE182

  20. Mass Spectrometry CSE182

  21. Nobel citation ’02 CSE182

  22. The promise of mass spectrometry • Mass spectrometry is coming of age as the tool of choice for proteomics • Protein sequencing, networks, quantitation, interactions, structure…. • Computation has a big role to play in the interpretation of MS data. • We will discuss algorithms for • Sequencing, Modifications, Interactions.. CSE182

  23. Enzymatic Digestion (Trypsin) + Fractionation Sample Preparation CSE182

  24. Single Stage MS Mass Spectrometry LC-MS: 1 MS spectrum / second CSE182

  25. Tandem MS Secondary Fragmentation Ionized parent peptide CSE182

  26. The peptide backbone The peptide backbone breaks to form fragments with characteristic masses. H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei CSE182

  27. Ionization The peptide backbone breaks to form fragments with characteristic masses. H+ H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei Ionized parent peptide CSE182

  28. Fragment ion generation The peptide backbone breaks to form fragments with characteristic masses. H+ H...-HN-CH-CONH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei AA residuei+1 Ionized peptide fragment CSE182

  29. Tandem MS for Peptide ID 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions 100 % Intensity [M+2H]2+ 0 250 500 750 1000 m/z CSE182

  30. Peak Assignment 88 145 292 405 534 663 778 907 1020 1166 b ions S G F L E E D E L K 1166 1080 1022 875 762 633 504 389 260 147 y ions y6 100 Peak assignment implies Sequence (Residue tag) Reconstruction! y7 % Intensity [M+2H]2+ y5 b3 b4 y2 y3 b5 y4 y8 b8 b9 b6 b7 y9 0 250 500 750 1000 m/z CSE182

  31. Database Searching for peptide ID • For every peptide from a database • Generate a hypothetical spectrum • Compute a correlation between observed and experimental spectra • Choose the best • Database searching is very powerful and is the de facto standard for MS. • Sequest, Mascot, and many others CSE182

  32. Spectra: the real story • Noise Peaks • Ions, not prefixes & suffixes • Mass to charge ratio, and not mass • Multiply charged ions • Isotope patterns, not single peaks CSE182

  33. xn-i yn-i yn-i-1 vn-i wn-i zn-i -HN-CH-CO-NH-CH-CO-NH- CH-R’ Ri i+1 ai R” i+1 bi bi+1 ci di+1 low energy fragments high energy fragments Peptide fragmentation possibilities(ion types) CSE182

  34. Ion types, and offsets • P = prefix residue mass • S = Suffix residue mass • b-ions = P+1 • y-ions = S+19 • a-ions = P-27 CSE182

  35. Mass-Charge ratio • The X-axis is not mass, but (M+Z)/Z • Z=1 implies that peak is at M+1 • Z=2 implies that peak is at (M+2)/2 • M=1000, Z=2, peak position is at 501 • Quiz: Suppose you see a peak at 501. Is the mass 500, or is it 1000? CSE182

  36. Isotopic peaks • Ex: Consider peptide SAM • Mass = 308.12802 • You should see: • Instead, you see 308.13 308.13 310.13 CSE182

  37. Isotopes • C-12 is the most common. Suppose C-13 occurs with probability 1% • EX: SAM • Composition: C11 H22 N3 O5 S1 • What is the probability that you will see a single C-13? • Note that C,S,O,N all have isotopes. Can you compute the isotopic distribution? CSE182

  38. All atoms have isotopes • Isotopes of atoms • O16,18, C-12,13, S32,34…. • Each isotope has a frequency of occurrence • If a molecule (peptide) has a single copy of C-13, that will shift its peak by 1 Da • With multiple copies of a peptide, we have a distribution of intensities over a range of masses (Isotopic profile). • How can you compute the isotopic profile of a peak? CSE182

  39. Nc=50 +1 Isotope Calculation • Denote: • Nc : number of carbon atoms in the peptide • Pc : probability of occurrence of C-13 (~1%) • Then Nc=200 +1 CSE182

  40. Isotope Calculation Example • Suppose we consider Nitrogen, and Carbon • NN: number of Nitrogen atoms • PN: probability of occurrence of N-15 • Pr(peak at M) • Pr(peak at M+1)? • Pr(peak at M+2)? How do we generalize? How can we handle Oxygen (O-16,18)? CSE182

  41. General isotope computation • Definition: • Let pi,a be the abundance of the isotope with mass i Da above the least mass • Ex: P0,C : abundance of C-12, P2,O: O-18 etc. • Characteristic polynomial • Prob{M+i}: coefficient of xi in (x) (a binomial convolution) CSE182

  42. Isotopic Profile Application • In DxMS, hydrogen atoms are exchanged with deuterium • The rate of exchange indicates how buried the peptide is (in folded state) • Consider the observed characteristic polynomial of the isotope profile t1, t2, at various time points. Then • The estimates of p1,H can be obtained by a deconvolution • Such estimates at various time points should give the rate of incorporation of Deuterium, and therefore, the accessibility. CSE182

  43. Quiz • How can you determine the charge on a peptide? • Difference between the first and second isotope peak is 1/Z • Proposal: • Given a mass, predict a composition, and the isotopic profile • Do a ‘goodness of fit’ test to isolate the peaks corresponding to the isotope • Compute the difference CSE182

  44. Post-translational modifications CSE182

  45. Tandem MS summary • The basics of peptide ID using tandem MS is simple. • Correlate experimental with theoretical spectra • In practice, there might be many confounding problems. • Isotope peaks, noise peaks, varying charges, post-translational modifications, no database. • Recall that we discussed how peptides could be identified by scanning a database. • What if the database did not contain the peptide of interest? CSE182

  46. De novo analysis basics • Suppose all ions were prefix ions? Could you tell what the peptide was? • Can post-translational modifications help? CSE182

  47. CSE182

More Related