1 / 23

SHRiMP: The SHort Read Mapping Package

SHRiMP: The SHort Read Mapping Package. Michael Brudno Department of Computer Science University of Toronto 11/09/08. Handling NGS Data. NGS: at least 3 distinct read types: Illumina/Solexa, 454  letter-space AB SOLiD  color-space (di-base sequencing) 2-pass SMS (Helicos)

neron
Download Presentation

SHRiMP: The SHort Read Mapping Package

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

  2. Handling NGS Data • NGS: at least 3 distinct read types: • Illumina/Solexa, 454  letter-space • AB SOLiD  color-space (di-base sequencing) • 2-pass SMS (Helicos) • 2 reads, same location • higher error rates • Need new algorithms • SOLiD: Biologists want letters, not colors • 2-pass: How to best handle two reads?

  3. SHRiMP Overview } Common Isolate similarity in stages: • Spaced Seed Filtering • Vectorized Smith-Waterman • Full Alignment • Specialized for SOLiD, 2-pass, Letter-space • Compute p-values (and other statistics)

  4. Outline • AB SOLiD Reads • 2-pass (SMS) Reads

  5. AB SOLiD: Dibase Sequencing hmm??? HMM!!! AB SOLiD reads look like this: T012233102 T012033102 G G G A T G G C A A T A C G T T T A 0 0 TGAGCGTTC|||TGAATAGGA 2 A G 1 3 3 1 C T 2 0 0

  6. AB SOLiD: Color space is complex! INDELS TGAGTTA 122103 TGA-TTA 12-303 TGAGTTTA 1221003 TGAGTATA 1221333 SNPs TGAGTT 12210 TGACTT 12120 TGAATT 12030 TGATTT 12300 G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT It’s bloody complicated!

  7. AB SOLiD: Translations TGAGCGTTC|||||||||TGAGCGTTC TGAGCGTTC|||TGAATAGGA • Look at: 012233102 • Recall: 012033102 • 4 translations for every color sequence 0 0 2 A G 1 3 3 1 C T 2 0 0

  8. AB SOLiD: Modified Smith-Waterman • 4 S-W matrices, one per translation • Errors transition into other matrix • ‘Crossover’ penalty charged for errors G A T A C C T T T G A G C G T T C C C A T T G Genome … A G C G T T C Translation A Translation C

  9. AB SOLiD: Obligatory Comparison • SHRiMP and AB Mapper (1.6) • SHRiMP seed weight 8 (1111001111) • AB 35_2, 35_3 schemas • 10,000 35bp reads • C. savignyi (173Mb), very high polymorphism • Considering single top hits only

  10. AB SOLiD: Resultant Alignments • SHRiMP emits letter-space alignments • Clear to biologists • Color-space need notbe scary! G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| ||| T: GAaCCCCTTACAACTGAACCCC-TAC R: 1 T1211000203110121201000-231 25

  11. Outline • AB SOLiD Reads • 2-pass (SMS) Reads

  12. 2-pass SMS Reads • SMS reads have high error rates • “Dark bases” (skipped letters) • Multiple passes are possible • Ameliorate errors over passes • Good chance of missing base in one read • Acceptable chance of getting it in at least one

  13. Mapping 2-pass Reads Original Reads C-GACTTTA CTGACTTA CTGA-T--- ? Reference Genome

  14. SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTG-ACT CTGCACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2

  15. SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTG-ACT CTGCACT CAGCA-T S=9 CTGAC-T CTGACAT CAG-CAT Match = +4 Mismatch = -3 Gap = -2

  16. SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTG-ACT CTGCACT CAGCA-T S=9 CTGAC-T CTGACAT CAG-CAT C-TG-ACT CATGCACT CA-GCA-T CT-GAC-T CTAGACAT C-AG-CAT S=8 C-TGAC-T CATGCACT CA-G-CAT CT-GAC-T CTAGACAT C-AG-CAT Match = +4 Mismatch = -3 Gap = -2

  17. SMS 2-pass: Near-optimal Alignments C T G A C T • Compute a DP matrix • Sum it up with the DP matrix computed in reverse C A G C A T + Match = +4 Mismatch = -3 Gap = -2

  18. SMS 2-pass: Near-optimal Alignments —T A— CC A — —A CC AT GG TT AA —C C— —T A— C T G A C T • Compute a DP matrix • Sum it up with the DP matrix computed in reverse • Leave only near optimal alignments C A G C A T = Match = +4 Mismatch = -3 Gap = -2 Represent the remaining cells as a directed graph (Shwikowski & Vingron, 2003)

  19. SMS 2-pass: SHRiMP with 2-pass data AT CC A — —A CC A— —T TT GG AA —C C— —T A— • Build a DAG representing the (near) optimal alignments of the two reads • Generate seeds (short paths) from the DAG • Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW. • Do full alignment for top hits

  20. SMS 2-pass: Results (in brief) • 10,000 synthetic reads (~25-65 bp) • 7% deletion,1% insertion, 1% sub rate • Mapped to Human chromosome 1 • Spaced seed weight 8: 111101111

  21. SHRiMP Summary • Fast mapping of short reads to a genome • -- Handles: • color-space (SOLiD) reads • 2-pass (SMS) reads • insertions and deletions • -- Easy to parallelize • Computation of p-values & other statistics for hits

  22. SHRiMP TODO List • Faster Mapping (biggest complaint) • Matepair data support • Transcriptome Data • Suggestions?

  23. Acknowledgements SHRiMP is brought to you by: • Steve Rumble • Vlad Yanovsky • Adrian Dalca • Marc Fiume • Phil Lacroute • Arend Sidow http://compbio.cs.toronto.edu/shrimp University of Toronto Stanford University

More Related