1 / 33

SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces

Learn how SHRiMP accurately maps short reads in letter and color spaces. This tool uses a multi-phase approach, including fast k-mer scanning and vectorized Smith-Waterman alignment, to efficiently compute statistics and map reads to the target genome.

myronm
Download Presentation

SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno

  2. How SHRiMP works: • Stage 1: Map reads to target genome • Stage 2: Compute statistics

  3. Read Mapping • Three phases • Very fast k-mer scan (index reads, scan genome) • Fast, vectorized Smith-Waterman to confirm • Slow, complete backtracking S-W for top ‘n’ hits

  4. Read Mapping: Phase 1 • Create a hash table of size 4^(k-mer length) • 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) • This becomes our kmer to read index AACTGTACCAGTGAG …

  5. Read Mapping: Phase 1 • Create a hash table of size 4^(k-mer length) • 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) • This becomes our kmer to read index AACTGT AACTGTaccagtgag …

  6. Read Mapping: Phase 1 • Create a hash table of size 4^(k-mer length) • 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) • This becomes our kmer to read index AACTGT ACTGTA aACTGTAccagtgag …

  7. Read Mapping: Phase 1 • Create an index of size 4(k-mer length) • 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) • This is our k-mer to read index AACTGT ACTGTA aaCTGTACcagtgag CTGTAC …

  8. Read Mapping: Phase 1 • Create a hash table of size 4^(k-mer length) • 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) • This becomes our kmer to read index AACTGT ACTGTA accTGTACCagtgag CTGTAC TGTACC …

  9. Read Mapping: Phase 1 • Create a hash table of size 4^(k-mer length) • 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) • This becomes our kmer to read index AACTGT Read 7 Read 12 ACTGTA Read 32 Read 13 CTGTAC Read 18 Read 7 TGTACC Read 12 Read 15 …

  10. Read Mapping: Phase 1 • Once we’ve indexed all reads, just scan the genome by k-mer Genome Reads

  11. Read Mapping: Phase 1 • Remember the k-mer hits within a given interval (window) • When sufficient hits, look more closely • “Look more closely” means calculate a fast Smith-Waterman score

  12. Technicalities • We don’t always use full k-mers (q-grams). • We actually support ‘spaced seeds’, but the algorithm doesn’t change much. • For each spaced seed, ‘compress out’ the k-mer and use it as the hash index

  13. Read Mapping: Phase 2 • Smith-Waterman is very expensive • NxM matrix isn’t too big for short reads and windows, but… • We call the vectorized code millions of times • We don’t want a bottleneck – aim for no more than 50% of the total runtime • We only want one score as quickly as possible

  14. Read Mapping: Phase 2 A C T A G A C T T G T C C A G T Cell being computed Previously computed cells

  15. Read Mapping: Phase 2 • Each forward-facing diagonal in S-W matrix depends on: • Small constant # of previous diagonals • Small constant # of scalars • We can compute entire diagonals in parallel • Our speed-up is proportional to the diagonal size

  16. A C T A G A C T T G T G A C C T + ---+ Read Mapping: Phase 2 A C T A G A C T T G T C C A G T Current Previous Penultimate

  17. Read Mapping: Phase 2 • Most commodity processors have vector instructions • Remember the MMX brouhaha? • SIMD – Single Instruction, Multiple Data 4 2 6 12 9 21 + = 8 15 23 7 3 10

  18. A C T A G A C T T G T G A C C T + ---+ Read Mapping: Phase 2 A C T A G A C T T G T C C A G T Current Previous Penultimate

  19. Read Mapping: Phase 2 • Match scores typically use a scoring matrix • ScoringMatrix[SeqA[i]][SeqB[j]] • But this doesn’t scale: Individual cell scores become a bottleneck • Can precompute a ‘query profile’ (expensive), or… • If we only care about strict match/mismatch we can use logical bit-wise operations • SIMD instructions work here (fully parallel)

  20. Read Mapping: Phase 2 • Results: • Our vectorized S-W is as fast, or faster than other very complicated SIMD implementations • 500 million+ matrix cells/second on Core 2 machines • Even with small seeds, S-W accounts for at most half of the total run time

  21. Read Mapping: Phase 3 • Recap: • K-mer scan selects areas of reasonable similarity • Vectorized S-W (dis)confirms similarity • Best ‘n’ hits per read are given a full alignment with backtrace

  22. Read Mapping: Phase 3 • Letter-space alignments are simple: • K-mer scan, Vectorized S-W, Full S-W in letters, give user pretty output • What about AB SOLiD colour-space? • Biologists want to see A,C,G,T, not 0,1,2,3… • Dealing with strange SOLiD properties… • Our solution: • K-mer scan, Vectorized S-W in colour-space • Full S-W in letter-space, but we can’t just convert

  23. AB Di-base Reads • We think in terms of nucleotides: • A, C, G, and T’s. • AB’s NGS machine outputs 4 colours • One colour per pair of bases: A C G T A 0 1 2 3 C 1 0 3 2 T T G A G C G T T C G 2 3 0 1 T 012 233102 T 3 2 1 0

  24. AB Di-base Reads 0 0 A C G T 2 A 0 1 2 3 G A C 1 0 3 2 1 3 3 1 G 2 3 0 1 C T 2 T 3 2 1 0 0 0

  25. SOLiD Translations • Given the following read, there are 4 translations (we need an initial base):

  26. SOLiD Translations • Reads begin with a known primer (‘T’)

  27. SOLiD Translations • What happens if a read error occurs? • The right translation was: T T G A G C G T T C

  28. Colour-space Smith-Waterman • There are four unique translations for every read • An error will cause us to change frames (different translation) • Why not do a S-W across all four letter-space translations with some error penalty?

  29. Colour-space Smith-Waterman Letter • Think of 4 S-W matrices stacked above one another • If we have 1 read error, but otherwise perfect match, we’ll use 2 matrices Genome Read Frame 1 Frame 2 Frame 3 Frame 4

  30. Colour-space Smith-Waterman • End result: G: 1123724 TA-ACCACGGTCACACTTGCATCAC 1123701 || |||||||||| |||X||||||| T: TACACCACGGTCAGACTtGCATCAC R: 0 T0311101130121221211313211 24 Should be ‘0’

  31. Statistics • After reads are mapped, mull over the results • For each read: • P(hit by pure chance – not a valid hit) • P(hit generated by genome – valid hit) • P(hit is best of all for particular read)

  32. Results • Speed • Simple k-mer scan is very fast • Important when seeds are bigger (less S-W) • Vectorized S-W is fast • Important when seeds are smaller (more S-W) • Generally well-balanced run time • Big seeds make k-mer scan the bottleneck (this is good - it’s really fast) • Easily parallelised – just divide the reads over CPUs

  33. Results • C. Savingyi • 22M 25bp reads • 173Mb genome • S-W would take at least a few thousand CPU days • SHRiMP runs in about 50 CPU days with fairly small seeds (length 8, weight 7) • SNP, indel, error rates correspond well to known averages for this organism

More Related