190 likes | 324 Views
Wrangling Short Read Data with SHRiMP. Stephen M. Rumble Department of Computer Science University of Toronto 07/19/08. Handling NGS Data. NGS: at least 3 distinct read types: Illumina/Solexa, 454 letter-space AB SOLiD color-space (di-base sequencing) 2-pass SMS (Helicos)
E N D
Wrangling Short Read Datawith SHRiMP Stephen M. Rumble Department of Computer Science University of Toronto 07/19/08
Handling NGS Data • NGS: at least 3 distinct read types: • Illumina/Solexa, 454 letter-space • AB SOLiD color-space (di-base sequencing) • 2-pass SMS (Helicos) • 2 reads, same location • higher error rates • Need new algorithms • SOLiD: Biologists want letters, not colors • 2-pass: How to best handle two reads?
SHRiMP Overview } Common Isolate similarity in stages: • Spaced Seed Filtering • Vectorized Smith-Waterman • Full Alignment • Specialized for SOLiD, 2-pass, Letter-space • Compute p-values (and other statistics)
Outline • AB SOLiD Reads • 2-pass (SMS) Reads
AB SOLiD: Color-space Sequencing AB SOLiD reads look like this: T012233102 T012033102 G G G A T G G C A A T A C G T T T A 0 0 TGAGCGTTC|||TGAATAGGA 2 A G 1 3 3 1 C T 2 0 0
AB SOLiD: Color space is complex! INDELS TGAGTTA 122103 TGA-TTA 12-303 TGAGTTTA 1221003 TGAGTATA 1221333 SNPs TGAGTT 12210 TGACTT 12120 TGAATT 12030 TGATTT 12300 G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT It’s bloody complicated!
AB SOLiD: Translations TGAGCGTTC|||||||||TGAGCGTTC TGAGCGTTC|||TGAATAGGA • Look at: 012233102 • Recall: 012033102 • 4 translations for every color sequence 0 0 2 A G 1 3 3 1 C T 2 0 0
AB SOLiD: Modified Smith-Waterman • 4 S-W matrices, one per translation • Errors transition into other matrix • ‘Crossover’ penalty charged for errors G A T A C C T T T G A G C G T T C C C A T T G Genome … A G C G T T C Translation A Translation C
AB SOLiD: Obligatory Comparison • SHRiMP and AB Mapper (1.6) • SHRiMP seed 1111001111 • AB 35_2, 35_3 schemas • 10,000 35mers • C. savignyi (173Mb), very high polymorphism • Considering single top hits only
AB SOLiD: Resultant Alignments • SHRiMP emits letter-space alignments • Clear to biologists • Color-space need notbe scary! G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| ||| T: GAaCCCCTTACAACTGAACCCC-TAC R: 1 T1211000203110121201000-231 25
Outline • AB SOLiD Reads • 2-pass (SMS) Reads
2-pass SMS Reads • SMS reads have high error rates • “Dark bases” (skipped letters) • Multiple passes are possible • Ameliorate errors over passes • Good chance of missing base in one read • Acceptable chance of getting it in at least one
SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTG-ACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2
SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTGAC-T CAG-CAT CTG-ACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2
SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTGAC-T CAG-CAT C-TG-ACT CA-GCA-T S=8 CT-GAC-T C-AG-CAT AT CC A — —A CC A— —T TT GG AA —C C— —T A— CTG-ACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2
SMS 2-pass: SHRiMP with 2-pass data AT CC A — —A CC A— —T TT GG AA —C C— —T A— • Build a DAG representing the (near) optimal alignments of the two reads • Generate seeds (short paths) from the DAG • Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW. • Do full WSG alignment for top hits
SMS 2-pass: Results (in brief) • 10,000 synthetic reads (~25-65 bp) • 7% deletion,1% insertion, 1% sub rate • Mapped to Human chromosome 1 • Spaced seed span 9: 111110111
SHRiMP Summary • Fast mapping of short reads to a genome • -- Handles: • color-space (SOLiD) reads • 2-pass (SMS) reads • insertions and deletions • -- Easy to parallelize • Computation of p-values & other statistics for hits
Acknowledgements • SHRiMP is brought to you by: • Michael Brudno • Adrian Dalca • Marc Fiume • Vlad Yanovsky • Phil Lacroute • Arend Sidow http://compbio.cs.toronto.edu/shrimp University of Toronto Stanford University