250 likes | 425 Views
GNUMAP-SNP. Parallel Pair-HMM SNP Detection. Nathan Clement The University of Texas Austin, TX, USA. Outline. Motivation NGS Issues and Requirements Pair- HMM Memory Optimizations Results Conclusion. Motivation. Mutation Detection: SNP discovery HapMap and resequencing
E N D
GNUMAP-SNP Parallel Pair-HMM SNP Detection Nathan Clement The University of Texas Austin, TX, USA
Outline • Motivation • NGS Issues and Requirements • Pair-HMM • Memory Optimizations • Results • Conclusion
Motivation Mutation Detection: • SNP discovery • HapMap and resequencing • Species Identification • Bisulfite Sequencing • Epigenetic influences • RNA editing
Error Rates* * Data current as of May 2011: Glenn, Travis C, “Field guide to next-generation DNA sequencers,” Molecular Ecology Resources, vol 11, pp 759-769, 2011
Pair-HMM (Mathematics) • Match • Gap (in both directions)
Why Inline SNP Calling? • Post-Processing • Disk space, less memory • Inline • Requires more memory • Less disk space • Can include specifics probabilities for each read
Previous Optimizations • Two methods for speeding up mapping: • Entire genome on one machine • Split memory among different machines • Must normalize across all genome portions • MPI reduction
Memory Requirements • Human Genome (3gb) • HashMap ≈ 12GB • 4 bits/character = 1.5GB • 5 floating point values per base (plus N) = sizeof(float)*5 * 3GB=60GB • Also stores total for easy computation = sizeof(float) * 3GB = 12GB • Total of ≈ 90GB per run
Three Memory Optimizations • Normal (no optimization) • Integer discretization • Centroid discretization
Integer Discretization • Only need one floating point value (for total) and 1 byte/nucleotide. • “Parts per 255” • Biggest hit: Going into and out of “integer space”
Integer Discretization • Step 1: Convert from Integer Space • Step 2: Add from rito Genome • Step 3: Convert back to Integer Space Genome
Centroid Discretization • Many states not used: • [255, 255, 255, 255, 255] • [0, 0, 0, 0, 0] • Many states not biologically relevant • SNP transition (common) vstransversion (not likely) • MSA uses this compression to perform fast alignment of one-to-many alignment
Centroid Discretization (cont) • Benefits • Doesn’t waste impossible or infrequently used space • Much smaller memory footprint • Drawbacks: • Slight overhead in converting from centroid to floating point spaces • Rounding error (how significant?)
Conclusion • For high error rates, HMM approach is ideal, but requires more memory • Distributing the genome across processors doesn’t scale linearly • Discretization methods provide good memory reductions (up to 42%) • Centroid discretization performs poorly • Integer discretization can be used when available memory is low