1 / 25

Parallel Pair-HMM SNP Detection

GNUMAP-SNP. Parallel Pair-HMM SNP Detection. Nathan Clement The University of Texas Austin, TX, USA. Outline. Motivation NGS Issues and Requirements Pair- HMM Memory Optimizations Results Conclusion. Motivation. Mutation Detection: SNP discovery HapMap and resequencing

dessa
Download Presentation

Parallel Pair-HMM SNP Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GNUMAP-SNP Parallel Pair-HMM SNP Detection Nathan Clement The University of Texas Austin, TX, USA

  2. Outline • Motivation • NGS Issues and Requirements • Pair-HMM • Memory Optimizations • Results • Conclusion

  3. Motivation Mutation Detection: • SNP discovery • HapMap and resequencing • Species Identification • Bisulfite Sequencing • Epigenetic influences • RNA editing

  4. Error Rates* * Data current as of May 2011: Glenn, Travis C, “Field guide to next-generation DNA sequencers,” Molecular Ecology Resources, vol 11, pp 759-769, 2011

  5. Pair-HMM

  6. Pair-HMM (Mathematics) • Match • Gap (in both directions)

  7. Pair-HMM (M)

  8. Pair-HMM (X)

  9. Pair-HMM (Y)

  10. Pair-HMM

  11. Expected Results

  12. Why Inline SNP Calling? • Post-Processing • Disk space, less memory • Inline • Requires more memory • Less disk space • Can include specifics probabilities for each read

  13. Previous Optimizations • Two methods for speeding up mapping: • Entire genome on one machine • Split memory among different machines • Must normalize across all genome portions • MPI reduction

  14. Previous Optimizations

  15. Memory Requirements • Human Genome (3gb) • HashMap ≈ 12GB • 4 bits/character = 1.5GB • 5 floating point values per base (plus N) = sizeof(float)*5 * 3GB=60GB • Also stores total for easy computation = sizeof(float) * 3GB = 12GB • Total of ≈ 90GB per run

  16. Three Memory Optimizations • Normal (no optimization) • Integer discretization • Centroid discretization

  17. Integer Discretization • Only need one floating point value (for total) and 1 byte/nucleotide. • “Parts per 255” • Biggest hit: Going into and out of “integer space”

  18. Integer Discretization • Step 1: Convert from Integer Space • Step 2: Add from rito Genome • Step 3: Convert back to Integer Space Genome

  19. Centroid Discretization • Many states not used: • [255, 255, 255, 255, 255] • [0, 0, 0, 0, 0] • Many states not biologically relevant • SNP transition (common) vstransversion (not likely) • MSA uses this compression to perform fast alignment of one-to-many alignment

  20. Centroid Discretization (cont)

  21. Centroid Discretization (cont) • Benefits • Doesn’t waste impossible or infrequently used space • Much smaller memory footprint • Drawbacks: • Slight overhead in converting from centroid to floating point spaces • Rounding error (how significant?)

  22. Speed Comparison

  23. Optimization Stats (chrX)

  24. Conclusion • For high error rates, HMM approach is ideal, but requires more memory • Distributing the genome across processors doesn’t scale linearly • Discretization methods provide good memory reductions (up to 42%) • Centroid discretization performs poorly • Integer discretization can be used when available memory is low

  25. Questions

More Related