1 / 38

Design Goals

Design Goals. Crash Course: Reference-guided Assembly. Crash Course: Reference-guided Assembly. Crash Course: Reference-guided Assembly. Sequencing Technologies. future. Next-Gen Sequence Lengths. Mixing It Up: Paired-end Reads. How Does It Work?. How Does It Work?.

zalika
Download Presentation

Design Goals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design Goals

  2. Crash Course: Reference-guided Assembly

  3. Crash Course: Reference-guided Assembly

  4. Crash Course: Reference-guided Assembly

  5. Sequencing Technologies future

  6. Next-Gen Sequence Lengths

  7. Mixing It Up: Paired-end Reads

  8. How Does It Work?

  9. How Does It Work?

  10. C. elegans: a case for INDELs SPEED 100 million Illumina reads Alignment time: 93 min (17,800 reads/s) Assembly time: 100 min INDEL validation rate: 89.3 % (216) SNP validation rate: 97.8 % (229) INDELS

  11. P. stipitis: Co-assembly Capillary 454 FLX 454 GS20 Illumina

  12. Scaling Up M. musculus H. sapiens D. melanogaster C. elegans P. stipitis H. sapiens ENCODE region H. sapiens CAPON region M. musculusmtDNA

  13. Performance: Aligners

  14. Aligners: Feature Set

  15. Performance: Aligner

  16. Performance: Aligner Using P. stipitis (15.4 Mbp) 454 FLX data set. 932,565 reads basecalled by PyroBayes†. † Quinlan et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods (2008)

  17. Accuracy: Synthetic Data Sets 1 million 1 per 1.3 kb 1 per 7.2 kb H. sapiens Xchromosome

  18. Accuracy: Classification

  19. Accuracy: Unique Read Alignment

  20. Reasons to use ? “One tool, many technologies, many applications” • FAST • Accurate • Multiprocessor (OPENMP) • Co-assemblies • Gapped alignments • Widely used

  21. (Near) Future Development • All technologies • Pacific BioSciences • Helicos • All application areas • Adapter trimming • Coverage graphs • Optimization • Improved paired-end read support • File format standardization (SAF & SRF)

  22. 1000 Genomes Project • Many samples with light coverage(1000 dg) • 100 samples from 10 populations at 2x coverage • Find 90% of the 1 % frequency variants per population • Trios with moderate coverage (990 dg) • 30 trios at 11x coverage • If you’re looking for SNPs, are your tools and methods robust?

  23. Scaling Up: Disk Footprint • Current situation: files created by MOSAIK are not optimized for speed or size • Assembly can take a long time (slow disk speed) • Hypothetical solution • Optimize the file formats • Ditch the built-in index • Keep data sorted by aligned location

  24. Scaling Up: Disk Footprint

  25. Scaling Up: Memory Footprint • Current situation: storing the entire human genome stored with all associated hash locations • Optimized hash table ≈ 55 GB RAM • File-based hash table (BerkeleyDB) • User selects how much RAM to use • Dreadfully slow performance • Large disk footprint ≈ 65 GB file

  26. Scaling Up: Memory Footprint

  27. Scaling Up: Memory Footprint

  28. Scaling Up: Speed & Sensitivity • Current situation: speed increases as the hash size increases, sensitivity decreases • Hypothetical solution: use small hash sizes and require a clustering of a predefined length. • Status: Implemented but not tested.

  29. BORK! BORK! BORK! (translated: when will MOSAIK get published?)

  30. Acknowledgements Boston College Gabor Marth Derek Barnett Michele Busby Weichun Huang Aaron Quinlan Chip Stewart Thomas Seyfried Mike Kiebish Washington University School of Medicine Elaine Mardis Jarret Glasscock Vincent Magrini Agencourt Douglas Smith Wei Tao

More Related