380 likes | 486 Views
Design Goals. Crash Course: Reference-guided Assembly. Crash Course: Reference-guided Assembly. Crash Course: Reference-guided Assembly. Sequencing Technologies. future. Next-Gen Sequence Lengths. Mixing It Up: Paired-end Reads. How Does It Work?. How Does It Work?.
E N D
Sequencing Technologies future
C. elegans: a case for INDELs SPEED 100 million Illumina reads Alignment time: 93 min (17,800 reads/s) Assembly time: 100 min INDEL validation rate: 89.3 % (216) SNP validation rate: 97.8 % (229) INDELS
P. stipitis: Co-assembly Capillary 454 FLX 454 GS20 Illumina
Scaling Up M. musculus H. sapiens D. melanogaster C. elegans P. stipitis H. sapiens ENCODE region H. sapiens CAPON region M. musculusmtDNA
Performance: Aligner Using P. stipitis (15.4 Mbp) 454 FLX data set. 932,565 reads basecalled by PyroBayes†. † Quinlan et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods (2008)
Accuracy: Synthetic Data Sets 1 million 1 per 1.3 kb 1 per 7.2 kb H. sapiens Xchromosome
Reasons to use ? “One tool, many technologies, many applications” • FAST • Accurate • Multiprocessor (OPENMP) • Co-assemblies • Gapped alignments • Widely used
(Near) Future Development • All technologies • Pacific BioSciences • Helicos • All application areas • Adapter trimming • Coverage graphs • Optimization • Improved paired-end read support • File format standardization (SAF & SRF)
1000 Genomes Project • Many samples with light coverage(1000 dg) • 100 samples from 10 populations at 2x coverage • Find 90% of the 1 % frequency variants per population • Trios with moderate coverage (990 dg) • 30 trios at 11x coverage • If you’re looking for SNPs, are your tools and methods robust?
Scaling Up: Disk Footprint • Current situation: files created by MOSAIK are not optimized for speed or size • Assembly can take a long time (slow disk speed) • Hypothetical solution • Optimize the file formats • Ditch the built-in index • Keep data sorted by aligned location
Scaling Up: Memory Footprint • Current situation: storing the entire human genome stored with all associated hash locations • Optimized hash table ≈ 55 GB RAM • File-based hash table (BerkeleyDB) • User selects how much RAM to use • Dreadfully slow performance • Large disk footprint ≈ 65 GB file
Scaling Up: Speed & Sensitivity • Current situation: speed increases as the hash size increases, sensitivity decreases • Hypothetical solution: use small hash sizes and require a clustering of a predefined length. • Status: Implemented but not tested.
BORK! BORK! BORK! (translated: when will MOSAIK get published?)
Acknowledgements Boston College Gabor Marth Derek Barnett Michele Busby Weichun Huang Aaron Quinlan Chip Stewart Thomas Seyfried Mike Kiebish Washington University School of Medicine Elaine Mardis Jarret Glasscock Vincent Magrini Agencourt Douglas Smith Wei Tao