240 likes | 379 Views
Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data. Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang. The Ohio State University. HiCOMB 2014 May 19 th , Phoenix, Arizona. Outline. Introduction Sequence Data Format Converter Design Experimental Results
E N D
Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19th, Phoenix, Arizona
Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion
Explosion of Next-Generation Sequencing Data • NGS Advantages • Faster and cheaper • E.g., over one billion short reads per instrument run • More accurate: higher resolution and deeper coverage • Challenges • Urgent need for turning raw data into knowledge • Parallelism is the key
Historical Trends in Storage Prices v.s. DNA Sequencing Costs Reported by Lincoln Stein
Varieties of NGS Data Formats • Different Formats • SAM (Sequence Alignment/Map) • The de-facto text format for storing large nucleotide sequence alignments • BAM (Binary Alignment/Map) • The compressed, indexable, binary form of the SAM format • Indexing is supported by BAI (BAM Index) file • Other formats • BED (Browser Extensible Data), FASTA, FASTQ, WIG(wiggle), GFF(Gene Finding Feature), etc.
Analysis Pipeline • Current Pipeline • Parallelism mainly focuses on the analysis steps, e.g., SNP discovery and BLAST • Reality • Cross-utilization Problem: sequencing data ≠ input • Some other analysis steps stay sequential • Needs for removing other sequential bottlenecks
Motivation: Removing Other Sequential Bottlenecks • Parallel Format Conversion • Current format conversion commonly makes use of a single core • Current downstream tools may not be exchanged between different aligners • Not hard to implement but important to scale out • Parallelizing Certain Statistical Analysis Steps • E.g., parallel analysis on the histogram data
Framework only discuss the first component today • Sequence Data Format Converter • Input: SAM/BAM • Output: • BAM/SAM • FASTA, FASTQ, BED, BEDGRAPH, JSON and YAML • Statistical Analysis Module • Parallelize other statistical analysis steps • E.g., non-local means (NL-Means) and false discovery rate (FDR) computation
Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion
Sequence Data Format Converter • 3 Converter Instances • SAM Format Converter • BAM Format Converter • Preprocessing-Optimized SAM Format Converter • Support partial format conversionon a specific chromosome region
SAM Format Converter No communication among procs after partitioning partitioning is the key step for parallelization Extensibility and Programmability
Partitioning Algorithm • Key: each SAM record is delimited by a line breaker • Initial even partitioning • Adjust partition boundaries by detecting line breakers
BAM Format Converter Cannot be parallelized because of the third-party API • Challenge • No explicit delimiter: • Even partitioning -> unparsable records • Solution: add a preprocessing phase • Partition data by supporting random access
BAMX and BAIX • BAMX (BAM eXtended) File • Transform each varying-length BAM record into a regular-layout BAMX record • Align varying-length BAM fields by padding • BAIX (BAI eXtended File) • Index file of the BAMX file • Store the alignment starting positions in BAM (logically) and in BAMX (physically)
Partial Conversion • If only interested in a subset, no need for full conversion • Based on the BAIX file • Given logical alignment starting and ending positions, locate the physical starting and ending positions in the BAMX file (by binary search) • Evenly partition the subset and proceed in parallel
Preprocessing-Optimized SAM Format Converter M procs N procs M × N target files • Main Ideas • Preprocessing can also optimize the SAM format conversion • Such preprocessing can be parallelized because of the easy partitioning on the SAM format
Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion
Experimental Setup • Dataset • Whole genome DNA-sequencing of three mouse samples • Approximately 125 million sequences providing about 40-fold coverage of the genome • In the SAM/BAM format • Cluster • 8 GB Memory • Up to 32 8-core machines (256 cores in total)
Performance of SAM Format Converter • Input: 100 GB SAM data • Output: BED, BEDGRAPH and FASTA
Performance of BAM Format Converter • Input: 117 GB BAM data • Output: BED, BEDGRAPH and FASTA
SAM Format Converter Comparison: Preprocessing-Optimized vs. Original • Input: 15.7 GB BAM data • Output: BED, BEDGRAPH and FASTA
Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion
Conclusion • In the NGS analysis pipeline, the overall latency cannot be reduced unless all sequential bottlenecks are removed • The first framework that can easily support parallel sequence format conversion in distributed environment • SAM format converter • BAM format converter • Preprocessing-optimized SAM format converter