350 likes | 696 Views
Cluster-based SNP Calling on Large Scale Genome Sequencing Data. Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. CCGrid 2014, Chicago, IL. What is SNP?. Stands for Single-Nucleotide Polymorphism
E N D
Cluster-based SNP Calling on Large Scale Genome Sequencing Data MucahidKutluGaganAgrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2014, Chicago, IL
What is SNP? • Stands for Single-Nucleotide Polymorphism • DNA sequence variation that occurs when a single nucleotide differs between members of biological species. • Essential for medical researches and developing personalized-medicine. • A single SNP may cause a Mendelian disease. *Adapted from Wikipedia CCGrid 2014
Motivation • The sequencing costs are decreasing *Adapted from genome.gov/sequencingcosts CCGrid2014
Motivation • Big data problem • 1000 Human Genome Project already produced 200 TB data • Parallel processing is inevitable! *Adapted from https://www.nlm.nih.gov/about/2015CJ.html CCGrid2014
Outline CCGrid 2014 Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion
General Idea of SNP Calling Algorithms ✖ ✓ ✖ Alignment File-1 Alignment File-2 CCGrid 2014 Two main observations: In order to detect an SNP at a certain location, we have to check the alignments in ALL genomes at that location. The existence of an SNP is independent than others
Parallel SNP Calling Location-based Sample-based Checkerboard Proc 1 Proc 2 Proc 3 Proc 4 Proc1 Proc2 Processor 1 Processor 2 Proc3 Processor 3 Proc4 Processor 4 Genome files Requires communication among processes CCGrid 2014 CCGrid 2014 How to distribute data among nodes?
Challenges 1 3 4 Coverage Variance 8 CCGrid 2014 • Load Imbalance due to nature of genomic data • It is not just an array of A, G, C and T characters • I/O contention • High overhead of random access to a particular region
Histogram Showing Coverage Variance CCGrid 2014 Chromosome: 1 Locations: 1-200M Number of samples: 256 Interval size: 1M
Outline CCGrid 2014 Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion
Proposed Scheduling Schemes CCGrid 2014 Dynamic Scheduling Static Scheduling Combined Scheduling …Each scheduling scheme uses location-based data division. That is, the genome is divided into regions and each task is responsible for a region.
Dynamic Scheduling Alignment File -1 Alignment File -2 • Big chunks are assigned first, then small chunks are assigned B B CCGrid 2014 • Master & Worker Approach • Tasks are assigned dynamically • Two types of data-chunks are used • Big chunk: covers B locations • Small chunk: cover S locations • B > S
Static Scheduling Alignment File -1 Alignment File -2 • Tasks are scheduled statically. No master & Slave approach CCGrid 2014 • Pre-processing step • We count the number of alignments for each region and generate a histogram • Estimated Cost • We use an estimation function and our histogram for data partitioning. • k : histogram interval k • TR : cost of accessing/reading the region • TP: processing an alignment • N(l): Number of alignments in location l • Each task is responsible for regions having same estimated cost.
Combined Scheduling Alignment File -1 Alignment File -2 Big chunks Small chunks CCGrid 2014 Combination of Static and Dynamic Scheduling We use small and big chunks as in dynamic scheduling The size of the chunks are determined according to histogram Master-Worker approach
Parameters of Scheduling Schemes CCGrid 2014 • Our proposed scheduling schemes have user-defined parameters • Dynamic Scheduling • Length of big and small chunks • Static Scheduling • Histogram interval size • Estimation function parameters • Combined Scheduling • All parameters for dynamic and static scheduling • All parameters can be determined with a offline training phase
Outline CCGrid 2014 Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion
Experiments CCGrid 2014 • Local cluster with nodes • 2 quad-core 2.53 GHz Xeon(R) processors with12 GB RAM • We obtained genomes of 256 samples from 1000 Human Genome Project • The data is replicated to all local disks unless noted otherwise • Parallel implementation: • We implemented VarScan in C programming language • We also modified VarScansuch that BAM files can be read directly. • Used MPI library for parallelization
Experiments: Scalability First 192M location of Chr.1 CCGrid 2014
Experiments: Data Size Impact 128 cores are allocated CCGrid 2014
Experiments: I/O Contention Impact 128 cores are allocated I/O Contention Impact CCGrid 2014
Comparison with Hadoop • First 192M location of Chr.2 in 512 samples are analyzed • Lower (dark) portions of the bars show pre-processing time. CCGrid 2014
Scheduling With Replication IPDPS'14 • Data-Intensive Processing Motivates New Schemes • Replicate each chunk fixed/variable number of times • Dynamic scheduling while processing only local chunks • Interesting new tradeoffs • Under submission
Other Work IPDPS'14 • PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications (IPDPS 2014) • Mappers and reducers are executable programs • Allows us to exploit existing applications • No restriction on programming language
PAGE vs. State-of-the-Art IPDPS'14 • Amiddleware system • Specific for parallel genetic data processing • Allow parallelization of a variety of genetic algorithms • Be able to work with different popular genetic data formats • Allows use of existing programs
Conclusion CCGrid 2014 We have developed a methodology for parallel identification of variants in large-scale genome sequencing data. Coverage variance and I/O contetionare two main problems We proposed 3 scheduling schemes Combined scheduling gives best results. Our approach has good speedup and outperforms Hadoop