Cluster-based SNP Calling on Large Scale Genome Sequencing Data

Cluster-based SNP Calling on Large Scale Genome Sequencing Data MucahidKutluGaganAgrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2014, Chicago, IL

What is SNP? • Stands for Single-Nucleotide Polymorphism • DNA sequence variation that occurs when a single nucleotide differs between members of biological species. • Essential for medical researches and developing personalized-medicine. • A single SNP may cause a Mendelian disease. *Adapted from Wikipedia CCGrid 2014

Motivation • The sequencing costs are decreasing *Adapted from genome.gov/sequencingcosts CCGrid2014

Motivation • Big data problem • 1000 Human Genome Project already produced 200 TB data • Parallel processing is inevitable! *Adapted from https://www.nlm.nih.gov/about/2015CJ.html CCGrid2014

Outline CCGrid 2014 Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion

General Idea of SNP Calling Algorithms ✖ ✓ ✖ Alignment File-1 Alignment File-2 CCGrid 2014 Two main observations: In order to detect an SNP at a certain location, we have to check the alignments in ALL genomes at that location. The existence of an SNP is independent than others

Parallel SNP Calling Location-based Sample-based Checkerboard Proc 1 Proc 2 Proc 3 Proc 4 Proc1 Proc2 Processor 1 Processor 2 Proc3 Processor 3 Proc4 Processor 4 Genome files Requires communication among processes CCGrid 2014 CCGrid 2014 How to distribute data among nodes?

Challenges 1 3 4 Coverage Variance 8 CCGrid 2014 • Load Imbalance due to nature of genomic data • It is not just an array of A, G, C and T characters • I/O contention • High overhead of random access to a particular region

Histogram Showing Coverage Variance CCGrid 2014 Chromosome: 1 Locations: 1-200M Number of samples: 256 Interval size: 1M

Proposed Scheduling Schemes CCGrid 2014 Dynamic Scheduling Static Scheduling Combined Scheduling …Each scheduling scheme uses location-based data division. That is, the genome is divided into regions and each task is responsible for a region.

Dynamic Scheduling Alignment File -1 Alignment File -2 • Big chunks are assigned first, then small chunks are assigned B B CCGrid 2014 • Master & Worker Approach • Tasks are assigned dynamically • Two types of data-chunks are used • Big chunk: covers B locations • Small chunk: cover S locations • B > S

Static Scheduling Alignment File -1 Alignment File -2 • Tasks are scheduled statically. No master & Slave approach CCGrid 2014 • Pre-processing step • We count the number of alignments for each region and generate a histogram • Estimated Cost • We use an estimation function and our histogram for data partitioning. • k : histogram interval k • TR : cost of accessing/reading the region • TP: processing an alignment • N(l): Number of alignments in location l • Each task is responsible for regions having same estimated cost.

Combined Scheduling Alignment File -1 Alignment File -2 Big chunks Small chunks CCGrid 2014 Combination of Static and Dynamic Scheduling We use small and big chunks as in dynamic scheduling The size of the chunks are determined according to histogram Master-Worker approach

Parameters of Scheduling Schemes CCGrid 2014 • Our proposed scheduling schemes have user-defined parameters • Dynamic Scheduling • Length of big and small chunks • Static Scheduling • Histogram interval size • Estimation function parameters • Combined Scheduling • All parameters for dynamic and static scheduling • All parameters can be determined with a offline training phase

Experiments CCGrid 2014 • Local cluster with nodes • 2 quad-core 2.53 GHz Xeon(R) processors with12 GB RAM • We obtained genomes of 256 samples from 1000 Human Genome Project • The data is replicated to all local disks unless noted otherwise • Parallel implementation: • We implemented VarScan in C programming language • We also modified VarScansuch that BAM files can be read directly. • Used MPI library for parallelization

Experiments: Scalability First 192M location of Chr.1 CCGrid 2014

Experiments: Data Size Impact 128 cores are allocated CCGrid 2014

Experiments: I/O Contention Impact 128 cores are allocated I/O Contention Impact CCGrid 2014

Comparison with Hadoop • First 192M location of Chr.2 in 512 samples are analyzed • Lower (dark) portions of the bars show pre-processing time. CCGrid 2014

Scheduling With Replication IPDPS'14 • Data-Intensive Processing Motivates New Schemes • Replicate each chunk fixed/variable number of times • Dynamic scheduling while processing only local chunks • Interesting new tradeoffs • Under submission

Other Work IPDPS'14 • PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications (IPDPS 2014) • Mappers and reducers are executable programs • Allows us to exploit existing applications • No restriction on programming language

PAGE vs. State-of-the-Art IPDPS'14 • Amiddleware system • Specific for parallel genetic data processing • Allow parallelization of a variety of genetic algorithms • Be able to work with different popular genetic data formats • Allows use of existing programs

Conclusion CCGrid 2014 We have developed a methodology for parallel identification of variants in large-scale genome sequencing data. Coverage variance and I/O contetionare two main problems We proposed 3 scheduling schemes Combined scheduling gives best results. Our approach has good speedup and outperforms Hadoop

Cluster-based SNP Calling on Large Scale Genome Sequencing Data

Cluster-based SNP Calling on Large Scale Genome Sequencing Data

Presentation Transcript

Large Scale Navigation Based on Perception

Whole Genome Sequencing

Genome sequencing

Large-scale genome projects

Large scale sequencing leading to sequencing of cancer genomes

Genome sequencing

Large scale proteome comparisons Genome trees

Knowledge-based Analysis of Genome-scale Data

SNP calling from Next Generation Sequencing data

Genotype and SNP Calling from Next-generation Sequencing Data

Mouse Genome Sequencing

Presentation on genome sequencing

Genome Sequencing and genome viewers

Sequencing a genome

Large scale data processing

Methods in genome sequencing and SNP finding

Sequencing a genome

in Large-Scale Cluster

Large Scale Data Integration

Large Scale Data Analytics

bacterial genome sequencing

large scale data analysis