260 likes | 400 Views
PAGE: A Framework for Easy Parallelization of Genomic Applications. Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. IPDPS 2014, Phoenix, Arizona. Motivation. The sequencing costs are decreasing.
E N D
PAGE: A Framework for Easy Parallelization of Genomic Applications MucahidKutluGaganAgrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2014, Phoenix, Arizona
Motivation • The sequencing costs are decreasing *Adapted from genome.gov/sequencingcosts IPDPS'14
Motivation • Big data problem • 1000 Human Genome Project already produced 200 TB data • Parallel processing is inevitable! *Adapted from https://www.nlm.nih.gov/about/2015CJ.html IPDPS'14
Typical Analysis on Genomic Data ✖ ✓ ✖ Alignment File-1 Alignment File-2 *Adapted from Wikipedia A single SNP may cause Mendeliandisease! IPDPS'14 Single Nucleotide Polymorphism (SNP) calling
Outline IPDPS'14 Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion
Existing Solutions for Implementation IPDPS'14 • Serial tools • SamTools, VCFTools, BedTools– File merging, sorting etc. • VarScan – SNP calling • Parallel implementations • Turboblast, searching local alignments, • SEAL, read mapping and duplicate removal • Biodoop, statistical analysis • Middleware Systems • Hadoop • Not designed for specific needs of genetic data • Limited programmability • Genome Analysis Tool Kit (GATK) • Designed for genetic data processing • Provides special data traversal patterns • Limited parallelization for some of its tools
Outline IPDPS'14 Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion
Our Goal IPDPS'14 • We want to develop a middleware system • Specific for parallel genetic data processing • Allow parallelization of a variety of genetic algorithms • Be able to work with different popular genetic data formats • Allows use of existing programs
Challenges 1 3 4 Coverage Variance IPDPS'14 9 • Load Imbalance due to nature of genomic data • It is not just an array of A, G, C and T characters • High overhead of tasks • I/O contention
Our Work IPDPS'14 • PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications • Mappers and reducers are executable programs • Allows us to exploit existing applications • No restriction on programming language
Intra-dependent Processing Region-1 Map O-11 Output-1 Reduce Region-n Map O-1n File-m File-2 File-1 Region-1 Map O-m1 Output-m Reduce Region-n Map O-mn IPDPS'14 Each file is processed independently
Inter-dependent Processing Region-1 Map O1 Region-k Output Map Input Files Reduce Ok Region-n Map On IPDPS'14 Each map task processes a particular region of ALL files
What Can PAGE Parallelize? R R2 R1 IPDPS'14 • PAGE can parallelize all applications that have the following property • M - Map task • R, R1 and R2 are three regions such that R = concatenation of R1 and R2 • M (R) = M(R1) ⊕ M(R2) where ⊕ is the reduction function
Data Partitioning IPDPS'14 • Data is NOT packaged into equal-size datablocks as in Hadoop • Each application has a different way of reading the data • Equal-size data block packaging ignores nucleotide base location information • Genome structure is divided into regions and each map task is assigned for a region. • Takes account location information • The map task is responsible of accessing particular region of the input files • It is a common feature for many genomic tools (GATK, SamTools)
Genome Partition • PAGE provides two data partitioning methods • By-locus partitioning: Chromosomes are divided into regions • By-chromosome partitioning: Chromosomes preserve their unity Chr-1 Chr-3 Chr-5 Chr-2 Chr-4 Chr-6 Chr-1 Chr-3 Chr-5 Chr-2 Chr-4 Chr-6 IPDPS'14
Task Scheduling PAGE provides two types of scheduling schemes. IPDPS'14
Applications Developed Using PAGE IPDPS'14 • We parallelized 4 applications • VarScan:SNP detection • RealignerTarget Creator: Detects insertion/deletions in alignment files • IndelRealigner:Applies local realignment to improve quality of alignment files • Unified Genotyper: SNP detection
Sample Application Development with PAGE IPDPS'14 • Serial execution command of VarScan Software • samtoolsmpileup–b file_list -f reference | java -jar VarScan.jarmpileup2snp • To parallelize VarScan with PAGE, user needs to define: • Genome Partition: By-Locus • Scheduling Scheme: Dynamic (or Static) • Execution Model: Inter-dependent • Map command: samtoolsmpileup–b file_list-rregionloc-f reference | java -jar VarScan.jarmpileup2snp >outputloc • Reduction : cat bash shell command
Outline IPDPS'14 Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion
Experiments IPDPS'14 • Experimental Setup • In our cluster • Each node has 12 GB memory • 8 cores (2.53 GHz) • We obtained the data from 1000 Human Genome Project • We evaluated PAGE with 4 applications • We compared PAGE with Hadoop Streaming and GATK
Comparison with GATK - IndelRealigner tool of GATK Scalability Data Size Impact 3.3x 9x Data Size: 11 GB # of cores: 128 IPDPS'14
Comparison with GATK - Unified Genotyper tool of GATK Scalability Data Size Impact 12.8x 10.9x Data Size: 34 GB # of cores: 128 IPDPS'14
Comparison with Hadoop Streaming - VarScan Application Scalability Data Size Impact 12.7x 6.9x Data Size: 52 GB # of cores: 128 IPDPS'14
Summary of Experimental Results IPDPS'14 When the computing power increased by 16 times
Conclusion IPDPS'14 • We developed a middleware • Easily parallelizes genomic applications • High applicability • No restriction on programming language or data format • Allows to use existing applications • Provides user to control the parallel execution while hiding the details • Alternative scheduling schemes, execution models and data partitioning types • Good Scalability
Thank you for listening … IPDPS'14 Questions