PAGE: A Framework for Easy Parallelization of Genomic Applications

PAGE: A Framework for Easy Parallelization of Genomic Applications MucahidKutluGaganAgrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2014, Phoenix, Arizona

Motivation • The sequencing costs are decreasing *Adapted from genome.gov/sequencingcosts IPDPS'14

Motivation • Big data problem • 1000 Human Genome Project already produced 200 TB data • Parallel processing is inevitable! *Adapted from https://www.nlm.nih.gov/about/2015CJ.html IPDPS'14

Typical Analysis on Genomic Data ✖ ✓ ✖ Alignment File-1 Alignment File-2 *Adapted from Wikipedia A single SNP may cause Mendeliandisease! IPDPS'14 Single Nucleotide Polymorphism (SNP) calling

Outline IPDPS'14 Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion

Existing Solutions for Implementation IPDPS'14 • Serial tools • SamTools, VCFTools, BedTools– File merging, sorting etc. • VarScan – SNP calling • Parallel implementations • Turboblast, searching local alignments, • SEAL, read mapping and duplicate removal • Biodoop, statistical analysis • Middleware Systems • Hadoop • Not designed for specific needs of genetic data • Limited programmability • Genome Analysis Tool Kit (GATK) • Designed for genetic data processing • Provides special data traversal patterns • Limited parallelization for some of its tools

Our Goal IPDPS'14 • We want to develop a middleware system • Specific for parallel genetic data processing • Allow parallelization of a variety of genetic algorithms • Be able to work with different popular genetic data formats • Allows use of existing programs

Challenges 1 3 4 Coverage Variance IPDPS'14 9 • Load Imbalance due to nature of genomic data • It is not just an array of A, G, C and T characters • High overhead of tasks • I/O contention

Our Work IPDPS'14 • PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications • Mappers and reducers are executable programs • Allows us to exploit existing applications • No restriction on programming language

Intra-dependent Processing Region-1 Map O-11 Output-1 Reduce Region-n Map O-1n File-m File-2 File-1 Region-1 Map O-m1 Output-m Reduce Region-n Map O-mn IPDPS'14 Each file is processed independently

Inter-dependent Processing Region-1 Map O1 Region-k Output Map Input Files Reduce Ok Region-n Map On IPDPS'14 Each map task processes a particular region of ALL files

What Can PAGE Parallelize? R R2 R1 IPDPS'14 • PAGE can parallelize all applications that have the following property • M - Map task • R, R1 and R2 are three regions such that R = concatenation of R1 and R2 • M (R) = M(R1) ⊕ M(R2) where ⊕ is the reduction function

Data Partitioning IPDPS'14 • Data is NOT packaged into equal-size datablocks as in Hadoop • Each application has a different way of reading the data • Equal-size data block packaging ignores nucleotide base location information • Genome structure is divided into regions and each map task is assigned for a region. • Takes account location information • The map task is responsible of accessing particular region of the input files • It is a common feature for many genomic tools (GATK, SamTools)

Genome Partition • PAGE provides two data partitioning methods • By-locus partitioning: Chromosomes are divided into regions • By-chromosome partitioning: Chromosomes preserve their unity Chr-1 Chr-3 Chr-5 Chr-2 Chr-4 Chr-6 Chr-1 Chr-3 Chr-5 Chr-2 Chr-4 Chr-6 IPDPS'14

Task Scheduling PAGE provides two types of scheduling schemes. IPDPS'14

Applications Developed Using PAGE IPDPS'14 • We parallelized 4 applications • VarScan:SNP detection • RealignerTarget Creator: Detects insertion/deletions in alignment files • IndelRealigner:Applies local realignment to improve quality of alignment files • Unified Genotyper: SNP detection

Sample Application Development with PAGE IPDPS'14 • Serial execution command of VarScan Software • samtoolsmpileup–b file_list -f reference | java -jar VarScan.jarmpileup2snp • To parallelize VarScan with PAGE, user needs to define: • Genome Partition: By-Locus • Scheduling Scheme: Dynamic (or Static) • Execution Model: Inter-dependent • Map command: samtoolsmpileup–b file_list-rregionloc-f reference | java -jar VarScan.jarmpileup2snp >outputloc • Reduction : cat bash shell command

Experiments IPDPS'14 • Experimental Setup • In our cluster • Each node has 12 GB memory • 8 cores (2.53 GHz) • We obtained the data from 1000 Human Genome Project • We evaluated PAGE with 4 applications • We compared PAGE with Hadoop Streaming and GATK

Comparison with GATK - IndelRealigner tool of GATK Scalability Data Size Impact 3.3x 9x Data Size: 11 GB # of cores: 128 IPDPS'14

Comparison with GATK - Unified Genotyper tool of GATK Scalability Data Size Impact 12.8x 10.9x Data Size: 34 GB # of cores: 128 IPDPS'14

Comparison with Hadoop Streaming - VarScan Application Scalability Data Size Impact 12.7x 6.9x Data Size: 52 GB # of cores: 128 IPDPS'14

Summary of Experimental Results IPDPS'14 When the computing power increased by 16 times

Conclusion IPDPS'14 • We developed a middleware • Easily parallelizes genomic applications • High applicability • No restriction on programming language or data format • Allows to use existing applications • Provides user to control the parallel execution while hiding the details • Alternative scheduling schemes, execution models and data partitioning types • Good Scalability

Thank you for listening … IPDPS'14 Questions

PAGE: A Framework for Easy Parallelization of Genomic Applications

PAGE: A Framework for Easy Parallelization of Genomic Applications

Presentation Transcript

zend framework and dojo toolkit

Genomic Arrays – an overview

Controlling parallelization in the IBM XL Fortran and C/C++ parallelizing compilers

Genomics Applications in Public Health Across All Populations, Environment, and Work Settings Genomic Epidemiology/Inter

Multilevel Parallelization Architecture of Boundary Element Solver Engines for Cloud Computing Dipanjan Gope, Don MacM

Enabling Speculative Parallelization via Merge Semantics in STMs

Computation of Large-Scale Genomic Evaluations

Loop Parallelization

ALTER: Exploiting Breakable Dependences for Parallelization

Clinical Applications of Whole Genome/Whole Exome Sequencing

Parallelization of the Classic Gram-Schmidt QR-Factorization

Cooperative Parallelization

The CloudBrowser Web Application Framework

Automatic Parallelization

Parallelization of FFT in AFNI

New applications of genomic technology in the US dairy industry

What is a wiki?

Parallelization of FFT in AFNI

Parallelization Of The Spacetime Discontinuous Galerkin Method

CATALYTIC Framework

Efficient Parallelization for AMR MHD Multiphysics Calculations