1 / 26

PAGE: A Framework for Easy Parallelization of Genomic Applications

PAGE: A Framework for Easy Parallelization of Genomic Applications. Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. IPDPS 2014, Phoenix, Arizona. Motivation. The sequencing costs are decreasing.

palani
Download Presentation

PAGE: A Framework for Easy Parallelization of Genomic Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PAGE: A Framework for Easy Parallelization of Genomic Applications MucahidKutluGaganAgrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2014, Phoenix, Arizona

  2. Motivation • The sequencing costs are decreasing *Adapted from genome.gov/sequencingcosts IPDPS'14

  3. Motivation • Big data problem • 1000 Human Genome Project already produced 200 TB data • Parallel processing is inevitable! *Adapted from https://www.nlm.nih.gov/about/2015CJ.html IPDPS'14

  4. Typical Analysis on Genomic Data ✖ ✓ ✖ Alignment File-1 Alignment File-2 *Adapted from Wikipedia A single SNP may cause Mendeliandisease! IPDPS'14 Single Nucleotide Polymorphism (SNP) calling

  5. Outline IPDPS'14 Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion

  6. Existing Solutions for Implementation IPDPS'14 • Serial tools • SamTools, VCFTools, BedTools– File merging, sorting etc. • VarScan – SNP calling • Parallel implementations • Turboblast, searching local alignments, • SEAL, read mapping and duplicate removal • Biodoop, statistical analysis • Middleware Systems • Hadoop • Not designed for specific needs of genetic data • Limited programmability • Genome Analysis Tool Kit (GATK) • Designed for genetic data processing • Provides special data traversal patterns • Limited parallelization for some of its tools

  7. Outline IPDPS'14 Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion

  8. Our Goal IPDPS'14 • We want to develop a middleware system • Specific for parallel genetic data processing • Allow parallelization of a variety of genetic algorithms • Be able to work with different popular genetic data formats • Allows use of existing programs

  9. Challenges 1 3 4 Coverage Variance IPDPS'14 9 • Load Imbalance due to nature of genomic data • It is not just an array of A, G, C and T characters • High overhead of tasks • I/O contention

  10. Our Work IPDPS'14 • PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications • Mappers and reducers are executable programs • Allows us to exploit existing applications • No restriction on programming language

  11. Intra-dependent Processing Region-1 Map O-11 Output-1 Reduce Region-n Map O-1n File-m File-2 File-1 Region-1 Map O-m1 Output-m Reduce Region-n Map O-mn IPDPS'14 Each file is processed independently

  12. Inter-dependent Processing Region-1 Map O1 Region-k Output Map Input Files Reduce Ok Region-n Map On IPDPS'14 Each map task processes a particular region of ALL files

  13. What Can PAGE Parallelize? R R2 R1 IPDPS'14 • PAGE can parallelize all applications that have the following property • M - Map task • R, R1 and R2 are three regions such that R = concatenation of R1 and R2 • M (R) = M(R1) ⊕ M(R2) where ⊕ is the reduction function

  14. Data Partitioning IPDPS'14 • Data is NOT packaged into equal-size datablocks as in Hadoop • Each application has a different way of reading the data • Equal-size data block packaging ignores nucleotide base location information • Genome structure is divided into regions and each map task is assigned for a region. • Takes account location information • The map task is responsible of accessing particular region of the input files • It is a common feature for many genomic tools (GATK, SamTools)

  15. Genome Partition • PAGE provides two data partitioning methods • By-locus partitioning: Chromosomes are divided into regions • By-chromosome partitioning: Chromosomes preserve their unity Chr-1 Chr-3 Chr-5 Chr-2 Chr-4 Chr-6 Chr-1 Chr-3 Chr-5 Chr-2 Chr-4 Chr-6 IPDPS'14

  16. Task Scheduling PAGE provides two types of scheduling schemes. IPDPS'14

  17. Applications Developed Using PAGE IPDPS'14 • We parallelized 4 applications • VarScan:SNP detection • RealignerTarget Creator: Detects insertion/deletions in alignment files • IndelRealigner:Applies local realignment to improve quality of alignment files • Unified Genotyper: SNP detection

  18. Sample Application Development with PAGE IPDPS'14 • Serial execution command of VarScan Software • samtoolsmpileup–b file_list -f reference | java -jar VarScan.jarmpileup2snp • To parallelize VarScan with PAGE, user needs to define: • Genome Partition: By-Locus • Scheduling Scheme: Dynamic (or Static) • Execution Model: Inter-dependent • Map command: samtoolsmpileup–b file_list-rregionloc-f reference | java -jar VarScan.jarmpileup2snp >outputloc • Reduction : cat bash shell command

  19. Outline IPDPS'14 Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion

  20. Experiments IPDPS'14 • Experimental Setup • In our cluster • Each node has 12 GB memory • 8 cores (2.53 GHz) • We obtained the data from 1000 Human Genome Project • We evaluated PAGE with 4 applications • We compared PAGE with Hadoop Streaming and GATK

  21. Comparison with GATK - IndelRealigner tool of GATK Scalability Data Size Impact 3.3x 9x Data Size: 11 GB # of cores: 128 IPDPS'14

  22. Comparison with GATK - Unified Genotyper tool of GATK Scalability Data Size Impact 12.8x 10.9x Data Size: 34 GB # of cores: 128 IPDPS'14

  23. Comparison with Hadoop Streaming - VarScan Application Scalability Data Size Impact 12.7x 6.9x Data Size: 52 GB # of cores: 128 IPDPS'14

  24. Summary of Experimental Results IPDPS'14 When the computing power increased by 16 times

  25. Conclusion IPDPS'14 • We developed a middleware • Easily parallelizes genomic applications • High applicability • No restriction on programming language or data format • Allows to use existing applications • Provides user to control the parallel execution while hiding the details • Alternative scheduling schemes, execution models and data partitioning types • Good Scalability

  26. Thank you for listening … IPDPS'14 Questions

More Related