270 likes | 432 Views
Arun Krishnan, PhD Assistant Professor, Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan. Wildfire Distributed, Grid-Enabled Workflow Construction and Execution. Affordable HPC Commodity hardware assembled into Beowulf clusters Pooled hardware in Grids
E N D
Arun Krishnan, PhD Assistant Professor, Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan WildfireDistributed, Grid-Enabled Workflow Construction and Execution
Affordable HPC Commodity hardware assembled into Beowulf clusters Pooled hardware in Grids Parallel by design Traditional solution: implement workflows as perl scripts. Difficult to program. Difficult to maintain. Difficult to port. Two Trends • Bioinformatics analysis • Increasingly complex analyses • Several bioinformatics applications assembled into workflows
The Problem • We need • Tool for construction and execution of workflows on supercomputers • User-interface must be intuitive for non-HPC-specialists • Execution must support different supercomputing platforms
Solution • Objectives: • Coarse-grained parallel programming for Grid • Exploit heterogeneity of Grid (s/w licences, data, h/w) • Approach: • An expressive workflow description language, GEL • Sequential and parallel composition • Conditional execution (if-then-else) • Sequential iteration (while loop) • Parameterised parallel composition (parameter sweeps) • Parameterised sequential composition for i in `ls .`; do blastp -d yeast -I $i; done Basic Idea: Can we do on the grid, what we do using shell scripts on a cluster??
GEL: An OverviewSemantics • A workflow has • One input directory • One or more output directories • Workflow cannot modify its input directory
GEL: An OverviewSemantics (Job) • Job (atomic workflow subunit) • Characteristics • Executable name • Resource/system/software/data requirements • One input directory • One output directory • Semantics • Stage files into input directory • Run executable • Present output directory as result
GEL: An OverviewSemantics (Conditional) • Conditional (if E then A else B) • E is a job for which we ignore the output files • A and B are workflows • Executing (if E then A else B) entails • Execute E and observe stdout • If stdout is non-empty then execute A • If stdout is empty then execute B
GEL: An OverviewSemantics (seq, par compn) • Sequential composition (A;B) • Execute A • Copy files from all output directories of A into input directory of B • Execute B • Note: implicit merge of output directories of A • Parallel composition (A||B) • Execute A and B from input directories populated from the same files • Output directories are those of A and B
GEL: An OverviewSemantics (Sequential Iteration) • Sequential iteration (while E do A) • E is a job for which we ignore the output files • (Standard while defn) Executing while E do A is semantically equivalent to executing if E then (A; while E do A)
GEL: An OverviewSemantics (Parametrized par compn) • Parameterised parallel composition (pfor x in xs do A(x)) • xs is a list expression • E.g. 0:50:10 = [0, 10, 20, 30, 40, 50] • Variable x is a bound variable • Executing pfor x in (a0,xs) do A(x) is semantically equivalent to executing A(a0) || (pfor x in xs do A(x))
GEL: An OverviewSemantics (Parametrized seq compn) • Parameterised sequential composition (for x in xs do A) • xs is a list expression, x is a bound variable • Executing for x in (a0,xs) do A(x) is semantically equivalent to executing A(a0); (pfor x in xs do A(x)) • Know number of iterations before executing loop (cf. while loop)
So what did we do…? • Grammar defined • Sequential and parallel composition • Sequential iteration • Intrinsic jobs (e.g. file projection) • Interpretors implemented • Local machine: spawn jobs locally • Clusters: spawn jobs using SGE,PBS and LSF • Statically-scheduled Grid interpretor: GridFTP staging, GramJob spawn • Required • A GUI frontend
Wildfire… Wildfire and GEL brings supercomputing power to the bioinformatician
Supercomputing support Shared memory multiprocessors Cluster schedulers PBS SGE LSF Grids Globus Features • Integrated environment • Construct and execute workflows from the same interface • User-friendly • Drawing-analogy workflow construction • Program options presented using Jemboss-style drop-down lists, buttons, textboxes, etc.
An arrow denotes sequential dependence Parallel bars denote parallel container Parallel “foreach” repeats contents for each file matching pattern Yellow boxes are atomic components W/F Construction: Drawing • Double click on components to change options • Draw arrows between components • Drag components into containers
W/F Construction: Components • Wildfire has been pre-configured with EMBOSS applications • Custom/new components can be added
User uses Wildfire to create workflow as GEL script data Execution on Grid, Cluster, or local data GEL data data Globus LSF fork Laptop Cluster GRID W/F Execution
Beowulf Cluster Submit job requests through queue manager Use processors on compute nodes Use job dependencies PBS/Torque Sun GridEngine (SGE) Platform LSF W/F Execution: GEL • “Local”/SMP • Run programs directly • Use multiple processors if available • Grid • Stage files using GridFTP • Execute programs using GRAM
More workflow parallel features Parallel for loop • Loop variable $i iterates over values 0 to 3 • For each value of $i, an instance of its contents executes in parallel Parallel container • Denotes independent components • Whole container is considered a component
Component inside round disc is the loop guard If loop guard evaluates to false, then the break branch is taken If loop guard is true, then the true branch is taken, after which the loop guard is evaluated again Workflow: while loops While loop allows for iterative workflows
Chromosome Extract Exons Transcripts BLAST Alignments Ex: Transcript Analysis • Transcripts database from Mammalian Gene Collection • Exons from chromosomes from NCBI Genbank • Blast each exon against transcript database to investigate splicing of transcripts
Transcript Analysis:24 Chromosomes • Human genome has 24 chromosomes (1-22,X,Y) • How do we leverage parallel computing?
Dice splits big exons file into several smaller files C/some C/some C/some C/some C/some C/some Extract Extract Extract Extract Extract Extract Exons Exons Separate BLAST instances align the smaller files against transcript database Exons Exons Exons Exons Dice Dice Dice Dice Transcripts Dice Dice Transcripts Transcripts Exons Transcripts Exons Transcripts Exons Transcripts Exons Exons Exons Blast Blast Blast Blast Blast Blast Blast Blast Blast Blast Blast Blast Alignments are stored in many files Blast Blast Blast Blast Blast Blast Results Results Results Results Results Results One copy per chromosome Transcript Analysis:Parallelism
Parallel “foreach” container executes inner pipeline once for each file matching *.gbk.gz Parallel “foreach” container executes blastall component for all files matching *_dice*.fna Decompress chromosome data Extract exons Break up exons file into smaller files Format database for BLAST query Decompress transcripts file Transcript Analysis:Workflow
Transcript Analysis:Execution Profile • The execution profile shows when programs start and stop • Note: “makespan” can be improved by balancing the duration of blast jobs (modify dice)
Summary • End-User Requirements • Ease of construction • Ease of implementation • Ease of recovery • Grid Scripting the way to go? • Interfaces to grid-scripting?
Acknowledgments • Bioinformatics Institute, Singapore • Dr. Francis Tang • Chua Ching Lian • Liang-Yoong Ho