Design Principles

Design Principles • Separation between components into a modular system: • Independent standalone modules, that are also runnable programs • Collaborator wants to run srf2FastQ at home, without a MetaDB • Researcher tries custom parameters, but still track his run in the MetaDB • XML Workflows that defines jobs and data dependencies • Parameterized to reuse workflows on different experiments • Based on DAX standard • Execution engine uses open-source Pegasus project • Wraps standard executables, so no modification to your code • Supports multiple cluster submission, including clusters living on EC2 and other clouds • Uses Globus to support SGE, PBS, Torque, Condor, LSF • Stages data and binaries to the appropriate cluster from whichever cluster has them • Manages temporary space and processing environment • creating temp directories, moving input files in, staging and running your program, copying results out

Java API: public interface WrapperInterface { int init(); // Optional int get_syntax(); int do_test(); int do_verify_input(); int do_verify_parameters(); int do_run(); int do_verify_output(); int clean_up(); // Optional } Application Wrapper Interface • Applications conforms to a standard interface • Developers and users do not have to understand rest of the the pipeline • Force developers to adhere to best practices • Syntax, --help option • Required test harness • Verifications of input, output, parameters Local Execution: $ java SeqWareRunner bpostprocess --help → Reports get_syntax() $ java SeqWareRunner bpostprocess input → Run bpostprocess on the command line $ java SeqWareRunner bpostprocess --db input → Same as above, but without MetaDB feedback $ java SeqWareRunner bpostprocess --db input --config=config.txt $ java SeqWareRunner bpostprocess --db input -A 0 -n 8

XML Workflow • Follows DAX Standard, which is input to Pegasus • Defines jobs, arguments, configuration, and data dependencies • Defines dependencies between jobs • Use Java Freemarker to populate the XML template for each experiment  <child ref="ID0000002"> <parent ref="ID0000001"/> </child> <child ref="ID0000003"> <parent ref="ID0000001"/> <parent ref="ID0000002"/> </child> </adag> </xml> <?xml version="1.0" encoding="UTF-8"?> <adag xmlns="http://pegasus.isi.edu/schema/DAX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-2.1.xsd" version="2.1" count="1" index="0" name="bfast" jobCount="3" fileCount="0" childCount="2">  <job id="ID0000001" namespace="seqware" name="runner" version="0.0.1"> <argument>bfast matches %{reference_file} %{experiment}.fastq...</argument> <profile namespace="globus" key="max_memory">24576</profile> <profile namespace="globus" key="count">8</profile> <uses file="%{experiment}.fastq" link="input"> <uses file="%{experiment}.bmf" link="output" transfer="false" register="false"> </job> <job id="ID0000002" namespace="seqware" name="runner" version="0.0.1"> <argument>bfast localalign ...</argument> <uses file="%{experiment}.bmf" link="input"> <uses file="%{experiment}.baf" link="output" transfer="false" register="false"> </job> <job id="ID0000003" namespace="seqware" name="runner" version="0.0.1"> <argument>bfast postprocess ...</argument> <uses file="%{experiment}.bmf" link="input"> <uses file="%{experiment}.bam" link="output" transfer="true" register="true"> </job> .....

Pegasus

Design Principles