Workflow Management

Workflow Management CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

Apache Oozie

Problem! • "Okay, Hadoop is great, but how do people actually do this?“ – A Real Person • Package jobs? • Chaining actions together? • Run these on a schedule? • Pre and post processing? • Retry failures?

Apache OozieWorkflow Scheduler for Hadoop • Scalable, reliable, and extensible workflow scheduler system to manage Apache Hadoop jobs • Workflow jobs are DAGs of actions • Coordinator jobs are recurrent Oozie Workflow jobs triggered by time and data availability • Supports several types of jobs: • Java MapReduce • Streaming MapReduce • Pig • Hive • Sqoop • Distcp • Java programs • Shell scripts

Why should I care? • Retry jobs in the event of a failure • Execute jobs at a specific time or when data is available • Correctly order job execution based on dependencies • Provide a common framework for communication • Use the workflow to couple resources instead of some home-grown code base

Layers of Oozie • Bundles • Coordinators • Workflows • Actions

Actions • Have a type, and each type has a defined set of configuration variables • Each action must specify what to do based on success or failure

Workflow DAGs M/R streaming job OK start Java Main OK fork join decision Pig job MORE OK M/R job ENOUGH OK Java Main end FS job OK OK

Workflow Language

Oozie Workflow Application • An HDFS Directory containing: • Definition file: workflow.xml • Configuration file: config-default.xml • App files: lib/ directory with JAR and other dependencies

WordCount Workflow <workflow-app name='wordcount-wf'> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>foo.com:9001</job-tracker> <name-node>hdfs://bar.com:9000</name-node> <configuration> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'/> <end name='end'/> </workflow-app> Start End M-R wordcount OK Start Error Kill

Coordinators • Oozie executes workflows based on • Time Dependency • Data Dependency Tomcat Check Data Availability Oozie Coordinator WS API Oozie Workflow Oozie Client Hadoop

Time Triggers <coordinator-app name="coord1" start="2009-01-01T00:00Z" end="2010-01-01T00:00Z" frequency="15" xmlns="uri:oozie:coordinator:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/apps/processor-wf</app-path> <configuration> <property> <name>key1</name> <value>value1</value> </property> </configuration> </workflow> </action> </coordinator-app>

Data Triggers <coordinator-app name="coord1" frequency="${1*HOURS}"...> <datasets> <dataset name="logs" frequency="${1*HOURS}" initial-instance="2009-01-01T00:00Z"> <uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="inputLogs" dataset="logs"> <instance>${current(0)}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name> <value>${dataIn('inputLogs')}</value> </property> </configuration> </workflow> </action> </coordinator-app>

Bundle • Bundles are higher-level abstractions that batch a set of coordinators together • No explicit dependencies between them, but they can be used to define a pipeline

Interacting with Oozie • Read-Only Web Console • CLI • Java client • Web Service Endpoints • Directly with Oozie DB using SQL

Extending Oozie • Minimal workflow language containing a handful of controls and actions • Extensibility for custom action nodes • Creation of a custom action requires: • Java implementation, extending ActionExecutor • Implementation of the action’s XML schema, which defines the action’s configuration parameters • Packaing of Java implementation and configuration schema into a JAR, which is added to Oozie WAR • Extending oozie-site.xml to register information about custom executor

What do I need to deploy a workflow? • coordinator.xml • workflow.xml • Libraries • Properties • Contains things like NameNode and ResourceManager addresses and other job-specific properties

Configuring Workflows • Three mechanisms to configure a workflow • config-default.xml • job.properties • Job Arguments • Processed as such: • Use all of the parameters from command line invocation • Anything unresolved? Use job.properties • Use config-default.xml for everything else

Okay, I've built those • Now you can put it in HDFS and run it hdfsdfs -put my_joboozie/app oozie job -run -configjob.properties

Java Action • A Java action will execute the main method of the specified Java class • Java classes should be packaged in a JAR and placed with workflow application's lib directory • wf-app-dir/workflow.xml • wf-app-dir/lib • wf-app-dir/lib/myJavaClasses.JAR

Java Action $ java -Xms512m a.b.c.MyMainClass arg1 arg2 <actionname='java1'> <java> ... <main-class> a.b.c.MyJavaMain </main-class> <java-opts> -Xms512m </java-opts> <arg> arg1 </arg> <arg> arg2 </arg> ... </java> </action>

Java Action Execution • Executed as a MR job with a single task • So you need the MR information <actionname='java1'> <java> <job-tracker>foo.bar:8021</job-tracker> <name-node>foo1.bar:8020</name-node> ... <configuration> <property> <name>abc</name> <value>def</value> </property> </configuration> </java> </action>

Capturing Output • How to pass parameter from my Java action to other actions? • Add the <capture-output/> element to your Java action • Reference the parameter in your following actions • Write some Java code to link them

<actionname='java1'> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <main-class>org.apache.oozie.test.MyTest</main-class> <arg>${outputFileName}</arg> <capture-output/> </java> <okto="pig1"/> <errorto="fail"/> </action>

<actionname='pig1'> <pig> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <script>script.pig</script> <param>MY_VAR=${wf:actionData('java1')['PASS_ME']}</param> </pig> <okto="end"/> <errorto="fail"/> </action>

publicstaticvoidmain (String[] args) { String fileName = args[0]; try{ File file = newFile( System.getProperty("oozie.action.output.properties")); Properties props = newProperties(); props.setProperty("PASS_ME", "123456"); OutputStreamos = newFileOutputStream(file); props.store(os, ""); os.close(); System.out.println(file.getAbsolutePath()); } catch(Exception e) { e.printStackTrace(); } • System.exit(0); }

Web Console

Coordinators

Coordinator Details

Job Details

Job DAG

Job Details

Action Details

Job Tracker

A Use Case: Hourly Jobs • Replace a CRON job that runs a bash script once a day • Java main class that pulls data from a file stream and dumps it to HDFS • Runs a MapReduce job on the files • Emails a person when finished • Start within X amount of time • Complete within Y amount of time • And retry Z times on failure

1 <workflow-app name="filestream_wf" xmlns="uri:oozie:workflow:0.1"> <start to="java-node"/> <action name="java-node"/> <java> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <main-class>org.foo.bar.PullFileStream</main-class> </java> <ok to="mr-node"/> <error to="fail"/> </action> <action name="mr-node"> <map-reduce> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <configuration> ... </configuration> </map-reduce> <ok to="email-node"> <error to="fail"/> </action> ... 2 3 ... <action name="email-node"> <email xmlns="uri:oozie:email-action:0.1"> <to>customer@foo.bar</to> <cc>employee@foo.bar</cc> <subject>Email notification</subject> <body>The wf completed</body> </email> <ok to="myotherjob"/> <error to="errorcleanup"/> </action> <end name="end"/> <kill name="fail"/> </workflow-app>

6 <?xml version="1.0"?> <coordinator-app end="${COORD_END}" frequency="${coord:days(1)}" name="daily_job_coord" start="${COORD_START}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1" xmlns="uri:oozie:sla:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/user/hadoop/oozie/app/test_job</app-path> </workflow> <sla:info> <sla:nominal-time>${coord:nominalTime()}</sla:nominal-time> <sla:should-start>${X * MINUTES}</sla:should-start> <sla:should-end>${Y * MINUTES}</sla:should-end> <sla:alert-contact>foo@bar.com</sla:alert-contact> </sla:info> </action> </coordinator-app> 4, 5

Review • Oozie ties together many Hadoop ecosystem components to "productionalize" this stuff • Advanced control flow and action extendibility lets Oozie do whatever you would need it to do at any point in the workflow • XML is gross

References • http://oozie.apache.org • https://cwiki.apache.org/confluence/display/OOZIE/Index • http://www.slideshare.net/mattgoeke/oozie-riot-games • http://www.slideshare.net/mislam77/oozie-sweet-13451212 • http://www.slideshare.net/ChicagoHUG/everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie

Workflow Management

Workflow Management

Presentation Transcript

Workflow Management in Grids

Bioinformatics workflow management

Workflow management within DIET

Workflow Management

Change Management Workflow Scheme

Scientific Workflow Management

Workflow Management in Condor

Bp Management Workflow

IEEE Workflow Management System

Workflow management

Scientific Workflow Management

Workflow Management KReSIT

LQCD Workflow Management

Workflow Management in

Workflow Project Management Software

Workflow Management System Market

Workflow Management Software

Workflow Management Systems

Scientific Workflow Management

workflow management software

Best Workflow Management Software

Workflow Management Software