400 likes | 565 Views
Workflow Management. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. Apache Oozie. Problem!. "Okay, Hadoop is great, but how do people actually do this?“ – A Real Person Package jobs? Chaining actions together? Run these on a schedule?
E N D
Workflow Management CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook
Problem! • "Okay, Hadoop is great, but how do people actually do this?“ – A Real Person • Package jobs? • Chaining actions together? • Run these on a schedule? • Pre and post processing? • Retry failures?
Apache OozieWorkflow Scheduler for Hadoop • Scalable, reliable, and extensible workflow scheduler system to manage Apache Hadoop jobs • Workflow jobs are DAGs of actions • Coordinator jobs are recurrent Oozie Workflow jobs triggered by time and data availability • Supports several types of jobs: • Java MapReduce • Streaming MapReduce • Pig • Hive • Sqoop • Distcp • Java programs • Shell scripts
Why should I care? • Retry jobs in the event of a failure • Execute jobs at a specific time or when data is available • Correctly order job execution based on dependencies • Provide a common framework for communication • Use the workflow to couple resources instead of some home-grown code base
Layers of Oozie • Bundles • Coordinators • Workflows • Actions
Actions • Have a type, and each type has a defined set of configuration variables • Each action must specify what to do based on success or failure
Workflow DAGs M/R streaming job OK start Java Main OK fork join decision Pig job MORE OK M/R job ENOUGH OK Java Main end FS job OK OK
Oozie Workflow Application • An HDFS Directory containing: • Definition file: workflow.xml • Configuration file: config-default.xml • App files: lib/ directory with JAR and other dependencies
WordCount Workflow <workflow-app name='wordcount-wf'> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>foo.com:9001</job-tracker> <name-node>hdfs://bar.com:9000</name-node> <configuration> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'/> <end name='end'/> </workflow-app> Start End M-R wordcount OK Start Error Kill
Coordinators • Oozie executes workflows based on • Time Dependency • Data Dependency Tomcat Check Data Availability Oozie Coordinator WS API Oozie Workflow Oozie Client Hadoop
Time Triggers <coordinator-app name="coord1" start="2009-01-01T00:00Z" end="2010-01-01T00:00Z" frequency="15" xmlns="uri:oozie:coordinator:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/apps/processor-wf</app-path> <configuration> <property> <name>key1</name> <value>value1</value> </property> </configuration> </workflow> </action> </coordinator-app>
Data Triggers <coordinator-app name="coord1" frequency="${1*HOURS}"...> <datasets> <dataset name="logs" frequency="${1*HOURS}" initial-instance="2009-01-01T00:00Z"> <uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="inputLogs" dataset="logs"> <instance>${current(0)}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name> <value>${dataIn('inputLogs')}</value> </property> </configuration> </workflow> </action> </coordinator-app>
Bundle • Bundles are higher-level abstractions that batch a set of coordinators together • No explicit dependencies between them, but they can be used to define a pipeline
Interacting with Oozie • Read-Only Web Console • CLI • Java client • Web Service Endpoints • Directly with Oozie DB using SQL
Extending Oozie • Minimal workflow language containing a handful of controls and actions • Extensibility for custom action nodes • Creation of a custom action requires: • Java implementation, extending ActionExecutor • Implementation of the action’s XML schema, which defines the action’s configuration parameters • Packaing of Java implementation and configuration schema into a JAR, which is added to Oozie WAR • Extending oozie-site.xml to register information about custom executor
What do I need to deploy a workflow? • coordinator.xml • workflow.xml • Libraries • Properties • Contains things like NameNode and ResourceManager addresses and other job-specific properties
Configuring Workflows • Three mechanisms to configure a workflow • config-default.xml • job.properties • Job Arguments • Processed as such: • Use all of the parameters from command line invocation • Anything unresolved? Use job.properties • Use config-default.xml for everything else
Okay, I've built those • Now you can put it in HDFS and run it hdfsdfs -put my_joboozie/app oozie job -run -configjob.properties
Java Action • A Java action will execute the main method of the specified Java class • Java classes should be packaged in a JAR and placed with workflow application's lib directory • wf-app-dir/workflow.xml • wf-app-dir/lib • wf-app-dir/lib/myJavaClasses.JAR
Java Action $ java -Xms512m a.b.c.MyMainClass arg1 arg2 <actionname='java1'> <java> ... <main-class> a.b.c.MyJavaMain </main-class> <java-opts> -Xms512m </java-opts> <arg> arg1 </arg> <arg> arg2 </arg> ... </java> </action>
Java Action Execution • Executed as a MR job with a single task • So you need the MR information <actionname='java1'> <java> <job-tracker>foo.bar:8021</job-tracker> <name-node>foo1.bar:8020</name-node> ... <configuration> <property> <name>abc</name> <value>def</value> </property> </configuration> </java> </action>
Capturing Output • How to pass parameter from my Java action to other actions? • Add the <capture-output/> element to your Java action • Reference the parameter in your following actions • Write some Java code to link them
<actionname='java1'> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <main-class>org.apache.oozie.test.MyTest</main-class> <arg>${outputFileName}</arg> <capture-output/> </java> <okto="pig1"/> <errorto="fail"/> </action>
<actionname='pig1'> <pig> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <script>script.pig</script> <param>MY_VAR=${wf:actionData('java1')['PASS_ME']}</param> </pig> <okto="end"/> <errorto="fail"/> </action>
publicstaticvoidmain (String[] args) { String fileName = args[0]; try{ File file = newFile( System.getProperty("oozie.action.output.properties")); Properties props = newProperties(); props.setProperty("PASS_ME", "123456"); OutputStreamos = newFileOutputStream(file); props.store(os, ""); os.close(); System.out.println(file.getAbsolutePath()); } catch(Exception e) { e.printStackTrace(); } • System.exit(0); }
A Use Case: Hourly Jobs • Replace a CRON job that runs a bash script once a day • Java main class that pulls data from a file stream and dumps it to HDFS • Runs a MapReduce job on the files • Emails a person when finished • Start within X amount of time • Complete within Y amount of time • And retry Z times on failure
1 <workflow-app name="filestream_wf" xmlns="uri:oozie:workflow:0.1"> <start to="java-node"/> <action name="java-node"/> <java> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <main-class>org.foo.bar.PullFileStream</main-class> </java> <ok to="mr-node"/> <error to="fail"/> </action> <action name="mr-node"> <map-reduce> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <configuration> ... </configuration> </map-reduce> <ok to="email-node"> <error to="fail"/> </action> ... 2 3 ... <action name="email-node"> <email xmlns="uri:oozie:email-action:0.1"> <to>customer@foo.bar</to> <cc>employee@foo.bar</cc> <subject>Email notification</subject> <body>The wf completed</body> </email> <ok to="myotherjob"/> <error to="errorcleanup"/> </action> <end name="end"/> <kill name="fail"/> </workflow-app>
6 <?xml version="1.0"?> <coordinator-app end="${COORD_END}" frequency="${coord:days(1)}" name="daily_job_coord" start="${COORD_START}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1" xmlns="uri:oozie:sla:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/user/hadoop/oozie/app/test_job</app-path> </workflow> <sla:info> <sla:nominal-time>${coord:nominalTime()}</sla:nominal-time> <sla:should-start>${X * MINUTES}</sla:should-start> <sla:should-end>${Y * MINUTES}</sla:should-end> <sla:alert-contact>foo@bar.com</sla:alert-contact> </sla:info> </action> </coordinator-app> 4, 5
Review • Oozie ties together many Hadoop ecosystem components to "productionalize" this stuff • Advanced control flow and action extendibility lets Oozie do whatever you would need it to do at any point in the workflow • XML is gross
References • http://oozie.apache.org • https://cwiki.apache.org/confluence/display/OOZIE/Index • http://www.slideshare.net/mattgoeke/oozie-riot-games • http://www.slideshare.net/mislam77/oozie-sweet-13451212 • http://www.slideshare.net/ChicagoHUG/everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie