1 / 40

Workflow Management

Workflow Management. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. Apache Oozie. Problem!. "Okay, Hadoop is great, but how do people actually do this?“ – A Real Person Package jobs? Chaining actions together? Run these on a schedule?

alina
Download Presentation

Workflow Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workflow Management CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

  2. Apache Oozie

  3. Problem! • "Okay, Hadoop is great, but how do people actually do this?“ – A Real Person • Package jobs? • Chaining actions together? • Run these on a schedule? • Pre and post processing? • Retry failures?

  4. Apache OozieWorkflow Scheduler for Hadoop • Scalable, reliable, and extensible workflow scheduler system to manage Apache Hadoop jobs • Workflow jobs are DAGs of actions • Coordinator jobs are recurrent Oozie Workflow jobs triggered by time and data availability • Supports several types of jobs: • Java MapReduce • Streaming MapReduce • Pig • Hive • Sqoop • Distcp • Java programs • Shell scripts

  5. Why should I care? • Retry jobs in the event of a failure • Execute jobs at a specific time or when data is available • Correctly order job execution based on dependencies • Provide a common framework for communication • Use the workflow to couple resources instead of some home-grown code base

  6. Layers of Oozie • Bundles • Coordinators • Workflows • Actions

  7. Actions • Have a type, and each type has a defined set of configuration variables • Each action must specify what to do based on success or failure

  8. Workflow DAGs M/R streaming job OK start Java Main OK fork join decision Pig job MORE OK M/R job ENOUGH OK Java Main end FS job OK OK

  9. Workflow Language

  10. Oozie Workflow Application • An HDFS Directory containing: • Definition file: workflow.xml • Configuration file: config-default.xml • App files: lib/ directory with JAR and other dependencies

  11. WordCount Workflow <workflow-app name='wordcount-wf'> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>foo.com:9001</job-tracker> <name-node>hdfs://bar.com:9000</name-node> <configuration> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'/> <end name='end'/> </workflow-app> Start End M-R wordcount OK Start Error Kill

  12. Coordinators • Oozie executes workflows based on • Time Dependency • Data Dependency Tomcat Check Data Availability Oozie Coordinator WS API Oozie Workflow Oozie Client Hadoop

  13. Time Triggers <coordinator-app name="coord1" start="2009-01-01T00:00Z" end="2010-01-01T00:00Z" frequency="15" xmlns="uri:oozie:coordinator:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/apps/processor-wf</app-path> <configuration> <property> <name>key1</name> <value>value1</value> </property> </configuration> </workflow> </action> </coordinator-app>

  14. Data Triggers <coordinator-app name="coord1" frequency="${1*HOURS}"...> <datasets> <dataset name="logs" frequency="${1*HOURS}" initial-instance="2009-01-01T00:00Z"> <uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="inputLogs" dataset="logs"> <instance>${current(0)}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name> <value>${dataIn('inputLogs')}</value> </property> </configuration> </workflow> </action> </coordinator-app>

  15. Bundle • Bundles are higher-level abstractions that batch a set of coordinators together • No explicit dependencies between them, but they can be used to define a pipeline

  16. Interacting with Oozie • Read-Only Web Console • CLI • Java client • Web Service Endpoints • Directly with Oozie DB using SQL

  17. Extending Oozie • Minimal workflow language containing a handful of controls and actions • Extensibility for custom action nodes • Creation of a custom action requires: • Java implementation, extending ActionExecutor • Implementation of the action’s XML schema, which defines the action’s configuration parameters • Packaing of Java implementation and configuration schema into a JAR, which is added to Oozie WAR • Extending oozie-site.xml to register information about custom executor

  18. What do I need to deploy a workflow? • coordinator.xml • workflow.xml • Libraries • Properties • Contains things like NameNode and ResourceManager addresses and other job-specific properties

  19. Configuring Workflows • Three mechanisms to configure a workflow • config-default.xml • job.properties • Job Arguments • Processed as such: • Use all of the parameters from command line invocation • Anything unresolved? Use job.properties • Use config-default.xml for everything else

  20. Okay, I've built those • Now you can put it in HDFS and run it hdfsdfs -put my_joboozie/app oozie job -run -configjob.properties

  21. Java Action • A Java action will execute the main method of the specified Java class • Java classes should be packaged in a JAR and placed with workflow application's lib directory • wf-app-dir/workflow.xml • wf-app-dir/lib • wf-app-dir/lib/myJavaClasses.JAR

  22. Java Action $ java -Xms512m a.b.c.MyMainClass arg1 arg2 <actionname='java1'> <java> ... <main-class> a.b.c.MyJavaMain </main-class> <java-opts> -Xms512m </java-opts> <arg> arg1 </arg> <arg> arg2 </arg> ... </java> </action>

  23. Java Action Execution • Executed as a MR job with a single task • So you need the MR information <actionname='java1'> <java> <job-tracker>foo.bar:8021</job-tracker> <name-node>foo1.bar:8020</name-node> ... <configuration> <property> <name>abc</name> <value>def</value> </property> </configuration> </java> </action>

  24. Capturing Output • How to pass parameter from my Java action to other actions? • Add the <capture-output/> element to your Java action • Reference the parameter in your following actions • Write some Java code to link them

  25. <actionname='java1'> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <main-class>org.apache.oozie.test.MyTest</main-class> <arg>${outputFileName}</arg> <capture-output/> </java> <okto="pig1"/> <errorto="fail"/> </action>

  26. <actionname='pig1'> <pig> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <script>script.pig</script> <param>MY_VAR=${wf:actionData('java1')['PASS_ME']}</param> </pig> <okto="end"/> <errorto="fail"/> </action>

  27. publicstaticvoidmain (String[] args) { String fileName = args[0]; try{ File file = newFile( System.getProperty("oozie.action.output.properties")); Properties props = newProperties(); props.setProperty("PASS_ME", "123456"); OutputStreamos = newFileOutputStream(file); props.store(os, ""); os.close(); System.out.println(file.getAbsolutePath()); } catch(Exception e) { e.printStackTrace(); } • System.exit(0); }

  28. Web Console

  29. Coordinators

  30. Coordinator Details

  31. Job Details

  32. Job DAG

  33. Job Details

  34. Action Details

  35. Job Tracker

  36. A Use Case: Hourly Jobs • Replace a CRON job that runs a bash script once a day • Java main class that pulls data from a file stream and dumps it to HDFS • Runs a MapReduce job on the files • Emails a person when finished • Start within X amount of time • Complete within Y amount of time • And retry Z times on failure

  37. 1 <workflow-app name="filestream_wf" xmlns="uri:oozie:workflow:0.1"> <start to="java-node"/> <action name="java-node"/> <java> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <main-class>org.foo.bar.PullFileStream</main-class> </java> <ok to="mr-node"/> <error to="fail"/> </action> <action name="mr-node"> <map-reduce> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <configuration> ... </configuration> </map-reduce> <ok to="email-node"> <error to="fail"/> </action> ... 2 3 ... <action name="email-node"> <email xmlns="uri:oozie:email-action:0.1"> <to>customer@foo.bar</to> <cc>employee@foo.bar</cc> <subject>Email notification</subject> <body>The wf completed</body> </email> <ok to="myotherjob"/> <error to="errorcleanup"/> </action> <end name="end"/> <kill name="fail"/> </workflow-app>

  38. 6 <?xml version="1.0"?> <coordinator-app end="${COORD_END}" frequency="${coord:days(1)}" name="daily_job_coord" start="${COORD_START}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1" xmlns="uri:oozie:sla:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/user/hadoop/oozie/app/test_job</app-path> </workflow> <sla:info> <sla:nominal-time>${coord:nominalTime()}</sla:nominal-time> <sla:should-start>${X * MINUTES}</sla:should-start> <sla:should-end>${Y * MINUTES}</sla:should-end> <sla:alert-contact>foo@bar.com</sla:alert-contact> </sla:info> </action> </coordinator-app> 4, 5

  39. Review • Oozie ties together many Hadoop ecosystem components to "productionalize" this stuff • Advanced control flow and action extendibility lets Oozie do whatever you would need it to do at any point in the workflow • XML is gross

  40. References • http://oozie.apache.org • https://cwiki.apache.org/confluence/display/OOZIE/Index • http://www.slideshare.net/mattgoeke/oozie-riot-games • http://www.slideshare.net/mislam77/oozie-sweet-13451212 • http://www.slideshare.net/ChicagoHUG/everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie

More Related