140 likes | 398 Views
Oozie-HCatalog Integration. Oozie Team. Agenda. Why does Oozie need HCatalog supports ? Architecture How to support existing Synchronous data processing using HCatalog ? Examples Future work. Current Oozie Coordinator.
E N D
Oozie-HCatalog Integration Oozie Team
Agenda • Why does Oozie need HCatalog supports? • Architecture • How to support existing Synchronous data processing using HCatalog? • Examples • Future work
Current Oozie Coordinator <coordinator-app frequency=“${coord:hours(4)}” start="2011-01-01T04:00Z“ end="2011-01-10T00:00Z" ..> <datasets> <dataset name="input1" frequency="60" initial-instance="2011-01-01T00:00” timezone="UTC“> <uri-template> hdfs://<namenode>:8020/data/click/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1“> <start-instance>${coord:current(-3)}</start-instance> <end-instance>${coord:current(0)}</end-instance> </data-in> </input-events> …….. <workflow> <configuration> <property>myinput</property> <value> ${coord:dataIn(‘coordInput1’)}</value> <property>MY_VAR</property> <value> ANYVALUE</value> </configuration> </workflow>
High Level Diagram (Oozie-HCat-Notification) HCatalog 1. Query/Poll Partition Oozie 4. Push <New Partition> 2. Register Topic MessageQ 3. Notify New Partition
Architecture Message Bus HCat Server Oozie Cold start Partition is available Partition Dependency Manager Service Mark as available Add entry Materialize Action JMS Message handler Recovery Service Action READY? Database Update action table for dependencies Persist missing dependencies
Hcat-based Dataset in Coordinator • <dataset name=”my_ds” initial-instance=”DateStamp” frequency=5 type=“metadata”> <uri-template> hcat://server:port/db/mydb/table/T1/ ?p_key1=v1;p_key2=v2;p_key=v3 </uri-template></dataset> • URI-template Example: hcat://server:port/db/mydb/table/clicks/?datestamp=$YEAR$MONTH$DAY;region=us
Input/Output partition • Oozie will pass input/output partitions to WF application as string through <configuration> section. • Example of Resolved Set of Partitions : • [hcat://server:port/db/mydb/table/clicks/?datestamp=20120915;region=us][hcat://server:port/db/mydb/table/clicks/?datestamp=20120916;region=us]
Pig script Using HCatalog • A typical pig script • A = LOAD ’dbname1.tablename1' USING org.apache.hcatalog.pig.HCatLoader(); • B = filter A by (datestamp= '2012-09-12’ AND regios=‘us’) OR (datestamp= '2012-09-11’ AND regios=‘us’); • my_processed_data = ... • STORE my_processed_data INTO 'dbname2.tablename2' USING org.apache.hcatalog.pig.HCatStorer(’date=20120912','a:int,b:chararray,c:map[]');
Map-Reduce Job using HCatalog • Configuration conf = new Configuration(); • Job job = new Job(conf, "hcatmapreduce read test"); • job.setJarByClass(this.getClass()); • job.setMapperClass(HCatMapReduceTest.MapRead.class); • job.setInputFormatClass(HCatInputFormat.class); • job.setOutputFormatClass(TextOutputFormat.class); • InputJobInfoinputJobInfo = InputJobInfo.create(dbName,tableName,filter,thriftUri,null); • HCatInputFormat.setInput(job, inputJobInfo);
A Typical HCatalog App Needs • DB Name • Table Name • Thrift URI of HCat server • For pig- it could be pass as –D option • Q: Will there be any other protocol (other than thrift) supported for HCAT? • Filters • Same partition: Keys are separated by AND • Ex: region = us AND date = 20110811 • Different partitions: Partitions are separated by OR • Ex: region = us AND date = 20110811 OR region = us AND date = 20110812
Parameter Passing from Coordinator • Oozie provides multiple EL functions for the followings: • Get DB name of input/output datasets • e.g. getDatabaseIn(‘dsName’) & getDatabaseOut(‘dsName’) • Get table-name of a dataset • e.g. getTableIn(‘dsName’) & getTableOut(‘dsName’) • Get partition filter string for each input-event • getPartitionsPigFilter(‘in-event’) • Get specific partition-key’s value for use in range filtering • getPartitionValue(‘key’,’dsName’) • Get partition definition for each output-event. • getOutputPartitionsPig(‘out-event’)
Future work • To support Asynchronous data Processing? • To support wild-card like support through HCatalog Mark-set-done feature.
Challenges .. • Scalability, scalability….