110 likes | 397 Views
An introduction to Apache HCatalog, what is it ?Why is it useful and how can it help Pig, Hive and MapReduce users on Hadoop share data ?
E N D
Apache HCatalog • What is it ? • How does it work ? • Interfaces • Architecture • Example www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – What is it ? • A Hive metastore interface set • Shared schema and data types for Hadoop tools • Rest interface for external data access • Assists inter operability between • Pig, Hive and Map Reduce • Table abstraction of data storage • Will provide data availability notifications www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – How does it work ? • Pig • HCatLoader + HCatStorer interface • Map Reduce • HCatInputFormat + HCatOutputFormat interface • Hive • No interface necessary • Direct access to meta data • Notifications when data available www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – Interfaces • Interface via • Pig • Map Reduce • Hive • Streaming • Access data via • Orc file • RC file • Text file • Sequence file • Custom format www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – Interfaces www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – Architecture www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – Example A data flow example from hive.apache.org First Joe in data acquisition uses distcp to get data onto the grid. hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'" Second Sally in data processing uses Pig to cleanse and prepare the data. Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS. A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, …); B = filter A by bot_finder(zeta) = 0; … store Z into 'data/processedevents/20100819/data'; With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started. A = load 'rawevents' using HCatLoader(); B = filter A by date = '20100819' and by bot_finder(zeta) = 0; … store Z into 'processedevents' using HcatStorer("date=20100819"); Note that the pig job refers to the data by name rawevents rather than a location Now access the data via Hive QL select advertiser_id, count(clicks) from processedevents where date = ‘20100819’ group by advertiser_id; www.semtech-solutions.co.nz info@semtech-solutions.co.nz
Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems