Making Pig Fly Optimizing Data Processing on Hadoop

Making Pig FlyOptimizing Data Processing on Hadoop Daniel Dai (@daijy) Thejas Nair (@thejasn)

What is Apache Pig? An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Pig Latin, a high level data processing language.

Pig-latin example • Query : Get the list of web pages visited by users whose age is between 20 and 29 years. USERS = load‘users’as (uid, age); USERS_20s = filterUSERS by age >= 20 and age <= 29; PVs = load‘pages’as(url, uid, timestamp); PVs_u20s = joinUSERS_20s byuid, PVs byuid; Architecting the Future of Big Data

Why pig ? • Faster development • Fewer lines of code • Don’t re-invent the wheel • Flexible • Metadata is optional • Extensible • Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data

Pig optimizations Architecting the Future of Big Data • Ideally user should not have to bother • Reality • Pig is still young and immature • Pig does not have the whole picture • Cluster configuration • Data histogram • Pig philosophy: Pig is docile

Pig optimizations Architecting the Future of Big Data • What pig does for you • Do safe transformations of query to optimize • Optimized operations (join, sort) • What you do • Organize input in optimal way • Optimize pig-latin query • Tell pig what join/group algorithm to use

Rule based optimizer Architecting the Future of Big Data • Column pruner • Push up filter • Push down flatten • Push up limit • Partition pruning • Global optimizer

Column Pruner A = load ‘input’ as(a0, a1, a2); B = foreachA generatea0+a1; C = orderB by$0; StoreC into‘output’; Pig will prune a2 automatically DIY A = load ‘input’; A1 = foreachA generate$0, $1; B = orderA1 by$0; C = foreachB generate$0+$1; StoreC into‘output’; A = load ‘input’; B = orderA by$0; C = foreachB generate$0+$1; StoreC into‘output’; Architecting the Future of Big Data • Pig will do column pruning automatically • Cases Pig will not do column pruning automatically • No schema specified in load statement

Column Pruner DIY A = load ‘input’ as(a0, a1, a2); A1 = foreachA generate$0, $1; B = group A1 by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’; A = load ‘input’ as(a0, a1, a2); B = group A by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’; Architecting the Future of Big Data • Another case Pig does not do column pruning • Pig does not keep track of unused column after grouping

Push up filter A B A A B B Filter a0>0 b0>10 Join Join a0>0 && b0>10 a0>0 b0>10 Join Filter Filter Original query Push up filter Split filter condition Architecting the Future of Big Data • Pig split the filter condition before push

Other push up/down Load Load A = load ‘input’ as (a0:bag, a1); B = foreachA generateflattten(a0), a1; C = orderB bya1; StoreC into‘output’; Flatten Order Order Flatten Load Load Load Load Load (limited) Order Order (limited) Foreach Limit Foreach Limit Limit Foreach Architecting the Future of Big Data • Push down flatten • Push up limit

Partition pruning 2010 HCatLoader Filter (year>=2011) 2011 2012 2010 HCatLoader (year>=2011) 2011 2012 Architecting the Future of Big Data • Prune unnecessary partitions entirely • HCatLoader

Intermediate file compression map 1 reduce 1 Pig Script Pig temp file map 2 reduce 2 Pig temp file map 3 reduce 3 Architecting the Future of Big Data • Intermediate file between map and reduce • Snappy • Temp file between mapreduce jobs • No compression by default

Enable temp file compression pig.tmpfilecompression = true pig.tmpfilecompression.codec = lzo Architecting the Future of Big Data • Pig temp file are not compressed by default • Issues with snappy (HADOOP-7990) • LZO: not Apache license • Enable LZO compression • Install LZO for Hadoop • In conf/pig.properties • With lzo, up to > 90% disk saving and 4x query speed up

Multiquery Load Group by $0 Group by $1 Group by $2 Foreach Foreach Foreach Store Store Store Architecting the Future of Big Data • Combine two or more map/reduce job into one • Happens automatically • Cases we want to control multiquery: combine too many

Control multiquery A = load‘input’; B0 = groupA by$0; C0 = foreachB0 generategroup, COUNT(A); StoreC0 into‘output0’; B1 = groupA by$1; C1 = foreachB1 generategroup, COUNT(A); StoreC1 into‘output1’; exec B2 = groupA by$2; C2 = foreachB2 generategroup, COUNT(A); StoreC2 into‘output2’; Architecting the Future of Big Data • Disable multiquery • Command line option: -M • Using “exec” to mark the boundary

Implement the right UDF Map Initial Combiner Intermediate A = load‘input’; B0 = groupA by$0; C0 = foreachB0 generategroup, SUM(A); StoreC0 into‘output0’; Reduce Final Architecting the Future of Big Data • Algebraic UDF • Initial • Intermediate • Final

Implement the right UDF A = load‘input’; B0 = groupA by$0; C0 = foreachB0 generategroup, my_accum(A); StoreC0 into‘output0’; my_accumextends Accumulator { publicvoidaccumulate() { // take a bag trunk } publicvoidgetValue() { // called after all bag trunks are processed } } pig.accumulative.batchsize=20000 Architecting the Future of Big Data • Accumulator UDF • Reduce side UDF • Normally takes a bag • Benefit • Big bag are passed in batches • Avoid using too much memory • Batch size

Memory optimization Mapreduce: reduce(Text key, Iterator<Writable> values, ……) Iterator Bag of Input 1 Bag of Input 2 Bag of Input 3 pig.cachedbag.memusage=0.2 Architecting the Future of Big Data • Control bag size on reduce side • If bag size exceed threshold, spill to disk • Control the bag size to fit the bag in memory if possible

Optimization starts before pig Architecting the Future of Big Data • Input format • Serialization format • Compression

Input format -Test Query Architecting the Future of Big Data > searches = load’aol_search_logs.txt' using PigStorage() as(ID, Query, …); > search_thejas= filter searches by Query matches '.*thejas.*'; > dump search_thejas; (1568578 , thejasminesupperclub, ….)

Input formats Architecting the Future of Big Data

Columnar format Architecting the Future of Big Data • RCFile • Columnar format for a group of rows • More efficient if you query subset of columns

Tests with RCFile Architecting the Future of Big Data • Tests with load + project + filter out all records. • Using hcatalog, w compression,types • Test 1 • Project 1 out of 5 columns • Test 2 • Project all 5 columns

RCFile test results Architecting the Future of Big Data

Cost based optimizations Run query Measure Tune Architecting the Future of Big Data • Optimizations decisions based on your query/data • Often iterative process

Cost based optimization - Aggregation Map task Reduce task M. Output Map (logic) HBA HBA Output Architecting the Future of Big Data • Hash Based Agg • Use pig.exec.mapPartAgg=true to enable

Cost based optimization – Hash Agg. Architecting the Future of Big Data • Auto off feature • switches off HBA if output reduction is not good enough • Configuring Hash Agg • Configure auto off feature - pig.exec.mapPartAgg.minReduction • Configure memory used - pig.cachedbag.memusage

Cost based optimization - Join Architecting the Future of Big Data • Use appropriate join algorithm • Skew on join key - Skew join • Fits in memory – FR join

Cost based optimization – MR tuning Architecting the Future of Big Data • Tune MR parameters to reduce IO • Control spills using map sort params • Reduce shuffle/sort-merge params

Parallelism of reduce tasks • Number of reduce slots = 6 • Factors affecting runtime • Cores simultaneously used/skew • Cost of having additional reduce tasks Architecting the Future of Big Data

Cost based optimization – keep data sorted Architecting the Future of Big Data • Frequent joins operations on same keys • Keep data sorted on keys • Use merge join • Optimized group on sorted keys • Works with few load functions – needs additional i/f implementation

Optimizations for sorted data Architecting the Future of Big Data

Future Directions Architecting the Future of Big Data • Optimize using stats • Using historical stats w hcatalog • Sampling

Questions Architecting the Future of Big Data ?

Making Pig Fly Optimizing Data Processing on Hadoop

Making Pig Fly Optimizing Data Processing on Hadoop

Presentation Transcript

Optimizing Butter Making

Hive: A data warehouse on Hadoop

(Hadoop) Pig Dataflow Language

Making Apache Hadoop Secure

Big data 實務運算 Apache Pig Hadoop course

Big Data and Hadoop On Windows

Making Fly

Nova: Continuous Pig/ Hadoop Workflows

Beyond Hadoop : Pig and Giraph

Trecul – Data Flow Processing using Hadoop and LLVM

Tree and Graph Processing On Hadoop

Data Management Platform on Hadoop

…..Making relationships fly

Making Hadoop Easy

Query Processing and Optimizing on SSDs

Pig: Making Hadoop Easy

Apache Tez : Accelerating Hadoop Data Processing

Pig, a high level data processing system on Hadoop

Pig, Making Hadoop Easy

Batch Start on Big Data & Hadoop

On-the-fly Image Processing in ArcGIS