360 likes | 492 Views
Making Pig Fly Optimizing Data Processing on Hadoop. Daniel Dai (@ daijy ) Thejas Nair (@ thejasn ). What is Apache Pig?. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig- latin -cup pic from http:// www.flickr.com /photos/ frippy /2507970530/.
E N D
Making Pig FlyOptimizing Data Processing on Hadoop Daniel Dai (@daijy) Thejas Nair (@thejasn)
What is Apache Pig? An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Pig Latin, a high level data processing language.
Pig-latin example • Query : Get the list of web pages visited by users whose age is between 20 and 29 years. USERS = load‘users’as (uid, age); USERS_20s = filterUSERS by age >= 20 and age <= 29; PVs = load‘pages’as(url, uid, timestamp); PVs_u20s = joinUSERS_20s byuid, PVs byuid; Architecting the Future of Big Data
Why pig ? • Faster development • Fewer lines of code • Don’t re-invent the wheel • Flexible • Metadata is optional • Extensible • Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data
Pig optimizations Architecting the Future of Big Data • Ideally user should not have to bother • Reality • Pig is still young and immature • Pig does not have the whole picture • Cluster configuration • Data histogram • Pig philosophy: Pig is docile
Pig optimizations Architecting the Future of Big Data • What pig does for you • Do safe transformations of query to optimize • Optimized operations (join, sort) • What you do • Organize input in optimal way • Optimize pig-latin query • Tell pig what join/group algorithm to use
Rule based optimizer Architecting the Future of Big Data • Column pruner • Push up filter • Push down flatten • Push up limit • Partition pruning • Global optimizer
Column Pruner A = load ‘input’ as(a0, a1, a2); B = foreachA generatea0+a1; C = orderB by$0; StoreC into‘output’; Pig will prune a2 automatically DIY A = load ‘input’; A1 = foreachA generate$0, $1; B = orderA1 by$0; C = foreachB generate$0+$1; StoreC into‘output’; A = load ‘input’; B = orderA by$0; C = foreachB generate$0+$1; StoreC into‘output’; Architecting the Future of Big Data • Pig will do column pruning automatically • Cases Pig will not do column pruning automatically • No schema specified in load statement
Column Pruner DIY A = load ‘input’ as(a0, a1, a2); A1 = foreachA generate$0, $1; B = group A1 by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’; A = load ‘input’ as(a0, a1, a2); B = group A by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’; Architecting the Future of Big Data • Another case Pig does not do column pruning • Pig does not keep track of unused column after grouping
Push up filter A B A A B B Filter a0>0 b0>10 Join Join a0>0 && b0>10 a0>0 b0>10 Join Filter Filter Original query Push up filter Split filter condition Architecting the Future of Big Data • Pig split the filter condition before push
Other push up/down Load Load A = load ‘input’ as (a0:bag, a1); B = foreachA generateflattten(a0), a1; C = orderB bya1; StoreC into‘output’; Flatten Order Order Flatten Load Load Load Load Load (limited) Order Order (limited) Foreach Limit Foreach Limit Limit Foreach Architecting the Future of Big Data • Push down flatten • Push up limit
Partition pruning 2010 HCatLoader Filter (year>=2011) 2011 2012 2010 HCatLoader (year>=2011) 2011 2012 Architecting the Future of Big Data • Prune unnecessary partitions entirely • HCatLoader
Intermediate file compression map 1 reduce 1 Pig Script Pig temp file map 2 reduce 2 Pig temp file map 3 reduce 3 Architecting the Future of Big Data • Intermediate file between map and reduce • Snappy • Temp file between mapreduce jobs • No compression by default
Enable temp file compression pig.tmpfilecompression = true pig.tmpfilecompression.codec = lzo Architecting the Future of Big Data • Pig temp file are not compressed by default • Issues with snappy (HADOOP-7990) • LZO: not Apache license • Enable LZO compression • Install LZO for Hadoop • In conf/pig.properties • With lzo, up to > 90% disk saving and 4x query speed up
Multiquery Load Group by $0 Group by $1 Group by $2 Foreach Foreach Foreach Store Store Store Architecting the Future of Big Data • Combine two or more map/reduce job into one • Happens automatically • Cases we want to control multiquery: combine too many
Control multiquery A = load‘input’; B0 = groupA by$0; C0 = foreachB0 generategroup, COUNT(A); StoreC0 into‘output0’; B1 = groupA by$1; C1 = foreachB1 generategroup, COUNT(A); StoreC1 into‘output1’; exec B2 = groupA by$2; C2 = foreachB2 generategroup, COUNT(A); StoreC2 into‘output2’; Architecting the Future of Big Data • Disable multiquery • Command line option: -M • Using “exec” to mark the boundary
Implement the right UDF Map Initial Combiner Intermediate A = load‘input’; B0 = groupA by$0; C0 = foreachB0 generategroup, SUM(A); StoreC0 into‘output0’; Reduce Final Architecting the Future of Big Data • Algebraic UDF • Initial • Intermediate • Final
Implement the right UDF A = load‘input’; B0 = groupA by$0; C0 = foreachB0 generategroup, my_accum(A); StoreC0 into‘output0’; my_accumextends Accumulator { publicvoidaccumulate() { // take a bag trunk } publicvoidgetValue() { // called after all bag trunks are processed } } pig.accumulative.batchsize=20000 Architecting the Future of Big Data • Accumulator UDF • Reduce side UDF • Normally takes a bag • Benefit • Big bag are passed in batches • Avoid using too much memory • Batch size
Memory optimization Mapreduce: reduce(Text key, Iterator<Writable> values, ……) Iterator Bag of Input 1 Bag of Input 2 Bag of Input 3 pig.cachedbag.memusage=0.2 Architecting the Future of Big Data • Control bag size on reduce side • If bag size exceed threshold, spill to disk • Control the bag size to fit the bag in memory if possible
Optimization starts before pig Architecting the Future of Big Data • Input format • Serialization format • Compression
Input format -Test Query Architecting the Future of Big Data > searches = load’aol_search_logs.txt' using PigStorage() as(ID, Query, …); > search_thejas= filter searches by Query matches '.*thejas.*'; > dump search_thejas; (1568578 , thejasminesupperclub, ….)
Input formats Architecting the Future of Big Data
Columnar format Architecting the Future of Big Data • RCFile • Columnar format for a group of rows • More efficient if you query subset of columns
Tests with RCFile Architecting the Future of Big Data • Tests with load + project + filter out all records. • Using hcatalog, w compression,types • Test 1 • Project 1 out of 5 columns • Test 2 • Project all 5 columns
RCFile test results Architecting the Future of Big Data
Cost based optimizations Run query Measure Tune Architecting the Future of Big Data • Optimizations decisions based on your query/data • Often iterative process
Cost based optimization - Aggregation Map task Reduce task M. Output Map (logic) HBA HBA Output Architecting the Future of Big Data • Hash Based Agg • Use pig.exec.mapPartAgg=true to enable
Cost based optimization – Hash Agg. Architecting the Future of Big Data • Auto off feature • switches off HBA if output reduction is not good enough • Configuring Hash Agg • Configure auto off feature - pig.exec.mapPartAgg.minReduction • Configure memory used - pig.cachedbag.memusage
Cost based optimization - Join Architecting the Future of Big Data • Use appropriate join algorithm • Skew on join key - Skew join • Fits in memory – FR join
Cost based optimization – MR tuning Architecting the Future of Big Data • Tune MR parameters to reduce IO • Control spills using map sort params • Reduce shuffle/sort-merge params
Parallelism of reduce tasks • Number of reduce slots = 6 • Factors affecting runtime • Cores simultaneously used/skew • Cost of having additional reduce tasks Architecting the Future of Big Data
Cost based optimization – keep data sorted Architecting the Future of Big Data • Frequent joins operations on same keys • Keep data sorted on keys • Use merge join • Optimized group on sorted keys • Works with few load functions – needs additional i/f implementation
Optimizations for sorted data Architecting the Future of Big Data
Future Directions Architecting the Future of Big Data • Optimize using stats • Using historical stats w hcatalog • Sampling
Questions Architecting the Future of Big Data ?