1 / 36

Making Pig Fly Optimizing Data Processing on Hadoop

Making Pig Fly Optimizing Data Processing on Hadoop. Daniel Dai (@ daijy ) Thejas Nair (@ thejasn ). What is Apache Pig?. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig- latin -cup pic from http:// www.flickr.com /photos/ frippy /2507970530/.

marcel
Download Presentation

Making Pig Fly Optimizing Data Processing on Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Making Pig FlyOptimizing Data Processing on Hadoop Daniel Dai (@daijy) Thejas Nair (@thejasn)

  2. What is Apache Pig? An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/ Architecting the Future of Big Data Pig Latin, a high level data processing language.

  3. Pig-latin example • Query : Get the list of web pages visited by users whose age is between 20 and 29 years. USERS = load‘users’as (uid, age); USERS_20s = filterUSERS by age >= 20 and age <= 29; PVs = load‘pages’as(url, uid, timestamp); PVs_u20s = joinUSERS_20s byuid, PVs byuid; Architecting the Future of Big Data

  4. Why pig ? • Faster development • Fewer lines of code • Don’t re-invent the wheel • Flexible • Metadata is optional • Extensible • Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ Architecting the Future of Big Data

  5. Pig optimizations Architecting the Future of Big Data • Ideally user should not have to bother • Reality • Pig is still young and immature • Pig does not have the whole picture • Cluster configuration • Data histogram • Pig philosophy: Pig is docile

  6. Pig optimizations Architecting the Future of Big Data • What pig does for you • Do safe transformations of query to optimize • Optimized operations (join, sort) • What you do • Organize input in optimal way • Optimize pig-latin query • Tell pig what join/group algorithm to use

  7. Rule based optimizer Architecting the Future of Big Data • Column pruner • Push up filter • Push down flatten • Push up limit • Partition pruning • Global optimizer

  8. Column Pruner A = load ‘input’ as(a0, a1, a2); B = foreachA generatea0+a1; C = orderB by$0; StoreC into‘output’; Pig will prune a2 automatically DIY A = load ‘input’; A1 = foreachA generate$0, $1; B = orderA1 by$0; C = foreachB generate$0+$1; StoreC into‘output’; A = load ‘input’; B = orderA by$0; C = foreachB generate$0+$1; StoreC into‘output’; Architecting the Future of Big Data • Pig will do column pruning automatically • Cases Pig will not do column pruning automatically • No schema specified in load statement

  9. Column Pruner DIY A = load ‘input’ as(a0, a1, a2); A1 = foreachA generate$0, $1; B = group A1 by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’; A = load ‘input’ as(a0, a1, a2); B = group A by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’; Architecting the Future of Big Data • Another case Pig does not do column pruning • Pig does not keep track of unused column after grouping

  10. Push up filter A B A A B B Filter a0>0 b0>10 Join Join a0>0 && b0>10 a0>0 b0>10 Join Filter Filter Original query Push up filter Split filter condition Architecting the Future of Big Data • Pig split the filter condition before push

  11. Other push up/down Load Load A = load ‘input’ as (a0:bag, a1); B = foreachA generateflattten(a0), a1; C = orderB bya1; StoreC into‘output’; Flatten Order Order Flatten Load Load Load Load Load (limited) Order Order (limited) Foreach Limit Foreach Limit Limit Foreach Architecting the Future of Big Data • Push down flatten • Push up limit

  12. Partition pruning 2010 HCatLoader Filter (year>=2011) 2011 2012 2010 HCatLoader (year>=2011) 2011 2012 Architecting the Future of Big Data • Prune unnecessary partitions entirely • HCatLoader

  13. Intermediate file compression map 1 reduce 1 Pig Script Pig temp file map 2 reduce 2 Pig temp file map 3 reduce 3 Architecting the Future of Big Data • Intermediate file between map and reduce • Snappy • Temp file between mapreduce jobs • No compression by default

  14. Enable temp file compression pig.tmpfilecompression = true pig.tmpfilecompression.codec = lzo Architecting the Future of Big Data • Pig temp file are not compressed by default • Issues with snappy (HADOOP-7990) • LZO: not Apache license • Enable LZO compression • Install LZO for Hadoop • In conf/pig.properties • With lzo, up to > 90% disk saving and 4x query speed up

  15. Multiquery Load Group by $0 Group by $1 Group by $2 Foreach Foreach Foreach Store Store Store Architecting the Future of Big Data • Combine two or more map/reduce job into one • Happens automatically • Cases we want to control multiquery: combine too many

  16. Control multiquery A = load‘input’; B0 = groupA by$0; C0 = foreachB0 generategroup, COUNT(A); StoreC0 into‘output0’; B1 = groupA by$1; C1 = foreachB1 generategroup, COUNT(A); StoreC1 into‘output1’; exec B2 = groupA by$2; C2 = foreachB2 generategroup, COUNT(A); StoreC2 into‘output2’; Architecting the Future of Big Data • Disable multiquery • Command line option: -M • Using “exec” to mark the boundary

  17. Implement the right UDF Map Initial Combiner Intermediate A = load‘input’; B0 = groupA by$0; C0 = foreachB0 generategroup, SUM(A); StoreC0 into‘output0’; Reduce Final Architecting the Future of Big Data • Algebraic UDF • Initial • Intermediate • Final

  18. Implement the right UDF A = load‘input’; B0 = groupA by$0; C0 = foreachB0 generategroup, my_accum(A); StoreC0 into‘output0’; my_accumextends Accumulator { publicvoidaccumulate() { // take a bag trunk } publicvoidgetValue() { // called after all bag trunks are processed } } pig.accumulative.batchsize=20000 Architecting the Future of Big Data • Accumulator UDF • Reduce side UDF • Normally takes a bag • Benefit • Big bag are passed in batches • Avoid using too much memory • Batch size

  19. Memory optimization Mapreduce: reduce(Text key, Iterator<Writable> values, ……) Iterator Bag of Input 1 Bag of Input 2 Bag of Input 3 pig.cachedbag.memusage=0.2 Architecting the Future of Big Data • Control bag size on reduce side • If bag size exceed threshold, spill to disk • Control the bag size to fit the bag in memory if possible

  20. Optimization starts before pig Architecting the Future of Big Data • Input format • Serialization format • Compression

  21. Input format -Test Query Architecting the Future of Big Data > searches = load’aol_search_logs.txt' using PigStorage() as(ID, Query, …); > search_thejas= filter searches by Query matches '.*thejas.*';    > dump search_thejas; (1568578 , thejasminesupperclub, ….)

  22. Input formats Architecting the Future of Big Data

  23. Columnar format Architecting the Future of Big Data • RCFile • Columnar format for a group of rows • More efficient if you query subset of columns

  24. Tests with RCFile Architecting the Future of Big Data • Tests with load + project + filter out all records. • Using hcatalog, w compression,types • Test 1 • Project 1 out of 5 columns • Test 2 • Project all 5 columns

  25. RCFile test results Architecting the Future of Big Data

  26. Cost based optimizations Run query Measure Tune Architecting the Future of Big Data • Optimizations decisions based on your query/data • Often iterative process

  27. Cost based optimization - Aggregation Map task Reduce task M. Output Map (logic) HBA HBA Output Architecting the Future of Big Data • Hash Based Agg • Use pig.exec.mapPartAgg=true to enable

  28. Cost based optimization – Hash Agg. Architecting the Future of Big Data • Auto off feature • switches off HBA if output reduction is not good enough • Configuring Hash Agg • Configure auto off feature - pig.exec.mapPartAgg.minReduction • Configure memory used - pig.cachedbag.memusage

  29. Cost based optimization - Join Architecting the Future of Big Data • Use appropriate join algorithm • Skew on join key - Skew join • Fits in memory – FR join

  30. Cost based optimization – MR tuning Architecting the Future of Big Data • Tune MR parameters to reduce IO • Control spills using map sort params • Reduce shuffle/sort-merge params

  31. Parallelism of reduce tasks • Number of reduce slots = 6 • Factors affecting runtime • Cores simultaneously used/skew • Cost of having additional reduce tasks Architecting the Future of Big Data

  32. Cost based optimization – keep data sorted Architecting the Future of Big Data • Frequent joins operations on same keys • Keep data sorted on keys • Use merge join • Optimized group on sorted keys • Works with few load functions – needs additional i/f implementation

  33. Optimizations for sorted data Architecting the Future of Big Data

  34. Future Directions Architecting the Future of Big Data • Optimize using stats • Using historical stats w hcatalog • Sampling

  35. Questions Architecting the Future of Big Data ?

More Related