How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig team @ Yahoo! Apache pig PMC member http://pig.apache.org

What is Pig? An engine that executes Pig Latin locally or on a Hadoop cluster. Pig Latin, a high level data processing language. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Pig Latin example Users = load‘users’as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user;

Comparison with MR in Java 1/20 the lines of code 1/16 the development time What about Performance ?

Pig Compared to Map Reduce Faster development time Data flow versus programming logic Many standard data operations (e.g. join) included Manages all the details of connecting jobs and data flow Copes with Hadoop version change issues

And, You Don’t Lose Power UDFs can be used to load, evaluate, aggregate, and store data External binaries can be invoked Metadata is optional Flexible data model Nested data types Explicit data flow programming

Pig performance Pigmix : pig vs mapreduce

Pig optimization principles vs RDBMS: There is absence of accurate models for data, operators and execution env Use available reliable info. Trust user choice. Use rules that help in most cases Rules based on runtime information

Logical Optimizations Parser Logical Optimizer Script A = load B = foreach C = filter Logical Plan A -> B -> C Optimized L. Plan A -> C -> B Restructure given logical dataflow graph • Apply filter, project, limit early • Merge foreach, filter statements • Operator rewrites

Physical Optimizations Translator Optimizer Optimized L. Plan X -> Y -> Z Phy/MR plan M(PX-PYm) R(PYr) -> M(Z) Optimized Phy/MR Plan M(PX-PYm) C(PYc)R(PYr) -> M(Z) Physical plan: sequence of MR jobs having physical operators. • Built-in rules. eg. use of combiner • Specified in query - eg. join type

Hash Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Users by name, Pages by user; Map 1 Reducer 1 (1, user) Pages Users Pages block n (1, fred) (2, fred) (2, fred) Map 2 Reducer 2 Users block m (1, jane) (2, jane) (2, jane) (2, name)

Skew Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘skewed’; Map 1 Reducer 1 SP (1, user) Pages Users Pages block n (1, fred, p1) (1, fred, p2) (2, fred) SP Map 2 Reducer 2 Users block m (1, fred, p3) (1, fred, p4) (2, fred) (2, name)

Merge Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘merge’; Map 1 Pages Users Pages Users aaron… amr aaron … aaron . . . . . . . . zach aaron . . . . . . zach Map 2 Pages Users amy… barb amy …

Replicated Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘replicated’; Map 1 Pages Pages Users Users aaron aaron . . . . . . . zach aaron . zach aaron… amr aaron . zach Map 2 Pages Users aaron . zach amy… barb

Group/cogroup optimizations • On sorted and ‘collected’ data • grp = group Users by name using ‘collected’; Pages Map 1 aaron aaron barney carol . . . . . . . zach aaron aaron barney Map 2 carol . .

Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group Bby age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group Bby state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; C1: group store into ‘bydemo’ C2: eval udf A: load B: filter C2: group store into ‘bystate’ C3: eval udf

Multi-Store Map-Reduce Plan map filter split local rearrange local rearrange reduce multiplex package package foreach foreach

Memory Management Use disk if large objects don’t fit into memory • JVM limit > phy mem - Very poor performance • Spill on memory threshold notification from JVM - unreliable • pre-set limit for large bags. Custom spill logic for different bags -eg distinct bag.

Other optimizations • Aggressive use of combiner, secondary sort • Lazy deserialization in loaders • Better serialization format • Faster regex lib, compiled pattern • Compression between MR jobs

Future optimization work Improve memory management Join + group in single MR, if same keys used Even better skew handling Adaptive optimizations Automated hadoop tuning …

Pig - fast and flexible Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ More flexibility in 0.8, 0.9 • Udfs in scripting languages (python) • MR job as relation • Relation as scalar • Turing complete pig (0.9)

Further reading • Docs - http://pig.apache.org/docs/r0.7.0/ • Papers and talks - http://wiki.apache.org/pig/PigTalksPapers • Training videos in vimeo.com (search ‘hadoop pig’)

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations

Presentation Transcript

Compensation

How Smokers Can Reduce Their Risk of Fire

Code Optimization I: Machine Independent Optimizations

REWARD SYSTEMS SHOULD ENCOURAGE WORKERS TO:

Lessons Learned in Building a Highly Scalable MySQL Database

Avoiding Six Dangerous Retention Mistakes Most Companies Make

Scheduling Conflicting Jobs: Problems and Techniques

Canada's Labour Market: Jobs With A Future

Perform Clerical Procedures

Global optimizations

Winter 2012-2013 Compiler Principles Loop Optimizations and Register Allocation

MANUAL MATERIAL HANDLING

Digital Design – Optimizations and Tradeoffs

Please remove your earplugs :-)

Logistical applications and optimizations

Unit 2 Jobs and Careers

Using Statistics To Make Inferences 10

Avoiding Six Dangerous Retention Mistakes Most Companies Make