220 likes | 440 Views
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations. Thejas Nair pig team @ Yahoo! Apache pig PMC member. http://pig.apache.org. What is Pig?. An engine that executes Pig Latin locally or on a Hadoop cluster.
E N D
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig team @ Yahoo! Apache pig PMC member http://pig.apache.org
What is Pig? An engine that executes Pig Latin locally or on a Hadoop cluster. Pig Latin, a high level data processing language. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
Pig Latin example Users = load‘users’as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user;
Comparison with MR in Java 1/20 the lines of code 1/16 the development time What about Performance ?
Pig Compared to Map Reduce Faster development time Data flow versus programming logic Many standard data operations (e.g. join) included Manages all the details of connecting jobs and data flow Copes with Hadoop version change issues
And, You Don’t Lose Power UDFs can be used to load, evaluate, aggregate, and store data External binaries can be invoked Metadata is optional Flexible data model Nested data types Explicit data flow programming
Pig performance Pigmix : pig vs mapreduce
Pig optimization principles vs RDBMS: There is absence of accurate models for data, operators and execution env Use available reliable info. Trust user choice. Use rules that help in most cases Rules based on runtime information
Logical Optimizations Parser Logical Optimizer Script A = load B = foreach C = filter Logical Plan A -> B -> C Optimized L. Plan A -> C -> B Restructure given logical dataflow graph • Apply filter, project, limit early • Merge foreach, filter statements • Operator rewrites
Physical Optimizations Translator Optimizer Optimized L. Plan X -> Y -> Z Phy/MR plan M(PX-PYm) R(PYr) -> M(Z) Optimized Phy/MR Plan M(PX-PYm) C(PYc)R(PYr) -> M(Z) Physical plan: sequence of MR jobs having physical operators. • Built-in rules. eg. use of combiner • Specified in query - eg. join type
Hash Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Users by name, Pages by user; Map 1 Reducer 1 (1, user) Pages Users Pages block n (1, fred) (2, fred) (2, fred) Map 2 Reducer 2 Users block m (1, jane) (2, jane) (2, jane) (2, name)
Skew Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘skewed’; Map 1 Reducer 1 SP (1, user) Pages Users Pages block n (1, fred, p1) (1, fred, p2) (2, fred) SP Map 2 Reducer 2 Users block m (1, fred, p3) (1, fred, p4) (2, fred) (2, name)
Merge Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘merge’; Map 1 Pages Users Pages Users aaron… amr aaron … aaron . . . . . . . . zach aaron . . . . . . zach Map 2 Pages Users amy… barb amy …
Replicated Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘replicated’; Map 1 Pages Pages Users Users aaron aaron . . . . . . . zach aaron . zach aaron… amr aaron . zach Map 2 Pages Users aaron . zach amy… barb
Group/cogroup optimizations • On sorted and ‘collected’ data • grp = group Users by name using ‘collected’; Pages Map 1 aaron aaron barney carol . . . . . . . zach aaron aaron barney Map 2 carol . .
Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group Bby age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group Bby state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; C1: group store into ‘bydemo’ C2: eval udf A: load B: filter C2: group store into ‘bystate’ C3: eval udf
Multi-Store Map-Reduce Plan map filter split local rearrange local rearrange reduce multiplex package package foreach foreach
Memory Management Use disk if large objects don’t fit into memory • JVM limit > phy mem - Very poor performance • Spill on memory threshold notification from JVM - unreliable • pre-set limit for large bags. Custom spill logic for different bags -eg distinct bag.
Other optimizations • Aggressive use of combiner, secondary sort • Lazy deserialization in loaders • Better serialization format • Faster regex lib, compiled pattern • Compression between MR jobs
Future optimization work Improve memory management Join + group in single MR, if same keys used Even better skew handling Adaptive optimizations Automated hadoop tuning …
Pig - fast and flexible Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ More flexibility in 0.8, 0.9 • Udfs in scripting languages (python) • MR job as relation • Relation as scalar • Turing complete pig (0.9)
Further reading • Docs - http://pig.apache.org/docs/r0.7.0/ • Papers and talks - http://wiki.apache.org/pig/PigTalksPapers • Training videos in vimeo.com (search ‘hadoop pig’)