210 likes | 379 Views
Pig: Making Hadoop Easy. Wednesday, June 10, 2009 Santa Clara Marriott. What is Pig?. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig Latin, a high level data processing language. An Example Problem. Data User records Pages served
E N D
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott
What is Pig? An engine that executes Pig Latin locally or on a Hadoop cluster. Pig Latin, a high level data processing language.
An Example Problem • Data • User records • Pages served • Question: the 5 pages most visited by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5
In Pig Latin Users = load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = join Fltrd by name, Pages by user;Grpd = group Jnd by url;Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;Srtd = order Smmd by clicks desc;Top5 = limit Srtd 5;store Top5 into‘top5sites’;
Comparison 1/20 the lines of code 1/16 the development time Performance: 1.5x Hadoop
Pig Compared to Map Reduce • Faster development time • Data flow versus programming logic • Many standard data operations (e.g. join) included • Manages all the details of connecting jobs and data flow • Copes with Hadoop version change issues
And, You Don’t Lose Power • UDFs can be used to load, evaluate, aggregate, and store data • External binaries can be invoked • Metadata is optional • Flexible data model • Nested data types • Explicit data flow programming
How it Works Pig Latin script is translated to a set of operators which are placed in one or more MR jobs and executed. Filter $1 > 0 Map A = load ‘myfile’; B = filter A by $1 > 0; C = group B by $0; D = foreach C generate group, COUNT(B) as cnt;E = filter D by cnt > 5;dump E; COUNT(B) Combiner SUM(COUNT(B)) Filter cnt > 5 Reducer
What Users Do with Pig • Inside Yahoo (based on user interviews) • 60% of ad hoc and 40% of production MR jobs • Production • Examples: search infrastructure, ad relevance • Attraction: fast development, extensibility via custom code, protection against Hadoop changes, debugability • Ad hoc • Examples: user intent analysis • Attraction: easy to learn, compact readable code, fast iteration when trying new algorithms, easy for collaboration
What Users Do with Pig • Outside Yahoo (based on mailing list responses) • Processing search engine query logs“Pig programs are easier to maintain, and less error-prone than native java programs. It is an excellent piece of work.” • Image recommendations“I am using it as a rapid-prototyping language to test some algorithms on huge amounts of data.” • Adsorption Algorithm (video recommendations) • Hoffman’s PLSI implementation“The E/M login was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in mapreduce java. Exactly that’s the reason I wanted to try it out in Pig. It took ~ 3-4 days for me to write it, starting from learning pig.”
Users Extending Pig: PigPy • Created by Mashall Weir at Zattoo • Uses Python to create Pig Latin scripts on the fly • Enables looping • Branching based on job results • Submits Pig jobs from Python scripts • Cache intermediate calculations • Avoid variable name collisions in large scripts
Version 0.2.0 • Released April 2009 • Added type system • ~5x better performance than 0.1 • More aggressive use of the combiner • Map side join • Handles key skew in ORDER BY • Improved error handling • Improved documentation
Version 0.3.0 • Release branch created June 8th, 2009 • Supports multiple STOREs in one MR job • Supports multiple GROUP Bys in one MR job students = load ’students' as (name, age, gpa); a_ed = filter students by age > 25; store a_ed into ‘adult_ed'; gname = group a_ed by name; cname = foreach gname generate group, COUNT(a_ed); store cname into ’count_by_name'; g_age = group a_ed by age; c_age = foreach g_age generate group, COUNT(a_ed); store c_age into ’count_by_age'; In 0.2.0 and before, this would be 3 MR jobs. In 0.3.0 it will be one. Seeing up to 10x speedup for these types of scripts.
Currently Working On • Map side merge join • Handling severe skew in join keys • Improving memory footprint • Extending optimizer capabilities
SQL • Pig will be bilingual, accepting SQL and Pig Latin • UDFs will work in both languages • Gives users ability to choose appropriate interface level • Administrators have one component to maintain
Metadata for the Grid • Provide metadata model for files and directories as data sets • Usable from Map Reduce and Pig • Attach user defined attributes to data sets • Define hierarchy and associations between data sets • Record data schema and statistics • Browsing, searching, and metadata administration via GUI and web services API • JIRA: PIG-823
Storage Access Layer • Common abstraction to contain storage access features and optimizations • Support fast projection • Support early row filtering • CPU/space efficient data serialization and compression • Usable by Map Reduce and Pig • PIG-833
Learn More • Come to the Hadoop Summit Training, tomorrow • Watch the training by Yahoo! and Cloudera:http://www.cloudera.com/hadoop-training-pig-introduction • Get involved: http://hadoop.apache.org/pig