Pig: Making Hadoop Easy

Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott

What is Pig? An engine that executes Pig Latin locally or on a Hadoop cluster. Pig Latin, a high level data processing language.

An Example Problem • Data • User records • Pages served • Question: the 5 pages most visited by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

In Map Reduce

In Pig Latin Users = load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = join Fltrd by name, Pages by user;Grpd = group Jnd by url;Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;Srtd = order Smmd by clicks desc;Top5 = limit Srtd 5;store Top5 into‘top5sites’;

Comparison 1/20 the lines of code 1/16 the development time Performance: 1.5x Hadoop

Pig Compared to Map Reduce • Faster development time • Data flow versus programming logic • Many standard data operations (e.g. join) included • Manages all the details of connecting jobs and data flow • Copes with Hadoop version change issues

And, You Don’t Lose Power • UDFs can be used to load, evaluate, aggregate, and store data • External binaries can be invoked • Metadata is optional • Flexible data model • Nested data types • Explicit data flow programming

Pig Commands

How it Works Pig Latin script is translated to a set of operators which are placed in one or more MR jobs and executed. Filter $1 > 0 Map A = load ‘myfile’; B = filter A by $1 > 0; C = group B by $0; D = foreach C generate group, COUNT(B) as cnt;E = filter D by cnt > 5;dump E; COUNT(B) Combiner SUM(COUNT(B)) Filter cnt > 5 Reducer

What Users Do with Pig • Inside Yahoo (based on user interviews) • 60% of ad hoc and 40% of production MR jobs • Production • Examples: search infrastructure, ad relevance • Attraction: fast development, extensibility via custom code, protection against Hadoop changes, debugability • Ad hoc • Examples: user intent analysis • Attraction: easy to learn, compact readable code, fast iteration when trying new algorithms, easy for collaboration

What Users Do with Pig • Outside Yahoo (based on mailing list responses) • Processing search engine query logs“Pig programs are easier to maintain, and less error-prone than native java programs. It is an excellent piece of work.” • Image recommendations“I am using it as a rapid-prototyping language to test some algorithms on huge amounts of data.” • Adsorption Algorithm (video recommendations) • Hoffman’s PLSI implementation“The E/M login was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in mapreduce java. Exactly that’s the reason I wanted to try it out in Pig. It took ~ 3-4 days for me to write it, starting from learning pig.”

Users Extending Pig: PigPy • Created by Mashall Weir at Zattoo • Uses Python to create Pig Latin scripts on the fly • Enables looping • Branching based on job results • Submits Pig jobs from Python scripts • Cache intermediate calculations • Avoid variable name collisions in large scripts

Version 0.2.0 • Released April 2009 • Added type system • ~5x better performance than 0.1 • More aggressive use of the combiner • Map side join • Handles key skew in ORDER BY • Improved error handling • Improved documentation

Version 0.3.0 • Release branch created June 8th, 2009 • Supports multiple STOREs in one MR job • Supports multiple GROUP Bys in one MR job students = load ’students' as (name, age, gpa); a_ed = filter students by age > 25; store a_ed into ‘adult_ed'; gname = group a_ed by name; cname = foreach gname generate group, COUNT(a_ed); store cname into ’count_by_name'; g_age = group a_ed by age; c_age = foreach g_age generate group, COUNT(a_ed); store c_age into ’count_by_age'; In 0.2.0 and before, this would be 3 MR jobs. In 0.3.0 it will be one. Seeing up to 10x speedup for these types of scripts.

Currently Working On • Map side merge join • Handling severe skew in join keys • Improving memory footprint • Extending optimizer capabilities

SQL • Pig will be bilingual, accepting SQL and Pig Latin • UDFs will work in both languages • Gives users ability to choose appropriate interface level • Administrators have one component to maintain

Metadata for the Grid • Provide metadata model for files and directories as data sets • Usable from Map Reduce and Pig • Attach user defined attributes to data sets • Define hierarchy and associations between data sets • Record data schema and statistics • Browsing, searching, and metadata administration via GUI and web services API • JIRA: PIG-823

Storage Access Layer • Common abstraction to contain storage access features and optimizations • Support fast projection • Support early row filtering • CPU/space efficient data serialization and compression • Usable by Map Reduce and Pig • PIG-833

Learn More • Come to the Hadoop Summit Training, tomorrow • Watch the training by Yahoo! and Cloudera:http://www.cloudera.com/hadoop-training-pig-introduction • Get involved: http://hadoop.apache.org/pig

Q & A

Pig: Making Hadoop Easy

Pig: Making Hadoop Easy

Presentation Transcript

Making HR Easy

(Hadoop) Pig Dataflow Language

Hadoop , Hadoop , Hadoop !!!

Making Apache Hadoop Secure

Big data 實務運算 Apache Pig Hadoop course

Making Data Warehouse Easy

Nova: Continuous Pig/ Hadoop Workflows

Beyond Hadoop : Pig and Giraph

Making Research Easy!

Making Pig Fly Optimizing Data Processing on Hadoop

Making Hadoop Easy

Pig

Making HR Easy

Pig, a high level data processing system on Hadoop

Pig

Making HR Easy

Louanne Pig in Making the Team

Pig, Making Hadoop Easy

Data Management Made Easy With Hadoop Training

Explain about Pig and Hive in Hadoop and their differences