Pig Latin: A Not-So-Foreign Language for Data Processing

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research SIGMOD’08 Presented By Sandeep Patidar Modified from original Pig Latin talk

Outline • Map-Reduce and the Need for Pig Latin • Pig Latin example • Feature and Motivation • Pig Latin • Implementation • Debugging Environment • Usage Scenarios • Related Work • Future Work

Data Processing Renaissance • Internet companies swimming in data • E.g. TBs/day at Yahoo! • Data analysis is “inner loop” of product innovation • Data analysts are skilled programmers

Data Warehousing …? Often not scalable enough Scale Prohibitively expensive at web scale • Up to $200K/TB $ $ $ $ • Little control over execution method • Query optimization is hard • Parallel environment • Little or no statistics • Lots of UDFs SQL

New Systems For Data Analysis • Map-Reduce • Apache Hadoop • Dryad

Map-Reduce • Map : Performs the group by • Reduce : Performs the aggregation • These are two high level declarative primitives to enable parallel processing

4) Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The location of these buffered pairs on the local disk are passed back to the Master, who is responsible for forwarding these locations to the reduce workers 3) A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory. 1) The Map-Reduce library in the user program first splits the input les into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines. 2) One of the copy of the program is special – the master. The rest are workers that are assigned work by the master. There are M map task and R reduce tasks to assign, The Master picks idle worker and assign each one a task. Execution overview of Map-Reduce [2]

6) The reduce worker iterate over the sorted intermediate data and for each unique key encountered, it passes the key and the. corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to the final output file for this reduce partition. 7) When all map task and reduce task have been completed, the master wakes up the user program, At this point, the Map-Reduce call in the user program returns back to the user code. 5) When a reduce worker is modified by the master about these locations, it uses remote procedure calls to read buffered data from the local disks of map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys. The sorting is needed because typically many different key map to the same reduce task. Execution overview of Map-Reduce [2]

Input records Output records map reduce map reduce

Map-Reduce Appeal Scalable due to simpler design • Only parallelizable operations • No transactions Scale $ Runs on cheap commodity hardware SQL Procedural Control- a processing “pipe”

Limitations of Map-Reduce M R 1. Extremely rigid data flow Other flows constantly hacked in M M M R Join, Union Chains Split 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize

Pros And Cons • Need a high-level, general data flow language High leveldeclarative language Low levelprocedural language

Enter Pig Latin • Need a high-level, general data flow language Pig Latin

Pig Latin Example 1 Suppose we have a table urls: (url, category, pagerank) Simple SQL query that finds, For each sufficiently large category, the average pagerank of high-pagerank urls in that category SELECT category, Avg(pagetank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106

Equivalent Pig Latin program • good_urls = FILTER urls BY pagerank > 0.2; • groups = GROUP good_urls BY category; • big_groups = FILTER groups BY COUNT(good_urls) > 106 ; • output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

Data Flow Filter good_urlsby pagerank > 0.2 Group by category Filter categoryby count > 106 Foreach category generate avg. pagerank

Example Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info

Data Flow Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls

In Pig Latin visits = load‘/data/visits’as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Dataflow Language User specifies a sequence of steps where each step specifies only a single high-level data transformation The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo!

Step by step execution • Pig Latin program supply an explicit sequence of operations, it is not necessary that the operations be executed in that order • e.g., Set of urls of pages classified as spam, but have a high pagerank score isSpam might be an expensive UDFThen, it will be much better to filterthe url by pagerank first. spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank > 0.8;

Quick Start and Interoperability visits = load‘/data/visits’as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); gVisits = group visits by $1;Where $1 uses positional notation to refer second field Schemas optional; Can be assigned dynamically Operates directly over files

Nested Data Model • Pig Latin has flexible, fully nested data model (described later) allows complex, non-atomic data types such as sets, map, and tuple. • Nested Model is more closer to programmer than normalization (1NF) • Avoids expensive joins for web-scale data • Allows programmer to easily write UDFs

UDFs as First-Class Citizens • Used Defined Functions (UFDs) can be used in every construct Load, Store, Group, Filter, Foreach • Example 2Suppose we want to find for each category, the top 10 urls according to pagerankgroups = GROUP urls BY category;output = FOREACH groups GENERATE category, top10(urls);

Data Model • Atom: contains Simple atomic value • Tuple: sequence of fields • Bag: collection of tuple with possible duplicates ‘lanker’ ‘ipod’ ‘alice’ Atom Tuple

Map: collection of data items, where each item has an associated key through which is can be looked

Pig Latin Commands • Specifying Input Data: LOADqueries = LOAD ‘query_log.txt’ USING myLoad() As (userId, queryString, timestamp); • Per-tuple Processing: FOREACHexpand_queries = FOREACH queries GENERATE userId, expandQuery(queryString);

Pig Latin Commands (Cont.) • Discarding Unwanted Data: FILTERreal_queries = FILTER queries BY userId neq ‘bot’; or FILTER queries BY NOT isBot(userId); • Filtering conditions involve combination of expression, comparison operators such as ==, eq, !=, neq, and the logical connectors AND, OR, NOT

Expressions in Pig Latin

Example of flattening in FOREACH

Pig Latin Commands (Cont.) • Getting Related Data Together: COGROUPSuppose we have two data setsresult: (queryString, url, position)revenue: (queryString, adSlot, amount)grouped_data = COGROUP result BY queryString, revenue BY queryString;

COGROUP versus JOIN

Pig Latin Example 3 Suppose we were trying to attribute search revenue to search-result urls to figure out the monetary worth of each url.url_revenues = FOREACH grouped_data GENERATE FLATTEN( distributeRevenue(result, revenue)); Where distributeRevenue is a UDF that accepts search results and revenue info for a query string at a time, and outputs a bag of urls and the revenue attributed to them.

Pig Latin Commands (Cont.) • Special case of COGROUP: GROUPgrouped_revenue = GROUP revenue BY queryString;query_revenue = FOREACH grouped_revenue GENERATE queryString, SUM(revenue.amount) AS totalRevenue; • JOIN in Pig Latinjoin_result = JOIN result BY queryString, revenue BY queryString;

Pig Latin Commands (Cont.) • Map-Reduce in Pig Latinmap_result = FOREACH input GENERATE FLATTEN(map(*));key_group = GROUP map_result BY $0;output = FOREACH key_group GENERATE reduce(*);

Pig Latin Commands (Cont.) • Other CommandUNION : Returns the union of two or more bagsCROSS: Returns the cross productORDER: Orders a bag by the specified field(s)DISTINCT: Eliminates duplicate tuple in a bag • Nested OperationsPig Latin allows some command to nested within a FOREACH command

Pig Latin Commands (Cont.) • Asking for Output : STORE user can ask for the result of a Pig Latin expression sequence to be materialized to a fileSTORE query_revenue INTO ‘myoutput’ USING myStore(); myStore is custom serializer.For plain text file, it can be omitted

Implementation SQL automatic rewrite + optimize Pig or USER or Pig is open-source. http://incubator.apache.org/pig HadoopMap-Reduce cluster

Building a Logical Plan • Pig interpreter first parse Pig Latin command, and verifies that the input files and bags being referred are valid • Builds logical plan for every bag that the user defines • Processing triggers only when user invokes a STORE command on a bag(at that point, the logical plan for that bag is compiled into physical plan and is executed)

Map-Reduce Plan Compilation • Every group or join operation forms a map-reduce boundary • Other operations pipelined into map and reduce phases

Compilation into Map-Reduce Filter good_urlsby pagerank > 0.2 Every group or join operation forms a map-reduce boundary Map1 Group by category Filter categoryby count > 106 Reduce1 Other operations pipelined into map and reduce phases Foreach category generate avg. pagerank

Compilation into Map-Reduce Every group or join operation forms a map-reduce boundary Map1 Load Visits Group by url Reduce1 Map2 Foreach url generate count Load Url Info Join on url Reduce2 Map3 Other operations pipelined into map and reduce phases Group by category Reduce3 Foreach category generate top10(urls)

Efficiency With Nested Bags • (CO)GROUP command places tuples belonging to the same group into one or more nested bags • System can avoid actually materializing these bags, which is specially important when the bags are larger than machine’s main memory • One common case is where user applies a algebraic aggregation function over the result of (CO)GROUP operation

Debugging Environment • Process of constructing Pig Latin program is iterative step • User makes an initial stab at writing a program • Submits it to the system for execution • Inspects the output • To avoid this inefficiency, user often create a side data set • Unfortunately this method does not always work well • Pig comes with debugging environment called Pig Pen • creates side data set automatically

Pig Pen screen shot

Generating a Sandbox Data Set • There are three primary objectives in selecting a sandbox data set • Realism: sandbox data set should be subset of the actual data set • Conciseness: example bags should be as small as possible • Completeness: example bags should be collectively illustrate the key semantics of each command

Pig Latin: A Not-So-Foreign Language for Data Processing