1 / 60

The Pig Experience: Building High-Level Data flows on top of Map-Reduce

DISTRIBUTED INFORMATION SYSTEMS. The Pig Experience: Building High-Level Data flows on top of Map-Reduce. Presenter: Javeria Iqbal Tutor: Dr.Martin Theobald. Outline. Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization

lynton
Download Presentation

The Pig Experience: Building High-Level Data flows on top of Map-Reduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DISTRIBUTED INFORMATION SYSTEMS The Pig Experience: Building High-Level Data flows on top of Map-Reduce Presenter: Javeria Iqbal Tutor: Dr.Martin Theobald

  2. Outline • Map-Reduce and the need for Pig Latin • Pig Latin • Compilation into Map-Reduce • Optimization • Future Work

  3. Data Processing Renaissance • Internet companies swimming in data • TBs/day for Yahoo! Or Google! • PBs/day for FaceBook! • Data analysis is “inner loop” of product innovation

  4. Data Warehousing …? • High level declarative approach • Little control over execution method Scale • Prohibitively expensive at web scale • Up to $200K/TB Price SQL Often not scalable enough

  5. Map-Reduce Map : Performs filtering Reduce : Performs the aggregation These are two high level declarative primitives to enable parallel processing BUT no complex Database Operations e.g. Joins

  6. Execution Overview of Map-Reduce Buffered pairs are written to local disk partitions, Location of buffered pairs are sent to reduce workers Worker reads, parses key/value pairs and passes pairs to user-defined Map function Split the Program Master and Worker Threads

  7. Execution Overview of Map-Reduce Unique keys, values are passed to user’s Reduce function. Output is appended to the output file for this reduce partition. Reduce worker sorts data by the intermediate keys.

  8. The Map-Reduce Appeal • Scalable due to simpler design • Explicit programming model • Only parallelizable operations Scale Runs on cheap commodity hardware Less Administration Price SQL Procedural Control- a processing “pipe”

  9. Disadvantages M R 1. Extremely rigid data flow Other flows hacked in M M M R Chains Join, Union Split • 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct • 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize • 3. No combined processing of multiple Datasets • Joins and other data processing operations

  10. Motivation Need a high-level, general data flow language

  11. Enter Pig Latin Need a high-level, general data flow language Pig Latin

  12. Outline • Map-Reduce and the need for Pig Latin • Pig Latin • Compilation into Map-Reduce • Optimization • Future Work

  13. Pig Latin: Data Types Rich and Simple Data Model Simple Types: int, long, double, chararray, bytearray Complex Types: Atom: String or Number e.g. (‘apple’) Tuple: Collection of fields e.g. (áppe’, ‘mango’) Bag: Collection of tuples { (‘apple’ , ‘mango’) (ápple’, (‘red’ , ‘yellow’)) } Map: Key, Value Pair

  14. Example: Data Model Atom: contains Single atomic value • Tuple: sequence of fields • Bag: collection of tuple with possible duplicates ‘lanker’ ‘ipod’ ‘alice’ Atom Tuple

  15. Pig Latin: Input/Output Data Input: queries = LOAD `query_log.txt' USING myLoad() AS (userId, queryString, timestamp); Output: STORE query_revenues INTO `myoutput' USING myStore();

  16. Pig Latin: General Syntax Discarding Unwanted Data: FILTER Comparison operators such as ==, eq, !=, neq Logical connectors AND, OR, NOT

  17. Pig Latin: Expression Table

  18. Pig Latin: FOREACH with Flatten expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString); ----------------- expanded_queries = FOREACH queries GENERATE userId, FLATTEN(expandQuery(queryString));

  19. Pig Latin: COGROUP Getting Related Data Together: COGROUPSuppose we have two data setsresult: (queryString, url, position)revenue: (queryString, adSlot, amount)grouped_data = COGROUP result BY queryString, revenue BY queryString;

  20. Pig Latin: COGROUP vs. JOIN

  21. Pig Latin: Map-Reduce Map-Reduce in Pig Latinmap_result = FOREACH input GENERATE FLATTEN(map(*));key_group = GROUP map_result BY $0;output = FOREACH key_group GENERATE reduce(*);

  22. Pig Latin: Other Commands UNION : Returns the union of two or more bags CROSS: Returns the cross product ORDER: Orders a bag by the specified field(s) DISTINCT: Eliminates duplicate tuple in a bag

  23. Pig Latin: Nested Operations grouped_revenue = GROUP revenue BY queryString; query_revenues = FOREACH grouped_revenue { top_slot = FILTER revenue BY adSlot eq `top'; GENERATE queryString, SUM(top_slot.amount), SUM(revenue.amount); };

  24. Pig Pen: Screen Shot

  25. Pig Latin: Example 1 Suppose we have a table urls: (url, category, pagerank) Simple SQL query that finds, For each sufficiently large category, the average pagerank of high-pagerank urls in that category SELECT category, Avg(pagetank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106

  26. Data Flow Filter good_urlsby pagerank > 0.2 Group by category Filter categoryby count > 106 Foreach category generate avg. pagerank

  27. Equivalent Pig Latin good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls) > 106 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

  28. Example 2: Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info

  29. Data Flow Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls

  30. Equivalent Pig Latin visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

  31. Quick Start and Interoperability Operates directly over files visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

  32. Quick Start and Interoperability Schemas optional; Can be assigned dynamically visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

  33. User-Code as a First-Class Citizen • User-defined functions (UDFs) can be used in every construct • Load, Store • Group, Filter, Foreach visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

  34. Nested Data Model finance yahoo , email news • Pig Latin has a fully nested data model with: • Atomic values, tuples, bags (lists), and maps • Avoids expensive joins

  35. Nested Data Model group by url • Common case: aggregation on these nested sets • Power users: sophisticated UDFs, e.g., sequence analysis • Efficient Implementation (see paper) I frankly like pig much better than SQL in some respects (group + optional flatten), I love nested data structures).” Ted Dunning Chief Scientist, Veoh Decouples grouping as an independent operation 35

  36. CoGroup results revenue Cross-product of the 2 bags would give natural join

  37. Pig Features Explicit Data Flow Language unlike SQL Low Level Procedural Language unlike Map-Reduce Quick Start & Interoperability Mode (Interactive Mode, Batch, Embedded) User Defined Functions Nested Data Model

  38. Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization Future Work

  39. Pig Process Life Cycle Parser PigLatinto Logical Plan Logical Optimizer Logical Plan to Physical Plan Map-Reduce Compiler Map-Reduce Optimizer Hadoop Job Manager

  40. Pig Latin to Physical Plan LOAD LOAD FILTER JOIN GROUP FOREACH STORE A = LOAD ‘file1’ AS (x,y,z); B = LOAD ‘file2’ AS (t,u,v); C = FILTER A by y > 0; D = JOIN C by x,B by u; E = GROUP D by z; F = FOREACH E generate group, COUNT(D); STORE F into ‘output’; x,y,z,t,u,v x,y,z group , count

  41. Logical Plan to Physical Plan 3 LOAD 1 1 LOAD 3 LOAD FILTER LOAD 2 2 4 GLOABL REARRANGE LOCAL REARRANGE FILTER JOIN 4 4 PACKAGE 5 4 FOREACH GROUP 4 6 LOCAL REARRANGE GLOBAL REARRANGE FOREACH 5 5 PACKAGE 7 5 FOREACH 6 STORE STORE 7

  42. Physical Plan to Map-Reduce Plan LOAD 1 LOAD 3 FILTER 2 GLOABL REARRANGE LOCAL REARRANGE 4 4 PACKAGE 4 FOREACH 4 LOCAL REARRANGE GLOBAL REARRANGE 5 5 PACKAGE 5 FOREACH 6 STORE 7 Filter Local Rearrange Package Foreach LocalRearrange Package Foreach

  43. Implementation SQL user automatic rewrite + optimize Pig Pig is open-source. http://hadoop.apache.org/pig or or Hadoop Map-Reduce cluster • ~50% of Hadoop jobs at • Yahoo! are Pig • 1000s of jobs per day

  44. Compilation into Map-Reduce Every group or join operation forms a map-reduce boundary Map1 Load Visits Group by url Reduce1 Map2 Foreach url generate count Load Url Info Join on url Reduce2 Map3 Group by category Other operations pipelined into map and reduce phases Reduce3 Foreach category generate top10(urls)

  45. Nested Sub Plans FILTER SPLIT LOCAL REARRANGE FILTER FILTER GLOBALREARRANGE FOEACH FOREACH MULTIPLEX PACKAGE PACKAGE FOREACH FOREACH SPLIT

  46. Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization Future Work

  47. Using the Combiner Input records Output records map reduce map reduce • Can pre-process data on the map-side to reduce data shipped • Algebraic Aggregation Functions • Distinct processing

  48. Skew Join • Default join method is symmetric hash join. cross product carried out on 1 reducer • Problem if too many values with same key • Skew join samples data to find frequent values • Further splits them among reducers

  49. Fragment-Replicate Join • Symmetric-hash join repartitions both inputs • If size(data set 1) >> size(data set 2) • Just replicate data set 2 to all partitions of data set 1 • Translates to map-only job • Open data set 2 as “side file”

  50. Merge Join • Exploit data sets are already sorted. • Again, a map-only job • Open other data set as “side file”

More Related