1 / 61

Pig : Building High-Level Dataflows over Map-Reduce

Pig : Building High-Level Dataflows over Map-Reduce. Data Processing Renaissance. Internet companies swimming in data E.g. TBs/day at Yahoo! Data analysis is “inner loop” of product innovation Data analysts are skilled programmers. Data Warehousing …?. Often not scalable enough. Scale.

gayton
Download Presentation

Pig : Building High-Level Dataflows over Map-Reduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pig : Building High-Level Dataflows over Map-Reduce

  2. Data Processing Renaissance • Internet companies swimming in data • E.g. TBs/day at Yahoo! • Data analysis is “inner loop” of product innovation • Data analysts are skilled programmers

  3. Data Warehousing …? Often not scalable enough Scale • Prohibitively expensive at web scale • Up to $200K/TB $ $ $ $ • Little control over execution method • Query optimization is hard • Parallel environment • Little or no statistics • Lots of UDFs SQL

  4. New Systems For Data Analysis • Map-Reduce • Apache Hadoop • Dryad . . .

  5. Map-Reduce Input records Output records map reduce map reduce Just a group-by-aggregate?

  6. The Map-Reduce Appeal • Scalable due to simpler design • Only parallelizable operations • No transactions Scale $ Runs on cheap commodity hardware SQL Procedural Control- a processing “pipe”

  7. Disadvantages M R 1. Extremely rigid data flow Other flows constantly hacked in M M M R Join, Union Chains Split • 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct • 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize

  8. Pros And Cons Need a high-level, general data flow language

  9. Enter Pig Latin Need a high-level, general data flow language Pig Latin

  10. Outline • Map-Reduce and the need for Pig Latin • Pig Latin • Compilation into Map-Reduce • Example Generation • Future Work

  11. Pig Latin Example 1 Suppose we have a table urls: (url, category, pagerank) Simple SQL query that finds, For each sufficiently large category, the average pagerank of high-pagerank urls in that category SELECT category, Avg(pagetank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106

  12. Equivalent Pig Latin program • good_urls = FILTER urls BY pagerank > 0.2; • groups = GROUP good_urls BY category; • big_groups = FILTER groups BY COUNT(good_urls) > 106 ; • output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

  13. Data Flow Filter good_urlsby pagerank > 0.2 Group by category Filter categoryby count > 106 Foreach category generate avg. pagerank

  14. Pig • Consist of a scripting language known as Pig Latin and Pig Latin Compiler. • It is a high level scripting language used to write code to analyze data. • Compiler converts the code into equivalent MapReduce code. • Easier to write code in Pig compared to programming in Map Reduce. • Pig has an optimizer that decides how to get the data quickly.

  15. BENIFITS • Ease of coding: Writes complex programs. It explicitly encodes the complex tasks involving inter-related data transformations, as data flow sequences. • Optimization: Encodes the task in such a way that they can easily optimized for execution. This allows user to concentrate on the data processing aspects without bothering about efficiency. • Extensibility: Allows to create own custom functions/user defined functions.

  16. Why use Pig?

  17. In MAP REDUCE

  18. In Pig Latin

  19. Example • Pig Latin is Procedural Pig Latin is procedural, it fits very naturally in the pipeline paradigm. SQL on the other hand is declarative.

  20. Consider, for example, a simple pipeline, where data from sources users and clicks is to be joined and filtered, and then joined to data from a third source geoinfo and aggregated and finally stored into a able ValuableClicksPerDMA.

  21. SQL Query • insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; • SQL is declarative but not step-by-step style

  22. The Pig Latin for this will look like: Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA'; • Pig Latin is procedural (dataflow programming model) • Step-by-step query style is much cleaner and easier to write

  23. Example Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info

  24. Data Flow Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls

  25. Dataflow Language User specifies a sequence of steps where each step specifies only a single high-level data transformation The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo!

  26. Quick Start and Interoperability visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Operates directly over files

  27. Quick Start and Interoperability visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Schemas optional; Can be assigned dynamically

  28. User-Code as a First-Class Citizen • User-defined functions (UDFs) can be used in every construct • Load, Store • Group, Filter, Foreach visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

  29. UDFs as First-Class Citizens • Used Defined Functions (UFDs) can be used in every construct Load, Store, Group, Filter, Foreach • Example 2Suppose we want to find for each category, the top 10 urls according to pagerankgroups = GROUP urls BY category;output = FOREACH groups GENERATE category, top10(urls);

  30. Data Model • Tuple: A tuple is an ordered set of fields. Example : (raja, 30) • Bag: A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)} • Map: A Map is a set of key-value pairs. Example : [ ‘name’#’Raju’, ‘age’#30]

  31. Nested Data Model • Pig Latin has a fully-nestable data model with: • Atomic values, tuples, bags (lists), and maps • More natural to programmers than flat tuples • Avoids expensive joins finance yahoo , email news

  32. Pig Latin – Relational Operations

  33. Pig Latin – Relational Operations

  34. UDFs as First-Class Citizens • Used Defined Functions (UFDs) can be used in every construct Load, Store, Group, Filter, Foreach • Example 2Suppose we want to find for each category, the top 10 urls according to pagerankgroups = GROUP urls BY category;output = FOREACH groups GENERATE category, top10(urls);

  35. Nested Data Model Decouples grouping as an independent operation group by url

  36. CoGroup results revenue Cross-product of the 2 bags would give natural join

  37. Outline • Map-Reduce and the need for Pig Latin • Pig Latin • Compilation into Map-Reduce • Example Generation • Future Work

  38. Implementation SQL user automatic rewrite + optimize Pig Pig is open-source. http://hadoop.apache.org/pig or or Hadoop Map-Reduce cluster • ~50% of Hadoop jobs at • Yahoo! are Pig • 1000s of jobs per day

  39. Building a Logical Plan • Pig interpreter first parse Pig Latin command, and verifies that the input files and bags being referred are valid • Builds logical plan for every bag that the user defines • Processing triggers only when user invokes a STORE command on a bag(at that point, the logical plan for that bag is compiled into physical plan and is executed)

  40. Compilation into Map-Reduce Every group or join operation forms a map-reduce boundary Map1 Load Visits Group by url Reduce1 Map2 Foreach url generate count Load Url Info Join on url Reduce2 Map3 Group by category Other operations pipelined into map and reduce phases Reduce3 Foreach category generate top10(urls)

  41. Optimizations: Skew Join • Default join method is symmetric hash join. cross product carried out on 1 reducer • Problem if too many values with same key • Further splits them among reducers

  42. Optimizations: Multiple Data Flows Map1 Load Users Filter bots Group by state Group by demographic Reduce1 Apply udfs Apply udfs Store into ‘bystate’ Store into ‘bydemo’

  43. Optimizations: Multiple Data Flows Map1 Load Users Filter bots Split Group by state Group by demographic Reduce1 Demultiplex Apply udfs Apply udfs Store into ‘bystate’ Store into ‘bydemo’

  44. Other Optimizations • Carry data as byte arrays as far as possible • Using binary comparator for sorting • “Streaming” data through external executables

  45. Performance

  46. Outline • Map-Reduce and the need for Pig Latin • Pig Latin • Compilation into Map-Reduce • Example Generation • Future Work

  47. Example Dataflow Program LOAD (user, url) LOAD (url, pagerank) JOIN on url Find users that tend to visit high-pagerank pages GROUP on user FOREACH user, canonicalize(url) FOREACH user, AVG(pagerank) FILTER avgPR> 0.5

  48. Iterative Process LOAD (user, url) LOAD (url, pagerank) Joining on right attribute? JOIN on url GROUP on user FOREACH user, canonicalize(url) FOREACH user, AVG(pagerank) Bug in UDF canonicalize? Everything being filtered out? FILTER avgPR> 0.5 No Output 

  49. How to do test runs? • Run with real data • Too inefficient (TBs of data) • Create smaller data sets (e.g., by sampling) • Empty results due to joins [Chaudhuri et. al. 99], and selective filters • Biased sampling for joins • Indexes not always present

More Related