1 / 46

Utkarsh Srivastava

Utkarsh Srivastava. Pig : Building High-Level Dataflows over Map-Reduce. Research & Cloud Computing. Data Processing Renaissance. Internet companies swimming in data E.g. TBs/day at Yahoo! Data analysis is “inner loop” of product innovation Data analysts are skilled programmers.

janecameron
Download Presentation

Utkarsh Srivastava

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

  2. Data Processing Renaissance • Internet companies swimming in data • E.g. TBs/day at Yahoo! • Data analysis is “inner loop” of product innovation • Data analysts are skilled programmers

  3. Data Warehousing …? Scale $ $ $ $ SQL Often not scalable enough • Prohibitively expensive at web scale • Up to $200K/TB • Little control over execution method • Query optimization is hard • Parallel environment • Little or no statistics • Lots of UDFs

  4. New Systems For Data Analysis Map-Reduce Apache Hadoop Dryad . . .

  5. Map-Reduce map map reduce reduce Just a group-by-aggregate? Input records Output records

  6. The Map-Reduce Appeal Scale $ SQL • Scalable due to simpler design • Only parallelizable operations • No transactions Runs on cheap commodity hardware Procedural Control- a processing “pipe”

  7. Disadvantages M R M M R M 1. Extremely rigid data flow Other flows constantly hacked in Join, Union Chains Split • 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct • 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize

  8. Pros And Cons Need a high-level, general data flow language

  9. Enter Pig Latin Pig Latin Need a high-level, general data flow language

  10. Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Example Generation Future Work

  11. Example Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info

  12. Data Flow Foreach url generate count Foreach category generate top10 urls Load Visits Group by url Load Url Info Join on url Group by category

  13. In Pig Latin visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

  14. Step-by-step Procedural Control Target users are entrenched procedural programmers The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful. Jasmine Novak Engineer, Yahoo! David Ciemiewicz Search Excellence, Yahoo! • Automatic query optimization is hard • Pig Latin does not preclude optimization

  15. visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Quick Start and Interoperability Operates directly over files

  16. visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Quick Start and Interoperability Schemas optional; Can be assigned dynamically

  17. visits = load‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; User-Code as a First-Class Citizen • User-defined functions (UDFs) can be used in every construct • Load, Store • Group, Filter, Foreach

  18. Pig Latin has a fully-nestable data model with: Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Avoids expensive joins Nested Data Model finance yahoo , email news

  19. Nested Data Model Decouples grouping as an independent operation I frankly like pig much better than SQL in some respects (group + optional flatten works better for me, I love nested data structures).” group by url • Common case: aggregation on these nested sets • Power users: sophisticated UDFs, e.g., sequence analysis • Efficient Implementation (see paper) Ted Dunning Chief Scientist, Veoh 19

  20. CoGroup results revenue Cross-product of the 2 bags would give natural join

  21. Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Example Generation Future Work

  22. Implementation Hadoop Map-Reduce Pig SQL Pig is open-source. http://hadoop.apache.org/pig cluster user automatic rewrite + optimize or or • ~50% of Hadoop jobs at • Yahoo! are Pig • 1000s of jobs per day

  23. Compilation into Map-Reduce Foreach url generate count Foreach category generate top10(urls) Load Visits Group by url Load Url Info Join on url Group by category Every group or join operation forms a map-reduce boundary Map1 Reduce1 Map2 Reduce2 Map3 Other operations pipelined into map and reduce phases Reduce3

  24. Optimizations: Using the Combiner map map reduce reduce Input records Output records • Can pre-process data on the map-side to reduce data shipped • Algebraic Aggregation Functions • Distinct processing

  25. Optimizations: Skew Join Default join method is symmetric hash join. cross product carried out on 1 reducer • Problem if too many values with same key • Skew join samples data to find frequent values • Further splits them among reducers

  26. Optimizations: Fragment-Replicate Join Symmetric-hash join repartitions both inputs If size(data set 1) >> size(data set 2) Just replicate data set 2 to all partitions of data set 1 Translates to map-only job Open data set 2 as “side file”

  27. Optimizations: Merge Join Exploit data sets are already sorted. Again, a map-only job Open other data set as “side file”

  28. Optimizations: Multiple Data Flows Group by state Store into ‘bystate’ Group by demographic Store into ‘bydemo’ Load Users Filter bots Apply udfs Apply udfs Map1 Reduce1

  29. Optimizations: Multiple Data Flows Group by state Store into ‘bystate’ Group by demographic Store into ‘bydemo’ Load Users Filter bots Apply udfs Apply udfs Split Demultiplex Map1 Reduce1

  30. Other Optimizations Carry data as byte arrays as far as possible Using binary comparator for sorting “Streaming” data through external executables

  31. Performance

  32. Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Example Generation Future Work

  33. Example Dataflow Program LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 Find users that tend to visit high-pagerank pages

  34. Iterative Process LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 Joining on right attribute? Bug in UDF canonicalize? Everything being filtered out? No Output ☹

  35. How to do test runs? Run with real data Too inefficient (TBs of data) Create smaller data sets (e.g., by sampling) Empty results due to joins [Chaudhuri et. al. 99], and selective filters Biased sampling for joins Indexes not always present

  36. Examples to Illustrate Program LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 (www.cnn.com, 0.9) (www.frogs.com, 0.3) (www.snails.com, 0.4) (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) (Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) (Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) • ) ( Amy, ( Fred, ) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) (Amy, 0.6) (Fred, 0.4) (Amy, 0.6)

  37. Value Addition From Examples Examples can be used for Debugging Understanding a program written by someone else Learning a new operator, or language

  38. Good Examples: Consistency LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 0. Consistency (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) output example = operator applied on input example (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com)

  39. Good Examples: Realism LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 1. Realism (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com)

  40. Good Examples: Completeness LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 2. Completeness Demonstrate the salient properties of each operator, e.g., FILTER (Amy, 0.6) (Fred, 0.4) (Amy, 0.6)

  41. Good Examples: Conciseness LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 3. Conciseness (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com)

  42. Implementation Status Available as ILLUSTRATE command in open-source release of Pig Available as Eclipse Plugin (PigPen) See SIGMOD09 paper for algorithm and experiments

  43. Related Work Sawzall Data processing language on top of map-reduce Rigid structure of filtering followed by aggregation Hive SQL-like language on top of Map-Reduce DryadLINQ SQL-like language on top of Dryad Nested data models Object-oriented databases

  44. Future / In-Progress Tasks Columnar-storage layer Metadata repository Profiling and Performance Optimizations Tight integration with a scripting language Use loops, conditionals, functions of host language Memory Management Project Suggestions at: http://wiki.apache.org/pig/ProposedProjects

  45. Credits

  46. Summary Big demand for parallel data processing Emerging tools that do not look like SQL DBMS Programmers like dataflow pipes over static files Hence the excitement about Map-Reduce But, Map-Reduce is too low-level and rigid Pig Latin Sweet spot between map-reduce and SQL

More Related