360 likes | 571 Views
Programming and Debugging Large-Scale Data Processing Workflows. Christopher Olston and many others Yahoo! Research. Context. Elaborate processing of large data sets e.g.: web search pre-processing cross-dataset linkage web information extraction. serving. ingestion.
E N D
Programming and Debugging Large-Scale Data Processing Workflows Christopher Olston and many others Yahoo! Research
Context • Elaborate processing of large data sets e.g.: • web search pre-processing • cross-dataset linkage • web information extraction serving ingestion storage & processing
Context storage & processing workflow manager e.g. Nova Nova 2 Debugging aides: • Before: example data generator • During: instrumentation framework • After: provenance metadata manager dataflow programming framework e.g. Pig Pig 1 distributed sorting & hashing e.g. Map-Reduce scalable file system e.g. GFS Dataflow Illustrator 3 Inspector Gadget 4 Ibis 5
Next: PIG storage & processing workflow manager e.g. Nova Debugging aides: • Before: example data generator • During: instrumentation framework • After: provenance metadata manager dataflow programming framework e.g. Pig Pig 1 distributed sorting & hashing e.g. Map-Reduce scalable file system e.g. GFS
Pig: A High-Level Language and Runtime for Map-Reduce 1/20 the lines of code 1/16 the development time performs on par with raw Hadoop
Syntax Web browsing sessions with “happy endings.” Visits = load ‘/data/visits’ as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load ‘/data/pages’ as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(FindSessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into '/data/happy_endings';
Pig Latin = Sweet Spot Between SQL & Map-Reduce "The [Hofmann PLSA E/M] algorithm was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig.” -- PrasenjitMukherjee, Mahout project "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.” -- Jasmine Novak, Engineer, Yahoo! "PIG seems to give the necessary parallel programming construct (FOREACH, FLATTEN, COGROUP .. etc) and also give sufficient control back to the programmer (which purely declarative approach like [SQL on top of Map-Reduce] doesn’t).” -- Ricky Ho, Adobe Software
Status • Productized at Yahoo (~12-person team) • 1000s of jobs/day • 70% of Hadoop jobs • Open-source (the Apache Pig Project) • Offered on Amazon Elastic Map-Reduce • Used by LinkedIn, Twitter, Yahoo, ...
Next: NOVA storage & processing workflow manager e.g. Nova Nova 2 Debugging aides: • Before: example data generator • During: instrumentation framework • After: provenance metadata manager dataflow programming framework e.g. Pig Pig 1 distributed sorting & hashing e.g. Map-Reduce scalable file system e.g. GFS
Why a Workflow Manager? • Modularity: a workflow connects N dataflow modules • Written independently, and re-used in other workflows • Scheduled independently • Optimization: optimize across modules • Share read costs among side-by-side modules • Pipeline data between end-to-end modules • Continuous processing: push new data through • Selective re-running • Incremental algorithms (“view maintenance”) • Manageability: help humans keep tabs on execution • Alerts • Metadata (e.g. data provenance)
RSS feed NEW HDFS “delta blocks” Example Workflow news articles ALL template detection programmable “merge” operation (lazy vs. eager) ALL NEW news site templates ALL template tagging Pig program template NEW NEW shingling NEW NEW shingle hashes seen ALL de-duping NEW NEW unique articles
Nova Performance Merge Overhead Incremental Join
Next: DATAFLOW ILLUSTRATOR storage & processing workflow manager e.g. Nova Nova 2 Debugging aides: • Before: example data generator • During: instrumentation framework • After: provenance metadata manager dataflow programming framework e.g. Pig Pig 1 distributed sorting & hashing e.g. Map-Reduce scalable file system e.g. GFS Dataflow Illustrator 3
Load Visits(user, url, time) Load Pages(url, pagerank) Transform to (user, Canonicalize(url), time) Join url = url Group by user Transform to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5 Example Pig Dataflow Find users who tend to visit “good” pages.
Load Visits(user, url, time) Load Pages(url, pagerank) Illustrated! (Amy, cnn.com, 8am) (Amy, http://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) (www.cnn.com, 0.9) (www.snails.com, 0.4) Transform to (user, Canonicalize(url), time) Join url = url (Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am) (Fred, www.snails.com, 11am) (Amy, www.cnn.com, 8am, 0.9) (Amy, www.snails.com, 9am, 0.4) (Fred, www.snails.com, 11am, 0.4) Group by user (Amy, { (Amy, www.cnn.com, 8am, 0.9), (Amy, www.snails.com, 9am, 0.4) }) (Fred, { (Fred, www.snails.com, 11am, 0.4) }) Transform to (user, Average(pagerank) as avgPR) (Amy, 0.65) (Fred, 0.4) Filter avgPR > 0.5 (Amy, 0.65)
Load Visits(user, url, time) Load Pages(url, pagerank) Naïve Algorithm (Amy, cnn.com, 8am) (Amy, http://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) (www.youtube.com, 0.9) (www.frogs.com, 0.4) Transform to (user, Canonicalize(url), time) Join url = url (Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am) (Fred, www.snails.com, 11am) Group by user Transform to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5
Dataflow Illustrator • Try to satisfy 3 objectives simultaneously: • Realism, conciseness, completeness • Why it’s hard: • Large data set to comb through for “real” examples • Selective operators (e.g. filter, join) • Noninvertible operators (e.g. UDFs) • Implemented in Pig as “ILLUSTRATE” command “ILLUSTRATE lets me check the output of my lengthy batch jobs and their custom functions without having to do a lengthy run of a long pipeline. [This feature] enables me to be productive.” --Russell Jurney, LinkedIn
Next: INSPECTOR GADGET storage & processing workflow manager e.g. Nova Nova 2 Debugging aides: • Before: example data generator • During: instrumentation framework • After: provenance metadata manager dataflow programming framework e.g. Pig Pig 1 distributed sorting & hashing e.g. Map-Reduce scalable file system e.g. GFS Dataflow Illustrator 3 Inspector Gadget 4
Motivated by UserInterviews • Interviewed 10 Yahoo dataflow programmers (mostly Pig users; some users of other dataflow environments) • Asked them how they (wish they could) debug
Precept • Add these features to Pig without modifying Pig or tampering with data flowing through Pig
Our Hammer: Pig Script Rewriting • Instrument Pig script by inserting special “monitoring agent” UDFs between operators • Each agent observes records flowing through • Agents communicate “interesting observations” to a central coordinator • Coordinator delivers findings to end user
load load Instrumented Dataflow IG agent IG agent filter join IG agent IG agent IG coordinator group IG agent count IG agent store
load load Example:Crash Culprit Determination IG agent IG agent filter Phases 1 to n-1: tuple counts Phase n: tuples join IG agent IG agent IG coordinator group IG agent Phases 1 to n-1: maintain count lower bounds count Phase n: maintain last-seen tuples IG agent store
load load Example:Forward Tracing IG agent filter join IG agent tracing instructions group IG coordinator traced records IG agent report traced records to user count IG agent store
dataflow program + app. parameters Flow: end user application result IG driver library launch instrumented dataflow run(s) raw result(s) load load IG agent IG agent IG coordinator filter join IG agent IG agent store dataflow engine runtime
Next: IBIS storage & processing workflow manager e.g. Nova Nova 2 Debugging aides: • Before: example data generator • During: instrumentation framework • After: provenance metadata manager dataflow programming framework e.g. Pig Pig 1 distributed sorting & hashing e.g. Map-Reduce scalable file system e.g. GFS Dataflow Illustrator 3 Inspector Gadget 4 Ibis 5
Ibis Motivation ingestion workflow manager e.g. Nova low-latency processor serving datum X dataflow programming framework e.g. Pig distributed sorting & hashing e.g. Map-Reduce datum Y scalable file system e.g. GFS Question: what is the provenance of X?
Ibis Motivation • Benefits: • Provide uniform view to users • Factor out metadata management code • Decouple metadata lifetime from data/subsystem lifetime metadata queries metadata Ibis integrated metadata answers data processing sub-systems metadata manager users
Key Challenge: Many Granularities of Data and Processing Elements Pig script Workflow Table Pig job Column group MR program Pig logical operation MR job Row Column Web page MR job phase Cell Pig physical operation MR task Version Snapshot Task attempt data granularities process granularities
Provenance Graph Example IMDB web page IMDB extracted table combined extracted table map output 1 map output 2 Y!Movies extracted table Yahoo! Movies web page version = 2 wrapper = imdb pig job 1 map task 1, attempt 1 reduce task 1, attempt 1 version = 3 wrapper = yahoo extract pig script map task 2, attempt 1 pig job 2 merge pig script
IQL: Ibis Query Language • Find all data rows that stem from version 3 of the “extract” pig script: • Resolve version conflicts based on source authority: select d2.* from PigScriptp, PigJobj, AnyData d1, Row d2 where p.id = “extract.pig” and j under p and j.version = 3 and j emits d1 and d1 influences(2) d2; select v.id, source.authScore from Version v, WebPage source where source influences(2) v and not exists (select * from Version v2, WebPage source2, (Row,Column) commonParent where source2 influences(2) v2 and v under commonParent and v2 under commonParent and source2.authScore > source.authScore);
Summary storage & processing workflow manager e.g. Nova Nova 2 Debugging aides: • Before: example data generator • During: instrumentation framework • After: provenance metadata manager dataflow programming framework e.g. Pig Pig 1 distributed sorting & hashing e.g. Map-Reduce scalable file system e.g. GFS Dataflow Illustrator 3 Inspector Gadget 4 Ibis 5
Related Work • Pig: DryadLINQ, Hive, Jaql, Scope, relational query languages • Nova: BigTable, CBP, Oozie, Percolator, scientific workflow, incremental view maintenance • Dataflow illustrator: [Mannila/Raiha, PODS’86], reverse query processing, constraint databases, hardware verification & model checking • Inspector gadget: XTrace, taint tracking, aspect-oriented programming • Ibis: Kepler COMAD, ZOOM user views, provenance management for databases & scientific workflows
Collaborators Shubham Chopra Anish Das Sarma Alan Gates PradeepKamath Ravi Kumar ShravanNarayanamurthy Olga Natkovich Benjamin Reed SanthoshSrinivasan UtkarshSrivastava Andrew Tomkins