1 / 16

Nova: Continuous Pig/ Hadoop Workflows

Nova: Continuous Pig/ Hadoop Workflows. storage & processing. Nova. workflow manager e.g. Nova. Pig. dataflow programming framework e.g. Pig. distributed sorting & hashing e.g. Map-Reduce. scalable file system e.g. HDFS. Nova Overview.

ralph
Download Presentation

Nova: Continuous Pig/ Hadoop Workflows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nova: Continuous Pig/Hadoop Workflows

  2. storage & processing Nova workflow manager e.g. Nova Pig dataflow programming framework e.g. Pig distributed sorting & hashing e.g. Map-Reduce scalable file system e.g. HDFS

  3. Nova Overview • Nova: a system for batched incremental processing. • Scenarios: Yahoo • Ingesting and analyzing user behavior logs • Building and updating a search index from a stream of crawled web pages • Processing semi-structured data (news, blogs, etc.) • Two-layer programming model (Nova over Pig) • Continuous processing • Independent scheduling • Cross-module optimization • Manageability features

  4. Continuous Processing - Nova: An outer workflow manager layer, deals with graphs of interconnected Pig programs, with data passing in a continuous fashion. - Pig/Hadoop: Inner layer, merely deals with transforming static input data into static output data. Nova: keeps track of “delta” data and routs them to the workflow components in the right order. Delta Input Output

  5. Independent Scheduling Different portions of a workflow may be scheduled at different times/rates. - Global link analysis algorithms may only be run occasionally due to their costly nature and consumers‘ tolerance for staleness. - The components that perform ingesting, tagging, indexing new news articles, need to operate continuously.

  6. Cross-module optimization • Can identify and exploit certain optimization opportunities. E.g.: • 2 components read the same input data at the same time. • Pipelining: output of one module as input of subsequent module => Avoid materializing the intermediate result. • Manageability features • Manage workflow programming, execution. • Support debugging, keep track of versions of workflow components. • Capture data source and emitting notifications of key events.

  7. Workflow Model • Workflow • Two kinds of vertices: tasks (processing steps) and channels (data containers) • Edges connect tasks to channels and vise versa. • [Task] Consumption mode: • ALL: read a complete snapshot • NEW: only new data since the last invocation • [Task] Production mode: • B: new complete snapshot • Delta: new data that augments any existed data

  8. Workflow Model • [Task] Four common patterns of processing • Non-incremental (template detection): Process data from scratch every time. • Stateless incremental (shingling): Process new data only, each data item is handle independently. • Stateless incremental with lookup table (template tagging): Process new data independently. May use a side loop-up table for reference. • Statefulincremental (de-duping): Process new data while maintain and reference some state with the prior input data.

  9. Workflow Model (Cont.) • Data and Update Model • Blocks: A channel’s data is divided into blocks. They vary in size. • Blocks are atomic units (either be processed entirely or discarded) • Blocks are immutable. • Contains a complete snapshot of data on a channel as of some point in time • Base blocks are assigned increasing sequence numbers(B0,B1,B2……Bn) Base block • Used in conjunction with incremental processing • Contains instructions for transforming a base block into a new base block( ) Delta block

  10. Workflow Model (Cont.) • Data and Update Model • Operators: • Merging: combine base and delta blocks: • Diffing: Compare 2 base blocks to create a delta block • Chaining: combine multiple delta blocks • Upsertmodel: Leverages the presence of a primary key attribute to encode updates and insertsin a uniform way. With upserts, delta blocks are comprised • of records to be inserted, with each one displacing any pre-existing record with the same key => retain only the most recent record with a given key.

  11. Workflow Model (Cont.) • Task/Data Interface: • [Task] Consumption mode: • ALL: read a complete snapshot • NEW: only new data since the last invocation • [Task] Production mode: • B: new complete snapshot • Delta: new data that augments any existed data

  12. Workflow Model (Cont.) • Workflow Programming and Scheduling • Workflows programming starts with task definitions, then compose them into “workflowettes”. • Workflowettes have ports to which input and output channels they may connect. • Channels attached to the input and output ports of a workflowette => bound workflowette. • 3 types of trigger associated with a workflowette: • Data-based trigger. • Time-based trigger. • Cascade trigger.

  13. Workflow Model (Cont.) • Data blocks are immutable. Channels accumulate data blocks => can grow without bound. • Data Compaction and Garbage Collection • If a channel has blocks B0,, , ,the compaction operation computes and adds B3 to the channel • After compaction is used to add B3 to the channel,and current cursor is at sequence number 2, then B0,, can be garbage-collected.

  14. Tying the model to Pig/Hadoop • Each data block resides in an HDFS file. A metadata maintains the mapping. • The notion of channel exists only in metadata. • Each task: a Pig program.

  15. Tying the model to Pig/Hadoop • Each data block resides in an HDFS file. A metadata maintains the mapping. • The notion of channel exists only in metadata.

  16. Nova System Architecture

More Related