Sailfish: A Framework For Large Scale Data Processing

Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012

Joint Work With Colleagues… • Raghu Ramakrishnan (at Yahoo and now at Microsoft) • Adam Silberstein (at Yahoo and now at LinkedIn) • Mike Ovsiannikov and Damian Reeves (at Quantcast)

Motivation • “Big data” is a booming industry: • Collect massive amounts of data (10’s of TB/day) • Use data intensive compute frameworks (Hadoop, Cosmos, Map-Reduce) to extract value from the collected data • Volume of data processed is bragging rights • How do frameworks handle data at scale? • Not well studied in the literature

M/R Dataflow DFS DFS DFS DFS DFS

Disk Overheads • Intermediate data transfer is seek intensive => I/O’s are small/random • # of disk seeks for transferring intermediate data proportional to M * R DFS DFS

Why Is Scale Important? • Yahoo cluster workload characteristics: • Vast majority of jobs (about 95%) are small • A minority of jobs (about 5%) are big • Involve 1000’s of tasks that are run on many machines in the cluster • Run for several hours processing TB’s of data • Size of intermediate data (i.e., map output) is at least as big as the input

Can We Minimize Seeks? • Problem space: Size of intermediate data exceeds amount of RAM in cluster • Reducer reads data from disk => one seek per reducer • Lower bound is proportional to R

Solving A Seek Problem… • Minimize disk seeks via “group commit” is well known • Why isn’t this idea implemented? • Difficult to implement in the past • Datacenter bandwidth is a contended resource • Any solution that mentions “network” was beginning of a (futile) negotiation

What Is New… • Network b/w in a datacenter is going up… • Lower “over-subscription” • 1/5/10-Gbps between any pair • Can we leverage this trend to do distributed aggregation and improve disk performance? • Building using this trend is being explored: • Flat Datacenter Storage (OSDI’12): Blob store • ThemisMR (SOCC’12): M/R at scale

Key Ideas • I-files, a network-wide data aggregation mechanism • Observe intermediate data in I-files during map phase to plan reduce phase RAM RAM RAM I-file I-file I-file

Our Work • Build I-files by extending a DFS • Build Sailfish (by modifying Hadoop) in which I-files are used to transport intermediate data • Leverage I-files to gather statisticson intermediate data to plan reduce phase: • (1) # of reducers depend on data, (2) handle skew • Eliminate tuning parameters • No more map-side tuning, choosing # of reducers • Results show 20% to 5x speedup on a representative mix of (large) real jobs/datasets at Yahoo

Talk Outline • Motivation • I-files: A data aggregation mechanism • Sailfish: Map-Reduce using I-files • Experimental Evaluation • Summary and On-going work

Using I-files for Intermediate Data • I-files are a container for data aggregation in general • Per I-file aggregator: • Buffers data from writers in RAM • “Group commit” data to disk • # of disk seeks is proportional to R RAM RAM RAM I-file I-file I-file Aggregator

Issues • Fault-tolerance • Scale • Skew: Suppose there is skew in data written to I-files • Hot-spots: • Suppose a partition becomes hot • All map tasks generate data for that partition RAM RAM RAM I-file I-file I-file

Big data => Scale out!

“Scale out” Aggregation I-file • Build using distributed aggregation (scale out) • Rather than 1 aggregator per I-file, use multiple • Bind subset of mappers to each aggregator … … Aggregator Aggregator

What Does This get? • Fault-tolerance • When an aggregator fails, need to re-run the subset of maps that wrote to that aggregator • Re-run in parallel… • Mitigate skew, hot-spots, better scale • Seeks: Goes up (!) • Reducer input is now stored at multiple aggregators • Read from multiple places • Will comeback to this issue…

I-files Design -> Implementation • Extend DFS to support data aggregation • Use KFS in our work (KFS ≅ HDFS) • Single metaserver (≅ HDFS NameNode) • Multiple Chunkservers (≅ HDFS DataNodes) • Files are striped across nodes in chunks • Chunks can be variable in size with a fixed maximal value • Currently, max size of a chunk: 128MB • Adapt KFS to support I-files • Leverage multi-writer capabilities of KFS

KFS I-files Characteristics • I-files provide a record-oriented interface • Append-only • Clients append records to an I-file : record_append(fd, <key, value>) • Append on a file translates to append on a chunk • Records do not span chunk boundaries • Chunkserveris the aggregator • Supports data retrieval by-key: • scan(fd, <key range>)

Appending To KFS I-files KFS metaserver Alloc chunk Map Task Bind to Chunk1 record_append() Chunk1 Map tasks • Multiple appenders per chunk Chunk2 Map tasks • Multiple chunks appended to concurrently

KFS I-files • I-file constructed via sequential I/O in a distributed manner • Network-wide batching • Multiple appenders per chunk • Multiple chunks appended to concurrently • On a per-chunk basis chunkserverresponsible for that chunk is the aggregator • Chunkserver aggregates records and commits to disk • Append is atomic: Chunkserver serializes concurrent appends

Distributed Aggregation With KFS I-files Design Implementation Multiple writers/chunk Multiple chunks appended to concurrently # of chunks per I-file scales based on data Chunk allocation is key: Need to pack data into as few chunks as possible • Bind subset of mappers to an aggregator • Use multiple aggregators per I-file • Minimize # of aggregators per I-file

Sailfish: MapReduce Using I-files • Modify Hadoop-0.20.2 to use I-files • Mappers append their output to per-partition I-files using record_append() • Map output is appended concurrent to task execution • During map phase, gather statistics on intermediate data and plan reduce phase • # of reducers, task assignment done at run-time • Reducer scans its input from a per-partition I-file • Merge records from chunks and reduce() • For efficient scan(), sort and index an I-file chunk

Sailfish Dataflow DFS DFS DFS DFS DFS DFS

Sailfish Dataflow DFS Sorter Chunkserver DFS Chunkserver DFS DFS

Sailfish Dataflow Chunkserver DFS Chunkserver DFS

Leveraging I-files • Gather statistics on intermediate data whenever a chunk is sorted • Statistics are gathered during map phase as part of execution • During sorting augment each chunk with an index • Index supports efficient scans

Reduce Phase Implementation • Plan reduce phase based on data: • # of reduce tasks per I-file = Size of I-file / Work per task • # of tasks scale based on data; handles skew • On a per I-file basis, partition key space by constructing “split points” • Each reduce task processes a range of keys within an I-file • “Hierarchical partitioning” of data in an I-file

Sailfish: Reduce Phase Objectives Implementation Gather statistics on I-files to plan reduce phase # of reduce tasks is determined at run-time in a data dependent manner Hierarchical scheme Partition map output into a large number of I-files Assign key-ranges within a I-file to a reduce task Reducers/I-file is data dependent • Avoid specifying # of reducers in a job at job submission time • Handle skew • Auto-scale

How many seeks? • Goal: # of seeks proportional to R • With Sailfish, • A reducer reads input from all chunks of a single I-file • Suppose that each I-file has c chunks • # of seeks during read is proportional to c * R • # of seeks during appends is also proportional to c * R • But sorters also cause seeks • # of seeks during sorting is proportional to 2 * c * R • Packing data into as few chunks as possible is critical for I-file effectiveness

Experimental Evaluation • Cluster comprises of ~150 machines (5 racks) • 2008-vintage machines • 8 cores, 16GB RAM, 4-750GB drives/node • 1Gbps between any pair of nodes • Used lzocompression for handling intermediate data (for both Hadoopand Sailfish) • Evaluations involved: • Synthetic benchmark that generates its own data • Actual jobs/data run at Yahoo

How did we do? (Synthetic Benchmark)

How Many Seeks… Hadoop Sailfish With Sailfish, it is proportional to c * R 64TB of intermediate data split over 512 I-files Chunks per I-file: ((64TB / 512) / 128MB)) = 1024 In practice: c varies from 1032 to 1048 Results show that chunks are packed Chunk allocation policy works well in practice  • With Stock Hadoop, it is proportional to M * R • # of map tasks generating 64TB data • M = (64TB / 1GB per map task) = 65536

Sailfish: More data read per seek…

Sailfish: Faster reduce phase…

Sailfish In Practice • Use actual jobs+datasets that are used in production

How Did We Do…

Handling Skew In Reducer Input LogRead job

Fault-tolerance • Sailfish handles (temporary) loss of intermediate data via recomputes • Bookkeeping that tracks map tasks that wrote to the lost block and re-run those • “Scale out” mitigates the impact of data loss • 15% increase in run-time for a run with failure • Described in detail in the paper…

Related Work • ThemisMR (SOCC’12) addresses same problem as Sailfish • Does not (yet) support fault-tolerance: design space is small clusters where failures are rare • Design requires reducer input to fit in RAM • [Starfish] Parameter tuning (for Hadoop) • Construct a job profile and use that to tune Hadoop parameters • Gains are limited by Hadoop’s intermediate data handling mechanisms • Lot of work in DB literature on handling skew • Run job on a sample of input and collect statistics to construct partition boundaries • Use statistics to drive actual run

Summary • Explore idea of network-wide aggregation to improve disk subsystem performance • Develop I-files as a data aggregation construct • Implement I-files in KFS (a distributed filesystem) • Use I-files to build Sailfish, a M/R infra • Sailfish improves job completion times: 20% to 5x

On-going Work • Extending Sailfish to support elasticity/preemption (Amoeba, SOCC’12) • Working on integrating many of the core ideas in Sailfish into Hadoop 2.x (aka YARN) • Work started at Yahoo! Labs • Being continued in CISL@Microsoft • http://issues.apache.org/jira/browse/MAPREDUCE-4584 • http://issues.apache.org/jira/browse/YARN-45

Software Available • Sailfish released as open source project • http://code.google.com/p/sailfish

Sailfish: A Framework For Large Scale Data Processing

Sailfish: A Framework For Large Scale Data Processing

Presentation Transcript

Large Scale Integrated Circuits

Introduction to the key large-scale government surveys

Pilot Plant Scale-up of Injectables and Liquid Orals

My Favorite Algorithms for Large-Scale Data Mining

Distributed Shared Memory for Large-Scale Dynamic Systems

Data Processing

The Data Quality Assessment Framework

Session 14 Management of Large-Scale Disaster Response/Recovery

PREGEL

Payroll Processing Administration

EMERGING SYSTEMS FOR LARGE-SCALE MACHINE LEARNING

Large Scale Visualization with ParaView

GRAPH PROCESSING

Data Bases in Cloud Environments

Large-Scale Data Processing with MapReduce

Leica Geo Office GNSS Processing

LEICA Geo Office - GPS Processing Exercise 1: GPS Tour2 Post Processing

Distributed Multi-Scale Data Processing for Sensor Networks

Radio Propagation - Large-Scale Path Loss

Large Mesh Simplification using Processing Sequences

Data Processing and Data Analysis