I-Files: Handling Intermediate Data In Parallel Dataflow Graphs

I-Files: Handling Intermediate Data In Parallel Dataflow Graphs Sriram Rao November 2, 2011

Joint Work With… • RaghuRamakrishnan, Adam Silberstein: Yahoo Labs • Mike Ovsiannikov, Damian Reeves: Quantcast

Motivation • Massive growth in online advertising (read…display ads) • Companies are reacting to this opportunity via behavioral ad-targeting • Collect click-stream logs, mine the data, build models, show ads • “Petabyte scale data mining” using computational frameworks (such as, Hadoop, Dryad) is commonplace • Analysis of Hadoop job history logs shows: • Over 95% of jobs are small (run for a few mins, process small data) • About 5% of jobs are large (run for hours, process big data)

Where have my cycles gone? 5% of jobs take 90% of cycles!

Who is using my network? 5% of jobs account for 99% of network traffic!

So… • Analysis shows 5% of the jobs are “big”: • 5% of jobs use 90% cluster compute cycles • 5% of jobs shuffle 99% of data (i.e., 99% network bandwidth) • To improve cluster performance, improve M/R performance for large jobs • Faster, faster, faster: virtuous cycle • Cluster throughput goes up • Users will run bigger jobs • Our work: Focus on handling intermediate data at scale in parallel dataflow graphs

Handling Intermediate Data in M/R • In a M/R computation, map output is intermediate data • For transferring intermediate data from map to reduce: • Map tasks generate data, write to disk • When a reduce task pulls map output, • Data has to be read from disk • Transferred over the network • Cannot assume that mappers/reducers can be scheduled concurrently • Transporting intermediate data: • Intermediate data size < RAM size: RAM masks disk I/O • Intermediate data size > RAM size: Cache hit rate masks disk I/O • Intermediate data size >> RAM size:Disk overheads affect perf

Handling Intermediate data at scale • Intermediate Data Transfer: Distributed Merge Sort • # of disk seeks for transferring intermediate data αM * R • Avg. amount of data reducer pulls from a mapperα 1 / R Distributed File System Map (M * R) Disk Seeks Reduce

Disk Overheads (More detail) • “Fix” the amount of data generated by a map task • Size RAM such that the map output fits in-memory and can be sorted in 1-pass • For example, use 1GB • “Fix” the amount of data consumed by a reduce task • Size RAM for a 1-pass merge • For example, use 1GB • Now… • For a job with 1TB of data • 1024 mappers generate 1G each; 1024 reduces consume 1G each • On average, data generated by a map for a given reducer = 1G / 1024 = 1M • For a job with 16TB of data • 16K mappers generate 1G each; 16K reduces consume 1G each • On average, data generated by a map for a given reducer = 1G / 16K = 64K • With scale, # of seeks increases; data read/seek decreases

Disk Overheads • As the volume of intermediate data scales, • Amount of data read per seek decreases • # of disk seeks increases non-linearly • Net result: Job performance will be affected by the disk overheads in handling intermediate data • Intermediate data increases by 2 • Job-run time increases by 2.5x

What is new? Distributed File System Map Network-wide Merge I-Files Fewer Seeks! Reduce One intermediate file per reducer, instead of one per mapper

Our work • New approach for efficient handling of intermediate data at large scale • Minimize the number of seeks • Maximize the amount of data read/written per seek • Primarily geared towards LARGE M/R jobs: • 10’s of TB of intermediate data • 1000’s of mapper/reducer tasks • I-files: Filesystem support for intermediate data • Atomic record append primitive that allows write parallelism at scale • Network-wide batching of intermediate data • Build Sailfish (by modifying Hadoop-0.20.2) where intermediate data is transported using I-files

How did we do? (Benchmark job)

How did we do? (Actual Job)

Talk Outline • Properties of Intermediate data • I-files implementation • Sailfish: M/R implementation that uses I-files for intermediate data • Experimental Evaluation • Summary

Organizing Intermediate Data • Hadoop organizes intermediate data in a format convenient for the mapper • What if we went the opposite way: organize it in a format convenient for the reducer? • Mappers write their output to a per-partition I-file • Data destined for a reducer is in a single file • Build the intermediate data file in a manner that is suitable for the reader rather than the writer

Intermediate data • Reducer input is generated by multiple mappers • File is a container into which mapper output needs to be stored • Write order is k1, k2, k3, k4 • Processing order is k3, k1, k4, k2 • Because reducer imposes processing order, writer does not care where the output is stored in the file • Once a mapper emits a record, the output is committed • There is no “withdraw” M M M M File k2 k1 R k3 k3 k4 k1 k4 k2

Properties of Intermediate data file • Multiple mappers generate data that will be consumed by a single reducer • Need low latency multi-writer support • Writers are doing append-only writes • Contents of the I-file are never overwritten • Arbitrary interleaving of data is ok: • Writer does not care where the data goes in the file • Any ordering we need can be done post-facto • No ordering guarantees for the writes from a single client • Follows from arbitrary interleaving of writes

Atomic Record Append • Multi-writer support => need an atomic primitive • Intermediate data is append only…so, need atomic append • With atomic record append primitive clients provide just the data but the server chooses the offset with arbitrary interleaving • In contrast, in a traditional write clients provide data+offset • Since server is choosing the offset, design is lock-free • To scale atomic record append with writers, allow • Multiple writers append to a single block of the file • Multiple blocks of the file concurrently appended to

Atomic Record Append ARA: <A, offset = -1> ARA: <B, offset = -1> Client1 Client2 300 350 400 500 B A C D Offset = 300 Server

Implementing I-files • Have implemented I-files in context of Kosmos distributed filesystem (KFS) • Why KFS? • KFS has multi-writer support • We have designed/implemented/deployed KFS to manage PB’s of storage • KFS is similar to GFS/HDFS • Chunks are striped across nodes and replicated for fault-tolerance • Chunk master serializes all writes to a chunk • For atomic append, chunk master assigns the offset • With KFS I-files, multiple chunks of the I-file can be concurrently modified

Atomic Record Append • Writers are routed to a chunk that is open for writing • For scale, limit the # of concurrent writers to a chunk • When client gets an ack back from chunk master, data is replicated in the volatile memory at all the replicas • Chunkservers are free to commit data to disk asynchronously • Eventually, chunk is made stable • Data is committed to disk at all the replicas • Replicas are byte-wise identical • Stable chunks are not appended to again

Talk Outline • Properties of Intermediate data • I-files implementation • Sailfish: M/R implementation that uses I-files for intermediate data • Experimental Evaluation • Summary

The Elephant Can Dance… reduce() map() Hadoop Shuffle Pipeline (De) Serialization Sailfish Shuffle Pipeline

Sailfish Overview • Modify Hadoop-0.20.2 to use I-files for MapReduce • Mappers write their output to a per-partition I-file • Replication factor of 1 for all the chunks of an I-file • At-least-once semantics for append; filter dups on the reduce side • Data destined for a reducer is in a single file • Build the intermediate data file in a manner that is suitable for the reader rather than the writer • Automatically parallelize execution of the reduce phase: Set the number of reduce tasks and work assignment dynamically • Assign key-ranges to reduce tasks rather than whole partitions • Extend I-files to support key-based retrieval

Sailfish Map Phase Execution

Sailfish Reduce Phase Execution

Atomic “Record” Append For M/R • M/R computations are about processing records • Intermediate data consists of key/value pairs • Extend atomic append to support “records” • Mappers emit <key, record> • Per-record framing that identifies the mapper task that generated a record • System stores per-chunk index • After chunk is stable, chunk is sorted and an index is built by the sorter • Sorting is a completely local operation: read a block from disk, sort in RAM, and write back to disk • Reducers can retrieve data by <key> • Use per-record framing to discard data from dead mappers

Sailfish Architecture Submit Job Hadoop JT workbuilder What do I do? I-file 5 [a, d) Mapper Task Reducer Task Read/Merge I Appender IMerger .. . KFS I-files

Handling Failures • Whenever a chunk of an I-file is lost, need to re-generate lost data • With I-file, we have multiple mappers writing to a block • For fault-tolerance, • Workbuilder tracks the set of chunks modified by a mapper task • Whenever a chunk is lost, workbuilder notifies the JT of the set of map tasks that have to be re-run • Reducers reading from the I-file with the lost chunk wait until data is re-generated • For fault-containment, in Sailfish, use per-rack I-files • Mappers running in a rack write to chunks of the I-file stored in the rack

Fault-tolerance With Sailfish • Alternate option is to replicate map-output • Use atomic record append to write to two chunkservers • Probability of data loss due to (concurrent) double failure is low • Performance hit for replicating data is low • Data is replicated using RAM and written to disk async • However, network traffic increases substantially • Sailfish causes network traffic to double compared to Stock Hadoop • Map output is written to the network and reduce input is read over the network • With replication, data traverses the network three times • Alternate strategy is to selectively replicate map output • Replicate in response to data loss • Replicate output that was generated the earliest

Sailfish Reduce Phase • # of reducers/job and their task assignment is determined by the workbuilder in a data-dependent manner • Dynamically set the # of reducer per job after the map phase execution is complete • # of reducers/I-file = (size of I-file) / (work per reducer) • Work per reducer is set based on RAM (in experiments, use, 1GB per reduce task) • If data assigned to a task exceeds size of RAM, merger does a network-wide merge by appropriately streaming the data • Workbuilder uses the per-chunk index to determine split points • Each reduce task is assigned a range of keys within an I-file • Data for a reduce task is in multiple chunks and requires a merge • Since chunks are sorted, data read by a reducer from a chunk is all sequential I/O • Skew in reduce input is handled seamlessly • I-file with more data has more tasks assigned to it

Experimental Evaluation • Cluster comprises of ~150 machines • 6 map tasks, 6 reduce tasks per node • With Hadoop M/R tasks, a JVM is given 1.5G RAM for one pass sort/merge • 8 cores, 16GB RAM, 4-750GB drives, 1Gbps between any pair of nodes • Job uses all the nodes in the cluster • Evaluate with benchmark as well as real M/R job • Simple benchmark that generates its own data (similar to terasort) • Measure only the overhead with transporting intermediate data • Job generates records with random 10-byte key, 90-byte value • Experiments vary the size of intermediate data (1TB – 64TB) • Mappers generate 1GB of data and reducers consume ~1GB of data

I-files in practice • 150 map tasks/rack • 128 map tasks concurrently appending to a block of an I-file • 2 blocks of an I-file are concurrently appended to in a rack • 512 I-files per job • Beyond 512 I-files hit system limitations in the cluster (too many open files, too many connections) • KFS chunkservers use direct I/O with the disk subsystem, by-passing the buffer cache

How did we do? (Benchmark job)

How many seeks? • With Stock Hadoop, number of seeks is α M * R • With Sailfish, it is the product of: • # of chunks per I-file (c) • # of reduce tasks per I-file (R / I) • # of I-files (I) • We get: c * I * (R / I) = c * R • # of chunks per I-file: 64TB intermediate data split over 512 I-files, where the chunksize is 128MB • c = (64TB / (512 * 128MB)) = 1024 • # of map tasks at 64TB: 65536 (64TB / 1GB per mapper): c << M

Why does Sailfish work? • Where are the gains coming from? • Write-path is low-latency and is able to keep as many disk arms and NICs busy • Read-path: • Lowered the number of disk seeks • Reads are large/sequential • Compared to Hadoop, read path in Sailfish is very efficient • Efficient disk read path leads to better network utilization

Data read per seek

Disk Thruput (during Reduce phase)

Using Sailfish In Practice • Use a job+data from one of the behavioral ad-targeting pipelines at Yahoo • BT-Join: Build a sliding N-day model of user behavior • Take 1 day of clickstream logs and join with previous N days and produce a new N-day model • Input datasets compressed using bz2: • Dataset A: 1000 files, 50MB apeice (10:1 compression) • Dataset B: 1000 files, 1.2GB apeice (10:1 compression) • Extended Sailfish to support compression for intermediate data • Mappers generate upto 256K of records, compress, and “append record” • Sorters read compressed data, decompress, sort, and recompress • Merger reads compressed data, decompress, merge and pass to reducer • For performance, use LZO from Intel IPP package

How did we do? (BT-Join)

BT-Join Analysis • Speedup in Reduce phase is due better batching • Speedup in Map-phase: • Stock Hadoop: if map output doesn’t fit in RAM, mappers do an external sort • Sailfish: Sorting is outside the map task and hence, no limits on the amount of map output generated by a map task • Net result: Job with Sailfish is about 2x faster when compared to Stock Hadoop

Related Work • Atomic append was introduced in GFS paper (SOSP’03) • GFS however seems to have moved away from atomic append as they say it has not usable (at least once semantics and replicas can diverge) • Balanced systems: TritonSort • Stage-based Sort engine in which the hardware is balanced • 8 cores, 24GB RAM, 10Gig NIC, 16 drives/box • Software is then constructed in a way that balances hardware use • Follow-on work on building an M/R on top of TritonSort • Not clear how general their M/R engine is (seems specific to sort) • Sailfish tries to achieve balance via software and is a general M/R engine

Summary • Designed I-files for intermediate data and built Sailfish for doing large-scale M/R • Sailfish will be released as open-source • Build Sailfish on top of YARN • Utilize the per-chunk index: • Improve reduce task planning based on key distributions • “Checkpoint” reduce tasks on key-based boundaries and allow better resource sharing • Support aggregation trees • Having the intermediate data outside a M/R job allows new debugging possibilities • Debug just the reduce phase

I-Files: Handling Intermediate Data In Parallel Dataflow Graphs