1 / 44

I-Files: Handling Intermediate Data In Parallel Dataflow Graphs

I-Files: Handling Intermediate Data In Parallel Dataflow Graphs. Sriram Rao November 2, 2011. Joint Work With…. Raghu Ramakrishnan , Adam Silberstein: Yahoo Labs Mike Ovsiannikov , Damian Reeves: Quantcast. Motivation. Massive growth in online advertising (read…display ads)

morey
Download Presentation

I-Files: Handling Intermediate Data In Parallel Dataflow Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I-Files: Handling Intermediate Data In Parallel Dataflow Graphs Sriram Rao November 2, 2011

  2. Joint Work With… • RaghuRamakrishnan, Adam Silberstein: Yahoo Labs • Mike Ovsiannikov, Damian Reeves: Quantcast

  3. Motivation • Massive growth in online advertising (read…display ads) • Companies are reacting to this opportunity via behavioral ad-targeting • Collect click-stream logs, mine the data, build models, show ads • “Petabyte scale data mining” using computational frameworks (such as, Hadoop, Dryad) is commonplace • Analysis of Hadoop job history logs shows: • Over 95% of jobs are small (run for a few mins, process small data) • About 5% of jobs are large (run for hours, process big data)

  4. Where have my cycles gone? 5% of jobs take 90% of cycles!

  5. Who is using my network? 5% of jobs account for 99% of network traffic!

  6. So… • Analysis shows 5% of the jobs are “big”: • 5% of jobs use 90% cluster compute cycles • 5% of jobs shuffle 99% of data (i.e., 99% network bandwidth) • To improve cluster performance, improve M/R performance for large jobs • Faster, faster, faster: virtuous cycle • Cluster throughput goes up • Users will run bigger jobs • Our work: Focus on handling intermediate data at scale in parallel dataflow graphs

  7. Handling Intermediate Data in M/R • In a M/R computation, map output is intermediate data • For transferring intermediate data from map to reduce: • Map tasks generate data, write to disk • When a reduce task pulls map output, • Data has to be read from disk • Transferred over the network • Cannot assume that mappers/reducers can be scheduled concurrently • Transporting intermediate data: • Intermediate data size < RAM size: RAM masks disk I/O • Intermediate data size > RAM size: Cache hit rate masks disk I/O • Intermediate data size >> RAM size:Disk overheads affect perf

  8. Handling Intermediate data at scale • Intermediate Data Transfer: Distributed Merge Sort • # of disk seeks for transferring intermediate data αM * R • Avg. amount of data reducer pulls from a mapperα 1 / R Distributed File System Map (M * R) Disk Seeks Reduce

  9. Disk Overheads (More detail) • “Fix” the amount of data generated by a map task • Size RAM such that the map output fits in-memory and can be sorted in 1-pass • For example, use 1GB • “Fix” the amount of data consumed by a reduce task • Size RAM for a 1-pass merge • For example, use 1GB • Now… • For a job with 1TB of data • 1024 mappers generate 1G each; 1024 reduces consume 1G each • On average, data generated by a map for a given reducer = 1G / 1024 = 1M • For a job with 16TB of data • 16K mappers generate 1G each; 16K reduces consume 1G each • On average, data generated by a map for a given reducer = 1G / 16K = 64K • With scale, # of seeks increases; data read/seek decreases

  10. Disk Overheads • As the volume of intermediate data scales, • Amount of data read per seek decreases • # of disk seeks increases non-linearly • Net result: Job performance will be affected by the disk overheads in handling intermediate data • Intermediate data increases by 2 • Job-run time increases by 2.5x

  11. What is new? Distributed File System Map Network-wide Merge I-Files Fewer Seeks! Reduce One intermediate file per reducer, instead of one per mapper

  12. Our work • New approach for efficient handling of intermediate data at large scale • Minimize the number of seeks • Maximize the amount of data read/written per seek • Primarily geared towards LARGE M/R jobs: • 10’s of TB of intermediate data • 1000’s of mapper/reducer tasks • I-files: Filesystem support for intermediate data • Atomic record append primitive that allows write parallelism at scale • Network-wide batching of intermediate data • Build Sailfish (by modifying Hadoop-0.20.2) where intermediate data is transported using I-files

  13. How did we do? (Benchmark job)

  14. How did we do? (Actual Job)

  15. Talk Outline • Properties of Intermediate data • I-files implementation • Sailfish: M/R implementation that uses I-files for intermediate data • Experimental Evaluation • Summary

  16. Organizing Intermediate Data • Hadoop organizes intermediate data in a format convenient for the mapper • What if we went the opposite way: organize it in a format convenient for the reducer? • Mappers write their output to a per-partition I-file • Data destined for a reducer is in a single file • Build the intermediate data file in a manner that is suitable for the reader rather than the writer

  17. Intermediate data • Reducer input is generated by multiple mappers • File is a container into which mapper output needs to be stored • Write order is k1, k2, k3, k4 • Processing order is k3, k1, k4, k2 • Because reducer imposes processing order, writer does not care where the output is stored in the file • Once a mapper emits a record, the output is committed • There is no “withdraw” M M M M File k2 k1 R k3 k3 k4 k1 k4 k2

  18. Properties of Intermediate data file • Multiple mappers generate data that will be consumed by a single reducer • Need low latency multi-writer support • Writers are doing append-only writes • Contents of the I-file are never overwritten • Arbitrary interleaving of data is ok: • Writer does not care where the data goes in the file • Any ordering we need can be done post-facto • No ordering guarantees for the writes from a single client • Follows from arbitrary interleaving of writes

  19. Atomic Record Append • Multi-writer support => need an atomic primitive • Intermediate data is append only…so, need atomic append • With atomic record append primitive clients provide just the data but the server chooses the offset with arbitrary interleaving • In contrast, in a traditional write clients provide data+offset • Since server is choosing the offset, design is lock-free • To scale atomic record append with writers, allow • Multiple writers append to a single block of the file • Multiple blocks of the file concurrently appended to

  20. Atomic Record Append ARA: <A, offset = -1> ARA: <B, offset = -1> Client1 Client2 300 350 400 500 B A C D Offset = 300 Server

  21. Implementing I-files • Have implemented I-files in context of Kosmos distributed filesystem (KFS) • Why KFS? • KFS has multi-writer support • We have designed/implemented/deployed KFS to manage PB’s of storage • KFS is similar to GFS/HDFS • Chunks are striped across nodes and replicated for fault-tolerance • Chunk master serializes all writes to a chunk • For atomic append, chunk master assigns the offset • With KFS I-files, multiple chunks of the I-file can be concurrently modified

  22. Atomic Record Append • Writers are routed to a chunk that is open for writing • For scale, limit the # of concurrent writers to a chunk • When client gets an ack back from chunk master, data is replicated in the volatile memory at all the replicas • Chunkservers are free to commit data to disk asynchronously • Eventually, chunk is made stable • Data is committed to disk at all the replicas • Replicas are byte-wise identical • Stable chunks are not appended to again

  23. Talk Outline • Properties of Intermediate data • I-files implementation • Sailfish: M/R implementation that uses I-files for intermediate data • Experimental Evaluation • Summary

  24. The Elephant Can Dance… reduce() map() Hadoop Shuffle Pipeline (De) Serialization Sailfish Shuffle Pipeline

  25. Sailfish Overview • Modify Hadoop-0.20.2 to use I-files for MapReduce • Mappers write their output to a per-partition I-file • Replication factor of 1 for all the chunks of an I-file • At-least-once semantics for append; filter dups on the reduce side • Data destined for a reducer is in a single file • Build the intermediate data file in a manner that is suitable for the reader rather than the writer • Automatically parallelize execution of the reduce phase: Set the number of reduce tasks and work assignment dynamically • Assign key-ranges to reduce tasks rather than whole partitions • Extend I-files to support key-based retrieval

  26. Sailfish Map Phase Execution

  27. Sailfish Reduce Phase Execution

  28. Atomic “Record” Append For M/R • M/R computations are about processing records • Intermediate data consists of key/value pairs • Extend atomic append to support “records” • Mappers emit <key, record> • Per-record framing that identifies the mapper task that generated a record • System stores per-chunk index • After chunk is stable, chunk is sorted and an index is built by the sorter • Sorting is a completely local operation: read a block from disk, sort in RAM, and write back to disk • Reducers can retrieve data by <key> • Use per-record framing to discard data from dead mappers

  29. Sailfish Architecture Submit Job Hadoop JT workbuilder What do I do? I-file 5 [a, d) Mapper Task Reducer Task Read/Merge I Appender IMerger .. . KFS I-files

  30. Handling Failures • Whenever a chunk of an I-file is lost, need to re-generate lost data • With I-file, we have multiple mappers writing to a block • For fault-tolerance, • Workbuilder tracks the set of chunks modified by a mapper task • Whenever a chunk is lost, workbuilder notifies the JT of the set of map tasks that have to be re-run • Reducers reading from the I-file with the lost chunk wait until data is re-generated • For fault-containment, in Sailfish, use per-rack I-files • Mappers running in a rack write to chunks of the I-file stored in the rack

  31. Fault-tolerance With Sailfish • Alternate option is to replicate map-output • Use atomic record append to write to two chunkservers • Probability of data loss due to (concurrent) double failure is low • Performance hit for replicating data is low • Data is replicated using RAM and written to disk async • However, network traffic increases substantially • Sailfish causes network traffic to double compared to Stock Hadoop • Map output is written to the network and reduce input is read over the network • With replication, data traverses the network three times • Alternate strategy is to selectively replicate map output • Replicate in response to data loss • Replicate output that was generated the earliest

  32. Sailfish Reduce Phase • # of reducers/job and their task assignment is determined by the workbuilder in a data-dependent manner • Dynamically set the # of reducer per job after the map phase execution is complete • # of reducers/I-file = (size of I-file) / (work per reducer) • Work per reducer is set based on RAM (in experiments, use, 1GB per reduce task) • If data assigned to a task exceeds size of RAM, merger does a network-wide merge by appropriately streaming the data • Workbuilder uses the per-chunk index to determine split points • Each reduce task is assigned a range of keys within an I-file • Data for a reduce task is in multiple chunks and requires a merge • Since chunks are sorted, data read by a reducer from a chunk is all sequential I/O • Skew in reduce input is handled seamlessly • I-file with more data has more tasks assigned to it

  33. Experimental Evaluation • Cluster comprises of ~150 machines • 6 map tasks, 6 reduce tasks per node • With Hadoop M/R tasks, a JVM is given 1.5G RAM for one pass sort/merge • 8 cores, 16GB RAM, 4-750GB drives, 1Gbps between any pair of nodes • Job uses all the nodes in the cluster • Evaluate with benchmark as well as real M/R job • Simple benchmark that generates its own data (similar to terasort) • Measure only the overhead with transporting intermediate data • Job generates records with random 10-byte key, 90-byte value • Experiments vary the size of intermediate data (1TB – 64TB) • Mappers generate 1GB of data and reducers consume ~1GB of data

  34. I-files in practice • 150 map tasks/rack • 128 map tasks concurrently appending to a block of an I-file • 2 blocks of an I-file are concurrently appended to in a rack • 512 I-files per job • Beyond 512 I-files hit system limitations in the cluster (too many open files, too many connections) • KFS chunkservers use direct I/O with the disk subsystem, by-passing the buffer cache

  35. How did we do? (Benchmark job)

  36. How many seeks? • With Stock Hadoop, number of seeks is α M * R • With Sailfish, it is the product of: • # of chunks per I-file (c) • # of reduce tasks per I-file (R / I) • # of I-files (I) • We get: c * I * (R / I) = c * R • # of chunks per I-file: 64TB intermediate data split over 512 I-files, where the chunksize is 128MB • c = (64TB / (512 * 128MB)) = 1024 • # of map tasks at 64TB: 65536 (64TB / 1GB per mapper): c << M

  37. Why does Sailfish work? • Where are the gains coming from? • Write-path is low-latency and is able to keep as many disk arms and NICs busy • Read-path: • Lowered the number of disk seeks • Reads are large/sequential • Compared to Hadoop, read path in Sailfish is very efficient • Efficient disk read path leads to better network utilization

  38. Data read per seek

  39. Disk Thruput (during Reduce phase)

  40. Using Sailfish In Practice • Use a job+data from one of the behavioral ad-targeting pipelines at Yahoo • BT-Join: Build a sliding N-day model of user behavior • Take 1 day of clickstream logs and join with previous N days and produce a new N-day model • Input datasets compressed using bz2: • Dataset A: 1000 files, 50MB apeice (10:1 compression) • Dataset B: 1000 files, 1.2GB apeice (10:1 compression) • Extended Sailfish to support compression for intermediate data • Mappers generate upto 256K of records, compress, and “append record” • Sorters read compressed data, decompress, sort, and recompress • Merger reads compressed data, decompress, merge and pass to reducer • For performance, use LZO from Intel IPP package

  41. How did we do? (BT-Join)

  42. BT-Join Analysis • Speedup in Reduce phase is due better batching • Speedup in Map-phase: • Stock Hadoop: if map output doesn’t fit in RAM, mappers do an external sort • Sailfish: Sorting is outside the map task and hence, no limits on the amount of map output generated by a map task • Net result: Job with Sailfish is about 2x faster when compared to Stock Hadoop

  43. Related Work • Atomic append was introduced in GFS paper (SOSP’03) • GFS however seems to have moved away from atomic append as they say it has not usable (at least once semantics and replicas can diverge) • Balanced systems: TritonSort • Stage-based Sort engine in which the hardware is balanced • 8 cores, 24GB RAM, 10Gig NIC, 16 drives/box • Software is then constructed in a way that balances hardware use • Follow-on work on building an M/R on top of TritonSort • Not clear how general their M/R engine is (seems specific to sort) • Sailfish tries to achieve balance via software and is a general M/R engine

  44. Summary • Designed I-files for intermediate data and built Sailfish for doing large-scale M/R • Sailfish will be released as open-source • Build Sailfish on top of YARN • Utilize the per-chunk index: • Improve reduce task planning based on key distributions • “Checkpoint” reduce tasks on key-based boundaries and allow better resource sharing • Support aggregation trees • Having the intermediate data outside a M/R job allows new debugging possibilities • Debug just the reduce phase

More Related