1 / 16

Phoenix Rebirth: Scalable MapReduce on a NUMA System

Phoenix Rebirth: Scalable MapReduce on a NUMA System. Richard Yoo , Anthony Romano. MapReduce and Phoenix. MapReduce A functional style parallel programming framework & runtime for large clusters Users only provide map / reduce functions Map

lemuel
Download Presentation

Phoenix Rebirth: Scalable MapReduce on a NUMA System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phoenix Rebirth: Scalable MapReduce on a NUMA System Richard Yoo, Anthony Romano

  2. MapReduce and Phoenix • MapReduce • A functional style parallel programming framework & runtime for large clusters • Users only provide map / reduce functions • Map • Processes input data and generates a set of intermediate key / value pairs • Reduce • Properly merges the intermediate pairs with the same key • Runtime • Automatically parallelizes computation • Manages data distribution / result collection • Phoenix • Shared memory implementation of MapReduce • Shown to be an efficient programming model for SMP / CMP

  3. Project Goal • Improve the scalability of Phoenix on a NUMA system • 4 socket UltraSPARC T2+ machine • 256 hardware contexts (8 cores per chip, 8 contexts per core) • NUMA • 300 cycles to access local memory • 100 cycle overhead to access remote memory • All off-chip memory traffic potentially an issue chip 0 chip 1 mem 0 mem 1 hub chip 2 chip 3 mem 2 mem 3

  4. Motivating Example • Baseline Phoenix shows moderate scalability on a single socket machine • Performance plummets on a NUMA machine with larger number of threads • One chip supports 64 threads • Utilizing off-chip thread destroys scalability Speedup on a Single Socket UltraSPARC T2 Speedup on a 4-Socket UltraSPARC T2+ (NUMA)

  5. MapReduce on a NUMA System • What’s happening? • It’s about • Locality • Locality • Locality • Reducing off-chip memory traffic would be the key to scalability • Why is locality a problem in (the invincible) MapReduce? • Original Google MapReduce meant for cluster • Communication implemented via GFS (distributed file system) • Because Phoenix is a shared memory implementation • Communication takes place through shared memory • How the user provided map / reduce functions access memory, and how they interact has significant impact on overall system performance

  6. Reducing Off-Chip Traffic • Where are these traffics being generated? • MapReduce is actually • Split-Map-Reduce-Merge • Each phase boundary introduces off-chip memory traffic • Location of map data determined by split phase • Global data shuffle from map to reduce phase • Merge phase entails global gathering

  7. Split-to-Map Phase Off-Chip Traffic • Split phase • User supplied data are distributed over the system • Map phase • Workers pull the data to work on a map task chip 0 chip 1 mem 0 mem 1 hub chip 2 chip 3 mem 2 mem 3

  8. Map-to-Reduce Phase Off-Chip Traffic • Map phase • Map results reside on local memory • Reduce phase • Has to be performed on a global scope • Global gathering of data • Similar gathering occurs at reduce-to-merge phase chip 0 chip 1 mem 0 mem 1 hub chip 2 chip 3 mem 2 mem 3

  9. Applied Optimizations • Implement optimization / scheduling logic at each phase boundary to minimize off-chip traffic • Split-to-Map Phase Traffic • Data locality aware map task distribution • Installed per locality group (chip) task queues • Distributed map tasks according to the location of map data • Worker threads work on local map tasks first • Then perform task stealing across locality groups • Map-to-Reduce Phase Traffic • Combiners • Perform local reduction before shipping off map results to reduce worker • Reduces the amount of off-chip data traffic

  10. Applied Optimizations (contd.) • Reduce-to-Merge Phase Traffic • Per locality group merge worker • Original Phoenix performs a global-scale merging • Perform localized merge first • Then merge the entire result at the final phase • Reduces chip-crossings during the merge phase • And of course, a lot of tunings / optimizations for the single chip case • Improved buffer management • Fine tune performance knobs • Size of map tasks • Number of key hash buckets

  11. Preliminary Results Execution time on kmeans Speedup on kmeans

  12. Preliminary Results (contd.) Execution time on pca Speedup on pca

  13. Summary of Results • Significantly improved scalability while not sacrificing execution time • Utilizing off-chip threads still an issue • Memory bandwidth seems to be a problem Original Phoenix Speedup Current Speedup

  14. In Progress • Measure memory bandwidth • Check whether the memory subsystem is the bottleneck or not • MapReducing-MapReduce • Execute multiple MapReduce instances • One MapReduce instance per each locality group • Globally merge the final result • Minimizes off-chip memory accesses as much as possible

  15. Questions?

  16. Memory Bandwidth Bottleneck • For word_count, realized L2 throughput caps at 64 threads Speedup on word_count Systemwide L2D Load Misses per uSecond

More Related