1 / 37

Condensing the cloud: running C IEL on many-core

This research paper explores the suitability of running CIEL, a dynamic task graph system, on many-core architectures. The study investigates the performance and challenges of fine-grained tasks on different hardware configurations, including commodity clusters and experimental processors.

Download Presentation

Condensing the cloud: running C IEL on many-core

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Condensing the cloud: running CIEL on many-core G T M M R R Malte Schwarzkopf Derek G. Murray Steven Hand University of Cambridge a x b

  2. Task graph Commodity cluster A B G M M M R R R

  3. Elastic resources Fault tolerance Sequential code ! Deterministic parallelism

  4. A B G M M M R R R

  5. Suitable for many-core? Many-core workers in a cluster?

  6. CIEL: dynamic task graphs • Allow tasks to spawn more tasks G a M M T x b R R

  7. masterprocess WP0 WP1 WP47 User code worker processes CIEL ... Obj. store ... Hardware ... Hardware

  8. Test environments 4x AMD Opteron 6168 „Magny-Cours“ • dodeca-core • 1.9 GHz (64-bit), 512 KB L2, 12 MB L3 • 64 GB DDR3 @ 1333 MHz [1] Intel Single-Chip Cloud (SCC) • tetracontakaiocta-core • 533 MHz (32-bit P54C), 16 KB L1, 256 KB L2 • 64 GB DDR3 @ 800 MHz Experimental! [2] [1] Image from http://en.wikipedia.org/wiki/File:Opteron_logo.png [2] Image from http://www.intel.com/communities/pix/marc/scc-v-wafer2.jpg

  9. relative overhead M r = t timespin µ-benchmark n tasks S1 S2 Sn T T’ M t …

  10. timespin on unmodified CIEL 41.6x rel. overhead 5.1x less is better 1.3x seconds 1.04x number of cores

  11. timespin, using lighttpd 2.03x rel. overhead 1.59x 1.1x less is better seconds 1.05x number of cores

  12. masterprocess WP0 WP1 WP47 User code worker processes CIEL ... Obj. store Obj. store ... ... Hardware

  13. timespin, shared object store 1.6x rel. overhead 1.3x less is better 1.06x seconds 1.03x number of cores

  14. masterprocess WP0 WP1 WP47 User code worker processes worker process CIEL ... ... 47 worker threads Object store ... Hardware

  15. timespin, using multi-worker 4.5x time (s) 1.27x 1.51x less is better seconds 1.05x number of cores

  16. masterprocess WP0 WP1 WP47 User code worker processes worker process CIEL ... ... ... 47 worker threads Object. store ... Hardware

  17. timespin, local IPC mode 1.4x rel. overhead 1.3x 1.04x less is better seconds 1.03x number of cores

  18. Intel Single-Chip Cloud L2$1 256 KB P54C Core 1 (IA-32) MC0 MC3 Router MPB (16 KB) MC1 MC2 L2$0 256 KB P54C Core 0 (IA-32)

  19. timespin on the SCC 47.7x ? 5.1x 5.7x 1.4x

  20. Message latency time (s) less is better 7900x 47x

  21. Challenges and Opportunities Contention vs. sharing I/O multiplexing Fault tolerance

  22. Challenges and Opportunities Contention vs. sharing I/O multiplexing Fault tolerance

  23. Conclusions and summary • Investigated performance of the CIEL many-core • Works unmodified, but fine-grained tasks suffer • Started to address various challenges • Next: multi-scale version for hybrid clusters http://www.cl.cam.ac.uk/netos/ciel/

  24. A M B G M A M G B M M M M R R R R R R

  25. User code User code worker processes masterprocess masterprocess CIEL worker process CIEL WP0 WP1 WPn ... ... worker threads Object store Object store Hardware Hardware User code User code worker processes masterprocess CIEL CIEL single process masterthread WP0 WP1 WPn ... ... worker threads ... Obj. store Object store Hardware Hardware

  26. 1) Suitable for many-core? • Many-core workers ina cluster.

  27. [1] [2] [3] [4] [1] Image from http://threadingbuildingblocks.org/images/tbb30.png [3] Image from http://www.mpi-forum.org/tshirt.jpg [2] Image from http://upload.wikimedia.org/wikipedia/en/2/27/Openmp.png [4] From http://mapreduce.stanford.edu

  28. CIEL’s universal execution model A B G M M M + Dynamism R R R

  29. Universal execution model MapReduce  Dryad  CIEL BipartiteMap/Reduce DAG Arbitrary static DAGs Dynamic task graphs

  30. AMD Opteron “Magny Cours” C0 C1 C2 C3 C4 C5 C0 C1 C2 C3 C4 C5 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 System Request Interface System Request Interface L3 tag L3 data (6 MB) L3 tag L3 data (6 MB) Crossbar Crossbar Memory controller Probe filter/ directory Memory controller Probe filter/ directory 4 HyperTransport3ports DRAM DRAM DRAM DRAM

  31. Binomial options pricing

  32. Binomial options pricing B B B B B B

  33. Binomial options pricing 800k (EC2) 800k (MC) 400k (EC2) 400k (MC) 200k (EC2) higheris better 200k (MC)

  34. Test configurations Master Worker Worker Worker

  35. Backup slide with EC2 timespin graph

  36. Backup slide on virtualization?

More Related