370 likes | 391 Views
This research paper explores the suitability of running CIEL, a dynamic task graph system, on many-core architectures. The study investigates the performance and challenges of fine-grained tasks on different hardware configurations, including commodity clusters and experimental processors.
E N D
Condensing the cloud: running CIEL on many-core G T M M R R Malte Schwarzkopf Derek G. Murray Steven Hand University of Cambridge a x b
Task graph Commodity cluster A B G M M M R R R
Elastic resources Fault tolerance Sequential code ! Deterministic parallelism
A B G M M M R R R
Suitable for many-core? Many-core workers in a cluster?
CIEL: dynamic task graphs • Allow tasks to spawn more tasks G a M M T x b R R
masterprocess WP0 WP1 WP47 User code worker processes CIEL ... Obj. store ... Hardware ... Hardware
Test environments 4x AMD Opteron 6168 „Magny-Cours“ • dodeca-core • 1.9 GHz (64-bit), 512 KB L2, 12 MB L3 • 64 GB DDR3 @ 1333 MHz [1] Intel Single-Chip Cloud (SCC) • tetracontakaiocta-core • 533 MHz (32-bit P54C), 16 KB L1, 256 KB L2 • 64 GB DDR3 @ 800 MHz Experimental! [2] [1] Image from http://en.wikipedia.org/wiki/File:Opteron_logo.png [2] Image from http://www.intel.com/communities/pix/marc/scc-v-wafer2.jpg
relative overhead M r = t timespin µ-benchmark n tasks S1 S2 Sn T T’ M t …
timespin on unmodified CIEL 41.6x rel. overhead 5.1x less is better 1.3x seconds 1.04x number of cores
timespin, using lighttpd 2.03x rel. overhead 1.59x 1.1x less is better seconds 1.05x number of cores
masterprocess WP0 WP1 WP47 User code worker processes CIEL ... Obj. store Obj. store ... ... Hardware
timespin, shared object store 1.6x rel. overhead 1.3x less is better 1.06x seconds 1.03x number of cores
masterprocess WP0 WP1 WP47 User code worker processes worker process CIEL ... ... 47 worker threads Object store ... Hardware
timespin, using multi-worker 4.5x time (s) 1.27x 1.51x less is better seconds 1.05x number of cores
masterprocess WP0 WP1 WP47 User code worker processes worker process CIEL ... ... ... 47 worker threads Object. store ... Hardware
timespin, local IPC mode 1.4x rel. overhead 1.3x 1.04x less is better seconds 1.03x number of cores
Intel Single-Chip Cloud L2$1 256 KB P54C Core 1 (IA-32) MC0 MC3 Router MPB (16 KB) MC1 MC2 L2$0 256 KB P54C Core 0 (IA-32)
timespin on the SCC 47.7x ? 5.1x 5.7x 1.4x
Message latency time (s) less is better 7900x 47x
Challenges and Opportunities Contention vs. sharing I/O multiplexing Fault tolerance
Challenges and Opportunities Contention vs. sharing I/O multiplexing Fault tolerance
Conclusions and summary • Investigated performance of the CIEL many-core • Works unmodified, but fine-grained tasks suffer • Started to address various challenges • Next: multi-scale version for hybrid clusters http://www.cl.cam.ac.uk/netos/ciel/
A M B G M A M G B M M M M R R R R R R
User code User code worker processes masterprocess masterprocess CIEL worker process CIEL WP0 WP1 WPn ... ... worker threads Object store Object store Hardware Hardware User code User code worker processes masterprocess CIEL CIEL single process masterthread WP0 WP1 WPn ... ... worker threads ... Obj. store Object store Hardware Hardware
1) Suitable for many-core? • Many-core workers ina cluster.
[1] [2] [3] [4] [1] Image from http://threadingbuildingblocks.org/images/tbb30.png [3] Image from http://www.mpi-forum.org/tshirt.jpg [2] Image from http://upload.wikimedia.org/wikipedia/en/2/27/Openmp.png [4] From http://mapreduce.stanford.edu
CIEL’s universal execution model A B G M M M + Dynamism R R R
Universal execution model MapReduce Dryad CIEL BipartiteMap/Reduce DAG Arbitrary static DAGs Dynamic task graphs
AMD Opteron “Magny Cours” C0 C1 C2 C3 C4 C5 C0 C1 C2 C3 C4 C5 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 System Request Interface System Request Interface L3 tag L3 data (6 MB) L3 tag L3 data (6 MB) Crossbar Crossbar Memory controller Probe filter/ directory Memory controller Probe filter/ directory 4 HyperTransport3ports DRAM DRAM DRAM DRAM
Binomial options pricing B B B B B B
Binomial options pricing 800k (EC2) 800k (MC) 400k (EC2) 400k (MC) 200k (EC2) higheris better 200k (MC)
Test configurations Master Worker Worker Worker