Cores, cores, everywhere

Cores, cores, everywhere Based on joint work with Martín Abadi, Andrew Baumann, Paul Barham, Richard Black, Vladimir Gajinov, Orion Hodson, Rebecca Isaacs, Ross McIlroy, Simon Peter, Vijayan Prabhakaran, Timothy Roscoe, Adrian Schüpbach, Akhilesh Singhania

Two hardware trendsBarrelfish operating systemMessage-passing softwareManaging parallel work

Amdahl’s law “Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi-core hardware. On a machine with 128 cores, how many cores do you need to use to get a 4x speed-up on the overall program?”

Amdahl’s law, f=70% Limit as c→∞ = 1/(1-f) = 3.33 Desired 4x speedup Speedup achieved (perfect scaling on 70%)

Amdahl’s law, f=10% Amdahl’s law limit, just 1.11x Speedup achieved with perfect scaling

Amdahl’s law, f=98%

Amdahl’s law & multi-core Suppose that the same h/w budget (space or power) can make us: 1 1 2 3 4 1 2 5 6 7 8 9 10 11 12 3 4 13 14 15 16 (analysis from Hill & Marty “Amdahl’s law in the multicore era”)

Perf of big & small cores Assumption: perf = α√resource Total perf:1 * 1 = 1 Total perf:16 * 1/4 = 4 (analysis from Hill & Marty “Amdahl’s law in the multicore era”)

Amdahl’s law, f=98% 16 small 4 medium 1 big (analysis from Hill & Marty “Amdahl’s law in the multicore era”)

Amdahl’s law, f=75% 1 big 4 medium 16 small (analysis from Hill & Marty “Amdahl’s law in the multicore era”)

Asymmetric chips 1 3 4 7 8 9 10 11 12 13 14 15 16

Amdahl’s law, f=75% 1+12 4 medium 1 big 16 small (analysis from Hill & Marty “Amdahl’s law in the multicore era”)

Two hardware trends Asymmetric performance and/or instruction sets Traditional multi-processor machines

Cache-coherent multicore Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 RAM RAM RAM RAM RAM RAM RAM RAM L3 L3 L3 L3 AMD Istanbul: 6 cores, per-core L2, per-package L3

Single-chip cloud computer (SCC) L2 Core RAM RAM MC-1 MC-3 Router MPB L2 Core RAM RAM MC-0 MC-4 VRC System interface Non-coherent caches Hardware supported messaging 24 * 2-core tiles On-chip mesh n/w

MSR Beehive Module RISCN Module RISCN Module RISCN Module RISCN Core 2 Core 1 Core N Core 3 RingIn [ 31 : 0 ] , SlotTypeIn [ 3 : 0 ] , SrcDestIn [ 3 : 0 ] Module MemMux Messages , Locks MQ WD Rdreturn ( 32 bits ) RD ( 128 bits ) RA , DDR Controller RA from display WA ( pipelined bus to controller all cores ) RD to Display controller RAM Ring interconnect Message passing in h/w No cache coherence Split-phase memory access

Two hardware trends Asymmetric performance and/or instruction sets Traditional multi-processor machines Non-cache-coherent access to memory

Messaging vs shared data as default • Fundamental model is message based • “It’s better to have shared memory and not need it than to need shared memory and not have it” Barrelfishmultikernel Traditional operating systems Shared state,one-big-lock Fine-grainedlocking Clustered objects,partitioning Distributed state,replica maintenance

The Barrelfish multi-kernel OS App App App App OS node OS node OS node OS node State replica State replica State replica State replica Message passing x64 x64 ARM Accelerator core Hardware interconnect

The Barrelfish multi-kernel OS App App App App OS node OS node OS node OS node State replica State replica State replica State replica Message passing System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64 x64 x64 ARM Accelerator core Hardware interconnect

The Barrelfish multi-kernel OS App App App App OS node OS node OS node OS node System components, each local to a specific core, and using message passing State replica State replica State replica State replica Message passing System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64 x64 x64 ARM Accelerator core Hardware interconnect

The Barrelfish multi-kernel OS User-mode programs: several models supported, including conventional shared-memory OpenMP & pthreads App App App App OS node OS node OS node OS node System components, each local to a specific core, and using message passing State replica State replica State replica State replica Message passing System runs on heterogeneous hardware, currently supporting ARM, Beehive, SCC, x86 & x64 x64 x64 ARM Accelerator core Hardware interconnect

Shared Resource Database Consensus bool updatePermissions(page_t page, flags_t flags) { bool ok = true; for (core in cores) ok &= permUpdateRequest_rpc(core, page, flags); if (ok) { localUpdatePermissions(page, flags); for (core in cores) permUpdateCommit_send(core, page, flags); } else { for (core in cores) permUpdateAbort_send(core, page, flags); } return ok; } Voting Phase Two-Phase Commit Blocking RPC before sending to next core ~400 cycles assuming process is scheduled on other core! Commit Phase

Shared Resource Database Consensus bool updatePermissions(page_t page, flags_t flags) { state_t *st = malloc (sizeof(state_t)); st->ok=true; st->page=page; st->flags=flags; st->count=0; for (core in cores) { permUpdateRequest_send(core, page, flags, st); st.count++; }} void recvReply(state_tst, bool ok) { st->ok &= ok; if (st->count-- == 0) { if (st->ok) { localUpdatePermissions(st->page, st->flags); for (core in cores) permUpdateCommit_send(core, st->page, st->flags); } else { for (core in cores) permUpdateAbort_send(core, st->page , st->flags); free(st); }} Stack-Ripped Can fail to send immediately (e.g., due to full channel) Need to Stack-Rip and here and here…

AC: Asynchronous C AC: Similar programing model to sync Similar performance to event-driven Synchronous Event-Driven Easy to program Difficult to program Poor Performance Good Performance

Shared Resource Database Consensus bool updatePermissions(page_t page, flags_t flags) { bool ok = true; do { for (core in cores) async{ok &= permUpdateRequest_AC(core, page, flags); } } finish; if (ok) { localUpdatePermissions(page, flags); for (core in cores) permUpdateCommit_send(core, page, flags); } else { for (core in cores) permUpdateAbort_send(core, page , flags); } return ok; } Identify code that can block – execution can continue after async AC versions of message RPCs Don’t pass finish until all asyncwork created in do {} finish block has complete

Shared Resource Database Consensus AC Event-Driven Synchronous

Performance Ping-pong test Minimum-sized messages • AMD 4 * 4-core machine • Using cores sharing L3 cache

Performance • “Do not fear async” • Think about correctness: if the callee doesn’t block then perf is basically unchanged

Adding Parallelism do { asyncmsg_send(core_1, “Computing Forces”); parfluidAnimate (computeForces, cells, range); } finish; Spawn a bunch of parallel tasks that can be run across multiple cores Wait for parallel andasync tasks to complete before continuing

FluidAnimate • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

Static Partitioning • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

Static Partitioning Problem: Uneven workload • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

Static Partitioning Problem: Barrier Synchronization • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

Static Partitioning Problem: Thread Preemption • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame Approach taken by (e.g.) OpenMP and Intel Parallel Building Blocks They assume you own the machine and know your workload

Dynamic Partitioning (Work-Stealing) • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

Dynamic Partitioning (Work-Stealing) Problem: Spawn / Sync Overhead • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame Cilk-5: 218 cycles per task Wool (old version): 97 cycles per task Density calculation task: ~ 10 cycles per particle

Dynamic Partitioning (Work-Stealing) Problem: Cache Locality • for each frame • move particles to correct cell • calculate cell density • calculate particle forces • calculate particles position • render frame

Cores, cores, everywhere

Cores, cores, everywhere

Presentation Transcript

Ice Cores

Ice Cores

Cores, cores, everywhere

Torroidal cores –

Cores

CORES

CORES

Ice Cores

Genomics Cores