Introduction to Many-Core Architectures

IntroductiontoMany-Core Architectures Henk Corporaal www.ics.ele.tue.nl/~heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

Core i7 Intel Trends (K. Olukotun) 3GHz 100W 5 Henk Corporaal

System-level integration (Chuck Moore, AMD at MICRO 2008) • Single-chip CPU Era: 1986 –2004 • Extreme focus on single-threaded performance • Multi-issue, out-of-order execution plus moderate cache hierarchy • Chip Multiprocessor (CMP) Era: 2004 –2010 • Early: Hasty integration of multiple cores into same chip/package • Mid-life: Address some of the HW scalability and interference issues • Current: Homogeneous CPUs plus moderate system-level functionality • System-level Integration Era: ~2010 onward • Integration of substantial system-level functionality • Heterogeneous processors and accelerators • Introspective control systems for managing on-chip resources & events Henk Corporaal

Why many core? • Running into • Frequency wall • ILP wall • Memory wall • Energy wall • Chip area enabler: Moore's law goes well below 22 nm • What to do with all this area? • Multiple processors fit easily on a single die • Application demands • Cost effective (just connect existing processors or processor cores) • Low power: parallelism may allow lowering Vdd • Performance/Watt is the new metric !! Henk Corporaal

CPU CPU1 CPU2 Low power through parallelism • Sequential Processor • Switching capacitance C • Frequency f • Voltage V • P1 = fCV2 • Parallel Processor (two times the number of units) • Switching capacitance 2C • Frequency f/2 • Voltage V’ < V • P2 = f/2 2C V’2 = fCV’2 < P1 Henk Corporaal

Engine Engine Engine Engine How low Vdd can we go? • Subthreshold JPEG encoder • Vdd 0.4 – 1.2 Volt Henk Corporaal

Computational efficiency: how many MOPS/Watt? Yifan He e.a., DAC 2010 Henk Corporaal

10000 W m / s p o M 0 0 0 1 ) IBM Cell s 1000 p W m o / s p o G M 0 W ( 0 m 1 / s p e o M c 0 1 n SODA a 100 ( 90 nm ) m P SODA r o Imagine o ( 65 nm ) w f B e r W m e r e / s t p o E t P M e f 1 f r i 10 c i e VIRAM Pentium M n TI C 6 X c y 1 0 . 1 1 10 100 Power ( Watts ) Computational efficiency: what do we need? 4 G Wireless Mobile HD Video 3 G Wireless Woh e.a., ISCA 2009 Henk Corporaal

Intel's opinion: 48-core x86 Henk Corporaal

Outline • Classifications of Parallel Architectures • Examples • Various (research) architectures • GPUs • Cell • Intel multi-cores • How much performance do you really get? Roofline model • Trends & Conclusions Henk Corporaal

Classifications • Performance / parallelism driven: • 4-5 D • Flynn • Communication & Memory • Message passing / Shared memory • Shared memory issues: coherency, consistency, synchronization • Interconnect Henk Corporaal

Flynn's Taxomony • SISD (Single Instruction, Single Data) • Uniprocessors • SIMD (Single Instruction, Multiple Data) • Vector architectures also belong to this class • Multimedia extensions (MMX, SSE, VIS, AltiVec, …) • Examples: Illiac-IV, CM-2, MasPar MP-1/2, Xetal, IMAP, Imagine, GPUs, …… • MISD (Multiple Instruction, Single Data) • Systolic arrays / stream based processing • MIMD (Multiple Instruction, Multiple Data) • Examples: Sun Enterprise 5000, Cray T3D/T3E, SGI Origin • Flexible • Most widely used Henk Corporaal

Flynn's Taxomony Henk Corporaal

Enhance performance: 4 architecture methods • (Super)-pipelining • Powerful instructions • MD-technique • multiple data operands per operation • MO-technique • multiple operations per instruction • Multiple instruction issue • Single stream: Superscalar • Multiple streams • Single core, multiple threads: Simultaneously Multi-Threading • Multiple cores Henk Corporaal

IF IF IF IF DC DC DC DC RF RF RF RF EX EX EX EX WB WB WB WB Architecture methodsPipelined Execution of Instructions • Purpose of pipelining: • Reduce #gate_levels in critical path • Reduce CPI close to one (instead of a large number for the multicycle machine) • More efficient Hardware • Problems • Hazards: pipeline stalls • Structural hazards: add more hardware • Control hazards, branch penalties: use branch prediction • Data hazards: by passing required IF: Instruction Fetch DC: Instruction Decode RF: Register Fetch EX: Execute instruction WB: Write Result Register CYCLE 1 2 3 4 5 6 7 8 1 2 INSTRUCTION 3 4 Simple 5-stage pipeline Henk Corporaal

* Architecture methodsPipelined Execution of Instructions • Superpipelining: • Split one or more of the critical pipeline stages • Superpipelining degree S: S(architecture) = f(Op) * lt (Op) Op I_set where: f(op) is frequency of operation op lt(op) is latency of operation op Henk Corporaal

Architecture methodsPowerful Instructions (1) • MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; or c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3) Henk Corporaal

SIMD Execution Method time PE1 PE2 PEn Instruction 1 Instruction 2 Instruction 3 Instruction n Architecture methodsPowerful Instructions (1) • SIMD computing • All PEs (Processing Elements) execute same operation • Typical mesh or hypercube connectivity • Exploit data locality of e.g. image processing applications • Dense encoding (few instruction bits needed) Henk Corporaal

* * * * Architecture methodsPowerful Instructions (1) • Sub-word parallelism • SIMD on restricted scale: • Used for Multi-media instructions • Examples • MMX, SSE, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II • Example: i=1..4|ai-bi| Henk Corporaal

Architecture methodsPowerful Instructions (2) • MO-technique: multiple operations per instruction • Two options: • CISC (Complex Instruction Set Computer) • VLIW (Very Long Instruction Word) FU 1 FU 2 FU 3 FU 4 FU 5 field sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5) bnez r5, 13 instruction VLIW instruction example Henk Corporaal

VLIW architecture: central Register File Register file Exec unit 1 Exec unit 2 Exec unit 3 Exec unit 4 Exec unit 5 Exec unit 6 Exec unit 7 Exec unit 8 Exec unit 9 Issue slot 1 Issue slot 2 Issue slot 3 Q: How many ports does the registerfile need for n-issue? Henk Corporaal

Architecture methodsMultiple instruction issue (per cycle) • Who guarantees semantic correctness? • can instructions be executed in parallel • User: he specifies multiple instruction streams • Multi-processor: MIMD (Multiple Instruction Multiple Data) • HW: Run-time detection of ready instructions • Superscalar • Compiler: Compile into dataflow representation • Dataflow processors Henk Corporaal

SIMD 100 Data/operation ‘D’ 10 Vector CISC Superscalar MIMD Dataflow 0.1 10 100 RISC Instructions/cycle ‘I’ Superpipelined 10 VLIW 10 Operations/instruction ‘O’ Superpipelining Degree ‘S’ Four dimensional representation of the architecture design space <I, O, D, S> Henk Corporaal

Architecture I O D S Mpar CISC 0.2 1.2 1.1 1 0.26 RISC 1 1 1 1.2 1.2 VLIW 1 10 1 1.2 12 Superscalar 3 1 1 1.2 3.6 SIMD 1 1 128 1.2 154 MIMD 32 1 1 1.2 38 GPU 32 2 8 24 12288 Top500 Jaguar ??? S(architecture) = f(Op) * lt (Op) Op I_set Architecture design space Example values of <I, O, D, S> for different architectures You should exploit this amount of parallelism !!! Mpar = I*O*D*S Henk Corporaal

Communication • Parallel Architecture extends traditional computer architecture with a communication network • abstractions (HW/SW interface) • organizational structure to realize abstraction efficiently Communication Network Processing node Processing node Processing node Processing node Processing node Henk Corporaal

Communication models: Shared Memory • Coherence problem • Memory consistency issue • Synchronization problem Shared Memory (read, write) (read, write) Process P2 Process P1 Henk Corporaal

Communication models: Shared memory • Shared address space • Communication primitives: • load, store, atomic swap • Two varieties: • Physically shared => Symmetric Multi-Processors (SMP) • usually combined with local caching • Physically distributed => Distributed Shared Memory (DSM) Henk Corporaal

Processor Processor Processor Processor One or more cache levels One or more cache levels One or more cache levels One or more cache levels SMP: Symmetric Multi-Processor • Memory: centralized with uniform access time (UMA) and bus interconnect, I/O • Examples: Sun Enterprise 6000, SGI Challenge, Intel can be 1 bus, N busses, or any network Main memory I/O System Henk Corporaal

Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory DSM: Distributed Shared Memory • Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network Main memory I/O System Henk Corporaal

Shared Address Model Summary • Each processor can name every physical location in the machine • Each process can name all data it shares with other processes • Data transfer via load and store • Data size: byte, word, ... or cache blocks • Memory hierarchy model applies: • communication moves data to local proc. cache Henk Corporaal

Three fundamental issues for shared memory multiprocessors • Coherence, about: Do I see the most recent data? • Consistency, about: When do I see a written value? • e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? • SynchronizationHow to synchronize processes? • how to protect access to shared data? Henk Corporaal

receive send Process P2 Process P1 send receive FiFO Communication models: Message Passing • Communication primitives • e.g., send, receive library calls • standard MPI: Message Passing Interface • www.mpi-forum.org • Note that MP can be build on top of SM and vice versa! Henk Corporaal

Message Passing Model • Explicit message send and receive operations • Send specifies local buffer + receiving process on remote computer • Receive specifies sending process on remote computer + local buffer to place data • Typically blocking communication, but may use DMA Message structure Header Data Trailer Henk Corporaal

Network interface Network interface Network interface Network interface DMA DMA DMA DMA Message passing communication Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory Interconnection Network Henk Corporaal

Communication Models: Comparison • Shared-Memory: • Compatibility with well-understood language mechanisms • Ease of programming for complex or dynamic communications patterns • Shared-memory applications; sharing of large data structures • Efficient for small items • Supports hardware caching • Messaging Passing: • Simpler hardware • Explicit communication • Implicit synchronization (with any communication) Battle ongoing Henk Corporaal

Interconnect • How to connect your cores? • Some options: • Connect everybody: • Single bus • Hierarchical bus • NoC • multi-hop via routers • any topology possible • easy 2D layout helps • Connect with e.g. neighbors only • e.g. using shift operation in SIMD • or using dual-ported mems to connect 2 cores. Henk Corporaal

Example: NoC with 2x4 mesh routing network node node node node R R R R node node node node R R R R Bus (shared) or Network (switched) • Network: • claimed to be more scalable • no bus arbitration • point-to-point connections • but router overhead Henk Corporaal

Historical Perspective • Early machines were: • Collection of microprocessors. • Communication was performed using bi-directional queues between nearest neighbors. • Messages were forwarded by processors on path • “Store and forward” networking • There was a strong emphasis on topology in algorithms, in order to minimize the number of hops => minimize time Henk Corporaal

Design Characteristics of a Network • Topology (how things are connected): • Crossbar, ring, 2-D and 3-D meshes or torus, hypercube, tree, butterfly, perfect shuffle, .... • Routing algorithm (path used): • Example in 2D torus: all east-west then all north-south (avoids deadlock) • Switching strategy: • Circuit switching: full path reserved for entire message, like the telephone. • Packet switching: message broken into separately-routed packets, like the post office. • Flow control and buffering (what if there is congestion): • Stall, store data temporarily in buffers • re-route data to other nodes • tell source node to temporarily halt, discard, etc. • QoS guarantees, Error handling, …., etc, etc. Henk Corporaal

Switch / Network Topology • Topology determines: • Degree: number of links from a node • Diameter: max number of links crossed between nodes • Average distance: number of links to random destination • Bisection: minimum number of links that separate the network into two halves • Bisection bandwidth = link bandwidth * bisection Henk Corporaal

Bisection Bandwidth • Bisection bandwidth: bandwidth across smallest cut that divides network into two equal halves • Bandwidth across “narrowest” part of the network not a bisection cut bisection cut bisection bw= link bw bisection bw = sqrt(n) * link bw • Bisection bandwidth is important for algorithms in which all processors need to communicate with all others Henk Corporaal

Common Topologies Type Degree Diameter Ave Dist Bisection 1D mesh 2 N-1 N/3 1 2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2 3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3 nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n Ring 2 N/2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/2 Hypercube Log2N n=Log2N n/2 N/2 2D Tree 3 2Log2N ~2Log2 N 1 Crossbar N-1 1 1 N2/2 N = number of nodes, n = dimension Henk Corporaal

Topologies in Real High End Machines older newer Henk Corporaal

Network: Performance metrics • Network Bandwidth • Need high bandwidth in communication • How does it scale with number of nodes? • Communication Latency • Affects performance, since processor may have to wait • Affects ease of programming, since it requires more thought to overlap communication and computation • How can a mechanism help hide latency? • overlap message send with computation, • prefetch data, • switch to other task or thread Henk Corporaal

Examples of many core / PE architectures • SIMD • Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan Univ) • VLIW • Itanium,TRIPS / EDGE, ADRES, • Multi-threaded • idea: hide long latencies • Denelcor HEP (1982), SUN Niagara (2005) • Multi-processor • RaW, PicoChip, Intel/AMD, GRID, Farms, ….. • Hybrid, like , Imagine, GPUs, XC-Core • actually, most are hybrid !! Henk Corporaal

IMAP from NEC • NEC IMAP • SIMD • 128 PEs • Supports indirect addressing • e.g. LD r1, (r2) • Each PE 5-issue VLIW Henk Corporaal

TRIPS (Austin Univ / IBM)a statically mapped data flow architecture R: register file E: execution unit D: Data cache I: Instruction cache G: global control Henk Corporaal

Compiling for TRIPS • Form hyperblocks (use unrolling, predication, inlining to enlarge scope) • Spatial map operations of each hyperblock • registers are accessed at hyperblock boundaries • Schedule hyperblocks Henk Corporaal

Multithreaded Categories Simultaneous Multithreading Multiprocessing Superscalar Fine-Grained Coarse-Grained Time (processor cycle) Thread 1 Thread 3 Thread 5 Intel calls this 'Hyperthreading' Thread 2 Thread 4 Idle slot Henk Corporaal

SUN Niagara processing element • 4 threads per processor • 4 copies of PC logic, Instr. buffer, Store buffer, Register file Henk Corporaal

Introduction to Many-Core Architectures