200 likes | 416 Views
GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. i n NVIDIA IEEE Micro , 2011. Taewoo Lee 2013.05.24 roboticist@voice.korea.ac.kr. Three Challenges for Parallel-Computing Chips. Limited power budget Bandwidth gap between computation and memory
E N D
GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011 Taewoo Lee 2013.05.24 roboticist@voice.korea.ac.kr
Three Challenges for Parallel-Computing Chips Limited power budget Bandwidth gap between computation and memory Parallel programmability
Computers have been constrained by power and energy rather than area • Power budget is limited • about 150W for desktops or 3W for mobile devices (∵ leakage, cooling) • transistorcomponents per chip have been continuously increased (Moore’s law) • Total power consumption also has been increased
Computers have been constrained by power and energy rather than area • E.g. Supercomputer • Power budget= 20 MW • Target compute capability= 1018 Flops/sec= 1 exaFlops/sec • Power/Flop= 20×10-12= 20 pJ/Flop • However, modern CPUs (Intel’s Westmere) • 1700 pJ/Flop (double-precision,∵130W/77GFlops) • GPU (Fermi architecture) • 225 pJ/Flop (single-precision,∵130W/665GFlops) • ×1/85, ×1/11 improvement is needed
Energy-efficiency will require reducing both instruction execution and data movement overheads • Instruction overheads • Modern CPUs were optimized for single-thread performance • E.g. Branch prediction, out-of-order execution, and large primary instruction and data caches • So, energy is consumed in overheads of data supply, instruction supply, and control • To get higher throughput, future architectures must consume their energy to more useful work (i.e. computation)
Energy-efficiency will require reducing both instruction execution and data movement overheads ∵14 pJ × 4 = 56 pJ • Todays, energy consumption of double-precision fused-multiply add (DFMA) is around 50 pJ • Data movement power dissipation is also large • E.g. Power to read three 64-bit source operands and to write one destination operand • to SRAM= 56 pJ (≈DFMA) • to 10 mm more distance memory= 56×6 (pJ) • to external DRAM= 56 ×200 (pJ)
Energy-efficiency will require reducing both instruction execution and data movement overheads 3.6 :1 3.6 :1 • Because communication dominates energy, both within the chip and across the external memory interface, energy-efficient architectures must decrease the amount of data movement by exploiting locality 1: 23 1: 6.2 • With the scaling projection to 10 nm, The ratios between DFMA, on-chip SRAM, and off-chip DRAM access energy stay relatively constant • However, the relative energy cost of 10 mm global wires goes up to 23 times the DFMA energy (∵wire C remains constant) • Feature size ↓ → relative power consumption of wire ↑
Three Challenges for Parallel-Computing Chips Limited power budget Bandwidth gap between computation and memory Parallel programmability
Bandwidth gap between computation and memory is severe. Also, power consumption by data movement is pretty serious Bandwidth gap between computation and memory becomes bigger and bigger → How to narrow this gap is very important Despite the relatively narrow memory bandwidth, chip-to-chip power comsumption is too big! (∵DRAM max. BW 175 GB/sec ⅹ20 pJ/bit= 28W + 21W for signaling= 49 W/sec, 49W accounts for 20% of total GPU TDP (thermal design power) ) → Again, reducing data movement is necessary
To cope with the bandwidth gap problem, • Architects are trying • Multichip modules (MCMs) • DRAMs on-chip (to reduce latency) • CPU + GPU on-chip (to reduce transfer overheads) • but also sharing bandwidth by both CPU and GPU can aggravate bandwidth utilization • 3D chip stacking • Deeper memory hierarchy • Bandwidth utilization • Coalescing • Prefetch • Data compression (more data per transaction) http://www.extremetech.com/computing/95319-ibm-and-3m-to-stack-100-silicon-chips-together-using-glue
Three Challenges for Parallel-Computing Chips Limited power budget Bandwidth gap between computation and memory Parallel programmability
For Parallel Programmability, Programmers must be able to • Represent data access pattern and data placement (∵ Memory model is no more flat, coalesced access) • Deal thousands of threads • Choose what kind of processing cores their tasks are running on (∵heterogeneity will be increased) • Also, coherence and consistency should be relaxed to facilitate grater memory-level parallelism • ∵the cost of coherence protocol is too high • Sol) Give programmers selective coherence
To cope with these challenges Limited power budget Bandwidth gap between computation and memory Parallel programmability
Echelon: A Research GPU Architecture • Goals • Double precision 16 TFlops/sec • Memory bandwidth= 1.6 TB/sec • Power budget ≤ 150W • 20 pJ/Flop
Echelon Block Diagram:Chip Level Architecture 16 DRAM memory controllers (MCs) - 64 Tiles - Each tile consists of 4 throughput optimized cores (TOCs) i.e. GPU for throughput oriented parallel tasks 8 latency optimized cores (LOCs) i.e. CPUs for operating system, serial portion
Echelon Block Diagram:Throughput Tile Architecture - 4 TOCs per tile. - Each TOC has secondary on-chip storage. - It may be DRAMs on-chip.
Characteristics of a TOC: MIMD + SIMD, Configurable and Sharable SRAM, and LIW per lane Lane Memory • Temporal SIMT • Divergent code → MIMD • Non-divergent → SIMT (more energy-efficient) • Two-level register files • Operand register file (ORF) for producer-consumer relationship between subsequent instructions • Main register file (MRF) • Multilevel scheduling • 4 active and 60 on-deck sets (total 64 threads)
Malleable Memory System • Selective SRAM • H/W controlled cache + scratch pads (S/W controlled cache) • The ratio can be determined by programmers • E.g. 16KB/48KB or 48KB/16KB (total 64KB) • Where to inherit can be determined by programmers • (GMEMs, L2, ranges)
To make writing a parallel program as easy as writing a sequential program • Unified memory addressing • An address space spanning LOCs and TOCs, as well as across multiple Echelon chips • Selective memory coherence • First, place data on coherence domain, Later, remove coherence to get better performance (energy, execution time) • H/W fine-grained thread creation • Automated fine-grained parallelization by H/W http://www.hardwarecanucks.com/reviews/processors/huma-amds-new-heterogeneous-unified-memory-architecture/
This work is licensed under a Creative Commons Attribution 3.0 Unported License.