Efficient and Easily Programmable Accelerator Architectures

Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013

Decreasing cost per unit computation Advancing Computer Systems without Technology Progress DARPA/ISAT Workshop, March 26-27, 2012 Mark Hill & Christos Kozyrakis 1981: IBM 5150 2007: iPhone 2012: Datacenter 1971: Intel 4004

Better (how to get here?) Single Core OoO Superscalar CPU Wimpy (In-order) Multicore Ease of Programming Brawny (OoO) Multicore 16K thread, SIMT Accelerator ASIC Hardware Efficiency

Heterogeneity helps… Ease of Programming Hardware Efficiency

Review: Amdahl’s Law Hard to accelerate Easy to accelerate

What defines division between hard and easy? Fractionhard = f(problem, prog. model, SW budget)

Goal: easy to accelerate (Acc. Arch1) easy to accelerate (Acc. Arch2)

Better ? Ease of Programming Hardware Efficiency

Increase Accelerator Efficiency (x-axis) • Control Flow • Memory Improve Accelerator Programmability (y-axis) • Easier coding • Fewer bugs • Easier debugging

PC B Active Mask 1111 A 1 2 3 4 B 1 2 3 4 C -- -- 3 4 D 1 2 -- -- SIMT Execution (MIMD on SIMD)(Levinthal SIGGRAPH’84) foo[] = {4,8,12,16}; A: n = foo[tid.x]; B: if (n > 10) C: …; else D: …; E: … E D 1100 C 0011 Time Branch Divergence

Warp 1 Warp 2 Warp 0 A 1 2 3 4 A 5 6 7 8 A 9 10 11 12 B 1 2 3 4 B 5 6 7 8 B 9 10 11 12 C 1 2 7 8 C 1 2 -- -- C 5 -- 7 8 C 5 -- 11 12 C -- -- 11 12 D -- -- 3 4 D -- 6 -- -- D 9 10 -- -- E 1 2 3 4 E 5 6 7 8 E 9 10 11 12 Dynamic Warp Formation (Fung: MICRO 2007, HPCA 2011) Reissue/Memory Latency SIMD Efficiency 58  88% Pack Time 22% average [HPCA’11]

Memory

Scheduler affects access pattern Greedy then Oldest Scheduler Round Robin Scheduler Warp0 Warp1 Warp0 Warp1 Warp Scheduler Warp Scheduler ld A ,B,C,D… ld Z,Y,X,W ld A,B,C,D… ... ... ... DC B A WX Y Z ld A,B,C,D ld Z,Y,X,W ld A,B,C,D… DC B A DC B A Cache Cache

Use scheduler to shape access pattern Cache-Conscious Wavefront Scheduling (Rogers: MICRO 2012, Top Picks 2013) Greedy then Oldest Scheduler Warp0 Warp1 Warp Scheduler Warp Scheduler ld A,B,C,D ld Z,Y,X,W ... working set size per warp WX Y Z ld A,B,C,D 63% perf. improvement DC B A Cache Cache

Easier coding

Accelerator Coherence Challenges Challenges of introducing coherence messages on a GPU Traffic: transferring messages Storage: tracking message Complexity: managing races between messages GPU cache coherence without coherence messages? YES – using global time

Global time Temporal Coherence (Singh: HPCA 2013) Related: Library Cache Coherence Local Timestamp > Global Time  VALID Core 1 Core 2 ▪▪▪ L1D L1D Global Timestamp < Global Time  NO L1 COPIES Interconnect L2 Bank ▪▪▪ 0 0 A=0 A=0

Temporal Coherence Example T=0 T=11 T=15 Core 1 Core 2 L1D L1D No coherence messages Interconnect T=10 L2 Bank Load A Store A=1 ▪▪▪ 10 A=1 A=0 10 10 0 10 A=0 A=0 A=0 A=0

Complexity MESI L2 States Non-Coherent L1 TC-Weak L1 MESI L1 States Non-Coherent L2 TC-Weak L2

Interconnect Traffic Reduces traffic by 53% over MESI and 23% over GPU-VI Lower traffic than 16x-sized 32-way directory NO-COH MESI GPU-VI TC-Weak 2.3 1.50 1.25 Interconnect Traffic 1.00 0.75 0.50 0.25 Do not require coherence 0.00

Performance TC-Weak with simple predictor performs 85% better than disabling L1 caches NO-L1 MESI GPU-VI TC-Weak 2.0 1.5 Speedup 1.0 0.5 Require coherence 0.0

Fewer bugs

Performance Functionality Time Fine-Grained Locking Transactional Memory Time Time • Lifetime of Accelerator Application Development ?

Are TM and GPUs Incompatible? GPU uarch very different from multicore CPU. KILO TM [Fung MICRO’11, Top Picks’12] • Hardware TM for GPUs • Half performance of fine grained locking • Chip area overhead of 0.5% 24

Aborted Committed T0 T0 T0 T1 T1 T1 T2 T2 T2 T3 T3 T3 Hardware TM for GPUs Challenge #1: SIMD Hardware • On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 4 Scalar Threads ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... Branch Divergence! 25

KILO TM – Solution to Challenge #1: SIMD Hardware Transaction Abort Like a Loop Extend SIMT Stack Abort ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... 26 26

GPU Core (SM) CPU Core 10s of Registers 32k Registers Register File Register File @ TX Entry @ TX Abort Checkpoint Register File Warp Warp Warp Warp Warp Warp Warp Warp Checkpoint? Hardware TM for GPUs Challenge #2: Transaction Rollback 2MB Total On-Chip Storage 27

Overwritten Abort KILO TM – Solution toChallenge #2: Transaction Rollback • SW Register Checkpoint • Most TX: Reg overwritten first appearance (idempotent) • TX in Barnes Hut: Checkpoint 2 registers TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit 28

Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol • Not Available on (current) GPUs • No Private Data Cache per Thread Signatures? • 1024-bit / Thread • 3.8MB / 30k Threads 29

Read-Log Read-Log Write-Log Write-Log TX2 atomic {A=B+2} Private Memory KILO TM: Value-Based Conflict Detection • Self-Validation + Abort: • Only detects existence of conflict (not identity) Global Memory A=1 A=1 TX1 atomic {B=A+1} Private Memory B=0 B=2 A=1 TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit B=2 B=2 TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit B=0 A=2 30

Easier debugging

V0 0 __global__ void BFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 level[id] = level[tid] + 1; 7 active[id] = true; 8 … 9 } } } } V1 V2 level = 1 active = 1 level = 2 active = 1 BFS algorithm Published in HiPC 2007

GPUDet (Jooybar: ASPLOS 2013) Read Only Global Memory Commit Reaching Quantum Boundary Load Op Atomic Op Store Buffers Wavefronts Instruction Count Atomic Operations Memory Fences Workgroup Barriers Execution Complete … Local Memory 2x Slowdown

Summary • Start from efficient architecture and try to improve programmability • Get efficiency and keep programmers happy reasonably

Thanks!Questions?

Efficient and Easily Programmable Accelerator Architectures

Efficient and Easily Programmable Accelerator Architectures

Presentation Transcript

Unified Architectures for Efficient and Compact Crypto-Processing

Efficient Architectures for Eigen Value Decomposition

FlexiTaint A Programmable Accelerator for Dynamic Taint Propagation

Towards Scalable and Energy-Efficient Memory System Architectures

Towards Scalable and Energy-Efficient Memory System Architectures

Hardware accelerator for Efficient error-correcting codes

Efficient Memory Shadowing for 64-bit Architectures

Programmable Logic Device Architectures

An Efficient Texture Cache for Programmable Vertex Shaders

Performance Debugging for Highly Parallel Accelerator Architectures

FlexiTaint: A Programmable Accelerator for Dynamic Taint Propagation

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Programming a Hyper-Programmable Architectures for Networked Systems

Architectures of Digital Information Systems Part 2: Programmable I/O and Multiprocessors

Programming Models, Languages, and Compilation for Accelerator-Based Architectures

Programmable Logic Devices and Architectures: A Nano-Course

MemTracker Efficient and Programmable Support for Memory Access Monitoring and Debugging

Architectures and Programmable Logic Devices

SensorWare: A Framework for Efficient and Programmable Sensor Networks

Efficient and Easily Programmable Accelerator Architectures

Processor Architectures and Program Mapping Programmable Digital Signal Processors

Energy Efficient Packet Classification Hardware Accelerator