1 / 35

Efficient and Easily Programmable Accelerator Architectures

Efficient and Easily Programmable Accelerator Architectures. Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013. Decreasing cost per unit computation. Advancing Computer Systems without Technology Progress DARPA/ISAT Workshop, March 26-27, 2012 Mark Hill & Christos Kozyrakis.

erno
Download Presentation

Efficient and Easily Programmable Accelerator Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013

  2. Decreasing cost per unit computation Advancing Computer Systems without Technology Progress DARPA/ISAT Workshop, March 26-27, 2012 Mark Hill & Christos Kozyrakis 1981: IBM 5150 2007: iPhone 2012: Datacenter 1971: Intel 4004

  3. Better (how to get here?) Single Core OoO Superscalar CPU Wimpy (In-order) Multicore Ease of Programming Brawny (OoO) Multicore 16K thread, SIMT Accelerator ASIC Hardware Efficiency

  4. Heterogeneity helps… Ease of Programming Hardware Efficiency

  5. Review: Amdahl’s Law Hard to accelerate Easy to accelerate

  6. What defines division between hard and easy? Fractionhard = f(problem, prog. model, SW budget)

  7. Goal: easy to accelerate (Acc. Arch1) easy to accelerate (Acc. Arch2)

  8. Better ? Ease of Programming Hardware Efficiency

  9. Increase Accelerator Efficiency (x-axis) • Control Flow • Memory Improve Accelerator Programmability (y-axis) • Easier coding • Fewer bugs • Easier debugging

  10. PC B Active Mask 1111 A 1 2 3 4 B 1 2 3 4 C -- -- 3 4 D 1 2 -- -- SIMT Execution (MIMD on SIMD)(Levinthal SIGGRAPH’84) foo[] = {4,8,12,16}; A: n = foo[tid.x]; B: if (n > 10) C: …; else D: …; E: … E D 1100 C 0011 Time Branch Divergence

  11. Warp 1 Warp 2 Warp 0 A 1 2 3 4 A 5 6 7 8 A 9 10 11 12 B 1 2 3 4 B 5 6 7 8 B 9 10 11 12 C 1 2 7 8 C 1 2 -- -- C 5 -- 7 8 C 5 -- 11 12 C -- -- 11 12 D -- -- 3 4 D -- 6 -- -- D 9 10 -- -- E 1 2 3 4 E 5 6 7 8 E 9 10 11 12 Dynamic Warp Formation (Fung: MICRO 2007, HPCA 2011) Reissue/Memory Latency SIMD Efficiency 58  88% Pack Time 22% average [HPCA’11]

  12. Memory

  13. Scheduler affects access pattern Greedy then Oldest Scheduler Round Robin Scheduler Warp0 Warp1 Warp0 Warp1 Warp Scheduler Warp Scheduler ld A ,B,C,D… ld Z,Y,X,W ld A,B,C,D… ... ... ... DC B A WX Y Z ld A,B,C,D ld Z,Y,X,W ld A,B,C,D… DC B A DC B A Cache Cache

  14. Use scheduler to shape access pattern Cache-Conscious Wavefront Scheduling (Rogers: MICRO 2012, Top Picks 2013) Greedy then Oldest Scheduler Warp0 Warp1 Warp Scheduler Warp Scheduler ld A,B,C,D ld Z,Y,X,W ... working set size per warp WX Y Z ld A,B,C,D 63% perf. improvement DC B A Cache Cache

  15. Easier coding

  16. Accelerator Coherence Challenges Challenges of introducing coherence messages on a GPU Traffic: transferring messages Storage: tracking message Complexity: managing races between messages GPU cache coherence without coherence messages? YES – using global time

  17. Global time Temporal Coherence (Singh: HPCA 2013) Related: Library Cache Coherence Local Timestamp > Global Time  VALID Core 1 Core 2 ▪▪▪ L1D L1D Global Timestamp < Global Time  NO L1 COPIES Interconnect L2 Bank ▪▪▪ 0 0 A=0 A=0

  18. Temporal Coherence Example T=0 T=11 T=15 Core 1 Core 2 L1D L1D No coherence messages Interconnect T=10 L2 Bank Load A Store A=1 ▪▪▪ 10 A=1 A=0 10 10 0 10 A=0 A=0 A=0 A=0

  19. Complexity MESI L2 States Non-Coherent L1 TC-Weak L1 MESI L1 States Non-Coherent L2 TC-Weak L2

  20. Interconnect Traffic Reduces traffic by 53% over MESI and 23% over GPU-VI Lower traffic than 16x-sized 32-way directory NO-COH MESI GPU-VI TC-Weak 2.3 1.50 1.25 Interconnect Traffic 1.00 0.75 0.50 0.25 Do not require coherence 0.00

  21. Performance TC-Weak with simple predictor performs 85% better than disabling L1 caches NO-L1 MESI GPU-VI TC-Weak 2.0 1.5 Speedup 1.0 0.5 Require coherence 0.0

  22. Fewer bugs

  23. Performance Functionality Time Fine-Grained Locking Transactional Memory Time Time • Lifetime of Accelerator Application Development ?

  24. Are TM and GPUs Incompatible? GPU uarch very different from multicore CPU. KILO TM [Fung MICRO’11, Top Picks’12] • Hardware TM for GPUs • Half performance of fine grained locking • Chip area overhead of 0.5% 24

  25. Aborted Committed T0 T0 T0 T1 T1 T1 T2 T2 T2 T3 T3 T3 Hardware TM for GPUs Challenge #1: SIMD Hardware • On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 4 Scalar Threads ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... Branch Divergence! 25

  26. KILO TM – Solution to Challenge #1: SIMD Hardware Transaction Abort Like a Loop Extend SIMT Stack Abort ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... 26 26

  27. GPU Core (SM) CPU Core 10s of Registers 32k Registers Register File Register File @ TX Entry @ TX Abort Checkpoint Register File Warp Warp Warp Warp Warp Warp Warp Warp Checkpoint? Hardware TM for GPUs Challenge #2: Transaction Rollback 2MB Total On-Chip Storage 27

  28. Overwritten Abort KILO TM – Solution toChallenge #2: Transaction Rollback • SW Register Checkpoint • Most TX: Reg overwritten first appearance (idempotent) • TX in Barnes Hut: Checkpoint 2 registers TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit 28

  29. Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol • Not Available on (current) GPUs • No Private Data Cache per Thread Signatures? • 1024-bit / Thread • 3.8MB / 30k Threads 29

  30. Read-Log Read-Log Write-Log Write-Log TX2 atomic {A=B+2} Private Memory KILO TM: Value-Based Conflict Detection • Self-Validation + Abort: • Only detects existence of conflict (not identity) Global Memory A=1 A=1 TX1 atomic {B=A+1} Private Memory B=0 B=2 A=1 TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit B=2 B=2 TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit B=0 A=2 30

  31. Easier debugging

  32. V0 0 __global__ void BFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 level[id] = level[tid] + 1; 7 active[id] = true; 8 … 9 } } } } V1 V2 level = 1 active = 1 level = 2 active = 1 BFS algorithm Published in HiPC 2007

  33. GPUDet (Jooybar: ASPLOS 2013) Read Only Global Memory Commit Reaching Quantum Boundary Load Op Atomic Op Store Buffers Wavefronts Instruction Count Atomic Operations Memory Fences Workgroup Barriers Execution Complete … Local Memory 2x Slowdown

  34. Summary • Start from efficient architecture and try to improve programmability • Get efficiency and keep programmers happy reasonably

  35. Thanks!Questions?

More Related