1 / 41

SMTp: An Architecture for Next-generation Scalable Multi-threading

SMTp: An Architecture for Next-generation Scalable Multi-threading. Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School of Computer Science University of Central Florida. Scalable multi-threading. Directory-based hardware DSM

Download Presentation

SMTp: An Architecture for Next-generation Scalable Multi-threading

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School of Computer Science University of Central Florida

  2. Scalable multi-threading • Directory-based hardware DSM • Directory-based coherence: complex MCs • So complex that MCs can be programmable with embedded protocol processors • Integrated memory controllers are common-place in high-end microprocessors • Servers are naturally NUMA/DSM, not SMP • Snooping is awkward and BW-limited This talk: build directory-based scalable DSM with nominal changes to standard MC

  3. Two major goals • Directory-based coherence without a directory controller • still scalable • can use less complex standard memory controllers • Flexibility in using custom protocol code or any software sequences to do “interesting things” on cache misses • compression/encryption • fault tolerance

  4. Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions

  5. Introducing SMTp • SMTp: SMTwith aprotocolthread context • Protocol thread executes the control part of the coherence protocol in parallel with SDRAM data access • Provides flexibility to run custom software sequences on cache misses [motivation#1] • Still uses the standard MC (no directory state machine) [motivation#2] • Build large-scale directory-based DSM out of commodity nodes with integrated MC and SMTp

  6. Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions

  7. Basic extensions for SMTp 1 bit PPCV IQ REG FILE ALU 1 bit ICFE DE RE DC LSQ AGU G LA FPQ FPU IBB 16x64B DBB 16x32B 7 bits Uncached load/store L1 Miss LDCTXT_ID L2 CACHE L2 BB 16x128B Protocol Miss App. Miss INTEGRATED MEMORY CONTROLLER

  8. Memory controller for SMTp Uncached ld/st Protocol miss App. miss PPCV, LA PROTOCOL DATA LDCTXT_ID HANDLER DISPATCH 8x128B SDRAM NI Handler Miss Refill NI In ADDR. HEADER LOCAL MISS INTERFACE Local Miss Handler NETWORK INTERFACE NI Out APP. DATA To/From Router

  9. Enabling a protocol thread • Statically bound to a thread context • Need an extra thread context (PC, RAS, register map) • No context switch • Not visible to kernel • Protocol code is provided by system (conventional DSM style) • User cannot download arbitrary code to protocol memory

  10. Anatomy of a protocol handler • MIPS style RISC ISA • Short sequence of instructions Calculate directory address // simple hash function . Load directory entry // normal cached load . Compute on header and directory // integer arithmetic . Send cache line/control message // uncached stores . switch r17 // uncached load (header) ldctxt r18 // uncached load (address)

  11. Fetching from protocol thread ICFE LSQ PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH SDRAM NI LMI Router Front side bus

  12. Fetching from protocol thread ICFE LSQ PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH SDRAM NI LMI Router Front side bus

  13. Fetching from protocol thread ICFE LSQ Unblock switch PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH SDRAM NI LMI Router Front side bus

  14. Fetching from protocol thread ICFE LSQ Execute ldctxt PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH SDRAM NI LMI Router Front side bus

  15. Fetching from protocol thread ICFE LSQ PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH (at home) SDRAM NI LMI Router Front side bus

  16. Fetching from protocol thread ICFE LSQ PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH (at home) SDRAM NI LMI Router Front side bus

  17. Fetching from protocol thread • Protocol code/data resides in unmapped portion of local SDRAM • No ITLB access • Share instruction cache with application thread(s) • Fetcher turns off PPCV after the last handler instruction is fetched

  18. Handling protocol load/store • No DTLB access • Share L1 data and L2 caches • L2 cache miss from protocol thread behaves differently • Needs to bypass Local Miss Interface • Talks to local SDRAM directly

  19. Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions

  20. Deadlock with shared resources • Progress of app. L2 miss depends on progress of protocol thread • Resources involved: front-end queue slots, branch stack space, integer registers, integer queue slots, LSQ slots, speculative store buffers, MSHRs, and cache index Retire ptr. LOAD L2 miss ROB Local miss handler Allocate ptr. Protocol instruction BLOCKED IQ Full

  21. Solving resource deadlock • General solution: one reserved instance • Out of 8 decode queue slots app. threads get 7 while all 8 are open to protocol thread • Easier solution: Pentium 4 style static resource partitioning • Cache index conflict: • Solution: L1 and L2 bypass buffers (FA/LRU) • Allocate a bypass buffer entry instead • Parallel lookup: hit latency unchanged

  22. SMTp: deadlock solution 1 bit PPCV IQ REG FILE ALU 1 bit ICFE DE RE DC LSQ AGU G LA FPQ FPU IBB 16x64B DBB 16x32B 7 bits Uncached load/store L1 Miss LDCTXT_ID L2 CACHE L2 BB 16x128B Protocol Miss App. Miss INTEGRATED MEMORY CONTROLLER

  23. Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions

  24. Evaluation methodology • Applications • SPLASH-2: FFT, LU, Radix, Ocean, Water • FFTW • Simulated machine model (details in paper) • 2GHz, 9 pipe stages • 1, 2, 4 app. threads + one protocol context • ROB: 128 (per thread) • Integer/floating point registers: 160/192/256 • L1 Icache: 32 KB/64B/2-way/LRU/1 cycle • L1 Dcache: 32 KB/32B/2-way/LRU/1 cycle • Unified L2: 2 MB/128B/8-way/LRU/9 cycles

  25. Simulated machine models

  26. Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions

  27. Single node (1app,1prot) results

  28. Single node (2app,1prot) results

  29. Single node results: summary • Memory controller integration helps • Ocean and FFTW get maximum benefit • LU and Water are largely insensitive • SMTp is always faster than Base • SMTp performs on par with Int512KB • In a few cases Int512KB outperforms SMTp by at most 1.6% • Int64KB suffers from directory cache misses • FFTW and Radix-Sort are most sensitive

  30. 32-node (1app,1prot) results

  31. 32-node (2app,1prot) results

  32. Multi-node results: summary • With increasing system size integrated models converge in terms of performance • IntPerfect gets a slight edge due to double memory controller speed • SMTp continues to deliver excellent performance • The gap between Int512KB and SMTp: at most 6%, on average same

  33. Resource occupancy: summary • Protocol thread is active for very small amount of time (low protocol occupancy) • When active, can have high peak resource occupancy • When idle, all resources are freed except • 31 mapped registers • 2 LSQ slots holding switch and ldctxt • Overall, protocol thread has very low pipeline overhead

  34. Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions

  35. Related work • Simultaneous multi-threading • Assisted execution [HPCA’01][MICRO’01][ISCA’02] • Fault tolerance [ASPLOS’00][ISCA’02] • User-level message passing [MTEAC’01] • Programmable protocol engine • Customized co-processor (FLASH, S3.mp, STiNG, Piranha) • Commodity off-the-shelf processor (Typhoon) • On main processor through low overhead interrupt (Chalmers) [ISCA’95]

  36. Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions

  37. Conclusions • First design to exploit SMT to run directory-based coherence protocol on spare threads • Delivers performance close to (within 6%, average 0%) integrated coherence controllers with large (512 KB) stand-alone directory data caches • Extremely low pipeline overhead • SMTp provides an opportunity to build scalable directory-based DSMs with minor changes to commodity nodes

  38. Future directions • Need not be restricted to building DSMs out of commodity nodes only • Use SMTp to carry out • On-the-fly compression/encryption of L2 cache lines • Software controlled address remapping to improve locality of cache access • Fault tolerance by selectively extending coherence protocols • Alternate CMP design • Issues with multiple protocol threads

  39. SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School of Computer Science University of Central Florida

  40. Protocol occupancy 16 nodes, (1a,1p) threads per node

  41. Protocol thread characteristics 16 nodes, (1a,1p) threads per node

More Related