410 likes | 426 Views
SMTp: An Architecture for Next-generation Scalable Multi-threading. Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School of Computer Science University of Central Florida. Scalable multi-threading. Directory-based hardware DSM
E N D
SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School of Computer Science University of Central Florida
Scalable multi-threading • Directory-based hardware DSM • Directory-based coherence: complex MCs • So complex that MCs can be programmable with embedded protocol processors • Integrated memory controllers are common-place in high-end microprocessors • Servers are naturally NUMA/DSM, not SMP • Snooping is awkward and BW-limited This talk: build directory-based scalable DSM with nominal changes to standard MC
Two major goals • Directory-based coherence without a directory controller • still scalable • can use less complex standard memory controllers • Flexibility in using custom protocol code or any software sequences to do “interesting things” on cache misses • compression/encryption • fault tolerance
Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions
Introducing SMTp • SMTp: SMTwith aprotocolthread context • Protocol thread executes the control part of the coherence protocol in parallel with SDRAM data access • Provides flexibility to run custom software sequences on cache misses [motivation#1] • Still uses the standard MC (no directory state machine) [motivation#2] • Build large-scale directory-based DSM out of commodity nodes with integrated MC and SMTp
Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions
Basic extensions for SMTp 1 bit PPCV IQ REG FILE ALU 1 bit ICFE DE RE DC LSQ AGU G LA FPQ FPU IBB 16x64B DBB 16x32B 7 bits Uncached load/store L1 Miss LDCTXT_ID L2 CACHE L2 BB 16x128B Protocol Miss App. Miss INTEGRATED MEMORY CONTROLLER
Memory controller for SMTp Uncached ld/st Protocol miss App. miss PPCV, LA PROTOCOL DATA LDCTXT_ID HANDLER DISPATCH 8x128B SDRAM NI Handler Miss Refill NI In ADDR. HEADER LOCAL MISS INTERFACE Local Miss Handler NETWORK INTERFACE NI Out APP. DATA To/From Router
Enabling a protocol thread • Statically bound to a thread context • Need an extra thread context (PC, RAS, register map) • No context switch • Not visible to kernel • Protocol code is provided by system (conventional DSM style) • User cannot download arbitrary code to protocol memory
Anatomy of a protocol handler • MIPS style RISC ISA • Short sequence of instructions Calculate directory address // simple hash function . Load directory entry // normal cached load . Compute on header and directory // integer arithmetic . Send cache line/control message // uncached stores . switch r17 // uncached load (header) ldctxt r18 // uncached load (address)
Fetching from protocol thread ICFE LSQ PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH SDRAM NI LMI Router Front side bus
Fetching from protocol thread ICFE LSQ PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH SDRAM NI LMI Router Front side bus
Fetching from protocol thread ICFE LSQ Unblock switch PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH SDRAM NI LMI Router Front side bus
Fetching from protocol thread ICFE LSQ Execute ldctxt PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH SDRAM NI LMI Router Front side bus
Fetching from protocol thread ICFE LSQ PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH (at home) SDRAM NI LMI Router Front side bus
Fetching from protocol thread ICFE LSQ PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH (at home) SDRAM NI LMI Router Front side bus
Fetching from protocol thread • Protocol code/data resides in unmapped portion of local SDRAM • No ITLB access • Share instruction cache with application thread(s) • Fetcher turns off PPCV after the last handler instruction is fetched
Handling protocol load/store • No DTLB access • Share L1 data and L2 caches • L2 cache miss from protocol thread behaves differently • Needs to bypass Local Miss Interface • Talks to local SDRAM directly
Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions
Deadlock with shared resources • Progress of app. L2 miss depends on progress of protocol thread • Resources involved: front-end queue slots, branch stack space, integer registers, integer queue slots, LSQ slots, speculative store buffers, MSHRs, and cache index Retire ptr. LOAD L2 miss ROB Local miss handler Allocate ptr. Protocol instruction BLOCKED IQ Full
Solving resource deadlock • General solution: one reserved instance • Out of 8 decode queue slots app. threads get 7 while all 8 are open to protocol thread • Easier solution: Pentium 4 style static resource partitioning • Cache index conflict: • Solution: L1 and L2 bypass buffers (FA/LRU) • Allocate a bypass buffer entry instead • Parallel lookup: hit latency unchanged
SMTp: deadlock solution 1 bit PPCV IQ REG FILE ALU 1 bit ICFE DE RE DC LSQ AGU G LA FPQ FPU IBB 16x64B DBB 16x32B 7 bits Uncached load/store L1 Miss LDCTXT_ID L2 CACHE L2 BB 16x128B Protocol Miss App. Miss INTEGRATED MEMORY CONTROLLER
Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions
Evaluation methodology • Applications • SPLASH-2: FFT, LU, Radix, Ocean, Water • FFTW • Simulated machine model (details in paper) • 2GHz, 9 pipe stages • 1, 2, 4 app. threads + one protocol context • ROB: 128 (per thread) • Integer/floating point registers: 160/192/256 • L1 Icache: 32 KB/64B/2-way/LRU/1 cycle • L1 Dcache: 32 KB/32B/2-way/LRU/1 cycle • Unified L2: 2 MB/128B/8-way/LRU/9 cycles
Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions
Single node results: summary • Memory controller integration helps • Ocean and FFTW get maximum benefit • LU and Water are largely insensitive • SMTp is always faster than Base • SMTp performs on par with Int512KB • In a few cases Int512KB outperforms SMTp by at most 1.6% • Int64KB suffers from directory cache misses • FFTW and Radix-Sort are most sensitive
Multi-node results: summary • With increasing system size integrated models converge in terms of performance • IntPerfect gets a slight edge due to double memory controller speed • SMTp continues to deliver excellent performance • The gap between Int512KB and SMTp: at most 6%, on average same
Resource occupancy: summary • Protocol thread is active for very small amount of time (low protocol occupancy) • When active, can have high peak resource occupancy • When idle, all resources are freed except • 31 mapped registers • 2 LSQ slots holding switch and ldctxt • Overall, protocol thread has very low pipeline overhead
Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions
Related work • Simultaneous multi-threading • Assisted execution [HPCA’01][MICRO’01][ISCA’02] • Fault tolerance [ASPLOS’00][ISCA’02] • User-level message passing [MTEAC’01] • Programmable protocol engine • Customized co-processor (FLASH, S3.mp, STiNG, Piranha) • Commodity off-the-shelf processor (Typhoon) • On main processor through low overhead interrupt (Chalmers) [ISCA’95]
Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions
Conclusions • First design to exploit SMT to run directory-based coherence protocol on spare threads • Delivers performance close to (within 6%, average 0%) integrated coherence controllers with large (512 KB) stand-alone directory data caches • Extremely low pipeline overhead • SMTp provides an opportunity to build scalable directory-based DSMs with minor changes to commodity nodes
Future directions • Need not be restricted to building DSMs out of commodity nodes only • Use SMTp to carry out • On-the-fly compression/encryption of L2 cache lines • Software controlled address remapping to improve locality of cache access • Fault tolerance by selectively extending coherence protocols • Alternate CMP design • Issues with multiple protocol threads
SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School of Computer Science University of Central Florida
Protocol occupancy 16 nodes, (1a,1p) threads per node
Protocol thread characteristics 16 nodes, (1a,1p) threads per node