SMTp: An Architecture for Next-generation Scalable Multi-threading

SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School of Computer Science University of Central Florida

Scalable multi-threading • Directory-based hardware DSM • Directory-based coherence: complex MCs • So complex that MCs can be programmable with embedded protocol processors • Integrated memory controllers are common-place in high-end microprocessors • Servers are naturally NUMA/DSM, not SMP • Snooping is awkward and BW-limited This talk: build directory-based scalable DSM with nominal changes to standard MC

Two major goals • Directory-based coherence without a directory controller • still scalable • can use less complex standard memory controllers • Flexibility in using custom protocol code or any software sequences to do “interesting things” on cache misses • compression/encryption • fault tolerance

Outline • Introducing SMTp • Basic extensions for SMTp • Deadlock avoidance • Evaluation methodology • Simulation results • Related work • Conclusions

Introducing SMTp • SMTp: SMTwith aprotocolthread context • Protocol thread executes the control part of the coherence protocol in parallel with SDRAM data access • Provides flexibility to run custom software sequences on cache misses [motivation#1] • Still uses the standard MC (no directory state machine) [motivation#2] • Build large-scale directory-based DSM out of commodity nodes with integrated MC and SMTp

Basic extensions for SMTp 1 bit PPCV IQ REG FILE ALU 1 bit ICFE DE RE DC LSQ AGU G LA FPQ FPU IBB 16x64B DBB 16x32B 7 bits Uncached load/store L1 Miss LDCTXT_ID L2 CACHE L2 BB 16x128B Protocol Miss App. Miss INTEGRATED MEMORY CONTROLLER

Memory controller for SMTp Uncached ld/st Protocol miss App. miss PPCV, LA PROTOCOL DATA LDCTXT_ID HANDLER DISPATCH 8x128B SDRAM NI Handler Miss Refill NI In ADDR. HEADER LOCAL MISS INTERFACE Local Miss Handler NETWORK INTERFACE NI Out APP. DATA To/From Router

Enabling a protocol thread • Statically bound to a thread context • Need an extra thread context (PC, RAS, register map) • No context switch • Not visible to kernel • Protocol code is provided by system (conventional DSM style) • User cannot download arbitrary code to protocol memory

Anatomy of a protocol handler • MIPS style RISC ISA • Short sequence of instructions Calculate directory address // simple hash function . Load directory entry // normal cached load . Compute on header and directory // integer arithmetic . Send cache line/control message // uncached stores . switch r17 // uncached load (header) ldctxt r18 // uncached load (address)

Fetching from protocol thread ICFE LSQ PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH SDRAM NI LMI Router Front side bus

Fetching from protocol thread ICFE LSQ Unblock switch PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH SDRAM NI LMI Router Front side bus

Fetching from protocol thread ICFE LSQ Execute ldctxt PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH SDRAM NI LMI Router Front side bus

Fetching from protocol thread ICFE LSQ PPC PPCV ADDR. HEADER JUMP TABLE HANDLER DISPATCH (at home) SDRAM NI LMI Router Front side bus

Fetching from protocol thread • Protocol code/data resides in unmapped portion of local SDRAM • No ITLB access • Share instruction cache with application thread(s) • Fetcher turns off PPCV after the last handler instruction is fetched

Handling protocol load/store • No DTLB access • Share L1 data and L2 caches • L2 cache miss from protocol thread behaves differently • Needs to bypass Local Miss Interface • Talks to local SDRAM directly

Deadlock with shared resources • Progress of app. L2 miss depends on progress of protocol thread • Resources involved: front-end queue slots, branch stack space, integer registers, integer queue slots, LSQ slots, speculative store buffers, MSHRs, and cache index Retire ptr. LOAD L2 miss ROB Local miss handler Allocate ptr. Protocol instruction BLOCKED IQ Full

Solving resource deadlock • General solution: one reserved instance • Out of 8 decode queue slots app. threads get 7 while all 8 are open to protocol thread • Easier solution: Pentium 4 style static resource partitioning • Cache index conflict: • Solution: L1 and L2 bypass buffers (FA/LRU) • Allocate a bypass buffer entry instead • Parallel lookup: hit latency unchanged

SMTp: deadlock solution 1 bit PPCV IQ REG FILE ALU 1 bit ICFE DE RE DC LSQ AGU G LA FPQ FPU IBB 16x64B DBB 16x32B 7 bits Uncached load/store L1 Miss LDCTXT_ID L2 CACHE L2 BB 16x128B Protocol Miss App. Miss INTEGRATED MEMORY CONTROLLER

Evaluation methodology • Applications • SPLASH-2: FFT, LU, Radix, Ocean, Water • FFTW • Simulated machine model (details in paper) • 2GHz, 9 pipe stages • 1, 2, 4 app. threads + one protocol context • ROB: 128 (per thread) • Integer/floating point registers: 160/192/256 • L1 Icache: 32 KB/64B/2-way/LRU/1 cycle • L1 Dcache: 32 KB/32B/2-way/LRU/1 cycle • Unified L2: 2 MB/128B/8-way/LRU/9 cycles

Simulated machine models

Single node (1app,1prot) results

Single node (2app,1prot) results

Single node results: summary • Memory controller integration helps • Ocean and FFTW get maximum benefit • LU and Water are largely insensitive • SMTp is always faster than Base • SMTp performs on par with Int512KB • In a few cases Int512KB outperforms SMTp by at most 1.6% • Int64KB suffers from directory cache misses • FFTW and Radix-Sort are most sensitive

32-node (1app,1prot) results

32-node (2app,1prot) results

Multi-node results: summary • With increasing system size integrated models converge in terms of performance • IntPerfect gets a slight edge due to double memory controller speed • SMTp continues to deliver excellent performance • The gap between Int512KB and SMTp: at most 6%, on average same

Resource occupancy: summary • Protocol thread is active for very small amount of time (low protocol occupancy) • When active, can have high peak resource occupancy • When idle, all resources are freed except • 31 mapped registers • 2 LSQ slots holding switch and ldctxt • Overall, protocol thread has very low pipeline overhead

Related work • Simultaneous multi-threading • Assisted execution [HPCA’01][MICRO’01][ISCA’02] • Fault tolerance [ASPLOS’00][ISCA’02] • User-level message passing [MTEAC’01] • Programmable protocol engine • Customized co-processor (FLASH, S3.mp, STiNG, Piranha) • Commodity off-the-shelf processor (Typhoon) • On main processor through low overhead interrupt (Chalmers) [ISCA’95]

Conclusions • First design to exploit SMT to run directory-based coherence protocol on spare threads • Delivers performance close to (within 6%, average 0%) integrated coherence controllers with large (512 KB) stand-alone directory data caches • Extremely low pipeline overhead • SMTp provides an opportunity to build scalable directory-based DSMs with minor changes to commodity nodes

Future directions • Need not be restricted to building DSMs out of commodity nodes only • Use SMTp to carry out • On-the-fly compression/encryption of L2 cache lines • Software controlled address remapping to improve locality of cache access • Fault tolerance by selectively extending coherence protocols • Alternate CMP design • Issues with multiple protocol threads

SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School of Computer Science University of Central Florida

Protocol occupancy 16 nodes, (1a,1p) threads per node

Protocol thread characteristics 16 nodes, (1a,1p) threads per node

SMTp: An Architecture for Next-generation Scalable Multi-threading

SMTp: An Architecture for Next-generation Scalable Multi-threading

Presentation Transcript

Scalable Algorithms for Next-Generation Sequencing Data Analysis

Next Generation BTS Architecture

Next Generation 911 Evolving Architecture

Scalable, Fault-Tolerant NAS for Oracle - The Next Generation

An Architecture for Next-Generation Emergency Services

Scalable Algorithms for Next-Generation Sequencing Data Analysis

DCN Next Generation Multi CCU Installation

A Multi-Threading Architecture…

Multi-Threading

Architecture for a Next-Generation GCC

Multi Threading Models

Best Practices for Multi-threading

Driving the Next Generation EIM Architecture

Developing a Next-Generation Internet Architecture

COMP25212 CPU Multi Threading

Scalable Networking for Next-Generation Computing Platforms

Architecture for a Next-Generation GCC

Why multi-threading/multi-core?

An Architecture for Data Intensive Service Enabled by Next Generation Optical Networks

Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading

An Architecture for Next-Generation Emergency Services

Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading