64-bit Scalable Chip Multiprocessor ( SCMP)

64-bit Scalable Chip Multiprocessor(SCMP) Tongji University

Why SCMP ? • Memory access latency is bottleneck • TLP is the trend • Flexible, scalable from 1 to 4, up to 16 cores • CPU core is small and simple, easier to verify • Higher throughput • Improve wafer utilization

How SCMP ? • Full custom 64-bit CPU core • On-chip switch • L2 cache and controller • Hardware thread scheduler

I$ I$ I$ I$ Crypto Coprocessor IO RF RF RF RF RF RF RF RF RF RF RF RF RF RF RF RF Int. FPU Int. FPU Int. FPU Int. FPU Thread Scheduler D$ D$ D$ D$ None-Blocking Crossbar Switch Multi-bank L2 Cache SCMP Block Diagram

4-Core-Architecture Feature • Target Application: Server • 4 Multi-thread processor cores • 4 MB L2 cache, multi-bank • Non-blocking crossbar switch between cores and L2 cache banks • Directory based cache coherency • Thread scheduler • Reconfigurable crypto-coprocessor • FB-DIMM memory controller (possibly)

Multi-thread Core Architecture • 64-bit MIPS Instruction Set Architecture • 4thread, Coarse Multithreading, only one thread at a time • 16 KB L1 instruction cache, 8 KB (or 16KB) data cache • 5-8 stage pipeline • Including Integer Unit, Floating Point Unit and L1 cache

64-bit CPU Core Feature • ST 90nm Technology • High speed, 1 GHz • Low power consumption • Small die size • Robust • Used as hard core Full custom:

64-bit CPU Core Feature • Coarse Multithreading • Make the core design easier • Small, simple core • Only one thread at a time • Bottleneck : memory access • When waiting for memory, thread switched • Totally 4 thread in a core • Memory latency more severe in common multiprocessor • Masking memory latency by switching thread Multithreading:

Performance gap between processor and memory

Multithreading

Multithreading Multiprocessor

On-chip interconnection • Increasing memory bandwidth • Possibly more than one core can access L2 cache • Make L2 cache higher associativity • Easier switch design • Optimized for low latency Crossbar:

L1 L1 L1 L1 Crossbar L2 Bank 3 L2 Bank 2 L2 Bank 1 L2 Bank 0 Interface Interface Interface Interface L2 Cache • Multi-banked • Higher bandwidth • Multi- memory interface to main memory

L2 Bank 3 L2 Bank 2 Directory Bank 3 Directory Bank 2 L2 Bank 1 Directory Bank 1 L2 Bank 0 Directory Bank 0 Crossbar Cache Coherency • Tracking the processors that have copies of the block • Tracking the states of data block in L2 cache • Shared • Uncached • Exclusive Directory-Based Cache Coherency:

Thread Scheduler • Dispatch threads • Hardware logic coupled with OS • Thread switch • When L1 cache miss • Load balance (hardware counter) • L1 cache hit / miss • Core pipeline idle • Configure crypto-coprocessor

Reconfigurable Crypto-coprocessor • Supporting coding and decoding symmetric algorithms: • AES • DES, 3DES, GDES • RCx

Thread Scheduler 4-Thread Full Custom Core 4-Thread Full Custom Core Crypto Coprocessor 4-Thread Full Custom Core 4-Thread Full Custom Core Reconfigurable Crypto-coprocessor • Reconfigure controlled byThread scheduler

OS / Software • Using commercial operation system, such as LINUX • Minimize OS/compiler modification • Almost no change in OS/compiler • Optimizing Compiler to improve machine code efficiency

Interface Interface Interface Interface FBDIMM Interface FB-DIMM (possibly) Serial data path Latency is managed with new channel features Cost-effective Server memory in the future

Thank you !

64-bit Scalable Chip Multiprocessor ( SCMP)

64-bit Scalable Chip Multiprocessor ( SCMP)

Presentation Transcript

Intel 64-bit Server Technology

Venturing into 64-bit mode

64-Bit Architectures

64-bit and VCL Styles

64-Bit Architectures

Single-Chip Multiprocessor

VR4121 64-BIT MICROPROCESSOR

Maximize the JVM for 64-Bit

On 64-bit ‘code-relocation’

Our first 64-bit ventures

Scaling and Packing on a Chip Multiprocessor

Design of Adaptive On-Chip Multiprocessor Systems

64-Bit Multibeam Editor MBMAX64

Introduction to Multiprocessor System-on-Chip

Heterogeneous Chip Multiprocessor Design for Virtual Machines

64 Bit Deconvolution

Windows 7 Ultimate 32/64 bit

Using Compression to Improve Chip Multiprocessor Performance

Introduction to Multiprocessor System-on-Chip

On 64-bit ‘code-relocation’