210 likes | 338 Views
64-bit Scalable Chip Multiprocessor ( SCMP). Tongji University. Why SCMP ?. Memory access latency is bottleneck TLP is the trend Flexible, scalable from 1 to 4, up to 16 cores CPU core is small and simple, easier to verify Higher throughput Improve wafer utilization. How SCMP ?.
E N D
64-bit Scalable Chip Multiprocessor(SCMP) Tongji University
Why SCMP ? • Memory access latency is bottleneck • TLP is the trend • Flexible, scalable from 1 to 4, up to 16 cores • CPU core is small and simple, easier to verify • Higher throughput • Improve wafer utilization
How SCMP ? • Full custom 64-bit CPU core • On-chip switch • L2 cache and controller • Hardware thread scheduler
I$ I$ I$ I$ Crypto Coprocessor IO RF RF RF RF RF RF RF RF RF RF RF RF RF RF RF RF Int. FPU Int. FPU Int. FPU Int. FPU Thread Scheduler D$ D$ D$ D$ None-Blocking Crossbar Switch Multi-bank L2 Cache SCMP Block Diagram
4-Core-Architecture Feature • Target Application: Server • 4 Multi-thread processor cores • 4 MB L2 cache, multi-bank • Non-blocking crossbar switch between cores and L2 cache banks • Directory based cache coherency • Thread scheduler • Reconfigurable crypto-coprocessor • FB-DIMM memory controller (possibly)
Multi-thread Core Architecture • 64-bit MIPS Instruction Set Architecture • 4thread, Coarse Multithreading, only one thread at a time • 16 KB L1 instruction cache, 8 KB (or 16KB) data cache • 5-8 stage pipeline • Including Integer Unit, Floating Point Unit and L1 cache
64-bit CPU Core Feature • ST 90nm Technology • High speed, 1 GHz • Low power consumption • Small die size • Robust • Used as hard core Full custom:
64-bit CPU Core Feature • Coarse Multithreading • Make the core design easier • Small, simple core • Only one thread at a time • Bottleneck : memory access • When waiting for memory, thread switched • Totally 4 thread in a core • Memory latency more severe in common multiprocessor • Masking memory latency by switching thread Multithreading:
On-chip interconnection • Increasing memory bandwidth • Possibly more than one core can access L2 cache • Make L2 cache higher associativity • Easier switch design • Optimized for low latency Crossbar:
L1 L1 L1 L1 Crossbar L2 Bank 3 L2 Bank 2 L2 Bank 1 L2 Bank 0 Interface Interface Interface Interface L2 Cache • Multi-banked • Higher bandwidth • Multi- memory interface to main memory
L2 Bank 3 L2 Bank 2 Directory Bank 3 Directory Bank 2 L2 Bank 1 Directory Bank 1 L2 Bank 0 Directory Bank 0 Crossbar Cache Coherency • Tracking the processors that have copies of the block • Tracking the states of data block in L2 cache • Shared • Uncached • Exclusive Directory-Based Cache Coherency:
Thread Scheduler • Dispatch threads • Hardware logic coupled with OS • Thread switch • When L1 cache miss • Load balance (hardware counter) • L1 cache hit / miss • Core pipeline idle • Configure crypto-coprocessor
Reconfigurable Crypto-coprocessor • Supporting coding and decoding symmetric algorithms: • AES • DES, 3DES, GDES • RCx
Thread Scheduler 4-Thread Full Custom Core 4-Thread Full Custom Core Crypto Coprocessor 4-Thread Full Custom Core 4-Thread Full Custom Core Reconfigurable Crypto-coprocessor • Reconfigure controlled byThread scheduler
OS / Software • Using commercial operation system, such as LINUX • Minimize OS/compiler modification • Almost no change in OS/compiler • Optimizing Compiler to improve machine code efficiency
Interface Interface Interface Interface FBDIMM Interface FB-DIMM (possibly) Serial data path Latency is managed with new channel features Cost-effective Server memory in the future