1 / 21

64-bit Scalable Chip Multiprocessor ( SCMP)

64-bit Scalable Chip Multiprocessor ( SCMP). Tongji University. Why SCMP ?. Memory access latency is bottleneck TLP is the trend Flexible, scalable from 1 to 4, up to 16 cores CPU core is small and simple, easier to verify Higher throughput Improve wafer utilization. How SCMP ?.

reuel
Download Presentation

64-bit Scalable Chip Multiprocessor ( SCMP)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 64-bit Scalable Chip Multiprocessor(SCMP) Tongji University

  2. Why SCMP ? • Memory access latency is bottleneck • TLP is the trend • Flexible, scalable from 1 to 4, up to 16 cores • CPU core is small and simple, easier to verify • Higher throughput • Improve wafer utilization

  3. How SCMP ? • Full custom 64-bit CPU core • On-chip switch • L2 cache and controller • Hardware thread scheduler

  4. I$ I$ I$ I$ Crypto Coprocessor IO RF RF RF RF RF RF RF RF RF RF RF RF RF RF RF RF Int. FPU Int. FPU Int. FPU Int. FPU Thread Scheduler D$ D$ D$ D$ None-Blocking Crossbar Switch Multi-bank L2 Cache SCMP Block Diagram

  5. 4-Core-Architecture Feature • Target Application: Server • 4 Multi-thread processor cores • 4 MB L2 cache, multi-bank • Non-blocking crossbar switch between cores and L2 cache banks • Directory based cache coherency • Thread scheduler • Reconfigurable crypto-coprocessor • FB-DIMM memory controller (possibly)

  6. Multi-thread Core Architecture • 64-bit MIPS Instruction Set Architecture • 4thread, Coarse Multithreading, only one thread at a time • 16 KB L1 instruction cache, 8 KB (or 16KB) data cache • 5-8 stage pipeline • Including Integer Unit, Floating Point Unit and L1 cache

  7. 64-bit CPU Core Feature • ST 90nm Technology • High speed, 1 GHz • Low power consumption • Small die size • Robust • Used as hard core Full custom:

  8. 64-bit CPU Core Feature • Coarse Multithreading • Make the core design easier • Small, simple core • Only one thread at a time • Bottleneck : memory access • When waiting for memory, thread switched • Totally 4 thread in a core • Memory latency more severe in common multiprocessor • Masking memory latency by switching thread Multithreading:

  9. Performance gap between processor and memory

  10. Multithreading

  11. Multithreading Multiprocessor

  12. On-chip interconnection • Increasing memory bandwidth • Possibly more than one core can access L2 cache • Make L2 cache higher associativity • Easier switch design • Optimized for low latency Crossbar:

  13. L1 L1 L1 L1 Crossbar L2 Bank 3 L2 Bank 2 L2 Bank 1 L2 Bank 0 Interface Interface Interface Interface L2 Cache • Multi-banked • Higher bandwidth • Multi- memory interface to main memory

  14. L2 Bank 3 L2 Bank 2 Directory Bank 3 Directory Bank 2 L2 Bank 1 Directory Bank 1 L2 Bank 0 Directory Bank 0 Crossbar Cache Coherency • Tracking the processors that have copies of the block • Tracking the states of data block in L2 cache • Shared • Uncached • Exclusive Directory-Based Cache Coherency:

  15. Thread Scheduler • Dispatch threads • Hardware logic coupled with OS • Thread switch • When L1 cache miss • Load balance (hardware counter) • L1 cache hit / miss • Core pipeline idle • Configure crypto-coprocessor

  16. Reconfigurable Crypto-coprocessor • Supporting coding and decoding symmetric algorithms: • AES • DES, 3DES, GDES • RCx

  17. Thread Scheduler 4-Thread Full Custom Core 4-Thread Full Custom Core Crypto Coprocessor 4-Thread Full Custom Core 4-Thread Full Custom Core Reconfigurable Crypto-coprocessor • Reconfigure controlled byThread scheduler

  18. OS / Software • Using commercial operation system, such as LINUX • Minimize OS/compiler modification • Almost no change in OS/compiler • Optimizing Compiler to improve machine code efficiency

  19. Interface Interface Interface Interface FBDIMM Interface FB-DIMM (possibly) Serial data path Latency is managed with new channel features Cost-effective Server memory in the future

  20. Thank you !

More Related