SINGLE CHIP MULTIPROCESSORS Computer Architecture Term Paper (11.12.2003) Esra KIRBA Ş 2002701357

SINGLE CHIP MULTIPROCESSORS Computer Architecture Term Paper (11.12.2003) Esra KIRBAŞ 2002701357 1/36

Evaluation of Design Alternatives for a Multiprocessor Microprocessor By Basem A. Nayfeh, Lance Hammond and Kunle Olukotun. ISCA 23, 1996, pp. 67-77. 2/36

With the use of advanced integrated technology, several options for design of high-performance microprocessors are avaliable. • In multiproessor design option, a small # of processors are interconnected on a single-chip or on a multi-chip-module (MCM) substrate. • We consantrate on single-chip multiprocessors. 3/36

Our goal is to study two proposed cache-sharing mechanisms for single chip multiprocessors: • Shared Level-1 (L1) Cache Architecture • Shared Level-2 (L2) Cache Architecture • (Performance of these two architectures will becompared with a single-bus based shared-memory multiprocessor .) 4/36

A multiprocessor architecture whose interconnect is closer to the CPUs in the memory hierarchy will be able to exploit fine-grained parallelism more efficiently than a multiprocessor architecture whose interconnect is further away from the CPUs in the memory hierarchy. • Try to achieve good performance on fine-grained parallel applications without sacrificing the performance of parallel independent jobs. 5/36

CPU CHARACTERISTICS • We use the same CPU with all the three architectures. • 2-way issue processor • Dynamic scheduling • Speculative execution • Non-blocking caches 6/36

Instruction Pipeline Functional Units 7/36

2-way 16KB set-associative instruction and data caches • 32-entry centeralized instruction window • 32-entry reorder buffer. 8/36

Shared L1-Cache Multiprocessor 9/36

Advantages of this Architecture: • It provides the lowest latency for interprocessor communication by using a shared-memory address space. • Low latency for interprocessor communication helps to achieve high performance in executing fine-grained parallel applications. • Processors may fetch shared data into the cache for each other. • It eleminates the cache coherence logic and implicitly provides a sequentially consistent memory without sacrificing the performance. 10/36

Disadvantages of this Architecture: • Crossbar switching system increases the access time of L1 cache.(We assume that average access time is three.) • All of the memory referances will be entered L1, so there may be some extra delays due to bank conflicts. • If the processors are not executing fine-grained parallel applications, then the miss rate will increase. 11/36

Secondary cache and main memories are uniprocessor like systems L2 (2 MB, 10-cycle latency + 2-cycle occupancy) Main Memory 50-cycle latency 6-cycle occupancy 12/36

Shared L2-Cache Multiprocessor 13/36

Write-through primary caches’ access time is 1 cycle • Latency of L2-cache increses to 14 cycles due to the cross-bar overhead. 14/36

L2 cache has four independent banks to increase its bandwith and enable it to support four independent access streams. • Data-path is 64-bitwidth. • occupancy is 4 cycles (for the transfer of 32-bit cache line) 15/36

Only memory accesses that miss in L1-cache will have to deal with the problem of reduced performance L2 cache. • MCM (multi chip module) technologycan be used. • (for 1996) • Main Memory • 50-cycle latency • 6-cycle occupancy 16/36

To keep the primary caches coherent, we need a coherency protocol. • Simply, we assume that each primary cache uses a write-through policy for shared data. • Additional hardware must be installed for this issue. 17/36

Shared Main Memory Multiprocessor 18/36

Primary cache access time is 1 cycle. • Secondary cache access time is 12 cycles. • All CPUs must access main memory to communicate. 19/36

Ideal Memory Latencies of Three Architectures in CPU Clock Cycles 20/36

SIMULATION ENVIRONMENT • SimOS simulation environment is used • IRIX 5.3 operating system is simulated • Hand Parallelized Scientific and Engineering Applications • Compiler Parallelized Scientific and Engineering Applications • Multiprogramming Workload 21/36

2 kinds of simulations is done; • Simple Simulation (no speculative execution, dynamic scheduling, and non-blocking memory referances) • Dynamic Superscalar Simulation 22/36

SIMPLE SIMULATION RESULTS (for high degree of interprocessor communication) EAR 23/36

EQNOTT 24/36

(for moderate degree of interprocessor communication) VOLPACK 25/36

FFT Kernel 26/36

(for low degree of interprocessor communication) MULTIPROGRAMMING WORKLOAD 27/36

OCEAN 28/36

DYNAMIC SUPERSCALAR SIMULATION RESULTS 29/36

In dynamic superscalar simulation, Shared-L1 cache performance can diminish substantially, whereas Shared-L2 and shared-memory architectures retain much of the relative performance predicted by the simple simulation results. 30/36

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing By Luiz Andre Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. ISCA 27, 2000, pp. 282-293 31/36

For Online Transaction Processing Systems • Standart ASIC design technology is used • The centerpiece of the Piranha architecture is a highly integrated processing node, with eight simple Alpha processor cores, seperate instruction and data caches for each core, a shared second level cache, eight memory controllers, two coherence protocol engines, and a network router all on a single chip. 32/36

33/36

34/36

35/36

SIMULATION 36/36

SINGLE CHIP MULTIPROCESSORS Computer Architecture Term Paper (11.12.2003) Esra KIRBA Ş 2002701357