360 likes | 498 Views
SINGLE CHIP MULTIPROCESSORS Computer Architecture Term Paper (11.12.2003) Esra KIRBA Ş 2002701357. 1/36. Evaluation of D esign A lternatives for a M ultiprocessor M icroprocessor By Basem A. Nayfeh, Lance Hammond and Kunle Olukotun. ISCA 23, 1996, pp. 67-77. 2/36.
E N D
SINGLE CHIP MULTIPROCESSORS Computer Architecture Term Paper (11.12.2003) Esra KIRBAŞ 2002701357 1/36
Evaluation of Design Alternatives for a Multiprocessor Microprocessor By Basem A. Nayfeh, Lance Hammond and Kunle Olukotun. ISCA 23, 1996, pp. 67-77. 2/36
With the use of advanced integrated technology, several options for design of high-performance microprocessors are avaliable. • In multiproessor design option, a small # of processors are interconnected on a single-chip or on a multi-chip-module (MCM) substrate. • We consantrate on single-chip multiprocessors. 3/36
Our goal is to study two proposed cache-sharing mechanisms for single chip multiprocessors: • Shared Level-1 (L1) Cache Architecture • Shared Level-2 (L2) Cache Architecture • (Performance of these two architectures will becompared with a single-bus based shared-memory multiprocessor .) 4/36
A multiprocessor architecture whose interconnect is closer to the CPUs in the memory hierarchy will be able to exploit fine-grained parallelism more efficiently than a multiprocessor architecture whose interconnect is further away from the CPUs in the memory hierarchy. • Try to achieve good performance on fine-grained parallel applications without sacrificing the performance of parallel independent jobs. 5/36
CPU CHARACTERISTICS • We use the same CPU with all the three architectures. • 2-way issue processor • Dynamic scheduling • Speculative execution • Non-blocking caches 6/36
2-way 16KB set-associative instruction and data caches • 32-entry centeralized instruction window • 32-entry reorder buffer. 8/36
Advantages of this Architecture: • It provides the lowest latency for interprocessor communication by using a shared-memory address space. • Low latency for interprocessor communication helps to achieve high performance in executing fine-grained parallel applications. • Processors may fetch shared data into the cache for each other. • It eleminates the cache coherence logic and implicitly provides a sequentially consistent memory without sacrificing the performance. 10/36
Disadvantages of this Architecture: • Crossbar switching system increases the access time of L1 cache.(We assume that average access time is three.) • All of the memory referances will be entered L1, so there may be some extra delays due to bank conflicts. • If the processors are not executing fine-grained parallel applications, then the miss rate will increase. 11/36
Secondary cache and main memories are uniprocessor like systems L2 (2 MB, 10-cycle latency + 2-cycle occupancy) Main Memory 50-cycle latency 6-cycle occupancy 12/36
Write-through primary caches’ access time is 1 cycle • Latency of L2-cache increses to 14 cycles due to the cross-bar overhead. 14/36
L2 cache has four independent banks to increase its bandwith and enable it to support four independent access streams. • Data-path is 64-bitwidth. • occupancy is 4 cycles (for the transfer of 32-bit cache line) 15/36
Only memory accesses that miss in L1-cache will have to deal with the problem of reduced performance L2 cache. • MCM (multi chip module) technologycan be used. • (for 1996) • Main Memory • 50-cycle latency • 6-cycle occupancy 16/36
To keep the primary caches coherent, we need a coherency protocol. • Simply, we assume that each primary cache uses a write-through policy for shared data. • Additional hardware must be installed for this issue. 17/36
Primary cache access time is 1 cycle. • Secondary cache access time is 12 cycles. • All CPUs must access main memory to communicate. 19/36
Ideal Memory Latencies of Three Architectures in CPU Clock Cycles 20/36
SIMULATION ENVIRONMENT • SimOS simulation environment is used • IRIX 5.3 operating system is simulated • Hand Parallelized Scientific and Engineering Applications • Compiler Parallelized Scientific and Engineering Applications • Multiprogramming Workload 21/36
2 kinds of simulations is done; • Simple Simulation (no speculative execution, dynamic scheduling, and non-blocking memory referances) • Dynamic Superscalar Simulation 22/36
SIMPLE SIMULATION RESULTS (for high degree of interprocessor communication) EAR 23/36
EQNOTT 24/36
(for moderate degree of interprocessor communication) VOLPACK 25/36
FFT Kernel 26/36
(for low degree of interprocessor communication) MULTIPROGRAMMING WORKLOAD 27/36
OCEAN 28/36
In dynamic superscalar simulation, Shared-L1 cache performance can diminish substantially, whereas Shared-L2 and shared-memory architectures retain much of the relative performance predicted by the simple simulation results. 30/36
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing By Luiz Andre Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. ISCA 27, 2000, pp. 282-293 31/36
For Online Transaction Processing Systems • Standart ASIC design technology is used • The centerpiece of the Piranha architecture is a highly integrated processing node, with eight simple Alpha processor cores, seperate instruction and data caches for each core, a shared second level cache, eight memory controllers, two coherence protocol engines, and a network router all on a single chip. 32/36
SIMULATION 36/36