“ Nahalal: Cache Organization for Chip Multiprocessors ” New LSU Policy

Software Systems Lab Department of Electrical Engineering, Technion “Nahalal: Cache Organization for Chip Multiprocessors” New LSU Policy By : Ido Shayevitz and Yoav Shargil Supervisor: Zvika Guz

NAHALAL PROJECT This project is based on the article : “Nahalal: Cache Organization for Chip Multiprocessors “ by Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser NAHALAL Project is a project that deals with an environment of multiprocessors chip. Project Goal : To give the appropriate solution for organizing a memory cache for multiprocessors environment.

Project Steps • Reading article and look for suitable idea. • Learning Simics Simulator. • Installing Simics3 on project host. • Learning g-cache module C code + NAHALAL code. • Writing implementation code of our project. • Define and Writing benchmarks. • Testing and debugging our model. • Run simulations, collect statistics and find conclusions. • Writing project book, presentation and website.

NAHALAL ARCHTECTURE NAHALAL archtecture defines the memory cache banks of the L2 cache. The basic distinction is between a private bank to a public bank. NAHALAL CIM

NAHALAL ARCHTECTURE • Placement Policy • Replacement Policy from Private Bank : LRU • Replacement Policy from Public Bank : LRU NAHALAL

LSU Improvement • Placement Policy • Replacement Policy from Private Bank : LRU • Replacement Policy from Public Bank : LSU NAHALAL

LSU Implementation Way 1 • Shift-register with N cells for each CPU. • Each cell in the shift-register hold line num. • In throwing by CPUi : for each line in the shift register need to check which less appear in other CPUs shift-registers. • 64Bytes line size, 1MB public bank size 14 bits for each cell in the shift register  if N=8 then memory overhead : 8*8*14 = 896 bits • Candidates for replacement are only 8 in the shift register ! • In case of a line no one access, it will stuck forever in the public bank. • Complicated, long time algorithm in HW.

LSU Implementation Way 2 • Shift-register with N cells for each Line. • Each cell in the shift-register hold CPU num • In throwing by CPUi : Search in all the shift-registers, which one contain CPUi the most. • 8 CPUs 3 bits for each cell in the shift register. 64Bytes line size, 1MB public bank size  2 lines in public bank. If N=8 Therefore 2 *8*3 = 0.375MB  37.5% memory overhead. • 37.5% memory overhead can cause to large distance between lines in public bank and CPUs • 37.5% memory overhead enlarge power. • Complicated, long time algorithm in HW. 14 14

LSU Implementation Way 3 • Shift-register with N cells for each Line. • Each cell in the shift-register hold CPU num • In throwing by CPUi : For each shift-register do XOR between each cell and the ID of CPUi. The shift-register on which the XOR produce 0, will be the chosen one. If non produce 0 then do regular LRU. • In order ro reduce memory overhead, define N=4. Therefore 2 *4*3 = 0.1875MB  18.75% memory overhead. • Simple, short time algorithm in HW. 14

Simics Simulator • In-order Instruction Set Simulator. Simulating code on virtual machine. • Each instruction run in one step time. • Collecting statistics during simulations. • Build models for all types of HW units and devices. • Memory Transactions, memory spaces, timing-model interface. • G-cache module for the basis to define memory hierarchy and cache modules. • Stall time is the key for cache model simulations. • Round Robin with default quantum for multiprocessors simulation.

TheImplementation In the project book ….

Build our project Using Simics g-cache Makefile : shargil@slab-i05:/home/users/shargil/Simics/simics-3.0.30/x86-linux/lib> gmake clean Removing: g-cache shargil@slab-i05:/home/users/shargil/Simics/simics-3.0.30/x86-linux/lib> gmake g-cache Generating: modules.cache === Building module "g-cache" === Creating dependencies: module_id.c Creating dependencies: sorter.c Creating dependencies: splitter.c Creating dependencies: gc.c Creating dependencies: gc-interface.c Creating dependencies: gc-attributes.c Creating dependencies: gc-cyclic-repl.c Creating dependencies: gc-lru-repl.c Creating dependencies: gc-random-repl.c Creating dependencies: gc-common-attributes.c Creating dependencies: gc-common.c Creating exportmap.elf Compiling gc-common.c Compiling gc-common-attributes.c Compiling gc-random-repl.c Compiling gc-lru-repl.c Compiling gc-cyclic-repl.c Compiling gc-attributes.c Compiling gc-interface.c Compiling gc.c Compiling module_id.c Linking g-cache.so shargil@slab-i05:/home/users/shargil/Simics/simics-3.0.30/x86-linux/lib>

Simulation Script Using pyhton script we defined :

Writing Benchmarks Writing Benchmarks is done in the simulated target console :

Writing Benchmarks • Using Threads with pthread library • Each Thread is associated to a CPU using sched library. • Parallel code is written in the benchmark • Also OS code and pthread code cause to Parallel code. • Each benchmark we run first without LSU and second with LSU. Our Goals : 1. To show that LSU reduce cycles in parallel benchmarks. 2. To show that in “LSU dedicated benchmarks” the improvement is better.

Benchmark A Each thread associated to a different CPU and read from public array and then from private array : // read shared array for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=0; j<SHARED_ARRAY_SIZE; j++) tmp = SHARED_ARRAY[j]; // write to desginated array : accum=0; if (thread_num == 1) { des_arrray_ptr = DESGINATED_ARRAY1; } else if (thread_num == 2) { des_arrray_ptr = DESGINATED_ARRAY2; } else { printf ("Error 0x100: thread_num has unexpexted value\n"); exit(1); } for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=0; j<DESIGNATED_ARRAY_SIZE; j++) *(des_arrray_ptr+j)=1; return NULL; }

Benchmark B Each thread associated to a different CPU and read from public array, private part, and public part again. // read first shared array – move first array to public bank for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=0; j<SHARED_ARRAY_SIZE; j++) tmp = SHARED_ARRAY1[j]; // now each thread reads only half part of the shared array if (thread_num == 1) { for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=0; j<(SHARED_ARRAY_SIZE/2); j++) tmp = SHARED_ARRAY1[j]; } else if (thread_num == 2) { for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=(SHARED_ARRAY_SIZE/2); j<SHARED_ARRAY_SIZE; j++) tmp = SHARED_ARRAY1[j]; } // read second array for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=0; j<SHARED_ARRAY_SIZE; j++) tmp = SHARED_ARRAY2[j]; return NULL; }

Collecting Statistics Example Cache statistics: l2c ----------------- Total number of transactions: 610349 Total memory stall time: 31402835 Total memory hit stall time: 28251635 Device data reads (DMA): 0 Device data writes (DMA): 0 Uncacheable data reads: 17 Uncacheable data writes: 30738 Uncacheable instruction fetches: 0 Data read transactions: 403488 Total read stall time: 17488735 Total read hit stall time: 14383135 Data read remote hits: 0 Data read misses: 10352 Data read hit ratio: 97.43% Instruction fetch transactions: 0 Instruction fetch misses: 0 Data write transactions: 176106 Total write stall time: 4687600 Total write hit stall time: 4687600 Data write remote hits: 0 Data write misses: 0 Data write hit ratio: 100.00% Copy back transactions: 0 Number of replacments in the middle (NAHALAL): 557

Results • Improvement of 54% in average stall time per transaction in benchmark A, and 61% improvement in benchmark B. • In Benchmark A 8.375% from the transactions cause a replacement in the middle without LSU, and with LSU only 0.09% ! ∆=8.28% • In Benchmark B 8.75% from the transactions cause a replacement in the middle without LSU, and with LSU only 0.02% ! ∆=8.73%

Conclusions LSU policy significantly improve average stall time per transaction, Therefore : LSU Policy implemented in NAHALAL architecture significantly reduce number of cycles for a benchmark. LSU policy significantly reduce number of replacements in the middle, Therefore : LSU Policy implemented in NAHALAL architecture, better keep the hot shared lines in the public bank. According to our implementation, LRU is activated if LSU did not find a line, Therefore : LSU Policy as we implemented is always preferable then LRU.

“ Nahalal: Cache Organization for Chip Multiprocessors ” New LSU Policy