CSCE 930 Advanced Computer Architecture ---- A Brief Introduction to CMP Memory Hierarchy Simulators

1. CSCE 930 Advanced Computer Architecture---- A Brief Introduction toCMP Memory Hierarchy & Simulators Dongyuan Zhan

2. From Teraflop Multiprocessor to Teraflop Multicore 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 2

3. Intel Teraflop Multicore Prototype 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 3

4. From Teraflop Multiprocessor to Teraflop Multicore Pictured here is ASCI Red which was the first computer to reach a Teraflops of processing, equal to trillions of calculations per second. Using about 10,000 Pentium Processors running at 200MHz Consuming 500kW of power for computation and another 500kW for cooling Occupy a very large room Intel has now announced just over 10 yeas later that they have developed the world�s first processor that will deliver the same Teraflops performance all on one single 80-core on a single chip running at 5 GHz Consuming only 62 watts power Small enough to rest on the tip of your finger. 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 4

5. A Commodity Many-core Processor Tile64 Multicore Processor (2007~now) 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 5

6. The Schematic Design of Tile64 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 6

7. Outline An Introduction to the Multi-core Memory Hierarchy Why do we need the memory hierarchy for any processors? A tradeoff between capacity and latency Make common cases fast as a result of programs� locality (general principle in computer architecture) What is the difference between the memory hierarchies of single-core and multi-core CPUs? Quite distinct from each other in on-chip caches Managing the CMP caches is of paramount importance in performance Again, we still have the capacity and latency issues for CMP caches How to keep CMP cache coherent Hardware & software management schemes 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 7

8. The Motivation for Mem Hierarchy 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 8

9. Programs� Locality Two Kinds of Basic Locality Temporal: if a memory location is referenced, then it is likely that the same memory location will be referenced again in the near future. int i; register int j; for (i = 0; i < 20000; i++) for (j = 0; j < 300; j++); Spatial: if a memory location is referenced, then it is likely that nearby memory locations will be referenced in the near future. Locality + smaller HW is to make common cases faster = memory hierarchy 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 9

10. The Challenges of Memory Wall The Truths: In many applications, 30-40% the total instructions are memory operations CPU speed scales much faster than the DRAM speed In 1980, CPUs and DRAMs were operated at almost the same speed, about 4MHz~8MHz CPU clock frequency has doubled every 2 years; DRAM speed have only been doubling about every 6 years. 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 10

11. Memory Wall DRAM bandwidth is quite limited: two DDR2-800 modules can reach the bandwidth of 12.8GB/sec (about 6.4B/cpu_cycle if the cpu runs at 2GHz). So, in a multicore processor, when multiple 64-bit cores need to access the memory at the same time, they will exacerbate contention on the DRAM bandwidth. Memory Wall: CPU needs to speed a lot of time on off-chip memory accesses. E.g., Intel XScale spends on average 35% of the total execution time on memory accesses. High latency and low bandwidth of the DRAM system becomes a bottleneck for CPUs. 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 11

12. Solutions How to alleviate the memory wall problem Hiding the mem access latency: prefetching Reducing the latency: making memory closer to the CPU: 3D-stacked on-chip DRAM Increasing the bandwidth: optical I/O Reducing the number of memory accesses: keeping as much reusable data on cache as possible 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 12

13. CMP Cache Organizations(Shared L2 Cache) 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 13

14. CMP Cache Organizations(Private L2 Cache) 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 14

15. How to Address Blocks in a CMP How to address blocks in a single-core processor L1 caches are typically virtually indexed but physically tagged, while L2 caches are mostly physically indexed and tagged (related to virtual memory). How to address blocks in a CMP L1 caches are accessed in the same way as in a single-core processor If the L2 caches are private, the addressing of a block is still the same If the L2 caches are shared among all of the cores, then 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 15

16. How to Address Blocks in a CMP 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 16

17. How to Address Blocks in a CMP 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 17

18. CMP Cache Coherence Snoop based: All caches on the bus snoop the bus to determine if they have a copy of the block of data that is requested on the bus. Multiple copies of a data block can be read without any coherence problems; however, a processor must have exclusive access (either invalidate or update other copies) to the bus in order to write. Enough for small-scale CMPs with bus interconnection Directory based the data being shared is tracked in a common directory that maintains the coherence between caches. When a cache line is changed the directory either updates or invalidates the other caches with that cache line. Necessary for many-core CMPs with such interconnection as mesh 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 18

19. Interference in Cachingin Shared L2 Caches The Problem: because the shared L2 caches are accessible to all cores, one core can interfere with another in placing blocks in L2 caches For example, in a dual-core CMP, if a stream application like a video player is co-scheduled with a scientific computation application that has good locality, then the aggressive stream application will continuously place new blocks in L2 cache and replace the computation application�s cached blocks, thus affecting the computation application�s performance. Solution: Regulate cores� usage of the L2 cache based on their utility of using the cache [3] 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 19

20. The Capacity Problemsin Private L2 Caches The Problems: the L2 capacity accessible to each core is fixed, regardless of the core�s real cache capacity demand. E.g., if two applications are co-scheduled on a dual core CMP with two 1MB private L2 caches, and if one application has a cache demand of 0.5 MB while the other asks for 1.5MB, then one private L2 cache is underutilized while the other is overwhelmed. If a parallel program is running on the CMP, different cores will have a lot of data in common. However, the private L2 cache organization requires each core maintain a copy of the common data in its local cache, leading to a lot of data redundancy and degrading the effective A Solution: Cooperative Caching [4] 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 20

21. Non-Uniform Cache Access Timein Shared L2 Caches 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 21

22. Non-Uniform Cache Access Timein Shared L2 Caches Let�s assume that Core0 needs to access a data block stored in Tile15 Assume that access an L2 cache bank needs 10 cycles; Assume transferring a data block from one router to an adjacent one needs 2 cycles; Then, an remote access to the block in Tile 15 needs 10+2*(2*6)=34 cycles, much greater than an local L2 access. Non-Uniform Cache Access Time (NUCA) means that the latency of accessing an cache is a function of the physical locations of both the requesting core and the cache. 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 22

23. How to reduce the latency of Remote Cache Access At least two solutions: Place the data close enough to the requesting core Victim replication [1]: placing L1 victim blocks in the Local L2 cache; Change the layout of the data: I will talk about one approach pretty soon; Use faster transmission Use special on-chip interconnect to transmit data via radio-wave or light-wave signals 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 23

24. A Comparison Between Shared and Private L2 Caches 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 24

25. The RF-Interconnect [2] 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 25

26. Using OS to Manage CMP Caches [5] Two kinds of address space: virtual (or logic) & physical Page coloring: there is a correspondence between a physical page and its location in the cache In CMPs with Shared L2 Cache, by changing the mapping scheme, we can use the OS to determine where a virtual page required by a core is located in the L2 cache Tile#(where a page is cached) = physical page number % #Tiles 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 26

27. Using OS to Manage CMP Caches 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 27 The Benefits Improved Data Proximity Capacity Sharing Data Sharing (to be introduced next time) The Benefits Improved Data Proximity Capacity Sharing Data Sharing (to be introduced next time)

28. Summary What we have covered this class The Memory Wall problem for CMPs The two basic cache organizations for CMPs HW & SW approaches of managing the last level cache. 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 28

29. References [1] M. Zhang, et al. Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. ISCA�05. [2] F. Chang, et al. CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect. HPCA�08. [3] A. Jaleel, et al. Adaptive Insertion Policies for Managing Shared Caches. PACT�08. [4] J. Chang, et al. Cooperative Caching for Chip Multiprocessors. ISCA�06 [5] S. Cho, et al. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. MICRO�06. 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 29

30. 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 30 Outline An Overview of CMP Research Tools A Detailed Introduction to SIMICS A Detailed Introduction to GEMS Other Online Resources

31. An Overview of CMP Research Tools CMP Simulators SESC (http://users.soe.ucsc.edu/~renau/rtools.html) M5 (http://www.m5sim.org/wiki/index.php/Main_Page) Simics (https://www.simics.net/) Benchmark Suites Single-threaded Applications SPEC2000 (www.spec.org) SPEC2006 Multi-threaded Applications SPECOMP2001 SPECWeb2009 SPLASH2 (http://www-flash.stanford.edu/apps/SPLASH/) Parsec (http://parsec.cs.princeton.edu/) 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 31

32. An Overview of CMP Research Tools A Taxonomy of Simulation Function vs. Timing Functional simulation: simulate the functionalities of a system Timing simulation: simulate the timing behavior of a system Full System vs. Non FS Full system simulation: like a VM that can boot up Oss Syscall emulation: no OS but syscalls are emulated by the simulator Simulation Stages Configuration stage: connect cores, caches, drams, interconnects and I/Os to build up a system Fast-forward stage: bypass the initialization stage of a benchmark program without timing simulation Warm-up stage: fill in the pipelines, branch predictors and caches by executing a certain number of instructions but do not count them in the performance statistics Simulation stage: detailed simulation to obtain performance statistics 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 32

33. An Overview of CMP Research Tools The Commonly used CMP Simulators SESC Only supports timing & syscall simulation Only supports MIPS ISA Able to seamlessly cooperate with Cacti (power), Hotspot (temperature) and Hotleakage (static power) Especially useful in power/thermal research Cacti is available at http://www.cs.utah.edu/~rajeev/cacti6/ Hotspot: http://lava.cs.virginia.edu/HotSpot/ Hotleakage: http://lava.cs.virginia.edu/HotLeakage/index.htm 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 33

34. An Overview of CMP Research Tools The Commonly used CMP Simulators SIMICS (Commercial but free-use for academia) Only supports functional & full-system simulation Supports multiple ISAs SparcV9 (well supported by public-domain add-on modules) X86, Alpha, MIPS, ARM (seldom supported by 3rd-party modules) Needs add-on models to do performance & power simulation GEMS (http://www.cs.wisc.edu/gems/) it has two components for performance simulation: OPAL: an out-of-order processing core model RUBY: a detailed CMP mem hierarchy model Simflex (http://parsa.epfl.ch/simflex/) It is similar to GEMS in functionality It supports statistical sampling for simulation Garnet (http://www.princeton.edu/~niketa/garnet.html) It supports the performance and power simulation for NoC 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 34

35. An Overview of CMP Research Tools The Commonly used CMP Simulators M5 Supports both functional and timing simulation Has two simulation modes: full-system (FS) and syscall emulation (SE) Supports multiple ISAs ALPHA: well-developed to support both FS and SE modes It models Processor Cores + Memory Hierarchy + I/O Systems Written by using C++, Python & Swig, and totally open-source More things about M5 http://www.m5sim.org/wiki/index.php/Main_Page The most important document: http://www.m5sim.org/wiki/index.php/Tutorials 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 35

36. A Detailed Introduction to M5 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 36 M5�s Source Tree Structure

37. A Detailed Introduction to M5 CPU Modeled by M5 SimpleCPU TimingCPU O3CPU 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 37

38. A Detailed Introduction to M5 Memory Hierarchy Modeled by M5 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 38

39. A Detailed Introduction to Simics Directory Tree Organization Under the root directory of Simics licenses: licenses for functional simics doc: detailed documents about all aspects targets: simics scripts that describe specific computer systems src: simics header files for user programming amd64-linux: dynamic modules �*.so� that are invoked by Simics to build up modeled computer systems 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 39

40. A Detailed Introduction to Simics Key Features of Simics Simics can be regarded as a command interpreter Command Line Interface (CLI): let users to control Simics Simics is quite modular It uses Simics scripts to connect different FUNCTIONAL modules (e.g., ISA, dram, disk, Ethernet), which are compiled as �lib/*.so� files, to build up a system. The information of all pre-compiled modules can be found in �doc/simics-reference-manual-public-all.pdf�. Modules can be designed in C/C++, python, and DML. Simics has already implemented several specific target systems (defined in scripts) for booting up an operating system E.g., SUN�s Serengeti system with Ultrasparc-III processors, which is scripted in the directory �targets/serengeti� 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 40

41. A Detailed Introduction to Simics Key Features of Simics DML, MAIs, APIs and CMDs DML: the Simics-specific Device Modeling Language, a C-like programming language for writing device models for Simics using Transaction Level Modeling. DML is simpler than C/C++ and python in device modeling. MAI: the Simics-specific Micro-Architectural Interface, enables users to define when things happen while letting Simics to handle how things happen. the add-on GEMS uses this feature to implement timing simulation. APIs: a set of functions that provide access to Simics functionality from script languages in the frontend and from extensions, usually written in C/C++. CMDs: the Simics-specific commands used in CLI to let users to control Simics, such as loading modules or running python scripts. 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 41

42. A Detailed Introduction to Simics Using Simics Installing Simics See �simics-installation-guide-unix.pdf� Creating Workspace See Chapter 4 of �doc/simics-user-guide-unix.pdf� Installing a Solaris OS Change the disk capacity by modifying the cylinder-head-sector parameters in �targets/serengeti/abisko-sol*-cd-install1.simics�. E.g., a 32GB=40980*20*80*512B disk is created by the command ($scsi_disk.get-component-object sd).create-sun-vtoc-header -quiet 40980 20 80 Enter the workspace just created See Chapter 6 of �doc/simics-target-guide-serengeti.pdf� 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 42

43. A Detailed Introduction to Simics Using Simics Modify the Simics script (for describing the Serengeti system) to enable multiple cores Change $num_cpus in �targets/serengeti/serengeti-6800-system.include� Booting the Solaris OS in Simics Under the workspace directory just created, enter the subdirectory �home/serengeti� Type �./simcs abisko-common.simics� Type �continue� Install the SimicsFS (used to communicate with your host system) See Section 7.3 of �doc/simics-user-guide-unix.pdf� Save a breakpoint, exit and restart from the previous breakpoint Type �Write-configuration try.conf� Type �exit� Type �./simics �c try.conf� 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 43

44. A Detailed Introduction to GEMS An Overview of GEMS 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 44

45. A Detailed Introduction to GEMS 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 45

46. A Detailed Introduction to GEMS Essential Components in Ruby Caches & Memory Coherence Protocols CMP protocols MOESI_CMP_token: M-CMP token coherence MSI_MOSI_CMP_directory: 2-level Directory MOESI_CMP_directory: higher performing 2-level Directory SMP protocols MOSI_SMP_bcast: snooping on ordered interconnect MOSI_SMP_directory MOSI_SMP_hammer: based on AMD Hammer User defined protocols using GEMS SLICC 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 46

47. A Detailed Introduction to GEMS Essential Components in Ruby Interconnection Networks Either be automatically generated by default Intra-chip network: Single on-chip switch Inter-chip network: 4 included (next slide) Or be customized by users Defined in *_FILE_SPECIFIED.txt under the directory �$GEMS_ROOT_DIR/ruby/network/simple/Network_Files� 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 47

48. Auto-generated Inter-chip Network Topologies 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 48

49. Topology Parameters Link latency Auto-generated ON_CHIP_LINK_LATENCY NETWORK_LINK_LATENCY Customized �link_latency:� Link bandwidth Auto-generated On-chip = 10 x g_endpoint_bandwidth Off-chip = g_endpoint_bandwidth Customized Individual link bandwidth = �bw_multiplier:� x g_endpoint_bandwidth Buffer size Infinite by default Customized network supports finite buffering Prevent 2D-mesh network deadlock through e-cube restrictive routing �link_weight� Perfect switch bandwidth 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 49

50. A Detailed Introduction to GEMS Steps of Using GEMS: Choosing a Ruby protocol Building Ruby and Opal Starting and configuring Simics Loading and configuring Ruby Loading and configuring Opal Running simulation Getting results 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 50

51. Other Online Resources Simics Online Forum https://www.simics.net/ GEMS Mailing List & Archive http://lists.cs.wisc.edu/mailman/listinfo/gems-users A Student wrote some articles about installing and using Simics at http://fisherduyu.blogspot.com/ 7/2/2012 CSE930, CMP Memory Hierarchy & Simulators 51

CSCE 930 Advanced Computer Architecture ---- A Brief Introduction to CMP Memory Hierarchy Simulators

CSCE 930 Advanced Computer Architecture ---- A Brief Introduction to CMP Memory Hierarchy Simulators

Presentation Transcript

CSCE 930 Advanced Computer Architecture Introductions

Computer Systems Architecture A networking Approach Chapter 12 Introduction The Memory Hierarchy

CPE 731 Advanced Computer Architecture Advanced Memory Hierarchy

Embedded Computer Architecture Memory Hierarchy: Cache Recap

Advanced Computer Architecture Memory Hierarchy Design

CSE 820 Advanced Computer Architecture Lec 4 – Memory Hierarchy Review

EECS 252 Graduate Computer Architecture Lec 17 – Advanced Memory Hierarchy 2

Computer Architecture Memory Hierarchy & Virtual Memory

CSCE 212 Introduction to Computer Architecture

CSCE 212 Chapter 7 Memory Hierarchy

CSC 7080 Graduate Computer Architecture Lec 12 – Advanced Memory Hierarchy 2

CSCE 930 Advanced Computer Architecture

Computer Architecture CSCE 350

EECS 252 Graduate Computer Architecture Lec 16 – Advanced Memory Hierarchy

Computer Systems Architecture A networking Approach Chapter 12 Introduction The Memory Hierarchy

CSCE 432/832 High Performance Processor Architectures An Introduction to CMP Simulators

EECS 252 Graduate Computer Architecture Lec 17 – Advanced Memory Hierarchy 2

CSCE 212 Introduction to Computer Architecture

Advanced Memory Hierarchy

EECS 252 Graduate Computer Architecture Lec 16 – Advanced Memory Hierarchy

CSCE 930 Advanced Computer Architecture ---- A Brief Introduction to CMP Memory Hierarchy Simulators