310 likes | 715 Views
SST-MacSim. SST. The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment based on MPI Fully modular design that enables extensive exploration of an individual system parameter without the need for intrusive changes to the simulator
E N D
SST • The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) • A parallel simulation environment based on MPI • Fully modular design that enables extensive exploration of an individual system parameter without the need for intrusive changes to the simulator • Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model • SST-download link: http://sst.sandia.gov/ MacSim Tutorial (In ICPADS 2013)
SST (cont’d) MacSim Tutorial (In ICPADS 2013)
SST (cont’d) • Processor Components • MacSim • Gem5 • Memory Components • DRAMSim2 • VaultSim (3D memory model) • MemHierarchy • Network Components • Merlin • Iris MacSim Tutorial (In ICPADS 2013)
MacSim as an SST Component MacSim Tutorial (In ICPADS 2013) • Multiple MacSim components can be instantiated • Each of which can act as • An entire GPU node (composed of multiple SMs) • A heterogeneous computing node (CPU + GPU) • A GPU/CPU core • Any combination of listed above
MacSim with other SST Components Front-end Decode Rename Schedule Execution Retire SST Link I-Cache (MH) D-Cache (MH) VaultSim VaultSim MacSim Tutorial (In ICPADS 2013) • MacSim can talk to • memHierarchy • MacSim can make use of memHierarchy’s cache hierarchy. Which means, whatever memory system is connectedto memHierarchy, MacSim can be configuredwith them. • DRAMSim2 or VaultSim. • Pipeline Stages with memHierarchy
MacSim with other SST Components Front-end Decode Rename Schedule Execution Retire SST Link I-Cache (MS) D-Cache (MS) VaultSim VaultSim MacSim Tutorial (In ICPADS 2013) • MacSim can directly talk to • DRAMSim2 • VaultSim • Using MacSim’s highly versatile memory controller interface, it can directlytalk to DRAMSim2andVaultSim. • Pipeline Stages with external memory component
memHierarchy • A SST component which models a memory hierarchy, such as multiple cache levels • Sub component: Cache, Bus, Memory Controller • Usage • Processor Component(s) + memHierarchy(s) + Memory Component(s) • MacSim + L1/L2 cache + DRAMSim2 • MacSim + L1/L2 cache + (3D memory model) • (MacSim + private L1 cache) + (Gem5 + private L1 cache) + shared L2 cache + (DRAMSim2 or 3D memory model) MacSim Tutorial (In ICPADS 2013)
MacSim + memHierarchy Integration Encapsulated MacSim as a SST Component, SST feeds clocks into MacSim and provides communication channels. By talking to memHierarchy, MacSim indirectly can communicate with bunch of memory components without bothering to modify its interface. MacSim MacSim SST::Component SST::Component SST::Link SST::Link L1 (memHierarchy) L1 (memHierarchy) SST::Link SST::Link L2 (memHierarchy) SST::Link SST::Link VaultSim DRAMSim2 SST::Component MacSim Tutorial (In ICPADS 2013)
Feasible Configuration Examples MacSim MacSim MacSim Gem5 SST::Component SST::Component SST::Component SST::Component SST::Link SST::Link SST::Link SST::Link L1 L1 L1 (memHierarchy) L1 (memHierarchy) SST::Link SST::Link SST::Link SST::Link L2 (memHierarchy) L2 (memHierarchy) SST::Link LLC (VaultSim) SST::Link SST::Link SST::Link VaultSim DRAMSim2 DRAMSim2 SST::Component SST::Component MacSim Tutorial (In ICPADS 2013)
SST-MacSim • Make sure macsimComponent doesn’t have .ignore file, otherwise SST build system will ignore the component • How to build: See the instruction from SST website • How to execute: Pay special attention to the following files • SDL (or XML) : SST component configuration • trace_file_list: Which trace to execute. Can be specified in the aforementioned SDL file • params.in: MacSim configuration, in which you can specify… • Whether MacSim uses its internal cache or memHierarchy as cache • Which DRAM controller to use amongst its internal FCFS/FRFCFS-based controller, DRAMSim2 controller and VaultSim controller. • Specific examples will be elaborated in the following slides
SST-MacSim: Standalone mode MacSim Tutorial (In ICPADS, 2013) • params.in • use_memhierarchy = 0 • dram_scheduling_policy = FRFCFS or FCFS • SDL (or XML) • Nothing except macsimComponent configuration • In this case, link configuration will not be used
SST-MacSim: With MemHierarchy • params.in • use_memhierarchy = 1 • Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all • SDL (or XML) • Specify memHierarchy’s cache configuration like the following • Similar configuration for D-cache as well MacSim Tutorial (In ICPADS, 2013)
SST-MacSim: With MH+DRAMSim2 • params.in • use_memhierarchy = 1 • Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all • SDL (or XML) • Specify MemControllerconfiguration for DRAMSim2 like the following • Note, DRAMSim2 configurations should be appended MacSim Tutorial (In ICPADS, 2013)
SST-MacSim: With MH+VaultSim • params.in • use_memhierarchy = 1 • Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all • SDL (or XML) • Specify MemController configuration for VaultSim like the following • Note, VaultSim configurations should be appended MacSim Tutorial (In ICPADS, 2013)
SST-MacSim: With only DRAMSim2 • params.in • use_memhierarchy = 0 • dram_scheduling_policy= DRAMSIM • SDL (or XML) • Specify configurations for DRAMSim2 like the following MacSim Tutorial (In ICPADS, 2013)
SST-MacSim: With only VaultSim • params.in • use_memhierarchy = 0 • dram_scheduling_policy= VAULTSIM • SDL (or XML) • Nothing special but to set macsimComponent’smem_link matches to VaultSim’stoCPU link
MacSim Architecture Studies MacSim Tutorial (In ICPADS 2013)
Architecture Studies Using MacSim • Thread fetch policies • Branch predictor • Software and Hardware prefetcher • Cache studies (sharing, inclusion) • DRAM scheduling • Interconnection studies • Power model Front-end Memory System Misc. MacSim Tutorial (In ICPADS 2013)
Prefetcher Study MacSim Trace Generator (PIN, GPUOCelot) Frontend Memory System Software prefetch instructions PTX prefetch, prefetchu x86 prefetcht0, prefetcht1, prefetchnta Hardware prefetch requests Hardware Prefetcher Stream, stride, GHB, … • Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010] • When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012] • Spare Register Aware Prefetching for Graph Algorithms on GPUs [Lakshminarayana, HPCA 2014] MacSim Tutorial (In ICPADS 2013)
Cache and NoC Studies $ $ $ $ $ $ $ Private Caches Interconnection Interconnection Shared $ Shared Cache • TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012] Cache studies – sharing, inclusion property On-chip interconnection studies MacSim Tutorial (In ICPADS 2013)
Heterogeneity Aware NoC • Heterogeneous link configuration CPU GPU MC Ring Network Different topologies L3 C C M M C C M M C0 C1 C2 G0 G1 G2 C C G G M1 M0 L3 L3 L3 L3 C0 G0 C2 G1 C1 G2 C C G G M1 M0 L3 L3 L3 L3 • On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. JPDC2013] MacSim Tutorial (In ICPADS 2013)
Instruction Fetch and DRAM Scheduling Trace Generator (GPUOCelot) Frontend RR, ICOUNT, FAIR, LRF, … Execution DRAM FCFS, FRFCFS, FAIR, … • Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim, LCA-GPGPU, 2010] MacSim Tutorial (In ICPADS 2013)
DRAM Scheduling in GPGPUs DRAM Bank DRAM Controller Qs for Core-0 Qs for Core-1 Potential of Requests from Core-0 = |W0|α + |W1|α + |W2|α+ |W3|α = 4α+ 3α+ 5α (α < 1) Reduction in potential if: row hit from queue of length L is serviced next Lα – (L – 1)α row hit from queue of length L is serviced next Lα – (L – 1/m)α m = cost of servicing row miss/cost of servicing row hit Tolerance(Core-0) < Tolerance(Core-1) select Core-0 Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next W0 W1 W2 W3 W0 W1 W2 W3 RH RM RM RM RM RH RM RM RM RH RM RM RM RH RH RM RM Core-0 Core-1 Tolerance(Core-0) < Tolerance(Core-1) • DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al. IEEE CAL, 2011] MacSim Tutorial (In ICPADS 2013)
MacSim with 3D stacked Memory Trace Generator (PIN, GPUOcelot) Out-of-The-Box MacSim Cache Hierarchy Frontend • CPU Traces (X86) • GPU Traces (CUDA) Off-Chip Memory Memory System Memory Requests 3D Stacked DRAM Model (New Module) • Configure 3-D Stack as • DRAM caches • Part of main memory DRAM Stacks • Resilient Die-stacked DRAM Caches, [Sim et al.,ISCA-40, 2013] • A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch • [Sim et al., MICRO, 2012] MacSim Tutorial (In ICPADS 2013)
Power Research & Validation • Verifying simulator and GTX580 • Modeling X86-CPU power • Modeling GPU power • Still on-going research MacSim Tutorial (In ICPADS 2013)
MacSim’s Roadmap ARM Architecture Mobile Platform Power/Energy Model 2013 ~ 2014 MacSim Tutorial (In ICPADS 2013)
Thank You! MacSim Tutorial (In ICPADS 2013)