Optimizing Data Access Speed with C-AMAT Metric

C-AMAT：Concurrent Average Memory Access Time Xian-He Sun April， 2015 Illinois Institute of Technology sun@iit.edu With Yuhang Liu and Dawei Wang

Outline Reference X.-H. Sun and D. Wang, "Concurrent Average Memory Access Time", in IEEE Computers, vol. 47, no. 5, pp. 74-80,May 2014 D. Wang and X. Sun, “APC: A Novel Memory Metric and Measurement Methodology for Modern Memory System,” IEEE Transactions on Computers, vol. 63, no. 7, pp. 1626–1639, 2014. Motivation Memory System and Metrics C-AMAT: Definition and Contribution Experimental Design and Verification Application and Related Work Conclusion

Motivation Processor is 400x faster than memory, and applications become more data intensive Data access becomes THE performance bottleneck of high-end computing Many concurrency based technologies are developed to improve data access speed, but their impact on final performance is elusive and, therefore, are not fully utilized Existing memory optimization strategies are still primarily based on the sequential single-access assumption

Memory Wall Problem Processor-DRAM Memory Gap µProc 1.20/yr. “Moore’s Law” µProc 1.52/yr. (2X/1.5yr) DRAM 7%/yr. (2X/10 yrs) Processor-Memory Performance Gap:(grows 50% / year) • 1980:no cache in micro-processor; 2010: 3-level cache on chip, 4-level cache off chip • 1989 the first Intel processor with on-chip L1 cache was Intel 486, 8KB size • 1995 the first Intel processor with on-chip L2 cache was Intel Pentium Pro, 256KB size • 2003thefirstIntel processor with on-chip L3 cache was Intel Itanium 2, 6MB size Source: Computer Architecture A Quantitative Approach

Extremely Unbalanced Operation Latency 5~15M cycles Cycles IO Access

Source: Multi-grid solver Source: MPQC Source: Gromacs Data Access becomes Performance Bottleneck • GROMACS (molecular dynamics) MPQC (Massively Parallel Quantum Chemistry) • Multi-Grid solver (CFD) Microstructure

Data Access becomes Performance Bottleneck Computational Fluid Dynamics Adaptive Multigrid Computational Finance Data mining

UpperLevel Capacity Access Time Staging Xfer Unit faster CPU Registers <8KB <0.2~0.5 ns Registers prog./compiler 1-8 bytes Instr. Operands L1 Cache <128B 0.5-1ns L1 Cache Blocks L2 cache cntl 32-128 bytes Main Memory Giga Bytes 50ns-100ns Memory OS 4K-4M bytes Pages Larger Disk Tera Bytes, 5 ms Disk LowerLevel Solution: Memory Hierarchy L1 cache cntl 32-128 bytes L2 Cache <50MB 1-10 ns L2 Cache

Data Access Concurrency Exist

Multi-core Multi-threading Multi-issue Out-of-order Execution Speculative Execution Runahead Execution CPU Pipelined Cache Non-blocking Cache Data Prefetching Write buffer Multi-banked Cache Multi-level Cache Cache Multi-channel Multi-rank Multi-bank Memory Solution: Memory Hierarchy & Parallelism Pipeline Non-blocking Prefetching Write buffer Input-Output (I/O) Parallel File System Disks

Assumption of Current Solutions Extremely Unbalanced Operation Latency • Memory Hierarchy: Locality • Concurrence: Data access pattern • Data stream IO Access 5~15M cycles Cycles Performances vary largely

Existing Memory Metrics Missing memory parallelism/concurrency • Miss Rate(MR) • {the number of miss memory accesses} over {the number of total memory accesses} • Misses Per Kilo-Instructions(MPKI) • {the number of miss memory accesses} over {the number of total committed Instructions × 1000} • Average Miss Penalty(AMP) • {the summary of single miss latency} over {the number of miss memory accesses} • Average Memory Access Time (AMAT) • AMAT = Hit time + MR×AMP • Flaw of Existing Metrics • Focus on a single component or • A single memory access

Concurrent AMAT (C-AMAT) • H is Hit time • CHis the hit concurrency • CM is the pure miss concurrency • pMR and pAMP are pure-miss ratio and pure-miss penalty • a Pure-miss cycle is a miss cycle there is no hit

Different perspectives • Sequential perspective: AMAT • Concurrent perspective: C-AMAT

Pure-miss Miss is not important (Pure miss is) The penalty is due to pure miss

C-AMAT is Recursive This Eq. shows the recurrence relation ofC-AMAT1and C-AMAT2 where

The physical meaning of η1 R1 = pure miss cycles / miss cycles R2 = pure misses / misses η1 = R1 / R2 The penalty at L2 is C-AMAT2 The actual delay impact is η1 x C-AMAT2 η1 is the L1 (concurrency) data delay reducer

Architecture Impacts • CH could be contributed by • multi-port cache • multi-banked cache • pipelined cache structures • CM could be contributed by • non-blocking cache structures • prefetching logic • These techniques can both increase the CH and CM • out-of-order execution • multiple issue pipeline • SMT • CMP

Detecting System Structure for detecting cache hit concurrency and cache miss concurrency using the C-AMAT metric

Experimental Environment • Simulator • GEM5 • Benchmarks • 29 benchmarks from SPEC CPU2006 suite • For each benchmark, 10 million instructions were simulated to collect statistics • Average values of the correspondent memory metrics are shown • A good memory metric should matches the actual design choices for modern processors

Default configuration Default processor and cache configuration parameters forsimulated testing of C-AMAT

Experimental Results L1 DCache AMAT and C-AMAT when Changing Issue Pipeline Width AMAT getting worse and C-AMAT getting better when concurrency increase

Experimental Results L1 DCache AMAT and C-AMAT when Changing MSHR Size AMAT getting worse and C-AMAT getting better when concurrency increase

Experimental Results L2 Cache AMAT and C-AMAT when Changing MSHR Size AMAT getting worse and C-AMAT getting better when concurrency increase More results can be found in X. H. Sun and D. Wang, "Concurrent Average Memory Access Time," IEEE Computer, 47(5), May 2014, pp.74-80.

Potential of C-AMAT and Data Concurrency Assume total running time is T Data stall time is d, d/T is up to 70%, that is d/T is 0.7 T Compute time is t, and t is 0.3 T Therefore, data stall time can be up to 0.7/0.3 = 2.3 folds of compute time If layered performance matching can be achieved when the overlapping effect of data access concurrency is enough, data stall time is only 1% of compute time Then memory performance can be improved 230 times!

Improvement potential due to concurrency Aided by concurrency, memory system performances can be improved up to hundreds of times (230X) at each layer of a memory hierarchy with layered performance matching

How 230x Improvement Achieved Increasing data access concurrency to have a 230 speedup of memory system performance with our LPM algorithm

Technique Impact Analysis (Original) Figure 2.11 on page 96 in Hennessy & Patterson’s latest book

Technique Impact Analysis (Ours) A new technique summation table with C-AMAT

The Impact of C-AMAT • New understanding of memory systems with a rigor mathematical description • Unified the influence of data locality and concurrency under one formulation • Foundation for developing new concurrency-based optimizations, and utilizing existing locality-based optimizations • Foundation for automatic tuning for best configuration, partition, and scheduling, etc.

C-AMAT in Action Traditional AMAT model Data stall time New C-AMAT model CPU-time = IC×(CPIexe + fmem×C-AMAT×(1–overlapRatioc-m))×cycle-time Data stall time Data stall time Only pure miss will cause processor stall, and the penalty is formulated here Y.-H. Liu and X.-H. Sun, “Reevaluating data stall time with the consideration of data access concurrency,” Journal of Computer Science and Technology, vol. 30, no. 2, pp. 227–245, 2015.

C-AMAT in Action Y.-H. Liu, X.-H. Sun, "LPM: Layered Performance Matching in Memory Hierarchy," Illinois Institute of Technology Technical Report (IIT/CS-SCS-2014-08), 2014. • Layered performance matching at each memory hierarchy • Using recursive C-AMAT to measure and mitigate layered performance mismatch • For instance, the impact of C-AMAT2 can be trimmed by pMR1 and η1 • The key is to reduce pure miss, not miss, and data concurrence can do so

C-AMAT in Action Y.-H. Liu, X.-H. Sun, "TuningC: A Concurrency-aware Optimization Tool," Illinois Institute of Technology Technical Report (IIT/CS-SCS-2015-05), 2015. • Online Reconfiguration and Smart Scheduling • A performance optimization tool has been developed base on C-AMAT • Provide measurement and optimization suggestions • Measure C-AMAT on existing computing systems • Optimization in hardware reconfiguration • Optimization in software task partitioning and scheduling

Related Work: APC Versus C-AMAT D. Wang, X.-H. Sun "Memory Access Cycle and the Measurement of Memory Systems", IEEE Transactions on Computers, vol. 63, no. 7, pp. 1626-1639, July.2014 • Access Per (memory active) Cycle (APC) • APC = A/T • APC is a measurement, a companion of C-AMAT • C-AMAT is a analysis and optimization tool • APC is very different with the traditional IPC • Memory Active Cycle (data centric/access) • Overlapping mode (concurrent data access) • C-AMAT does not depend on its five parameters for its value C-AMAT = 1/APC

Related Work: MLP • Memory Level Parallelism (MLP) • Average number of long-latency main memory outstanding accesses when there is at least one such outstanding access • Assuming each off-chip memory access has a constant latency, say m cycles, APCM=MLP/m • That means APCM is directly proportional to MLP • APC is superset of MLP • C-AMAT is an analytical tool and measurement, MLP is a measurement • MLP does not consider locality, will APC and C-AMAT do

Conclusions • Data access delay is the premier bottleneck of computing • Hardware memory concurrence exists but is under utilized • C-AMAT unifies data concurrency with locality for combined data access optimizations • C-AMAT can improve AMAT performance 230 times • This 230X number could be even larger. With the multicore technology, CPU can be built faster. The question is if data can be moved up fast enough

Conclusions Develop C-AMAT based technologies to reduce data access time !

Thank You & Questions ?

Optimizing Data Access Speed with C-AMAT Metric

Optimizing Data Access Speed with C-AMAT Metric

Presentation Transcript

Shared Memory: UMA and NUMA

Cache Designs and Tricks

Layers of a DBMS

Disk and I/O Management

AMAT Educational Series

Main Memory

Amat (Amor)

Memory Hierarchy

Fence Complexity in Concurrent Algorithms

Memory Mapped Files

Memory System Performance

Memory device

Concurrent events

AMAT Educational Series

SWAT Memory Leak Detection

Phase Change Memory (PCM) ‏

The Memory Hierarchy

Lecture 15 Main Memory

361 Computer Architecture Lecture 14: Cache Memory

Computer Architecture

Memory