130 likes | 144 Views
A Self-Tuning Cache Architecture for Embedded Systems. Chuanjun Zhang*, Frank Vahid** , and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer Science and Engineering University of California, Riverside **Also with the Center for Embedded Computer Systems at UC Irvine
E N D
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid** , and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer Science and Engineering University of California, Riverside **Also with the Center for Embedded Computer Systems at UC Irvine This work was supported by the National Science Foundation and the Semiconductor Research Corporation
Caches Consume Much Power • ARM920T and M*CORE : Caches consume 50% of total processor system power (Segars 01,Lee et.al. 99) • Caches are frequently accessed • Consume Dynamic Power • Caches accounts for the most of the transistors on a die • Consume Static Power • We showed that a configurable cache can reduce that power nearly in half on average (Zhang et.al. ISCA 03,ISVLSI 03) >50%
One Way 4 physical lines are filled when line size is 64 bytes 16 bytes W1 W2 W3 W4 Counter bus Four Way Set Associative Base Cache Off Chip Memory W1 W1 W2 W2 W3 W3 W4 W4 Vdd Bitline Bitline Direct mapped cache Two Way Set Associative Gated-Vdd Control Gnd Configurable Cache Architecture W1 W2 W3 W4 (Zhang et. al. ISVLSI 03) Shut down two ways Line Concatenation Way Shutdown • Way prediction unit can be turned on/off. • Use sleep transistor method (Powell et. al. ISLPED 2000) Way Concatenation (Zhang et.al. ISCA 03)
Computing Total Memory-Related Energy • Considers CPU stall energy and off-chip memory energy • Excludes CPU active energy • Thus, represents all memory-related energy energy_mem = energy_dynamic + energy_static energy_dynamic = cache_hits * energy_hit + cache_misses * energy_miss energy_miss = energy_offchip_access + energy_uP_stall + energy_cache_block_fill energy_static = cycles * energy_static_per_cycle energy_miss = k_miss_energy * energy_hit energy_static_per_cycle = k_static * energy_total_per_cycle (we varied the k’s to account for different system implementations) • Underlined – measured quantities • SimpleScalar (cache_hits, cache_misses, cycles) • Our layout or data sheets (others)
Offchip Memory D$ Processor Tuner I$ Cache Self-tuning Hardware • Simulation-based methods • Drawback: slowness. Seconds of real-time work may take tens of hours to simulate • Simulation tools set up may be difficult • Self-tuning method • Incorporates a cache parameter tuner on a SoC platform • Detect the lowest energy dissipation cache parameters • The tuner sits to the side and collects information used to calculate the energy • Heuristic algorithm is needed • Search all possible cache configurations are time consuming. Considering other configurable parameters: voltage levels, bus width, etc. the search space will increase very quickly to millions • Cache flushing should be avoided
Designing a Search Heuristic: Evaluating Impact of Cache Parameters on Miss Rate and Energy Average Instruction Cache Miss Rate and Normalized Energy of the Benchmarks. One Way Line Size 32B Line Size 32B One Way
W1 W2 W3 W4 Heuristic: Searching for the least-energy cache configuration Search Cache Size Search Line Size Search Associativity Way prediction The least-energy cache configuration
hit energies hit num input miss energies miss num static energies exe time FSM mux mux control multiplier com_out adder configure register register lowest energy comparator com_out Implementing the Heuristic in Hardware • Total size of the tuner. • About 4,200 gates, or 0.041 mm2 in 0.18 micron CMOS technology. • Area overhead • Compared to the reported size of the MIPS 4Kp with cache, this represents just over a 3% area overhead. • Power consumption: • 2.69 mW at 200 MHz. The power overhead compared with the MIPS 4Kp would be less than 0.5%. • Furthermore, the exploring hardware is used only during the exploring stage, and can be shut down after the best configuration is determined. FSM and Data Path of the Cache Explorer
Heuristic time-complexity and effectiveness • Time complexity: • Search all space: O(m x n x l x p) • Heuristic : O(m + n + l + p) • m:number of associativities, n :number of cache size • l : number of cache line size , p :way prediction on/off • Efficiency • On average 5 searching instead of 27 total searching • 2 out of 19 benchmarks miss the lowest power cache configuration. • Use a different searching heuristic: line size, associativity, way prediction and cache size. • 11 out 19 benchmarks miss the best configuration
100% stands for the energy consumption of a conventional four way set associative cache Conventional direct mapped cache may consume unacceptable energy On average, 40% energy reductions. 70% energy reductions Energy Savings Energy savings when way concatenation, way shut down, and cache line size concatenation are implemented. cnv: Conventional Cache, cfg: configurable cache; wc:way concatenation; ws:way shut down; lc:line concatenation. (C. Zhang TECS ACM To Appear)
Conclusions • A highly configurable cache architecture • Reduces on average 40% of memory access related energy • A self-tuning mechanisms is proposed • A special cache parameter explorer • A heuristic algorithm to search the parameter space • Cache flushing is avoided