1 / 13

A Self-Tuning Cache Architecture for Embedded Systems

A Self-Tuning Cache Architecture for Embedded Systems. Chuanjun Zhang*, Frank Vahid** , and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer Science and Engineering University of California, Riverside **Also with the Center for Embedded Computer Systems at UC Irvine

jgarza
Download Presentation

A Self-Tuning Cache Architecture for Embedded Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid** , and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer Science and Engineering University of California, Riverside **Also with the Center for Embedded Computer Systems at UC Irvine This work was supported by the National Science Foundation and the Semiconductor Research Corporation

  2. Caches Consume Much Power • ARM920T and M*CORE : Caches consume 50% of total processor system power (Segars 01,Lee et.al. 99) • Caches are frequently accessed • Consume Dynamic Power • Caches accounts for the most of the transistors on a die • Consume Static Power • We showed that a configurable cache can reduce that power nearly in half on average (Zhang et.al. ISCA 03,ISVLSI 03) >50%

  3. One Way 4 physical lines are filled when line size is 64 bytes 16 bytes W1 W2 W3 W4 Counter bus Four Way Set Associative Base Cache Off Chip Memory W1 W1 W2 W2 W3 W3 W4 W4 Vdd Bitline Bitline Direct mapped cache Two Way Set Associative Gated-Vdd Control Gnd Configurable Cache Architecture W1 W2 W3 W4 (Zhang et. al. ISVLSI 03) Shut down two ways Line Concatenation Way Shutdown • Way prediction unit can be turned on/off. • Use sleep transistor method (Powell et. al. ISLPED 2000) Way Concatenation (Zhang et.al. ISCA 03)

  4. Computing Total Memory-Related Energy • Considers CPU stall energy and off-chip memory energy • Excludes CPU active energy • Thus, represents all memory-related energy energy_mem = energy_dynamic + energy_static energy_dynamic = cache_hits * energy_hit + cache_misses * energy_miss energy_miss = energy_offchip_access + energy_uP_stall + energy_cache_block_fill energy_static = cycles * energy_static_per_cycle energy_miss = k_miss_energy * energy_hit energy_static_per_cycle = k_static * energy_total_per_cycle (we varied the k’s to account for different system implementations) • Underlined – measured quantities • SimpleScalar (cache_hits, cache_misses, cycles) • Our layout or data sheets (others)

  5. Best Configuration Varies Across Applications

  6. Offchip Memory D$ Processor Tuner I$ Cache Self-tuning Hardware • Simulation-based methods • Drawback: slowness. Seconds of real-time work may take tens of hours to simulate • Simulation tools set up may be difficult • Self-tuning method • Incorporates a cache parameter tuner on a SoC platform • Detect the lowest energy dissipation cache parameters • The tuner sits to the side and collects information used to calculate the energy • Heuristic algorithm is needed • Search all possible cache configurations are time consuming. Considering other configurable parameters: voltage levels, bus width, etc. the search space will increase very quickly to millions • Cache flushing should be avoided

  7. Designing a Search Heuristic: Evaluating Impact of Cache Parameters on Miss Rate and Energy Average Instruction Cache Miss Rate and Normalized Energy of the Benchmarks. One Way Line Size 32B Line Size 32B One Way

  8. Energy Dissipation of On-Chip Cache and Off Chip Memory

  9. W1 W2 W3 W4 Heuristic: Searching for the least-energy cache configuration Search Cache Size Search Line Size Search Associativity Way prediction The least-energy cache configuration

  10. hit energies hit num input miss energies miss num static energies exe time FSM mux mux control multiplier com_out adder configure register register lowest energy comparator com_out Implementing the Heuristic in Hardware • Total size of the tuner. • About 4,200 gates, or 0.041 mm2 in 0.18 micron CMOS technology. • Area overhead • Compared to the reported size of the MIPS 4Kp with cache, this represents just over a 3% area overhead. • Power consumption: • 2.69 mW at 200 MHz. The power overhead compared with the MIPS 4Kp would be less than 0.5%. • Furthermore, the exploring hardware is used only during the exploring stage, and can be shut down after the best configuration is determined. FSM and Data Path of the Cache Explorer

  11. Heuristic time-complexity and effectiveness • Time complexity: • Search all space: O(m x n x l x p) • Heuristic : O(m + n + l + p) • m:number of associativities, n :number of cache size • l : number of cache line size , p :way prediction on/off • Efficiency • On average 5 searching instead of 27 total searching • 2 out of 19 benchmarks miss the lowest power cache configuration. • Use a different searching heuristic: line size, associativity, way prediction and cache size. • 11 out 19 benchmarks miss the best configuration

  12. 100% stands for the energy consumption of a conventional four way set associative cache Conventional direct mapped cache may consume unacceptable energy On average, 40% energy reductions. 70% energy reductions Energy Savings Energy savings when way concatenation, way shut down, and cache line size concatenation are implemented. cnv: Conventional Cache, cfg: configurable cache; wc:way concatenation; ws:way shut down; lc:line concatenation. (C. Zhang TECS ACM To Appear)

  13. Conclusions • A highly configurable cache architecture • Reduces on average 40% of memory access related energy • A self-tuning mechanisms is proposed • A special cache parameter explorer • A heuristic algorithm to search the parameter space • Cache flushing is avoided

More Related