270 likes | 377 Views
Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems. G. Pokam and F. Bodin. Motivation (1/3). High-performance accommodates difficultly with low-power Consider the cache hierarchy for instance benefits of large caches
E N D
Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin
Motivation (1/3) • High-performance accommodates difficultly with low-power • Consider the cache hierarchy for instance • benefits of large caches • maintain embedded code + data workload on-chip • reduce off-chip memory traffic • however, • caches account for ~80% of the transistors count • we usually devote half of the chip area to caches
Motivation (2/3) • Cache impact on the energy consumption • static energy is incommensurate in comparison to the rest of the chip • 80% of the transistors contribute steadily to the leakage power • dynamic energy (transistors switching activities) represents an important fraction of the total energy due to the high access frequency of caches • Caches design is therefore critical in the context of high-performance embedded systems
Motivation (3/3) • We seek to address cache energy management via • Hardware/software interaction • Any good ways to achieve that ? • Yes: add flexibility to allow a cache to be reconfigured efficiently • How ? • Follow program phases to adapt the cache structure accordingly
Previous work (1/2) • Some configurable cache proposals that apply to embedded systems include: • Albonesi [MICRO’99]: selective cache ways • to disable/enable individual cache ways of a highly set-associative cache • Zhang & al. [ISCA’03]: way-concatenation • to reduce the cache associativity while still maintaining the full cache capacity
Previous work (2/2) • These approaches only consider configuration on a per-application basis • Problems : • empirically, no best cache size exists for a given application • varying dynamic cache behavior within an application, and from one application to another • Therefore, these approaches do not accommodate well to program phase changes
Our approach • Objective : • emphasize on application-specific cache architectural parameters • To do so, we consider a cache with fixed line size and modulus set mapping function • power/perf is dictated by size and associativity • Not all dynamic program phases may have the same requirements on cache size and associativity ! • Dynamically varying size and assoc. to leverage power/perf. tradeoff at phase-level
Cache model (1/8) • Baseline cache model: • way-concatenation cache [Zhang ISCA’03] • Functionality of the way-concatenation cache • on each cache lookup, a logic selects the number of active cache ways m out of the n available cache ways • virtually, each active cache way is a multiple of the size of a single bank in the base n-way cache.
Cache model (2/8) • Our proposal: • modify the associativity while guaranteeing cache coherency • modify the cache size while preserving data availability on unused cache portions
Cache model (3/8) • First enhancement: associativity level • Problem with baseline model • consider the following scenario in the baseline model Phase 0: 32K 2-way, active banks are 0 and 2 Bank 0 Bank 1 Bank 2 Bank 3 @A Phase 1: 32K 1-way, active bank is 2, @A is modified Old copy @A @A invalidation
Cache model (4/8) • Proposed solution : • assume a write-through cache • the unused tag and status arrays must be made accessible on a write to ensure coherency across cache configurations => associative tag array • actions of the cache controller: access all tag arrays on a write request to set the corresponding status bit to invalid
Cache model (5/8) • Second enhancement: cache size level • Problem with the baseline model: • Gated-Vdd is used to disconnect a bank => data are not preserved across 2 configurations! • Proposed solution: • unused cache ways are put in a low-power mode => drowsy mode [Flautner & al. ISCA’02] • tag portion is left unchanged ! • Main advantage • we can reduce the cache size, preserve the state of the unused memory cellsacross program phases, while still reducing leakage energy !
Cache model (6/8) • Overall cache model
Cache model (8/8) • Drowsy circuitry accounts for less than 3% of the chip area • Accessing a line in drowsy mode requires 1 cycle delay [Flautner & al. ISCA’02] • ISA extension • we assume the ISA can be extended with a reconfiguration instruction having the following effects on WCR:
Trace-based analysis (1/3) • Goal : • We want to extract a performance and energy profiles from the trace in order to adapt the cache structure to the dynamic application requirements • Assumptions : • LRU replacement policy • no prefetching
Trace-based analysis (2/3) • sample interval = • set mapping function = (for varying the associativity) • LRU-Stack distance d = (for varying the cache size) • Then, define the LRU-stack profiles : • : performance • for each pair , this expression defines the number of dynamic references that hit in caches with LRU-stack distance
Trace-based analysis (3/3) • : energy Cache energy Tag energy Drowsy transitions energy memory energy
Experimental setup (1/2) • Focus on data cache • Simulation platform • 4-issue VLIW processor[Faraboschi & al. ISCA’00] • 32KB 4-way data cache • 32B block size • 20 cycles miss penalty • Benchmarks • MiBench: fft, gsm, susan • MediaBench: mpeg, epic • PowerStone: summin, whestone, v42bis
Experimental setup (2/2) • CACTI 3.0 • to obtain energy values • we extend it to provide leakage energy values for each simulated cache configuration • Hotleakage • from where we adapted the leakage energy calculation for each simulated leakage reduction technique • estimated memory ratio = 50 • drowsy energy from [Flautner & al. ISCA’02]
Program behavior (1/4) • GSM All 32K config All 16K config Capacity miss effect (log10 scale) Sensitive region 8K config Tradeoff region Insensitive region (log10 scale)
Program behavior (2/4) • FFT
Program behavior (3/4) • Working set size sensitivity property • the working set can be partitioned into clusters with similar cache sensitivity • Capturing sensitivity through working set size clustering • the partitioning is done relative to the base cache configuration • We use a simple metric based on the Manhattan distance vector from two points and
Program behavior (4/4) • More energy/performance profiles summin whestone
Results (1/3) • Dynamic energy reduction
Results (2/3) • Leakage energy savings (0.07um) Better due to gated-Vdd
Results (3/3) • Performance Worst-case degradation (65% due to drowsy transitions)
Conclusions and future work • Can do better for improving performance • reduce the frequency of drowsy transitions within a phase with refined cache bank access policies • management of reconfiguration at the compiler level • insert BB annotation in the trace • exploit feedback-directed compilation • promising scheme for embedded systems