Compiler Managed Partitioned Data Caches for Low Power

Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu, and Scott Mahlke Advanced Computer Architecture Lab Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor * Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California

Introduction: Memory Power • On-chip memories are a major contributor to system energy • Data caches  ~16% in StrongARM [Unsal et. al, ‘01] Hardware Software Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses – Limited program information – Reactive Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive – No dynamic adaptability – Conservative

Global program knowledge • Proactive optimizations • Dynamic adaptability • Efficient execution • Aggressive software optimizations Reducing Data Memory Power:Compiler Managed, Hardware Assisted Hardware Software Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ｰ Limited program information ｰ Reactive Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ｰ No dynamic adaptability ｰ Conservative

Data Caches: Tradeoffs Advantages Disadvantages • + Capture spatial/temporal locality • + Transparent to the programmer • + General than software scratch-pads • + Efficient lookups • – Fixed replacement policy • – Set index no program locality • – Set-associativity has high overhead • – Activate multiple data/tag-array • per access

Traditional Cache Architecture tag set offset tag data lru tag data lru tag data lru tag data lru Replace =? =? =? =? 4:1 mux • Lookup  Activate all ways on every access • Replacement  Choose among all the ways

Partitioned Cache Architecture Ld/St Reg [Addr] [k-bitvector] [R/U] tag set offset tag data lru tag data lru tag data lru tag data lru P0 P1 P2 P3 Replace =? =? =? =? 4:1 mux • Advantages • Improve performance by controlling replacement • Reduce cache access power by restricting number of accesses • Lookup  Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions • Replacement  Restricted to partitions specified in bit-vector

Partitioned Caches: Example for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld3 ld5 ld2/st2 ld4 ld6 way-0 way-1 way-2 tag data tag data tag data ld1 [100], R ld5 [010], R ld3 [001], R ld1, st1, ld2, st2 ld5, ld6 ld3, ld4 y w1/w2 x • Reduce number of tag checks per iteration from 12 to 4 !

Compiler Controlled Data Partitioning • Goal: Place loads/stores into cache partitions • Analyze application’s memory characteristics • Cache requirements  Number of partitions per ld/st • Predict conflicts • Place loads/stores to different partitions • Satisfies its caching needs • Avoid conflicts, overlap if possible

Cache Analysis: Estimating Number of Partitions • Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate • Use the working-set to compute number of partitions j-loop k-loop X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y B1 B1 B1 B1 M M M M • M has working-set size = 1

Cache Analysis:Estimating Number Of Partitions • Avoid conflict/capacity misses for an instruction • Estimates hit-rate based on • Reuse-distance (D), total number of cache blocks (B), associativity (A) (Brehob et. al., ’99) D = 2 D = 1 D = 0 1 2 3 4 1 2 3 4 1 2 3 4 8 8 8 16 16 16 24 24 24 32 32 32 • Compute energy matrices in reality • Pick most energy efficient configuration per instruction

Cache Analysis: Computing Interferences • Avoid conflicts among temporally co-located references • Model conflicts using interference graph X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1 M4 D = 1 M1 D = 1 M2 D = 1 M3 D = 1

Partition Assignment • Placement phase can overlap references • Compute combined working-set • Use graph-theoretic notion of a clique • For each clique, new D  Σ D of each node • Combined D for all overlaps  Max (All cliques) M4 D = 1 Clique 1 M1 D = 1 M2 D = 1 Clique 1 : M1, M2, M4  New reuse distance (D) = 3 Clique 2 : M1, M3, M4  New reuse distance (D) = 3 Combined reuse distance  Max(3, 3) = 3 M3 D = 1 Clique 2

Experimental Setup • Trimaran compiler and simulator infrastructure • ARM9 processor model • Cache configurations: • 1-Kb to 32-Kb • 32-byte block size • 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache • Mediabench suite • CACTI for cache energy modeling

Reduction in Tag & Data-Array Checks 8 8-part 4-part 2-part 7 6 5 Average way accesses 4 3 2 1 0 1-K 2-K 4-K 8-K 16-K 32-K Average Cache size • 36% reduction on a 8-partition cache

Improvement in Fetch Energy 16-Kb cache 60 2-part vs 2-way 4-part vs 4-way 8-part vs 8-way 50 40 30 Percentage energy improvement 20 10 0 epic cjpeg djpeg unepic Average pegwitenc pegwitdec rawcaudio rawdaudio mpeg2dec mpeg2enc pgpencode pgpdecode gsmencode gsmdecode g721encode g721decode

Summary • Maintain the advantages of a hardware-cache • Expose placement and lookup decisions to the compiler • Avoid conflicts, eliminate redundancies • 24% energy savings for 4-Kb with 4-partitions • Extensions • Hybrid scratch-pad and caches • Disable selected tags  convert them to scratch-pads • 35% additional savings in 4-Kb cache with 1 partition as SP

Thank You & Questions

Cache Analysis Step 1: Instruction Fusioning • Combine ld/st that accesses the same set of objects • Avoids coherence and duplication • Points-to analysis for (i = 0; i < N1; i++) { … for (j = 0; j < readInput1(); j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < readInput2(); k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld3 ld5 M1 M2 ld2/st2 ld4 ld6

Partition Assignment • Greedily place instructions based on its cache estimates • Overlap instructions if required • Compute number of partitions for overlapped instructions • Enumerate cliques within interference graph • Compute combined working-set of all cliques • Assign the R/U bit to control lookup M4 D = 1 Clique 1 M1 D = 1 M2 D = 1 M3 D = 1 Clique 2

Related Work • Direct addressed, cool caches [Unsal ’01, Asanovic ’01] • Tags maintained in registers that are addressed within loads/stores • Split temporal/spatial cache [Rivers ’96] • Hardware managed, two partitions • Column partitioning [Devdas ’00] • Individual ways can be configured as a scratch-pad • No load/store based partitioning • Region based caching [Tyson ’02] • Heap, stack, globals • More finer grained control and management • Pseudo set-associative caches [Calder ’96,Inou ’99,Albonesi ‘99] • Reduce tag check power • Compromises on cycle time • Orthogonal to our technique

Code Size Overhead Annotated LD/STs Extra MOV instructions 15% 16% 12 10 8 6 Percentage instructions 4 2 0 epic cjpeg djpeg unepic Average pegwitenc pegwitdec rawcaudio rawdaudio mpeg2dec mpeg2enc pgpencode pgpdecode gsmencode gsmdecode g721encode g721decode

Compiler Managed Partitioned Data Caches for Low Power

Compiler Managed Partitioned Data Caches for Low Power

Presentation Transcript

Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache

Highly-Associative Caches for Low-Power Processors

Microarchitectural Wire Management for Performance and Power in Partitioned Architectures

Evaluation Data Compiler

SOFTWARE DESIGN FOR LOW POWER

IBM Power Systems Compiler Roadmap

Low Static-Power Frequent-Value Data Caches

Compiler-Managed Redundant Multi-Threading for Transient Fault Detection

Entropy-Based Low Power Data TLB Design

Drowsy Caches: Simple Techniques for Reducing Leakage Power

Positioning Dynamic Storage Caches for Transient Data

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Low Voltage Low Power Dram

Low-Power Multipliers with Data Wordlength Reduction

Microarchitectural Wire Management for Performance and Power in partitioned architectures

Understanding the Tradeoffs between Software-Managed vs. Hardware-Managed Caches in GPUs

Understanding the Tradeoffs between Software-Managed vs. Hardware-Managed Caches in GPUs

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Compiler-Managed Redundant Multi-Threading for Transient Fault Detection

Low Static-Power Frequent-Value Data Caches

Power BI managed service