210 likes | 302 Views
Compiler Managed Partitioned Data Caches for Low Power. Rajiv Ravindran *, Michael Chu, and Scott Mahlke Advanced Computer Architecture Lab Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor.
E N D
Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu, and Scott Mahlke Advanced Computer Architecture Lab Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor * Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California
Introduction: Memory Power • On-chip memories are a major contributor to system energy • Data caches ~16% in StrongARM [Unsal et. al, ‘01] Hardware Software Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses – Limited program information – Reactive Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive – No dynamic adaptability – Conservative
Global program knowledge • Proactive optimizations • Dynamic adaptability • Efficient execution • Aggressive software optimizations Reducing Data Memory Power:Compiler Managed, Hardware Assisted Hardware Software Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information ー Reactive Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ー No dynamic adaptability ー Conservative
Data Caches: Tradeoffs Advantages Disadvantages • + Capture spatial/temporal locality • + Transparent to the programmer • + General than software scratch-pads • + Efficient lookups • – Fixed replacement policy • – Set index no program locality • – Set-associativity has high overhead • – Activate multiple data/tag-array • per access
Traditional Cache Architecture tag set offset tag data lru tag data lru tag data lru tag data lru Replace =? =? =? =? 4:1 mux • Lookup Activate all ways on every access • Replacement Choose among all the ways
Partitioned Cache Architecture Ld/St Reg [Addr] [k-bitvector] [R/U] tag set offset tag data lru tag data lru tag data lru tag data lru P0 P1 P2 P3 Replace =? =? =? =? 4:1 mux • Advantages • Improve performance by controlling replacement • Reduce cache access power by restricting number of accesses • Lookup Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions • Replacement Restricted to partitions specified in bit-vector
Partitioned Caches: Example for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld3 ld5 ld2/st2 ld4 ld6 way-0 way-1 way-2 tag data tag data tag data ld1 [100], R ld5 [010], R ld3 [001], R ld1, st1, ld2, st2 ld5, ld6 ld3, ld4 y w1/w2 x • Reduce number of tag checks per iteration from 12 to 4 !
Compiler Controlled Data Partitioning • Goal: Place loads/stores into cache partitions • Analyze application’s memory characteristics • Cache requirements Number of partitions per ld/st • Predict conflicts • Place loads/stores to different partitions • Satisfies its caching needs • Avoid conflicts, overlap if possible
Cache Analysis: Estimating Number of Partitions • Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate • Use the working-set to compute number of partitions j-loop k-loop X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y B1 B1 B1 B1 M M M M • M has working-set size = 1
Cache Analysis:Estimating Number Of Partitions • Avoid conflict/capacity misses for an instruction • Estimates hit-rate based on • Reuse-distance (D), total number of cache blocks (B), associativity (A) (Brehob et. al., ’99) D = 2 D = 1 D = 0 1 2 3 4 1 2 3 4 1 2 3 4 8 8 8 16 16 16 24 24 24 32 32 32 • Compute energy matrices in reality • Pick most energy efficient configuration per instruction
Cache Analysis: Computing Interferences • Avoid conflicts among temporally co-located references • Model conflicts using interference graph X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1 M4 D = 1 M1 D = 1 M2 D = 1 M3 D = 1
Partition Assignment • Placement phase can overlap references • Compute combined working-set • Use graph-theoretic notion of a clique • For each clique, new D Σ D of each node • Combined D for all overlaps Max (All cliques) M4 D = 1 Clique 1 M1 D = 1 M2 D = 1 Clique 1 : M1, M2, M4 New reuse distance (D) = 3 Clique 2 : M1, M3, M4 New reuse distance (D) = 3 Combined reuse distance Max(3, 3) = 3 M3 D = 1 Clique 2
Experimental Setup • Trimaran compiler and simulator infrastructure • ARM9 processor model • Cache configurations: • 1-Kb to 32-Kb • 32-byte block size • 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache • Mediabench suite • CACTI for cache energy modeling
Reduction in Tag & Data-Array Checks 8 8-part 4-part 2-part 7 6 5 Average way accesses 4 3 2 1 0 1-K 2-K 4-K 8-K 16-K 32-K Average Cache size • 36% reduction on a 8-partition cache
Improvement in Fetch Energy 16-Kb cache 60 2-part vs 2-way 4-part vs 4-way 8-part vs 8-way 50 40 30 Percentage energy improvement 20 10 0 epic cjpeg djpeg unepic Average pegwitenc pegwitdec rawcaudio rawdaudio mpeg2dec mpeg2enc pgpencode pgpdecode gsmencode gsmdecode g721encode g721decode
Summary • Maintain the advantages of a hardware-cache • Expose placement and lookup decisions to the compiler • Avoid conflicts, eliminate redundancies • 24% energy savings for 4-Kb with 4-partitions • Extensions • Hybrid scratch-pad and caches • Disable selected tags convert them to scratch-pads • 35% additional savings in 4-Kb cache with 1 partition as SP
Thank You & Questions
Cache Analysis Step 1: Instruction Fusioning • Combine ld/st that accesses the same set of objects • Avoids coherence and duplication • Points-to analysis for (i = 0; i < N1; i++) { … for (j = 0; j < readInput1(); j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < readInput2(); k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld3 ld5 M1 M2 ld2/st2 ld4 ld6
Partition Assignment • Greedily place instructions based on its cache estimates • Overlap instructions if required • Compute number of partitions for overlapped instructions • Enumerate cliques within interference graph • Compute combined working-set of all cliques • Assign the R/U bit to control lookup M4 D = 1 Clique 1 M1 D = 1 M2 D = 1 M3 D = 1 Clique 2
Related Work • Direct addressed, cool caches [Unsal ’01, Asanovic ’01] • Tags maintained in registers that are addressed within loads/stores • Split temporal/spatial cache [Rivers ’96] • Hardware managed, two partitions • Column partitioning [Devdas ’00] • Individual ways can be configured as a scratch-pad • No load/store based partitioning • Region based caching [Tyson ’02] • Heap, stack, globals • More finer grained control and management • Pseudo set-associative caches [Calder ’96,Inou ’99,Albonesi ‘99] • Reduce tag check power • Compromises on cycle time • Orthogonal to our technique
Code Size Overhead Annotated LD/STs Extra MOV instructions 15% 16% 12 10 8 6 Percentage instructions 4 2 0 epic cjpeg djpeg unepic Average pegwitenc pegwitdec rawcaudio rawdaudio mpeg2dec mpeg2enc pgpencode pgpdecode gsmencode gsmdecode g721encode g721decode