Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache

Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman, Scott Mahlke, Richard BrownDepartment of Electrical Engineering and Computer Science University of Michigan, Ann Arbor University of Michigan Electrical Engineering and Computer Science 1

Introduction • Instruction fetch power dominant in low-power embedded processors • ~ 27% for the StrongARM • ~ 50% for the Motorola MCORE • Two alternatives • + Hardware managed • + Transparent to the user • - Power hungry tag-checking and • comparison logic + No hardware overhead + Part of the physical address space - Managed in software Instruction-cache Scratch-pad 2

Focus Of This Work • Explore the use of scratch-pad for reducing instruction fetch power • Two possible software managed schemes • Static • Map ‘hot’ regions prior to execution • Contents do not change during execution • Dynamic • Allow contents to change during execution • Explicit copying of ‘hot’ regions 3

Scratch-pad Management: Static Approach BB1 BB2 T1 BB3 BB4 BB14 BB5 BB6 BB7` BB8 BB9 T2 BB10 BB11 T3 BB12 BB13 4

BB1 BB2 BB3 BB4 BB14 BB5 BB6 BB7 BB8 BB9 T2 BB10 BB11 T3 BB12 BB13 Scratch-pad Management: Static Approach T1 profit = size * freq 5

BB1 BB2 BB3 BB4 BB14 BB5 BB6 BB7 BB8 BB9 T2 BB10 BB11 T3 BB12 BB13 Scratch-pad Management: Static Approach Equivalent to bin-packing T2 32 bytes T1 96 bytes T1 64 bytes profit = size * freq 6

Scratch-pad Management: Dynamic Approach BB1 BB2 T1 BB3 Scratch-pad size (96 bytes) BB4 BB14 Scratch-pad space BB5 BB6 32b BB7 BB8 T1 64b BB9 T2 time BB10 copy T1 BB11 T3 BB12 BB13 7

BB1 BB2 BB3 BB4 BB14 BB5 BB6 BB7 BB8 BB9 T2 BB10 BB11 T3 BB12 BB13 Scratch-pad Management: Dynamic Approach T1 Scratch-pad size (96 bytes) Scratch-pad space T2 32b T1 64b time copy T1 copy T2 8

BB1 BB2 BB3 BB4 BB14 BB5 BB6 BB7 BB8 BB9 T2 BB10 BB11 T3 BB12 BB13 Scratch-pad Management: Dynamic Approach T1 Scratch-pad size (96 bytes) Scratch-pad space T2 T3 32b T1 64b time copy T1 copy T2 copy T3 over T2 9

BB1 BB2 T1 BB3 BB4 BB14 BB5 BB6 BB7 BB8 BB9 T2 BB10 BB11 T3 BB12 BB13 Scratch-pad Management: Dynamic Approach Scratch-pad size (96 bytes) Scratch-pad space T2 T2 T3 32b T1 64b time copy T1 copy T2 copy T3 over T2 copy T2 over T3 10

Scratch-pad Management: Dynamic Approach Copy1 for T1 Copy2 for T2 BB1 BB2 T1 BB3 Scratch-pad size (96 bytes) BB4 BB14 Scratch-pad space T2 T2 T3 T3 BB5 BB6 32b BB7 BB8 T1 64b BB9 T2 copy1 copy2 copy4 copy3 copy4 time BB10 copy T1 copy T2 copy T3 over T2 copy T2 over T3 copy T3 over T2 BB11 Copy4 for T3 T3 BB12 Copy3 for T2 BB13 11

Objectives Of This Work • Develop a dynamic compiler managed scheme to exploit scratch-pad • Prior work [Verma et al,’04] • ILP based solution • Not scalable • Limits scope of analysis to single procedure, loop-nests • Practical solution • Scalable • Handle arbitrary control flow graphs • Inter-procedural analysis 12

Our Approach • Two phases • Trace selection & scratch-pad (SP) allocation • Identify frequently executed traces • Select the most energy beneficial traces • Place them with possible overlap to reduce copy overhead • Copy placement • Insert copies to realize the placement • Hoist within the control flow graph to minimize overhead • Fix branch targets into selected traces 13

BB1 BB2 T1 BB3 BB4 BB14 BB5 BB6 BB7 BB8 BB9 T2 BB10 BB11 T3 BB12 BB13 SP Allocation: Computing Energy Gain Benefit: Energy savings when the trace is executed from scratch-pad instead of memory CopyCost: Overhead associated with copying the trace once Benefit = ProfileWeight * Size * DFetchEnergy CopyCost = Size * ( FetchEnergy + WriteEnergy) Energy Gain = Benefit - CopyCost 14

BB1 BB2 T1 BB3 BB4 BB14 BB5 BB6 BB7 BB8 BB9 T2 BB10 BB11 T3 BB12 BB13 SP Allocation: Placing Traces initial copy of T2 recopy of T2 recopy of T2 T1 T1 T2 T2 T2 T1 T1 T2 T2 T2 T3 T1 T1 T2 T2 T2 T3 initial copy of T1 recopy of T1 recopy of T1 Dynamic Copy Cost: # copies of T1 * CopyCost (T1) + # copies of T2 * CopyCost(T2) 15

Temporal Relationship Graph [Gloy et al,’97] copy of T2 copy of T2 T1T1 T2 T2 T2T1 T1 T2 T2 T2 T3 T1 T1 T2 T2 T2 T3 copy of T1 copy of T1 2 * CopyCost (T1) + 2 * CopyCost(T2) T1 T2 T3 Edge Weights between two nodes denote the Dynamic Copy Cost 16

SP Allocation: Placing Traces T2 T2 Energy Gain: T1  3104nJ Energy Gain: T2  15952nJ Energy Gain: T3  752nJ 96-bytes 17

SP Allocation: Placing Traces T2 T2 Energy Gain: T1  3104nJ Energy Gain: T2  15952nJ Energy Gain: T3  752nJ T1 96-bytes T1 T1 T1 18

SP Allocation: Placing Traces T2 T2 432nJ T1 T2 T1 96-bytes 144nJ 96nJ T1 T3 T1 T1 19

SP Allocation: Placing Traces T2, T3 T2, T3 432nJ T1 T2 T1 96-bytes 144nJ 96nJ T1 T3 T1 T1 20

Copy Placement • Initially, naively place copies at trace entry points • Guarantees correct but inefficient execution • Reduce the copy overhead • Identify frequently executed copies • Iteratively hoist copies to less frequently executed blocks • Remove redundant copies • Ensure that the hoists and removal are legal • Traces are present prior to execution 21

Copy Placement: Initial Placement C1-T1 BB1 BB2 T1 BB3 C2-T1 BB4 BB14 BB5 BB6 C3-T1 BB7 BB8 C1-T2 BB9 T2 BB10 BB11 C3-T2 T3 BB12 C1-T3 C2-T3 BB13 22

Copy Placement: Redundant Copies C1-T1 BB1 T2, T3 BB2 T1 T2, T3 BB3 C2-T1 BB4 BB14 T1 BB5 BB6 C3-T1 BB7 T1 BB8 C1-T2 BB9 T1 T2 BB10 T1 BB11 C3-T2 T3 BB12 C1-T3 C2-T3 BB13 23

Copy Placement: Hoisting C1-T1 BB1 BB2 T1 BB3 Live-Range T1 BB4, BB6, BB7 T2 BB9, BB10 T3 BB12 BB4 BB14 BB5 BB6 BB7 BB8 C1-T2 BB9 T2 BB10 BB11 T3 BB12 C1-T3 BB13 24

Copy Placement: Hoisting Live-Range of T2 before hoist BB1 T2, T3 BB2 T1 T2, T3 BB3 BB4 BB14 T1 BB5 BB6 BB7 T1 BB8 C1-T2 BB9 T1 T2 BB10 T1 BB11 T3 BB12 C1-T3 BB13 25

Copy Placement: Hoisting Live-Range of T2 after hoist  legal BB1 T2, T3 C1-T2 BB2 T1 T2, T3 BB3 BB4 BB14 T1 BB5 BB6 BB7 T1 BB8 BB9 T1 T2 BB10 T1 BB11 T3 BB12 C1-T3 BB13 26

Experimental Setup • Trimaran compiler framework • Measured instruction fetch power • Varied scratch-pad size from 32-bytes to 4-Kbytes • Two configurations • WIMS microcontroller at the Univ. of Michigan • On-chip memory and scratch-pad • Static vs dynamic schemes • PowerMill • Conventional processor • Off-chip memory, on-chip scratch-pad vs on-chip I-cache • CACTI model • Scratch-pad vs I-cache • DMA copying • 2 bytes per cycle, stalling 27

Energy Savings: Static vs Dynamic WIMS Energy Savings, 64-Byte scratch-pad 60 dynamic static 50 40 % Energy Improvement 30 20 10 0 fir sha epic cjpeg djpeg unepic average blowfish mpeg2dec pegwitenc mpeg2enc pegwitdec rawcaudio rawdaudio pgpdecode pgpencode gsmencode gsmdecode g721encode g721decode Average savings for Dynamic: 28% Average savings for Static: 17% 28

Effect of Varying Scratch-pad Size pegwitenc 35 Static Energy Static Hit Rate 30 Dynamic Energy 100 Dynamic Hit Rate 25 80 20 % Energy Savings % Hit Rate 60 15 40 10 20 5 0 0 32 64 128 256 512 1024 2048 4096 32 64 128 256 512 1024 2048 4096 SP Size (Bytes) SP Size (bytes) 29

Scratch-pad Size For 95% Hit Rate 9000 8000 static dynamic 7000 6000 5000 Size (bytes) 4000 3000 2000 1000 0 fir sha epic cjpeg djpeg unepic blowfish average pegwitenc mpeg2enc pegwitdec rawcaudio mpeg2dec rawdaudio gsmencode pgpencode gsmdecode pgpdecode g721encode g721decode Dynamic is 2.5x better than static 30

Energy Savings: SP vs I-Cache Cacti energy savings, 64b scratch-pad/I-cache 120 dynamic static 100 I-cache 80 60 40 % Energy Improvement 20 0 fir sha epic cjpeg djpeg -20 unepic average blowfish pegwitenc mpeg2enc mpeg2dec pegwitdec rawcaudio rawdaudio pgpencode pgpdecode gsmencode gsmdecode g721encode g721decode -40 -60 Average savings for Dynamic: 48% Average savings for Static: 25% Average savings for I-cache: 30% 31

Conclusions • Compiler directed dynamic placement in scratch-pad • Arbitrary control flow graph • Inter-procedural • Two phases  SP allocation & copy placement • 28% savings for dynamic as compared to 16% for static for a 64-byte scratch-pad • 41% savings for dynamic as compared to 31% for static for 256-byte scratch-pad • 2 to 10% stall cycles • Within 0 to 11 % of optimal, but scalable 32

For more information http://cccp.eecs.umich.edu Thank You! 33

Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache

Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache

Presentation Transcript

Cache-Conscious Data Placement

Instruction Set Architecture (ISA) for Low Power

Dynamic Statement Cache In A Nutshell

Compiler Managed Partitioned Data Caches for Low Power

Improving Instruction Cache Performance in OLTP

A Low-Power I-Cache Design with Tag-Comparison Reuse

A Data Cache with Dynamic Mapping

Managed Code

Managed Object Placement Service

Cache-Conscious Data Placement

Low Power Very Fast Dynamic Logic Circuits

Iterative Algorithms for Low Power VLSI Placement

Low-Power Dynamic Voltage Scaling System

Efficient Application Placement in a Dynamic Hosting Platform

Compiler-Directed instruction cache leakage optimizations

Compressed Instruction Cache

Dynamic and Application-Driven I-Cache Partitioning for Low-Power Embedded Multitasking

Low-Power Design and Test Dynamic and Static Power in CMOS

Instruction Set Architecture (ISA) for Low Power

Compressed Instruction Cache

Portable Code Compiler