Synthesis of Customized Loop Caches for Core-Based Embedded Systems

Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the U.S. National Science Foundation and a U.S. Department of Education GAANN Fellowship

Traditional Core Based microprocessor architecture Opportunity to tune the microprocessorarchitecture to the program Introduction

I-cache • Size • Associativity • Replacement policy Mem Processor I$ I$ I$ D$ • JPEG • Compression Bridge • Buses • Width • Bus invert/gray code JPEG JPEG JPEG USB CCDP P4 Introduction

Introduction • Memory access can consume 50% of an embedded microprocessor’s system power • Caches tend to be power hungry • M*CORE: unified cache consumes half of total power (Lee/Moyer/Arends 99) • ARM920T: caches consume half of total power (Segars 01)

Introduction Advantageous to focus on the instruction fetching subsystem Mem Processor I$ D$ Bridge JPEG USB CCDP P4

Introduction • Techniques to reduce instruction fetch power • Program Compression • Compress only a subset of frequently used instructions (Benini 1999) • Compress procedures in a small cache (Kirvoski 1997) • Lookup table based (Lekatsas 2000) • Bus Encoding • Increment (Benini 1997) • Bus-invert (Stan 1995) • Binary/gray code (Mehta 1996)

Introduction • Techniques to reduce instruction fetch power (cont.) • Efficient Cache Design • Small buffers: victim, non-temporal, speculative, and penalty to reduce miss rate (Bahar 1998) • Memory array partitioning and variation in cache sizes (Ko 1995) • Tiny Caches • Filter cache (Kin/Gupta/Magione-Smith 1997) • Dynamically loaded tagless loop cache (Lee/Moyer/Arends 1999) • Preloaded tagless loop cache (Gordon-Ross/Cotterell/Vahid 2002)

L1 memory Filter cache (L0) Processor Cache Architectures – Filter Cache • Small L0 direct mapped cache • Utilizes standard tag comparison and miss logic • Has low dynamic power • Short internal bitlines • Close to the microprocessor • Performance penalty of 21% due to high miss rate (Kin 1997)

L1 memory L1 memory L1 memory Dynamic loop cache Dynamic loop cache Mux Dynamic loop cache Processor Iteration 1 : detect sbb instruction Iteration 2 : fill loop cache Iteration 3 : fetch from loop cache Cache Architectures – Dynamically Loaded Loop Cache • Small tagless loop cache • Alternative location to fetch instructions • Dynamically fills the loop cache • Triggered by any short backwards branch (sbb) instruction • Flexible variation • Allows loops larger than the loop cache to be partially stored ... add r1,2 ... sbb -5

L1 memory L1 memory L1 memory L1 memory Dynamic loop cache Dynamic loop cache Dynamic loop cache Mux Processor Iteration 1 : detect sbb instruction Iteration 2 : fill loop cache, terminate at cof Iteration 3 : fill loop cache, terminate at cof Cache Architectures – Dynamically Loaded Loop Cache (cont.) • Limitations • Does not support loops with control of flow changes (cofs) • cofs terminate loop cache filling and fetching • cofs include commonly found if-then-else statements ... add r1,2 bne r1, r2, 3 ... sbb -5

L1 memory L1 memory L1 memory Preloaded loop cache Preloaded loop cache Mux Processor Iteration 1 : detect sbb instruction Iteration 2 : check to see if loop preloaded, if so fetch from cache Cache Architectures – Preloaded Loop Cache • Small tagless loop cache • Alternative location to fetch instructions • Loop cache filled at compile time and remains fixed • Supports loops with cof • Fetch triggered by any short backwards branch • Start address variation • Fetch begins on first loop iteration ... add r1,2 bne r1, r2, 3 ... sbb -5

L1 memory ? Mux Processor Traditional Design • Traditional Pre-fabricated IC • Typically optimized for best average case • Intended to run well across a variety of programs • Benchmark suite is used to determine which configuration

microprocessor architecture Core Based Design • Core Based Design • Know application • Opportunity to tune the architecture • Is it worth tuning the architecture to the application or is the average case good enough?

Evaluation Framework – Candidate Cache Configurations

Evaluation Framework – Motorola's Powerstone Benchmarks

LOOAN lcsim lc power calc program instr trace loop stats packed loops & explr script loop cache stats loop cache power tech info many configs. Tool Chain - Simulation

Results - Averages • Configuration 11 (flexible/32entry/dynamically loaded loop cache) • On average does well – 25% Instruction Fetch Energy Savings • Loop cache selection on a per application basis • Saves additional 70% Instruction Fetch Energy Savings

LOOAN lcsim lc power calc lcsim program instr trace loop stats packed loops & explr script loop cache stats loop cache power tech info many configs. program instr trace ... lcsim Tool Chain - Simulation

func calls program instr trace estimator LOOAN estimator lc power calc loop cache stats loop cache power packed loops loop stats fast. tech info loop and function call statistics What kind of statistics? ... How can we use this information to model the various loop caches?  li f = s*b;  li f = s*b;  li f = s*b; estimator Tool Chain - Estimation

How big are the loops? Loop hierarchy, function calls Once the loop is called, how many times does it iterate? How many times is the loop called? LOOAN

iter 1: detect sbb iter 2: fill iter 1: detect sbb iter 2: fill, abort at cof iter 3: fill, abort at cof x x x x Estimation – Original Dynamically Loaded Loop Cache • How many times do we fill the loop cache? mov r5,r4 ... add r1,2 sub r1, r2, 3 ... sbb -5 mov r5, r4 ... add r1,2 sub r1, r2, 3 bne r1, r2, 3 ... sbb -5 if( loop size <= lc size && loop iteration >= 2) fills = # times loop called * loop size if( loop size <= lc size && loop iteration >= 2) if( cof != sbb) fills = # loop called * (iter per exec–1) * offset to 1st cof else fills = # loop called * loop size

iter 1: detect sbb iter 1: detect sbb iter 2: fill iter 2: fill, abort at cof iter 3: fetch from loop cache iter 3: fill, abort at cof x x x x Estimation - Original Dynamically Loaded Loop Cache • How many times do we fetch from the loop cache? mov r5,r4 ... add r1,2 sub r1, r2, 3 ... sbb -5 mov r5, r4 ... add r1,2 sub r1, r2, 3 bne r1, r2, 3 ... sbb -5 if( loop size <= lc size && loop iteration >= 3) fetch = # times loop called * (loop iter – 2) * loop size if( loop size <= lc size && loop iteration >= 3) if( cof == sbb) fetch = # times loop called * (loop iter – 2) * loop size

Estimation • Loop Cache Equations • Each loop cache type is characterized by approximately 5 unique equations • 20 different equations in all

Estimation Results - Accuracy • Ranges from 0-16% difference • Average 2% difference

Estimation Results - Fidelity • Does the estimation method preserve the fidelity? • summin shows the worst case – 10% • On average <1% difference in savings between loop cache chosen via simulation vs. loop cache chosen via estimation

simulation was bottleneck Required for both methods Time Comparison Biggest example only 30 minutes – small program Started looking at MediaBench – simulation takes hours

Conclusion and Future Work • Important to tune the architecture to the program • Simulation methods are slow • Presented a equation based methodology which is faster than the simulation based methodology previously used • Accuracy/fidelity preserved • Future Work • Expand types of tiny caches • Look at more benchmarks • MediaBench - several hours (up to 48 hours) for our simulations • Expand hierarchy search

Thank you for your attention. Questions?

Synthesis of Customized Loop Caches for Core-Based Embedded Systems

Synthesis of Customized Loop Caches for Core-Based Embedded Systems

Presentation Transcript

Design Synthesis and Optimization for Automotive Embedded Systems

MODEL-BASED SYNTHESIS OF GENERATORS FOR EMBEDDED SYSTEMS

Model Based Design of Embedded Systems

Sorting Units for FPGA-Based Embedded Systems

Model-Based Agility for Embedded Systems Development

Embedded Solutions for EPICS Based Control Systems

Model-Based Design of Embedded Systems

Reachability Based Controller Synthesis for Switched Systems

Multi-core programming frameworks for embedded systems

Synthesis of Embedded Software for Reactive Systems

Component-Based Design of Embedded Control Systems

MODEL-BASED SYNTHESIS OF GENERATORS FOR EMBEDDED SYSTEMS

Dependable communication synthesis for distributed embedded systems

Model-Based Programming of Intelligent Embedded Systems

Synthesis of Loop-free Programs

Experimental Platform for Model-Based Design of Embedded Systems

Model-Driven Synthesis of Embedded Robotic Navigation Systems

Interface-based Design of Embedded Systems

Compiler-in-the-Loop Exploration of Programmable Embedded Systems

Model-Based Agility for Embedded Systems Development

Multi-core programming frameworks for embedded systems