Self-Tuning IC Platforms by Frank Vahid - Enhancing Speed and Energy Efficiency

Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid Co-PI: Walid Najjar, Professor, CS&E, UCR Frank Vahid, UC Riverside

Application Goal: Platform Self-Tunes to Executing Application • Download standard binary • Platform adjusts to executing application • Result is better speed and energy • Why and How? Platform Frank Vahid, UC Riverside

Processor L1 Cache Mem FPGA Periph1 JPEG Platforms • Pre-designed programmable platforms • Reduce NRE cost, time-to-market, and risk • Platform designer amortizes design cost over large volumes • Many (if not most) will include FPGA • Today: Triscend, Altera, Xilinx, Atmel • More sure to come • As FPGA vendors license to SoC makers Sample Platform Processor, cache, memory, FPGA, etc. Modern IC costs are feasible mostly in very high volumes Frank Vahid, UC Riverside

Processor L1 Cache Mem FPGA Periph1 JPEG idle uP active idle uP FPGA Hardware/Software Partitioning Improves Speed and Energy • But requires partitioning CAD tool • O.K. in some flows • In mainstream software flows, hard to integrate Standard Sw Tools Hw/Sw Parti- tioner Frank Vahid, UC Riverside

Decompiler, Optimizer Synthesis, Place and Route Idea: Perform Partitioning Dynamically (and hence Transparently) • Add components on-chip: • Profile • Decompile frequent loops • Optimize • Synthesize • Place and route onto FPGA • Update Sw to call FPGA • Transparent • No impact on tool flow • Dynamic software optimization, software binary updating, and dynamic binary translation are proven technologies • But how can you profile, decompile, optimize, synthesize, and p&r, on-chip? Processor L1 Cache Mem Profiler Explorer DAG & LC Dynamic Partitioning Module FPGA Frank Vahid, UC Riverside

Dynamic Partitioning Requires Lean Tools • How can you run Synopsys/Cadence/Xilinx tools on-chip, when they currently run on powerful workstations? • Key – our tools only need be good enough to speedup critical loops • Most time spent in small loops (e.g., Mediabench, Netbench, EEMBC) • Created ultra-lean versions of the tools • Quality not necessarily as good, but good enough • Runs on a 60 MHz ARM 7 Loop Frank Vahid, UC Riverside

Decompiler, Optimizer Synthesis, Place and Route Binary Loop Profiler Small, Frequent Loops Loop Decompilation DMA Configuration Synthesis Tech. Mapping Place & Route Binary Modification Bitfile Creation Updated Binary Hw Dynamic Hw/Sw Partitioning Tool Chain We’ve developed efficient profiler Hw Processor L1 Cache Mem Profiler Explorer We’re continuing to extend these tools to handle more benchmarks FPGA DAG & LC Partitioner Architecture targeted for loop speedup, simple P&R Frank Vahid, UC Riverside

Decompiler, Optimizer Synthesis, Place and Route Dynamic Hw/Sw Partitioning Results Processor L1 Cache Mem Profiler Explorer FPGA DAG & LC Partitioner Frank Vahid, UC Riverside

Dynamic Hw/Sw Partitioning Results • Powerstone, NetBench, and EEMBC examples, most frequent 1 loop only • Average speedup very close to ideal speedup of 2.4 • Not much left on the table in these examples • Dynamically speeding up inners loops on FPGAs is feasible using on-chip tools • ICCAD’02 (Stitt/Vahid) – Binary-level partitioning in general is very effective Frank Vahid, UC Riverside

Processor L1 Cache Mem Profiler Explorer FPGA DAG & LC Dynamic Partitioning Module Decompiler Synthesis Place and Route Configurable Cache: Why? • ARM920T: Caches consume half of total processor system power (Segars 01) • M*CORE: Unified cache consumes half of total processor sys. power (Lee/Moyer/Arends 99) Frank Vahid, UC Riverside

Best Cache for Embedded Systems? • Diversity of associativity, line size, total size Frank Vahid, UC Riverside

Cache Design Dilemmas • Associativity • Low: low power, good performance for many programs • High: better performance on more programs • Total size • Small: lower power if working set small, (less area) • Big: better performance/power if working set large • Line size • Small: better when poor spatial locality • Big: better when good spatial locality • Most caches are a compromise for many programs • Work best on average • But embedded systems run one/few programs • Want best cache for that one program vs. vs. vs. Frank Vahid, UC Riverside

Solution to the Cache Design Dilemna • Configurable cache • Design physical cache that can be reconfigured • 1-way, 2-ways, or 4-ways • Way concatenation – new technique, ISCA’03 (Zhang/Vahid/Najjar) • Four 2K ways, plus concatenation logic • 8K, 4K or 2K byte total size • Way shutdown, ISCA’03 • Gates Vdd, saves both dynamic and static power, some performance overhead (5%) • 16, 32 or 64 byte line size • Variable line fetch size, ISVLSI’03 • Physical 16 byte line, one, two or four physical line fetches • Note: this is a single physical cache, not a synthesizable core Frank Vahid, UC Riverside

6x64 c0 c1 c3 c2 Configurable Cache Design: Way Concatenation (4, 2 or 1 way) a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0 Configuration circuit a11 Trivial area overhead, no performance overhead reg0 a12 reg1 tag part c3 c1 c0 c2 bitline c1 c0 index 6x64 6x64 6x64 data array c2 c3 6x64 6x64 column mux sense amps tag address line offset mux driver data output critical path Frank Vahid, UC Riverside

Configurable Cache Design Metrics • We computed power, performance, energy and size using • CACTI models • Our own layout (0.13 TSMC CMOS), Cadence tools • Energy: considered cache, memory, bus, and CPU stall • Powerstone, MediaBench, and SPEC benchmarks • Used SimpleScalar for simulations Frank Vahid, UC Riverside

Configurable Cache Energy Benefits • 40%-50% energy savings on average • Compared to conventional 4-way and 1-way assoc., 32-byte line size • AND, best for every example (remember, conventional is compromise) Frank Vahid, UC Riverside

Future Work • Dynamic cache tuning • More advanced dynamic partitioning • Automatic frequent loop detection • On-chip exploration tool • Better decompilation, synthesis • Better FPGA fabric, place and route • Approach: continue to extend to support more benchmarks • Extend to platforms with multiple processors • Scales well – processors can share on-chip partitioning tools Frank Vahid, UC Riverside

Conclusions • Self-improving configurable ICs • Provide excellent speed and energy improvements • Require no modification to existing software flows • Can thus be widely adopted • We’ve shown the idea is practical • Lean on-chip tools are possible • Now need to make them even better • Extensive research into algorithms, designs and architecture is needed Frank Vahid, UC Riverside

Self-Tuning IC Platforms by Frank Vahid - Enhancing Speed and Energy Efficiency