A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power

A Self-Optimizing Embedded Microprocessor using aLoop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported by the National Science Foundation and NEC International Symposium on Low Power Electronics and Design, 2001

Introduction • Mass-produced microprocessor IC’s prevail in embedded systems • Cheap • From amortization and high yields • Small and low power • From optimization and use of new technologies • Available immediately • Typically run one program forever • QUESTION: • Can we “tune” a mass-produced microprocessor to its one program to reduce power? Sample: Annual production: 10 million units Cost per unit: $2 Processor Processor Dmem. Pmem. Pmem. Pmem. Periph.

Moore’s Law: 2x / 18 months 1984 1987 1990 1993 1996 1999 2002 1981 10,000 transistors 150,000,000 transistors Leading edge chip in 1981 Leading edge chip in 2002 Introduction • Use configurable (tunable) components and add a tuner circuit Processor Dmem. Pmem. Periph. Tuner. • Make use of abundant transistors • Previously, silicon too scarce • Today, “transistor budgets have gone ballistic” [Microprocessor Report, 1998] • Software analogy • Previously, program memory was scarce • Today, we find a flight simulator hidden in Excel’97

Introduction • We introduce: • Architecture and methodology for a self-optimizing microprocessor that can tune itself to its program • Uses self-profiling circuitry and designer-activated self-optimization mode • To illustrate, we introduce: • A tunable component: Loop Table • Similar to loop caches, differs in how and when contents are updated • Other tunable components are possible

Problem Description • Goal: • Develop a mass-producible standard embedded microprocessor that can tune its configurable components to one application for low power • Constraints • Exact instruction set compatibility • Avoid changing tool chain • Preserve cycle-by-cycle behavior • These constraints are more stringent than in most previous work

Related Work • Application-specific instruction-set processors • Introduce new instructions for frequent code • Pre-fabrication: [Fischer99], [Tensillica00] • Post-fab: [Kucukcakar99] – for mass-produced IC’s • Modifies instruction-set and tool chain • Code morphing • Crusoe: Cache frequent code’s translation • Helps only if performing dynamic binary translation • Changes cycle-by-cycle behavior • Code compression • Compress frequent code [Ishihara00] • Modifies tool chain

Related Work • Cache frequent small loops • Reduces memory/bus power • Filter cache [Kin97] • Small L0 cache • Many misses (extra cycles) • Compiler-assisted loop cache [Bellas99] • Profiler/compiler marks frequent loops for filter cache placement • Modifies tool chain • Transparent loop cache [Lee99] • Fill loop cache only when detect short-backwards branch • No tag comparisons – greater efficiency • Our approach • Moves profiler to chip, and can be more selective in filling loop cache PID controller example: most execution time spent in two small loops Pmem Pmem Proc. Proc. Loop table

RAM ROM Configuration Memory (~10’s of bytes) Microprocessor Datapath Loop Table Controller Bypass Controller Self- Profiling Controller Loop Count Table Architecture Overview • Standard microcontroller • ROM access consumes much power • Added • Self-Profiling Controller and Loop Count Table for profiling • Loop Table to store common loops • Bypass Controller to switch to Loop Table

(Designer: pre-fabrication) Designer: post-fabrication User Self-optimization mode activation Methodology Overview • Self-optimizing microcontroller • Post-fabrication (hence mass-produced) • In-system • Tuning under designer control • Not by end user, hence stable and consistent end-use platform

Methodology Overview Download application to microcontroller program memory Reset microcontroller, causing (optimized) application execution in normal mode Activate self-optimizing mode, causing update of configuration memory Upload configuration memory for downloading to other microcontrollers

Down-load program Normal mode Self- optimizing mode Upload configuration ROM Loop Count Table Self- Profiling Controller Self-optimizing mode • Initializing • Activated by extra pin or existing pin combo • Traverse memory, detect loops, add addresses to loop count table • Profiling • Execute, update loop counts • Requires fast increments • We use fully-assoc. mem • Hardware hash table possible • Configuring • Store most frequent loop addresses at bottom of program memory, set flag 200 Loop addr. Count 100 0 5 200 0 900

Down-load program Normal mode Self- optimizing mode RAM Upload configuration Data-path Con-troller Normal mode • Reset • Read loop addresses (if any) into registers (LAR’s) • Read corresponding loops into loop table • Set flag in bypass controller • Execute: Check if flag set and address match • No: Fetch from ROM • Yes: Begin fetching from loop table • No tag comparisons, no misses • Pre-computed extra bits quickly detect table exit 200: **** ROM 200 Loop Table 200: **** Bypass Controller 200 LAR:

Results -- power • Savings • 34% total power savings after self-optimization • Dependent on technology • Power overhead • Negligible when self-optimization idle • Slight increase (5%) during self-optimization • Setup • Synopsys synthesis, simulation, and power analysis • 8051 synthesizable VHDL model at UCR (www.cs.ucr.edu/~dalton) Ex1: checksum Ex2: gcd Ex3: matrix multiply

Results – size (in cells) • Big increase, but: • 8051 version was small • Others much bigger • Smaller % overhead • Transistors becoming cheaper • Product-oriented IC’s: loop table and controller, no Self-Profiler or Loop Count Table • Transfer configuration from prototype-oriented part to new product-oriented parts • Supported by existing upload/download tools • We are working on shrinking the Loop Count Table logic

Conclusions • Mass-produced IC’s give big advantages • Transistor abundance provides new opportunities • We introduced: • A self-optimization methodology and architecture • A loop table as an example tunable component • These items yielded: • Power savings by reducing ROM access • 34% savings for 8051 microcontroller for target technology • No change in instruction set, tools, or performance • Future work includes: • Reducing size overhead while maintaining accuracy • Trading off size with accuracy • Extending loop table for multiple loops, subroutines, etc. • Incorporating into 32-bit processor environment (LEON Sparc) • Investigating other tunable components • On-chip FPGA, configurable cache, etc.

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power

Presentation Transcript

525.415 : Embedded Microprocessor Systems

Teaching Microprocessor Systems Design Using a SoC and Embedded Linux Platform

Low Voltage Power for Future Microprocessor

A New Successive Approximation Architecture for Low-Power Low-Cost A

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power

Low Power Silicon Microphotonic Communications for Embedded Systems

Low Power Embedded FWIRE System Using Integrate-and-Fire

A Loop Accelerator for Low Power Embedded VLIW Processors

A Self-Tuning Cache architecture for Embedded Systems

A Low-Power CoAP for Contiki

A Low-Power High-Speed Hybrid CMOS Full Adder for Embedded System

Low Power Embedded Security: Thumbpod embedded biometrics project

A Decompression Architecture for Low Power Embedded Systems

Safe RTL Annotations for Low Power Microprocessor Design

Low-Power Design for Embedded Processor

DRIM: A Low Power Dynamically Reconfigurable Instruction Memory Hierarchy for Embedded Systems

Microprocessor System Design Using Coldfire Embedded Processor

The Design of a Low-Power High-Speed Phase Locked Loop

Using a Power

A Low-Energy Reconfigurable Fabric For Embedded Computing

Thermal-Scheduling For Ultra Low Power Mobile Microprocessor

A Self-Tuning Cache Architecture for Embedded Systems