150 likes | 352 Views
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power. Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine
E N D
A Self-Optimizing Embedded Microprocessor using aLoop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported by the National Science Foundation and NEC International Symposium on Low Power Electronics and Design, 2001
Processor Dmem. Pmem. Periph. Processor Dmem. Pmem. Periph. Introduction • Mass-produced microprocessor IC’s prevail in embedded systems • Cheap • From amortization and high yields • Small and low power • From optimization and use of new technologies • Available immediately • Typically run one program forever • QUESTION: • Can we “tune” a mass-produced microprocessor to its one program to reduce power? Processor Dmem. Pmem. Periph. Annual production: 10 million units Cost per unit: $2
Moore’s Law: 2x / 18 months 1984 1987 1990 1993 1996 1999 2002 1981 10,000 transistors 150,000,000 transistors Leading edge chip in 1981 Leading edge chip in 2002 Introduction • Answer: • Yes, by using configurable (tunable) components and adding a tuner circuit Processor Dmem. Pmem. Periph. Tuner. • Non-obvious use of extra transistors • Previously unheard of – silicon too scarce • Becoming more common, e.g., self-test circuitry • “Transistor budgets have gone ballistic” [Microprocessor Report, 1998] • Analogous situation in software • Yesterday, program memory extremely scarce • Today, we find a flight simulator hidden in Excel’97
Introduction • We introduce: • A basic architecture and methodology for a self-optimizing microprocessor that can tune itself to its program • Involves self-profiling circuitry • Uses designer-activated self-optimization mode • To illustrate, we also introduce: • A tunable component: Loop Table • Small memory to store frequent loops • Similar to previous loop caches • Differs in how and when contents are updated
Problem Description and Related Work • Goal: • Develop a mass-producible standard embedded microprocessor that can tune its configurable components to one application for low power • Constraints • Exact instruction set compatibility • Avoid changing tool chain • Preserve cycle-by-cycle behavior • These constraints are more stringent than in most previous work
Problem Description and Related Work • Application-specific instruction-set processors • Introduce new instructions for common actions • Pre-fabrication: [Fischer99], [Tensillica00] • Post-fabrication: [Kucukcakar99] – for mass-produced IC’s • Obviously modifies instruction-set and tool chain • Dynamic binary translation and code morphing • Transmeta’s Crusoe: Profile executing code, cache translation results of frequently executed code • Changes cycle-by-cycle behavior, and only helps if performing dynamic binary translation in the first place • Program compression • Profile code, compress frequently-executed code [Ishihara00] • Modifies the tool chain
Problem Description and Related Work • Loop caches • Cache frequently-executed small loops to reduce power for memory • Filter cache [Kin97] • Small, low-power L0 cache • Causes extra cycles due to many misses • Compiler-assisted loop cache [Bellas99] • Use profiler/compiler to mark only frequent loops for placement in filter cache • Modifies tool chain • Transparent loop cache [Lee99] • Fill loop cache only when detect a short-backwards branch, indicating a small loop • No tag comparisons – greater efficiency • We extend to only consider frequent loops, reducing runtime overhead PID controller example: most execution time spent in two small loops Pmem Pmem Proc. Proc. Loop table
Data Memory (RAM) Program Memory (ROM) (~10,000’s of bytes) Configuration Memory (~10’s of bytes) Microprocessor Instruction Address Datapath Loop Table (~100’s of bytes) Controller Jump bits Instructions Mux Instruction Address Address Bypass Controller Self- Profiling Controller Loop Count Table (~100’s of bytes) Mux LAR’s Instruction Architecture Overview • Started with standard microcontroller • ROM access consumes much power • Added Loop Table to store common loops • Added Bypass Controller to switch to/from Loop Table • Added Self-Profiling Controller and Loop Count Table to detect most frequent loops
(Designer: pre-fabrication) Designer: post-fabrication User Self-optimization mode activation Methodology Overview • Self-optimizing microcontroller • Post-fabrication (hence mass-produced) • In-system • Tuning under designer control • Not by end user, hence stable and consistent end-use platform
Loop Table (~100’s of bytes) Instructions Jump bits Instruction Bypass Controller Self- Profiling Controller Loop Count Table (~100’s of bytes) LAR’s Methodology Overview Download application to microcontroller program memory Reset microcontroller, causing (optimized) application execution in normal mode Activate self-optimizing mode, causing update of configuration memory Upload configuration memory for downloading to other microcontrollers Data Memory (RAM) Program Memory (ROM) (~10,000’s of bytes) Configuration Memory (~10’s of bytes) Microprocessor Instruction Address Datapath Controller Mux Address Address Mux Instruction
Data Memory (RAM) Program Memory (ROM) (~10,000’s of bytes) Configuration Memory (~10’s of bytes) Microprocessor Datapath Controller Self- Profiler Loop Count Table Self-optimizing mode • Initializing • Activated by extra pin • Traverse memory, detect loops, add addresses to loop count table • Profiling • Execute, update loop counts • Requires fast increments • We use fully-assoc. mem • Hardware hash table possible • Configuring • Store most frequent loop addresses at bottom of program memory, set flag Down-load program Normal mode Self- optimizing mode Upload configuration
Normal mode • Reset • If self-optimization flag set • Read loop addresses into address registers (LAR’s) • Set flag in bypass controller • If flag unset or no address match, fetch from ROM • If flag set and address match • Begin fetching from loop table • Extra bits in loop table for fast determination if jump leaves table • 00: instruction can’t exit loop • 10: exits loop if jump not taken • 01: exits loop if jump taken Down-load program Normal mode Self- optimizing mode Data Memory (RAM) Program Memory (ROM) Configuration Memory Upload configuration Loop Table Microprocessor Datapath Controller Jump Instructions Bypass LAR’s
Results -- power • Savings • 34% total power savings after self-optimization • Depends on technology • Power overhead • Negligible when self-optimization idle • Slight increase (5%) during self-optimization • Setup • Synopsys synthesis, simulation, and power analysis • 8051 synthesizable VHDL model at UCR (www.cs.ucr.edu/~dalton)
Results – size (in cells) • Significant increase, but: • 8051 version was small • Others bigger ROM (e.g., 2M), RAM, and other processors are even bigger • Smaller percentage overhead • Transistors becoming cheaper • Can build product-oriented IC’s with only loop table and controller (no Self-Profiler or Loop Count Table) • Upload new binaries from prototype-oriented part, download back to new product-oriented parts • Supported by existing standard tools • We are investigating ways to shrink the Loop Count Table
Conclusions • Mass-produced IC’s give big advantages • Abundance of transistors provides new opportunity for self-optimization by tuning • We introduced: • A self-optimization methodology and architecture • A loop table as a tunable component • These items yielded: • Significant power savings by reducing ROM access • 34% total savings for our particular microcontroller and target technology • No change in instruction set, tools, or performance • Future work includes: • Reducing size overhead • Investigating other tunable components (e.g., N-way cache)