1 / 15

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power. Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine

aquarius
Download Presentation

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Self-Optimizing Embedded Microprocessor using aLoop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported by the National Science Foundation and NEC International Symposium on Low Power Electronics and Design, 2001

  2. Introduction • Mass-produced microprocessor IC’s prevail in embedded systems • Cheap • From amortization and high yields • Small and low power • From optimization and use of new technologies • Available immediately • Typically run one program forever • QUESTION: • Can we “tune” a mass-produced microprocessor to its one program to reduce power? Sample: Annual production: 10 million units Cost per unit: $2 Processor Processor Dmem. Pmem. Pmem. Pmem. Periph.

  3. Moore’s Law: 2x / 18 months 1984 1987 1990 1993 1996 1999 2002 1981 10,000 transistors 150,000,000 transistors Leading edge chip in 1981 Leading edge chip in 2002 Introduction • Use configurable (tunable) components and add a tuner circuit Processor Dmem. Pmem. Periph. Tuner. • Make use of abundant transistors • Previously, silicon too scarce • Today, “transistor budgets have gone ballistic” [Microprocessor Report, 1998] • Software analogy • Previously, program memory was scarce • Today, we find a flight simulator hidden in Excel’97

  4. Introduction • We introduce: • Architecture and methodology for a self-optimizing microprocessor that can tune itself to its program • Uses self-profiling circuitry and designer-activated self-optimization mode • To illustrate, we introduce: • A tunable component: Loop Table • Similar to loop caches, differs in how and when contents are updated • Other tunable components are possible

  5. Problem Description • Goal: • Develop a mass-producible standard embedded microprocessor that can tune its configurable components to one application for low power • Constraints • Exact instruction set compatibility • Avoid changing tool chain • Preserve cycle-by-cycle behavior • These constraints are more stringent than in most previous work

  6. Related Work • Application-specific instruction-set processors • Introduce new instructions for frequent code • Pre-fabrication: [Fischer99], [Tensillica00] • Post-fab: [Kucukcakar99] – for mass-produced IC’s • Modifies instruction-set and tool chain • Code morphing • Crusoe: Cache frequent code’s translation • Helps only if performing dynamic binary translation • Changes cycle-by-cycle behavior • Code compression • Compress frequent code [Ishihara00] • Modifies tool chain

  7. Related Work • Cache frequent small loops • Reduces memory/bus power • Filter cache [Kin97] • Small L0 cache • Many misses (extra cycles) • Compiler-assisted loop cache [Bellas99] • Profiler/compiler marks frequent loops for filter cache placement • Modifies tool chain • Transparent loop cache [Lee99] • Fill loop cache only when detect short-backwards branch • No tag comparisons – greater efficiency • Our approach • Moves profiler to chip, and can be more selective in filling loop cache PID controller example: most execution time spent in two small loops Pmem Pmem Proc. Proc. Loop table

  8. RAM ROM Configuration Memory (~10’s of bytes) Microprocessor Datapath Loop Table Controller Bypass Controller Self- Profiling Controller Loop Count Table Architecture Overview • Standard microcontroller • ROM access consumes much power • Added • Self-Profiling Controller and Loop Count Table for profiling • Loop Table to store common loops • Bypass Controller to switch to Loop Table

  9. (Designer: pre-fabrication) Designer: post-fabrication User Self-optimization mode activation Methodology Overview • Self-optimizing microcontroller • Post-fabrication (hence mass-produced) • In-system • Tuning under designer control • Not by end user, hence stable and consistent end-use platform

  10. Methodology Overview Download application to microcontroller program memory Reset microcontroller, causing (optimized) application execution in normal mode Activate self-optimizing mode, causing update of configuration memory Upload configuration memory for downloading to other microcontrollers

  11. Down-load program Normal mode Self- optimizing mode Upload configuration ROM Loop Count Table Self- Profiling Controller Self-optimizing mode • Initializing • Activated by extra pin or existing pin combo • Traverse memory, detect loops, add addresses to loop count table • Profiling • Execute, update loop counts • Requires fast increments • We use fully-assoc. mem • Hardware hash table possible • Configuring • Store most frequent loop addresses at bottom of program memory, set flag 200 Loop addr. Count 100 0 5 200 0 900

  12. Down-load program Normal mode Self- optimizing mode RAM Upload configuration Data-path Con-troller Normal mode • Reset • Read loop addresses (if any) into registers (LAR’s) • Read corresponding loops into loop table • Set flag in bypass controller • Execute: Check if flag set and address match • No: Fetch from ROM • Yes: Begin fetching from loop table • No tag comparisons, no misses • Pre-computed extra bits quickly detect table exit 200: **** ROM 200 Loop Table 200: **** Bypass Controller 200 LAR:

  13. Results -- power • Savings • 34% total power savings after self-optimization • Dependent on technology • Power overhead • Negligible when self-optimization idle • Slight increase (5%) during self-optimization • Setup • Synopsys synthesis, simulation, and power analysis • 8051 synthesizable VHDL model at UCR (www.cs.ucr.edu/~dalton) Ex1: checksum Ex2: gcd Ex3: matrix multiply

  14. Results – size (in cells) • Big increase, but: • 8051 version was small • Others much bigger • Smaller % overhead • Transistors becoming cheaper • Product-oriented IC’s: loop table and controller, no Self-Profiler or Loop Count Table • Transfer configuration from prototype-oriented part to new product-oriented parts • Supported by existing upload/download tools • We are working on shrinking the Loop Count Table logic

  15. Conclusions • Mass-produced IC’s give big advantages • Transistor abundance provides new opportunities • We introduced: • A self-optimization methodology and architecture • A loop table as an example tunable component • These items yielded: • Power savings by reducing ROM access • 34% savings for 8051 microcontroller for target technology • No change in instruction set, tools, or performance • Future work includes: • Reducing size overhead while maintaining accuracy • Trading off size with accuracy • Extending loop table for multiple loops, subroutines, etc. • Incorporating into 32-bit processor environment (LEON Sparc) • Investigating other tunable components • On-chip FPGA, configurable cache, etc.

More Related