270 likes | 402 Views
Architecture Tuning in Embedded Systems. Roman Lysecky Department of IP Management Conexant Newport Beach. Greg Stitt, Frank Vahid, Tony Givargis Dept. of Computer Science & Engineering University of California, Riverside.
E N D
Architecture Tuning in Embedded Systems Roman Lysecky Department of IP Management Conexant Newport Beach Greg Stitt, Frank Vahid, Tony Givargis Dept. of Computer Science & Engineering University of California, Riverside This work was supported by the National Science Foundation under grants CCR-9811164 and CCR-9876006, and by a Design Automation Conference graduate scholarship. This work is being presented at CASES’00 (Compilers, Architectures and Synthesis for Embedded Systems), November 18-19, 2000, San Jose, CA.
A “short list” of embedded systems Anti-lock brakes Auto-focus cameras Automatic teller machines Automatic toll systems Automatic transmission Avionic systems Battery chargers Camcorders Cell phones Cell-phone base stations Cordless phones Cruise control Curbside check-in systems Digital cameras Disk drives Electronic card readers Electronic instruments Electronic toys/games Factory control Fax machines Fingerprint identifiers Home security systems Life-support systems Medical testing systems Modems MPEG decoders Network cards Network switches/routers On-board navigation Pagers Photocopiers Point-of-sale systems Portable video games Printers Satellite phones Scanners Smart ovens/dishwashers Speech recognizers Stereo systems Teleconferencing systems Televisions Temperature controllers Theft tracking systems TV set-top boxes VCR’s, DVD players Video game consoles Video phones Washers and dryers And the list goes on and on
Introduction: Traditional micro-processor use in embedded systems • Tasks (not necessarily in the given order) • (1) Buy a microprocessor IC (integrated circuit) • (2) Integrate it with other IC’s onto a board and insert it into an embedded system • (3) Download a software program Software Processor Board 1 2 3 • Notice that the processor IC is designed independent of the software • Different microprocessor variations thus exist, like low-power or high-performance IC’s
Introduction: Modern core-based approach • Tasks • (1) Buy a microprocessor CORE • Hard: layout; Firm: structural HDL; Soft: synthesizable HDL • You are buying Intellectual Property, like a file that may come on a floppy, CD-ROM, over the web, etc. You are NOT buying hardware. • (2) Design a system-on-a-chip (SOC) from this and other cores • (3) Fabricate a SOC IC • (4) Insert the IC into an embedded system • (5) Download a software program Software Processor Processor HDL HDL 1 2 3 4 5
Introduction: embedded system unique feature of fixed program • SOC’s implementing an embedded system have a unique feature • Implements a particular application • Thus, the processor may execute a single fixed program that never changes • Unlike desktop systems, which execute a variety of programs • Examples: digital camera, automobile cruise-controller • We can exploit this fixed-program feature • For example, by using mask-programmed ROM • But much more can be done The software in here never changes after production
Introduction: Proposed core-based approach with architecture tuning • Tasks • (1) Buy a microprocessor core • (2) Design a system-on-a-chip (SOC) from this and other cores • (3) TUNE the SOC architecture to a software program • (4) Fabricate a SOC IC • (5) Insert the IC into an embedded system • (6) Download the software program Software 1 Processor Processor Processor HDL HDL HDL 2 3 4 5 6
Core library PeripheralA PeripheralB ProcessorX Introduction: architecture tuning Fixed program • Architecture tuning • A way to exploit the fixed-program feature of embedded systems • First, do architecture design for the particular application • Then, “tune” the core-based system architecture to the particular application program, before IC fabrication • Goals: better performance, power, size Architecture design Peripheral Prog. Processor Architecture tuning HDL Peripheral Prog. Processor Fabrication HDL Peripheral Prog. Tuned cores Processor IC
Introduction: architecture tuning • Examples of tuning optimizations • Memory hierarchy: no cache, L1 cache, L1+L2 cache • Cache organization: size, associativity, write policies • Bus structure, data/address encoding • DMA block sizes • Microprocessor optimizations • Internal small-loop table • Controller partitioning • Datapath shortcuts • Register file copies
Our focus Introduction: Tuning is a special case of Y-Chart iteration • Philips/TriMedia approach of simultaneously developing architecture and its applications Architecture Applications Mapping Analysis Numbers
Problem description • Focus of this work: • Tuning a microcontroller to its program • Goal is reduced power without performance loss • Restrict tuning to maintain exact instruction set compatibility • No instructions may be added or deleted • Thus, no modification to software development environment • Also, no problems with porting software to/from other versions of the microcontroller • Instruction set incompatibility can be a show stopper • Maintenance/upgrades/re-porting of binaries over the lifetime of product and for product variations is a key issue • Likewise, a stable software development environment is needed
Previous work • Application-specific instruction-set processors [Fisher99] • Customize a microprocessor to its application(s) • Delete unnecessary instructions, add new ones along with accompanying datapath extensions • e.g., Tensilica • Customized instruction-set requires customized development tools (e.g., compiler, debugger) • Tuning compiler to architecture [Tiwari et al 94] • Architectural description languages to inform compiler of architecture features [Halambi et al 99] • Tuning cache and cache/bus [Givargis et al 99] organization to application
Tuning environment • Currently for the 8051 microcontroller • Starts from VHDL synthesizable model of 8051 (soft core) • Uses Synopsys synthesis, simulation and power analysis • Uses 8051 instruction-set simulator • Uses numerous scripts • Goal of the enviroment • Understand how power is being consumed for a particular application, so that modifications to the architecture (or application) can be made to minimize that power • Three main tools • Architectural view • Instruction-set view • Program/data memory view
Microprocessor soft core Program binary RT-synthesizer ROM generator Microprocessor structure ROM entity Simulator and power analyzer “Flat” power data Structural hierarchical power data translator and xdu display ROM 1.04 mW ALU 1.62 mW Total 7.66 mW RAM 1.42 mW CTRL 2.69 mW DECODER 0.07 mW Tuning environment: architectural view tool
Binaries to exercise instruction 1 Binaries to exer instruction 2 Binaries to exe instruction 3 ROM generator Microprocessor structure ROM entity Simulator and power analyzer Flat power data for instruction 1 Flat power data for instruction 2 Flat power data for instruction 3 Power data collector, structural power data translator, and xdu display Tuning environment: instruction-set view tool Instruction Power (mW) ADDC_1 7.340834 ADD_1 7.350741 ANL_1 6.631394 CLR_1 3.76228 CPL_1 5.481627 DA 5.28897 DEC_1 5.368807 DIV 7.716592 INC_1 4.662862 MOVC_1 6.078014 MOVC_2 5.021021 MOV_1 5.577664 MOV_2 6.164267 MUL 5.522886 NOP 4.900275 ORL_1 6.954121 POP 8.103867 PUSH 8.7116
Per-instruction power data (from previous tool) Program binary Instruction-set simulator Program/data memory access frequencies and power Program hierarchy power translator and xdu display Tuning environment: program/data memory view tool Addr Ins Freq Pwr Freq*Pwr 00000 LJMP 1 0 0 00003 MOV_9 108 5.46067 589.752 00005 MOV_9 108 5.46067 589.752 00007 MOV_9 108 5.46067 589.752 00009 MOV_9 108 5.46067 589.752 00011 RET 108 0 0 00012 MOV_9 27 5.46067 147.438 00014 MOV_9 27 5.46067 147.438 00016 MOV_9 27 5.46067 147.438 00018 MOV_9 27 5.46067 147.438 00020 MOV_4 27 4.83507 130.547 00022 LCALL 27 0 0 Addr Purpose Accesses 00128 P0 1311 00129 SP 70317 00130 DPL 31189 00131 DPH 7977 00144 P1 161 00208 PSW 413527 00224 ACC 360949 00240 B 2598
Program binary Microprocessor core Instruction-set power view tool (1 day) Program/data memory view tool (seconds) Architectural view tool (1 hour) Instruction-set power data Program power data Architecture power data Tuning environment
Change application Change architecture Run program / data memory view tool Run architecture view tool Run instruction-set view tool No Satisfied? Yes DONE Design flow using the tuning environment
Experiments • Started with 8051 soft core in VHDL • Tuning environment was used to • Examine where power consumption was occurring for a given application • Quickly evaluate the impact of tuning optimizations • These are early results, much more work remains
Power consumption of the initial 8051 model • Power consumption • Mainly due to switching wires • Any wire who’s value changed (from 0 to 1) consumes power • Want to minimize switching • 8051 power consumption • 5 main components • Controller, RAM, and ALU are the most expensive components • These components have potential for general optimizations • Total Gates - 25854 Average power: 37.1824 mW
General optimizations made to the 8051 • Prevent unnecessary switching on wires connecting to memories • Wires connecting processor to memories are high capacitance • They were switching even when not being used • So we inserted latches to hold the previous value, a standard power-saving technique • Prevent unnecessary switching in decoder and ALU • Again, by latching the inputs coming from the controller • Fetch instruction bytes only when needed • Hold ROM output when not being read
Power after general optimizations • Overall power reduction from 37.2 to 11.6 mW. • Total gates - 25951 • % improvements • ROM 82.9% • RAM 70.5% • ALU 60.0% • CTR 19.9% Average power: 11.6025 mW
Tuning optimizations • Sought to tune the microprocessor to a particular applicaton • GCD (Greatest common divisor) computation • Tuning optimizations invoked • 1) Replace frequently-accessed RAM locations by internal registers • 2) Create datapath shortcuts for most common instructions • 3) Partition the controller into a big controller and a small controller, with the small one handling the most frequently-executed GCD instructions
ROM 1.04 mW ALU 1.62 mW Total 7.66 mW RAM 1.42 mW CTRL 2.69 mW DECODER 0.07 mW Sample tuning optimization • Observation • RAM consumes much power • Address 224 accessed frequently • Possible tuning optimization • Replace this RAM location by a register • Steps • Modify VHDL model • Run all three view tools • Results • Power reduction: 7.67 to 7.27 mW • RAM reduced from 1.42 to 0.8 mW, CTRL increased slightly Addr Purpose Accesses 00128 P0 1311 00129 SP 70317 00130 DPL 31189 00131 DPH 7977 00144 P1 161 00208 PSW 413527 00224 ACC 360949 00240 B 2598
Replacing certain RAM locations by registers • PSW and accumulator are separated from RAM entity, placed in internal registers • Total gates - 26465 • % improvements • RAM 46.1% • Overall 15.8% Average Power: 9.7684 mW
Optimized datapath Addr Ins Freq Pwr Freq*Pwr 00000 LJMP 1 0 0 00003 MOV_9 108 5.46067 589.752 00005 MOV_9 108 5.46067 589.752 00007 MOV_9 108 5.46067 589.752 00009 MOV_9 108 5.46067 589.752 00011 RET 108 0 0 00012 MOV_9 27 5.46067 147.438 00014 MOV_9 27 5.46067 147.438 00016 MOV_9 27 5.46067 147.438 00018 MOV_9 27 5.46067 147.438 00020 MOV_4 27 4.83507 130.547 00022 LCALL 27 0 0 • MOV from reg7 to ACC very common • Add “shortcut” signal to register file • Avoids having data go through ALU • Total Gates - 26315 • Power reduced by 0.32 mW (2.7%) Average power: 11.2857 mW
Controller Partitioning • Motivation • In many applications, 90% of the time is spent in 10% of the code (or some similar ratio) • So let’s partition the controller into two, one handling the 10% of frequently executed code • This smaller controller should consume less power • Results • Average power reduced from 11.6 mW to 11.3 mW (2.6%) • Total gates - 28731
Conclusions • Described an environment for tuning a microprocessor to its application for low power • Full instruction set compatibility • Multiple views helps find power hogs • Fully automated • Focus is now on developing tuning optimizations • Controller partitioning, small-loop table, datapath shortcuts, register-file copies, etc. • Investigate possibility of automating tuning optimizations, develop more general tuning methodology • Environment for the 8051 is available on the web: • http://www.cs.ucr.edu/~dalton