300 likes | 670 Views
Automatic Insertion of Low Power Annotations in RTL for Pipelined Microprocessors . Vinod Viswanath The University of Texas at Austin. Outline. Power Dissipation in Hardware Circuits Instruction-driven Slicing to attain lower power dissipation
E N D
Automatic Insertion of Low Power Annotations in RTL for Pipelined Microprocessors Vinod Viswanath The University of Texas at Austin
Outline • Power Dissipation in Hardware Circuits • Instruction-driven Slicing to attain lower power dissipation • Automatically annotates microprocessor description • At the Register Transfer Level and Architectural level • Applying Instruction-driven Slicing to pipelined architectures • Applying Instruction-driven Slicing to out-of-order superscalar architectures
Power Dissipation • Switching activity power dissipation • To charge and discharge nodes • Short Circuit power dissipation • High only for output drivers, clock buffers • Static power dissipation • Due to leakage current
Switching Activity Power Dissipation • Transistor-level • Reordering, sizing • Gate-level • Don’t-care optimizations (combinational) • Encoding (sequential) • Pre-computation based optimization (sequential) • Guarded evaluation (sequential) • RT-level • Use program structure and dataflow information available at that level of abstraction
Instruction-driven Slice • An instruction-driven slice of a microprocessor design is • all the relevant circuitry of the design required to completely execute a specific instruction • Parts of the decode, execute, writeback etc. blocks • Cone of influence of the semantics of the instruction
Instruction-driven Slicing • Given a microprocessor design and an instruction • Identify the instruction-driven slice • Shut off the rest of the circuitry • This might include • Gating out parts of different blocks • Gating out floating point units during integer ALU execution • Turning off certain FSMs in different control blocks since exact constraints on their inputs are available due to instruction-driven slicing
Algorithm (High Level) • Algorithm instruction-driven-slicing. Begin • Inputs: vRTL (Verilog RTL), insts (instructions) • Output: aRTL (Annotated RTL) • Parse vRTL to obtain the Abstract Syntax Program Graph (ASPG) • For each instruction I in insts repeat • Slice the ASPG for instruction I • Traverse the ASPG • Add annotation variables if such a block is found • If a particular flop is already gated, then add the current annotation in an optimal fashion • Return the annotated ASPG • Generate Verilog code (aRTL) for the annotated ASPG End.
Methodology • In order to demonstrate our technique • We have incorporated instruction-driven slicing as part of the traditional design flow • The vRTL model is annotated to obtain the aRTL model • Synopsys Design Environment has been sufficiently modified to accept the aRTL, SPEC2000 benchmarks and power process parameters and estimate the power dissipation due to switching activity • The annotated Architectural model is fed to the SimpleScalar simulator with the Wattch power estimator to estimate the power dissipation
Experiment: OR1200 • We have used our tool-chain to test our methodology on OR1200 • OR1200 is a single-instruction-issue pipelined microprocessor implementing the OpenRISC ISA. • 4-stage integer pipeline with single instruction issue per cycle • We have annotated both the RTL and the architectural models of OR1200
OR1200-RTL Results • Results are shown after annotation insertion • Sliced on 1, 4, 10 instructions • For SPECINT2000 benchmarks • Power dissipation decreases consistently
OR1200-Arch Results • Results are shown after annotation insertion • Sliced on 1, 4, 10 instructions • For SPECINT2000 benchmarks • Power dissipation decreases consistently
OR1200 Results (contd.) • Power gains are consistently good • Power gains far outperform area losses
OR1200 Results (contd.) • Flop distribution shown before slicing (Fig. a) after slicing on add, l.add (Fig. b) and after slicing on load, l.lw (Fig. c) Fig. a Fig. b Fig. c
Experiment: PUMA • We have used our tool-chain to test our methodology on PUMA • PUMA is a dual-issue, out-of-order super-scalar, fixed-point PowerPC core • We have annotated both the RTL and the architectural models of PUMA
PUMA Results (contd.) • Power gains are good upon slicing for a few instructions (~7) before delay losses start dominating (Fig. 1) • Power gains far outperform area losses (Fig 2) • Flop distribution shown before slicing (Fig. 3a) after slicing on add (Fig. 3b) and after slicing on load (Fig. 3c) Fig.3a (Fig. 1) Fig.3b (Fig. 2) Fig.3c
Conclusions • Proposed Instruction-driven Slicing as a new technique to automatically reduce power dissipation • Implemented the methodology of incorporating instruction-driven slicing into the design flow tool-chain • Inserting these annotations preserves the functionality of the circuit
Conclusions (continued) • This technique seems most applicable to single-issue multi-staged pipelined machines. • When there are multiple instructions in-flight in the same pipeline stage, the gains of a single-instruction-abstraction are lost. • Graphics processors, various embedded applications are more often better suited for this technique than general purpose out-of-order superscalars.
PUMA Power Gain Results • Results are shown after annotating the • RTL (left) and Architectural (Right) models • For un-sliced and sliced on 1, 4, 10 instructions • For SPECINT2000 benchmarks • Power dissipation decreases consistently
Correct Annotations • Notion of correctness • Original RTL and the annotated RTL should be functionally equivalent under all conditions • Correctness theorem (defthm or1200_slicing_correct (equal (or1200_cpu n) (or1200_cpu_sliced n)))
ACL2 Theorem Prover • First order logic general purpose theorem prover • Breakdown the theorem into sub-goals • Many engines work on the sub-goals and will either prove them or break them down further and add to the central pool of goals to be proved • Success story in Hardware • Verified FDIV in the AMD processors
Proof Methodology • The RTL is a shallow embedding in ACL2 • Convert Verilog RTL into ACL2RTL • We have created a large RTL library to recognize as well as analyze ACL2RTL • Slicing is done on the Verilog code • Both original and annotated Verilog are converted into ACL2 and we construct the functional equivalence proof in ACL2
Proof Structure • Create a library of functions to interpret the ACL2 model of the RTL • Functional equivalence theorem is built up block by block