230 likes | 245 Views
Discover how a two-phase clocking scheme can reduce area and power in chip design by eliminating hold time violations and minimizing the need for clock buffers. Learn about the implementation steps and benefits of this innovative approach.
E N D
An Energy Efficient Two-Phase Clocking Scheme Brad Bridgeman Yanqing Zhang
Outline • Overview of Place and Route • -Main steps • -Introduction to our problem • Two-Phase Clock Explanation • -Explanation of two-phase clock • -Benefits of two-phase clock • Implementation • -Step by step project implementation • Conclusions • -Rules in using the two-phase clock • Summary • Questions, Thoughts, Problems
Overview of Place and Route Steps 1. Import a synthesized netlist -We will be using a PIC formerly done 2. Define chip core size -This is where we will place the cells 3. Draw power rings and rails -Supply and ground rails are placed around the core 4. Place standard cells -The tool automatically places standard cells within the defined core
Overview of Place and Route Steps 5. 1st/Trial route -The tool does a fictitious route to get an idea of how routing will look 6. Clock tree generation/synthesis -The tool creates the clock tree and inserts clock buffers in the design 7. Timing Closure -The tool inserts/deletes buffers to solve hold/setup time violations 8. Nanoroute -With the netlist finalized, the tool actually does the routing
Our Problem • Clock Buffering -339 buffers are placed to drive the clock and generate an H-tree • Hold time violation fixes -Buffers are placed between logic paths with too small a delay which violate hold time constraints Our problem is, it’s a lot of wasted area and power that essentially does nothing. We think we can do better than this.
An Intro to Our Clock Buffering Scheme • We propose a two-phased clocking scheme. Our motivation is that this may reduce overall area and power in the design. • The idea that our approach is centered around is that the two-phased clock will eliminate hold time violations:
An Intro to Our Clock Buffering Scheme So how will this help our problem of reducing buffer area and power?
Benefits of Two-Phased Clock Scheme First, let’s not forget what the buffers were for: 1. To fix hold time violations Two-Phased Clocking Scheme On a micro-architectural level: The two-phased clock negates the need for hold time buffers. Of course, the cost of the 2nd phase generator, and the cost to ‘adapt’ the registers to the two-phase clock must be taken into account. This is discussed later… Old Clocking Scheme
Benefits of Two-Phased Clock Scheme First, let’s not forget what the buffers were for: 2. To drive the clock signal at sinks (clock input in registers) 3. To balance paths in the H-tree Old Clocking Scheme Two-Phased Clocking Scheme On a macro-architectural level: We may be able to reduce clock buffers because the clock load is reduced. We also may be able to take some of the buffers out at the deeper levels of the H-tree. Considering that the pulse generator is able to eliminate skew problems of up to 300ps, we can allow the skew in paths to be close to 300ps, which can reduce the buffer requirement. (These benefits are only conceptually shown)
Implementation 1. Design of the 2nd phase clock generator (pulse generator)
Implementation Estimated Costs: Area: 35 u2 Power: 0.06uW 2.6ns 1.1ns 0.8ns
Implementation 2. Verification of hold time fix using designed pulse generator
Implementation Latches correctly on the next clock cycle
Implementation 3. We find the paths that violate hold time. They are shift register paths R20_reg_0 -> PC_run_reg_10 R27_reg_3 -> STACKLEVEL_reg_1 R26_reg_7 -> PC_run_reg_9 R23_reg_5 -> PC_run_reg_8 R10_reg_2 -> STACKLEVEL_reg_0 R19_reg_6 -> W_int_reg_6 R19_reg_5 -> W_int_reg_5 R19_reg_2 -> W_int_reg_2 R19_reg_4 -> W_int_reg_4 R19_reg_3 -> W_int_reg_3 R21_reg_1 -> STATUS_int_reg_2
Implementation 4. We compare how the hold time violation is fixed by the tool and by our method: VS Power saved by using our method: ∆P=-0.002 uW This means we’ve wasted power in this example…
Implementation We compare for every path, and see which path(s) benefit from our clocking scheme: R20_reg_0 -> PC_run_reg_10 R27_reg_3 -> STACKLEVEL_reg_1 R26_reg_7 -> PC_run_reg_9 R23_reg_5 -> PC_run_reg_8 R10_reg_2 -> STACKLEVEL_reg_0 R19_reg_6 -> W_int_reg_6 R19_reg_5 -> W_int_reg_5 R19_reg_2 -> W_int_reg_2 R19_reg_4 -> W_int_reg_4 R19_reg_3 -> W_int_reg_3 R21_reg_1 -> STATUS_int_reg_2 Can’t win them all…
Conclusion 1 On a micro-architectural level: Qualitatively: -We only use our clocking scheme where it overcomes the cost for buffering. The clocking scheme becomes attractive when the imbalance in the H-tree increases and thus skew increases. Quantitatively: -In this case, the clocking scheme saves power when more than 3 buffers are placed to fix hold time violations
Implementation 5. We make the following analysis: The reason our clocking scheme does not save much power is because few paths are imbalanced that they need a lot of buffering. Driving JUST ONE path is costly. Around the area where a pulse generator was originally needed, we can take out some buffers near the end of the H-tree of other registers to CREATE skew, and have the pulse generator drive those registers as well. However, how much can a pulse generator drive?... Reg11 Reg12 Path with less skew, We make it more skewy Reg21 Reg22 Path with more skew Pulse Gen
Implementation 6. We simulate to estimate how many registers one pulse generator can drive We find that it can drive 3
Implementation On the brink of failing
Conclusion 2 On a macro-architectural level: Qualitatively: -We can take buffers out of other path(s) to create the greatest tolerable skew, and have the pulse generator drive that path as well. Quantitatively: -In this case, the pulse generator can at most drive 2 other paths, and the maximum tolerable skew is 300ps.
Summary Steps to improve buffering conditions: Search for the path(s) that violate hold time constraint Replace excessive register to register buffering with a 2nd phase clock pulse generator driving the downstream register On the same branch but different path in the H-tree, remove buffers driving the upstream register in that path until there is maximum tolerable skew in that path Have the pulse generator drive the downstream register in that path
Questions, Thoughts, Problems 1. We didn’t have the sub-vt models for our technology -This project was meant for sub-vt, but our models broke down at Vdd=450 mV 2. Better pulse generator? -Our pulse generator costs a lot of power/area, also, not a good generator in sub-vt 3. Simulation conditions -Didn’t simulate within the whole design, hard to figure out the inputs needed