180 likes | 456 Views
Mapping for Better Than Worst-Case Delays In LUT-Based FPGA Designs . Kirill Minkovich and Jason Cong VLSI CAD Lab Computer Science Department University of California, Los Angeles. Supported by the National Science Foundation under grant CCF-0530261.
E N D
Mapping for Better Than Worst-Case Delays In LUT-Based FPGA Designs Kirill Minkovich and Jason Cong VLSI CAD Lab Computer Science Department University of California, Los Angeles Supported by the National Science Foundation under grant CCF-0530261
Intra-die variations in ILD thickness Variation and its effects • Environmental Variation • Causes: overheating and voltage fluctuations • Addressed (in part) by: cooling and better power supplies • Process Variation • Causes: dopant density, edge geometry, stress during manufacturing, and much more • Addressed (in part) by: Adding a slack of as much as 3-sigma for delay variation • Data Variation • Causes: output stabilization varying greatly between different data • Addressed by: Highly restrictive asynchronous designs and the Razor architecture • Solutions • Speed Binning and more accurate estimates • Only deals with process variation • Variable Clocking (Razor Architecture) • Deals with all 3 variations!
High performance circuits • Worst-case delay minimization • Hitting a wall due to feature size limits • Can’t keep up with Moore’s law • Conservative timing due to variation • Typical case delay minimization • Defined: Delay for expected data to propagate through circuit • Usually much smaller than worst-case delay • Harder to optimize circuits • Change thinking about circuit optimization • Requires special architecture, like the Razor Architecture (MICRO ’03) UCLA VLSICAD LAB
Logic Stage L1 Logic Stage L2 0 1 Error_L Shadow Latch clk_del Razor flip-flop implementation clk Q D Main Flip-Flop comparator Error RAZOR FF • Main flip-flop • Clocked faster than worst-case delay • Shadow Latch • Clocked with delayed clock to catches any errors • Error • Occurs when main flip-flop and shadow latch differ • Next clock cycle, the Shadow latch value moves into the Main Flip-Flop Slide borrow from Razor (MICRO ’03) presentation
Main FF Shadow Latch Main FF Razor timing error detection • Second sample of logic value used to validate earlier sample • Key design issues: • Maintaining forward progress • Short path impact on shadow-latch • Overhead of error detection and correction 5 9 3 9 MEM 4 9 clk clk clk_del Slide borrow from Razor (MICRO ’03) presentation
FSM to Razor transformation Razor Blackbox Data Ready FSM combinational logic Data Valid Data State registers Razor (state) registers Combinational logic Stallable buffer Input registers Razor (input) registers Output registers Razor (output) registers Razor (stabilization) registers Possible to convert most circuits to Razor
Problem formulation • Definitions • Maximum depth • Shadow latch (worst-case delay) • Target depth • Main (overclocked) flip-flip • Performance Measuring • Can’t use the clock due to errors! • Errors due to overclocking (any switching between target depth and max depth) • So we have to use Expected Delay instead of Delay • ExpDelay(using target depth d) = d • (Pr(NO error | using clock d)) + (d + timerecover) • (Pr(error | using clock d)) • Find d • Linear search! • BestExpDelay = min(ExpDelay(d) | max_depth/2 ≤ d < max_depth) UCLA VLSICAD LAB
Optimization goals Clock Switching Activity • The expected delay • Total delay for data to propagate and recovery from any errors • Reduce probability of an error • Straight forward, if we are given a target depth • Minimize probability of switching after target depth • What can we do without the target depth? • We try to get the switching to occur as early as possible • Extra area overhead • Hard to compare solutions (special cost function is needed) UCLA VLSICAD LAB
Area Recovery Cut Selection BTWMap algorithm overview • Decompose into 2-input gates • Target clock • assignment • Simulate 256 random input • values over all cuts • Assign cost based on • switching and depth 400 times • Area/performance • tradeoff • Choose cuts to minimize cost • Save the scaled simulation • data for next iteration Done!
Cut selection • Cut ranking • Can’t look at just switching activity for each depth • For example, cut 2 is better than cut 1 • Expired simulation data • Keep the old data • Assume previous iteration’s costs after still valid but scale them down • Allows the algorithm to converge on a solution • Keeping the old data, decreases Pr(error) by an average of 3.5% • Huge improvement since for us Pr(error) <= 5% UCLA VLSICAD LAB
BTWMap - area recovery (target depth) • Idea • Find a target depth • How much to overclock • Ignore the switching the happens below this depth • Implementation • Set outputs’ target depth • Select cuts PO->PI while propagating the target depth • Works similar to worst case depth but calculated PO->PI using MIN instead PI->PO using MAX • Benefits • Moderate reduction in area 3 3 Target depth 2 2 2 1 1 1 0 0 UCLA VLSICAD LAB
BTWMap – area-performance tradeoff • Idea • Relax the minimum switching cost of each gate • Give area recovery techniques room to work • Implementation • Set outputs to the initial amount they can be relaxed • Make a relaxation and propagate what your inputs can change using: • Depth of the inputs • How much switching slack is left • Input to output switching correlation • For example, Pr(y switching|x1 switched)=75% while Pr(y switching|x2 switched)=50% • Benefits • Accurate relaxation estimates • Large reduction in area UCLA VLSICAD LAB
BTWMap results example • BTWMap mapping comparison • Test circuit is PDC from the MCNC benchmark suit • Comparing 4 methods • A. Depth optimal mapping with depth relaxation on non-critical paths for area saving • B. Depth optimal mapping without depth relaxation • C. BTWMap • D. BTWMap with area recovery .
What circuits can’t be optimized • Maximum Razor clock = (max depth)/2 • Already good = switching < 2% at maximum Razor clock • Very low switching at maximum Razor clock • 4 of the MCNC suite • Too bad = switching > 90% at max depth • All the switching happens at the very last depth. Very hard to optimize. Have to reduce the switching activity a minimum of 20x at that depth • 5 of the MCNC suite • Easy to test and exclude • Map using ABC and checking switching probabilities UCLA VLSICAD LAB
Sample results • The example below is for the MCNC benchmark SEQ ABC BTWMap BTWMap+area UCLA VLSICAD LAB
Results – expected delay and area • Performance improvement: 13% with BTWMap and 11% after area recovery • The area recovery saves over 16% of the lost area • In the best case (ignoring switching), we’re still 3% away from ABC • Trading 7% for much better switching activity
Conclusion • BTWMap work includes: • Methodologies for measuring performance on circuits optimized for average case delay. • Algorithms for optimizing circuits for average case delay. • Implementation and release these tools (alpha version) http://cadlab.cs.ucla.edu/software_release/btwmap/ • Results Summary • BTWMap (and the area recovery version) • 14% (and 8%) average delay reduction • 13% (and 11%) pipeline improvement • 26% (and 10%) area increase UCLA VLSICAD LAB
Thanks ! UCLA VLSICAD LAB