Continuing Challenges in Static Timing Analysis

Continuing Challenges in Static Timing Analysis Tom Spyrou TAU 2013 3/2013

Goal of this talk • Higher level than latest trends • Remind ourselves the trade-offs we have made as an industry to have a workable solution for STA • Signoff • Embedded in Design Synthesis and Optimization • Plenty of discussion on new effects, lets discuss core STA • Explain basis of industrial algorithms to academic community • Challenge ourselves to look at the issues again • Technology trends • Design • Compute

Why Static Timing Analysis • Dynamic simulation is impossible for even a small chip • Assume combination logic only • 100 inputs implies 2^100 vectors needed to verify timing which is about 10^30 vectors • If a simulator could process 10^6 vectors per second this works out to a sim time of 10^19 days or about 10^15 years • Talk about a verification bottleneck! • Now add in state elements and the problem of making sure the critical path is actually in the vector set • STA can analyze such a design in 1 minute • There are some issues, but they can be mitigated • STA’s quality of result is not dependent on the quality of the vector set

What is the trade-off / core issues?These have been unchanged for a long time • A different kind of setup • Result is dependent on quality of constraints and exceptions • If all storage elements are clocked and i/o’s constrained generally safe • Less accurate delay analysis • Exact path is not really known as with event driven simulation • When STA was first introduced this was less of an issue, PBA now essential • Introduction of false paths due to topological not functional analysis • Users have to manually specify these • Multiple circuit modes take extra effort • Not just more vectors • Loops and level sensitive latches add complexity

Analysis • Every circuits looks the same to STA since it ignores the functions of the logic.

Topological analysis • Simplifies problem, possibility of reporting false paths

What do recent trends mean • Design • Hyper-optimization means accuracy is critical • When a chip is designed at a bleeding edge technology it will be pushed on all dimensions of power, performance and area • Simulation based delay calculation • Path based analysis • Design size means memory use is #1 problem • Largest chips are approaching 1TB of RAM needed for flat runs • Hierarchical / Parallel solutions must prioritize memory use on compute nodes • Runtime also needs to be faster but the first step is to run on machines with reasonable cost • Recent design uses 750+Gig of RAM for single mode/corner STA • Compute • CPU is cheap, data movement is expensive • Whenever you hear its an expensive calculation don’t avoid it • Parallel computing must not only improve performance but also accuracy and features. • Don’t just make the same problem go faster or just divide the data

If you ask a designer what doesn’t work well • Hierarchical timing in the final verification loop • SI calculations very conservative • SDC’s are large and hard to verify • Worst case timing is done and process variation is modeled very pessimistically • Block based analysis loses too much accuracy • True delay (looking at combinational logic to prove a path true) reporting is slow and can’t run during optimization • Libraries limit flexibility of analysis

STA Industry and Academia • STA technology has been innovated inside Industry much more than in Academia • The key approaches are not documented • There is no open source reference to build from • Industry protects the core concepts as trade secrets • Academia does not (rarely) publish on STA beyond single clock designs or delay calculation • We need a book on the core search algorithm

Example, Veritime from the 90’s • STA Engine that required vectors for the clock • Dynamic simulation of the clock • Period, multicycle paths, clock to clock false paths automatically determined • STA for data portion • Absorbed by Cadence and forgotten since at the time SDCs were a lot easier to hand inspect

Requirements of an STA Engine • I would like to begin by documenting the basics that everyone in Industry knows. There are no company specific trade secrets • Must run in linear memory and runtime with circuit size, number of clocks, exceptions, and number of storage element • Touch each vertex only once, maybe twice to simplify pre-processing, not once per clock or exception • Must support SDC timing constraints • Clocks, clock tree assumptions, multi-cycle paths, false paths, path delays, cases and modes • Must be nearly spice accurate in delays and support path based • Must be incremental enough • Netlist changes / full retrace on one extreme • Query based incremental with limited tracing on the other

The Basic Search • The Graph • Startpoints are inputs to the circuit and clock inputs to storage elements • Endpoints are outputs of circuit and data inputs of storage elements • Propagate the Clocks • For each clock input BFS to all clock data pins • Offset startpoint arrival times and end point required times with information from the clock propagation and cycle accounting • Propagate the Data • Use a BFS from startpoints to end points • Use multiple timing totals at every pin to take into account multiple clocks and exceptions • Can optionally store back pointers to record K critical paths but this time/memory is wasted on optimization programs and should be left to a reporting phase

Multiple Timing Totals with Partial Path • Simplistic implementation is that each clock and each exception gets its own total • Simultaneously or via separate traces • Memory and/or runtime increase quickly • Occurrence pins are the most common netlist object • There can be thousands of exceptions • At Timing endpoints like totals can be combined and evaluated • At Timing endpoints point to point exceptions can be evaluated

Multiple Timing Totals

Multiple Timing Totals with path completion data • A BFS has no information about paths • However timing exceptions are specified in terms of from, through, and to paths with a boolean expression of pins • Mcp –from a –through {b c} –to d • From a through b or c and also through d • Each total can have a small state machine about what exception points it has seen • At timing endpoints like totals with like exception point data can be combined or if false not combined

Through exceptions

Framework can be used for Clock Pessimism d1,d2 Arr 1 d1,d2,d3 Arr 2 d1,d2 Arr 1 d1,d2 Arr 1 d1,d2,d3 Arr 2 d1,d2 Arr 1 d1,d2,d3 Arr 2 d1,d2,d3 Arr 2 d3 d2 d1 17

Delay Calculation, Multiple Timing Totals • Worst case slew merging is pessimistic but allows Delay Calculation to be a pre-process step • If Delay Calculation is done in the BFS the critical slew merging can be done • It is also possible for each timing total to carry its own slew to improve accuracy • Loops can be auto detected and dynamically broken avoiding accidental critical path breaks

Incremental Timing • Netlist edits, full retime • Netlist edits, fanout cone retime • Netlist edits, query based retime • The choice of how incremental to go depends on the optimization approach • More global cost functions require less incrementalness • More locally greedy approaches require more

STA needs innovation • Increased sharing to Academia • Increased research on the problems that are still problems • Redirect solutions in light of the Design and Compute trends • There is a lot of interesting work to do!

Some ideas • New constraint language that is more functional • Try to propagate the function with the delays • Some combination with cycle based simulation • Constraint language enhancements • Library-less delay models • New data model which is stage based • Focus on data locality • Hierarchical timing model which is truly context independent within acceptable limitations • Constraint improvements to help constraint blocks more accurately

Continuing Challenges in Static Timing Analysis

Continuing Challenges in Static Timing Analysis

Presentation Transcript

Timing Analysis

Timing Analysis

STATIC TIMING ANALYSIS

F A S T Frequency-Aware Static Timing Analysis

Timing Analysis

Timing Analysis

Capturing Crosstalk-Induced Waveform for Accurate Static Timing Analysis

On the Assumption of Normality in Statistical Static Timing Analysis

Timing Analysis

Large-Scale Static Timing Analysis

Static Timing Analysis for Threshold Logic Circuits

Static Timing Analysis for Combinational Threshold Logic Networks

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units

Statistical Static Timing Analysis

Timing Analysis

Timing Analysis in Quartus

Continuing QA Challenges

Final Project: Static Timing Analysis on GPGPU

Capturing Crosstalk-Induced Waveform for Accurate Static Timing Analysis

Timing Analysis

Chapter 4b Statistical Static Timing Analysis: SSTA

Static Timing Analysis