230 likes | 367 Views
Continuing Challenges in Static Timing Analysis. Tom Spyrou TAU 2013 3/2013. Goal of this talk. Higher level than latest trends Remind ourselves the trade-offs we have made as an industry to have a workable solution for STA Signoff Embedded in Design Synthesis and Optimization
E N D
Continuing Challenges in Static Timing Analysis Tom Spyrou TAU 2013 3/2013
Goal of this talk • Higher level than latest trends • Remind ourselves the trade-offs we have made as an industry to have a workable solution for STA • Signoff • Embedded in Design Synthesis and Optimization • Plenty of discussion on new effects, lets discuss core STA • Explain basis of industrial algorithms to academic community • Challenge ourselves to look at the issues again • Technology trends • Design • Compute
Why Static Timing Analysis • Dynamic simulation is impossible for even a small chip • Assume combination logic only • 100 inputs implies 2^100 vectors needed to verify timing which is about 10^30 vectors • If a simulator could process 10^6 vectors per second this works out to a sim time of 10^19 days or about 10^15 years • Talk about a verification bottleneck! • Now add in state elements and the problem of making sure the critical path is actually in the vector set • STA can analyze such a design in 1 minute • There are some issues, but they can be mitigated • STA’s quality of result is not dependent on the quality of the vector set
What is the trade-off / core issues?These have been unchanged for a long time • A different kind of setup • Result is dependent on quality of constraints and exceptions • If all storage elements are clocked and i/o’s constrained generally safe • Less accurate delay analysis • Exact path is not really known as with event driven simulation • When STA was first introduced this was less of an issue, PBA now essential • Introduction of false paths due to topological not functional analysis • Users have to manually specify these • Multiple circuit modes take extra effort • Not just more vectors • Loops and level sensitive latches add complexity
Analysis • Every circuits looks the same to STA since it ignores the functions of the logic.
Topological analysis • Simplifies problem, possibility of reporting false paths
What do recent trends mean • Design • Hyper-optimization means accuracy is critical • When a chip is designed at a bleeding edge technology it will be pushed on all dimensions of power, performance and area • Simulation based delay calculation • Path based analysis • Design size means memory use is #1 problem • Largest chips are approaching 1TB of RAM needed for flat runs • Hierarchical / Parallel solutions must prioritize memory use on compute nodes • Runtime also needs to be faster but the first step is to run on machines with reasonable cost • Recent design uses 750+Gig of RAM for single mode/corner STA • Compute • CPU is cheap, data movement is expensive • Whenever you hear its an expensive calculation don’t avoid it • Parallel computing must not only improve performance but also accuracy and features. • Don’t just make the same problem go faster or just divide the data
If you ask a designer what doesn’t work well • Hierarchical timing in the final verification loop • SI calculations very conservative • SDC’s are large and hard to verify • Worst case timing is done and process variation is modeled very pessimistically • Block based analysis loses too much accuracy • True delay (looking at combinational logic to prove a path true) reporting is slow and can’t run during optimization • Libraries limit flexibility of analysis
STA Industry and Academia • STA technology has been innovated inside Industry much more than in Academia • The key approaches are not documented • There is no open source reference to build from • Industry protects the core concepts as trade secrets • Academia does not (rarely) publish on STA beyond single clock designs or delay calculation • We need a book on the core search algorithm
Example, Veritime from the 90’s • STA Engine that required vectors for the clock • Dynamic simulation of the clock • Period, multicycle paths, clock to clock false paths automatically determined • STA for data portion • Absorbed by Cadence and forgotten since at the time SDCs were a lot easier to hand inspect
Requirements of an STA Engine • I would like to begin by documenting the basics that everyone in Industry knows. There are no company specific trade secrets • Must run in linear memory and runtime with circuit size, number of clocks, exceptions, and number of storage element • Touch each vertex only once, maybe twice to simplify pre-processing, not once per clock or exception • Must support SDC timing constraints • Clocks, clock tree assumptions, multi-cycle paths, false paths, path delays, cases and modes • Must be nearly spice accurate in delays and support path based • Must be incremental enough • Netlist changes / full retrace on one extreme • Query based incremental with limited tracing on the other
The Basic Search • The Graph • Startpoints are inputs to the circuit and clock inputs to storage elements • Endpoints are outputs of circuit and data inputs of storage elements • Propagate the Clocks • For each clock input BFS to all clock data pins • Offset startpoint arrival times and end point required times with information from the clock propagation and cycle accounting • Propagate the Data • Use a BFS from startpoints to end points • Use multiple timing totals at every pin to take into account multiple clocks and exceptions • Can optionally store back pointers to record K critical paths but this time/memory is wasted on optimization programs and should be left to a reporting phase
Multiple Timing Totals with Partial Path • Simplistic implementation is that each clock and each exception gets its own total • Simultaneously or via separate traces • Memory and/or runtime increase quickly • Occurrence pins are the most common netlist object • There can be thousands of exceptions • At Timing endpoints like totals can be combined and evaluated • At Timing endpoints point to point exceptions can be evaluated
Multiple Timing Totals with path completion data • A BFS has no information about paths • However timing exceptions are specified in terms of from, through, and to paths with a boolean expression of pins • Mcp –from a –through {b c} –to d • From a through b or c and also through d • Each total can have a small state machine about what exception points it has seen • At timing endpoints like totals with like exception point data can be combined or if false not combined
Framework can be used for Clock Pessimism d1,d2 Arr 1 d1,d2,d3 Arr 2 d1,d2 Arr 1 d1,d2 Arr 1 d1,d2,d3 Arr 2 d1,d2 Arr 1 d1,d2,d3 Arr 2 d1,d2,d3 Arr 2 d3 d2 d1 17
Delay Calculation, Multiple Timing Totals • Worst case slew merging is pessimistic but allows Delay Calculation to be a pre-process step • If Delay Calculation is done in the BFS the critical slew merging can be done • It is also possible for each timing total to carry its own slew to improve accuracy • Loops can be auto detected and dynamically broken avoiding accidental critical path breaks
Incremental Timing • Netlist edits, full retime • Netlist edits, fanout cone retime • Netlist edits, query based retime • The choice of how incremental to go depends on the optimization approach • More global cost functions require less incrementalness • More locally greedy approaches require more
STA needs innovation • Increased sharing to Academia • Increased research on the problems that are still problems • Redirect solutions in light of the Design and Compute trends • There is a lot of interesting work to do!
Some ideas • New constraint language that is more functional • Try to propagate the function with the delays • Some combination with cycle based simulation • Constraint language enhancements • Library-less delay models • New data model which is stage based • Focus on data locality • Hierarchical timing model which is truly context independent within acceptable limitations • Constraint improvements to help constraint blocks more accurately