350 likes | 457 Views
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units. Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University, College Station, TX ASPDAC 2009. Outline. Preliminaries Previous works The proposed approach Experimental results Conclusions.
E N D
Accelerating Statistical Static Timing Analysis Using GraphicsProcessing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University, College Station, TX ASPDAC 2009
Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions
Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions
Preliminaries • Static Timing Analysis • Statistical Static Timing Analysis • Monte Carlo method • Some differences between GPU and CPU
Static Timing Analysis (STA) • At each gate, the MAX of the SUM of the input arrival time at pin i plus the pin-to-output rising (or falling) delay from pin i to the output is computed. • Use LUT for storing delay of each type of gates or compute the delay according to specific equations. • Worst case delay as the representa-tive value.
STA example • We use a 2-inputs NAND as a example.
Pros and Cons of STA • Pros • Can be computed very fast. • Very easy to understand the meaning. • Cons • Not that precise. • Hard to deal with the process variation. • Moreover, variations become less systematic now.
Statistical Static Timing Analysis (SSTA) • Apply probability and statistics in signals, gates, etc. • Basic ideas is the same: MAX and SUM. • Need to generate random samples or deal with probability distribution functions (PDFs) directly.
Why SSTA? • To deal with variations and to move beyond the limitations of the deterministic nature of traditional STA techniques. • The main idea is to include the effect of variations in order to analyze circuit delay more accurately.
Pros and Cons of SSTA • Pros • Could deal with variations. • High accuracy. • Cons • High runtime cost for accurate method. • May have big difference between different methods.
Monte Carlo method • There is no single Monte Carlo method; instead, the term describes a large and widely-used class of approaches. • However, these approaches tend to follow a particular pattern: • Define a domain of possible inputs • Generate inputs randomly from the domain using a certain specified probability distribution • Perform a deterministic computation using the inputs • Aggregate the results of the individual computations into the final result
A simple example for Monte Carlo method • How can we approximate π? • Draw a square and a circle within it on the ground. • Uniformly scatter some uniform size object into the square. • Counting the number of objects in the circle and dividing by the total number of objects in the square will yield an approximation for π / 4
A simple example for Monte Carlo method (cont.) • Generally speaking • The more the objects (samples), the more the preciseness. • The smaller the objects (unit of samples), the more the preciseness. • Distribution of the objects (distribution function of samples) affects the result.
Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions
Previous works • Block-based SSTA • Perform statistical MAX and SUM operations and traverse the circuit in a level-wise BFS • Fast but not that accurate • Path-based SSTA • Calculate delay PDF of each selected path • Maybe accurate but hard to decide the path that should be selected
Previous works (cont.) • Block-based SSTA like [14][15][16] are fast but only an approximation. • Path-based SSTA like [17] using Gaussian distribution propagation is also approximation. • [19][20][21] propose faster algorithm that compute only the bound of result. • [22][23][24][25] do operations on PDFs.
Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions
The proposed approach • Monte Carlo based SSTA on GPU with Mersenne Twisterpseudo-random number generator and Box-Muller transformations. • Compute delay of gates like path-based SSTA approach. • Traverse circuit like block-based SSTA approach.
Monte Carlo based SSTA • Generate gate delay samples according to μ and σ. • Do STA for each set of samples. • Aggregate results to produce the full circuit delay distribution. • The spirit of Monte Carlo method – The more the objects (samples), the more the preciseness.
Why Monte Carlo based SSTA on GPU? • Sample parallelism • the generation of samples and the corresponding static timing analysis for a single gate computation can be executed in parallel, with no data-dependency • Data parallelism • gates at the same logic level can execute Monte Carlo based SSTA in parallel
Why Monte Carlo based SSTA on GPU? (cont.) • SIMD of GPU • Parallel execute Mersenne Twisterpseudo-random number generator followed by Box-Muller transformations • Large memory bandwidth of GPU • Extremely fast in lookup • Many threads of GPU • STA with lots of samples can be executed fast • Memory access time can be hided well
Mersenne Twisterpseudo-random number algorithm • Developed in 1997 by Makoto Matsumoto and Takuji Nishimura that is based on a matrix linear recurrence over a finite binary field F2. • For a k-bit word length, the Mersenne Twister generates numbers with an almost uniform distribution in the range [0,2^k -1]. • Long period, efficient use of memory, good distribution properties and high performance
Box-Muller transformations • Given a source of uniformly distributed random numbers. • A method of generating pairs of independent standard normally distributed (zero expectation, unit variance) random numbers • Transform into N(0,1) • Developed by George Edward Pelham Box and Mervin Edgar Muller at 1958.
Example • Suppose a random number sequence: • 0.1 -0.2 0.2 -0.2 0.4 0.1 -0.3 0 0.5 0.1 -0.4 0.2 0.3 -0.2 -0.5 0.3 0.1 0
Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions
Experimental results • NVIDIA GeForce 8800 GTX graphic card • 768MB memory • Some are listed in previous slides • The environment that is compared with • 3.6GHz CPU with 3GB memory • Linux • Monte Carlo analysis was performed with 64K samples
Experimental results - Some comparisons • Running 16M threads of SSTA kernel • CPU took 37.158 sec • GPU tool 0.115 sec • About 320x faster • Mersenne Twister generator • CPU generates about 2.24*10^7 number/sec • GPU generates about 2.33*10^9 number/sec • About 100x faster
Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions
Conclusions • Monte Carlo based SSTA on GPU • Mersenne Twister generator and Box-Muller transformation • Combination of path-based SSTA approach and block-based SSTA approach • No loss of accuracy and ultra fast