Accelerating Statistical Static Timing Analysis Using Graphics Processing Units

Accelerating Statistical Static Timing Analysis Using GraphicsProcessing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University, College Station, TX ASPDAC 2009

Outline • Preliminaries • Previous works • The proposed approach • Experimental results • Conclusions

Preliminaries • Static Timing Analysis • Statistical Static Timing Analysis • Monte Carlo method • Some differences between GPU and CPU

Static Timing Analysis (STA) • At each gate, the MAX of the SUM of the input arrival time at pin i plus the pin-to-output rising (or falling) delay from pin i to the output is computed. • Use LUT for storing delay of each type of gates or compute the delay according to specific equations. • Worst case delay as the representa-tive value.

STA example • We use a 2-inputs NAND as a example.

Pros and Cons of STA • Pros • Can be computed very fast. • Very easy to understand the meaning. • Cons • Not that precise. • Hard to deal with the process variation. • Moreover, variations become less systematic now.

Statistical Static Timing Analysis (SSTA) • Apply probability and statistics in signals, gates, etc. • Basic ideas is the same: MAX and SUM. • Need to generate random samples or deal with probability distribution functions (PDFs) directly.

Why SSTA? • To deal with variations and to move beyond the limitations of the deterministic nature of traditional STA techniques. • The main idea is to include the effect of variations in order to analyze circuit delay more accurately.

Pros and Cons of SSTA • Pros • Could deal with variations. • High accuracy. • Cons • High runtime cost for accurate method. • May have big difference between different methods.

Monte Carlo method • There is no single Monte Carlo method; instead, the term describes a large and widely-used class of approaches. • However, these approaches tend to follow a particular pattern: • Define a domain of possible inputs • Generate inputs randomly from the domain using a certain specified probability distribution • Perform a deterministic computation using the inputs • Aggregate the results of the individual computations into the final result

A simple example for Monte Carlo method • How can we approximate π? • Draw a square and a circle within it on the ground. • Uniformly scatter some uniform size object into the square. • Counting the number of objects in the circle and dividing by the total number of objects in the square will yield an approximation for π / 4

A simple example for Monte Carlo method (cont.)

A simple example for Monte Carlo method (cont.) • Generally speaking • The more the objects (samples), the more the preciseness. • The smaller the objects (unit of samples), the more the preciseness. • Distribution of the objects (distribution function of samples) affects the result.

About some differences between GPU and CPU

Abstract comparisons of memory between GPU and CPU (cont.)

Previous works • Block-based SSTA • Perform statistical MAX and SUM operations and traverse the circuit in a level-wise BFS • Fast but not that accurate • Path-based SSTA • Calculate delay PDF of each selected path • Maybe accurate but hard to decide the path that should be selected

Previous works (cont.) • Block-based SSTA like [14][15][16] are fast but only an approximation. • Path-based SSTA like [17] using Gaussian distribution propagation is also approximation. • [19][20][21] propose faster algorithm that compute only the bound of result. • [22][23][24][25] do operations on PDFs.

The proposed approach • Monte Carlo based SSTA on GPU with Mersenne Twisterpseudo-random number generator and Box-Muller transformations. • Compute delay of gates like path-based SSTA approach. • Traverse circuit like block-based SSTA approach.

Monte Carlo based SSTA • Generate gate delay samples according to μ and σ. • Do STA for each set of samples. • Aggregate results to produce the full circuit delay distribution. • The spirit of Monte Carlo method – The more the objects (samples), the more the preciseness.

Why Monte Carlo based SSTA on GPU? • Sample parallelism • the generation of samples and the corresponding static timing analysis for a single gate computation can be executed in parallel, with no data-dependency • Data parallelism • gates at the same logic level can execute Monte Carlo based SSTA in parallel

Why Monte Carlo based SSTA on GPU? (cont.) • SIMD of GPU • Parallel execute Mersenne Twisterpseudo-random number generator followed by Box-Muller transformations • Large memory bandwidth of GPU • Extremely fast in lookup • Many threads of GPU • STA with lots of samples can be executed fast • Memory access time can be hided well

Mersenne Twisterpseudo-random number algorithm • Developed in 1997 by Makoto Matsumoto and Takuji Nishimura that is based on a matrix linear recurrence over a finite binary field F2. • For a k-bit word length, the Mersenne Twister generates numbers with an almost uniform distribution in the range [0,2^k -1]. • Long period, efficient use of memory, good distribution properties and high performance

Box-Muller transformations • Given a source of uniformly distributed random numbers. • A method of generating pairs of independent standard normally distributed (zero expectation, unit variance) random numbers • Transform into N(0,1) • Developed by George Edward Pelham Box and Mervin Edgar Muller at 1958.

Monte Carlo based SSTA kernel

Example • Suppose a random number sequence: • 0.1 -0.2 0.2 -0.2 0.4 0.1 -0.3 0 0.5 0.1 -0.4 0.2 0.3 -0.2 -0.5 0.3 0.1 0

Experimental results • NVIDIA GeForce 8800 GTX graphic card • 768MB memory • Some are listed in previous slides • The environment that is compared with • 3.6GHz CPU with 3GB memory • Linux • Monte Carlo analysis was performed with 64K samples

Experimental results - Some comparisons • Running 16M threads of SSTA kernel • CPU took 37.158 sec • GPU tool 0.115 sec • About 320x faster • Mersenne Twister generator • CPU generates about 2.24*10^7 number/sec • GPU generates about 2.33*10^9 number/sec • About 100x faster

Experimental results – 30 cases

Conclusions • Monte Carlo based SSTA on GPU • Mersenne Twister generator and Box-Muller transformation • Combination of path-based SSTA approach and block-based SSTA approach • No loss of accuracy and ultra fast

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units

Presentation Transcript

Optimistic Hybrid Analysis: Accelerating Dynamic Analysis through Predicated Static Analysis

STATIC TIMING ANALYSIS

High-throughput sequence alignment using Graphics Processing Units

Memory Optimizations for Graphics Processing Units

General Purpose Computation on Graphics Processing Units (GPGPU)

Graphics Processing Units ( GPUs )

General Purpose Graphics Processing Units (GPGPUs)

Data Processing/Statistical Analysis

Accelerating Coherent Pulsar De-dispersion on Graphics Processing Units

Using Graphics Processing Units as Accelerators for Pulsar Dedispersion

Non-Gaussian Statistical Timing Analysis Using Second Order Polynomial Fitting

Session 1: GPU: Graphics Processing Units

Graphics Processing Units

On the Assumption of Normality in Statistical Static Timing Analysis

Non-Linear Statistical Static Timing Analysis for Non-Gaussian Variation Sources

Large-Scale Static Timing Analysis

Static Timing Analysis for Threshold Logic Circuits

Statistical Static Timing Analysis

Continuing Challenges in Static Timing Analysis

Final Project: Static Timing Analysis on GPGPU

Chapter 4b Statistical Static Timing Analysis: SSTA

Static Timing Analysis