290 likes | 435 Views
WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines. Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University {weidong, pli, yexiaoji} @ neo.tamu.edu. Courtesy Intel. Courtesy AMD. Courtesy IBM. Multi-Core Implications.
E N D
WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University {weidong, pli, yexiaoji} @ neo.tamu.edu
Courtesy Intel Courtesy AMD Courtesy IBM Multi-Core Implications • Multi-core shift is changing the landscape of computing • New challenges & opportunities for EDA • Free ride of single-threaded EDA applications on Moore’s Law is coming to an end • Question: How to fully exploit increasingly parallel hardware and achieve good runtime scaling??
Why Parallel Transient Simulation? • SPICE-like transient simulation is key to wide ranges of ICs • Memories, custom digital, analog/RF/mixed-signal • Long simulation time presents significant bottleneck in design • CPU time > days, weeks (e.g. transistor-level PLL simulation) • Can lead to insufficient verification, non-optimal design, chip failure Natural target for parallelization!
Performance of a public parallel matrix solver on a 8-processor server Prior Work • Fine-grained parallelization • Parallel matrix solves, devicemodel evaluations • The efficiency of parallel matrix solvers deteriorates quickly • Parallel waveform relaxation[White et al ’87,Reichelt et al ICCAD’03] • Limited convergence property • Domain decomposition[Wever et al, HICSS’96] • Can create dense problems • Applicability highly application dependent
Our Strategies • Exploit coarse-grained & application-level parallelisms • Lessons learned before [T. Mattson, Intel] • >100 parallel languages/environments developed in the 90’s ! • Only a few with significant domain knowledge made successful • Develop simulation algorithms parallelizable by construction • Goals/Benefits • Reduce parallel overhead via applying domain knowledge • Create rich parallelisms for multi-/many-core platforms (pairing with fine grained methods) • Ease in parallel programming, debug and code reuse • Do not jeopardize accuracy & convergence
t3 t2 t4 t1 t3 t5 t1 t2 t5 t4 Proposed Approach • Time-domain MNA formulation • How to parallelize along the time axis? • Data dependency Nonlinear DAEs : dynamic nonlinearities : vector of unknowns : inputs : static nonlinearities One-step integration two-step integration
Backward Pipelining Forward Pipelining … … Current/base Position Predictive Computing Granularity of WaveformPipelining … T4 T2 T3 T1 Solve Schedule Fine GrainedParallel Assists Multi-/Many-Core Machine Waveform Pipelining (WavePipe) T Multi-Step Num. Integration Parallel Matrix Solve/Device Evaluation
Outline • Motivation • Overview • Parallel backward pipelining • Parallel forward pipelining • Experimental results • Summary
Parallel Backward Pipelining • Move backwards in time • Create additional independent computing tasks along T axis • Why useful? • Employ under variable-stepsize multi-step numerical integration • Contribute to a larger future time step Forward Pipelining Backward Pipelining … … T Current Position Multi-Step Num. Integration Predictive Computing
Variable-Stepsize Multi-Step Gear’s Method • Gear’s integration formula • Two-step Gear’s method [Shichman, Trans. Circuit Theory, 1970] : order of numerical integration : coefficients : circuit response at time point i
Two-step Three-step Local Truncation Error (LTE) • Numerical integration error incurred “locally” at each point • All the previous solutions are assumed to be accurate • LTEs in Gear’s methods
hn+1 hn ? T LTE based Time Step Control (Gear2) • Control the time step to meet an LTE tolerance • LTE’s dependency on hn & hn+1 • Key observation • Smaller hn greater hn+1:if DD3 nonincreasing • Exploit for parallel computing
h3 t2 t4 h2 t3 h4 t1 Thread 2 Thread 1 Parallel Backward Pipelining • Serial Gear2 • Double-threaded Gear2 • Balance between efficiency and robustness: • Extensible to multi-step methods (e.g. Gear3) Initial conditions @ t1 & t2 h4’ Tr1: t3 (h3 h2)Tr2: back to t3’ time h3 t2 t4’ t4 h2 Tr1:t4 (h4 h3’) Tr2:back to t4’ t3 t3’ h4 h3’ t1
Parallel Forward Pipelining • Move forwards in time • Exploit predictive computing along the forward T direction • Question • How to resolve data dependency & ensure accuracy Forward Pipelining Backward Pipelining … … T Current Position Multi-Step Num. Integration Predictive Computing
Thread 2 Thread 1 Parallel Forward Pipelining • Ex: double threaded Time point t3 (h3 h2) FE estimate sol@t3 Time point t4 (h4 h3) Solve sol@t3 & sol@t4 Time pointt5 (h5 h4) FE estimate sol@t5 Time point t6 (h6 h5) Solve sol@t5 & sol@t6 Init. condition @ t1 & t2 time h6 h3 t2 t5 h2 t4 t6 t3 h5 h4 t1
Complications • Time steps for forward points may not be estimated accurately • Data dependency on initial conditions • Apply a damping factor (β<1.0) for time step estimation • Revoke forward results in thread scheduling cycle (covered later) • Forward points based on inaccurate initial conditions • Addressed by inter-thread communication • Tradeoffs provided by fine/coarse grained communications Forward Pipelining Base Position … T h=? Forward Pipelining Base Position … T Accuracy?
FE Estimation Newton Loop … FE Estimation One or more iter. Time point 3Thread 3 Newton Loop Convergence FE Estimation One or more iter. Newton Loop … Time point 1Thread 1 Convergence time One or more iter. Time point 2Thread 2 Convergence time Coarse Grained Inter-thread Communication • Iterate on the converged initial condition
time time Fine Grained Inter-thread Communication • Communicate at the granularity of NR iterations • Beneficial to large circuits Time point 3Thread 3 FE Estimation FE Estimation NR Iteration 1 FE Estimation NR Iteration 1 NR Iteration 2 NR Iteration 1 NR Iteration 2 … NR Iteration 3 NR Iteration 2 NR Iteration 3 Convergence NR Iteration 3 Convergence Convergence Time point 1Thread 1 Time point 2Thread 2
Multi-threaded WavePipe • Combine backward with forward waveform pipelining • Ex: 4T (1-backward-2-forward) WavePipe Time step T2: backward 2nd Forward Initial Solutions Forward Newton FE Time step Backward T4 T1: standard … … T3 Newton FE T2 T1 Time step T3: forward Base Gear2 point Newton FE T4: 2nd forward Time step One Thread Scheduling Cycle Newton FE
Cycle Completes Partially Completes Cycle Completes Cycle Starts Cycle Completes Cycle Starts Cycle Starts Cycle Starts Time … … … Initial Conditions Initial Conditions 4-Thread WavePipe(1-backward-2-forward scheme) Standard Forward 2nd Forward Backward Thread Scheduling • The work done over an overestimated step is discarded Time … Without Step Size Overestimation With Step Size Overestimation
Experimental Setup • A 8-processor Linux server with four dual-core processors • WavePipe implemented in C/C++ using pThreads (Gear2) • Compare with • Reference serial SPICE-like (Gear2) transient simulation • Low level parallel matrix solve (SuperLU) and device evaluation • Test circuits
Experimental Results – Accuracy & Profiling • 3-T (1 backward + 1 forward) WavePipe vs. serial (DB mixer) • Real-time threading profiling (mesh ckt)
Experimental Results – 2T Speedups • 2T 1-backward & 2T 1-forward 1.29X 1.57X
Experimental Results – 3T Speedups • 3T 1-backward-1-forward & 3T 2-forward 1.73X 1.83X
Experimental Results – 4T Speedups • 4T 1-backward-2-forward & 4T 3-forward 2.09X 2.19X
Experimental Results – Runtime Scaling • 2-4 threads
Experimental Results • Low-level scheme • Parallel matrix solve & device model evaluation • Proposed scheme • 1-4 threads: WavePipe • 8 threads: 3-forward WavePipe + parallel matrix sol. & model eval.
Summary • Multi-core challenges & opportunities for EDA • Application-level coarse-grained parallelism for transient simulation • Parallelize at a granularity of single time-point circuit solution • Inherent low inter-core communication overhead • Maintain accuracy & convergence • Ease in implementation and code reuse • Rich sets of parallelisms for multi-core or many-core systems • New parallel opportunities orthogonal to fine-grained schemes • Pair with parallel matrix solve, device evaluation and low-level parallel programming assists