1 / 29

Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines. Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University {weidong, pli, yexiaoji} @ neo.tamu.edu. Courtesy Intel. Courtesy AMD. Courtesy IBM. Multi-Core Implications.

abe
Download Presentation

Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University {weidong, pli, yexiaoji} @ neo.tamu.edu

  2. Courtesy Intel Courtesy AMD Courtesy IBM Multi-Core Implications • Multi-core shift is changing the landscape of computing • New challenges & opportunities for EDA • Free ride of single-threaded EDA applications on Moore’s Law is coming to an end • Question: How to fully exploit increasingly parallel hardware and achieve good runtime scaling??

  3. Why Parallel Transient Simulation? • SPICE-like transient simulation is key to wide ranges of ICs • Memories, custom digital, analog/RF/mixed-signal • Long simulation time presents significant bottleneck in design • CPU time > days, weeks (e.g. transistor-level PLL simulation) • Can lead to insufficient verification, non-optimal design, chip failure Natural target for parallelization!

  4. Performance of a public parallel matrix solver on a 8-processor server Prior Work • Fine-grained parallelization • Parallel matrix solves, devicemodel evaluations • The efficiency of parallel matrix solvers deteriorates quickly • Parallel waveform relaxation[White et al ’87,Reichelt et al ICCAD’03] • Limited convergence property • Domain decomposition[Wever et al, HICSS’96] • Can create dense problems • Applicability highly application dependent

  5. Our Strategies • Exploit coarse-grained & application-level parallelisms • Lessons learned before [T. Mattson, Intel] • >100 parallel languages/environments developed in the 90’s ! • Only a few with significant domain knowledge made successful • Develop simulation algorithms parallelizable by construction • Goals/Benefits • Reduce parallel overhead via applying domain knowledge • Create rich parallelisms for multi-/many-core platforms (pairing with fine grained methods) • Ease in parallel programming, debug and code reuse • Do not jeopardize accuracy & convergence

  6. t3 t2 t4 t1 t3 t5 t1 t2 t5 t4 Proposed Approach • Time-domain MNA formulation • How to parallelize along the time axis? • Data dependency Nonlinear DAEs : dynamic nonlinearities : vector of unknowns : inputs : static nonlinearities One-step integration two-step integration

  7. Backward Pipelining Forward Pipelining … … Current/base Position Predictive Computing Granularity of WaveformPipelining … T4 T2 T3 T1 Solve Schedule Fine GrainedParallel Assists Multi-/Many-Core Machine Waveform Pipelining (WavePipe) T Multi-Step Num. Integration Parallel Matrix Solve/Device Evaluation

  8. Outline • Motivation • Overview • Parallel backward pipelining • Parallel forward pipelining • Experimental results • Summary

  9. Parallel Backward Pipelining • Move backwards in time • Create additional independent computing tasks along T axis • Why useful? • Employ under variable-stepsize multi-step numerical integration • Contribute to a larger future time step Forward Pipelining Backward Pipelining … … T Current Position Multi-Step Num. Integration Predictive Computing

  10. Variable-Stepsize Multi-Step Gear’s Method • Gear’s integration formula • Two-step Gear’s method [Shichman, Trans. Circuit Theory, 1970] : order of numerical integration : coefficients : circuit response at time point i

  11. Two-step Three-step Local Truncation Error (LTE) • Numerical integration error incurred “locally” at each point • All the previous solutions are assumed to be accurate • LTEs in Gear’s methods

  12. hn+1 hn ? T LTE based Time Step Control (Gear2) • Control the time step to meet an LTE tolerance • LTE’s dependency on hn & hn+1 • Key observation • Smaller hn greater hn+1:if DD3 nonincreasing • Exploit for parallel computing

  13. h3 t2 t4 h2 t3 h4 t1 Thread 2 Thread 1 Parallel Backward Pipelining • Serial Gear2 • Double-threaded Gear2 • Balance between efficiency and robustness: • Extensible to multi-step methods (e.g. Gear3) Initial conditions @ t1 & t2 h4’ Tr1: t3 (h3  h2)Tr2: back to t3’ time h3 t2 t4’ t4 h2 Tr1:t4 (h4  h3’) Tr2:back to t4’ t3 t3’ h4 h3’ t1

  14. Parallel Forward Pipelining • Move forwards in time • Exploit predictive computing along the forward T direction • Question • How to resolve data dependency & ensure accuracy Forward Pipelining Backward Pipelining … … T Current Position Multi-Step Num. Integration Predictive Computing

  15. Thread 2 Thread 1 Parallel Forward Pipelining • Ex: double threaded Time point t3 (h3  h2) FE estimate sol@t3 Time point t4 (h4  h3) Solve sol@t3 & sol@t4 Time pointt5 (h5  h4) FE estimate sol@t5 Time point t6 (h6  h5) Solve sol@t5 & sol@t6 Init. condition @ t1 & t2 time h6 h3 t2 t5 h2 t4 t6 t3 h5 h4 t1

  16. Complications • Time steps for forward points may not be estimated accurately • Data dependency on initial conditions • Apply a damping factor (β<1.0) for time step estimation • Revoke forward results in thread scheduling cycle (covered later) • Forward points based on inaccurate initial conditions • Addressed by inter-thread communication • Tradeoffs provided by fine/coarse grained communications Forward Pipelining Base Position … T h=? Forward Pipelining Base Position … T Accuracy?

  17. FE Estimation Newton Loop … FE Estimation One or more iter. Time point 3Thread 3 Newton Loop Convergence FE Estimation One or more iter. Newton Loop … Time point 1Thread 1 Convergence time One or more iter. Time point 2Thread 2 Convergence time Coarse Grained Inter-thread Communication • Iterate on the converged initial condition

  18. time time Fine Grained Inter-thread Communication • Communicate at the granularity of NR iterations • Beneficial to large circuits Time point 3Thread 3 FE Estimation FE Estimation NR Iteration 1 FE Estimation NR Iteration 1 NR Iteration 2 NR Iteration 1 NR Iteration 2 … NR Iteration 3 NR Iteration 2 NR Iteration 3 Convergence NR Iteration 3 Convergence Convergence Time point 1Thread 1 Time point 2Thread 2

  19. Multi-threaded WavePipe • Combine backward with forward waveform pipelining • Ex: 4T (1-backward-2-forward) WavePipe Time step T2: backward 2nd Forward Initial Solutions Forward Newton FE Time step Backward T4 T1: standard … … T3 Newton FE T2 T1 Time step T3: forward Base Gear2 point Newton FE T4: 2nd forward Time step One Thread Scheduling Cycle Newton FE

  20. Cycle Completes Partially Completes Cycle Completes Cycle Starts Cycle Completes Cycle Starts Cycle Starts Cycle Starts Time … … … Initial Conditions Initial Conditions 4-Thread WavePipe(1-backward-2-forward scheme) Standard Forward 2nd Forward Backward Thread Scheduling • The work done over an overestimated step is discarded Time … Without Step Size Overestimation With Step Size Overestimation

  21. Experimental Setup • A 8-processor Linux server with four dual-core processors • WavePipe implemented in C/C++ using pThreads (Gear2) • Compare with • Reference serial SPICE-like (Gear2) transient simulation • Low level parallel matrix solve (SuperLU) and device evaluation • Test circuits

  22. Experimental Results – Accuracy & Profiling • 3-T (1 backward + 1 forward) WavePipe vs. serial (DB mixer) • Real-time threading profiling (mesh ckt)

  23. Experimental Results – 2T Speedups • 2T 1-backward & 2T 1-forward 1.29X 1.57X

  24. Experimental Results – 3T Speedups • 3T 1-backward-1-forward & 3T 2-forward 1.73X 1.83X

  25. Experimental Results – 4T Speedups • 4T 1-backward-2-forward & 4T 3-forward 2.09X 2.19X

  26. Experimental Results – Runtime Scaling • 2-4 threads

  27. Experimental Results • Low-level scheme • Parallel matrix solve & device model evaluation • Proposed scheme • 1-4 threads: WavePipe • 8 threads: 3-forward WavePipe + parallel matrix sol. & model eval.

  28. Summary • Multi-core challenges & opportunities for EDA • Application-level coarse-grained parallelism for transient simulation • Parallelize at a granularity of single time-point circuit solution • Inherent low inter-core communication overhead • Maintain accuracy & convergence • Ease in implementation and code reuse • Rich sets of parallelisms for multi-core or many-core systems • New parallel opportunities orthogonal to fine-grained schemes • Pair with parallel matrix solve, device evaluation and low-level parallel programming assists

  29. Thanks

More Related