220 likes | 355 Views
Atomistic nanoelectronic device engineering with sustained performances up to 1.44 PFlop/s. M. Luisier , T. Boykin, G. Klimeck , and W. Fichtner ETH Zurich, University of Alabama, Purdue University. Overview. Nanoelectronics in HPC. ???. ???.
E N D
Atomistic nanoelectronic device engineering with sustained performances up to 1.44 PFlop/s M. Luisier, T. Boykin, G. Klimeck, and W. Fichtner ETH Zurich, University of Alabama, Purdue University Integrated Systems Laboratory, ETH Zurich
Nanoelectronics in HPC ??? ??? Number of transistors per chip doubles every 2 years (Moore’s law) 8nm (2020) Lg 22nm (2011) Lg 45nm (2008) To keep Moore’s law: New breakthrough required by 2020 2011 breakthrough: 3-D FinFETs instead of planar MOSFETs 90nm (2002) Lg 0.35μm (1995) Source: Intel Corporation Dienstag, 16. September 2014 3
Next Generation Devices Production: around 2020 CNT BTB Tunneling Nanowire Graphene III-V UTB P. Hashemi et al., EDL30, 401 (2009) Supratik Guha, IBM Research L. Tapasztó et al., Nat. Nano. 3, 397 (2008) Y.Q. Wu et al., EDL30, 700 (2009) W.Y. Choi et al., EDL28, 743 (2007) NEEDED:Fast, cheap, and reliable platform to support the development and accelerate the innovation of novel nanoelectronic devices Physics-based Numerical Device Simulator OMEN Dienstag, 16. September 2014 4
What is OMEN? HPC in Nanoelectronics First Peta-scale Engineering Application Device Engineering Physical Models • Industrial-Strength Nano-electronic Device Simulator • Multi-Geometry Capabilities • Explore, Understand, Explain, Optimize Novel Designs • 3D Quantum Transport Solver • Accurate Representation of the Semiconductor Properties • Atomistic Description of Devices • Multi-Physics Modeling OMEN GAA NW • Accelerate Simulation Time • Investigate New Phenomena at the Nanometer Scale • Move Hero Experiments to a Day-to-Day Basis Electron Density Id-Vgs Scheme Parallelization Efficient Parallel Computing Dienstag, 16. September 2014 5
Overview Dienstag, 16. September 2014 6
(1) III-V HEMT Simulations VG In0.52Al0.48As Si δ-doping In0.53Ga0.47As VD VS strained InAs OMEN Device Structure In0.53Ga0.47As In0.52Al0.48As ON Expt: J. del Alamo @ MIT Thermionic Current over a Potential Barrier CB OFF Publications: IEDM 2008, IEDM 2009, IEEE TED 2011 Dienstag, 16. September 2014 7
(2) CNT FET Simulations Lg=9nm Drain Air OMEN Device Structure HfO2 Source Expt: A. Franklin @ IBM YH CB Gate Efr Source Ambipolar Current Flow Id-Vgs Characteristics Drain Efl VB Publication: IEDM 2011, submitted to Nano Letters 2011 Dienstag, 16. September 2014 8
(3) BTBT Diode Simulations Contact OMEN Device Structure One single, multi-geometry, multi-physics code for a wide range of different nanoscale applications N+ drain Expt: S. Rommel @ RIT S. Datta @ PSU P+ source OMEN already used by device engineers in industry at Intel and Global Foundries Contact Band-to-band Tunneling Current P+ NDR Current CB VB Zener Current N+ Discrepancy due to measurement setup Publication: TECHCON 2010, submitted to APL 2011 Dienstag, 16. September 2014 9
Overview Dienstag, 16. September 2014 10
Physical Models: Quantum Transport (i) <r|ψE> = ∑Cij(E,kt)Φσ(r - Rijk)eikt·rt σ σ,ijk,kt Multi-Dimensional Schrödinger Equation with OBCs H | ψE > = E | ψE > Tight-Binding Ansatz for the Wave Function (E-H-ΣRB)·C = Inj (E-H-ΣRB+ΣRS)·GR = I G< = GR·(Σ<B+Σ<S)·GR† Ballistic (Wave Function) Ax=b Scattering (NEGF) AB=C Dienstag, 16. September 2014 11
Physical Models: Quantum Transport (ii) Carriers Localized around Atom Positions ρ(r) = Fρ∑ ∫ dE |Ci(E,k)|2δ(r - ri) i,k Current along Bonds Connecting two Atoms J(r) = FJ∑ ∫ dEIm{Ci(E,k)·Hij·Cj(E,k)} (rj-ri)δ(r - ri) ij,k Solve Poisson Equation on FEM Grid ΔV(r) = -ρ(r)/ε(r) Repeat till ρ(r) andV(r) Convergence Dienstag, 16. September 2014 12
Parallelization Scheme Initialization of Structure and Hamiltonian Matrix • Objective: • Nanoelectronic Device Simulations with Quantum Transport and Atomistic Basis • Approach: • Multi-Level parallelism • Voltage • Momentum • Energy • Space • Parameter sweep over voltages • Dynamic load balancing in double integral • Leverage of existing linear solvers (Pardiso, MUMPs, SuperLU, Umfpack, …) • Novel: • Development of new solvers (Block Cyclic Reduction) with Computational Interleaving between BC and sparse LSE Initialize New Bias V Update Potential Get Momentum k Get Energy E Solve Schrödinger Eq. for (V,k,E) Loop over Energy Loop over Momentum All E? Loop over Voltages Self-consistent Poisson Iterations All k? Charge and Current Poisson Eq. Convergence? Quad-Level Parallelisation Scheme Tested on multiple platforms All V? Done Dienstag, 16. September 2014 13
Overview Dienstag, 16. September 2014 14
Benchmarks: End-to-end Device Simulations 55nm 40nm 55nm In0.52Al0.48As Si δ-doping Double-Gate InAs BTBT FET Single-Gate MQW III-V HEMT (MIT) 3nm 25nm 90nm 40nm 2nm I-V Curve 20 Bias Points In0.53Ga0.47As 2nm strained InAs HfO2 Same code executable for both applications: no specific tuning 5nm p+ InAs intrinsic InAs n- InAs Patent Filed In0.53Ga0.47As 16nm 3nm I-V Curve 20 Bias Points In0.52Al0.48As 3nm 2nm HfO2 • Specifications: • symmetric multi-quantum-well structure • electron flow only, mainly in s-InAs • sp3d5s* tight-binding without SO • NA=55,226 atoms in active region • sizeof(A)=552,260 in Ax=b (|| on 9 CPU) • Specifications: • unsymmetric single-material structure • electron and hole current flow • sp3s* tight-binding with SO coupling • NA=54,272 atoms in active region • sizeof(A)=542,720 in Ax=b (|| on 9 CPU) Dienstag, 16. September 2014 15
Band-to-band Tunneling Transistor Double Precision Strong Scaling up to 221,400 Cores 96% || efficiency 1.28 PFlop/s • 4 parallel levels • maximum of 11,070 cores per bias • ~20 years on a single core • <1 hour on 221,400 cores • almost ideal speed-up till 221,400 cores • 1.28 PFlop/s • 55.4% of peak HfO2 78.5× p+ InAs intrinsic InAs n- InAs HfO2 82× Dienstag, 16. September 2014 16
High Electron Mobility Transistor Double and Mixed Precision Scheme Strong Scaling from 2,700 up to 221,400 Cores In0.52Al0.48As 1.44 PFlop/s 92% || efficiency 1.27 PFlop/s In0.53Ga0.47As strained InAs • 4 parallel levels • maximum of 11,070 cores per bias • 5 Poisson iterations • mixed: last Poisson iteration in double precision • 1.27 PFlop/sdouble • 54% of peak • 1.44 PFlop/s mixed In0.53Ga0.47As 75.5× In0.52Al0.48As 82× Dienstag, 16. September 2014 17
Evolution of Nanoelectronic Device Simulation Time to compute 1 Poisson Iteration for 1 Bias Point on 11,070 cores 8000 In0.52Al0.48As NEGF In0.53Ga0.47As Load Balance • NEGF: most popular technique, but not most efficient • WF: computationally more efficient • BCR: 20% faster than MUMPS and allows comp. interleaving • as compared to standard techniques, OMEN 10.7x faster (double precision) strained InAs 10.7x 4x In0.53Ga0.47As Computational Interleaving Walltime (s) 1.7x In0.52Al0.48As 1.2x 1.3x Mixed MUMPS 1.1x MUMPS BCR 0 BCR BCR Experiment Dienstag, 16. September 2014 18
Overview Dienstag, 16. September 2014 19
Outlook: Could we run on larger systems? So far: end-to-end simulation of I-V curve with 20 bias points on 221,400 cores => 11,070 cores per bias point In0.52Al0.48As In0.53Ga0.47As strained InAs In0.53Ga0.47As In0.52Al0.48As Fact: the loop over bias points is embarrassingly parallel Consequence: “Case 2” with 20 instead of 10 bias points could eaily run on 2*219,300 = 438,600 cores and still reach more than 50% of peak performance Dienstag, 16. September 2014 20
Conclusion In0.52Al0.48As Si δ-doping Drain In0.53Ga0.47As strained InAs Air HfO2 Source In0.53Ga0.47As In0.52Al0.48As Gate Dienstag, 16. September 2014 21
Acknowledgment Dienstag, 16. September 2014 22