280 likes | 408 Views
High-Performance Quantum Simulation: A challenge to Schr ö dinger equation on 256^4 grids. * Toshiyuki Imamura 13 今村俊幸 , Thanks to Susumu Yamada 23 , Takuma Kano 2 , and Masahiko Machida 23 1. UEC (University of Electro-Communications 電気通信大学 ) 2. CCSE JAEA (Japan Atomic Energy Agency)
E N D
High-Performance Quantum Simulation: A challenge to Schrödinger equation on 256^4 grids *Toshiyuki Imamura13 今村俊幸, Thanks to Susumu Yamada23, Takuma Kano2, and Masahiko Machida23 1. UEC (University of Electro-Communications 電気通信大学) 2. CCSE JAEA (Japan Atomic Energy Agency) 3. CREST JST (Japan Science Technology)
Outline • Physics, Review of Quantum Simulation • Mathematics, Numerical Algorithm • Grand Challenge, Parallel Computing on ES • Numerical Results • Conclusion RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
1.1, Quantum Simulation (1/2) Classical Equation of Motion Schroedinger Equation W W’ down-sizing S I S Crossover from Classical to Quantum ??? RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
1.2, Quantum Simulation (2/2) β: 1/Mass ∝ 1/ W Numerical Simulation for Coupled Schrodinger Eq. Y : possible state not a value but a vector! H β: 1/Mass ∝ 1/ W α: Coupling Numerical method to solve the above equation : Spectral expansion by {un} eigenvecs. Requirement of Exact Diagonalization for the Hamiltonian RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
2.1 Krylov Subspace Iteration • Lanczos (Traditional method) • Krylov+GS: Simple, but shift+invert version is needed • LOBPCG (Locally Optimal Block PCG) • {Krylov base, Ritz vector, prior vector} : CG approach **Restart at every iteration** **INVERSE-free** -> Less Communication • Lanczos • LOBPCG RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
2.2 LOBPCG 3*MV / every iteration 1*MV / every iteration • Costly! Since the block is updated at every iteration, MV operation is also required!! • Other Difficulties in implementation • Breakdown of linear independency • make our own DSYGV using LDL and deflation (not Cholesky) • Growth of numerical error in {W,X,P} • detect numerical error and recalculate them automatically • Choice of the shift • Portability RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
2.3 Preconditioning 100 No preconditioner H1 (Point Jacobi) 10 H2 (LDL) 1 H3(LDL) 0.1 0.01 Residual error 1e-3 1e-4 1e-5 1e-6 0 100 200 300 400 500 Iteration count • T~H-1 H=A+B1+B2+B3+B4+C12+C23+C34 H~A H~(A+B1) H~ (A+B1)A-1(A+B2) Here, A: diagonal A+Bx: block-tridiagonal shift + LDLt is used RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
3.2 Technical Issues on the Earth Simulator node node node 3-level parallelism • Inter-Node : • MPI (Message Passing Interface) • Low latency (6.63[us]) • Very fast (11.63[GB/s]) • Intra-Node : • Auto-parallelization • OpenMP (thread-level parallelism) • Vector Processor (most-inner loops) : • Auto-/manual- Vectorization Processor 0 Inter-Node Processor 1 Intra-Node Processor 7 Vector processing • Programming model hybrid of distributed parallelism and thread parallelism. RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
3.3 Quantum Simulation parallel code • Application flow chart Eigenmode calculation Parallel LOBPCG solver developed on ES Time Integrator Parallel code on ES Quantum state analyzer Parallel code on ES Visualization Visualized by AVS RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
3.4 Handling of Huge Data 2-dimensionnal loop decomposition 1-dimension loop decomposition l (k,l) (j) (k, l ) / NP j /MP i j loop length=256 i, j (k, l ) / NP vector processing i k intra-node parallelization NP : Number of MPI processes MP : Number of microtasking processes (=8) • Data distribution in case of a 4D array RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
3.5 Parallel LOBPCG • LOBPCG • Core implementation is MATRIX-VECTOR mult. • 3-level parallelism is carefully done in our implementation. • In Inter-node parallelization, communication pipelining is used. • In the Rayleigh-Ritz part, SCALAPACK is used. do l=1,256 :: inter-node parallelism do k=1,256 :: inter-node parallelism do j=1,256 :: intra-node (thread) parallelism do i=1,256 :: vectorization w(i,j,k,l)=a(i,j,k,l)*v(i,j,k,l) & +b*(v(i+1,j,k,l)+・・・) +c*(v(i+1,j+1,k,l)+・・・) enddo enddo enddo enddo RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
4.1, Numerical Result 1e+4 the ground state the 2nd lowest state 1e+2 the 3rd lowest state the 4th lowest state the 5th lowest state 1 the 6th lowest state the 7th lowest state the 8th lowest state 1e-2 the 9th lowest state the 10th lowest state Residual error 1e-4 1e-6 1e-8 1e-10 1e-12 0 500 1000 1500 2000 2500 3000 Iteration count • Preliminary test of our eigensolver • 4-junction system: -> 256^4 dimension Convergence history Performance (5 eigenmodes) RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) (10 eigenmodes)
4.2, Numerical Result (Scenario) 2p 0 4-junction system : 2564 Discretization: 256 grids Question: Synchronization or Independence (Localization) The Simplest Case: (two Junctions) Capacitive Coupling ? Potential Change: Only a Single Junction Initial State RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
4.3, Numerical Result Two-Stacked Intrinsic Josephson Junction Classical Regime: Independent Dynamics Quantum Regime: ? RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
α=0.4 β=0.2 q1 q1 q1 q1 q2 q2 q2 q2 t=0.0(a.u.) t=2.9(a.u.) RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) t=9.2(a.u.) t=10.0(a.u.)
α=0.4 β=1.0 q1 q1 q1 q1 q2 q2 q2 q2 t=0.0(a.u.) t=2.5(a.u.) RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) t=4.2(a.u.) t=10.0(a.u.)
Two Junctions Weakly Quantum(Classical): Independence Strongly Quantum: Synchronization RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
Three Junctions RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
α=0.4 β=0.2 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
α=0.4 β=1.0 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
<q1> <q1> <q2> <q2> <q3> <q3> <q4> <q4> 4 Junctions α=0.4 q β=0.2 t(a.u.) (a) α=0.4 β=1.0 q Quantum Assisted Synchronization t(a.u.) RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) (b)
5. Conclusion • Collective MQT in Intrinsic Josephson Junctions via parallel computing on ES • Direct Quantum Simulation (4-Junctions) • Quantum (Sychronus) vs Classical (Localized) • Quantum Assisted Synchronization • High Performance Computing • Novel eigenvalue algorithm LOBPCG • Communication-free (or less) implementation • Sustained 7TFLOPS (21.4% of Peak) • Toward Peta-scale computing? RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)
Thank you! 謝謝 Further information Physics: machida.masahiko@jaea.go.jp HPC: imamura@im.uec.ac.jp