230 likes | 370 Views
Stochastic Optimization of Complex Energy Systems on High-Performance Computers. Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory petra@mcs.anl.gov SIAM CSE 2013 Joint work with Olaf Schenk(USI Lugano ), Miles Lubin ( MIT ) ,
E N D
Stochastic Optimization of Complex Energy Systems on High-Performance Computers Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory petra@mcs.anl.gov SIAM CSE 2013 Joint work with Olaf Schenk(USI Lugano), Miles Lubin (MIT), Klaus Gaertner(WIAS Berlin)
Outline • Application of HPC to power-grid optimization under uncertainty • Parallel interior-point solver (PIPS-IPM) • structure exploiting • Revisiting linear algebra • Experiments on BG/P with the new features
Stochastic unit commitment with wind power • Wind Forecast – WRF(Weather Research and Forecasting) Model • Real-time grid-nested 24h simulation • 30 samples require 1h on 500 CPUs (Jazz@Argonne) Thermal generator Wind farm Slide courtesy of V. Zavala & E. Constantinescu
Stochastic Formulation • Discrete distribution leads to block-angular LP
Large-scale (dual) block-angular LPs Extensive form • In terminology of stochastic LPs: • First-stage variables (decision now): x0 • Second-stage variables (recourse decision): x1, …, xN • Each diagonal block is a realization of a random variable (scenario)
Computational challenges and difficulties • May require many scenarios (100s, 1,000s, 10,000s …) to accurately model uncertainty • “Large” scenarios (Wi up to 250,000 x 250,000) • “Large” 1st stage (1,000s, 10,000s of variables) • Easy to build a practical instance that requires 100+ GB of RAM to solve Requires distributed memory • Real-time solution needed in our applications
Linear algebra of primal-dual interior-point methods (IPM) Convex quadratic problem IPM Linear System Min subj. to. 2 solves per IPM iteration - predictor directions - corrector directions Multi-stage SP Two-stage SP nested arrow-shaped linear system (modulo a permutation) N is the number of scenarios
Parallel Solution Procedure for KKT System • Steps 1 and 5 trivially parallel • “Scenario-based decomposition” • Steps 1,2,3 are >95% of total execution time.
Components of Execution Time • Notice break in y-axis scale
Scenario Calculations – Steps 1 and 5 • Each scenario is assigned to an MPI process, which locally performs steps 1 and 5. • Matrices are sparse and symmetric indefinite (symmetric with positive and negative eigenvalues). • Computing is very expensive when solving with the factors of against non-zero columns of and multiplying from left with • 4 hours 10 minutes wall time to solve a 4h-horizon problem with 8k scenarios on 8k nodes. • Need to run under strict time requirements • For example, solve 24h-horizon problem in less than 1h
Revisiting scenario computations for shared-memory • Multiple sparse right-hand sides • Triangular solves phase hard to parallelize in shared-memory (multi-core) • Factorization phase speeds up very well and achieves considerable peak-performance • Our approach: incomplete factorization of • Stop factorization after the elimination of (1,1) block • will sit in the (2,2) block (Schurcomplement)
Implementation • Requires modification of the linear solver • PARDISO (Schenk) -> PARDISO-SC • Pivot perturbations during factorization needed to maintain numerical stability • Errors due to perturbations are absorbed by iterative refinement • This would be extremely expensive in our case (many right-hand sides) • We let errors propagate in the “global” Schur complement C (Step 2) • Factorize the perturbed C (denoted by ) (Step 3) • After Step 1, 2 and 3, we have the factorization of an approximation matrix
Pivot error absorption by preconditioned BiCGStab • Still we have to solve with • “Absorb errors” by solving Kz=r using preconditioned BiCGStab • Numerical experiments showed it is more robust than iterative refinement. • Preconditioner is • Each BiCGStab iteration requires • 2 mat-vecs: Kz • 2 applications of the preconditioner: • One application of the preconditioner resumes to performing “solve” steps 4 and 5 for
Test architecture • “Intrepid” Blue Gene/P supercomputer • 40,960 nodes • Custom interconnect • Each node has quad-core 850 Mhz PowerPC processor, 2 GB RAM • DOE INCITE Award 2012-2013 – 24 million core hours
Numerical experiments • 4h (UC4), 12h(UC12), 24h(UC24) horizon problems • 1 scenario per node (4 cores per scenario) • Large-scale: 12h horizon, up to 32k scenarios and 128k cores (k=1,024) • 16k scenarios – 2.08 billion variables, 1.81 billion constraints, KKT system size = 3.89 billion • LAPACK+SMP ESSL BLAS for first-stage linear systems • PARDISO-SC for second-stage linear systems
Time per IPM iteration • UC12, 32k scenarios, 32k nodes (128k cores) • BiCGStab iteration count ranges from 0 to 1.5 • Cost of absorbing factorization perturbation errors is between 10 and 30% of total iteration cost
Solve to completion – UC12 • Before: 4 hours 10 minutes wall time to solve UC4 problem with 8k scenarios on 8k nodes • Now: UC12
Conclusions and Future Considerations • Multicore-friendly reformulation of sparse linear algebra computations lead to one order of magnitude faster execution times. • Fast factorization-based computation of SC • Robust and cheap pivot errors absorption via Krylov iterative methods • Parallel efficiency of PIPS remains good. • Performance evaluation on today’s supercomputers • IBM BG/Q • Cray XK7, XC30