Scalable Stochastic Programming

Scalable Stochastic Programming Cosmin Petra and MihaiAnitescu Mathematics and Computer Science Division Argonne National Laboratory Informs Computing Society Conference Monterey, California January, 2011 petra@mcs.anl.gov

Motivation • Sources of uncertainty in complex energy systems • Weather • Consumer Demand • Market prices • Applications @Argonne – Anitescu, Constantinescu, Zavala • Stochastic Unit Commitment with Wind Power Generation • Energy management of Co-generation • Economic Optimization of a Building Energy System

Stochastic Unit Commitment with Wind Power Zavala’s SA2 talk • Wind Forecast – WRF(Weather Research and Forecasting) Model • Real-time grid-nested 24h simulation • 30 samples require 1h on 500 CPUs (Jazz@Argonne) Slide courtesy of V. Zavala & E. Constantinescu

Optimization under Uncertainty • Two-stage stochastic programming with recourse (“here-and-now”) subj. to. subj. to. continuous discrete Sample average approximation (SAA) subj. to. Sampling Inference Analysis M samples

Linear Algebra of Primal-Dual Interior-Point Methods Convex quadratic problem IPM Linear System Min subj. to. Multi-stage SP Two-stage SP nested arrow-shaped linear system (via a permutation)

The Direct Schur Complement Method (DSC) • Uses the arrow shape of H • 1.Implicit factorization 2. Solving Hz=r • 2.1. Back substitution 2.2. Diagonal Solve 2.3. Forward substitution

Parallelizing DSC – 1. Factorization phase Process 1 Process 2 Process 1 2. Backsolve Process p Factorization of the 1st stage Schur complement matrix = BOTTLENECK

Parallelizing DSC – 2. Backsolve Process 1 Process 1 Process 2 Process 2 Process 1 1.Factorization Process p Process p 1st stage backsolve = BOTTLENECK

Scalability of DSC Unit commitment 76.7% efficiency butnot always the case Large number of 1st stage variables: 38.6% efficiency on Fusion @ Argonne

BOTTLENECK SOLUTION 1: STOCHASTIC PRECONDITIONER

Preconditioned Schur Complement (PSC) (separate process) REMOVES the factorization bottleneck Slightly largerbacksolve bottleneck

The Stochastic Preconditioner • The exact structure of C is • IID subset of n scenarios: • The stochastic preconditioner(Petra & Anitescu, 2010) • For C use the constraint preconditioner (Keller et. al., 2000)

The “Ugly” Unit Commitment Problem • DSC on P processes vs PSC on P+1 process Optimal use of PSC – linear scaling • 120 scenarios Factorization of the preconditioner can not be hidden anymore.

Quality of the Stochastic Preconditioner • “Exponentially” better preconditioning (Petra & Anitescu 2010) • Proof: Hoeffding inequality • Assumptions on the problem’s random data • Boundedness • Uniform full rank of and not restrictive

Quality of the Constraint Preconditioner • has an eigenvalue 1 with order of multiplicity . • The rest of the eigenvalues satisfy • Proof: based on Bergamaschiet. al., 2004.

Performance of the preconditioner • Eigenvalues clustering & Krylov iterations • Affected by the well-known ill-conditioning of IPMs.

SOLUTION 2: PARALELLIZATION OF STAGE 1 LINEAR ALGEBRA

Parallelizing the 1st stage linear algebra • We distribute the 1st stage Schur complement system. • C is treated as dense. • Alternative to PSC for problems with large number of 1st stage variables. • Removes the memory bottleneck of PSC and DSC. • We investigated ScaLapack, Elemental (successor of PLAPACK) • None have a solver for symmetric indefinite matrices (Bunch-Kaufman); • LU or Cholesky only. • So we had to think of modifying either. densesymm. pos. def., sparse full rank.

ScaLapack (ORNL) • Classical block distribution of the matrix • Blocked “down-looking” Cholesky - algorithmic blocks • Size of algorithmic block = size of distribution block! • For cache-performance - large algorithmic blocks • For good load balancing - small distribution blocks • Must trade off cache-performance for load balancing • Communication: basic MPI calls • Inflexible in working with sub-blocks

Elemental (UT Austin) • Unconventional “elemental” distribution: blocks of size 1. • Size of algorithmic block size of distribution block • Both cache-performance (large alg. blocks) and load balancing (distrib. blocks of size 1) • Communication • More sophisticated MPI calls • Overhead O(log(sqrt(p))), p is the number of processors. • Sub-blocks friendly • Better performance in a hybrid approach, MPI+SMP, than ScaLapack

Cholesky-based -like factorization • Can be viewed as an “implicit” normal equations approach. • In-place implementation inside Elemental: no extra memory needed. • Idea: modify the Cholesky factorization, by changing the sign after processing p columns. • It is much easier to do in Elemental, since this distributes elements, not blocks. • Twice as fast as LU • Works for more general saddle-point linear systems, i.e., pos. semi-def. (2,2) block.

Distributing the 1st stage Schur complement matrix • All processors contribute to all of the elements of the (1,1) dense block • A large amount of inter-process communication occurs. • Possibly more costly than the factorization itself. • Solution: use buffer to reduce the number of messages when doing a Reduce_scatter. • approach also reduces the communication by half – only need to send lower triangle.

Reduce operations • Streamlined copying procedure - Lubin and Petra (2010) • Loop over continuous memory and copy elements in send buffer • Avoids divisions and modulus ops needed to compute the positions • “Symmetric” reduce for • Only lower triangle is reduced • Fixed buffer size • A variable number of columns reduced. • Effectively halves the communication (both data & # of MPI calls).

Large-scale performance • First-stage linear algebra: ScaLapack (LU), Elemental(LU), and • Strong scaling of PIPS with and • 90.1% from 64 to 1024 cores • 75.4% from 64 to 2048 cores • > 4,000 scenarios SAA problem: 1st stage variables: 82,000 Total #: 189 million Thermal units: 1,000 Wind farms: 1,200

Concluding remarks • PIPS – parallel interior-point solver for stochastic SAA problems • Largest SAA prob. • 189 Mil vars = 82k 1st-stage vars + 4k scens * 47k 2nd-stage vars • 2048 cores • Specialized linear algebra layer • Small-sized 1st-stage subproblems DSC • Medium-sized 1st-stage  PSC • Large-sized 1st-stage  Distributed SC • Current work: Scenario parallelization in a hybrid programming model MPI+SMP

Thank you for your attention! Questions?

Scalable Stochastic Programming