Paul Hovland (Argonne National Laboratory) Steven Lee (Lawrence Livermore National Laboratory)

Challenges and Opportunities in Using Automatic Differentiation with Object-Oriented Toolkits for Scientific Computing Paul Hovland (Argonne National Laboratory) Steven Lee (Lawrence Livermore National Laboratory) Lois McInnes (ANL) Boyana Norris (ANL) Barry Smith (ANL) The Computational Differentiation Project at Argonne National Laboratory

Acknowledgments • Jason Abate • Satish Balay • Steve Benson • Peter Brown • Omar Ghattas • Lisa Grignon • William Gropp • Alan Hindmarsh • David Keyes • Jorge Moré • Linda Petzold • Widodo Samyono

Outline • Intro to AD • Survey of Toolkits • SensPVODE • PETSc • TAO • Using AD with Toolkits • Toolkit Level • Parallel Function Level • Subdomain Level • Element/Vertex Function Level • Experimental Results • Conclusions and Expectations

Automatic Differentiation • Technique for augmenting code for computing a function with code for computing derivatives • Analytic differentiation of elementary operations/functions, propagation by chain rule • Can be implemented using source transformation or operator overloading • Two main modes • Forward: propagates derivatives from independent to dependent variables • Reverse (adjoint): propagates derivatives from dependent to independent variables

Comparison of Methods • Finite Differences • Advantages: cheap Jv, easy • Disadvantages: inaccurate (not robust) • Hand Differentiation • Advantages: accurate; cheap Jv, JTv, Hv, … • Disadvantages: hard; difficult to maintain consistency • Automatic Differentiation • Advantages: cheap JTv, Hv; easy? • Disadvantages: Jv costs ~2 function evals; hard?

PVODE: Parallel ODE-IVP solver • Algorithm developers: Hindmarsh, Byrne, Brown and Cohen • ODE Initial-Value Problems • Stiff and non-stiff integrators • Written in C • MPI calls for communication

ODE Initial-Value Problem (standard form): Implicit time-stepping using BDF methods for y(tn) Solve nonlinear system for y(tn)via Inexact Newton Solve update to Newton iterate using Krylov methods PVODE for ODE simulations

Possible Approaches Apply AD to PVODE: ad_F(y,ad_y,p,ad_p) p,ad_p ad_PVODE y, ad_y|t=t1, t2, ... y, ad_y|t=0 Solve sensitivity eqns: ad_F(y,ad_y,p,ad_p) p SensPVODE y, dy/dp |t=t1, t2, ... y|t=0

Sensitivity Differential Equations • Differentiate y= f(t, y, p) with respect to pi: • A linearODE-IVP for the sensitivity vector si(t) :

PETSc • Portable, Extensible Toolkit for Scientific computing • Parallel • Object-oriented • Free • Supported (manuals, email) • Interfaces with Fortran 77/90, C/C++ • Available for virtually all UNIX platforms, as well as Windows 95/NT • Flexible: many options for solver algorithms and parameters

Nonlinear Solvers (SNES) Solve F(u) = 0 Linear Solvers (SLES) PETSc PC KSP Application Initialization Function Evaluation Jacobian Evaluation Post- Processing User code PETSc code AD-generated code Nonlinear PDE Solution Main Routine

TAO The process of nature by which all things change and which is to be followed for a life of harmony • Object-oriented techniques • Component-based (CCA) interaction • Leverage existing parallel computing infrastructure • Reuse of external toolkits The Right Way Toolkit for advanced optimization

TAO Goals • Portability • Performance • Scalable parallelism • An interface independent of architecture

TAO Algorithms • Unconstrained optimization • Limited-memory variable-metric method • Trust region/line search Newton method • Conjugate-gradient method • Levenberg-Marquardt method • Bound-constrained optimization • Trust region Newton method • Gradient projection/conjugate gradient method • Linearly-constrained optimization • Interior-point method with iterative solvers

PETSc and TAO Application Driver Optimization Tools Toolkit for Advanced Optimization (TAO) Linear Solvers Matrices PC KSP Vectors Application Initialization Function & Gradient Evaluation Hessian Evaluation Post- Processing PETSc code User code TAO code

Using AD with Toolkits • Apply AD to toolkit to produce derivative-enhanced toolkit • Use AD to provide Jacobian/Hessian/gradient for use by toolkit. Apply AD at • Parallel Function Level • Subdomain Function Level • Element/Vertex Function Level

Differentiated Version of Toolkit • Makes possible sensitivity analysis, black-box optimization of models constructed using toolkit • Can take advantage of high-level structure of algorithms, providing better performance: see Andreas’ and Linda’s talks • Ongoing work with PETSc and PVODE

Levels of Function Evaluation int FormFunction(SNES snes,Vec X,Vec F,void *ptr){ Parallel Function Level /* Variable declarations omitted */ mx = user->mx; my = user->my; lambda = user->param; Subdomain Function Level hx = one/(double)(mx-1); hy = one/(double)(my-1); sc = hx*hy*lambda; hxdhy = hx/hy; hydhx = hy/hx; ierr = DAGlobalToLocalBegin(user->da,X,INSERT_VALUES,localX);CHKERRQ(ierr); ierr = DAGlobalToLocalEnd(user->da,X,INSERT_VALUES,localX);CHKERRQ(ierr); ierr = VecGetArray(localX,&x);CHKERRQ(ierr); ierr = VecGetArray(localF,&f);CHKERRQ(ierr); ierr = DAGetCorners(user->da,&xs,&ys,PETSC_NULL,&xm,&ym,PETSC_NULL);CHKERRQ(ierr); ierr = DAGetGhostCorners(user->da,&gxs,&gys,PETSC_NULL,&gxm,&gym,PETSC_NULL);CHKERRQ(ierr); for (j=ys; j<ys+ym; j++) { row = (j - gys)*gxm + xs - gxs - 1; for (i=xs; i<xs+xm; i++) { row++; if (i == 0 || j == 0 || i == mx-1 || j == my-1) {f[row] = x[row]; continue;} Vertex/Element Function Level u = x[row]; uxx = (two*u - x[row-1] - x[row+1])*hydhx; uyy = (two*u - x[row-gxm] - x[row+gxm])*hxdhy; f[row] = uxx + uyy - sc*PetscExpScalar(u); } } ierr = VecRestoreArray(localX,&x);CHKERRQ(ierr); ierr = VecRestoreArray(localF,&f);CHKERRQ(ierr); ierr = DALocalToGlobal(user->da,localF,INSERT_VALUES,F);CHKERRQ(ierr); ierr = PLogFlops(11*ym*xm);CHKERRQ(ierr); return 0; }

Parallel Function Level • Advantages • Well-defined interface:int Function(SNES, Vec, Vec, void *);void function(integer, Real, N_Vector, N_Vector, void *); • No changes to function • Disadvantages • Differentiation of toolkit support functions (may result in unnecessary work) • AD of parallel code (MPI, OpenMP) • May need global coloring

Subdomain Function Level • Advantages • No need to differentiate communication functions • Interface may be well defined • Disadvantages • May need local coloring • May need to extract from parallel function

Using AD with PETSc Global-to-local scatter of ghost values Local Function computation Local Function computation Parallel function assembly Script file Global-to-local scatter of ghost values ADIFOR or ADIC Coded manually; can be automated Seed matrix initialization Local Jacobian computation Local Jacobian computation Parallel Jacobian assembly

Global-to-local scatter of ghost values Local Function computation Parallel function assembly Using AD with TAO Local Function computation Script file Global-to-local scatter of ghost values ADIFOR or ADIC Coded manually; can be automated Seed matrix initialization Local gradient computation Local gradient computation Parallel gradient assembly

Element/Vertex Function Level • Advantages • Reduced memory requirements • No need for matrix coloring • Disadvantages • May be difficult to decompose function to this level (boundary conditions, other special cases) • Decomposition to this level may impede efficiency

Example int localfunction2d(Field **x,Field **f,int xs, int xm, int ys, int ym, int mx,int my, void *ptr) { xints = xs; xinte = xs+xm; yints = ys; yinte = ys+ym; if (yints == 0) { j = 0; yints = yints + 1; for (i=xs; i<xs+xm; i++) { f[j][i].u = x[j][i].u; f[j][i].v = x[j][i].v; f[j][i].omega = x[j][i].omega + (x[j+1][i].u - x[j][i].u)*dhy; f[j][i].temp = x[j][i].temp-x[j+1][i].temp; } } if (yinte == my) { j = my - 1; yinte = yinte - 1; for (i=xs; i<xs+xm; i++) { f[j][i].u = x[j][i].u - lid; f[j][i].v = x[j][i].v; f[j][i].omega = x[j][i].omega + (x[j][i].u - x[j-1][i].u)*dhy; f[j][i].temp = x[j][i].temp-x[j-1][i].temp; } } if (xints == 0) { i = 0; xints = xints + 1; for (j=ys; j<ys+ym; j++) { f[j][i].u = x[j][i].u; f[j][i].v = x[j][i].v; f[j][i].omega = x[j][i].omega - (x[j][i+1].v - x[j][i].v)*dhx; f[j][i].temp = x[j][i].temp; } } if (xinte == mx) { i = mx - 1; xinte = xinte - 1; for (j=ys; j<ys+ym; j++) { f[j][i].u = x[j][i].u; f[j][i].v = x[j][i].v; f[j][i].omega = x[j][i].omega - (x[j][i].v - x[j][i-1].v)*dhx; f[j][i].temp = x[j][i].temp - (double)(grashof>0); } } for (j=yints; j<yinte; j++) { for (i=xints; i<xinte; i++) { vx = x[j][i].u; avx = PetscAbsScalar(vx); vxp = p5*(vx+avx); vxm = p5*(vx-avx); vy = x[j][i].v; avy = PetscAbsScalar(vy); vyp = p5*(vy+avy); vym = p5*(vy-avy); u = x[j][i].u; uxx = (two*u - x[j][i-1].u - x[j][i+1].u)*hydhx; uyy = (two*u - x[j-1][i].u - x[j+1][i].u)*hxdhy; f[j][i].u = uxx + uyy - p5*(x[j+1][i].omega-x[j-1][i].omega)*hx; u = x[j][i].v; uxx = (two*u - x[j][i-1].v - x[j][i+1].v)*hydhx; uyy = (two*u - x[j-1][i].v - x[j+1][i].v)*hxdhy; f[j][i].v = uxx + uyy + p5*(x[j][i+1].omega-x[j][i-1].omega)*hy; u = x[j][i].omega; uxx = (two*u - x[j][i-1].omega - x[j][i+1].omega)*hydhx; uyy = (two*u - x[j-1][i].omega - x[j+1][i].omega)*hxdhy; f[j][i].omega = uxx + uyy + (vxp*(u - x[j][i-1].omega) + vxm*(x[j][i+1].omega - u)) * hy +(vyp*(u - x[j-1][i].omega) + vym*(x[j+1][i].omega - u)) * hx -p5 * grashof * (x[j][i+1].temp - x[j][i-1].temp) * hy; u = x[j][i].temp; uxx = (two*u - x[j][i-1].temp - x[j][i+1].temp)*hydhx; uyy = (two*u - x[j-1][i].temp - x[j+1][i].temp)*hxdhy; f[j][i].temp = uxx + uyy + prandtl * ((vxp*(u - x[j][i-1].temp) + vxm*(x[j][i+1].temp - u)) * hy + (vyp*(u - x[j-1][i].temp) + vym*(x[j+1][i].temp - u)) * hx); } } ierr = PetscLogFlops(84*ym*xm);CHKERRQ(ierr); return 0; }

Experimental Results • Toolkit Level – Differentiated PETSc Linear Solver • Parallel Nonlinear Function Level – SensPVODE • Local Subdomain Function Level • PETSc • TAO • Element Function Level – PETSc

Differentiated Linear Equation Solver

Increased Accuracy

SensPVODE: Problem • Diurnl kinetics advection-diffusion equation • 100x100 structured grid • 16 processors of a Linux cluster with 550 MHz processors and Myrinet interconnect

SensPVODE: Time

SensPVODE: Number of Timesteps

SensPVODE: Time/Timestep

PETSc Applications • Toy problems • Solid fuel ignition: finite difference discretization; Fortran & C variants; differentiated using ADIFOR, ADIC • Driven cavity: finite difference discretization; C implementation; differentiated using ADIC • Euler code • Based on legacy F77code from D. Whitfield (MSU) • Finite volume discretization • Up to 1,121,320 unknowns • Mapped C-H grid • Fully implicit steady-state • Tools: SNES, DA, ADIFOR

C-H Structured Grid

Algorithmic Performance

Real Performance

Hybrid Method

Hybrid Method (cont.)

TAO: Preliminary Results

For More Information • PETSc: http://www.mcs.anl.gov/petsc/ • TAO: http://www.mcs.anl.gov/tao/ • Automatic Differentiation at Argonne • http://www.mcs.anl.gov/autodiff/ • ADIFOR: http://www.mcs.anl.gov/adifor/ • ADIC: http://www.mcs.anl.gov/adic/ • http://www.autodiff.org

10 challenges for PDE optimization algorithms • Problem size • Efficiency vs. intrusiveness • “Physics-based” globalizations • Inexact PDE solvers • Approximate Jacobians • Implicitly-defined PDE residuals • Non-smooth solutions • Pointwise inequality constraints • Scalability of adjoint methods to large numbers of inequalities • Time-dependent PDE optimization

7. Non-smoothness • PDE residual may not depend smoothly on state variables (maybe not even continuously) • Solution-adaptivity • Discontinuity-capturing, front-tracking • Subgrid-scale models • Material property evaluation • Contact problems • Elasto(visco)plasticity • PDE residual may not depend smoothly or continuously on decision variables • Solid modeler- and mesh generator-induced

Paul Hovland (Argonne National Laboratory) Steven Lee (Lawrence Livermore National Laboratory)

Paul Hovland (Argonne National Laboratory) Steven Lee (Lawrence Livermore National Laboratory)

Presentation Transcript

Majorana Neutrinoless Double-Beta Decay Experiment

The Earth System Grid: Turning Climate Datasets into Community Resources

Manycore Optimizations: A Compiler and L anguage Independent ManyCore Runtime System

Lawrence Livermore National Laboratory

Visualization with VisIt Part II

Is Hybrid Programming a Bad Idea Whose Time Has Come?

Python’s Role in VisIt

SciDAC Reaction Theory Year-5-End plans

SciDAC Reaction Theory

Computer Science in UNEDF

TotalView Multiprocess Debugger

Rare Isotope Accelerator (RIA) Remote Maintenance Concepts

Lawrence Livermore National Laboratory

Introduction Scope, Tool Categories, Definitions

SciDAC Reaction Theory

Lawrence Livermore National Laboratory

The Earth System Grid: Turning Climate Datasets into Community Resources

TotalView Memory Debugging

Majorana Neutrinoless Double-Beta Decay Experiment

SciDAC Reaction Theory

Security for Open Science Project