1 / 69

David E. Keyes Center for Computational Science Old Dominion University

Algorithms and Software for Terascale Computation of PDEs and PDE-constrained Optimization. David E. Keyes Center for Computational Science Old Dominion University Institute for Scientific Computing Research (ISCR) Lawrence Livermore National Laboratory. Plan of presentation.

ayita
Download Presentation

David E. Keyes Center for Computational Science Old Dominion University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms and Software for Terascale Computation of PDEs and PDE-constrained Optimization David E. Keyes Center for Computational Science Old Dominion University Institute for Scientific Computing Research (ISCR) Lawrence Livermore National Laboratory

  2. Plan of presentation • Imperative of “optimal” algorithms for terascale computing • Basic domain decomposition and multilevel algorithmic concepts • Illustration of solver performance on ASCI platforms • Terascale Optimal PDE Simulations (TOPS) software project of the U.S. DOE • Conclusions and outlook

  3. Slides from 14-hour Peking University CS&E short course with Bill Gropp (in August 2002) now on-line Related URLs • Personal homepage: papers, talks, etc. http://www.math.odu.edu/~keyes • SciDAC initiative http://www.science.doe.gov/scidac • TOPS software project http://www.math.odu.edu/~keyes/scidac • PETSc software project http://www.mcs.anl.gov/petsc • Hypre software project http://www.llnl.gov/CASC/hypre

  4. Bibliography • Jacobian-Free Newton-Krylov Methods: Approaches and Applications, Knoll & Keyes, 2002, submitted to J. Comp. Phys. • Nonlinearly Preconditioned Inexact Newton Algorithms, Cai & Keyes, 2002, SIAM J. Sci. Comp. 24:183-200 • High Performance Parallel Implicit CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing 27:337-362 • Four Horizons for Enhancing the Performance of Parallel Simulations based on Partial Differential Equations, Keyes, 2000, Lect. Notes Comp. Sci., Springer, 1900:1-17 • Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel CFD, Gropp, Keyes, McInnes & Tidriri, 2000, Int. J. High Performance Computing Applications 14:102-136 • Achieving High Sustained Performance in an Unstructured Mesh CFD Application, Anderson, Gropp, Kaushik, Keyes & Smith, 1999, Proceedings of SC'99 • Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278 • How Scalable is Domain Decomposition in Practice?, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297

  5. Engineeringcrash testing aerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Applied Physics radiation transport supernovae Environment global climate contaminant transport Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

  6. Engineeringcrash testing aerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Applied Physics radiation transport supernovae Environment global climate contaminant transport Experiments controversial Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

  7. Engineeringcrash testing aerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Applied Physics radiation transport supernovae Experiments dangerous Environment global climate contaminant transport Experiments controversial Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

  8. Engineering crash testing aerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Experiments prohibited or impossible Applied Physics radiation transport supernovae Experiments dangerous Environment global climate contaminant transport Experiments controversial Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

  9. Engineeringcrash testingaerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Experiments prohibited or impossible Applied Physics radiation transport supernovae Experiments dangerous Experiments difficult to instrument Environment global climate contaminant transport Experiments controversial Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

  10. Engineeringcrash testingaerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Experiments prohibited or impossible Applied Physics radiation transport supernovae Experiments dangerous Experiments difficult to instrument Environment global climate contaminant transport Experiments controversial Experiments expensive Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

  11. Engineeringcrash testingaerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Experiments prohibited or impossible Applied Physics radiation transport supernovae Experiments dangerous Experiments difficult to instrument Environment global climate contaminant transport Experiments controversial Experiments expensive Scientific Simulation However, simulation is far from proven! To meet expectations, we need to handle problems of multiple physical scales.

  12. Plan Develop Use Large platforms have been provided 100+ Tflop / 30 TB Livermore 50+ Tflop / 25 TB 7.2 Tflop/s LINPACK 30+ Tflop / 10 TB Capability 10+ Tflop / 4 TB White 3+ Tflop / 1.5 TB Blue Livermore Red 1+ Tflop / 0.5 TB ‘97 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06 Time (CY) Sandia Los Alamos ASCI program of the U.S. DOE has roadmap to go to 100 Tflop/s by 2006 www.llnl.gov/asci/platforms

  13. Caltech Argonne NCSA/PACI 8 TF 240 TB SDSC 4.1 TF 225 TB NSF’s 13.6 TF TeraGrid coming on line TeraGrid: NCSA, SDSC, Caltech, Argonne www.teragrid.org Site Resources Site Resources 26 HPSS HPSS 4 24 External Networks External Networks 8 5 40Gb/s External Networks External Networks Site Resources Site Resources HPSS UniTree c/o I. Foster

  14. Earth Simulator Bird’s-eye View of the Earth Simulator System Disks Cartridge Tape Library System Processor Node (PN) Cabinets 35.6 Tflop/s LINPACK Interconnection Network (IN) Cabinets Air Conditioning System 65m Power Supply System 50m Double Floor for IN Cables

  15. Building platforms is the “easy” part • Algorithms must be • highly concurrent and straightforward to load balance • latency tolerant • cache friendly (temporal and spatial locality of reference) • highly scalable (in the sense of convergence) • Goal for algorithmic scalability: fill up memory of arbitrarily large machines while preserving constant* running times with respect to proportionally smaller problem on one processor • Domain-decomposed multilevel methods “natural” for all of these • Domain decomposition also “natural” for software engineering * or logarithmically growing

  16. Algorithmic requirementsfrom architecture • Must run on physically distributed memory units connected by message-passing network, each serving one or more processors with multiple levels of cache “horizontal” aspects “vertical” aspects T3E message passing, shared memory threads register blocking, cache blocking, prefetching

  17. 200 unscalable 150 Steel/rubber composite Parallel multigrid c/o M. Adams, Berkeley-Sandia 100 Time to Solution 50 scalable 0 1000 1 10 100 Problem Size (increasing with number of processors) The solver is a key part, but not the only part, of the simulation that needs to be scalable Keyword: “Optimal” • Convergence rate nearly independent of discretization parameters • Multilevel schemes for rapid linear convergence of linear problems • Newton-like schemes for quadratic convergence of nonlinear problems • Convergence rate as independent as possible of physical parameters • Continuation schemes • Physics-based preconditioning

  18. Why Optimal Algorithms? • The more powerful the computer, the greater the importance of optimality • Example: • Suppose Alg1 solves a problem in time CN2, where N is the input size • Suppose Alg2 solves the same problem in time CN • Suppose that the machine on which Alg1 and Alg2 have been parallelized to run has 10,000 processors • In constant time (compared to serial), Alg1 can run a problem 100X larger, whereas Alg2 can run a problem 10,000X larger

  19. Why Optimal?, cont. • Alternatively, filling the machine’s memory, Alg1 requires 100X time, whereas Alg2 runs in constant time • Is 10,000 processors a reasonable expectation? • Yes, we have it today (ASCI White)! • Could computational scientists really use 10,000X scaling? • Of course; we are approximating the continuum • A grid for weather prediction allows points every 1km versus every 100km on the earth’s surface • In 2D 10,000X disappears fast; in 3D even faster • However, these machines are expensive (Earth Simulator is $0.5B, plus ongoing operating costs), and optimal algorithms are the only algorithms that we can afford to run on them

  20. Consider, e.g., the implicitly discretized parabolic case Decomposition strategies for Lu=f in  • Operator decomposition • Function space decomposition • Domain decomposition

  21. Operator decomposition • Consider ADI • Iteration matrix consists of four sequential (“multiplicative”) substeps per timestep • two sparse matrix-vector multiplies • two sets of unidirectional bandsolves • Parallelism within each substep • But global data exchanges between bandsolve substeps

  22. System of ordinary differential equations • Perhaps are diagonal matrices • Perfect parallelism across spectral index • But global data exchanges to transformback to physical variables at each step Function space decomposition • Consider a spectral Galerkin method

  23. = Domain decomposition • Consider restriction and extension operators for subdomains, , and for possible coarse grid, • Replace discretized with • Solve by a Krylov method, e.g., CG • Matrix-vector multiplies with • parallelism on each subdomain • nearest-neighbor exchanges, global reductions • possible small global system (not needed for parabolic case)

  24. Comparison • Operator decomposition (ADI) • natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose) • Function space decomposition (Fourier) • natural mode-based assignment requires all-to-all, bulk data exchanges in each step (for transform) • Domain decomposition (Schwarz) • natural domain-based assignment requires local (nearest neighbor) data exchanges, global reductions, and optional small global problem

  25. Theoretical scaling of domain decomposition(for three common network topologies) • With logarithmic-time (hypercube- or tree-based) global reductions and scalable nearest neighbor interconnects: • optimal number of processors scales linearlywith problem size (“scalable”, assumes one subdomain per processor) • With power-law-time (3D torus-based) global reductions and scalable nearest neighbor interconnects: • optimal number of processors scales as three-fourths power of problem size (“almost scalable”) • With linear-time (common bus) network: • optimal number of processors scales as one-fourth power of problem size (*not* scalable) • bad news for conventional Beowulf clusters, but see 2000 & 2001 Bell Prize “price-performance awards” using multiple commodity NICs per Beowulf node!

  26. Some “Advanced” Concepts • Polynomial combinations of Schwarz projections • Schwarz-Schur combinations • Schwarz on Schur-reduced system • Schwarz inside Schur-reduced system • Nonlinear Schwarz optimization new! Three Basic Concepts • Iterative correction • Schwarz preconditioning • Schur preconditioning

  27. Iterative correction • The most basic idea in iterative methods • Evaluate residual accurately, but solve approximately, where is an approximate inverse to A • A sequence of complementary solves can be used, e.g., with first and then one has • Optimal polynomials of lead to various preconditioned Krylov methods • Scale recurrence, e.g., with , leads to multilevel methods

  28. smoother A Multigrid V-cycle Restrictiontransfer from fine to coarse grid coarser grid has fewer cells (less work & storage) Prolongationtransfer from coarse to fine grid First Coarse Grid Recursively apply this idea until we have an easy problem to solve Multilevel Preconditioning Finest Grid

  29. Schwarz Preconditioning • Given A x = b , partition x into subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of • Let Boolean rectangular matrix extract the subset of : • Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain support

  30. Preconditioning Type in 2D in 3D Point Jacobi Ο(N1/2) Ο(N1/3) Domain Jacobi (=0) Ο((NP)1/4) Ο((NP)1/6) 1-level Additive Schwarz Ο(P1/2) Ο(P1/3) 2-level Additive Schwarz Ο(1) Ο(1) Iteration count estimates from the Schwarz theory • Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system • In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter hand subdomain diameter H, iteration counts may be estimated as follows:

  31. Comments on the Schwarz theory • Basic Schwarz estimates are for: • self-adjoint operators with smooth coefficients • positive definite operators • exact subdomain solves, • two-way overlapping with • generous overlap, =O(H)(otherwise 2-level result is O(1+H/)) • Extensible to: • nonself-adjointness (e.g, convection) and jumping coefficients • indefiniteness (e.g., wave Helmholtz) • inexact subdomain solves • one-way overlap communication (“restricted additive Schwarz”) • small overlap

  32. Schur Preconditioning • Given a partition • Condense: • Let M be a good preconditioner for S • Then is a preconditioner for A • Moreover, solves with may be done approximately if all degrees of freedom are retained

  33. This leads to algorithm “Hybrid II” in S-B-G’96: • Convenient for “SPMD” (single prog/multiple data) Schwarz polynomials • Polynomials of Schwarz projections that are combinations of additive and multiplicative may be appropriate for certain implementations • We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively)

  34. “Balancing Neumann-Neumann” alg Schwarz-on-Schur • Preconditioning the Schur complement is complex in and of itself; Schwarz can be used on the reduced problem • “Neumann-Neumann” alg • Multigrid on the Schur complement

  35. Schwarz-inside-Schur • Consider Newton’s method for solving the nonlinear rootfinding problem derived from the necessary conditions for constrained optimization • Constraint • Objective • Lagrangian • Form the gradient of the Lagrangian with respect to each of x, u, and :

  36. Newton Reduced SQP solves the Schur complement system H u = g , where H is the reduced Hessian Schwarz-inside-Schur • Equality constrained optimization leads to the KKT system for states x , designs u , and multipliers  • Then

  37. Schwarz-inside-Schur, cont. • Problems • is the Jacobian of a PDE  huge! • involve Hessians of objective and constraints  second derivatives and huge • H is unreasonable to form, store, or invert • Solutions • Use Schur preconditioning on full system • Form forward action of Hessians by automatic differentiation (vector-to-vector map) • Form approximate inverse action of state Jacobian and its transpose by Schwarz

  38. wing tip vortices, no control (l); optimal control (r) optimal boundary controls shown as velocity vectors Example of PDE-constrained Optimization Lagrange-Newton-Krylov-Schur implemented in Veltisto/PETSc • Optimal control of laminar viscous flow • optimization variables are surface suction/injection • objective is minimum drag • 700,000 states; 4,000 controls • 128 Cray T3E processors • ~5 hrs for optimal solution (~1 hr for analysis) c/o G. Biros and O. Ghattas www.cs.nyu.edu/~biros/veltisto/

  39. Nonlinear Schwarz preconditioning • Nonlinear Schwarz has Newton both inside and outside and is fundamentally Jacobian-free • It replaces with a new nonlinear system possessing the same root, • Define a correction to the partition (e.g., subdomain) of the solution vector by solving the following local nonlinear system: where is nonzero only in the components of the partition • Then sum the corrections:

  40. Nonlinear Schwarz, cont. • It is simple to prove that if the Jacobian of F(u) is nonsingular in a neighborhood of the desired root then and have the same unique root • To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : • The residual • The Jacobian-vector product • Remarkably, (Cai-Keyes, 2000) it can be shown that where and • All required actions are available in terms of !

  41. Stagnation beyond critical Re Difficulty at critical Re Convergence for all Re Additive Schwarz Preconditioned Inexact Newton (ASPIN) Newton’s method Example of nonlinear Schwarz

  42. Then and — the Schwarz formula! “Unreasonable effectiveness” of Schwarz • When does the sum of partial inverses equal the inverse of the sums? When the decomposition is right! Let be a complete set of orthonormal row eigenvectors for A : or • Good decompositions are a compromise between conditioning and parallel complexity, in practice

  43. Newton-Krylov-Schwarz – a parallel PDE “workhorse” Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994), in PETSc since Balay’s MS project at ODU (1995) Newton nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable

  44. Jacobian-Free Newton-Krylov Method • In the Jacobian-Free Newton-Krylov (JFNK) method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products • These are approximated by the Fréchet derivatives so that the actual Jacobian elements are never explicitly needed, where  is chosen with a fine balance between approximation and floating point rounding error  Schwarz preconditions, using approximate elements

  45. Philosophy of Jacobian-free NK • To evaluate the linear residual, we use the true F’(u) , giving a true Newton step and asymptotic quadratic Newton convergence • To precondition the linear residual, we do anything convenient that uses understanding of the dominant physics/mathematics in the system and respects the limitations of the parallel computer architecture and the cost of various operations: • combinations of operator-split Jacobians (for reasons of physics or reasons of numerics) • Jacobian of related discretization (for “fast” solves) • Jacobian of lower-order discretization (for more stability, less storage) • Jacobian with “lagged” values for expensive terms (for less computation per degree of freedom) • Jacobian stored in lower precision (for less memory traffic per preconditioning step) • Jacobian blocks decomposed for parallelism

  46. Philosophy of Jacobian-free NK, cont. • These motivations are not new; most large-scale application codes also take “short cuts” on the approximate Jacobian operator to be inverted – showing physical intuition • The problem with many codes is that they do not anywhere have an accurate global Jacobian operator; they use only the weak Jacobian • This leads to a weakly nonlinearly converging “defect correction method” • Defect correction: • in contrast to preconditioned Newton:

  47. Newton Outside Physics-based Preconditioning • Recall the example of the shallow-water wave splitting, treated earlier as a solver, leaving a first-order in time splitting error • In the Jacobian-free Newton-Krylov framework, this solver, which maps a residual into a correction, can be regarded as a preconditioner • The true Jacobian is never formed yet the time-implicit nonlinear residual at each time step can be made as small as needed for nonlinear consistency in long time integrations

  48. Physics-based preconditioning • In Newton iteration, one seeks to obtain a correction (“delta”) to solution, by inverting the Jacobian matrix on (the negative of) the nonlinear residual: • A typical operator-split code also derives a “delta” to the solution, by some implicitly defined means, through a series of implicit and explicit substeps • This implicitly defined mapping from residual to “delta” is a naturalpreconditioner • Software must accommodate this!

  49. 1D Shallow water preconditioning • Define continuity residual for each timestep: • Define momentum residual for each timestep: • Continuity delta-form (*): • Momentum delta form (**):

  50. 1D Shallow water preconditioning, cont. • Solving (**) for and substituting into (*), • After this parabolic equation is solved for  , we have • This completes the application of the preconditioner to one Newton-Krylov iteration at one timestep • Of course, the parabolic solve need not be done exactly; one sweep of multigrid can be used • See paper by Mousseau et al. (2002) in Ref [1] for impressive results for longtime weather integration

More Related