HANSO and HIFOO Two New MATLAB Codes

HANSO and HIFOOTwo New MATLAB Codes Michael L. Overton Courant Institute of Mathematical Sciences New York University Singapore, Jan 2006

Two New Codes • HANSO: Hybrid Algorithm for Nonsmooth Optimization • Aimed at finding local minimizers of general nonsmooth, nonconvex optimization problems • Joint with Jim Burke and Adrian Lewis • HIFOO: H-Infinity Fixed-Order Optimization • Aimed at solving specific nonsmooth, nonconvex optimization problems arising in control • Built on HANSO • Joint with Didier Henrion

Many optimization objectives are • Nonconvex • Nonsmooth • Continuous • Differentiable almost everywhere • With gradients often available at little additional cost beyond that of computing function • Subdifferentially regular • Sometimes, non-Lipschitz

Numerical Optimization of Nonsmooth, Nonconvex Functions • Steepest descent: jams • Bundle methods: better, but these are mainly intended for nonsmooth convex functions • We developed a simple method for nonsmooth, nonconvex minimization based on Gradient Sampling • Intended for continuous functions that are differentiable almost everywhere, and for which the gradient can be easily computed when it is defined • User need only write routine to return function value and gradient – and need not worry about nondifferentiable cases, e.g., ties for a max when coding the gradient • Very recently, found BFGS is often very effective when implemented correctly - and much less expensive

BFGS • Standard quasi-Newton method for optimizing smooth functions, using gradient differences to update an inverse Hessian matrix H, defining a local quadratic model of the objective function • Conventional wisdom: jams on nonsmooth functions • Amazingly, works very well as long as implemented right (weak Wolfe line search is essential) • It builds an excellent, extremely ill-conditioned quadratic model • Often, runs until cond(H) is 1016 before breaking down, taking steplengths of around 1 • Often converges linearly (not superlinearly) • By contrast, steepest descent and Newton’s method usually jam • Not very reliable, and no convergence theory

Minimizing a Product of Eigenvalues • An application due to Anstreicher and Lee • Min log of product of K largest eigenvalues of AoX subject to X positive semidefinite and diag(X)=1 (A, X symmetric) (we usually set K=N/2) • Would be equivalent to SDP if replace product by sum • Number of variables is n = N(N-1)/2 where N is dimension of X • Following Burer, substitute Y Y’ for X where Y is N by r so n is reduced to rN and normalize Y so its rows have norm 1: thus both constraints eliminated • Typically objective is not smooth at minimizer, because the K-th largest eigenvalue is multiple • Results for N=10 and rank r = 2 through 10…

N=10, r=6: eigenvalues of optimal AoX 3.163533558166873e-001 5.406646569314501e-001 5.406646569314647e-001 5.406646569314678e-001 5.406646569314684e-001 5.406646569314689e-001 5.406646569314706e-001 7.149382062194134e-001 7.746621702661097e-001 1.350303124150612e+000

The U and V Spaces • The condition number of H, the BFGS approximation to inverse Hessian, typically reaches 1016 !! • Let’s look at eigenvalues of H: H = QDQ* (* denotes transpose) Search direction is d = -Hg= -QDQ*g (where g = gradient) • Next plot shows diag(D), sorted into ascending order Components of |Q*g|, using same ordering Components of |DQ*g|, using same ordering (search direction expanded in eigenvector basis) • Tiny eigenvalues of H correspond to “nonsmooth” V-space • Other eigenvalues of H correspond to “smooth” U-space • From matrix theory, dim(V-space) = m(m+1)/2 – 1, where m is multiplicity of Kth eigenvalue of AoX

More on the U and V Spaces • H seems to be an excellent approximation to the “limiting inverse Hessian” and evidently gives us bases for the optimal U and V spaces • Let’s look at plots of f along random directions in the U and V spaces, as defined by eigenvectors of H, passing through minimizer (for the N=10, r=6 example)

Line Search for BFGS • Sufficient decrease: for 0 < c1< 1, f (x + td) ≤ f (x) + c1 t(grad f)(x)*d • Weak Wolfe condition on derivative for c1< c2 < 1, (grad f)(x + td)*d ≥ c2 (grad f)(x)*d • Not strong Wolfe condition on derivative | (grad f)(x + t d)*d | ≤ - c2 (grad f)(x)*d • Essential when f is nonsmooth • No reason to use strong Wolfe even if f is smooth • Much simpler to implement • Why ever use strong Wolfe?

But how to check if final x is locally optimal? • Extreme ill-conditioning of H is a strong clue, but is expensive to compute and proves nothing • If x is locally optimal, running a local bundle method from x establishes nonsmooth stationarity: this involves repeated null-step line searches and building a bundle of gradients obtained via the line searches • When line search returns null step, it must also return gradient at a point lying across the discontinuity • Then new d = – arg min { ||d||: dÎconvG }, where G is the set of gradients in the bundle • Terminate if ||d|| is small • If ||d|| is 0 and line search steps were 0, x is Clarke stationary; more realistically it is “close” to Clarke stationary • I’ll call this Clarke Quay Stationary since it is a Quay Idea

First-order local optimality • Is this implied by Clarke stationarity of f at x ? • No, for example f(x) = -|x|is Clarke stationary at 0 • Yes, in sense that f’(x; d) ≥ 0 for all directions d, when f is regular at x (subdifferentially regular, Clarke regular) • Most of the functions that we have studied areregular everywhere, although this is sometimes hard to prove • Regularity generalizes smoothness and convexity

Harder Problems • Eigenvalue product problems are interesting, but the Lipschitz constant L for f is not large • Harder problems: • Chebyshev Exponential Approximation • Optimization of Distance to Instability • Pseudospectral Abscissa Optimization (arbitarily large L) • Spectral Abscissa Optimization (not even Lipschitz) • In all cases, BFGS turns out to be substantially less reliable than gradient sampling

Gradient Sampling Algorithm Initialize e and x. Repeat • Get G, a set of gradients of function f evaluated at x and at points near x(sampling controlled by e) • Let d = – arg min { ||d||: dÎconvG } • Line search: replace x by x+ t d, with f(x + t d) < f(x)(if d is not 0) until d = 0(or ||d|| £ tol) Then reduce e and repeat.

Convergence Theory for Gradient Sampling Method (SIOPT, 2005) • Suppose • f is locally Lipschitz and coercive • f is continuously differentiable on an open dense subset of its domain • number of gradients sampled near each iterate is greater than problem dimension • line search uses an appropriate sufficient decrease condition • Then, with probability one and for fixed sampling parameter e, algorithm generates a sequence of points with a cluster point xthat is e-Clarke stationary • If f has a unique Clarke stationary point x, then the set of all cluster points generated by the algorithm converges to x as e is reduced to zero • Kiwiel has already improved on this! (For a slightly different, as yet untested, version of the algorithm.)

HANSO • User provides routine to compute f and its gradient at any given x (do not worry about nonsmooth cases) • BFGS is then run from many randomly generated or user-provided starting points. Quit if gradient at best point found is small. • Otherwise, a local bundle method is run from best point found, attempting to verify local stationarity. • If this is not successful, gradient sampling is initiated. • Regardless, return a final x along with a set of nearby points, the corresponding gradients, and the d which is the smallest vector in the convex hull of these gradients • If the nearby points are near enough and d is small enough, f is Clarke Quay Stationary at x • This is an approximate first-order local optimality condition if f is regular at x

Optimization in Control • Often, not many variables, as in low-order controller design • Reasons: • engineers like simple controllers • nonsmooth, nonconvex optimization problems in even a few variables can be very difficult • Sometimes, more variables are added to the problem in an attempt to make it more tractable • Example: adding a Lyapunov matrix variable P and imposing stability on A by AP + PA*£ 0 • When A depends linearly on the original parameters, this results in a bilinear matrix inequality (BMI), which is typically difficult to solve • Our approach: tackle the original problem • Looking for locally optimal solutions

H Norm of a Transfer Function • Transfer function G(s)=C (sI – A)-1 B + D • SS representation • Hnorm is  if A is not stable (stable means all eigenvalues have negative real part) and otherwise, sup||G(s)||2 over all imaginary s • Standard way to compute it is norm(SS,inf) in Matlab control toolbox • LINORM from SLICOT is much faster

H Fixed-Order Controller Design • Choose matrices a, b, c, d to minimize the H norm of the transfer function • The dimension of a is k by k (order, between 0 and n=dim(A)) • The dimension of b is k by p (number of system inputs) • The dimension of c is m (number of system outputs) by k • The dimension of d is m by p • Total number of variables is (k+m)(k+p)

H Fixed-Order Controller Design • Choose matrices a, b, c, d to minimize the H norm of the transfer function • The dimension of a is k by k (order, between 0 and n=dim(A)) • The dimension of b is k by p (number of system inputs) • The dimension of c is m (number of system outputs) by k • The dimension of d is m by p • Total number of variables is (k+m)(k+p) • The case k=0 is static output feedback

H Fixed-Order Controller Design • Choose matrices a, b, c, d to minimize the H norm of the transfer function • The dimension of a is k by k (order, between 0 and n=dim(A)) • The dimension of b is k by p (number of system inputs) • The dimension of c is m (number of system outputs) by k • The dimension of d is m by p • Total number of variables is (k+m)(k+p) • The case k=0 is static output feedback • When B1,C1 are empty, all I/O channels are in performance measure

HIFOO: H Fixed-Order Optimization • Aims to find a, b, c, d for which H norm is locally optimal • Begins by minimizing the spectral abscissa max(real(eig(A-block))) until finds a, b, c, d for which A-block is stable (and therefore H norm is finite) • Then locally minimizes H norm • Calls HANSO to carry out both optimizations • Alternative objectives: • Optimize the spectral abscissa instead of quitting when stable • Optimize the pseudospectral abscissa • Optimize the distance to instability (complex stability radius) • The output a,b,c,d can then be input to optimize for larger order k, which cannot give worse result • Accepts various input formats

HIFOO Provides the Gradients • For H norm, it combines • left and right singular vector info at point on imaginary axis where sup achieved • chain rule • For spectral abscissa max(real(eig)), it combines • left and right eigenvector info for eigenvalue achieving max real part • chain rule • Do not have to worry about ties • Function is continuous, but gradient is not • However, function is differentiable almost everyhere (and virtually everywhere that it is evaluated) • Typically, not differentiable at an exact minimizer

Benchmark Examples: AC Suite • Made extensive runs for the AC suite of 18 aircraft control problems in Leibfritz’ COMPLeIB • In many cases we found low-order controllers that had smaller H norm than was supposedly possible according to the standard full-order controller MATLAB routine HINFSYN! • Sometimes this was even the case when the order was set to 0 (static output feedback) • After I announced this at the SIAM Control meeting in July, HINFSYN was extensively debugged and a new version has been released

Benchmark Examples: Mass-Spring • A well known simple example with n=4 • Optimizing spectral abscissa: • Order 1, we obtain 0, known to be optimal value • Order 2, we obtain about -0.73, previously conjectured to be about -0.5

Benchmarks: Belgian Chocolate Challenge • Simply stated stabilization problem due to Blondel • From 1994 to 2000, not known if solvable by any order controller • In 2000, an order 11 controller was discovered • We found an order 3 controller, and using our analytical results, proved that it is locally optimal

Availability of HANSO and HIFOO • Version 0.9, documented in a paper submitted to ROCOND 2006, freely available at http://www.cs.nyu.edu/overton/software/hifoo/ • Version 0.91, available online but not fully tested so no public link yet • Version 0.92 will have further enhancements and we hope to announce it widely in a month or two

Some Relevant Publications • A Robust Gradient Sampling Algorithm for Nonsmooth, Nonconvex Optimization • SIOPT, 2005 (with Burke, Lewis) • Approximating Subdifferentials by Random Sampling of Gradients • MOR, 2002 (with Burke, Lewis) • Stabilization via Nonsmooth, Nonconvex Optimization • submitted to IEEE Trans-AC (with Burke, Henrion, Lewis) • HANSO: A Hybrid Algorithm for Non-Smooth Optimization Based on BFGS, Bundle and Gradient Sampling • in planning stage http://www.cs.nyu.edu/faculty/overton/

恭喜发财 Gong xi fa cai! Gung hei fat choy! 恭喜發財

HANSO and HIFOO Two New MATLAB Codes

HANSO and HIFOO Two New MATLAB Codes

Presentation Transcript

NEW DIAGNOSIS CODES

CUFSM and Matlab

MATLAB and Octave

CUFSM and Matlab

Introduction to MATLAB and MATLAB Programming

Matrices and MATLAB

Two New Oscillators

Music and MATLAB

Matlab and Matrices

New Disposition Codes

New advances in photoionization codes:

World War Two and New York

Old and New Cardiac MRI CPT Codes

CONSTRUCTION CODES new codes for a changing world

Fountain Codes, LT Codes and Raptor Codes

Module Two Australia and New Zealand

MATLAB Building and Running Faster Matlab

Images and MATLAB

LabVIEW and MatLab

New Medicare Codes

CUFSM and Matlab