420 likes | 943 Views
HANSO and HIFOO Two New MATLAB Codes. Michael L. Overton Courant Institute of Mathematical Sciences New York University Singapore, Jan 2006. Two New Codes. HANSO: Hybrid Algorithm for Nonsmooth Optimization
E N D
HANSO and HIFOOTwo New MATLAB Codes Michael L. Overton Courant Institute of Mathematical Sciences New York University Singapore, Jan 2006
Two New Codes • HANSO: Hybrid Algorithm for Nonsmooth Optimization • Aimed at finding local minimizers of general nonsmooth, nonconvex optimization problems • Joint with Jim Burke and Adrian Lewis • HIFOO: H-Infinity Fixed-Order Optimization • Aimed at solving specific nonsmooth, nonconvex optimization problems arising in control • Built on HANSO • Joint with Didier Henrion
Many optimization objectives are • Nonconvex • Nonsmooth • Continuous • Differentiable almost everywhere • With gradients often available at little additional cost beyond that of computing function • Subdifferentially regular • Sometimes, non-Lipschitz
Numerical Optimization of Nonsmooth, Nonconvex Functions • Steepest descent: jams • Bundle methods: better, but these are mainly intended for nonsmooth convex functions • We developed a simple method for nonsmooth, nonconvex minimization based on Gradient Sampling • Intended for continuous functions that are differentiable almost everywhere, and for which the gradient can be easily computed when it is defined • User need only write routine to return function value and gradient – and need not worry about nondifferentiable cases, e.g., ties for a max when coding the gradient • Very recently, found BFGS is often very effective when implemented correctly - and much less expensive
BFGS • Standard quasi-Newton method for optimizing smooth functions, using gradient differences to update an inverse Hessian matrix H, defining a local quadratic model of the objective function • Conventional wisdom: jams on nonsmooth functions • Amazingly, works very well as long as implemented right (weak Wolfe line search is essential) • It builds an excellent, extremely ill-conditioned quadratic model • Often, runs until cond(H) is 1016 before breaking down, taking steplengths of around 1 • Often converges linearly (not superlinearly) • By contrast, steepest descent and Newton’s method usually jam • Not very reliable, and no convergence theory
Minimizing a Product of Eigenvalues • An application due to Anstreicher and Lee • Min log of product of K largest eigenvalues of AoX subject to X positive semidefinite and diag(X)=1 (A, X symmetric) (we usually set K=N/2) • Would be equivalent to SDP if replace product by sum • Number of variables is n = N(N-1)/2 where N is dimension of X • Following Burer, substitute Y Y’ for X where Y is N by r so n is reduced to rN and normalize Y so its rows have norm 1: thus both constraints eliminated • Typically objective is not smooth at minimizer, because the K-th largest eigenvalue is multiple • Results for N=10 and rank r = 2 through 10…
N=10, r=6: eigenvalues of optimal AoX 3.163533558166873e-001 5.406646569314501e-001 5.406646569314647e-001 5.406646569314678e-001 5.406646569314684e-001 5.406646569314689e-001 5.406646569314706e-001 7.149382062194134e-001 7.746621702661097e-001 1.350303124150612e+000
The U and V Spaces • The condition number of H, the BFGS approximation to inverse Hessian, typically reaches 1016 !! • Let’s look at eigenvalues of H: H = QDQ* (* denotes transpose) Search direction is d = -Hg= -QDQ*g (where g = gradient) • Next plot shows diag(D), sorted into ascending order Components of |Q*g|, using same ordering Components of |DQ*g|, using same ordering (search direction expanded in eigenvector basis) • Tiny eigenvalues of H correspond to “nonsmooth” V-space • Other eigenvalues of H correspond to “smooth” U-space • From matrix theory, dim(V-space) = m(m+1)/2 – 1, where m is multiplicity of Kth eigenvalue of AoX
More on the U and V Spaces • H seems to be an excellent approximation to the “limiting inverse Hessian” and evidently gives us bases for the optimal U and V spaces • Let’s look at plots of f along random directions in the U and V spaces, as defined by eigenvectors of H, passing through minimizer (for the N=10, r=6 example)
Line Search for BFGS • Sufficient decrease: for 0 < c1< 1, f (x + td) ≤ f (x) + c1 t(grad f)(x)*d • Weak Wolfe condition on derivative for c1< c2 < 1, (grad f)(x + td)*d ≥ c2 (grad f)(x)*d • Not strong Wolfe condition on derivative | (grad f)(x + t d)*d | ≤ - c2 (grad f)(x)*d • Essential when f is nonsmooth • No reason to use strong Wolfe even if f is smooth • Much simpler to implement • Why ever use strong Wolfe?
But how to check if final x is locally optimal? • Extreme ill-conditioning of H is a strong clue, but is expensive to compute and proves nothing • If x is locally optimal, running a local bundle method from x establishes nonsmooth stationarity: this involves repeated null-step line searches and building a bundle of gradients obtained via the line searches • When line search returns null step, it must also return gradient at a point lying across the discontinuity • Then new d = – arg min { ||d||: dÎconvG }, where G is the set of gradients in the bundle • Terminate if ||d|| is small • If ||d|| is 0 and line search steps were 0, x is Clarke stationary; more realistically it is “close” to Clarke stationary • I’ll call this Clarke Quay Stationary since it is a Quay Idea
First-order local optimality • Is this implied by Clarke stationarity of f at x ? • No, for example f(x) = -|x|is Clarke stationary at 0 • Yes, in sense that f’(x; d) ≥ 0 for all directions d, when f is regular at x (subdifferentially regular, Clarke regular) • Most of the functions that we have studied areregular everywhere, although this is sometimes hard to prove • Regularity generalizes smoothness and convexity
Harder Problems • Eigenvalue product problems are interesting, but the Lipschitz constant L for f is not large • Harder problems: • Chebyshev Exponential Approximation • Optimization of Distance to Instability • Pseudospectral Abscissa Optimization (arbitarily large L) • Spectral Abscissa Optimization (not even Lipschitz) • In all cases, BFGS turns out to be substantially less reliable than gradient sampling
Gradient Sampling Algorithm Initialize e and x. Repeat • Get G, a set of gradients of function f evaluated at x and at points near x(sampling controlled by e) • Let d = – arg min { ||d||: dÎconvG } • Line search: replace x by x+ t d, with f(x + t d) < f(x)(if d is not 0) until d = 0(or ||d|| £ tol) Then reduce e and repeat.
Convergence Theory for Gradient Sampling Method (SIOPT, 2005) • Suppose • f is locally Lipschitz and coercive • f is continuously differentiable on an open dense subset of its domain • number of gradients sampled near each iterate is greater than problem dimension • line search uses an appropriate sufficient decrease condition • Then, with probability one and for fixed sampling parameter e, algorithm generates a sequence of points with a cluster point xthat is e-Clarke stationary • If f has a unique Clarke stationary point x, then the set of all cluster points generated by the algorithm converges to x as e is reduced to zero • Kiwiel has already improved on this! (For a slightly different, as yet untested, version of the algorithm.)
HANSO • User provides routine to compute f and its gradient at any given x (do not worry about nonsmooth cases) • BFGS is then run from many randomly generated or user-provided starting points. Quit if gradient at best point found is small. • Otherwise, a local bundle method is run from best point found, attempting to verify local stationarity. • If this is not successful, gradient sampling is initiated. • Regardless, return a final x along with a set of nearby points, the corresponding gradients, and the d which is the smallest vector in the convex hull of these gradients • If the nearby points are near enough and d is small enough, f is Clarke Quay Stationary at x • This is an approximate first-order local optimality condition if f is regular at x
Optimization in Control • Often, not many variables, as in low-order controller design • Reasons: • engineers like simple controllers • nonsmooth, nonconvex optimization problems in even a few variables can be very difficult • Sometimes, more variables are added to the problem in an attempt to make it more tractable • Example: adding a Lyapunov matrix variable P and imposing stability on A by AP + PA*£ 0 • When A depends linearly on the original parameters, this results in a bilinear matrix inequality (BMI), which is typically difficult to solve • Our approach: tackle the original problem • Looking for locally optimal solutions
H Norm of a Transfer Function • Transfer function G(s)=C (sI – A)-1 B + D • SS representation • Hnorm is if A is not stable (stable means all eigenvalues have negative real part) and otherwise, sup||G(s)||2 over all imaginary s • Standard way to compute it is norm(SS,inf) in Matlab control toolbox • LINORM from SLICOT is much faster
H Fixed-Order Controller Design • Choose matrices a, b, c, d to minimize the H norm of the transfer function • The dimension of a is k by k (order, between 0 and n=dim(A)) • The dimension of b is k by p (number of system inputs) • The dimension of c is m (number of system outputs) by k • The dimension of d is m by p • Total number of variables is (k+m)(k+p)
H Fixed-Order Controller Design • Choose matrices a, b, c, d to minimize the H norm of the transfer function • The dimension of a is k by k (order, between 0 and n=dim(A)) • The dimension of b is k by p (number of system inputs) • The dimension of c is m (number of system outputs) by k • The dimension of d is m by p • Total number of variables is (k+m)(k+p) • The case k=0 is static output feedback
H Fixed-Order Controller Design • Choose matrices a, b, c, d to minimize the H norm of the transfer function • The dimension of a is k by k (order, between 0 and n=dim(A)) • The dimension of b is k by p (number of system inputs) • The dimension of c is m (number of system outputs) by k • The dimension of d is m by p • Total number of variables is (k+m)(k+p) • The case k=0 is static output feedback • When B1,C1 are empty, all I/O channels are in performance measure
HIFOO: H Fixed-Order Optimization • Aims to find a, b, c, d for which H norm is locally optimal • Begins by minimizing the spectral abscissa max(real(eig(A-block))) until finds a, b, c, d for which A-block is stable (and therefore H norm is finite) • Then locally minimizes H norm • Calls HANSO to carry out both optimizations • Alternative objectives: • Optimize the spectral abscissa instead of quitting when stable • Optimize the pseudospectral abscissa • Optimize the distance to instability (complex stability radius) • The output a,b,c,d can then be input to optimize for larger order k, which cannot give worse result • Accepts various input formats
HIFOO Provides the Gradients • For H norm, it combines • left and right singular vector info at point on imaginary axis where sup achieved • chain rule • For spectral abscissa max(real(eig)), it combines • left and right eigenvector info for eigenvalue achieving max real part • chain rule • Do not have to worry about ties • Function is continuous, but gradient is not • However, function is differentiable almost everyhere (and virtually everywhere that it is evaluated) • Typically, not differentiable at an exact minimizer
Benchmark Examples: AC Suite • Made extensive runs for the AC suite of 18 aircraft control problems in Leibfritz’ COMPLeIB • In many cases we found low-order controllers that had smaller H norm than was supposedly possible according to the standard full-order controller MATLAB routine HINFSYN! • Sometimes this was even the case when the order was set to 0 (static output feedback) • After I announced this at the SIAM Control meeting in July, HINFSYN was extensively debugged and a new version has been released
Benchmark Examples: Mass-Spring • A well known simple example with n=4 • Optimizing spectral abscissa: • Order 1, we obtain 0, known to be optimal value • Order 2, we obtain about -0.73, previously conjectured to be about -0.5
Benchmarks: Belgian Chocolate Challenge • Simply stated stabilization problem due to Blondel • From 1994 to 2000, not known if solvable by any order controller • In 2000, an order 11 controller was discovered • We found an order 3 controller, and using our analytical results, proved that it is locally optimal
Availability of HANSO and HIFOO • Version 0.9, documented in a paper submitted to ROCOND 2006, freely available at http://www.cs.nyu.edu/overton/software/hifoo/ • Version 0.91, available online but not fully tested so no public link yet • Version 0.92 will have further enhancements and we hope to announce it widely in a month or two
Some Relevant Publications • A Robust Gradient Sampling Algorithm for Nonsmooth, Nonconvex Optimization • SIOPT, 2005 (with Burke, Lewis) • Approximating Subdifferentials by Random Sampling of Gradients • MOR, 2002 (with Burke, Lewis) • Stabilization via Nonsmooth, Nonconvex Optimization • submitted to IEEE Trans-AC (with Burke, Henrion, Lewis) • HANSO: A Hybrid Algorithm for Non-Smooth Optimization Based on BFGS, Bundle and Gradient Sampling • in planning stage http://www.cs.nyu.edu/faculty/overton/
恭喜发财 Gong xi fa cai! Gung hei fat choy! 恭喜發財