Unconstrained Optimization

Unconstrained Optimization Rong Jin

Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How to do it efficiently ?

Gradient Ascent • Compute the gradient • Increase weights w and threshold b in the gradient direction

Problem with Gradient Ascent • Difficult to find the appropriate step size • Small   slow convergence • Large   oscillation or “bubbling” • Convergence conditions • Robbins-Monroe conditions • Along with “regular” objective function will ensure convergence 

Newton Method • Utilizing the second order derivative • Expand the objective function to the second order around x0 • The minimum point is • Newton method for optimization • Guarantee to converge when the objective function is convex

Multivariate Newton Method • Object function comprises of multiple variables • Example: logistic regression model • Text categorization: thousands of words  thousands of variables • Multivariate Newton Method • Multivariate function: • First order derivative  a vector • Second order derivative  Hessian matrix • Hessian matrix is mxm matrix • Each element in Hessian matrix is defined as:

Multivariate Newton Method • Updating equation: • Hessian matrix for logistic regression model • Can be expensive to compute • Example: text categorization with 10,000 words • Hessian matrix is of size 10,000 x 10,000  100 million entries • Even worse, we have compute the inverse of Hessian matrix H-1

Quasi-Newton Method • Approximate the Hessian matrix H-1 with another B matrix: • B is update iteratively (BFGS): • Utilizing derivatives of previous iterations

Limited-Memory Quasi-Newton • Quasi-Newton • Avoid computing the inverse of Hessian matrix • But, it still requires computing the B matrix  large storage • Limited-Memory Quasi-Newton (L-BFGS) • Even avoid explicitly computing B matrix • B can be expressed as a product of vectors • Only keep the most recently vectors of (3~20)

Number of Variable Convergence Rate Efficiency V-Fast Standard Newton method: O(n3) Small Quasi Newton method (BFGS): O(n2) Medium Fast Large Limited-memory Quasi Newton method (L-BFGS): O(n) R-Fast

Empirical Study: Learning Conditional Exponential Model Limited-memory Quasi-Newton method Gradient ascent

Free Software • http://www.ece.northwestern.edu/~nocedal/software.html • L-BFGS • L-BFGSB

Linear Conjugate Gradient Method • Consider optimizing the quadratic function • Conjugate vectors • The set of vector {p1, p2, …, pl} is said to be conjugate with respect to a matrix A if • Important property • The quadratic function can be optimized by simply optimizing the function along individual direction in the conjugate set. • Optimal solution: • k is the minimizer along the kth conjugate direction

Example • Minimize the following function • Matrix A • Conjugate direction • Optimization • First direction, x1 = x2=x: • Second direction, x1 =- x2=x: • Solution: x1 = x2=1

How to Efficiently Find a Set of Conjugate Directions • Iterative procedure • Given conjugate directions{p1,p2,…, pk-1} • Set pk as follows: • Theorem: The direction generated in the above step is conjugate to all previous directions {p1,p2,…, pk-1}, i.e., • Note: compute the k direction pk only requires the previous direction pk-1

Nonlinear Conjugate Gradient • Even though conjugate gradient is derived for a quadratic objective function, it can be applied directly to other nonlinear functions • Several variants: • Fletcher-Reeves conjugate gradient (FR-CG) • Polak-Ribiere conjugate gradient (PR-CG) • More robust than FR-CG • Compared to Newton method • No need for computing the Hessian matrix • No need for storing the Hessian matrix

Unconstrained Optimization