1 / 61

Model Selection via Bilevel Optimization

Model Selection via Bilevel Optimization. Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences Rensselaer Polytechnic Institute Troy, NY. Convex Machine Learning.

mervyn
Download Presentation

Model Selection via Bilevel Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences Rensselaer Polytechnic Institute Troy, NY

  2. Convex Machine Learning Convex optimization approaches to machine learning has been major obsession of machine learning for last ten years. But are the problems really convex?

  3. Outline • The myth of convex machine learning • Bilevel Programming Model Selection • Regression • Classification • Extensions to other machine learning tasks • Discussion

  4. Modeler’s Choices Data Function Loss/Regularization CONVEX! Optimization Algorithm w

  5. Many Hidden Choices • Data: • Variable Selection • Scaling • Feature Construction • Missing Data • Outlier removal • Function Family: • linear, kernel (introduces kernel parameters) • Optimization model • loss function • regularization • Parameters/Constraints

  6. Data Function Loss/Regularization Cross-Validation Strategy Generalization Error Cross-Validation: C,ε, [X,y] Optimization Algorithm w NONCONVEX

  7. How does modeler make choices? • Best training set error • Experience/policy • Estimate of generalization error • Cross-validation • Bounds • Optimize generalization error estimate • Fiddle around. • Grid Search • Gradient methods • Bilevel Programming

  8. Splitting Data for T-fold CV

  9. ε C CV via Grid Search For every C, ε • For every validation set, Solve model on corresp. training set, and to estimate loss for •Estimate generalization error for C, ε Return best values for C,ε Make final model using C,ε

  10. CV as Continuous Optimization Problem • Bilevel Program for T folds • Prior Approaches: Golub et al., 1979, Generalized Cross-Validation for one parameter in Ridge Regression Outer-level validation problem T inner-level training problems

  11. Benefit: More Design Variables Add feature box constraint: in the inner-level problems.

  12. -insensitive Loss Function

  13. Inner-level Problem for t-th Fold

  14. Optimality (KKT) conditions for fixed

  15. Key Transformation • KKT for the inner level training problems are necessary and sufficient • Replace lower level problems by their KKT Conditions • Problem becomes an Mathematical Programming Problem with Equilibrium Constraints (MPEC)

  16. Bilevel Problem as MPEC Replace T inner-level problems with corresponding optimality conditions

  17. MPEC to NLP via Inexact Cross Validation Relax “hard” equilibrium constraints to “soft” inexact constraints tol is some user-defined tolerance.

  18. Solvers Strategy: Proof of concept using nonlinear general purpose solvers from NEOS on NLP FILTER, SNOPT Sequential Quadratic Programming Methods FILTER results almost always better. Many possible alternatives: Integer Programming Branch and Bound Lagrangian Relaxations

  19. Computational Experiments: DATA Synthetic • (5,10,15)-D Data with Gaussian and Laplacian noise and (3,7,10) relevant features. • NLP: 3-fold CV • Results: 30 to 90 train, 1000 test points, 10 trials QSAR/Drug Design • 4 datasets, 600+ dimensions reduced to 25 top principal components. NLP: 5-fold CV • Results: 40 – 100 train, rest test, 20 trials

  20. Cross-validation Methods Compared • Unconstrained Grid: Try 3 values each for C,ε • Constrained Grid: Try 3 values each for C, ε, and {0, 1} for each component of • Bilevel/FILTER: Nonlinear program solved using off-the-shelf SQP algorithm via NEOS

  21. 15-D Data: Objective Value

  22. 15-D Data: Computational Time

  23. 15-D Data: TEST MAD

  24. QSAR Data: Objective Value

  25. QSAR Data: Computation Time

  26. QSAR Data: TEST MAD

  27. Classification Cross Validation Given sample data from two classes. Find classification function that minimizes out-of-sample estimate of classification error 1 -1

  28. Lower level - SVM • Define parallel planes • Minimize points on wrong side • Maximize margin of separation

  29. Lower Level Loss Function: Hinge Loss Measures distance of points that violate the appropriate hyperplane constraints,

  30. Lower Level Problem: SVC with box

  31. Inner-level KKT Conditions

  32. Outer-level Loss Functions • Misclassification Minimization Loss (MM) • Loss function used in classical CV • Loss = 1, if validation pt misclassified, 0, otherwise(computed using step function, ) • Hinge Loss (HL) • Both inner and outer levels use same loss function • Loss = distance from(computed using max function, )

  33. Hinge Loss is Convex Approx. of Misclassification Minimization Loss

  34. Hinge Loss Bilevel Program (BilevelHL) • Replace max in outer level objective with convex constraints • Replace inner-level problems with KKT conditions

  35. Hinge Loss MPEC

  36. Misclassification Min. Bilevel Program (BilevelMM) Misclassifications are counted using the step function, defined component wise for a n-vector as

  37. The Step Function Mangasarian (1994) showed that and that any solution, , to the LP satisfies

  38. Misclassifications in the Validation Set • Validation point misclassified when the sign of is negative i.e., • This can be recast for all validation points (within the t-th fold) as

  39. Misclassification Minimization Bilevel Program (revisited) Outer-level average misclassification minimization Inner-level problems to determine misclassified validation points Inner-level training problems

  40. Misclassification Minimization MPEC

  41. Inexact Cross Validation NLP • Both BilevelHL and BilevelMM MPECs are transformed to NLP by relaxing equilibrium constraints (inexact CV) • Solved using FILTER on NEOS • These are compared with classical cross validation: unconstrained and constrained grid.

  42. Experiments: Data sets 3-fold cross validation for model selection Average results for 20 train test splits

  43. Computational Time

  44. Training CV Error

  45. Testing Error

  46. Number of Variables

  47. Progress • Cross Validation is a bilevel problem solvable by continuous optimization methods • Off-the-shelf NLP algorithm – FILTER solved classification and regression • Bilevel Optimization extendable to many Machine Learning problems

  48. Extending Bilevel Approach to other Machine Learning Problems • Kernel Classification/Regression • Variable Selection/Scaling • Multi-task Learning • Semi-supervised Learning • Generative methods

  49. Semi-supervised Learning • Have labeled data, and unlabeled data • Treat missing labels, , as design variables in the outer level • Lower level problems are still convex

  50. Semi-supervised Regression Outer level minimizes error on labeled data to find optimal parameters and labels -insensitive loss on labeled data in inner level -insensitive loss on unlabeled data in inner level Inner level regularization

More Related