150 likes | 296 Views
Recitation4 for BigData. LASSO and Coordinate Descent. Jay Gu Feb 7 2013. A numerical example. Generate some synthetic data:. N = 50 P = 200 # Non zero coefficients = 5 X ~ normal (0, I) beta_1, beta_2, beta_3 ~ normal (1, 2) sigma ~ normal(0, 0.1*I) Y = Xbeta + sigma
E N D
Recitation4 for BigData LASSO and Coordinate Descent Jay Gu Feb 7 2013
A numerical example Generate some synthetic data: N = 50 P = 200 # Non zero coefficients = 5 X ~ normal (0, I) beta_1, beta_2, beta_3 ~ normal (1, 2) sigma ~ normal(0, 0.1*I) Y = Xbeta + sigma Split training vs testing: 80/20
Practicalities • Standardize your data: • Center X, Y • remove the intercept • Scale X to have unit norm at each column • fair regularization for all covariates • Warm start. Run Lambdas from large to small, • Starting from the largest lambda to be max(X’y) • Guarantees to have zero support size.
Algorithm Ridge Regression: Closed form solution. LASSO: Iterative algorithms: Subgradient Descent Generalized Gradient Methods (ISTA) Accelerated Generalized Gradient Methods (FSTA) Coordinate Descent
SubdifferentialsCoordinate Descent • Slides from Ryan Tibshirani http://www.cs.cmu.edu/~ggordon/10725-F12/slides/06-sg-method.pdf http://www.cs.cmu.edu/~ggordon/10725-F12/slides/25-coord-desc.pdf
Coordintate Descent: always find global optimum? • Convex and differentiable? Yes • Convex and non-differentiable? No
Convex but separable non-differentiable parts? • Yes. Proof:
Rate of Convergence? • Assuming gradient is Lipchitz continuous. • Subgradient Descent: 1/sqrt(k) • Gradient Descent: 1/k • Optimal rate for first order methods: 1/(k^2) • Coordinate Descent: • Only know for some special cases
Summary: Coordinate Descent • Good for large P • No tuning parameter • In practice converge much faster than the optimal first order methods • Only applies to certain cases • Unknown convergence rate for general function classes