290 likes | 392 Views
Middle Term Exam. 02/28 (Thursday), take home, turn in at noon time of 02/029 (Friday). Project. 03/14 (Phase 1): 10% of training data is available for algorithm development 04/04 (Phase 2): full training data and test examples are available
E N D
Middle Term Exam 02/28 (Thursday), take home, turn in at noon time of 02/029 (Friday)
Project • 03/14 (Phase 1): 10% of training data is available for algorithm development • 04/04 (Phase 2): full training data and test examples are available • 04/17 (submission): submit your prediction before 11:59pm Apr. 20 (Wednesday) • 04/23 and 04/25: • Project presentation • Announce the competition results • 04/28: project report is due
Logistic Regression Rong Jin
Logistic Regression Generative models often lead to linear decision boundary Linear discriminatory model Directly model the linear decision boundary w is the parameter to be decided
Logistic Regression Learn parameter w by Maximum Likelihood Estimation (MLE) Given training data
Logistic Regression Convex objective function, global optimal Gradient descent Classification error
Logistic Regression Convex objective function, global optimal Gradient descent Classification error
Example: Heart Disease 1: 25-29 2: 30-34 3: 35-39 4: 40-44 5: 45-49 6: 50-54 7: 55-59 8: 60-64 • Input feature x: age group id • Output y: if having heart disease • y=1: having heart disease • y=-1: no heart disease
Example: Text Categorization Learn to classify text into two categories Input d: a document, represented by a word histogram Output y=1: +1 for political document, -1 for non-political document
Example: Text Categorization Training data
Example 2: Text Classification • Dataset: Reuter-21578 • Classification accuracy • Naïve Bayes: 77% • Logistic regression: 88%
Logistic Regression vs. Naïve Bayes • Both are linear decision boundaries • Naïve Bayes: • Logistic regression: learn weights by MLE • Both can be viewed as modeling p(d|y) • Naïve Bayes: independence assumption • Logistic regression: assume an exponential family distribution for p(d|y) (a broad assumption)
Discriminative vs. Generative Discriminative Models Generative Models Model P(x|y) Pros Usually fast converge Cheap computation Robust to noise data Cons Usually performs worse Model P(y|x) Pros Usually good performance Cons Slow convergence Expensive computation Sensitive to noise data
OverfittingProblem • Consider text categorization • What is the weight for a word j appears in only one training document dk?
Using regularization Without regularization Iteration Overfitting Problem Decrease in the classification accuracy of test data
Solution: Regularization • Regularized log-likelihood • The effects of regularizer • Favor small weights • Guarantee bounded norm of w • Guarantee the unique solution
Using regularization Without regularization Iteration Regularized Logistic Regression Classification performance by regularization
Regularization as Robust Optimization • Assume each data point is unknownbutbounded in a sphere of radius and center xi
Sparse Solution by Lasso Regularization • RCV1 collection: • 800K documents • 47K unique words
Sparse Solution by Lasso Regularization How to solve the optimization problem? Subgradient descent Minimax
Bayesian Treatment Compute the posterior distribution of w Laplacian approximation
Bayesian Treatment Laplacian approximation
Multi-class Logistic Regression How to extend logistic regression model to multi-class classification ?
Conditional Exponential Model Let classes be Need to learn Normalization factor (partition function)
Conditional Exponential Model Learn weights ws by maximum likelihood estimation Any problem ?