1 / 54

Lecture 4: Logistic Regression

Lecture 4: Logistic Regression. Machine Learning CUNY Graduate Center. Today. Linear Regression Bayesians v. Frequentists Bayesian Linear Regression Logistic Regression Linear Model for Classification. Regularization: Penalize large weights. Regularized Regression

kendis
Download Presentation

Lecture 4: Logistic Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 4: Logistic Regression Machine Learning CUNY Graduate Center

  2. Today • Linear Regression • Bayesians v. Frequentists • Bayesian Linear Regression • Logistic Regression • Linear Model for Classification

  3. Regularization: Penalize large weights Regularized Regression (L2-Regularization or Ridge Regression) Introduce a penalty term in the loss function.

  4. More regularization • The penalty term defines the styles of regularization • L2-Regularization • L1-Regularization • L0-Regularization • L0-norm is the optimal subset of features

  5. Curse of dimensionality • Increasing dimensionality of features increases the data requirements exponentially. • For example, if a single feature can be accurately approximated with 100 data points, to optimize the joint over two features requires 100*100 data points. • Models should be small relative to the amount of available data • Dimensionality reduction techniques – feature selection – can help. • L0-regularization is explicit feature selection • L1- and L2-regularizations approximate feature selection.

  6. Bayesians v. Frequentists • What is a probability? • Frequentists • A probability is the likelihoodthat an event will happen • It is approximated by the ratio of the number of observed events to the number of total events • Assessment is vital to selecting a model • Point estimates are absolutely fine • Bayesians • A probability is a degree of believability of a proposition. • Bayesians require that probabilities be prior beliefs conditioned on data. • The Bayesian approach “is optimal”, given a good model, a good prior and a good loss function. Don’t worry so much about assessment. • If you are ever making a point estimate, you’ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior

  7. Bayesian Linear Regression • The previous MLE derivation of linear regression uses point estimates for the weight vector, w. • Bayesians say, “hold it right there”. • Use a prior distribution over w to estimate parameters • Alpha is a hyperparameter over w, where alpha is the precision or inverse variance of the distribution. • Now optimize:

  8. Optimize the Bayesian posterior As usual it’s easier to optimize after a log transform.

  9. Optimize the Bayesian posterior As usual it’s easier to optimize after a log transform.

  10. Optimize the Bayesian posterior Ignoring terms that do not depend on w IDENTICAL formulation as L2-regularization

  11. Context • Overfitting is bad. • Bayesians vs. Frequentists • Is one better? • Machine Learning uses techniques from both camps.

  12. Logistic Regression • Linear model applied to classification • Supervised: target information is available • Each data point xi has a corresponding target ti. • Goal: Identify a function

  13. Target Variables • In binary classification, it is convenient to represent tias a scalar with a range of [0,1] • Interpretation of ti as the likelihood that xi is the member of the positive class • Used to represent the confidence of a prediction. • For L > 2 classes, tiis often represented as a K element vector. • tijrepresents the degree of membership in class j. • |ti| = 1 • E.g. 5-way classification vector:

  14. Graphical Example of Classification

  15. Decision Boundaries

  16. Graphical Example of Classification

  17. Classification approaches • Generative • Models the joint distribution between c and x • Highest data requirements • Discriminative • Fewer parameters to approximate • DiscriminantFunction • May still be trained probabilistically, but not necessarily modeling a likelihood.

  18. Treating Classification as a Linear model

  19. Relationship between Regression and Classification 1 Happy/Good/ClassA f(x)>=0.5? Sad/Not Good/ClassB • Since we’re classifying two classes, why not set one class to ‘0’ and the other to ‘1’ then use linear regression. • Regression: -infinity to infinity, while class labels are 0, 1 • Can use a threshold, e.g. • y >= 0.5 then class 1 • y < 0.5 then class 2

  20. Odds-ratio • Rather than thresholding, we’ll relate the regression to the class-conditional probability. • Ratio of the odd of prediction y = 1 or y = 0 • If p(y=1|x) = 0.8 and p(y=0|x) = 0.2 • Odds ratio = 0.8/0.2 = 4 • Use a linear model to predict odds rather than a class label.

  21. Logit – Log odds ratio function • LHS: 0 to infinity • RHS: -infinity to infinity • Use a log function. • Has the added bonus of disolving the division leading to easy manipulation

  22. Logistic Regression A linear model used to predict log-odds ratio of two classes

  23. Logit to probability

  24. Sigmoid function Squashing function to map the reals to a finite domain.

  25. Gaussian Class-conditional • Assume the data is generated from a gaussian distribution for each class. • Leads to a bayesian formulation of logistic regression.

  26. Bayesian Logistic Regression

  27. Maximum Likelihood ExtimationLogistic Regression Class-conditional Gaussian. Multinomial Class distribution. As ever, take the derivative of this likelihood function w.r.t.

  28. Maximum Likelihood Estimation of the prior

  29. Maximum Likelihood Estimation of the prior

  30. Maximum Likelihood Estimation of the prior

  31. Discriminative Training • Take the derivatives w.r.t. • Be prepared for this for homework. • In the generative formulation, we need to estimate the joint of t and x. • But we get an intuitive regularization technique. • Discriminative Training • Model p(t|x) directly.

  32. What’s the problem with generative training Formulated this way, in D dimensions, this function has D parameters. In the generative case, 2D means, and D(D+1)/2 covariance values Quadratic growth in the number of parameters. We’d rather linear growth.

  33. Discriminative Training

  34. Optimization Take the gradient in terms of w

  35. Optimization

  36. Optimization

  37. Optimization

  38. Optimization: putting it together

  39. Optimization We know the gradient of the error function, but how do we find the maximum value? Setting to zero is nontrivial Numerical approximation

  40. Gradient Descent Take a guess. Move in the direction of the negative gradient Jump again. In a convex function this will converge Other methods include Newton-Raphson

  41. Multi-class discriminant functions • Can extend to multiple classes • Other approaches include constructing K-1 binary classifiers. • Each classifier compares cn to notcn • Computationally simpler, but not without problems

  42. Exponential Model • Logistic Regression is a type of exponential model. • Linear combination of weights and features to produce a probabilistic model.

  43. Problems with Binary Discriminant functions

  44. K-class discriminant

  45. Entropy Measure of uncertainty, or Measure of “Information” High uncertainty equals high entropy. Rare events are more “informative” than common events.

  46. Entropy • How much information is received when observing ‘x’? • If independent, p(x,y) = p(x)p(y). • H(x,y) = H(x) + H(y) • The information contained in two unrelated events is equal to their sum.

  47. Entropy • Binary coding of p(x): -log p(x) • “How many bits does it take to represent a value p(x)?” • How many “decimal” places? How many binary decimal places? • Expected value of observed information

  48. Examples of Entropy Uniform distributions have higher distributions.

  49. Maximum Entropy • Logistic Regression is also known as Maximum Entropy. • Entropy is convex. • Convergence Expectation. • Constrain this optimization to enforce good classification. • Increase maximum likelihood of the data while making the distribution of weights most even. • Include as many useful features as possible.

  50. Maximum Entropy with Constraints From Klein and Manning Tutorial

More Related