610 likes | 986 Views
Lecture 4: Logistic Regression. Machine Learning CUNY Graduate Center. Today. Linear Regression Bayesians v. Frequentists Bayesian Linear Regression Logistic Regression Linear Model for Classification. Regularization: Penalize large weights. Regularized Regression
E N D
Lecture 4: Logistic Regression Machine Learning CUNY Graduate Center
Today • Linear Regression • Bayesians v. Frequentists • Bayesian Linear Regression • Logistic Regression • Linear Model for Classification
Regularization: Penalize large weights Regularized Regression (L2-Regularization or Ridge Regression) Introduce a penalty term in the loss function.
More regularization • The penalty term defines the styles of regularization • L2-Regularization • L1-Regularization • L0-Regularization • L0-norm is the optimal subset of features
Curse of dimensionality • Increasing dimensionality of features increases the data requirements exponentially. • For example, if a single feature can be accurately approximated with 100 data points, to optimize the joint over two features requires 100*100 data points. • Models should be small relative to the amount of available data • Dimensionality reduction techniques – feature selection – can help. • L0-regularization is explicit feature selection • L1- and L2-regularizations approximate feature selection.
Bayesians v. Frequentists • What is a probability? • Frequentists • A probability is the likelihoodthat an event will happen • It is approximated by the ratio of the number of observed events to the number of total events • Assessment is vital to selecting a model • Point estimates are absolutely fine • Bayesians • A probability is a degree of believability of a proposition. • Bayesians require that probabilities be prior beliefs conditioned on data. • The Bayesian approach “is optimal”, given a good model, a good prior and a good loss function. Don’t worry so much about assessment. • If you are ever making a point estimate, you’ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior
Bayesian Linear Regression • The previous MLE derivation of linear regression uses point estimates for the weight vector, w. • Bayesians say, “hold it right there”. • Use a prior distribution over w to estimate parameters • Alpha is a hyperparameter over w, where alpha is the precision or inverse variance of the distribution. • Now optimize:
Optimize the Bayesian posterior As usual it’s easier to optimize after a log transform.
Optimize the Bayesian posterior As usual it’s easier to optimize after a log transform.
Optimize the Bayesian posterior Ignoring terms that do not depend on w IDENTICAL formulation as L2-regularization
Context • Overfitting is bad. • Bayesians vs. Frequentists • Is one better? • Machine Learning uses techniques from both camps.
Logistic Regression • Linear model applied to classification • Supervised: target information is available • Each data point xi has a corresponding target ti. • Goal: Identify a function
Target Variables • In binary classification, it is convenient to represent tias a scalar with a range of [0,1] • Interpretation of ti as the likelihood that xi is the member of the positive class • Used to represent the confidence of a prediction. • For L > 2 classes, tiis often represented as a K element vector. • tijrepresents the degree of membership in class j. • |ti| = 1 • E.g. 5-way classification vector:
Classification approaches • Generative • Models the joint distribution between c and x • Highest data requirements • Discriminative • Fewer parameters to approximate • DiscriminantFunction • May still be trained probabilistically, but not necessarily modeling a likelihood.
Relationship between Regression and Classification 1 Happy/Good/ClassA f(x)>=0.5? Sad/Not Good/ClassB • Since we’re classifying two classes, why not set one class to ‘0’ and the other to ‘1’ then use linear regression. • Regression: -infinity to infinity, while class labels are 0, 1 • Can use a threshold, e.g. • y >= 0.5 then class 1 • y < 0.5 then class 2
Odds-ratio • Rather than thresholding, we’ll relate the regression to the class-conditional probability. • Ratio of the odd of prediction y = 1 or y = 0 • If p(y=1|x) = 0.8 and p(y=0|x) = 0.2 • Odds ratio = 0.8/0.2 = 4 • Use a linear model to predict odds rather than a class label.
Logit – Log odds ratio function • LHS: 0 to infinity • RHS: -infinity to infinity • Use a log function. • Has the added bonus of disolving the division leading to easy manipulation
Logistic Regression A linear model used to predict log-odds ratio of two classes
Sigmoid function Squashing function to map the reals to a finite domain.
Gaussian Class-conditional • Assume the data is generated from a gaussian distribution for each class. • Leads to a bayesian formulation of logistic regression.
Maximum Likelihood ExtimationLogistic Regression Class-conditional Gaussian. Multinomial Class distribution. As ever, take the derivative of this likelihood function w.r.t.
Discriminative Training • Take the derivatives w.r.t. • Be prepared for this for homework. • In the generative formulation, we need to estimate the joint of t and x. • But we get an intuitive regularization technique. • Discriminative Training • Model p(t|x) directly.
What’s the problem with generative training Formulated this way, in D dimensions, this function has D parameters. In the generative case, 2D means, and D(D+1)/2 covariance values Quadratic growth in the number of parameters. We’d rather linear growth.
Optimization Take the gradient in terms of w
Optimization We know the gradient of the error function, but how do we find the maximum value? Setting to zero is nontrivial Numerical approximation
Gradient Descent Take a guess. Move in the direction of the negative gradient Jump again. In a convex function this will converge Other methods include Newton-Raphson
Multi-class discriminant functions • Can extend to multiple classes • Other approaches include constructing K-1 binary classifiers. • Each classifier compares cn to notcn • Computationally simpler, but not without problems
Exponential Model • Logistic Regression is a type of exponential model. • Linear combination of weights and features to produce a probabilistic model.
Entropy Measure of uncertainty, or Measure of “Information” High uncertainty equals high entropy. Rare events are more “informative” than common events.
Entropy • How much information is received when observing ‘x’? • If independent, p(x,y) = p(x)p(y). • H(x,y) = H(x) + H(y) • The information contained in two unrelated events is equal to their sum.
Entropy • Binary coding of p(x): -log p(x) • “How many bits does it take to represent a value p(x)?” • How many “decimal” places? How many binary decimal places? • Expected value of observed information
Examples of Entropy Uniform distributions have higher distributions.
Maximum Entropy • Logistic Regression is also known as Maximum Entropy. • Entropy is convex. • Convergence Expectation. • Constrain this optimization to enforce good classification. • Increase maximum likelihood of the data while making the distribution of weights most even. • Include as many useful features as possible.
Maximum Entropy with Constraints From Klein and Manning Tutorial