Logistic Regression

Logistic Regression

Linear regression • Function f : XYis a linear combination of input components

Binary classification • Two classes • Our goal is to learn to classify correctly two types of examples • Class 0 – labeled as 0, • Class 1 – labeled as 1 • We would like to learn • Zero-one error (loss) function • Error we would like to minimize: • •First step: we need to devise a model of the function

Discriminant functions • One convenient way to represent classifiers is through • Discriminant functions • Works for binary and multi-way classification • Idea: • For every class i = 0,1, …k define a function gi(x) mapping X R • When the decision on input x should be made choose the class with the highest value of gi(x) • So what happens with the input space? Assume a binary case.

Discriminant functions

Discriminant functions • Define decision boundary.

Quadratic decision boundary

Logistic regression model • Defines a linear decision boundary • Discriminant functions:

Logistic function

Linear decision boundary • Logistic regression model defines a linear decision boundary • Why? • Answer: Compare two discriminant functions.

Logistic regression model. Decision boundary

Form of P(Yj|X) for Gaussian Naive Bayes Classifier • a GNB based on the following modeling assumptions:

In general, Bayes rule allows us to write • Dividing both the numerator and denominator by the numerator yields:

Because of our conditional independence assumption we can write this

Given our assumption that P(Xi|Y =yk) is Gaussian, we can expand this term as follows:

Estimating Parameters for Logistic Regression • training Logistic Regression is to choose parameter values that maximize the conditional data likelihood. • The conditional data likelihood is the probability of the observed Y values in the training data, conditioned on their corresponding X values. We choose parametersW that satisfy Equivalently, we can work with the log of the conditional likelihood:

This conditional data log likelihood, which we will denote l (W) can be written as

Using gradient ascent, The i th component of the vector gradient has the form where ˆP(Yl jXl ;W) is the Logistic Regression prediction using following equations

Regularization in Logistic Regression • Overfitting the training data is a problem that can arise in Logistic Regression, especially when data is very high dimensional and training data is sparse. • One approach to reducing overfitting is regularization, in which we create a modified “penalized log likelihood function,” which penalizes large values of W.

Logistic Regression for Functions with Many Discrete Values • More generally, if Y can take on any of the discrete values {y1,….,yK}, then the form of • is

Relationship Between Naive Bayes Classifiers and Logistic Regression • Logistic Regression directly estimates the parameters of P(Y|X) • Naive Bayes directly estimates parameters for P(Y) and P(Y|X), • We often call the former a discriminative classifier, and the latter a generative classifier.

Generative vs. Discriminative Classifiers Wish to learn f: X Y, or P(Y|X) Generative classifiers (e.g., Naïve Bayes): • Assume some functional form for P(X|Y), P(Y) • This is the ‘generative’ model • Estimate parameters of P(X|Y), P(Y) directly from training data • Use Bayes rule to calculate P(Y|X= xi) Discriminative classifiers: • Assume some functional form for P(Y|X) • This is the ‘discriminative’ model • Estimate parameters of P(Y|X) directly from training data

Naïve Bayes vs Logistic Regression Consider Y boolean, Xi continuous, X=<X1 ... Xn> Number of parameters to estimate: • NB:4n+1 • LR:n+1

G.Naïve Bayes vs. Logistic Regression • Generative and Discriminative classifiers • Asymptotic comparison (# training examples 􀃆 infinity) • when model assumptions correct • GNB, LR produce identical classifiers • when model assumptions incorrect • LR is less biased – does not assume cond indep. • therefore expected to outperform GNB

Naïve Bayes vs. Logistic Regression • Generative and Discriminative classifiers • Non-asymptotic analysis (see [Ng & Jordan, 2002] ) • convergence rate of parameter estimates – how many training examples needed to assure good estimates? • GNB order log n (where n = # of attributes in X) • LR order n GNB converges more quickly to its (perhaps less helpful) asymptotic estimates

Logistic Regression