330 likes | 442 Views
Linear Model (III). Rong Jin. Announcement. Homework 2 is out and is due 02/05/2004 (next Tuesday) Homework 1 is handed out. Recap: Logistic Regression Model. Assume the inputs and outputs are related in the log linear function Estimate weights: MLE approach. Example: Text Classification.
E N D
Linear Model (III) Rong Jin
Announcement • Homework 2 is out and is due 02/05/2004 (next Tuesday) • Homework 1 is handed out
Recap: Logistic Regression Model • Assume the inputs and outputs are related in the log linear function • Estimate weights: MLE approach
Example: Text Classification • Input x: a binary vector • Each word is a different dimension • xi = 0 if the ith word does not appear in the document xi = 1 if it appears in the document • Output y: interesting document or not • +1: interesting • -1: uninteresting
Example: Text Classification Doc 1 The purpose of the Lady Bird Johnson Wildflower Center is to educate people around the world, … Doc 2 Rain Bird is one of the leading irrigation manufacturers in the world, providingcomplete irrigation solutions for people…
Example 2: Text Classification • Logistic regression model • Every term ti is assigned with a weight wi • Learning parameters: MLE approach • Need numerical solutions
Example 2: Text Classification • Weight wi • wi > 0: term ti is a positive evidence • wi < 0: term ti is a negative evidence • wi = 0: term ti is irrelevant to whether or not the document is intesting • The larger the | wi |, the more important ti term is determining whether the document is interesting. • Threshold c
Example 2: Text Classification • Dataset: Reuter-21578 • Classification accuracy • Naïve Bayes: 77% • Logistic regression: 88%
Why Logistic Regression Works better for Text Classification? • Common words • Small weights in logistic regression • Large weights in naïve Bayes • Weight ~ p(w|+) – p(w|-) • Independence assumption • Naive Bayes assumes that each word is generated independently • Logistic regression is able to take into account of the correlation of words
Comparison • Discriminative Model • Model P(y|x) directly • Model the decision boundary • Usually good performance • But • Slow convergence • Expensive computation • Sensitive to noise data • Generative Model • Model P(x|y) • Model the input patterns • Usually fast converge • Cheap computation • Robust to noise data • But • Usually performs worse
Problems with Logistic Regression? How about words that only appears in one class?
Overfitting Problem with Logistic Regression • Consider word t that only appears in one document d, and d is a positive document. Let w be its associated weight • Consider the derivative of l(Dtrain) with respect to w • w will be infinite !
Solution: Regularization • Regularized log-likelihood • Large weights small weights • Prevent weights from being too large • Small weights zero • Sparse weights
Why do We Need Sparse Solution? • Two types of solutions • Many non-zero weights but many of them are small • Only a small number of weights, and many of them are large • Occam’s Razor: the simpler the better • A simpler model that fits data unlikely to be coincidence • A complicated model that fit data might be coincidence • Smaller number of non-zero weights less amount of evidence to consider simpler model case 2 is preferred
Finding Optimal Solutions • Concave objective function • No local maximum • Many standard optimization algorithms work
Predication Errors Preventing weights from being too large Gradient Ascent • Maximize the log-likelihood by iteratively adjusting the parameters in small increments • In each iteration, we adjust w in the direction that increases the log-likelihood (toward the gradient)
Graphical Illustration No regularization case
When should Stop? • The gradient ascent learning method converges when there is no incentive to move the parameters in any particular direction: • In many cases, it can be very tricky • Small first order derivative close to the maximum point
Extend Logistic Regression Model to Multiple Classes • y{1,2,…,C} • How to extend the above definition to the case when the number of classes is more than 2?
Conditional Exponential Model • It is simple! • Ensure the sum of probability to be 1
Conditional Exponential Model • Predication probability • Model parameters: • For each class y, we have weights wy and threshold cy • Maximum likelihood estimation • Any problem with the above optimization problem?
Conditional Exponential Model • Add a constant vector to every weight vector, we have the same log-likelihood function • Usually set w1 to be a zero vector and c1 to be zero
Maximum Entropy Model: Motivation • Consider a translation example • English ‘in’ French {dans, en, à, au cours de, pendant} • Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) • Case 1: no prior knowledge on tranlation • What is your guess of the probabilities? • p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5 • Case 2: 30% time use either dans or en • What is your guess of the probabilities? • p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30 • Uniform distribution is favored
Maximum Entropy Model: Motivation • Case 3: 30% use dans or en, and 50% use dans or à • What is your guess of the probabilities? • How to measure the uniformity of any distribution?
Maximum Entropy Principle (MaxEnt) • A uniformity of distribution is measured by entropy of the distribution • Solution: p(dans) = 0.2, p(a) = 0.3, p(en)=0.1, p(au-cours-de) = 0.2, p(pendant) = 0.2
MaxEnt for Classification Problems • Requiring the first order moment to be consistent between the empirical data and model predication • No assumption about the parametric form for likelihood • Usually assume it is Cn continuous • What is the solution for ?
Solution to MaxEnt • Surprisingly, the solution is just conditional exponential model without thresholds • Why?