230 likes | 305 Views
Linear Models (I). Rong Jin. Review of Information Theory. What is information? What is entropy? Average information Minimum coding length Important inequality. Distribution for Generating Symbols. Distribution for Coding Symbols. Review of Information Theory (cont’d).
E N D
Linear Models (I) Rong Jin
Review of Information Theory • What is information? • What is entropy? • Average information • Minimum coding length • Important inequality Distribution for Generating Symbols Distribution for Coding Symbols
Review of Information Theory (cont’d) • Mutual information • Measure the correlation between two random variables • Symmetric • Kullback-Leibler distance • Difference between two distributions
Outline • Classification problems • Information theory for text classification • Gaussian generative • Naïve Bayes • Logistic regression
X Input Y Output ? Classification Problems • Given input X={x1, x2, …, xm} • Predict the class label y • y{-1,1}, binary class classification problems • y {1, 2, 3, …, c}, multiple class classification problems • Goal: need to learn the function:
Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, … Topic: politics Which is a bird image? Examples of Classification Problems • Text categorization: • Input features: words ‘campaigning’, ‘efforts’, ‘Iowa’, ‘Democrats’, … • Class label: ‘politics’ and ‘non-politics’ • Image Classification: • Input features: color histogram, texture distribution, edge distribution, … • Class label: ‘bird image’ and ‘non-bird image’
Learning Setup for Classification Problems • Training examples: • Identical Independent Distribution (i.i.d.) • Training examples are similar to testing examples • Goal • Find a model or a function that is consistent with the training data
Information Theory for Text Classification • If coding distribution is similar to the generating distribution short coding length good compression rate Distribution for Generating Symbols Distribution for Coding Symbols
Compression Algorithm for TC Topic: Sports New Document Compression Model M1 Politics 16K bits Compression Model M2 10K bits Sports
Training Examples Learning a Statistical Model Prediction p(y|x;) Probabilistic Models for Classification Problems • Apply statistical inference methods • Key: finding the best parameters • Maximum likelihood (MLE) approach • Log-likelihood of data • Find the parameters that maximizes the log-likelihood
Generative Models • Not directly estimate p(y|x;) • Using Bayes rule • Estimate p(xly;) instead of p(y|x;) • Why p(xly;)? • Most well known distributions are p(xl). • Allocate a separate set of parameters for each class • {1, 2,…,c} • p(xly;) p(xly) • Describes the special input patterns for each class y
Gaussian Generative Model (I) • Assume a Gaussian model for each class • One dimension case • Results for MLE
Example • Height histogram for males and females. • Using Gaussian generative model • P(male|1.8) = ? , P(female|1.4) = ?
Gaussian Generative Model (II) • Consider multiple input features • X={x1, x2, …, xm} • Multi-variate Gaussian distribution • y is a mm covariance matrix • Results for MLE • Problem: • Singularity of y : too many parameters
Overfitting Issue • Complex model • Insufficient training • Consider a classification problem of multiple inputs • 100 input features • 5 classes • 1000 training examples • Total number parameters for a full Gaussian model is • 5 means 500 parameters • 5 covariance matrices 50,000 parameters • 50,500 parameters insufficient training data
Naïve Bayes • Simplify the model complexity • Diagonalize the covariance matrix y • Simplified Gaussian distribution • Feature independence assumption • Naïve Bayes assumption
Naïve Bayes • A terrible estimator for • But it is a very reasonable estimator for Why? • The ratio of likelihood is more important • Naïve Bayes does a reasonable job on the estimation of ratio
The Ratio of Likelihood • Binary class • Both classes share the similar variance • A linear model !
Decision Boundary • Gaussian Generative Models == Finding a linear decision boundary • Why not do it directly?