Linear Classification

Linear Classification Fall 2014 The University of Iowa Tianbao Yang

Content • K nearest neighborhood classification • Basic variant • Improved variants • Probabilistic Generative Model • Discriminant Functions • Probabilistic Discriminative Model • Support Vector Machine

Classification Problems • Given input: • Predict the output (class label) • Binary classification: • Multi-class classification: • Learn a classification function: • Regression:

Examples of Classification Problem • Text categorization: Politics Sport Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, … Topic:

Examples of Classification Problem • Text categorization: • Input features : • Word frequency • {(campaigning, 1), (democrats, 2), (basketball, 0), …} • Class label: • ‘Politics’: • ‘Sport’: Politics Sport Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, … Topic:

Examples of Classification Problem • Image Classification: • Input features X • Color histogram • {(red, 1004), (red, 23000), …} • Class label y • Y = +1: ‘bird image’ • Y = -1: ‘non-bird image’ Which images have birds, which ones do not?

Examples of Classification Problem • Image Classification: • Input features • Color histogram • {(red, 1004), (blue, 23000), …} • Class label • ‘bird image’: • ‘non-bird image’: Which images have birds, which ones do not?

Supervised Learning • Training examples: • Identical independent distribution (i.i.d) assumption • A critical assumption for machine learning theory

Regression for Classification • It is easy to turn binary classification into a regression problem • Ignore the binary nature of class label y • How to convert multiclass classification into a regression problem? • Pros: computational efficiency • Cons: ignore the discrete nature of class label

Regression for Classification

(k=1)

(k=1) K Nearest Neighbor (k-NN) Classifier

K Nearest Neighbor (k-NN) Classifier Decision boundary K = 1

(k=4) (k=1) K Nearest Neighbor (k-NN) Classifier How many neighbors should we count ?

K Nearest Neighbor (k-NN) Classifier

Leave-One-Out Method

(k=1) Leave-One-Out Method

Leave-One-Out Method err(1) = 1

k = 2 Leave-One-Out Method err(1) = 3 err(2) = 2 err(3) = 6

Cross-validation • Divide training examples into two sets • A training set (80%) and a validation set (20%) • Predict the class labels for validation set by using the examples in training set • Choose the number of neighbors k that maximizes the classification accuracy

Bayes Optimal Solution for Classification • expected loss for classification • consider 0-1 loss • point-wise loss • Bayes Optimal Classifier

Probabilistic Interpretation of KNN • Bayes’ theorem • KNN uses Non-parametric Density Estimation • Given a data set with data points from class and

Probabilistic Interpretation of KNN • Estimate • Consider a small neighbourhood containing x such that • Given the total number N data points, K data points fall inside

Probabilistic Interpretation of KNN • Estimate • Consider a small neighbourhood containing x # of neighbors in class k # of points in class k

Probabilistic Interpretation of KNN • Given a data set with data points from class and , we have • and correspondingly • Since , Bayes’ theorem gives

Probabilistic Interpretation of KNN • Estimate conditional probability Pr(y|x) • Count of data points in class y in the neighborhood of x • Bias and variance tradeoff • A small neighborhood  large variance  unreliable estimation • A large neighborhood  large bias  inaccurate estimation

Content • K nearest neighborhood classification • Basic variant • Improved variants • Probabilistic Generative Model • Discriminant Functions • Probabilistic Discriminative Model • Support Vector Machine

Weighted kNN • Weight the contribution of each close neighbor based on their distances • Weight function • Prediction

Kernel Density Estimation • fix V, estimate K from the data. Let R be a hypercube centred on x

Kernel Density Estimation • discontinuity in • to avoid the discontinuity

Kernel Density Estimation • More generally

When to Consider Nearest Neighbor ? • Less than 20 attributes per example • Advantages: • Training is very fast • Learn complex target functions • Disadvantages: • Slow at query time • Easily fooled by irrelevant attributes

Curse of Dimensionality • Imagine instances described by 20 attributes, but only 2 are relevant to target function • Curse of dimensionality: conflicting assumption become severe

Curse of Dimensionality • Curse of dimensionality: more data points close to the boundary • uniformly distributed over a unit ball #of data points Dimensionality

Curse of Dimensionality • High-dimensional problem is not rare • Bioinformatics: microarray gene expression: d: 102~104 • Computer vision: images: d: 104 – 106 • Text analysis: d: 104 - 106

Dimensionality Reduction • Can we reduce the dimensionality? • it is possible

Principal Component Analysis • Dimensionality Reduction by Linear Transformation

Principal Component Analysis • Dimensionality Reduction by Linear Transformation Which direction we should project the data?

Principal Component Analysis • The big picture when we loot at the data Variance Mean

Principal Component Analysis • Mean-centered Data variance

Principal Component Analysis • Projection should keep the variance as much as possible

Principal Component Analysis • Let us compute the variance after projection • assume all data points are mean-centered • data after projection variance

Principal Component Analysis • The best projection should maximize variance • the first projection (the first component) • The first eigen-vector of corresponding to the largest eigen-value of • What about other projections?

Principal Component Analysis • Maximize the variance of the Residual for all data • should maximize the variance of the residual data residual

Principal Component Analysis • The m components • the first m eigen-vectors of the Covariance matrix • variance along different components Eigen-values of Covariance Matrix

Principal Component Analysis • In geometry

Principal Component Analysis • In Summary (step by step) • Compute Covariance matrix • Compute first m eigen-vectors of covariance matrix as the m components for projection • Compute New Data:

Linear Classification