530 likes | 699 Views
Linear Classification. Fall 2014 The University of Iowa Tianbao Yang. Content. K nearest neighborhood classification Basic variant I mproved variants Probabilistic Generative Model Discriminant Functions Probabilistic Discriminative Model Support Vector Machine. Content.
E N D
Linear Classification Fall 2014 The University of Iowa Tianbao Yang
Content • K nearest neighborhood classification • Basic variant • Improved variants • Probabilistic Generative Model • Discriminant Functions • Probabilistic Discriminative Model • Support Vector Machine
Content • K nearest neighborhood classification • Basic variant • Improved variants • Probabilistic Generative Model • Discriminant Functions • Probabilistic Discriminative Model • Support Vector Machine
Classification Problems • Given input: • Predict the output (class label) • Binary classification: • Multi-class classification: • Learn a classification function: • Regression:
Examples of Classification Problem • Text categorization: Politics Sport Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, … Topic:
Examples of Classification Problem • Text categorization: • Input features : • Word frequency • {(campaigning, 1), (democrats, 2), (basketball, 0), …} • Class label: • ‘Politics’: • ‘Sport’: Politics Sport Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, … Topic:
Examples of Classification Problem • Image Classification: • Input features X • Color histogram • {(red, 1004), (red, 23000), …} • Class label y • Y = +1: ‘bird image’ • Y = -1: ‘non-bird image’ Which images have birds, which ones do not?
Examples of Classification Problem • Image Classification: • Input features • Color histogram • {(red, 1004), (blue, 23000), …} • Class label • ‘bird image’: • ‘non-bird image’: Which images have birds, which ones do not?
Supervised Learning • Training examples: • Identical independent distribution (i.i.d) assumption • A critical assumption for machine learning theory
Regression for Classification • It is easy to turn binary classification into a regression problem • Ignore the binary nature of class label y • How to convert multiclass classification into a regression problem? • Pros: computational efficiency • Cons: ignore the discrete nature of class label
(k=1) K Nearest Neighbor (k-NN) Classifier
K Nearest Neighbor (k-NN) Classifier Decision boundary K = 1
(k=4) (k=1) K Nearest Neighbor (k-NN) Classifier How many neighbors should we count ?
(k=1) Leave-One-Out Method
Leave-One-Out Method err(1) = 1
k = 2 Leave-One-Out Method err(1) = 3 err(2) = 2 err(3) = 6
Cross-validation • Divide training examples into two sets • A training set (80%) and a validation set (20%) • Predict the class labels for validation set by using the examples in training set • Choose the number of neighbors k that maximizes the classification accuracy
Bayes Optimal Solution for Classification • expected loss for classification • consider 0-1 loss • point-wise loss • Bayes Optimal Classifier
Probabilistic Interpretation of KNN • Bayes’ theorem • KNN uses Non-parametric Density Estimation • Given a data set with data points from class and
Probabilistic Interpretation of KNN • Estimate • Consider a small neighbourhood containing x such that • Given the total number N data points, K data points fall inside
Probabilistic Interpretation of KNN • Estimate • Consider a small neighbourhood containing x # of neighbors in class k # of points in class k
Probabilistic Interpretation of KNN • Given a data set with data points from class and , we have • and correspondingly • Since , Bayes’ theorem gives
Probabilistic Interpretation of KNN • Estimate conditional probability Pr(y|x) • Count of data points in class y in the neighborhood of x • Bias and variance tradeoff • A small neighborhood large variance unreliable estimation • A large neighborhood large bias inaccurate estimation
Content • K nearest neighborhood classification • Basic variant • Improved variants • Probabilistic Generative Model • Discriminant Functions • Probabilistic Discriminative Model • Support Vector Machine
Weighted kNN • Weight the contribution of each close neighbor based on their distances • Weight function • Prediction
Kernel Density Estimation • fix V, estimate K from the data. Let R be a hypercube centred on x
Kernel Density Estimation • discontinuity in • to avoid the discontinuity
Kernel Density Estimation • More generally
When to Consider Nearest Neighbor ? • Less than 20 attributes per example • Advantages: • Training is very fast • Learn complex target functions • Disadvantages: • Slow at query time • Easily fooled by irrelevant attributes
Curse of Dimensionality • Imagine instances described by 20 attributes, but only 2 are relevant to target function • Curse of dimensionality: conflicting assumption become severe
Curse of Dimensionality • Curse of dimensionality: more data points close to the boundary • uniformly distributed over a unit ball #of data points Dimensionality
Curse of Dimensionality • High-dimensional problem is not rare • Bioinformatics: microarray gene expression: d: 102~104 • Computer vision: images: d: 104 – 106 • Text analysis: d: 104 - 106
Dimensionality Reduction • Can we reduce the dimensionality? • it is possible
Principal Component Analysis • Dimensionality Reduction by Linear Transformation
Principal Component Analysis • Dimensionality Reduction by Linear Transformation Which direction we should project the data?
Principal Component Analysis • The big picture when we loot at the data Variance Mean
Principal Component Analysis • Mean-centered Data variance
Principal Component Analysis • Projection should keep the variance as much as possible
Principal Component Analysis • Let us compute the variance after projection • assume all data points are mean-centered • data after projection variance
Principal Component Analysis • The best projection should maximize variance • the first projection (the first component) • The first eigen-vector of corresponding to the largest eigen-value of • What about other projections?
Principal Component Analysis • Maximize the variance of the Residual for all data • should maximize the variance of the residual data residual
Principal Component Analysis • The m components • the first m eigen-vectors of the Covariance matrix • variance along different components Eigen-values of Covariance Matrix
Principal Component Analysis • In geometry
Principal Component Analysis • In Summary (step by step) • Compute Covariance matrix • Compute first m eigen-vectors of covariance matrix as the m components for projection • Compute New Data: