1 / 120

Introduction to Machine Learning

Introduction to Machine Learning. Manik Varma Microsoft Research India http://research.microsoft.com/~manik manik@microsoft.com. Binary Classification. Is this person Madhubala or not? Is this person male or female? Is this person beautiful or not?. Multi-Class Classification.

bernad
Download Presentation

Introduction to Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Machine Learning Manik Varma Microsoft Research India http://research.microsoft.com/~manik manik@microsoft.com

  2. Binary Classification • Is this person Madhubala or not? • Is this person male or female? • Is this person beautiful or not?

  3. Multi-Class Classification • Is this person Madhubala, Lalu or Rakhi Sawant? • Is this person happy, sad, angry or bemused?

  4. Ordinal Regression • Is this person very beautiful, beautiful, ordinary or ugly?

  5. Regression • How beautiful is this person on a continuous scale of 1 to 10? 9.99?

  6. Ranking • Rank these people in decreasing order of attractiveness.

  7. Multi-Label Classification • Tag this image with the set of relevant labels from {female, Madhubala, beautiful, IITD faculty}

  8. Can regression solve all these problems • Binary classification – predict p(y=1|x) • Multi-Class classification – predict p(y=k|x) • Ordinal regression – predict p(y=k|x) • Ranking – predict and sort by relevance • Multi-Label Classification – predict p(y{1}k|x) • Learning from experience and data • In what form can the training data be obtained? • What is known a priori? • Complexity of training • Complexity of prediction Are These Problems Distinct?

  9. Supervised learning • Classification • Generative methods • Nearest neighbour, Naïve Bayes • Discriminative methods • Logistic Regression • Discriminant methods • Support Vector Machines • Regression, Ranking, Feature Selection, etc. • Unsupervised learning • Semi-supervised learning • Reinforcement learning In This Course

  10. Noise and uncertainty • Unknown generative model Y = f(X) • Noise in measuring input and feature extraction • Noise in labels • Nuisance variables • Missing data • Finite training set size Learning from Noisy Data

  11. Under and Over Fitting

  12. Non-negativity and unit measure • 0 ≤ p(y) , p() = 1, p() = 0 • Conditional probability – p(y|x) • p(x, y) = p(y|x) p(x) = p(x|y) p(y) • Bayes’ Theorem • p(y|x) = p(x|y) p(y) / p(x) • Marginalization • p(x) = yp(x, y) dy • Independence • p(x1, x2) = p(x1) p(x2)  p(x1|x2) = p(x1) • Chris Bishop, “Pattern Recognition & Machine Learning” Probability Theory

  13. p(x|,) = exp( -(x – )2/22) / (22)½ The Univariate Gaussian Density -3 -2 -1  1 2 3

  14. p(x|,) = exp( -½(x – )t-1 (x – ) )/ (2)D/2||½ The Multivariate Gaussian Density

  15. p(|a,b) = a-1(1 – )b-1(a+b) / (a)(b) The Beta Density

  16. Bernoulli: Single trial with probability of success = • n {0, 1}, [0, 1] • p(n|) = n(1 – )1-n • Binomial: N iid Bernoulli trials with n successes • n {0, 1, …, N},  [0, 1], • p(n|N,) = NCnn(1 – )N-n • Multinomial: N iid trials, outcome k occurs nk times • nk {0, 1, …, N}, knk = N, k [0, 1], kk = 1 • p(n|N,) = N! kknk / nk! Probability Distribution Functions

  17. We don’t know whether a coin is fair or not. We are told that heads occurred n times in N coin flips. • We are asked to predict whether the next coin flip will result in a head or a tail. • Let y be a binary random variable such that y = 1 represents the event that the next coin flip will be a head and y = 0 that it will be a tail • We should predict heads if p(y=1|n,N) > p(y=0|n,N) A Toy Example

  18. Let p(y=1|n,N) =  and p(y=0|n,N) = 1 -  so that we should predict heads if  > ½ • How should we estimate ? • Assuming that the observed coin flips followed a Binomial distribution, we could choose the value of  that maximizes the likelihood of observing the data • ML = argmaxp(n|) = argmaxNCnn(1 – )N-n • = argmaxn log() + (N – n) log(1 – ) • = n / N • We should predict heads if n > ½ N The Maximum Likelihood Approach

  19. We should choose the value of  maximizing the posterior probability of  conditioned on the data • We assume a • Binomial likelihood : p(n|) = NCnn(1 – )N-n • Beta prior : p(|a,b)=a-1(1–)b-1(a+b)/(a)(b) • MAP = argmaxp(|n,a,b) = argmaxp(n|) p(|a,b) • = argmaxn (1 – )N-na-1 (1–)b-1 • = (n+a-1) / (N+a+b-2) as if we saw an extra a – 1 heads & b – 1 tails • We should predict heads if n > ½ (N + b – a) The Maximum A Posteriori Approach

  20. We should marginalize over  • p(y=1|n,a,b) = p(y=1|n,) p(|a,b,n) d • = p(|a,b,n) d • = (|a + n, b + N –n) d • = (n + a) / (N + a + b) as if we saw an extra a heads & b tails • We should predict heads if n > ½ (N + b – a) • The Bayesian and MAP prediction coincide in this case • In the very large data limit, both the Bayesian and MAP prediction coincide with the ML prediction (n > ½ N) The Bayesian Approach

  21. Classification

  22. Binary Classification

  23. Memorization • Can not deal with previously unseen data • Large scale annotated data acquisition cost might be very high • Rule based expert system • Dependent on the competence of the expert. • Complex problems lead to a proliferation of rules, exceptions, exceptions to exceptions, etc. • Rules might not transfer to similar problems • Learning from training data and prior knowledge • Focuses on generalization to novel data Approaches to Classification

  24. Training Data • Set of N labeled examples of the form (xi, yi) • Feature vector – xD. X = [x1x2 … xN] • Label – y {1}. y = [y1, y2 … yN]t. Y=diag(y) • Example – Gender Identification Notation (x1 = , y1 = +1) (x2 = , y2 = +1) (x3 = , y3 = +1) (x4 = , y4 = -1)

  25. Binary Classification

  26. Binary Classification b w wtx + b = 0  = [w; b]

  27. Bayes’ decision rule • p(y=+1|x) > p(y=-1|x) ? y = +1 : y = -1 •  p(y=+1|x) > ½ ? y = +1 : y = -1 Bayes’ Decision Rule

  28. Bayesian versus MAP versus ML • Should we choose just one function to explain the data? • If yes, should this be the function that explains the data the best? • What about prior knowledge? • Generative versus Discriminative • Can we learn from “positive” data alone? • Should we model the data distribution? • Are there any missing variables? • Do we just care about the final decision? Issues to Think About

  29. p(y|x,X,Y) = fp(y,f|x,X,Y) df • = fp(y|f,x,X,Y) p(f|x,X,Y) df • = fp(y|f,x) p(f|X,Y) df • This integral is often intractable. • To solve it we can • Choose the distributions so that the solution is analytic (conjugate priors) • Approximate the true distribution of p(f|X,Y) by a simpler distribution (variational methods) • Sample from p(f|X,Y) (MCMC) Bayesian Approach

  30. p(y|x,X,Y) = fp(y|f,x) p(f|X,Y) df • = p(y|fMAP,x) when p(f|X,Y) = (f – fMAP) • The more training data there is the better p(f|X,Y) approximates a delta function • We can make predictions using a single function, fMAP, and our focus shifts to estimating fMAP. Maximum A Posteriori (MAP)

  31. fMAP = argmaxfp(f|X,Y) • = argmaxfp(X,Y|f) p(f) / p(X,Y) • = argmaxfp(X,Y|f) p(f) • fML  argmaxfp(X,Y|f) (Maximum Likelihood) • Maximum Likelihood holds if • There is a lot of training data so that • p(X,Y|f) >> p(f) • Or if there is no prior knowledge so that p(f) is uniform (improper) MAP & Maximum Likelihood (ML)

  32. fML = argmaxfp(X,Y|f) • = argmaxfIp(xi,yi|f) • The independent and identically distributed assumption holds only if we know everything about the joint distribution of the features and labels. • In particular, p(X,Y) Ip(xi,yi) IID Data

  33. Generative MethodsNaïve Bayes

  34. MAP = argmaxp() Ip(xi,yi| ) • = argmaxp(x) p(y) Ip(xi,yi| ) • = argmaxp(x) p(y) Ip(xi|yi,) p(yi|) • = argmaxp(x) p(y) Ip(xi|yi,) p(yi|) • = [argmaxxp(x) Ip(xi|yi,x)] * • [argmaxyp(y) Ip(yi|y)] • x and y can be solved for independently • The parameters of each class decouple and can be solved for independently Generative Methods

  35. The parameters of each class decouple and can be solved for independently Generative Methods

  36. MAP = [argmaxxp(x) Ip(xi|yi,x)] * • [argmaxyp(y) Ip(yi|x)] • Naïve Bayes assumptions • Independent Gaussian features • p(xi|yi,x) = jp(xij|yi,x) • p(xij|yi=1,x) = N(xij| j1, i) • Improper uniform priors (no prior knowledge) • p(x) = p(y) = const • Bernoulli labels • p(yi=+1|y) = , p(yi=-1|y) = 1- Generative Methods – Naïve Bayes

  37. ML = [argmaxxIjN(xij| j1, i)] * • [argmaxI  (1+yi)/2 (1-)(1-yi)/2] • Estimating ML • ML = argmaxI  (1+yi)/2 (1-)(1-yi)/2 • = argmax (N+I yi) log()+ (N-I yi) log(1-) • = N+ / N (by differentiating and setting to zero) • Estimating ML, ML • ML = (1 / N)  yi=1xi • 2jML = [ yi=+1 (xij - +jML)2 +  yi=-1 (xij - -jML)2 ]/N Generative Methods – Naïve Bayes

  38. Naïve Bayes – Prediction

  39. Naïve Bayes – Prediction

  40. p(y=+1|x) = p(x|y=+1) p(y=+1) / p(x) • = 1 / (1 + exp(log(p(y=-1)/ p(y=+1)) • +log(p(x|y=-1) / p(x|y=+1))) • = 1 / (1 + exp( log(1/ - 1) - ½ -t-1- • + ½ +t-1+ + (+- -)t-1x )) • = 1 / (1 + exp(-b – wtx)) (Logistic Regression) • p(y=-1|x)= exp(-b – wtx) / (1 + exp(-b – wtx)) • log(p(y=-1|x)/ p(y=+1|x)) = -b – wtx • y = sign(b + wtx) • The decision boundary will be linear! Naïve Bayes – Prediction

  41. Discriminative Methods Logistic Regression

  42. MAP = argmaxp() Ip(xi,yi| ) • We assume that • p() = p(w) p(w) • p(xi,yi| ) = p(yi| xi, ) p(xi| ) • = p(yi| xi, w) p(xi| w) • MAP = [argmaxwp(w) Ip(yi| xi, w)] * • [argmaxwp(w) Ip(xi|w)] • It turns out that only w plays no role in determining the posterior distribution • p(y|x,X,Y) = p(y|x, MAP) = p(y|x, wMAP) • where wMAP = argmaxwp(w) Ip(yi| xi, w) Discriminative Methods

  43. MAP = argmaxw,bp(w) Ip(yi| xi, w) • Regularized Logistic Regression • Gaussian prior – p(w) = exp( -½ wtw) • Logistic likelihood– • p(yi| xi, w) = 1 / (1 + exp(-yi(b + wtxi))) Disc. Methods – Logistic Regression

  44. MAP = argmaxw,bp(w) Ip(yi| xi, w) • = argminw,b ½wtw+ I log(1+exp(-yi(b+wtxi))) • Bad news: No closed form solution for w and b • Good news: We have to minimize a convex function • We can obtain the global optimum • The function is smooth • Tom Minka, “A comparison of numerical optimizers for LR” (Matlab code) • Keerthi et al., “A Fast Dual Algorithm for Kernel Logistic Regression”, ML 05 • Andrew and Gao, “OWL-QN” ICML 07 • Krishnapuram et al., “SMLR” PAMI 05 Regularized Logistic Regression

  45. Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01 Regularized Logistic Regression

  46. Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01 Regularized Logistic Regression

  47. Naïve Bayes versus Logistic Regression

  48. Naïve Bayes versus Logistic Regression

  49. Naïve Bayes versus Logistic Regression

  50. Convex f : f(x1 + (1- )x2)  f(x1) + (1- )f(x2) • The Hessian 2f is always positive semi-definite • The tangent is always a lower bound to f Convex Functions

More Related