1 / 112

Classification

Classification. Yan Pan. Under and Over Fitting. Non-negativity and unit measure 0 ≤ p ( y ) , p (  ) = 1, p ( ) = 0 Conditional probability – p ( y | x ) p ( x , y ) = p ( y | x ) p ( x ) = p ( x | y ) p ( y ) Bayes’ Theorem p ( y | x ) = p ( x | y ) p ( y ) / p ( x )

mele
Download Presentation

Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification Yan Pan

  2. Under and Over Fitting

  3. Non-negativity and unit measure • 0 ≤ p(y) , p() = 1, p() = 0 • Conditional probability – p(y|x) • p(x, y) = p(y|x) p(x) = p(x|y) p(y) • Bayes’ Theorem • p(y|x) = p(x|y) p(y) / p(x) • Marginalization • p(x) = yp(x, y) dy • Independence • p(x1, x2) = p(x1) p(x2)  p(x1|x2) = p(x1) • Chris Bishop, “Pattern Recognition & Machine Learning” Probability Theory

  4. p(x|,) = exp( -(x – )2/22) / (22)½ The Univariate Gaussian Density -3 -2 -1  1 2 3

  5. p(x|,) = exp( -½(x – )t-1 (x – ) )/ (2)D/2||½ The Multivariate Gaussian Density

  6. p(|a,b) = a-1(1 – )b-1(a+b) / (a)(b) The Beta Density

  7. Bernoulli: Single trial with probability of success = • n {0, 1},  [0, 1] • p(n|) = n(1 – )1-n • Binomial: N iid Bernoulli trials with n successes • n {0, 1, …, N},  [0, 1], • p(n|N,) = NCnn(1 – )N-n Probability Distribution Functions

  8. We don’t know whether a coin is fair or not. We are told that heads occurred n times in N coin flips. • We are asked to predict whether the next coin flip will result in a head or a tail. • Let y be a binary random variable such that y = 1 represents the event that the next coin flip will be a head and y = 0 that it will be a tail • We should predict heads if p(y=1|n,N) > p(y=0|n,N) A Toy Example

  9. Let p(y=1|n,N) =  and p(y=0|n,N) = 1 -  so that we should predict heads if  > ½ • How should we estimate ? • Assuming that the observed coin flips followed a Binomial distribution, we could choose the value of  that maximizes the likelihood of observing the data • ML = argmaxp(n|) = argmaxNCnn(1 – )N-n • = argmaxn log() + (N – n) log(1 – ) • = n / N • We should predict heads if n > ½ N The Maximum Likelihood Approach

  10. We should choose the value of  maximizing the posterior probability of  conditioned on the data • We assume a • Binomial likelihood : p(n|) = NCnn(1 – )N-n • Beta prior : p(|a,b)=a-1(1–)b-1(a+b)/(a)(b) • MAP = argmaxp(|n,a,b) = argmaxp(n|) p(|a,b) • = argmaxn (1 – )N-na-1 (1–)b-1 • = (n+a-1) / (N+a+b-2) as if we saw an extra a – 1 heads & b – 1 tails • We should predict heads if n > ½ (N + b – a) The Maximum A Posteriori Approach

  11. We should marginalize over  • p(y=1|n,a,b) = p(y=1|n,) p(|a,b,n) d • = p(|a,b,n) d • = (|a + n, b + N –n) d • = (n + a) / (N + a + b) as if we saw an extra a heads & b tails • We should predict heads if n > ½ (N + b – a) • The Bayesian and MAP prediction coincide in this case • In the very large data limit, both the Bayesian and MAP prediction coincide with the ML prediction (n > ½ N) The Bayesian Approach

  12. Classification

  13. Binary Classification

  14. Memorization • Can not deal with previously unseen data • Large scale annotated data acquisition cost might be very high • Rule based expert system • Dependent on the competence of the expert. • Complex problems lead to a proliferation of rules, exceptions, exceptions to exceptions, etc. • Rules might not transfer to similar problems • Learning from training data and prior knowledge • Focuses on generalization to novel data Approaches to Classification

  15. Training Data • Set of N labeled examples of the form (xi, yi) • Feature vector – xD. X = [x1x2 … xN] • Label – y {1}. y = [y1, y2 … yN]t. Y=diag(y) • Example – Gender Identification Notation (x1 = , y1 = +1) (x2 = , y2 = +1) (x3 = , y3 = +1) (x4 = , y4 = -1)

  16. Binary Classification

  17. Binary Classification b w wtx + b = 0  = [w; b]

  18. Machine Learning from the Optimization View • Before we go into the details of classification and regression methods, we should take a close look at the objective functions of machine learning • Machine Learning:根据数据找规律(从多个候选规律里面选最好的),选择的标准是什么? • 把候选规律放到训练数据上预测一下,看看预测的错误率是多少,预测错误最少的规律就是我们要找的。

  19. Supervised Learning

  20. Common Form of Supervised Learning Problems • Minimize the following objective function • Regularization term + Loss function • Regularization term: control the model complexity, avoid over fitting • Loss function: measure the quality of the learned function, i.e. predict error on the training data.

  21. Ex.1 Linear Regression • E(w)= ½Sn(yn- wtxn)^2 + ½wtw

  22. Ex.2 Logistic Regression (classification method) • (w, b) = ½wtw+ I log(1+exp(-yi(b+wtxi)))

  23. Ex.3 SVM • E(w)= ½wtw+ I max(0,1-yiwtxi) • Or • E(w)= ½wtw+ I max(0,1-yiwtxi)^2

  24. How to measure error? • True: yi • Predicted: wtxi • 越像越好。相等? • I (yi!= wtxi ) • ( yi- wtxi )^2 • 假设取值范围为[-1,1]: 乘积尽量大 • yi wtxi

  25. Approximate the Zero-One Loss • Squared Error • Exponential Loss • Logistic Loss • Hinge Loss • Sigmoid Loss

  26. Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01 Regularized Logistic Regression

  27. Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01 Regularized Logistic Regression

  28. Convex f : f(x1 + (1- )x2)  f(x1) + (1- )f(x2) • The Hessian 2f is always positive semi-definite • The tangent is always a lower bound to f Convex Functions

  29. Iteration : xn+1 = xn - nf(xn) • Step size selection : Armijo rule • Stopping criterion : Change in f is “miniscule” Gradient Descent

  30. (w, b) = ½wtw+ I log(1+exp(-yi(b+wtxi))) • w(w, b) =w –Ip(-yi|xi,w) yi xi • b(w, b) = –Ip(-yi|xi,w) yi • Beware of numerical issues while coding! Gradient Descent – Logistic Regression

  31. Gradient Decent Algorithm • Input: x0, objective f(x), e, T • Output: x_star that minimize f(x) • t=0 • While (t==0 || (f(x_{t-1}) – f(x_{t})>e && T<100000 )){ • g_t = gradient of f(x) at x_t • for( i=10; i>=-6; i--) • { • s=2^i • x_{t+1}=x_t – s*g_t • if (f(x_{t+1} < f(x_t)) • break; • } • t++; • } • Output x_t

  32. Iteration : xn+1 = xn - nH-1f(xn) • Approximate f by a 2nd order Taylor expansion • The error can now decrease quadratically Newton Methods

  33. Newton Decent Algorithm • Input: x0, objective f(x), e, T • Output: x_star that minimize f(x) • t=0 • While (t==0 || (f(x_{t-1}) – f(x_{t})>e && T<10)){ • g_t = gradient of f(x) at x_t • h_t = hessian matrix of f(x) at x_t • s = inverse matrix of h_t • x_{t+1}=x_t – s*g_t • t++; • } • Output x_t

  34. Computing and inverting the Hessian is expensive • Quasi-Newton methods can approximate H-1 directly (LBFGS) • Iteration : xn+1 = xn - nBn-1f(xn) • Secant equation : f(xn+1) – f(xn) = Bn+1(xn+1 – xn) • The secant equation does not fully determine B • LBFGS updates Bn+1-1 using two rank one matrices Quasi-Newton Methods

  35. Machine Learning Problems from the Probability View

  36. Bayes’ decision rule • p(y=+1|x) > p(y=-1|x) ? y = +1 : y = -1 •  p(y=+1|x) > ½ ? y = +1 : y = -1 Bayes’ Decision Rule

  37. p(y|x,X,Y) = fp(y,f|x,X,Y) df • = fp(y|f,x,X,Y) p(f|x,X,Y) df • = fp(y|f,x) p(f|X,Y) df • This integral is often intractable. • To solve it we can • Choose the distributions so that the solution is analytic (conjugate priors) • Approximate the true distribution of p(f|X,Y) by a simpler distribution (variational methods) • Sample from p(f|X,Y) (MCMC) Bayesian Approach

  38. p(y|x,X,Y) = fp(y|f,x) p(f|X,Y) df • = p(y|fMAP,x) when p(f|X,Y) = (f – fMAP) • The more training data there is the better p(f|X,Y) approximates a delta function • We can make predictions using a single function, fMAP, and our focus shifts to estimating fMAP. Maximum A Posteriori (MAP)

  39. fMAP = argmaxfp(f|X,Y) • = argmaxfp(X,Y|f) p(f) / p(X,Y) • = argmaxfp(X,Y|f) p(f) • fML  argmaxfp(X,Y|f) (Maximum Likelihood) • Maximum Likelihood holds if • There is a lot of training data so that • p(X,Y|f) >> p(f) • Or if there is no prior knowledge so that p(f) is uniform (improper) MAP & Maximum Likelihood (ML)

  40. fML = argmaxfp(X,Y|f) • = argmaxfIp(xi,yi|f) • The independent and identically distributed assumption holds only if we know everything about the joint distribution of the features and labels. • In particular, p(X,Y) Ip(xi,yi) IID Data

  41. Discriminative Methods Logistic Regression

More Related