1.16k likes | 1.42k Views
Classification. Yan Pan. Under and Over Fitting. Non-negativity and unit measure 0 ≤ p ( y ) , p ( ) = 1, p ( ) = 0 Conditional probability – p ( y | x ) p ( x , y ) = p ( y | x ) p ( x ) = p ( x | y ) p ( y ) Bayes’ Theorem p ( y | x ) = p ( x | y ) p ( y ) / p ( x )
E N D
Classification Yan Pan
Non-negativity and unit measure • 0 ≤ p(y) , p() = 1, p() = 0 • Conditional probability – p(y|x) • p(x, y) = p(y|x) p(x) = p(x|y) p(y) • Bayes’ Theorem • p(y|x) = p(x|y) p(y) / p(x) • Marginalization • p(x) = yp(x, y) dy • Independence • p(x1, x2) = p(x1) p(x2) p(x1|x2) = p(x1) • Chris Bishop, “Pattern Recognition & Machine Learning” Probability Theory
p(x|,) = exp( -(x – )2/22) / (22)½ The Univariate Gaussian Density -3 -2 -1 1 2 3
p(x|,) = exp( -½(x – )t-1 (x – ) )/ (2)D/2||½ The Multivariate Gaussian Density
p(|a,b) = a-1(1 – )b-1(a+b) / (a)(b) The Beta Density
Bernoulli: Single trial with probability of success = • n {0, 1}, [0, 1] • p(n|) = n(1 – )1-n • Binomial: N iid Bernoulli trials with n successes • n {0, 1, …, N}, [0, 1], • p(n|N,) = NCnn(1 – )N-n Probability Distribution Functions
We don’t know whether a coin is fair or not. We are told that heads occurred n times in N coin flips. • We are asked to predict whether the next coin flip will result in a head or a tail. • Let y be a binary random variable such that y = 1 represents the event that the next coin flip will be a head and y = 0 that it will be a tail • We should predict heads if p(y=1|n,N) > p(y=0|n,N) A Toy Example
Let p(y=1|n,N) = and p(y=0|n,N) = 1 - so that we should predict heads if > ½ • How should we estimate ? • Assuming that the observed coin flips followed a Binomial distribution, we could choose the value of that maximizes the likelihood of observing the data • ML = argmaxp(n|) = argmaxNCnn(1 – )N-n • = argmaxn log() + (N – n) log(1 – ) • = n / N • We should predict heads if n > ½ N The Maximum Likelihood Approach
We should choose the value of maximizing the posterior probability of conditioned on the data • We assume a • Binomial likelihood : p(n|) = NCnn(1 – )N-n • Beta prior : p(|a,b)=a-1(1–)b-1(a+b)/(a)(b) • MAP = argmaxp(|n,a,b) = argmaxp(n|) p(|a,b) • = argmaxn (1 – )N-na-1 (1–)b-1 • = (n+a-1) / (N+a+b-2) as if we saw an extra a – 1 heads & b – 1 tails • We should predict heads if n > ½ (N + b – a) The Maximum A Posteriori Approach
We should marginalize over • p(y=1|n,a,b) = p(y=1|n,) p(|a,b,n) d • = p(|a,b,n) d • = (|a + n, b + N –n) d • = (n + a) / (N + a + b) as if we saw an extra a heads & b tails • We should predict heads if n > ½ (N + b – a) • The Bayesian and MAP prediction coincide in this case • In the very large data limit, both the Bayesian and MAP prediction coincide with the ML prediction (n > ½ N) The Bayesian Approach
Memorization • Can not deal with previously unseen data • Large scale annotated data acquisition cost might be very high • Rule based expert system • Dependent on the competence of the expert. • Complex problems lead to a proliferation of rules, exceptions, exceptions to exceptions, etc. • Rules might not transfer to similar problems • Learning from training data and prior knowledge • Focuses on generalization to novel data Approaches to Classification
Training Data • Set of N labeled examples of the form (xi, yi) • Feature vector – xD. X = [x1x2 … xN] • Label – y {1}. y = [y1, y2 … yN]t. Y=diag(y) • Example – Gender Identification Notation (x1 = , y1 = +1) (x2 = , y2 = +1) (x3 = , y3 = +1) (x4 = , y4 = -1)
Binary Classification b w wtx + b = 0 = [w; b]
Machine Learning from the Optimization View • Before we go into the details of classification and regression methods, we should take a close look at the objective functions of machine learning • Machine Learning:根据数据找规律(从多个候选规律里面选最好的),选择的标准是什么? • 把候选规律放到训练数据上预测一下,看看预测的错误率是多少,预测错误最少的规律就是我们要找的。
Common Form of Supervised Learning Problems • Minimize the following objective function • Regularization term + Loss function • Regularization term: control the model complexity, avoid over fitting • Loss function: measure the quality of the learned function, i.e. predict error on the training data.
Ex.1 Linear Regression • E(w)= ½Sn(yn- wtxn)^2 + ½wtw
Ex.2 Logistic Regression (classification method) • (w, b) = ½wtw+ I log(1+exp(-yi(b+wtxi)))
Ex.3 SVM • E(w)= ½wtw+ I max(0,1-yiwtxi) • Or • E(w)= ½wtw+ I max(0,1-yiwtxi)^2
How to measure error? • True: yi • Predicted: wtxi • 越像越好。相等? • I (yi!= wtxi ) • ( yi- wtxi )^2 • 假设取值范围为[-1,1]: 乘积尽量大 • yi wtxi
Approximate the Zero-One Loss • Squared Error • Exponential Loss • Logistic Loss • Hinge Loss • Sigmoid Loss
Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01 Regularized Logistic Regression
Zhu & Hastie, “KLR and the Import Vector Machine”, NIPS 01 Regularized Logistic Regression
Convex f : f(x1 + (1- )x2) f(x1) + (1- )f(x2) • The Hessian 2f is always positive semi-definite • The tangent is always a lower bound to f Convex Functions
Iteration : xn+1 = xn - nf(xn) • Step size selection : Armijo rule • Stopping criterion : Change in f is “miniscule” Gradient Descent
(w, b) = ½wtw+ I log(1+exp(-yi(b+wtxi))) • w(w, b) =w –Ip(-yi|xi,w) yi xi • b(w, b) = –Ip(-yi|xi,w) yi • Beware of numerical issues while coding! Gradient Descent – Logistic Regression
Gradient Decent Algorithm • Input: x0, objective f(x), e, T • Output: x_star that minimize f(x) • t=0 • While (t==0 || (f(x_{t-1}) – f(x_{t})>e && T<100000 )){ • g_t = gradient of f(x) at x_t • for( i=10; i>=-6; i--) • { • s=2^i • x_{t+1}=x_t – s*g_t • if (f(x_{t+1} < f(x_t)) • break; • } • t++; • } • Output x_t
Iteration : xn+1 = xn - nH-1f(xn) • Approximate f by a 2nd order Taylor expansion • The error can now decrease quadratically Newton Methods
Newton Decent Algorithm • Input: x0, objective f(x), e, T • Output: x_star that minimize f(x) • t=0 • While (t==0 || (f(x_{t-1}) – f(x_{t})>e && T<10)){ • g_t = gradient of f(x) at x_t • h_t = hessian matrix of f(x) at x_t • s = inverse matrix of h_t • x_{t+1}=x_t – s*g_t • t++; • } • Output x_t
Computing and inverting the Hessian is expensive • Quasi-Newton methods can approximate H-1 directly (LBFGS) • Iteration : xn+1 = xn - nBn-1f(xn) • Secant equation : f(xn+1) – f(xn) = Bn+1(xn+1 – xn) • The secant equation does not fully determine B • LBFGS updates Bn+1-1 using two rank one matrices Quasi-Newton Methods
Bayes’ decision rule • p(y=+1|x) > p(y=-1|x) ? y = +1 : y = -1 • p(y=+1|x) > ½ ? y = +1 : y = -1 Bayes’ Decision Rule
p(y|x,X,Y) = fp(y,f|x,X,Y) df • = fp(y|f,x,X,Y) p(f|x,X,Y) df • = fp(y|f,x) p(f|X,Y) df • This integral is often intractable. • To solve it we can • Choose the distributions so that the solution is analytic (conjugate priors) • Approximate the true distribution of p(f|X,Y) by a simpler distribution (variational methods) • Sample from p(f|X,Y) (MCMC) Bayesian Approach
p(y|x,X,Y) = fp(y|f,x) p(f|X,Y) df • = p(y|fMAP,x) when p(f|X,Y) = (f – fMAP) • The more training data there is the better p(f|X,Y) approximates a delta function • We can make predictions using a single function, fMAP, and our focus shifts to estimating fMAP. Maximum A Posteriori (MAP)
fMAP = argmaxfp(f|X,Y) • = argmaxfp(X,Y|f) p(f) / p(X,Y) • = argmaxfp(X,Y|f) p(f) • fML argmaxfp(X,Y|f) (Maximum Likelihood) • Maximum Likelihood holds if • There is a lot of training data so that • p(X,Y|f) >> p(f) • Or if there is no prior knowledge so that p(f) is uniform (improper) MAP & Maximum Likelihood (ML)
fML = argmaxfp(X,Y|f) • = argmaxfIp(xi,yi|f) • The independent and identically distributed assumption holds only if we know everything about the joint distribution of the features and labels. • In particular, p(X,Y) Ip(xi,yi) IID Data