90 likes | 195 Views
ICS 178 Introduction Machine Learning & data Mining. Instructor max Welling Lecture 4: Least squares Regression. What have we done so far?. parametric. non-parametric. density estimation: parzen-windowing. future example: k-mean. unsupervised. classification. regression. regression.
E N D
ICS 178Introduction Machine Learning & data Mining Instructor max Welling Lecture 4: Least squares Regression
What have we done so far? parametric non-parametric density estimation: parzen-windowing future example: k-mean unsupervised classification regression regression classification supervised future example: logistic regression today: least-squares kNN X
Problem # of mantee-kills versus boats
Goal • Given data find a linear relationship between them. • In 1 dimension we have data {Xn,Yn} (blue dots) • We imagine vertical springs (red lines) between the data and a stiff rod (line). • (imagine they can slide over the rod so they remain vertical). • Springs have rest length 0, so they compete to pull the rod towards them. • The relaxed solution is what we are after.
Cost Function We measure the total squared length of all the springs: We can now take derivatives wrt a,b and set that to 0. After some algebra (on white board) we find,
More Variables • More generally, we want to have Dx input variables and Dy output variables. • The cost is now:
In Matlab function [A,b] = LSRegression(X,Y) [D,N] = size(X); EX = sum(X,2)/N; CovX = X*X'/N - EX*EX'; EY = sum(Y,2)/N; CovXY = Y*X'/N - EY*EX'; A = CovXY * inv(CovX); b = EY - A*EX;
Statistical Interpretation • We can think of the problem as one where we are trying to find the • probability distribution for P(Y|X). • We can write: • where d is the residual error pointing vertically from the line to the data-point. • d is a random vector and we may assume is has a Gaussian distribution.
Statistical Interpretation • We can now maximize the probability of the data under the model by adapting • the parameters A,b. • If we use negative log-probability we get: • Looks familiar? • We can also optimize for • (It won’t affect A,b) • This is called “maximum likelihood learning”.