1 / 0

Linear Regression & Classification

Linear Regression & Classification. Prof. Navneet Goyal CS & IS BITS, Pilani. Fundamentals of Modeling. Abstract representation of a real-world process Y=3X+2 is a very simple model of how variable Y might relate to variable X Instance of a more general model structure Y = aX+b

lucie
Download Presentation

Linear Regression & Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Regression & Classification

    Prof. NavneetGoyal CS & IS BITS, Pilani
  2. Fundamentals of Modeling Abstract representation of a real-world process Y=3X+2 is a very simple model of how variable Y might relate to variable X Instance of a more general model structure Y =aX+b a & b are parameters θ is generally used to denote a generic parameter or a set (or vector) of parameters θ={a,b} Values of parameters are chosen by estimation – that is by min. or max. an appropriate score function measuring the fit of the model to the data Before we can choose the parameters, we must choose an app. functional form of the model itself
  3. Fundamentals of Modeling Predictive modeling PM can be thought of as learning a mapping from an input set of vector measurements x to a scalar output y Vector output also possible but rarely used in practice One of the variable is expressed as a function of others (predictor variables) Response variable – Y and predictor variables – Xi Ÿ = f(x1,x2,….xp; θ) When Y is quantitative, this task of estimating a mapping from the p-dimensional X to Y is called as regression When Y is categorical, the task of learning a mapping from X to Y is called classification learning or supervised classification
  4. Predictive Modeling Predictive modeling Predicts the value of some target characteristic of an object on the basis of observed values of other characteristics of the object Examples: Regression (Prediction in DM) & Classification
  5. Predictive Modeling Prediction Linear regression Nonlinear regression Classification (supervised learning) Decision trees k-NN SVM ANN
  6. Definition of Regression Regression is a (statistical) methodology that utilizes the relation between two or more quantitative variables so that one variable can be predicted from the other, or others. Examples: Sales of a product can be predicted by using the relationship between sales volume and amount of advertising The performance of an employee can be predicted by using the relationship between performance and aptitude tests The size of a child’s vocabulary can be predicted by using the relationship between the vocabulary size, the child’s age and the parents’ educational input.
  7. rmse = s Regression Problem Visualisation + + + + + ^ ^ + y, y y + + + + + + + + + + + + + + + + + x
  8. Given a set of features x, a linear predictor has the form: The output is a real-valued, quantitative variable ^ y, y x Structure of a Linear Regression Model
  9. Classification Problem Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DC,where each ti is assigned to one class. Predictionis similar, but may be viewed as having infinite number of classes.
  10. Classification Classification is the task of assigning an object, described by a feature vector, to one of a set of mutually exclusive groups A linear classifier has a linear decision boundary The perceptron training algorithm is guaranteed to converge in a finite time when the data set is linearly separable
  11. What is Classification? Classification is also known as (statistical) pattern recognition The aim is to build a machine/algorithm that can assign appropriate qualitative labels to new, previously unseen quantitative data using a priori knowledge and/or information contained in a training set. The patterns to be classified are usually groups of measurements/observations, that are believed to be informative for the classification task. Example: Face recognition Training data: D = {X,y} Prior knowledge Design/ learn Classifier m(q,x) ^ Predict ^ Predicted class label: y New pattern: x
  12. Classification: Applications Spam mail IDS (rare event classification) Credit- rating Medical diagnosis Categorizing cells as malignant or benign based on MRI scans Classifying galaxies based on their shapes Predicting preterm births Crop yield prediction Identify mushrooms as poisonous or edible …
  13. Classification: Applications Example: Credit Card Company Every purchase is placed in 1 of 4 classes Authorize Ask for further identification before authorizing Do not authorize Do not authorize but contact police Two functions of Data Mining Examine historical data to determine how the data fit into 4 classes Apply the model to each new purchase
  14. Classification: 3 phase job Model building phase (learning phase) Testing phase Model usage phase
  15. Compute Distance Test Record Training Records Choose k of the “nearest” records Distance-based Classification Nearest Neighbors If it walks like a duck, quacks like a duck, and looks like a duck, then it is probably a duck
  16. Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x
  17. Find a linear hyperplane (decision boundary) that will separate the data Support Vector Machines
  18. One Possible Solution Support Vector Machines
  19. Another possible solution Support Vector Machines
  20. Other possible solutions Support Vector Machines
  21. Which one is better? B1 or B2? How do you define better? Support Vector Machines
  22. Find a hyperplane that maximizes the margin => B1 is better than B2 Support Vector Machines
  23. Support Vector Machines What if the problem is not linearly separable?
  24. Nonlinear Support Vector Machines What if decision boundary is not linear?
  25. Support Vector Machines Solid line is preferred Geometrically we can characterize the solid plane as being “furthest” from both classes How can we construct the plane “furthest’’ from both classes?
  26. Support Vector Machines Examine the convex hull of each class’ training data (indicated by dotted lines) and then find the closest points in the two convex hulls (circles labeled d and c). The convex hull of a set of points is the smallest convex set containing the points. If we construct the plane that bisects these two points (w=d-c), the resulting classifier should be robust in some sense. Figure – Best plane bisects closest points in the convex hulls
  27. Convex Sets Convex Set Non-Convex or Concave Set A function (in blue) is convex if and only if the region above its graph (in green) is a convex set.
  28. Convex Hulls Convex hull: elastic band analogy For planar objects, i.e., lying in the plane, the convex hull may be easily visualized by imagining an elastic band stretched open to encompass the given object; when released, it will assume the shape of the required convex hull.
  29. Var1 Var2 Disadvantages of Linear Decision Surfaces
  30. Var1 Var2 Advantages of Non-Linear Surfaces
  31. Linear Classifiers in High-Dimensional Spaces Constructed Feature 2 Var1 Var2 Constructed Feature 1 Find function (x) to map to a different space Go back
  32. Handwriting Recognition Task T recognizing and classifying handwritten words within images Performance measure P percent of words correctly classified Training experience E a database of handwritten words with given classifications
  33. Handwriting Recognition
  34. Pattern Recognition Example Handwriting Digit Recognition
  35. Pattern Recognition Example Handwriting Digit Recognition Each digit represented by a 28x28 pixel image Can be represented by a vector of 784 real no.s Objective: to have an algorithm that will take such a vector as input and identify the digit it is representing Non-trivial problem due to variability in handwriting Take images of a large no. of digits (N) – training set Use training set to tune the parameters of an adaptive model Each digit in the training set has been identified by a target vector t, which represents the identity of the corresp. digit. Result of running a ML algo. can expressed as a fn. y(x) which takes input a new digit x and outputs a vector y. Vector y is encoded in the same way as t The form of y(x) is determined through the learning (training) phase
  36. Pattern Recognition Example Generalization The ability to categorize correctly new examples that differ from those in training Generalization is a central goal in pattern recognition Preprocessing Input variables are preprocessed to transform them into some new space of variables where it is hoped that the problem will be easier to solve (see fig.) Images of digits are translated and scaled so that each digit is contained within a box of fixed size. This reduces variability. Preprocessing stage is referred to as feature extraction New test data must be preprocessed using the same steps as training data
  37. Pattern Recognition Example Preprocessing Can also speed up computations For eg.: Face detection in a high resolution video stream Find useful features that are fast to compute and yet that also preserve useful discriminatory information enabling faces to be distinguished form non-faces Avg. value of image intensity in a rectangular sub-region can be evaluated extremely efficiently and a set of such features are very effective in fast face detection Such features are smaller in number than the number of pixels, it is referred to as a form of Dimensionality Reduction Care must be taken so that important information is not discarded during pre processing
  38. Pattern Recognition Example Supervised & unsupervised learning If training data consists of both input vectors and target vectors – supervised learning Digit recognition problem – classification Predicting crop yield – regression If training data consists of only input vectors – unsupervised learning Discover groups of similar examples within data – clustering Find distribution of data within the input space – density estimation Project data from a HD space to 2-3 D space for the purpose of visualization
  39. Reinforcement Learning The problem of finding suitable actions to take in a given situation in order to maximize a reward
  40. Polynomial Curve Fitting Observe Real-valued input variable x • Use x to predict value of target variable t • Synthetic data generated from sin(2π x) • Random noise in target values Target Variable Input Variable
  41. Polynomial Curve Fitting N observations of x x = (x1,..,xN)T t = (t1,..,tN)T • Goal is to exploit training set to predict value of from x • Inherently a difficult problem Target Variable Data Generation: N = 10 Spaced uniformly in range [0,1] Generated from sin(2πx) by adding small Gaussian noise Noise typical due to unobserved variables Input Variable
  42. Polynomial Curve Fitting • Where M is the order of the polynomial • Is higher value of M better? We’ll see shortly! • Coefficients w0 ,…wM are denoted by vector w • Nonlinear function of x, linear function of coefficients w • Called Linear Models Target Variable Input Variable
  43. Sum-of-Squares Error Function
  44. Polynomial curve fitting
  45. Polynomial curve fitting Choice of M?? Called the model selection or model comparison
  46. 0th Order Polynomial Poor representations of sin(2πx)
  47. 1st Order Polynomial Poor representations of sin(2πx)
  48. 3rd Order Polynomial Best Fit to sin(2πx)
  49. 9th Order Polynomial Over Fit: Poor representation of sin(2πx)
  50. Polynomial Curve Fitting Good generalization is the objective Dependence of generalization performance on M? Consider a data set of 100 points Calculate E(w*) for both training data & test data Choose M which minimizes E(w*) Root Mean Square Error (RMS) Sometimes convenient to use as division by N allows us to compare different sizes of data sets on equal footing Square root ensures ERMS is measure on the same scale ( and in same units) as the target variable t
  51. Over-fitting For small M(0,1,2) Inflexible to handle oscillations of sin(2πx) M(3-8) flexible enough to handle oscillations of sin(2πx) For M=9 Too flexible!! TE = 0 GE = high Why is it happening?
  52. Polynomial Coefficients
  53. Data Set Size: M=9 - Larger the data set, the more complex model we can afford to fit to the data - No. of data pts should be no less than 5-10 times the no. of adaptive parameters in the model
  54. Over-fitting Problem Should we limit the no. of parameters according to the available training set? Complexity of the model should depend only on the complexity of the problem! LSE represents a specific case of Maximum Likelihood Over-fitting is a general property of maximum likelihood Over-fitting problem can be avoided using the Bayesian Approach!
  55. Regularization Penalize large coefficient values
  56. Regularization:
  57. Regularization:
  58. Regularization: vs.
  59. Polynomial Coefficients
  60. Linear Models for Regression Polynomial is an example of a broad class of functions called linear regression models The role of regression is to predict the value of one or more continuous target variables t given the value of a D-dimensional vector x of input variables We have already discussed Polynomial Curve Fitting for Regression A polynomial is a specific example of a broad class of functions called Linear Regression Models Functions which are linear functions of the adjustable parameters Simplest form of linear regression models are also linear functions of the input variables A more useful class of functions can be obtained by taking a linear combination of a fixed set of nonlinear functions of the input variables, known as basis functions Linear functions of parameters Non-linear wrt input variables
  61. Linear Models for Regression Linear models have significant limitations as practical techniques for ML, particularly for problems involving high dimensionality Linear models possess nice analytical properties and form the foundation for more sophisticated models
  62. Linear Basis Function Models Simplest linear model for regression with d input variables: Where are the input variables Compare with linear regression with one variable Compare with polynomial regression with one variable Linear in both parameters and input variables Significant limitations since it is a linear fn. of input variables 1-D case – straight line fit
  63. Linear Basis Function Models
  64. Linear Basis Function Models Polynomial regression is a particular example of this model!! How?? Single input variable: x Basis function Polynomial basis Limitation of polynomial basis function? Global: • changes in one region of input space affects others Can divide input space into regions • use different polynomials in each region • equivalent to spline functions
  65. Linear Basis Function Models Polynomial basis functions: These are global; a small change in x affect all basis functions.
  66. Linear Basis Function Models (4) Gaussian basis functions: These are local; a small change in x only affect nearby basis functions. μjand s control location and scale (width).
  67. Linear Basis Function Models (5) Sigmoidal basis functions: where Also these are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (slope).
  68. Home Work Read about Gaussian, Sigmoidal, & Fourier basis functions Sequential Learning & Online algorithms Will discuss in the next class!
  69. The Bias-Variance Decomposition Bias-variance decomposition is a formal method for analyzing the prediction error of a predictive model Bias = avg. distance bet the target and the location where the projectile hits the ground (depends on angle) Variance = deviation bet x and the avg. position where the projectile hits the floor (depends on force) Noise: if the target is not stationary then the observed distance is also affected by changes in the location of target
  70. The Bias-Variance Decomposition Low degree polynomial has high bias (fits poorly) but has low variance with different data sets High degree polynomial has low bias (fits well) but has high variance with different data sets Interactive demo @: http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_bias_variance.htm
  71. The Bias-Variance Decomposition True height of Chinese emperor: 200cm, about 6’6”. Poll a random American: ask “How tall is the emperor?” We want to determine how wrong they are, on average
  72. The Bias-Variance Decomposition Each scenario has expected value of 180 (or bias error = 20), but increasing variance in estimate Squared error = Square of bias error + Variance As variance increases, error increases
  73. Effect of regularization parameter on the bias and variance terms high variance low variance low bias high bias
  74. An example of the bias-variance trade-off
  75. Beating the bias-variance trade-off We can reduce the variance term by averaging lots of models trained on different datasets. This seems silly. If we had lots of different datasets it would be better to combine them into one big training set. With more training data there will be much less variance. Weird idea: We can create different datasets by bootstrap sampling of our single training dataset. This is called “bagging” and it works surprisingly well. But if we have enough computation its better to do the right Bayesian thing: Combine the predictions of many models using the posterior probability of each parameter vector as the combination weight.
More Related