Supervised Learning

Supervised Learning Linear Models and Gradient Decent Algorithm

SUPERVISED LEARNING* *https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

SUPERVISED LEARNING • Supervised learning problems can be further grouped into regression and classification problems. • Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”. • Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”. • Some common types of problems built on top of classification and regression include recommendation and time series prediction respectively. • Some popular examples of supervised machine learning algorithms are: • Linear regression for regression problems. • Random forest for classification and regression problems. • Support vector machines for classification problems.

SUPERVISED LEARNING • A supervised learning algorithm analyzes the training data and produces an inferred function, which is called a classifier (if the output is discrete) or a regression function (if the output is continuous). • The inferred function should predict the correct output value for any valid input object. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way.

SUPERVISED LEARNING Prediction Accuracy • A good learner is the one which has good prediction accuracy; in other words, which has the smallest prediction error. • Consider the simple case of fitting a linear regression model to the observed data. A model is a good fit, if it provides a high R2 value. However, note that the model has used all the observed data and only the observed data. Hence, how it will perform when predicting for a new set of input values, is not clear. Assumption is that, with a high R2 value, the model is expected to predict well for data observed in the future. • Suppose now the model is more complex than a linear model and a spline smoother or a polynomial regression needs to be considered. What would be the proper complexity of the model? Would it be a fifth-degree polynomial or a cubic spline would suffice? • Many modern classification and regression models are highly adaptable and are capable of formulating complex relationships. At the same time they may overemphasize patterns that are not reproducible. Without a methodological approach to evaluating models, the problem will not be detected until the next set of samples are predicted. And here we are not talking about the data quality of the sample, which is used to develop the model, being bad!

SUPERVISED LEARNING

SUPERVISED LEARNING • A first issue is the tradeoff between bias and variance. • Imagine that we have available several different, but equally good, training data sets. • A learning algorithm is biased for a particular input x if, when trained on each of these data sets, it is systematically incorrect when predicting the correct output for x. • A learning algorithm has high variance for a particular input if it predicts different output values when trained on different training sets. • The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm. • Generally, there is a tradeoff between bias and variance. A learning algorithm with low bias must be "flexible" so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance. • A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).

SUPERVISED LEARNING • The best learner is the one which can balance the bias and the variance of a model. For more information, you can visit https://www.saylor.org/site/wp-content/uploads/2011/11/CS405-6.2.1.2-WIKIPEDIA.pdf

LINEAR MODELS AND GRADIENT DESCENT • Starting with an example • How do we predict housing prices • Collect data regarding housing prices and how they relate to size in feet Example problem: "Given this data, a friend has a house 750 square feet - how much can they be expected to get?"

What approaches can we use to solve this? • Straight line through data • Maybe $150 000 • Second order polynomial • Maybe $200 000 • Each of these approaches represent a way of doing supervised learning • What does this mean? • We gave the algorithm a data set where a "right answer" was provided • So we know actual prices for houses • The idea is we can learn what makes the price a certain value from the training data • The algorithm should then produce more right answers based on new training data where we don't know the price already, i.e. predict the price • We also call this a regression problem • Predict continuous valued output (price) • No real discrete delineation

What do we start with? Training set (this is your data set) • Notation m = number of training examples • x's = input variables / features • y's = output variable "target" variables • (x,y) - single training example • (xi, yj) - specific example (ith training example); iis an index to training set • With our training set defined - how do we used it? • Take training set • Pass into a learning algorithm • Algorithm outputs a function (denoted f) (f= hypothesis) • This function takes an input (e.g. size of new house) • Tries to output the estimated value of Y

How do we represent hypothesis f ? • Going to present f as; fθ(x) = θ0 + θ1x • f(x) (shorthand) • What does this mean? • Means Y is a linear function of x! • θi are parameters • θ0 is zero condition • θ1 is gradient (slope) • This function is a linear regression with one variable • Also called simple linear regression • So in summary • A hypothesis takes in some variable • Uses parameters determined by a learning system • Outputs a prediction based on that input

Linear regression - implementation (cost function) • A cost function lets us figure out how to fit the best straight line to our data. • Choosing values for θi (parameters) • Different values give you different functions • If θ0 is 1.5 and θ1 is 0 then we get straight line parallel with X along 1.5 @ y • If θ1 is > 0 then we get a positive slope • Based on our training set we want to generate parameters which make the straight line • Chosen these parameters so fθ(x) is close to y for our training examples • Basically, uses x’s in training set with fθ(x) to give output which is as close to the actual y value as possible • Think of fθ(x) as a "y imitator" - it tries to convert the x into y, and considering we already have y we can evaluate how well fθ(x) does this • To formalize this; • We want to want to solve a minimization problem • Minimize (fθ(x) - y)2 • i.e. minimize the difference betweenf(x) and y for each/any/every example • Sum this over the training set

Cost Function

We can see that the height (y) indicates the value of the cost function, so find where y is at a minimum • Instead of a surface plot we can use a contour figures/plots • Set of ellipses in different colors • Each color is the same value of J(θ0, θ1), but obviously plot to different locations because θ1 and θ0 will vary • Imagine a bowl shape function coming out of the screen so the middle is the concentric circles

Each point (like the red one above) represents a pair of parameter values for Ɵ0 and Ɵ1 • Our example here put the values at • θ0 = ~800 • θ1 = ~-0.15 • Not a good fit • i.e. these parameters give a value on our contour plot far from the center • If we have • θ0 = ~360 • θ1 = 0 • This gives a better hypothesis, but still not great - not in the center of the contour plot • Finally we find the minimum, which gives the best hypothesis • Doing this by eye/hand is hard • What we really want is an efficient algorithm for finding the minimum for θ0 and θ1

GRADIENT DESCENT ALGORITHM • At a theoretical level, gradient descent is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter valuesand iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient. • Minimize cost function J • Gradient descent • Used all over machine learning for minimization • Start by looking at a general J() function • Problem • We have J(θ0, θ1) • We want to get min J(θ0, θ1) • Gradient descent applies to more general functions • J(θ0, θ1, θ2 .... θp) • min J(θ0, θ1, θ2 .... θp)

How does it work? • Start with initial guesses • Start at 0,0 (or any other value) • Keeping changing θ0 and θ1 a little bit to try and reduce J(θ0,θ1) • Each time you change the parameters, you select the gradient which reduces J(θ0,θ1) the most possible • Repeat • Do so until you converge to a local minimum • Has an interesting property • Where you start can determine which minimum you end up

STEPS IN THE GRADIENT DESCENT ALGORITHM* • Lets now go step by step to understand the Gradient Descent algorithm: • Step 1: Initialize the weights (0 and 1) with random values and calculate Error (SSE) • Step 2:Calculate the gradient i.e. change in SSE when the weights (0 and 1) are changed by a very small value from their original randomly initialized value. This helps us move the values of 0 and 1 in the direction in which SSE is minimized. • Step 3: Adjust the weights with the gradients to reach the optimal values where SSE is minimized • Step 4: Use the new weights for prediction and to calculate the new SSE • Step 5: Repeat steps 2 and 3 till further adjustments to weights doesn’t significantly reduce the Error *https://www.kdnuggets.com/2017/04/simple-understand-gradient-descent-algorithm.html

Gradient descent over multi-dimensional parameters

EXAMPLE • We will now go through each of the steps in detail. But before that, we have to standardize the data as it makes the optimization process faster.

Step 1: To fit a line Ypred = a + b X, start off with random values of a and b and calculate prediction error (SSE)

Step 2: Calculate the error gradient w.r.t the weights ∂SSE/∂a = – (Y-YP) ∂SSE/∂b = – (Y-YP)X Here, SSE=½ (Y-YP)2 = ½(Y-(a+bX))2 ∂SSE/∂a and ∂SSE/∂b are the gradients and they give the direction of the movement of a, b w.r.t to SSE.

Step 3: Adjust the weights with the gradients to reach the optimal values where SSE is minimized We need to update the random values of a, b so that we move in the direction of optimal a, b. Update rules: a – ∂SSE/∂a b – ∂SSE/∂b So, update rules: New a = a – α * ∂SSE/∂a = 0.45-0.01*3.300 = 0.42 New b = b – α * ∂SSE/∂b= 0.75-0.01*1.545 = 0.73 Here, α is the learning rate = 0.01, which is the pace of adjustment to the weights.

Step 4: Use new a and b for prediction and to calculate new Total SSE You can see with the new prediction, the total SSE has gone down (0.677 to 0.553). That means prediction accuracy has improved. Step 5: Repeat step 3 and 4 till the time further adjustments to a and b doesn’t significantly reduces the error. At that time, we have arrived at the optimal a and b with the highest prediction accuracy. • This is the Gradient Descent Algorithm. This optimization algorithm and its variants form the core of many machine learning algorithms like Neural Networks and even Deep Learning.

While we were able to scratch the surface for learning gradient descent, there are several additional concepts that are good to be aware of that we weren’t able to discuss. A few of these include: • Convexity – In our linear regression problem, there was only one minimum. Our error surface was convex. Regardless of where we started, we would eventually arrive at the absolute minimum. In general, this need not be the case. It’s possible to have a problem with local minima that a gradient search can get stuck in. There are several approaches to mitigate this (e.g., stochastic gradient search). • Performance – We used vanilla gradient descent with a learning rate of 0.0005, and ran it for 2000 iterations. There are approaches such a line search, that can reduce the number of iterations required. For the above example, line search reduces the number of iterations to arrive at a reasonable solution from several thousand to around 50. • Convergence– How to determine when the search finds a solution is typically done by looking for small changes in error iteration-to-iteration (e.g., where the gradient is near zero).

We have an almost identical rule for multivariate gradient descent • Polynomial regression for non-linear function • Example • House price prediction • Two features • Frontage - width of the plot of land along road (x1) • Depth - depth away from road (x2) • You don't have to use just two features • Can create new features • Might decide that an important feature is the land area • So, create a new feature = frontage * depth (x3) • h(x) = θ0 + θ1x3 • Area is a better indicator • Often, by defining new features you may get a better model • Polynomial regression • May fit the data better • θ0 + θ1x + θ2x2 e.g. here we have a quadratic function • For housing data could use a quadratic function • But may not fit the data so well - inflection point means housing prices decrease when size gets really big • So instead must use a cubic function

How do we fit the model to this data • To map our old linear hypothesis and cost functions to these polynomial descriptions the easy thing to do is set • x1 = x • x2 = x2 • x3 = x3 • By selecting the features like this and applying the linear regression algorithms you can do polynomial linear regression • Remember, feature scaling becomes even more important here • Instead of a conventional polynomial you could do variable ^(1/something) - i.e. square root, cubed root etc.

R APPLICATION OF THE GRADIENT DECENT ALGORITHM attach(mtcars) plot(disp, mtcars[,1], col = "blue", pch = 20) > plot(disp, mtcars[,1], col = "blue", pch = 20) > model <- lm(mtcars[,1] ~ disp, data = mtcars) > coef(model) (Intercept) disp 29.59985476 -0.04121512

y_preds<- predict(model) abline(model) > errors <- unname((mtcars[,1] - y_preds) ^ 2) > sum(errors) / length(mtcars[,1]) [1] 9.911209

gradientDesc <- function(x, y, learn_rate, conv_threshold, n, max_iter) { plot(x, y, col = "blue", pch = 20) m <- runif(1, 0, 1) c <- runif(1, 0, 1) yhat <- m * x + c MSE <- sum((y - yhat) ^ 2) / n converged = F iterations = 0 while(converged == F) { ## Implement the gradient descent algorithm m_new <- m - learn_rate * ((1 / n) * (sum((yhat - y) * x))) c_new <- c - learn_rate * ((1 / n) * (sum(yhat - y))) m <- m_new c <- c_new yhat <- m * x + c MSE_new <- sum((y - yhat) ^ 2) / n if(MSE - MSE_new <= conv_threshold) { abline(c, m) converged = T return(paste("Optimal intercept:", c, "Optimal slope:", m)) } iterations = iterations + 1 if(iterations > max_iter) { abline(c, m) converged = T return(paste("Optimal intercept:", c, "Optimal slope:", m, “MSE”, MSE_new)) } } }

# Run the function > gradientDesc(disp, mtcars[,1], 0.0000293, 0.001, 32, 2500000) > gradientDesc(disp, mtcars[,1], 0.0000293, 0.001, 32, 2500000) [1] "Optimal intercept: 29.5998515131943 Optimal slope: -0.0412151089777685 MSE 9.91120904007057“ > coef(model) (Intercept) disp 29.59985476 -0.04121512

STOCHASTIC GRADIENT DESCENT

LINEAR REGRESSION APPLICATION(https://towardsdatascience.com/step-by-step-tutorial-on-linear-regression-with-stochastic-gradient-descent-1d35b088a843 ) • We have some data: as we observe the independent variables x₁ and x₂, we observe the dependent variable (or response variable) y along with it. • In our dataset, we have 6 examples (or observations). x1 x2 y 1) 4 1 2 2) 2 8 -14 3) 1 0 1 4) 3 2 -1 5) 1 4 -7 6) 6 7 -8

Model • The next question to ask: “How are both x₁ and x₂ related to y?” • We believe that they are connected to each other by this equation: • our job today is to find the ‘best’ w and b values. • The deep learning conventions w and b, which stand for weights and biases respectively. But note that linear regression is not deep learning.

Define loss function • Let’s say at the end of this exercised, we’ve figured out our model to be • How do we know if our model is doing well? • We simply compare the predicted ŷand the observed ythrough a loss function. There are many ways to define the loss function but in this post, we define it as the squared difference between ŷand y. • Generally, the smaller the L, the better.

Minimize loss function • Because we want the difference between ŷand yto be small, we want to make an effort to minimize it. This is done through stochastic gradient descent optimization. It is basically iteratively updating the values of w₁ and w₂ using the value of gradient, as in this equation: • This algorithm tries to find the right weights by constantly updating them, bearing in mind that we are seeking values that minimize the loss function.

Implementation • The workflow for training our model is simple: forward propagation (or feed-forward or forward pass) and backpropagation. • Training just means regularly updating the values of your weights, put simply.

STEP 0. Build computation graph • To keep track of all the values, we build a ‘computation graph’ that comprises nodes color-coded in • orange — the placeholders (x₁, x₂ and y), • dark green — the weights and bias (w₁, w₂ and b), • light green — the model (ŷ) connecting w₁, w₂, b, x₁ and x₂, and • yellow — the loss function (L) connecting the ŷ and y. For forward propagation, you should read this graph from top to bottom and for backpropagation bottom to top. Fig. 0: Computation graph for linear regression model with stochastic gradient descent.

STEP 1. Forward Propagation: Initialize weights (one-time) • Since gradient descent is all about updating the weights, we need them to start with some values, known as initializing weights. • Here we initialized the weights and bias as follows: In this example, we initialized the weights by using truncated normal distribution and the bias with 0. Fig. 1: Weights initialized (dark green nodes)

STEP 2. Forward Propagation: Feed data • We set the batch size to be 1 and we feed in this first batch of data. • Batch and batch size: We can divide our dataset into smaller groups of equal size. Each group is called a batch and consists of a specified number of examples, called batch size. If we multiply these two numbers, we should get back the number of observations in our data. • Here, our dataset consists of 6 examples and since we defined the batch size to be 1 in this training, we have 6 batches altogether. Eqn. 1: First batch of data fed into model Fig. 2.2: Feeding data to model with first batch (orange nodes)

STEP 3. Forward Propagation: Compute ŷ • Now that we have the values of x₁, x₂, w₁, w₂ and b ready, let’s compute ŷ. • The value of ŷ (=0.1) is reflected in the light green node below: Fig. 3: ŷ computed (light green node)

Supervised Learning