130 likes | 140 Views
Chapter 2. Simple Linear Regression – What is it, Why do we do it?. Remember Statistics is an applied branch of mathematics. When you apply mathematics to describe the world we live in, we call this mathematical modeling.
E N D
Chapter 2 Simple Linear Regression – What is it, Why do we do it?
Remember Statistics is an applied branch of mathematics. When you apply mathematics to describe the world we live in, we call this mathematical modeling. Up to this point in your mathematical studies you have been learning about the language of algebra, which involves learning the about the objects – like functions – that make up this language, its characteristics, syntax, and grammar. The big concept in algebra is the understanding of relationships between variables. The simplest mathematical relationship being linear relationships. After a while we learn about a special type of relationship called a function, symbolized by f(x) = y.
The types of relationships/functions we learn about in algebra are called deterministic. What is deterministic? This means that we have a perfect relationship between the variables in question. If we know a particular value of one variable, then knowing what the relationship is – meaning knowing the equation- gives us complete knowledge of what is the corresponding value of the other variable. Here is a simple example. You work at a job that pays you $10.00 per hour. Let h equal the number of hours you work, d equal the amount of dollars you earn and f(h) = d equal the symbolic representation of the relationship between h and d. So, if on your first week on the job you work 20 hours your gross pay will be $200.00 without question, f(20) = $200.00. The relationship is exact!
If the following week, you work 20 hours again your pay will be $200.00 again. This is different from a probabilistic model. In a probabilistic model there is a relationship between two variables but that relationship is not perfect. For the value of a particular variable we have an expectation of a value for the other variable in the relationship, but we can not expect to get that exact value. Here is a simple example. You are a vendor at an outdoor market. Let h represent the number of hours you work at the outdoor market, and let d equal the net amount of money you earn from selling your products. If you work for 8 hours you will earn some amount of money, but you can not predict from day to day what that amount will be. You have an expectation of what you will earn, otherwise you would not be involved in this endeavor.
But suppose that you worked for 8 hours, for many days at your stall in the outdoor market. Eventually you would have gathered enough data to create a distribution of sales and would begin to see a pattern emerge. Soon you would recognize that this pattern repeats itself, except that you can not predict exactly how much money you will earn in any one week. But depending on the time of the season, or month you can expect a certain return for your work. Lets say that sales always seem to be higher at the end of the month. So there is a relationship between the two variables, except that it is not deterministic- exact - it is probabilistic.
So here is where we begin talking about the topic of chapter 2, simple linear regression. The attempt here is to understand what relationship exists between the two variables under a probabilistic model. Before we start let us cover a couple of preliminary concepts. As was mentioned, we are dealing with a probabilistic model, which means our functions will also be probabilistic. More on what this means in a moment. An important fact is that we will be dealing with linear equations/functions only to make our task easier.
Now, what do we mean by a probabilistic linear function? Let x be our independent variable, which we will call, most of the time, the explanatory variable. In algebra the letter y represent the dependent variable, which we now call the response variable (more on the name change later). Using algebra symbols, f is the name of the linear function, and f(x) = y. But since we are dealing with a probabilistic model, that is x and y are not perfectly related it seems that we have the following situation: f(3) = ? . What exactly is the output of a probabilistic equation that defines the relationship between two variables?
Alright, here it is. Let the variable h, the explanatory variable be the number of hours you work at the market selling your product. Let us say you work 8 hours every Saturday. The variable d represents the gross amount of dollars earned for working 8 hours on Saturday. The function f represents the mathematical relationship between the two; so f(8) equals the expected (mean) gross amount of dollars earned. In other words f(8) = the mean for the given situation. Thus, f(8) represents the mean of some distribution.
We will make some assumptions about our distribution; f(8) is the mean dollar value of the sales. Secondly, we will assume a normal distribution.
We realize that there is also then a value associated for any value of h, number of hours worked. Thus f(4) = has some value, and f(5) = has some value, and f(3.5) = has some value and so on. The output is always the mean number of sales in dollars associated with that number of hours worked, and we will assume that the distributions are all normal. Also all the normal distributions have the same standard deviation.
Ok, this is understandable so far. But what exactly is simple linear regression? What is chapter 2 about?
We want to find the equation f(x) = y or in this example f(h) = d. What is this equation that explains the relationship between the variable x and the variable y, or in our example h and d, which represents the mean of a normal distribution. We will only be considering situation when f(x) is a linear equation, linear relationship. So one task is to determine if our relationship is linear, how good is the linear relationship, and if it will it provide any good information We will assume that for each value of x, f(x) is the mean of a normal distribution.