Statistics and Quantitative Analysis U4320

Statistics and Quantitative Analysis U4320 Segment 8 Prof. Sharyn O’Halloran

I. Introduction • A. Overview • 1. Ways to describe, summarize and display data. • 2.Summary statements: • Mean • Standard deviation • Variance • 3. Distributions • Central Limit Theorem

I. Introduction (cont.) • A. Overview • 4. Test hypotheses • 5. Differences of Means • B. What's to come? • 1. Analyze the relationship between two or more variables with a specific technique called regression analysis.

I. Introduction (cont.) • A. Overview • B. What's to come? • 2. This tools allows us to predict the impact of one variable on another. • For example, what is the expected impact of a SIPA degree on income?

II. Causal Models • Causal models explain how changes in one variable affect changes in another variable. Incinerator -------------------------> Bad Public Health Regression analysis gives us a way to analyze precisely the cause-and-effect relationships between variables. • Directional • Magnitude

II. Causal Models (cont.) • A. Variables • Let us start off with a few basic definitions. • 1. Dependent Variable • The dependent variable is the factor that we want to explain. • 2. Independent Variables • Independent variable is the factor that we believe causes or influences the dependent variable. Independent variable-------> Dependent Variable Cause ------------------> Effect

II. Causal Models (cont.) • A. Variables • B. Voting Example • Let us say that we have a vote in the House of Representatives on health. And we want to know if party affiliation influenced individual members' voting decisions? • 1. The raw data looks like this:

II. Causal Models (cont.) • A. Variables • B. Voting Example • 2. Percentages look like this: • 3. Does party affect voting behavior? • Given that the legislator is a Democrat, what is the chance of voting for the health care proposal?

II. Causal Models (cont.) • A. Variables • B. Voting Example • 3. Does party affect voting behavior? (cont.) • What is the Probability of being a democrat? • What is the Probability of being a Democrat and voting yes?

II. Causal Models (cont.) • A. Variables • B. Voting Example • 4. Casual Model • This is the simplest way to state a causal model A-------------> B Party ---------> Vote • 5. Interpretation • The interpretation is that if party influences vote, then as we move from Republicans to Democrats we should see a move from a No vote to a YES vote.

II. Causal Models (cont.) • A. Variables • B. Voting Example • C. Summary • 1. Regression analysis helps us to explain the impact of one variable on another. • We will be able to answer such questions as what is the relative importance of race in explaining one's income? • Or perhaps the influence of economic conditions on the levels of trade barriers?

II. Causal Models (cont.) • A. Variables • B. Voting Example • C. Summary • 2. Univariate Model • For now, we will focus on the univariate case, or the causal relation between two variables. • We will then relax this assumption and look at the relation of multiple variables in a couple of weeks.

III. Fitted Line • Although regression analysis can be very complicated, the heart of it is actually very simple. • It centers on the notion of fitting a line through the data. • 1. Example • Suppose we have a study of how wheat yield depends on fertilizer. And we observe this relation:

III. Fitted Line (cont.) • 1. Example (cont.) • The observed relation between Fertilizer and Yield then can be plotted as follows:

III. Fitted Line (cont.) • 1. Example • 2. What line best approximates the relation between these observations? • a) Highest and Lowest Value

III. Fitted Line (cont.) • 1. Example • 2. What line best approximates the relation between these observations? (cont.) • b) Median Value

III. Fitted Line (cont.) • 1. Example • 2. What line best approximates the relation between these observations? • 3. Predicted Values • a) Example 1: • The line that is fitted to the data gives the predicted value of Y for any give level of X.

III. Fitted Line (cont.) • 1. Example • 2. What line best approximates the relation between these observations? • 3. Predicted Values (cont.) • a) Example 1: • If X is 400 and all we know was the fitted line then we would expect the yield to be around 65.

III. Fitted Line (cont.) • 1. Example • 2. What line best approximates the relation between these observations? • 3. Predicted Values (cont.) • b) Example 2: • Many times we have a lot of data and fitting the line becomes rather difficult.

III. Fitted Line (cont.) • 1. Example • 2. What line best approximates the relation between these observations? • 3. Predicted Values (cont.) • b) Example 2: • For example, if our plotted data looked like this:

IV. OLS Ordinary Least Squares • We want a methodology that allows us to be able to draw a line that best fits the data. • A. The Least Square Criteria • What we want to do is to fit a line whose equation is of the form: • This is just the algebraic representation of a line.

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria (cont.) • 1. Intercept: • a represents the intercept of the line. That is, the point at which the line crosses the Y axis. • 2. Slope of the line: • b represents the slope of the line.

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria (cont.) • 1. Intercept: • 2. Slope of the line: • Remember: the slope is just the change in Y divided by the change in X. Rise/Run • 3. Minimizing the Sum or Squares • a) Problem: • How do we select a and b so that we minimize the pattern of vertical Y deviations (predicted errors)? • We what to minimize the deviation:

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria (cont.) • 1. Intercept: • 2. Slope of the line: • 3. Minimizing the Sum or Squares • b)There are several ways in which we can do this. • 1. First, we could minimize the sum of d. • We could find the line that will give us the lowest sum of all the d's. • The problem of course is that some d's would be positive and others would be negative and when we add them all up they would end up canceling each other. • In effect, we would be picking a line so that the d's add up to zero.

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria (cont.) • 1. Intercept: • 2. Slope of the line: • 3. Minimizing the Sum or Squares • b)There are several ways in which we can do this. • 2. Absolute Values • 3. Sum of Squared Deviations

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • 1. Fitted Line • The line that we what to fit to the data is: • This is simply what we call the OLS line. • Remember: we are concerned with how to calculate the slope of the line b and the intercept of the line

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • 1. Fitted Line • 2. OLS Slope • The OLS slope can becalculated from the formula:

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • 1. Fitted Line • 2. OLS Slope • In the book they use the abbreviations:

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • 1. Fitted Line • 2. OLS Slope • 3. Intercept • Now that we have the slope b it is easy to calculate a • Note: when b=0 then the intercept is just the mean of the dependent variable.

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • So to calculate the slope we solve: • We can then use the slope b to calculate the intercept

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • Remember: • Plugging these estimated values into our fitted line equation, we get:

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • What is the predicted bushels produced with 400 lbs of fertilizer? • What if we add 700 lbs of fertilizer what would be the expected yield?

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • D. Interpretation of b and a • 1. Slope b • Change in Y that accompanies a unit change X. • The slope tells us that when there is a one unit change in the independent variable what is the predicted effect on the dependent variable?

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • D. Interpretation of b and a • 1. Slope b • The slope then tells us two things: • i) The directional effect of the independent variable on the dependent variable. • There was a positive relation between fertilizer and yield.

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • D. Interpretation of b and a • 1. Slope b • The slope then tells us two things: • ii) It also tells you the magnitude of the effect on the dependent variable. • For each additional pound of fertilizer we expect an increased yield of .059 bushels.

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • D. Interpretation of b and a • 2. The Intercept • The intercept tells us what we would expect if there is no fertilizer added, we expect a yield of 36.4 bushels. • So independent of the fertilizer you can expect 36.4 bushels. • Alternatively, if fertilizer has no effect on yield, we would simply expect 36.4 bushels. The yield we expected with no fertilizer.

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • D. Interpretation of b and a • E. Example II: Radio Active Exposure • 1. Casual Model • We want to know if exposure to radio active waste is linked to cancer? Radio Active Waste --------------> Cancer

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • D. Interpretation of b and a • E. Example II: Radio Active Exposure • 2. Data

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • D. Interpretation of b and a • E. Example II: Radio Active Exposure • 3. Graph

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • D. Interpretation of b and a • E. Example II: Radio Active Exposure • 4. Calculate the regression line for predicting Y from X • i) Slope • How do we interpret the slope coefficient? • For each unit of radioactive exposure, the cancer mortality rate rises by 9.03 deaths per 10,000 individuals.

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • D. Interpretation of b and a • E. Example II: Radio Active Exposure • ii) Calculate the intercept • Plugging these estimated values into our fitted line equation, we get:

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • D. Interpretation of b and a • E. Example II: Radio Active Exposure • 5. Predictions: • Let's calculate the mortality rate if X were 5.0. • How about if X were 0?

IV. OLS Ordinary Least Squares (cont.) • A. The Least Square Criteria • B. OLS Formulas • C. Example 1: Fertilizer and Yield • D. Interpretation of b and a • E. Example II: Radio Active Exposure • How can we interpret this result? • Even with no radioactive exposure, the mortality rate would be 118.5.

III. Advantages of OLS • A. Easy • 1. The least square method gives relative easy or at least computable formulas for calculating a and b.

III. Advantages of OLS (cont.) • A. Easy • B. OLS is similar to many concepts we have already used. • 1. We are minimizing the sum of the squared deviations. In effect, this is very similar to how we find the variance. • 2. Also, we saw above that when b=0, • The interpretation of this is that the best prediction we can make of Y is just the sample mean . • This is the case when the two variables are independent.

III. Advantages of OLS (cont.) • A. Easy • B. OLS is similar to many concepts we have already used. • C. Extension of the Sample Mean • Since OLS is just an extension of the sample mean, it has many of the same properties like efficient and unbiased. • D. Weighted Least Squares • We might want to weigh some observations more heavily than others.

V. Homework Example • In the homework assignment, you are asked to select two interval/ratio level variables and calculate the fitted line that minimizes the sum of the squared deviations (the regression line). • A. Choose 2 Variables • What effect does the number of years of education have on the frequency that one reads the newspaper? • The independent variable is Education • And the dependent variable is Newspaper reading.

V. Homework Example(cont.) • A. Choose 2 Variables • B. Coding the Variables • First, I made a new variable called PAPER. • Recode all the missing data values to a single value. • Remove missing values from the data set. • Then do the same for education

V. Homework Example(cont.) • A. Choose 2 Variables • B. Coding the Variables • C. Getting the number of valid observations • Next, see how many valid observations are left by using the “Summarize” command under the “Data” menu.

Statistics and Quantitative Analysis U4320