470 likes | 566 Views
Intermediate Data Collection & Analysis. Steven A. Allshouse Coordinator of Research and Analysis November 5, 2008. Organization of the Class. Part I – Discussion of Correlation and Causation. Part II – Quantitative Examples of Correlation and Causation.
E N D
Intermediate Data Collection & Analysis Steven A. Allshouse Coordinator of Research and Analysis November 5, 2008
Organization of the Class • Part I – Discussion of Correlation and Causation. • Part II – Quantitative Examples of Correlation and Causation. • Part III – How to Measure Correlation (OLS Method). • Part IV – Common Pitfalls of the OLS Method. • Part V – MS Excel Exercise.
Correlation • A situation in which one variable or set of variables tends to be associated with a second variable or set of variables, but is not thought to bring about that second variable or set of variables. • Examples: The size of a person’s left foot and the size of his or her right foot; women’s hemlines and the performance of the stock market; and the number of cavities in elementary school children and the size of their vocabulary. • Note: Correlation can be positive or negative; positive means as X increases, so does Y; negative means as X increases, Y decreases.
Causation • A situation in which one variable or set of variables is thought to bring about, or help bring about, a second variable or set of variables. • Examples: Alcohol consumption/traffic accidents; average daily temperatures/heating oil consumption. • Notes: Causation usually implies correlation; If X causes Y, where we see X we would expect to see Y. Causation can be positive or negative; an increase in X can cause an increase or a decrease in Y. The direction of causation can run one or both ways; X causes Y, but Y might or might not cause X.
A Case of Causation? • There is a strong positive correlation between the number of fire engines that respond to a fire and the number of fatalities in that fire, i.e., the greater the number of fire engines, the greater the number of deaths. • Question: Does this fact mean that Albemarle County could save lives by decreasing the number of fire engines sent to a given fire?
Additional Notes about Correlation & Causation • Direction of causation usually determines what we identify as “independent” and “dependent” variables; Independent variable X causes the dependent variable Y. X and Y are correlated, but Y does not cause X. • Identification problem: Smoke actually does not cause the fire alarm to be pulled; fire is the underlying cause. Similarly, an increase in, say, education can be seen as causing an increase in income, but educational attainment might just be a “signal” of some underlying ability.
Ordinary Least Squares (OLS) Method • OLS is mathematical technique that estimates the correlation between two or more variables. Usually, however, if we are measuring correlation, we already are assuming causation. • The OLS technique renders two items: • (1) A formula whose graphical representation (a “regression” or “trend” line) best “fits” the observed data; and • (2) A number (R2) whose value describes how “tightly” the data fits around the regression line.
The “Regression” or “Trend” Line • Data is plotted in a “scatter” diagram. Horizontal line contains “x” values (independent variable) and vertical line contains “y” values (dependent variable). • Regression or Trend line is expressed in the form y = mx + b. • The terms “regression” line and “trend” line frequently are used interchangeably but, usually, a “trend” line pertains to data where the value of the dependent variable changes with time.
The R2 Number • Has a value anywhere from Zero to 1. • An R2 value of zero means that there is absolutelyno correlation between the independent and dependent variables. • An R2 value of 1 means that there is a perfectly deterministiccorrelation between the independent and dependent variables. • The R2 number tells us how much changes in the dependent variable are “explained” by changes in the independent variable. • Example: If R2 equals 0.70, that means that 70% of the change in the dependent variable is “explained” by the change in the independent variable.
Part IV – Some Common Pitfalls of Regression / Trend Line Analysis
Pitfall #1: The Regression or Trend Line that is derived from the OLS method might be meaningful only for a limited range of numbers. • Pitfall #2: The most valid Regression or Trend Line for a particular set of data might not necessarily be linear. • Pitfall #3: Usually, a dependent variable is a function of several independent variables, not just one independent variable.
Background • You work in the Planning Department; your boss comes to you with historical development data showing growth in the square footage of non-residential space. • An intern has compiled the data, and has calculated the square footage, by type of non-residential space, that has occurred during a twenty year time period. • The intern has taken the twenty year increase and divided that number by twenty in order to derive and average annual increase in each type of square footage. • Your boss has used this average annual increase to estimate the number of square feet, by non-residential type, that the County can expect over the course of the next ten years.
Background (Cont.) • You are somewhat suspicious of the ten year projection for industrial space, since the County had a net loss of jobs in the manufacturing sector during the course of the twenty years. • Assignment: • (a) Take the historical data for the industrial square footage and use MS Excel to derive an OLS trend line that fits this data; • (b) Graph the trend line, the trend line equation, and the R2 value; and • (c) Using the trend line equation, project the total new industrial square footage that the County can expect during the course of the next ten years.
Assignment (Cont.) • Question: Is your estimate different from the estimate that your boss derived? If so, how large is the gap (both in absolute square footage and percentage terms)? • How “tightly” does the data fit around the trend line that you have derived? Do you have much confidence in your trend line?