120 likes | 287 Views
Correlation: Relationships Can Be Deceiving. Outliers. An outlier is a data point that does not fit the overall trend. Speculate on what influence outliers have on correlation. For each picture, do you think the correlation is higher or lower than it would be without the outlier ?. Outliers.
E N D
Outliers An outlier is a data point that does not fit the overall trend. Speculate on what influence outliers have on correlation. For each picture, do you think the correlation is higher or lower than it would be without the outlier?
Outliers • An outlier that is consistent with the trend of the rest of the data will inflate the correlation. • An outlier that is not consistent with the rest of the data can substantially decrease the correlation.
Ages of Husbands and Wives Example of an outlier that should be removed. Subset of data on ages of husbands and wives, with one outlier added (entered 82 instead of 28 for husband’s age). • Correlation for data with outlier is .39. • If outlier removed, correlation of remainder points is .964 – a very strong linear relationship.
Legitimate Outliers, Illegitimate Correlation Outliers can be legitimate data and in some cases should not be removed (without convincing justification). • Be careful about removing outliers when… • presented with data in which outliers are likely to occur • correlations are presented for a small sample.
Legitimate Correlation Does Not Imply Causation Even if two variables are legitimately related or correlated, do not fall into the trap of believing there is a causal connection. This happens often with observational studies. In the absence of any other evidence, data from observational studies simply cannot be used to establish causation. An example of a silly correlation: List of weekly tissue sales and weekly hot chocolate sales for a city with extreme seasons would probably exhibit a correlation because both tend to go up in the winter and down in the summer.
Prostate Cancer and Red Meat Study Details: • Followed 48,000 men who filled out dietary questionnaires in 1986. • By 1990, 300 men diagnosed with prostate cancer and 126 had advanced cases. • For advanced cases: “men who ate the most red meat had a 164% higher risk than those with the lowest intake.” Possible third variable that both leads men to consume more red meat and increases risk of prostate cancer … the hormone testosterone. Example showing that Legitimate Correlation Does Not Imply Causation
Reasons Two Variables Could Be Related: • Explanatory variable is the direct cause of the response variable. • e.g. Amount of food consumed in past hour and level of hunger. • Response variable is causing a change in the explanatory variable. • e.g. Explanatory = advertising expenditures and Response = occupancy rates for hotels. • Explanatory variable is a contributing but not sole cause of the response variable. • e.g. Carcinogen in diet is not sole cause of cancer, but rather a necessary contributor to it. + C
Reasons Two Variables Could Be Related: • Confounding variables may exist. • A confounding variable is related to the explanatory variable and affects the response variable. So can’t determine how much change is due to the explanatory and how much is due to the confounding variable(s). • e.g. Emotional support is a confounding variable for the relationship between happiness and length of life. • Both variables may result from a common cause. • e.g. Verbal SAT and GPA: causes (such as time dedicated to studying) responsible for one variable being high (or low) are same as those responsible for the other being high (or low).
Reasons Two Variables Could Be Related: • Both variables are changing over time. • Nonsensical associations result from correlating two variables that have both changed over time. Divorce Rates and Drug Offenses Correlation between divorce rate and % admitted for drug offenses is 0.67, quite strong. However, both simply reflect a trend across time. • Correlation between year and divorce rate is 0.92. • Correlation between year and % admitted for drug offenses is 0.78.
Reasons Two Variables Could Be Related: • Association may be nothing more than coincidence. • Association is a coincidence, even though odds of it happening appear to be very small. Example: New office building opened and within a year there was an unusually high rate of brain cancer among workers in the building. Suppose odds of having that many cases in one building were only 1 in 10,000. But there are thousands of new office buildings, so we should expect to see this phenomenon just by chance in about 1 of every 10,000 buildings.
Confirming Causation The only legitimate way to try to establish a causal connection statistically is through the use of randomized experiments. If a randomized experiment cannot be done, then nonstatistical considerations must be used to determine whether a causal link is reasonable. • Evidence of a possible causal connection: • There is a reasonable explanation of cause and effect. • The connection happens under varying conditions. • Potential confounding variables are ruled out.