200 likes | 205 Views
Explore the concept of relationships between variables and learn how to identify, measure, and analyze these relationships using statistical methods.
E N D
Chapter 2 Looking at Data - Relationships
Relations Among Variables • Response variable - Outcome measurement (or characteristic) of a study. Also called: dependent variable, outcome, and endpoint. Labelled as y. • Explanatory variable - Condition that explains or causes changes in response variables. Also called: independent variable and predictor. Labelled as x. • Theories usually are generated about relationships among variables and statistical methods can be used to test them. • Research questions are stated such as: Do changes in x cause changes in y?
Scatterplots • Identify the explanatory and response variables of interest, and label them as x and y • Obtain a set of individuals and observe the pairs (xi , yi) for each pair. There will be n pairs. • Statistical convention has the response variable (y) placed on the vertical (up/down) axis and the explanatory variable (x) placed on the horizontal (left/right) axis. (Note: economists reverse axes in price/quantity demand plots) • Plot the n pairs of points (x,y) on the graph
France August,2003 Heat Wave Deaths • Individuals: 13 cities in France • Response: Excess Deaths(%) Aug1/19,2003 vs 1999-2002 • Explanatory Variable: Change in Mean Temp in period (C) • Data:
France August,2003 Heat Wave Deaths Possible Outlier
Example - Pharmacodynamics of LSD • Response (y) - Math score (mean among 5 volunteers) • Explanatory (x) - LSD tissue concentration (mean of 5 volunteers) • Raw Data and scatterplot of Score vs LSD concentration: Source: Wagner, et al (1968)
Manufacturer Production/Cost Relation Y= Amount Produced x= Total Cost n=48 months (not in order)
Correlation • Numerical measure to summarize the strength of the linear (straight-line) association between two variables • Bounded between -1 and +1 (Labelled as r) • Values near -1 Strong Negative association • Values near 0 Weak or no association • Values near +1 Strong Positive association • Not affected by linear transformationof either x or y • Does not distinguish between response and explanatory variable (x and y can be interchaged)
Least-Squares Regression • Goal: Fit a line that “best fits” the relationship between the response variable and the explanatory variable • Equation of a straight line: y = a + bx • a - y-intercept (value of y when x = 0) • b - slope (amount y increases as x increases by 1 unit) • Prediction: Often want to predict what y will be at a given level of x. (e.g. How much will it cost to fill an order of 1000 t-shirts) • Extrapolation: Using a fitted line outside level of the explanatory variable observed in sample: BAD IDEA
Least-Squares Regression • y = a + bx is a deterministic equation • Sample data don’t fall on a straight line, but rather around one • Obtain equation that “best fits” a sample of data points • Error - Difference between observed response and predicted response (from equation) • Least Squares criteria: Choose the line that minimizes the sum of squared errors. Resulting regression line:
Excess French Heatwave Deaths For each 1C increase in mean temp, excess mortality increases about 20%
Effect of an Outlier (Paris) • Re-fitting the model without Paris, which had a very high excess mortality (Using EXCEL):
Squared Correlation • The squared correlation represents the fraction of the variation in the response variable that is “explained” by the explanatory variable • Represents the improvement (reduction in sum of squared errors) by using x (and fitted equation y-hat) to predict y as opposed to ignoring x (and simply using the sample mean y-bar) to predict y • 0 r2 1 • Values near 0 x does not help predict y (regression line flat) • Values near 1 x predicts y well (data near regression line)
Residual Analysis • Residuals: Difference between observed responses and their predicted values: • Useful to plot the residuals versus the level of the explanatory variable (x) • Outliers: Large (positive or negative) residuals. Values of y that are inconsistent with prediction • Influential observations: Cases where the level of the explanatory variable is far away from the other individuals (extreme x values)
France Heatwave Mortality Paris (outlier)
Miscellaneous Topics • Lurking Variable: Variable not included in regression analysis that may influence the association between y and x. Sometimes referred to as a spurious association between y and x. • Association does not imply causation (it is one of various steps to demonstrating cause-and-effect) • Do not extrapolate outside range of x observed in study • Some relationships are not linear, which may show low correlation when relation is strong • Correlations based on averages across individuals tend to be higher than those based on individuals
Causation • Association between x and y demonstrated • Time order confirmed (x “occurs” before y) • Alternative explanations are considered and explained away: • Lurking variables - Another variable causes both x and y • Confounding - Two explanatory variables are highly related, and which causes y cannot be determined • Dose-Response Effect • Plausible cause