360 likes | 447 Views
ASSOCIATION: CONTINGENCY, CORRELATION, AND REGRESSION. Chapter 3. 3.1 The Association between Two Categorical Variables. Response and Explanatory Variables. Response variable (dependent, y) outcome variable Explanatory variable (independent, x) defines groups Response/Explanatory
E N D
ASSOCIATION: CONTINGENCY, CORRELATION, AND REGRESSION Chapter 3
Response and Explanatory Variables • Response variable(dependent, y) outcome variable • Explanatory variable(independent, x) defines groups • Response/Explanatory • Grade on test/Amount of study time • Yield of corn/Amount of rainfall
Association Association – When a value for one variable is more likely with certain values of the other variable Data analysis with two variables • Tell whether there is an association and • Describe that association
Contingency Table • Displays two categorical variables • The rows list the categories of one variable; the columns list the other • Entries in the table are frequencies www1.pictures.fp.zimbio.com
Contingency Table • What is the response (outcome) variable? Explanatory? • What proportion of organic foods contain pesticides?Conventionally grown? • What proportion of all sampled foods contain pesticides?
Proportions & Conditional Proportions Side by side bar charts show conditional proportions and allow for easy comparison www.vitalchoice.com
Proportions & Conditional Proportions If no association, then proportions would be the same Since there isassociation, then proportions are different
Internet Usage & GDP Data Set www.knitwareblog.com
Scatterplot Graph of two quantitative variables: • Horizontal Axis: Explanatory, x • Vertical Axis: Response, y
Interpreting Scatterplots • The overall pattern includes trend, direction, and strength of the relationship • Trend: linear, curved, clusters, no pattern • Direction: positive, negative, no direction • Strength: how closely the points fit the trend • Also look for outliers from the overall trend
Used-car Dealership What association would we expect between the age of the car and mileage? • Positive • Negative • No association
Linear Correlation, r Measures the strength and direction of the linear association between x and y
Correlation coefficient: Measuring Strength & Direction of a Linear Relationship • Positive r => positive association • Negative r => negative association • r close to +1 or -1 indicates strong linear association • r close to 0 indicates weak association
Regression Line • Predicts y, given x: • The y-intercept and slope are a and b • Only an estimate – actual data vary • Describes relationship between x and estimated meansof y farm4.static.flickr.com
Residuals www.chem.utoronto.ca • Prediction errors: vertical distance between data point and regression line • Large residual indicates unusual observation • Each residual is: • Sum of residuals is always zero • Goal: Minimize distance from data to regression line
Least Squares Method • Residual sum of squares: • Least squares regression line minimizes vertical distance between points and their predictions msenux.redwoods.edu
Regression Analysis Identify response and explanatory variables • Response variable is y • Explanatory variable is x
Anthropologists Predict Height Using Remains? • Regression Equation: • is predicted height and x is the length of a femur, thighbone (cm) Predict height for femur length of 50 cm www.geektoysgamesandgadgets.com Bones
Interpreting the y-Intercept and slope • y-intercept: y-value when x = 0 • Helps plot line • Slope: change in y for 1 unit increase in x • 1 cm increase in femur length means 2.4 cm increase in predicted height
Slope and Correlation • Correlation, r: • Describes strength • No units • Same if x and y are swapped • Slope, b: • Doesn’t tell strength • Has units • Inverts if x and y are swapped
Squared Correlation, r2 • Proportional reduction in error, r2 • Variation in y-values explained by relationship of y to x • A correlation, r, of .9 means • 81% of variation in y is explained by x
Extrapolation • Extrapolation: Predicting y for x-values outside range of data • Riskier the farther from the range of x • No guarantee trend holds Neil Weiss, Elementary Statistics, 7th Edition
Outliers and Influential Points • Regression outlier lies far away from rest of data • Influential if both: • Low or high, compared to rest of data • Regression outlier www2.selu.edu
Correlation Does Not Imply Causation Strong correlation between x and y means • Strong linear association between the variables • Does not mean x causes y Ex. 95.6% of cancer patients have eaten pickles, so do pickles cause cancer?
Lurking Variables & Confounding • Ice cream sales & drowning => temperature • Reading level & shoe size => age • Confounding – two explanatory variables both associated with response variable and each other • Lurking variables – not measured in study but may confound
Simpson’s Paradox Example Simpson’s Paradox: • Association between two variables reverses after third is included Probability of Death of Smoker = 139/582 = 24% Probability of Death of Nonsmoker = 230/732 = 31%
Simpson’s Paradox Example Break out Data by Age
Simpson’s Paradox Example Associations look quite different after adjusting for third variable