300 likes | 314 Views
This chapter focuses on analyzing the relationships between two variables, including categorical vs. numerical, categorical vs. categorical, and numerical vs. numerical. Learn how to determine if an association exists using graphical and numerical comparisons.
E N D
Chapter 3 Association: Contingency and Correlation
Looking Back • In Chapter 2 we learned how to: • Distinguish between categorical and numerical variables • Summarize a single variable using descriptive statistics • In practice, research investigations usually require analyzing more than one variable. • In this chapter we will focus on the relationships between two variables. There are three possible situations: • Categorical vs. Numerical • Categorical vs. Categorical • Numerical vs. Numerical
Relationships Between Two Variables • The first step is to determine which is the explanatory variable and which is the response variable. • Explanatory variable • Response variable • The explanatory variable influences the response variable.
Example 3.1 • Determine which variable is the explanatory and which is the response. Also identify the type of each variable. • Smoking (yes or no) & Survival after 20 years (yes or no) • Smoking: • Survival: • Sucrose level (in grams) & Fruit type (A,B or C) • Sucrose level: • Fruit type: • Daily amount of gasoline used by automobiles (in gallons) & Amount of air pollution (in ppm) • Gasoline use: • Air pollution: Example taken from Statistics: The Art and Science of Learning from Data
The Main Purpose of Two Variable Analysis • Determining whether an association exists. • An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable. • In other words, are the variables _______ or are they ____________ of each other? • If an association exists, then we need to know how to
Categorical vs. Numerical • In this case, • Categorical variable: • Numerical variable: • Use the following to determine if there is an association: • Graphical Comparison • Side-by-side boxplots (one boxplot per group) • Numerical Comparison • Summary statistics like the ______ and ___________ ___________ of each group
Example 3.2 • To determine the effect of pets on stress, researchers recruited 45 females who owned dogs. The subjects were randomly assigned to three groups of fifteen. Each group was asked to complete a stressful task alone, with a good friend, or with their dog present. The heart rate of each subject was measured. • Using the charts below, which group performed best? Example taken from The Practice of Statistics by Yates, Moore and Starnes
Determining Association Between Categorical and Numerical • Side-by-side boxplots • If one is much farther to the right or left than the other, then there is probably an association. • Means and standard deviations • If one of the means or standard deviations is very different, then there may be an association.
Determining Association Between Categorical and Numerical • When there is no association between the two variables, the boxplot and descriptive statistics will look something like this:
Example 3.3 • A teacher was interested in knowing “On average, is the fastest speed ever driven by TAMU men greater than that of TAMU women?” To study this, she took a random sample of 242 STAT 302 students in the Fall of 2004. • Which variable is the explanatory and which is the response? • Gender: • What are the values of the variable? • Fastest Speed: Used with permission from Dr. Ellen Toby
Example 3.3 • Do you think there is an association between gender and fastest speed?
Categorical vs. Categorical • In this case, determining which variable is the explanatory and which is the response is not always straightforward. • Use the following to determine if there is an association: • Numerical Comparison • Contingency tables • Graphical Comparison • Stacked bar charts • Side-by-Side pie charts
Example 3.4 • Many people purchase organic foods because they believe they are pesticide free and therefore healthier than conventionally grown foods. Is it really worth the extra cost? The Consumer’s Union led a study which sampled both organic and conventional foods, recording the presence or absence of pesticide residue on the food. • Which variable is the explanatory and which is the response? What are the values of the variables? • Explanatory: • Values: • Response: • Values: Example taken from Statistics: The Art and Science of Learning from Data
Contingency Tables • The rows list the categories of the explanatory variable. • The columns list the categories of the response variable. • Each row and column combination is called a cell. • There are four cells in the table below. • The number in each cell is the frequency of that particular combination.
Contingency Tables • Conditional Probability • The probability of an observation falling into one of the response categories given that it is in a particular explanatory category • The conditional probabilities in each row should sum to ______. • What is the conditional probability that there is pesticide on a randomly selected organic food in this sample? • What is the conditional probability that there is pesticide on a randomly selected conventional food in this sample?
Stacked Bar Charts & Side-by-Side Pie Charts • Stacked Bar Charts • There is one bar for each category of the explanatory variable. • Sections within each bar represent conditional probabilities for each category of the response variable. • Side-by-Side Pie Charts • There is one pie for each category of the explanatory variable. • Slices within each pie represent conditional probabilities for each category of the response variable.
Determining Association Between Categorical and Categorical • Contingency Tables • Compare the percentages in the first row to the percentages in the second row. • If the percentages are very different, then there is probably an association. • Stacked Bar Charts & Side-by-Side Pie Charts • Compare the colored areas in the first bar to those in the second bar. • If they are very different, then there is probably an association.
Determining Association Between Categorical and Categorical • When there is no association between the two variables, the charts and contingency table will look something like this:
Example 3.5 • A study was done to determine if there is any association between gender and ability to differentiate between Coca-Cola and C2. • Are gender and soda preference associated? Used with permission from Dr. Ellen Toby
Numerical vs. Numerical • Once again, determining which variable is the explanatory and which is the response is not always straightforward. • Use the following to determine if there is an association: • Graphical Comparison • Scatterplots • Numerical Comparison • Correlation
Example 3.6 • What is the relationship between the weight of a vehicle and the number of miles it travels per gallon of gasoline? To answer this question, a random sample of 25 vehicles was taken. The weight (in pounds) and the miles per gallon (MPG) was recorded for each car. • Which variable is the explanatory and which is the response? • Explanatory: • Response: Example taken from Statistics: The Art and Science of Learning from Data
Scatterplots • Treat each pair as an (x, y) coordinate:
Determining Association Between Numerical and Numerical • Look for a consistent change or pattern in the response variable as the explanatory variable increases. • The particular kind of association seen in Example 3.6 is called _______ __________. • We describe linear association numerically by a measure called correlation. • Correlation • Correlation summarizes the ____________ and ____________ of the linear relationship between two numerical variables. • It is denoted by the symbol R.
Correlation R • Properties of the Correlation R: • Takes values between -1 and 1 • R = 1 or R = -1 implies • R = 0 implies there is • R < 0 implies there is a ___________ association • R > 0 implies there is a ___________ association • Even if the X and Y variables are switched, the correlation will ________________.
Correlation R • The formula for correlation is • StatTools will calculate this for us • The association in Example 3.6 is _______ and ___________. • The correlation is R =
Example 3.7 Used with permission from Dr. Ellen Toby
Correlation R • Correlation does not imply causation • Lurking variables • An unobserved variable influences the association between the explanatory variable and response variable. • Confounding variables • Two explanatory variables are both associated with a response variable. It is impossible to determine which variable causes the response.
Two Cases When R Is Not a Good Measure of Linear Association • Case 1: Outliers are present. • Case 2: Relationship between x and y is not linear.
Important Points • Categorical Explanatory & Numerical Response • Numeric Summary • Measures of center and spread for each group • Graphical Summary • Histograms and QQ plots for each group, side-by-side boxplots • Categorical Explanatory & Response • Numeric Summary • Contingency tables with relative frequencies • Graphical Summary • Stacked bar charts and side-by-side pie charts
Important Points • Numerical Explanatory & Response • Numeric Summary • Correlation R • Graphical Summary • Scatterplot • At this point in the class, we can only report what we see in the sample data. • Later in the semester, we will learn about statistical methods that allow us to apply our findings to the overall population.