300 likes | 419 Views
Chapter 3. Association: Contingency and Correlation. Looking Back. In Chapter 2 we learned how to: Distinguish between categorical and numerical variables Summarize a single variable using descriptive statistics
E N D
Chapter 3 Association: Contingency and Correlation
Looking Back • In Chapter 2 we learned how to: • Distinguish between categorical and numerical variables • Summarize a single variable using descriptive statistics • In practice, research investigations usually require analyzing more than one variable. • In this chapter we will focus on the relationships between two variables. There are three possible situations: • Categorical vs. Numerical • Categorical vs. Categorical • Numerical vs. Numerical
Relationships Between Two Variables • The first step is to determine which is the explanatory variable and which is the response variable. • Explanatory variable • Response variable • The explanatory variable influences the response variable.
Example 3.1 • Determine which variable is the explanatory and which is the response. Also identify the type of each variable. • Smoking (yes or no) & Survival after 20 years (yes or no) • Smoking: • Survival: • Sucrose level (in grams) & Fruit type (A,B or C) • Sucrose level: • Fruit type: • Daily amount of gasoline used by automobiles (in gallons) & Amount of air pollution (in ppm) • Gasoline use: • Air pollution: Example taken from Statistics: The Art and Science of Learning from Data
The Main Purpose of Two Variable Analysis • Determining whether an association exists. • An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable. • In other words, are the variables _______ or are they ____________ of each other? • If an association exists, then we need to know how to
Categorical vs. Numerical • In this case, • Categorical variable: • Numerical variable: • Use the following to determine if there is an association: • Graphical Comparison • Side-by-side boxplots (one boxplot per group) • Numerical Comparison • Summary statistics like the ______ and ___________ ___________ of each group
Example 3.2 • To determine the effect of pets on stress, researchers recruited 45 females who owned dogs. The subjects were randomly assigned to three groups of fifteen. Each group was asked to complete a stressful task alone, with a good friend, or with their dog present. The heart rate of each subject was measured. • Using the charts below, which group performed best? Example taken from The Practice of Statistics by Yates, Moore and Starnes
Determining Association Between Categorical and Numerical • Side-by-side boxplots • If one is much farther to the right or left than the other, then there is probably an association. • Means and standard deviations • If one of the means or standard deviations is very different, then there may be an association.
Determining Association Between Categorical and Numerical • When there is no association between the two variables, the boxplot and descriptive statistics will look something like this:
Example 3.3 • A teacher was interested in knowing “On average, is the fastest speed ever driven by TAMU men greater than that of TAMU women?” To study this, she took a random sample of 242 STAT 302 students in the Fall of 2004. • Which variable is the explanatory and which is the response? • Gender: • What are the values of the variable? • Fastest Speed: Used with permission from Dr. Ellen Toby
Example 3.3 • Do you think there is an association between gender and fastest speed?
Categorical vs. Categorical • In this case, determining which variable is the explanatory and which is the response is not always straightforward. • Use the following to determine if there is an association: • Numerical Comparison • Contingency tables • Graphical Comparison • Stacked bar charts • Side-by-Side pie charts
Example 3.4 • Many people purchase organic foods because they believe they are pesticide free and therefore healthier than conventionally grown foods. Is it really worth the extra cost? The Consumer’s Union led a study which sampled both organic and conventional foods, recording the presence or absence of pesticide residue on the food. • Which variable is the explanatory and which is the response? What are the values of the variables? • Explanatory: • Values: • Response: • Values: Example taken from Statistics: The Art and Science of Learning from Data
Contingency Tables • The rows list the categories of the explanatory variable. • The columns list the categories of the response variable. • Each row and column combination is called a cell. • There are four cells in the table below. • The number in each cell is the frequency of that particular combination.
Contingency Tables • Conditional Probability • The probability of an observation falling into one of the response categories given that it is in a particular explanatory category • The conditional probabilities in each row should sum to ______. • What is the conditional probability that there is pesticide on a randomly selected organic food in this sample? • What is the conditional probability that there is pesticide on a randomly selected conventional food in this sample?
Stacked Bar Charts & Side-by-Side Pie Charts • Stacked Bar Charts • There is one bar for each category of the explanatory variable. • Sections within each bar represent conditional probabilities for each category of the response variable. • Side-by-Side Pie Charts • There is one pie for each category of the explanatory variable. • Slices within each pie represent conditional probabilities for each category of the response variable.
Determining Association Between Categorical and Categorical • Contingency Tables • Compare the percentages in the first row to the percentages in the second row. • If the percentages are very different, then there is probably an association. • Stacked Bar Charts & Side-by-Side Pie Charts • Compare the colored areas in the first bar to those in the second bar. • If they are very different, then there is probably an association.
Determining Association Between Categorical and Categorical • When there is no association between the two variables, the charts and contingency table will look something like this:
Example 3.5 • A study was done to determine if there is any association between gender and ability to differentiate between Coca-Cola and C2. • Are gender and soda preference associated? Used with permission from Dr. Ellen Toby
Numerical vs. Numerical • Once again, determining which variable is the explanatory and which is the response is not always straightforward. • Use the following to determine if there is an association: • Graphical Comparison • Scatterplots • Numerical Comparison • Correlation
Example 3.6 • What is the relationship between the weight of a vehicle and the number of miles it travels per gallon of gasoline? To answer this question, a random sample of 25 vehicles was taken. The weight (in pounds) and the miles per gallon (MPG) was recorded for each car. • Which variable is the explanatory and which is the response? • Explanatory: • Response: Example taken from Statistics: The Art and Science of Learning from Data
Scatterplots • Treat each pair as an (x, y) coordinate:
Determining Association Between Numerical and Numerical • Look for a consistent change or pattern in the response variable as the explanatory variable increases. • The particular kind of association seen in Example 3.6 is called _______ __________. • We describe linear association numerically by a measure called correlation. • Correlation • Correlation summarizes the ____________ and ____________ of the linear relationship between two numerical variables. • It is denoted by the symbol R.
Correlation R • Properties of the Correlation R: • Takes values between -1 and 1 • R = 1 or R = -1 implies • R = 0 implies there is • R < 0 implies there is a ___________ association • R > 0 implies there is a ___________ association • Even if the X and Y variables are switched, the correlation will ________________.
Correlation R • The formula for correlation is • StatTools will calculate this for us • The association in Example 3.6 is _______ and ___________. • The correlation is R =
Example 3.7 Used with permission from Dr. Ellen Toby
Correlation R • Correlation does not imply causation • Lurking variables • An unobserved variable influences the association between the explanatory variable and response variable. • Confounding variables • Two explanatory variables are both associated with a response variable. It is impossible to determine which variable causes the response.
Two Cases When R Is Not a Good Measure of Linear Association • Case 1: Outliers are present. • Case 2: Relationship between x and y is not linear.
Important Points • Categorical Explanatory & Numerical Response • Numeric Summary • Measures of center and spread for each group • Graphical Summary • Histograms and QQ plots for each group, side-by-side boxplots • Categorical Explanatory & Response • Numeric Summary • Contingency tables with relative frequencies • Graphical Summary • Stacked bar charts and side-by-side pie charts
Important Points • Numerical Explanatory & Response • Numeric Summary • Correlation R • Graphical Summary • Scatterplot • At this point in the class, we can only report what we see in the sample data. • Later in the semester, we will learn about statistical methods that allow us to apply our findings to the overall population.