370 likes | 456 Views
Business Statistics for Managerial Decision Making. Examining Relationships. Examining Relationships. To study the relationship between two variables, we measure both variables on the same individuals. Often we think that one of the variables explains or influences the other.
E N D
Business Statistics for Managerial Decision Making Examining Relationships
Examining Relationships • To study the relationship between two variables, we measure both variables on the same individuals. • Often we think that one of the variables explains or influences the other. • A response variable measures an outcome of a study. • An explanatory variable explains or influences changes in a response variable.
Scatter plot • A scatter plot shows the relationship between two quantitative variables measured on the same individuals. • The values of one variable appear on the horizontal axis, and the other variable appear on the vertical axis. • Each individual in the data appears as the point in the plot fixed by the values of both variables for that individual.
Example • Scatter plot of the gross sales for each day in April 2000 against the number of items sold for the same day at Duck Worth Wearing company. • The dotted lines intersect at the point (72, 594), the data for April 22, 2000.
Examining a Scatter Plot • In any graph of data, look for the overall pattern and for striking deviations from the pattern. • You can describe the overall pattern of a scatter plot by the form, direction, and strength of the relationship.
Positive Association, Negative Association • Two variables are positively associated when above average values of one tends to accompany the above average values of the other and below average values also tend to occur together. • Two variables are negatively associated when above average values of one tend to accompany below average values of the other, and vice versa.
Example • City and highway fuel consumption for 2002 model two-seater cars. • There is one unusual observation. • Describe the pattern of the relationship between city and highway mileage.
Example • Scatter plot of life expectancy against domestic product per person for all the nations for which data are available. • Describe the form, direction, and strength of the overall pattern. • The three African nations marked on the graph are outliers.
Correlation • The correlation measures the direction and strength between two quantitative variables. • Correlation is usually written as r. • If we have data on variables x and y for n individuals. • The values for the first individual are x1 and y1, the values for the second individual are x2, y2, and so on. • The means and standard deviations for the two variables are and sx for the x-values, and and sy for the y-values.
Correlation • The correlation r between x and y is
Facts about Correlation • Correlation makes no distinction between explanatory and response variables. • Correlation requires that both variables be quantitative. • r does not change when we change the units of measurement of x, y, or both. • Positive r indicates positive linear association between variables, and negative r indicates negative linear association. • The correlation r is always a number between –1 and 1 • Correlation measures the strength of only a linear relationship between two variables.
Least Squares Regression • Correlation measures the direction and strength of linear relationship between two quantitative variables. • If the scatter plot shows a linear relationship, we can summarize this overall pattern by drawing a line on the scatter plot. • A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. • We use a regression line to predict the value of y for a given value of x.
This is a scatter plot of the natural gas consumption for Sanchez family. Outside temperature, x, is measured by heating degree-days in a month. the average amount of natural gas that the family uses per day during the month is y. Least Squares Regression
How do we draw the least squares regression line? Different people might draw different lines by eye on a scatter plot. Least Squares Regression
The difference between our prediction based on the least square regression line and the the observed value is the error of our prediction. This error is also called the residual Least Squares Regression
Least Squares Regression • The least squares regression line of y on x is the line that makes the sum of the squares of the errors or residuals (vertical distances of the data points from the line) as small as possible. • Give the data on explanatory variable, x, and the response variable, y, we can find the equation of the line with the smallest squared residuals.
Least Squares Regression • The lest squares regression line is: • Where the slope is: • And the intercept is:
This table gives data on declines of at least 10% in the standard&poor's 500-stock index between 1940 and 1999. The data shows how far the index fell from its peak and how long the decline in stock prices lasted. Example
Scatter plot of percent decline versus duration in months of the bear market. Is there a linear association? Is the association positive or negative? Example
Example • Calculation shows that the mean and standard deviation of the durations are: • For the declines are: • The correlation between duration and decline is: r = 0.6285 • Find the equation of the least-squares line for predicting decline from duration. • One bear market has a duration of 15 months but a very low decline of 14%. What is the predicted decline for a bear market with duration of 15 months? What is the residual for this particular bear market?
Residual plots • Recall A residual is the difference between an observed value of the response variable and the value predicted by the regression line. • A residual plot is a scatter plot of the regression residuals against the explanatory variable x. • Residual plots help us asses the fit of a regression line. • The mean of the least-squares residuals is always zero.
The residuals should have no systematic pattern. The residual plot to right shows a scatter of the points with no individual observations or systematic change as x increases. Residual plots
The points in this residual plot have a curve pattern, so a straight line fits poorly Residual plots
The points in this plot show more spread for larger values of the explanatory variable x, so prediction will be less accurate when x is large. Residual plots
Influential observations An observation is influential if removing it would markedly change the fitted line. Points that are extreme in the x direction of a scatter plot are often influential for the least squares regression line. Observation 21 is an influential observation. Residual plots
The solid line is calculated from all the data. The dashed line is calculated leaving observation 21 out. Observation 21 is an influential observation since leaving it out moves the regression line quite a bit. Residual plots
Outlier An observation that falls outside the overall pattern of the observations is called an outlier. Points that are outliers in the y direction of a scatter plot have large regression residuals. Residual plots
Caution about Correlation and Regression • Extrapolation • Extrapolation is the use of a regression line for prediction far outside of the explanatory variable x that you used to obtain the line. • These predictions are often not accurate. • Lurking variable • A lurking variable is a variable that is not among the explanatory or response variables in a study and yet it may influence the interpretation of relationships among those variables.
Caution about Correlation and Regression • Association is not causation • An association between an explanatory variable x and a response variable y, even if it is strong, is not by itself good evidence that changes in x actually cause changes in y. • Example: There is a high positive correlation between the number of television sets per person (x) and the average life expectancy (y) for the world’s nations. Could we lengthen the lives of people in Rwanda by shipping them TV sets? • The best way to get evidence that x causes y is to do an experiment in which we change x and keep lurking variables under control.
Relations in Categorical Data • Two-way tables • When two categorical variables are studied with several levels of each variables then the data can be organized in a two-way table. • For example let’s look at the relation between the payment method (cash, check, credit card) and type of purchase (impulse purchase, planned purchase).
Relations in Categorical Data • Marginal distribution • We can look at each categorical variable separately in a two-way table by studying the row totals and the column totals. They represent the marginal distributions, expressed in counts or percentages. • Example: • 31/97= 31.96% were impulse shoppers. • 48/97 = 49.5% paid by credit card.
Relations in Categorical Data • The marginal distributions summarize each categorical variable independently. But the two-way table actually describes the relationship between both categorical variables. • The cells of a two-way table represent the intersection of a given level of one categorical factor with a given level of the other categorical factor.
Relations in Categorical Data • For example: • 35/97 = 36.1% of shoppers did a planned purchase and paid by credit card. • What percentage did impulse purchase and paid cash?
Relations in Categorical Data • The percents within the table of the row variable for one specific value of the column variable represent the conditional distribution of the row variable. Comparing the conditional distributions allows you to describe the “relationship” between both categorical variables. • Example: • Among those who used credit card, 13/48 = 27.1 % of the purchases were on impulse, and 35/48 = 72.9% of the purchases were planned.
Relations in Categorical Data • Similarly, The percents within the table of the column variable for one specific value of the row variable represent the conditional distribution of the column variable. • Example: • Among those who purchase on impulse; 13/31 = 41.9% paid by credit card, 4/31 = 12.9% paid by check, and 14/31 = 45.2% paid cash.