620 likes | 863 Views
QMS 6351 Statistics and Research Methods Analyzing the Relationship Between Two and More Variables Chapter 2.4 Chapter 3.5 Chapter 14 (14.1-14.3, 14.6) Chapter 15 (15.1-15.3, 15.7). Prof. Vera Adamchik. Chapter 2 Section 2.4 Crosstabulations and Scatter Diagrams. Crosstabulations.
E N D
QMS 6351Statistics and Research Methods Analyzing the Relationship Between Two and More VariablesChapter 2.4Chapter 3.5Chapter 14 (14.1-14.3, 14.6)Chapter 15 (15.1-15.3, 15.7) Prof. Vera Adamchik
Chapter 2 Section 2.4 Crosstabulations and Scatter Diagrams
Crosstabulations • Crosstabulation is a method that can be used to summarize the data for two variables simultaneously. • Typically, the table’s left and top margin labels define the classes for the two variables. • Crosstabulation can provide insight about the relationship between the variables.
Crosstabulations • Crosstabulation of Enrollment by Gender and Degree Level at a University Degree Level Gender Undergraduate Graduate Doctorate Total Male 7341 (47.0%) 1937 (53.4%) 172 (59.1%) 9450 (48.3%) Female 8294 (53.0%) 1688 (46.6%) 119 (40.9%) 10101 (51.7%) Total 15635 (100.0%) 3625 (100.0%) 291(100.0%)19551 (100.0%)
Scatter Diagram • A scatter diagram is a graphical presentation of the relationship between two quantitative variables.
Scatter Diagram • Scatter Diagram for Engine Size and Gas Mileage of Eight Automobiles 30 25 20 In-City Gas Mileage (mpg) 15 10 0 2 4 6 8 10 Engine Size (number of cylinders)
Example: Reed Auto Sales Reed Auto periodically has a special week-long sale. As part of the advertising campaign Reed runs one or more television commercials during the weekend preceding the sale. Data from a sample of 5 previous sales showing the number of TV ads run and the number of cars sold in each sale are shown below. Develop a scatter diagram.
Example (cont.) Number of TV Ads Number of Cars Sold 1 14 3 24 2 18 1 17 3 27
Chapter 3 Section 3.5 Measures of Association Between Two Variables • Covariance • Correlation Coefficient
Covariance is a descriptive measure of the linear association between two variables. • The value of covariance depends upon units of measurement. • A measure of the relationship between two variables that avoids this difficulty is the correlation coefficient.
Covariance • If the data sets are samples, the covariance is denoted by sxy. • If the data sets are populations, the covariance is denotedby .
Example:Reed Auto Sales Sample covariance = 20/4 = 5 (autos*tv ads)
Correlation Coefficient • If the data sets are samples, the correlation coefficient is denoted by rxy. • If the data sets are populations, the correlation coefficient is denoted byrxy .
Correlation Coefficient • The coefficient can take on values between -1 and +1. • If r orrare near -1, it indicates a strong negative linear relationship. • If r orrare near +1, it indicates a strong positive linear relationship.
Example:Reed Auto Sales s2x = 4/4 = 1; sx = 1. s2y = 114/4 = 28.5; sy = 5.3385. Correlation coefficient = rxy = 5/(1*5.3385) = 0.936586. A strong positive linear relationship.
If r orr= 1, it is a case of perfect positive linear correlation (all points are on a positively sloped straight line). • If r orr= -1, it is a case of a perfect negative linear correlation (all points are on a negatively sloped straight line). • If r orr= 0, there is no linear correlation between the two variables (the points are scattered all over the diagram).
We would like to find an analytical/mathematical expression (a formula) for the relationship between TV ads and auto sales. • Both a scatter diagram and correlation coefficient suggest that there is a linear relationship between TV ads and auto sales.
Chapter 14 Outline • The simple linear regression model • The Least Squares Method • The coefficient of determination
Regression analysis • Regression analysis is a description or the study of the nature of the relationship between variables (for example, linear regression, non-linear regression, simple regression, multiple regression).
Functional vs. stochastic relationship • Functional (deterministic) relationship: the variables are perfectly related; the relationship is true for each/any observation. For example, the area of a square in mathematics, total revenue in economics. • Statistical (stochastic) relationship: the variables are not perfectly related, the relationship is true on average, not for each observation. For example, MPC in economics.
The simple linear regression • The simple linear regression model is a mathematical way of stating the linear statistical relationship between two variables. • The variable being predicted is called the dependent variable. • The variable being used to predict the value of the dependent variable is called the independent variable.
Regression equation • Regression equation – the equation that describes how the mean value(that is, on average) of the dependent variable (y) is related to the independent variable(s) (x). • Simple Linear Regression Equation E(y) = 0 + 1x 0 and1 are referred to as the parameters of the model.
Regression model • Regression model – the equation that describes how the dependent variable is related to the independent variable(s) and an error term. • Simple Linear Regression Model y = 0 + 1x+ (the Greek letter epsilon) is a random variable referred to as the error term. It absorbs the impact of all other variables on y.
Estimated regression equation • We will use a sample to estimate the population parameters 0 and1 . Sample statistics (denoted b0 and b1) serve as estimates of0 and1 . Substituting the values of b0andb1 in the regression equation, we obtain the estimated regression equation. • Estimated Simple Linear Regression Equation y = b0 + b1x y is the mean value of y for a given value of x. ^ ^
The Least Squares Method ^ • Least Squares Criterion min S(yi - yi)2 where yi = observed value of the dependent variable for the i th observation yi = estimated value of the dependent variable for the i th observation ^
The Least Squares Method • Slope for the Estimated Regression Equation This formula appears in the footnote on p. 568 • y -Intercept for the Estimated Regression Equation b0 = y - b1x _ _
Example: Reed Auto Sales • Slope for the Estimated Regression Equation b1 = 220 - (10*100)/5 = 5 24 - (10)2/5 • y -Intercept for the Estimated Regression Equation b0 = 20 - 5(2) = 10 • Estimated Regression Equation y = 10 + 5x ^
Interpretation • bo is theexpected value of ywhen x=0. (May be meaningless). In our example, when the number of TV ads is zero, the expected number of cars sold is 10. • b1 is thechange in the expected value of ywhen x changes by 1 unit of its measurement, ceteris paribus. In our example, when the number of TV ads increases by 1, the number of cars sold is expected to increase by 5 cars.
SST, SSR, SSE • Relationship Among SST, SSR, SSE SST = SSR + SSE Variation in Y due to X Total variation in Y Variation in Y due to all other factors
Coefficient of Determination • Coefficient of determination represents the proportion of SST that is explained by the use of the regression model. • Coefficient of Determination: r 2 = SSR/SST 0 r 2 1
Example: Reed Auto Sales • Coefficient of Determination r 2 = SSR/SST = 100/114 = .877193 The regression relationship is very strong since 87.7% of the variation in number of cars sold can be explained by the linear relationship between the number of TV ads and the number of cars sold.
The Correlation Coefficient • The correlation coefficient measures the strength of the linear association between two variables. • The sample correlation coefficient is plus or minus the square root of the coefficient of determination. • Sample Correlation coefficient: • = 0.936586 sign of b1
Chapter 15 Outline • The multiple linear regression model • The Least Squares Method • The multiple coefficient of determination • Categorical independent variables
Multiple Regression Equation: • Multiple Regression Model • Estimated Multiple Regression Equation:
Multiple coefficient of determination R2 = SSR/SST Adjusted multiple coefficient of determination: where p is the number of independent variables.
Example: Programmer Salary Survey • A software firm collected data for a sample of 20 computer programmers. A suggestion was made that regression analysis could be used to determine if salary was related to the years of experience and the score on the firm’s programmer aptitude test. • The years of experience, score on the aptitude test test, and corresponding annual salary ($1000s) for a sample of 20 programmers is shown on the next slide.
Test Score Exper. (Yrs.) Exper. (Yrs.) Salary ($000s) Salary ($000s) Test Score 4 7 1 5 8 10 0 1 6 6 78 100 86 82 86 84 75 80 83 91 9 2 10 5 6 8 4 6 3 3 88 73 75 81 74 87 79 94 70 89 38.0 26.6 36.2 31.6 29.0 34.0 30.1 33.9 28.2 30.0 24.0 43.0 23.7 34.3 35.8 38.0 22.2 23.1 30.0 33.0
Suppose we believe that salary (y) is related to the years of experience (x1) and the score on the programmer aptitude test (x2) by the following regression model: where y = annual salary ($000), x1 = years of experience, x2 = score on programmer aptitude test.
Solving for the Estimates of β0, β1, β2 • Excel’s Regression Equation Output Note: Columns F-I are not shown.
Estimated Regression Equation SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE) Note: Predicted salary will be in thousands of dollars.
Interpreting the Coefficients In multiple regression analysis, we interpret each regression coefficient as follows: bi represents an estimate of the change in y corresponding to a 1-unit increase in xiwhen all other independent variables are held constant.
Interpreting the Coefficients b1 = 1.404 Salary is expected to increase by $1,404 for each additional year of experience (when the variable score on programmer attitude test is held constant).
Interpreting the Coefficients b2 = 0.251 Salary is expected to increase by $251 for each additional point scored on the programmer aptitude test (when the variable years of experience is held constant).
Multiple Coefficient of Determination • Excel’s ANOVA Output SSR SST
Multiple Coefficient of Determination R2 = SSR/SST R2 = 500.3285/599.7855 = .83418