1.27k likes | 1.48k Views
Chapter 10. Linear regression and correlation. Relationship between variables. Relationship between variables. Age and blood pressure Nutrient level and growth of cells Height and weight.
E N D
Chapter 10 Linear regression and correlation Relationship between variables
Relationship between variables • Age and blood pressure • Nutrient level and growth of cells • Height and weight To determine the strength of relationship between two variables and to test if it is statistically significant
Difference, variation and association analysis Group(0,1) (category) Group(ABC) (category) (quantity)
Sir Francis Galton (16 February 1822 – 17 January 1911) Polymath: Meteorology (the anti-cyclone and the first popular weather maps); Psychology (synaesthesia); Biology (the nature and mechanism of heredity); Eugenicist; Criminology (fingerprints); Statistics (regression and correlation).
Related but Different Regression analysis: one of the variables (e.g. blood pressure) is dependent on (caused by) the other which are fixed and measured without error (e.g. age). Correlation analysis: both variables are experimental and measured with error (e.g height and weight).
Regression analysis The experimental data Repeated experiments
Correlation analysis The experimental data More individuals measured
Regression analysis Equation for a straight line If you know a and b, can predict Y from X ----the goal of regression
Concepts • Simple linear regression • Simple linear correlation • Correlation analysis based on ranks
Example • Consider growth rate of a yeast colony and nutrient level . • If you increase nutrient level, the growth rate would increase. • Growth rate is dependent on nutrient level but nutrient level is NOT dependent on growth rate.
Growth rate Y Variables in Regression • Growth rate is called the Dependent Variable and is given the symbol Y. • Nutrient level (the causal factor) is called the Independent Variable and is given the symbol X. Nutrient level X
Single linear model assumption • X’s are fixed and measured without error • Homoscedastic α,β:constant real numbers, β≠0 • independent identically distributed
General steps for simple linear regression analysis • Graphing the data • Fitting the best straight line • Testing whether the linear relationship is statistically significant or not
Graphing the data • Fitting the best straight line No relationship Relationship but not straight-lined Which one? Negative linear relationship Positive linear relationship Need criterion
Area (y) H Intercept (at x=0) L a Time days (x) Example: Area of a yeast colony on successive days. The best fit Slope (b) = H/L 0 0
Area (y) Time days (x) Problem How to estimate a and b?
Method Fitting to the data y 0 x 0 Total sum of squares for Y:
Method Fitting to the data Area (y) a and b should minimize the residual error Time days (x) Residual error sum of squares:
=0 Sum of Squares Total SSTotal Sum of Squares due to regression SSR Sum of Squares Residual or error SSE maximize minimize
Least Square Regression Equation • Minimize SSError by partial derivatives =0
Result • Least squares regression line
Simple Linear Regression Analysis • A global test for regression (ANOVA) • A test for regression coefficient (Student t test)
Hypothesis • H0: The variation in Y is not explained by a linear model, i.e., β=0 • Ha: A significant portion of the variation in Y is explained by a linear model i.e., β≠0
Source of variation SS DF MS E(MS) F c.v. Regression SSR 1 MSR See Table C.7 Error SSE n-2 MSE Total SSTotal n-1 The ANOVA table for a regression analysis If H0 is true Test statistic: If Ha is true, β=0 =1
Coefficient of determination a measure of the amount of the variability in Y that is explained by its dependence on X.
Simple Linear Regression Analysis • A global test for regression (ANOVA) • A test for regression coefficient (Student t test)
Hypothesis • H0: The variation in Y is not explained by a linear model, i.e., β=0 • Ha: A significant portion of the variation in Y is explained by a linear model i.e., β≠0
t test statistic • Variance of b: • It’s estimate: • Standard error of b:
Confidence interval for β • follow student’s t distribution • Confidence interval
Sampling error Confidence Interval for • Since • And • Standard error of is:
Confidence Interval for follow student’s t distribution Confidence interval L1 L2
Sampling error Confidence Interval for • Since • And • Standard error of is:
Example1:Yield of tomato varieties Summarized data: Totals:
A. Student’s t test • There is no difference between the two variances • There is difference between the two mean Accept H0 Reject H0
B. ANOVA Conclusion: Reject H0
Compare ANOVA with t test • t was 2.16 for 18df, 0.05 P 0.01 F was 4.67 for 1 and 18 df, 0.05 P 0.01 • In fact, F= t2 (i.e. 4.67=2.162) Why? Because with t we are dealing with differences while with F we are dealing with variances (differences squared)
Estimation • Regression coefficient • Intercept atx • Intercept • regression equation
Testing the significance ANOVA • H0: no linear relation between y and x. β=0 • Ha: the variation in y is linearly explained by the variation in x. i.e., β≠0