970 likes | 1.13k Views
Chapter 9. Linear Regression and Correlation. Bivariate quantitative data:. Population: finite and infinite paired variable values Sample: n observations on the explanatory and the response variable y randomly sampled from population. ( X1,Y1 ) , ( X2,Y2 ) , … , ( Xn,Yn ).
E N D
Chapter 9 Linear Regression and Correlation 102
Bivariate quantitative data: Population: finite and infinite paired variable values Sample: n observations on the explanatory and the response variable y randomly sampled from population (X1,Y1), (X2,Y2), …, (Xn,Yn) Objective: study the quantitative relationship between X and Y Method: regression and correlation Simple ,basic —— linear regression ,linear correlation 102
Content 1. Linear regression 2. Linear correlation 3. Rank correlation 4. Curve fitting 102
Historic background: 19th century British anthropologist F.Galton correlation and coefficient of correlation statistician Karl Pearson found: There is the linear relationship betweenthe height of the sons (X,inch) and height of fathers (Y,inch). 102
That is to say, the height of the sons of the tall fathers do not sure to be taller, while their height maybe shorter than their fathers. However, the height of short father’s son do not sure to be shorter, while they maybe taller than their fathers’ level . Galton call this phenomenon of race steady tendency as regression. 102
Now, “regression ” has became the statistic term which show the quantitative dependency between the variables, and formed some new statistic concepts such as the “regression equation” and “regression coefficient”. For example : study the relationship between the blood sugar and insulin level . study the relationship between the the age and the weight of children. 102
9.1.1 concepts Objective: study the dependency between the dependent variable Y and the independent variable X. Feature: statistic relationship. relationship between the means of the X and Y Differ from the functional relationship between X and Y in general mathmatics 102
Example 9-1 : A endemic disease institute has investigated urine creatinine concentrations(mmol/24h)of eight health children , in table 9-1. Please estimate the regression equation of the urine creatinine concentrations( Y ) to the age (X). 102
Table 9-1 age X (years old ) and urine creatinine contents Y (mmol/24h)in eight health children 102
Urine creatinine contents (mmol/24h) age (years old) X Figure 9-1 the scatter-plot of the age versus the urine creatinine contents in eight children 102
When we describe the quantitative dependency of the urine creatinine contents and ages, we selected the age as independent variable, expressed by X, urine creatinine concentration as dependent variable , expressed by Y. 102
Figure 9-1 displays that the urine creatinine contents Y lineally increase with the increase of the ages X, while it differ from the strict linear functional relationship of the two variables ,compared that the eight dots do not all on the line exactly. So we call this phenomenon as linear regression , the equation as linear regression model distinguish with the strict linear equation. Bivariate linear regression is the most basic and simplest regression , so this regression also called simple regression. 102
Linear regression model is is the estimate of the means of Y corresponding to X 102
1.a is the intercept of the regression line on the axis Y • a > 0,show the point of intersection of line and the axis y is over the origin • a < 0,show the piont is below the origin • a = 0,show the line get through the origin Y a < 0 a = 0 a > 0 X 102
2. b is the regression coefficient and the slope of the line 。 b>0,y increase with the increase of X b<0,y decrease with the increase of X b=0,no linear correlation between two variables. Y b>0 b=0 b<0 X statistical significance of b:when X changed a unit , the Y changed b units on average. 102
Formula (9-1) is the sample regression model. It is the estimate of the linear relationship of the two population variables. We can assume that the mean of the response Y corresponding to X will be on the line (figure 9-2) according to the scatter-plot. 102
9.1.2 the calculation method of the linear regression equation • Residual: • Calculating a、b is also to find a best line to represent the distribution tendency of the data. (X,Y) principles: least sum of squares 102
Besides the linear relationship of the two variables in the figure, we assume the population Y corresponding to X to be normal distribution and the population total variances of the normal distribution to be equal and independent. The is the sample estimate of the population means of y corresponding to x in the formula (9-1) and the predicted value of the regression equation, while a and b are the estimations of α,β respectively. 102
Example 9-1: A endemic disease institute has investigated urine creatinine contents (mmol/24h)of eight healthy children , in table (9-1). Please estimate the regression equation of the urine creatinine contents( Y ) corresponding to the age (X). 102
Table 9-1 age X (years old ) and urine creatinine contents Y (mmol/24h)in healthy children 102
Steps of solution • There is the linear tendency between the two variables by observing the original data and the scatter-plot (figure 9-1).We can do the following calculation. 、 、 2 Calculate 102
4. Calculate the regression coefficient b and the intercept a 102
The line certainly go through the dot ( , ) and intersect the Y axis on the intercept a. If the scatter-plot does not began with the origin, we will remain get the regression line by linking the dot( , ) and the faraway spot easy to read in the range of the independent variable X. 102
Urine creatinine contents Age (years old) Figure9-1 scatter-plot of the urine creatinine contents versus the age of the children 102
9.1.3 statistical inference in linear regression 102
1. Hypothesis test in regression equation Building the sample regression equation not only to describe the relationship of the two variables but also to explain the fact of the existence of the linear regression relationship from the population, that is to say ? 102
If β=0,there is no linear relationship between x and y. If b≠0,how much the difference between the b and 0? We will answer this question by ANOVA and t-test 102
1. ANOVA To understand the basic idea of the ANOVA, we will decompose the (sum of squares of deviations from mean): 102
(X,Y) 102
In figure 9-4: It can be proved by: 102
It can be explained by : In the formula: = = 102
There is the relationship of the three degree of freedom : If the contribution of regression if much more than random error, we will calculate the F value to sure the statistic significance. 102
In the formula: 102
2. t-test Whether the β=0? 102
Example 9-2,please test the linear regression equation from the data 9-1. 102
(1)ANOVA 102
Display the table of ANOVA Table 9-2 table of ANOVA V1=1, V2=6, check the F distribution , get P<0.05,according to the α=0.05,reject H0,accept H1, we can believe that there is the linear relationship between the urine creatinine contents and the ages. 102
(2)t test V=6, check the t distribution , get0.002 <P<0.005,according to the α=0.05,reject H0,accept H1, we can believe that there is the linear relationship between the urine creatinine contents and the ages. 102
Confidence interval of population regression coefficient β the 1-α CI of the β 102
Example 9-3, please estimate the two sides 95%CI of the population regression coefficient according to the b=0.1392 of sample 9-1. 102
(0.1392-2.447×0.0304,0.1392+2.447×0.0304) =(0.0648,0.2136) 102
(3) Estimation and prediction 1.Confidence interval of population means Standard error of the sampling error (9-14) When X=X0, The 1-αCI of the (9-15) 102
2. the interval of estimate of Y (9-16) (9-17) 102
Example 9-4, when X0=12,calculate the 95%CI of and the 95% prediction CI of Y by the linear regression equation of the example 9-1. 102