280 likes | 377 Views
Regression. Petter Mostad 2005.10.10. Some problems you might want to look at. Given the annual number of cancers of a certain type, over a few decades, make a prediction for the future, with uncertainty.
E N D
Regression Petter Mostad 2005.10.10
Some problems you might want to look at • Given the annual number of cancers of a certain type, over a few decades, make a prediction for the future, with uncertainty. • There seems to be a connection between efficiency and size for Norwegian hospitals. Given data from many hospitals, determine if there is a connection, and what it is. • Investigate the connection between efficiency and a number of possible explanatory variables.
Connection between variables We would like to study connection between x and y!
Connection between variables Fit a line!
What can you do with a fitted line? • Interpolation • Extrapolation (sometimes dangerous!) • Interpret the parameters of the line
How to define the line that ”fits best”? The sum of the squares of the ”errors” minimized = Least squares method! • Note: many other ways to fit the line can be imagined
How to compute the line fit with the least squares method? • Let (x1, y1), (x2, y2),...,(xn, yn) denote the points in the plane. • Find a and b so that y=a+bx fit the points by minimizing • Solution: where and all sums are done for i=1,...,n.
How do you get this answer? • Differentiate S with respect to a og b, and set the result to 0 We get: This is two equations with two unknowns, and the solution of these give the answer.
Example Some grasshoppers make sound by rubbing their wings against each other. There is a connection between the temperature and the frequency of the movements, unique for each species. Here are some data for Nemobius fasciatus fasciatus: If you measure 18 movements per sec, what is estim. temperature? Data from Pierce, GW. The Songs of Insects. Cambridge, Mass.: Harvard University Press, 1949, pp. 12-21
Example (cont.) Computation: Answer: Estimated temperature
y against x ≠ x against y • Linear regression of y against x does not give the same result as the opposite. Regression of y against x Regression of x against y
Centered variables • Assume we subtract the average from both x- and y-values • We get and • We get and • From definitions of correlation and standard deviation se get (even in uncentered case) • Note also: The residuals sum to 0.
Anaylzing the variance • Define • SSE: Error sum of squares • SSR: Regression sum of squares • SST: Total sum of squares • We can show that SST = SSR + SSE • Define • R2 is the ”coefficient of determination”
But how to answer questions like: • Given that a positive slope (b) has been estimated: Does it give a reproducible indication that there is a positive trend, or is it a result of random variation? • What is a confidence interval for the estimated slope? • What is the prediction, with uncertainty, at a new x value?
The standard simple regression model • We have to do as before, and define a model where are independent, normally distributed, with equal variance • We can then use data to estimate the model parameters, and to make statements about their uncertainty
Confidence intervals for simple regression • In a simple regression model, • a estimates • b estimates • estimates • Also, where estimates variance of b • So a confidence interval for is given by
Hypothesis testing for simple regression • Choose hypotheses: • Test statistic: • Reject H0 if or
Prediction from a simple regression model • A regression model can be used to predict the response at a new value xn+1 • The uncertainty in this prediction comes from two sources: • The uncertainty in the regression line • The uncertainty of any response, given the regression line • A confidence interval for the prediction:
Testing for correlation • It is also possible to test whether a sample correlation r is large enough to indicate a nonzero population correlation • Test statistic: • Note: The test only works for normal distributions and linear correlations: Always also investigate scatter plot!
Influence of extreme observations • NOTE: The result of a regression analysis is very much influenced by points with extreme values, in either the x or the y direction. • Always investigate visually, and determine if outliers are actually erroneous observations
Example: Transformed variables • The relationship between variables may not be linear • Example: The natural model may be • We want to find a and b so that the line approximates the points as well as possible
Example (cont.) • When then • Use standard formulas on the pairs (x1,log(y1)), (x2, log(y2)), ..., (xn, log(yn)) • We get estimates for log(a) and b, and thus a and b
Another example of transformed variables • Another natural model may be • We get that • Use standard formulas on the pairs (log(x1), log(y1)), (log(x2), log(y2)), ...,(log(xn),log(yn)) Note: In this model, the curve goes through (0,0)
More than one independent variable: Multiple regression • Assume we have data of the type (x11, x12, x13, y1), (x21, x22, x23, y2), ... • We want to ”explain” y from the x-values by fitting the following model: • Just like before, one can produce formulas for a,b,c,d minimizing the sum of the squares of the ”errors”. • x1,x2,x3 can be transformations of different variables, or transformations of the same variable
Multiple regression model • The errors are independent random (normal) variables with expectation zero and variance • The explanatory variables x1i, x2i, …, xni cannot be linearily related
Use of multiple regression • Versions of multiple regression is the most used model in econometrics, and in health economics • It is a powerful tool to detect and verify connections between variables
Doing a regression analysis • Plot the data first, to investigate whether there is a natural relationship • Linear or transformed model? • Are there outliers which will unduly affect the result? • Fit a model. Different models with same number of parameters may be compared with R2 • Make tests / confidence intervals for parameters
Interpretation • The parameters may have important interpretations • The model may be used for prediction at new values (caution: Extrapolation can sometimes be dangerous!) • Remember that subjective choices have been made, and interpret cautiously