1 / 47

Understanding Regression and Correlation in Statistics

Learn about the relationship between two quantitative variables, knowing the dependent and independent variables in regression and the equal relationship in correlation. Discover how to calculate regression coefficients and interpret the best-fit line using the least squares criterion. Explore hypothesis tests, ANOVA models, and the coefficient of determination in regression analysis.

tanat
Download Presentation

Understanding Regression and Correlation in Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regression and correlation Dependence of two quantitative variables

  2. Regression – I do know, which one is dependent and which one is independent

  3. Similarly will depend • High of plant on nutrient content in soil • Intensity of photosynthesis on amount of light • Species diversity on latitude • Rate of enzymatic reaction on temperature • and not vice versa

  4. Correlation – both variables are “equal”

  5. Similarly we can be interested in correlations of • Pb and Cd contents in water • Number of points from test in maths and chemistry • Cover of Cirsium and Agropyron in square in meadow • Anywhere, where it is hard to say, what depends on what

  6. Even by equal variables • we can use one of them as a predictor. • Regression is then used even in cases when there is not clear causality. I can predict on the basis of DBH (easier measurement) height of a tree.

  7. Model of simple linear regression Error variability - N(0,σ2) Slope, coefficient of regression Intercept Dependent variable, response Independent variable, predictor

  8. Coefficient of regression = slope of the line, how much Y changes if X is changed by one unit. So, it is a value dependent on units in which X and Y are measured. It reaches from - to +. Β=tg of angle slope α=value of Y if X=0 0 0

  9. So, we presume: X is measured exactly Y measurement is subject to an error mean value of Y depends linearly on X variance “around line” is always the same (homogenity of variances)

  10. Which line is the best one?

  11. Which line is the best one?

  12. Which line is the best one? This one probably not, but how I can distinguish it?

  13. The best line is that one fitting • Criterion of Least squares (LS) • i.e. the least sum of squares of deviations predicted – real value of dependent variable

  14. I.e. the best is that line having the least sum of squares of residuals Vertical, no horizontal distance to line!!!

  15. Can parameters of line be computed from this condition? I replace valuation with Y X and Y are values measured. We count them as fixed. So, I am searching for local minimum of function with two variables, a and b. We calculate derivations according a and b. Then put d SS/da = 0, and d SS/ db=0 by solving those equations, I get the parameters

  16. We get α and β are real values, a and b are their estimations Line always goes through the point of averages of both the variables

  17. b is (sample) estimate of real value β Every estimate is subject to an error – from data variability Statistica computes mean error of estimate b

  18. In case of independence β=0 P-value for the test of H0: β=0 is probability, that I get such good dependence by chance, if variables are independent

  19. For test H0: β=0 Number of degrees of freedom is n-2 One tailed tests can be used. Similar test can be used also for parameter a, we test then, if the line goes through zero, what is in the most cases uninteresting

  20. Test using the ANALYSIS OF VARIANCE of regression model We test null hypothesis, that our model explains nothing (variables are indpendent). Then holds that β=0. [So, the test should be in congruence with the previous one, it just doesn’t enables one-way hypothesis] Again – as in classic ANOVA, the principle is the analysis of sum of squares

  21. Grand variability = squares of deviations of observations from grand mean Wing length, Y, in centimeters Variability explained by model=squares of deviations of predicted values from grand mean Age, X, in days

  22. Error variability= squares of deviations of observed and predict values Wing length, Y, in centimeters Age, X, in days Holds:

  23. As in classic ANOVA holds MS=SS/DF - it is estimate of variance of population, if null hypothesis is true. And also here we make a test using ratio of grand variation estimations based on variance explained and unexplained by the model

  24. This beta is something different from the one used so far Test of the null hypothesis, that in the hatching time birds are wingless (in day zero length is zero) ANOVA model

  25. Coefficient of determination - percent of variability explained

  26. Confidence belt – where is with given [95% here] probability for given X mean value Y Basically – where is the line

  27. Prediction or toleration belt Where the next observation will be

  28. Reliability is the best around the mean

  29. Regression going through zero – it is possible, but How was it in reality?

  30. My regression has proved with high certainty, that in the time of volcanic island’s birth there was a negative number of species

  31. Regression going through zero – it is possible, but How was it in reality? regression going through zero do such a thing

  32. We don’t use linear regression, because • We believe, that dependence is linear in all its range, but nevertheless we often (and legitimately) believe, that we can rationally approximate it by linear function in the range of our values used. • Be carefull with extrapolations (especially dangerous are extrapolations to zero)

  33. Using of regression doesn’t mean causal dependence • Significant are: • Dependence of number of murders on number of frost days in year in USA states • Dependence of number of divorces on number of fridges in years • Dependence of number of inhabitants of India on concentration of CO2 in years • Causal dependence can be proved just by manipulative experiment

  34. Dependence of number of murders (Murders) on number of frost days (Frost) in single states of USA Results of regression analysis of number of murders per 100 000 inhabitants in year 1976 (Murders) in individual states of USA in dependence on number of frost days in the capital of given state in years 1931-1960 (Frost). P<0.01

  35. Power of test • Depends on number of observations and strength of the relation (so, on R2 in the whole population) • In experimental studies we can increase R2by increasing range of independent variable (keep in mind, it usually makes linearity of relation worse)

  36. In interpretations • Make difference, when we are more interested in the strength of relation (and thus R2 value), and when we are happy, when “it is significant”. • How much is new cheap analytical method based on real concentration? (If I haven’t believed, that H0: method is completely independent on concentration isn’t true, so I wouldn’t do it – I am interested in R2 or in error of estimation.)

  37. Declaration • Method is excellent, dependence on real concentrations is highly significant (p<0.001) says the only thing – we are very sure, that the method is better than random number generator. We are interested mainly in R2 [and value of 0.8 can be low for us] (and especially here the error of estimation).

  38. On the other side • Declaration: Number of species is positively dependent on soil pH (F1,33=12.3, p<0.01) is interesting, as the fact that the null hypothesis is not true is not clear a priori. But I am interested in R2 too (but I might be satisfied even with very low values, e.g. 0.2).

  39. Changing X for Y I get logically different results (as regression formulas aren’t inverse functions). But R2, F, and P are the same. I estimate DBH with help of height I estimate height with help of DBH minimise minimise

  40. Even simple regression • is computed in Statistica with help of “Multiple regression”. I write to my results, that I have used simple regression!!!

  41. Data transformation in regression • Attention – values aren’t equal • Independent value is considered exact • Dependent variable contains error (and on it I do minimization of error sum of squares)

  42. Make difference • with transformation of independent variable I change shape of dependence, but not residual distribution • with transformation of dependent variable I change both – shape and residual distribution

  43. The first line is usually deleted,the second one usually to in the case of publication Linearized regression The most common transformation is logarithmic one. With logarithm of independent variable, I get Y=a+b log(X) Presumption - residuals weren’t dependent on mean – transformation haven’t done anything with them. S=a+blog(A)

  44. Relationhip is exponential Residuals are linearly dependent on mean

  45. It doesn’t matter, if I use ln or log But if I want to estimate growth rate, then use ln! I logarithm just dependent variance – and I “homogenize” residuals

  46. Popular is power relationship It always goes through zero - Allometric relationships, Species-Area

  47. Use either ln or log It linearizes most of monotonic relationships without flex point [S=cAz] going through zero Log transformation of both variables, residuals are assumed to be positively dependent on the mean. Attention, using the logarithm, positive deviance from prediction are “decreased” more than the negative ones.

More Related