1 / 29

Lab: Lecture 9 Review

Lab: Lecture 9 Review. November 15, 2012. Scatterplot. The scatterplot is a simple method to examine the relationship between 2 continuous variables. twoway ( lowess sleep_hrs bmi ) (scatter sleep_hrs bmi ), ytitle(Hours of sleep) xtitle(BMI ) legend(off ). Correlation.

yale
Download Presentation

Lab: Lecture 9 Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lab: Lecture 9 Review November 15, 2012

  2. Scatterplot • The scatterplot is a simple method to examine the relationship between 2 continuous variables twoway (lowesssleep_hrsbmi) (scatter sleep_hrsbmi), ytitle(Hours of sleep) xtitle(BMI) legend(off)

  3. Correlation • Correlation is a method to examine the relationship between 2 continuous variables • Does one increase with the other? • E.g. Do hours of sleep decrease with increasing BMI? • Both variables are measured on the same people (or unit of analysis) • Correlation assumes a linear relationship between the two variables • Correlation is symmetric • The correlation of A with B is the same as the correlation of B with A

  4. Correlation • Correlation is a measure of the relationship between two random variables X and Y • A correlation is defined as • Correlation does not imply causation

  5. Correlation Perfect positive correlation Perfect negative correlation No correlation small correlation

  6. Pearson’s Correlation • An estimator of the population correlation  is Pearson’s correlation coefficient denoted r • It is estimated by: • Ranges between -1 to 1

  7. Pearson’s Correlation:Hypothesis testing • To test whether there is a correlation between two variables, our hypotheses are H0 : =0 and HA : ≠0 • The test statistic is: • t distribution • Degrees of freedom • n-2

  8. Pearson’s Correlation example pwcorrvar1 var2, sig obs . . pwcorrsleep_hrsbmi, sig obs | sleep_~sbmi -------------+------------------ sleep_hrs | 1.0000 | | 503 | bmi | -0.1130 1.0000 | 0.0114 | 501 513 | Correlation coefficient “r” P-value for null hypothesis that p=0 Note that the hypothesis test is only of =0, no other null Also note that the correlation is the linear relationship only

  9. Spearman’s Rank Correlation • The Pearson correlation coefficient is calculated, but the data values are replaced by the ranks (non-parametric) • The Spearman rank correlation coefficient is:

  10. Spearman’s Rank Correlation • The Spearman rank correlation ranges between -1 and 1 as does the Pearson correlation • We can test the null hypothesis that =0 • t-distribution • n-2 degrees of freedom

  11. Spearman’s Rank Correlation spearman sleep_hrsbmi, stats(rhoobsp) Number of obs = 501 Spearman's rho = -0.1056 (“r”) Test of Ho: sleep_hrs and bmi are independent Prob > |t| = 0.0181

  12. Matrix of Spearman correlations Here if you drop the “pw” option you get all n’s equal . spearman sleep_hrs bmi age child6_n, pw stats(rho obs p) +-----------------+ | Key | |-----------------| | rho | | Number of obs | | Sig. level | +-----------------+ | sleep_~s bmi age child6_n -------------+------------------------------------ sleep_hrs | 1.0000 | 503 | | bmi | -0.1056 1.0000 | 501 513 | 0.0181 | age | -0.0095 0.2407 1.0000 | 502 512 520 | 0.8314 0.0000 | child6_n | -0.0802 0.0582 0.0283 1.0000 | 502 511 513 514 | 0.0725 0.1891 0.5224

  13. Pearson vs. Spearman

  14. Biomarker of alcohol consumption vs. days drinking (raw data vs. ranks) Spearman Pearson

  15. Pearson and Spearman correlations . pwcorr peth21_18and16 daysdrank_21, obs sig | peth2~16 days~_21 -------------+------------------ peth21_18~16 | 1.0000 | | 77 | daysdrank_21 | 0.4717 1.0000 | 0.0000 | 77 85 | ----------------------------------- . spearman peth21_18and16 daysdrank_21 Number of obs = 77 Spearman's rho = 0.7413 Test of Ho: peth21_18and16 and daysdrank_21 are independent Prob > |t| = 0.0000

  16. Simple Logistic Regression • Correlation allows us to quantify a linear relationship between two variables • Regression allows us to additionally estimate how a change in a random variable X corresponds to a change in random variable Y

  17. Simple Logistic RegressionTwo continuous variables twoway (lowessfev age, bwidth(0.8)) (scatter fev age, sort), ytitle(FEV) xtitle(Age) legend(off) title(FEVvs age in children and adolescents)

  18. Concept of y|x and σy|x • y|x • At each x value, there is a mean y value • σy|x • - At each x value, there is a standard deviation of y Y X

  19. The equation of a straight line y = α + βx

  20. Simple linear regression • Population regression equation is defined as μy|x = α +  x • This is the equation of a straight line • αandare constants and are called the coefficients of the equation

  21. Simple Linear Regression • α = y intercept, mean value of y when X = 0 • β = slope of the line, the change in the mean value of y that corresponds to a one-unit increase in X

  22. Simple Linear regression • Even if there is a linear relationship between Y and X in theory, there will be some variability in the population • At each value of X, there is a range of Y values, with a mean μy|xand a standard deviation σy|x • So when we model the data we collect (rather than the population), we note this by including an error term, ε, in our regression equation = y = α + x + ε

  23. Simple Linear Regression:Assumptions • X’s are measured without error - Violations of this cause the coefficients to attenuate toward zero • For each value of x, the y’s are normally distributedwith mean μy|xand standard deviation σy|x • The regression equation is correct μy|x = α + x

  24. Simple Linear Regression:Assumptions • X’s are measured without error - Violations of this cause the coefficients to attenuate toward zero 5) Homoscedasticity – the standard deviation of y at each value of X is constant; σy|xthe same for all values of X • All the yi ‘s are independent - You can’t guess the y value for one person (or observation)based on the outcome of another **Note that we do not need the X’s to be normally distributed, just the Y’s at each value of X

  25. Least squares • We estimate the coefficients of the population regression line ( and ) using our sample of measurements of y and x • We have a set of data, where the points are (yi,xi), and we want to put a line through them • The distance from a data point (xi, yi) to the line at xi is called the residual, ei ei = yi – ŷi ŷi is y-value of the regression line at xi

  26. Least squares The “best” line is the one that finds the α and β that minimize the sum of the squared residuals Σei2 (hence the name “least squares”)

  27. Simple linear regression example: Regression of age on FEVFEV= α̂ + β̂ age regress yvarxvar . regress fev age Source | SS df MS Number of obs = 654 -------------+------------------------------ F( 1, 652) = 872.18 Model | 280.919154 1 280.919154 Prob > F = 0.0000 Residual | 210.000679 652 .322086931 R-squared = 0.5722 -------------+------------------------------ Adj R-squared = 0.5716 Total | 490.919833 653 .751791475 Root MSE = .56753 ------------------------------------------------------------------------------ fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .222041 .0075185 29.53 0.000 .2072777 .2368043 _cons | .4316481 .0778954 5.54 0.000 .278692 .5846042 ------------------------------------------------------------------------------ β α Value that FEV1 increases by for each one year increase in age Value of FEV1 when Age = 0

  28. Hypothesis testing of regression coefficients • We can use these to test the null hypothesis • H0:  = 0 against the alternative (no relationship between x and y) • HA:  ≠ 0 (x and y are related) • The test statistic for this is • And it follows the t distribution with n-2 degrees of freedom under the null hypothesis

  29. R2 • A summary of the model fit is the coefficient of determination, R2 • R2 = r2 , i.e. the Pearson correlation coefficient squared • R2 ranges from 0 to 1, and measures the proportion of the variability in y that is explained by the regression of y on x

More Related