160 likes | 431 Views
Ch 17 Correlation. 2010/3/10. We now investigate the relationships that can exist among continuous variable. Correlation analysis: Correlation is defined as the quantification of the degree to which two random variables are related, provided that the relationship is linear.
E N D
Ch 17 Correlation 2010/3/10
We now investigate the relationships that can exist among continuous variable. • Correlation analysis: • Correlation is defined as the quantification of the degree to which two random variables are related, provided that the relationship is linear.
17.1 Two-Way Scatter Plot • Suppose that we are interested in a pair of continuous random variables. Example, relationship between the percentage of children who have been immunized against the infectious DPT and mortality rate. • Data for a random sample of 20 countries are show the figure 17.1. (Table 17.1) • X: the percentage of children immunized by age on year • Y: the under-five mortality rate • Before we do any analysis, we should create a two-way scatter plot of the data. (relationship exists between x and y??) • The mortality rate tends to decrease as the percentage of children immunized increase.
17.2 Pearson’s Correlation Coefficient • In the underlying population form which the sample of points (xi,yi) is selected, the population correlation between the variables X and Y. (Greek letter: r; read rho) • The quantifies the strength of the linear relationship between the outcomes x and y. • The estimator of r is known as Pearson’s coefficient of correlation or correlation coefficient (r).
17.2 Pearson’s Correlation Coefficient • The sample correlation coefficient is denoted by r. • sx and sy are the sample standard deviations of the x and y values.
The correlation coefficient is dimensionless number; it has no nuits of measurement. • -1 ≤ r ≤ 1 • The value r=1 and r=-1 occur when there is an exact linear relationship between x and y. (Figure 17.2 (a)(b)) • If y tends to increase in magnitude as x increases, r is greater than 0; x any y are said to be positively correlated. (r >0) • If y decreases as x increases, r is less than 0 and the two variables are negatively correlated. (r <0) • If r=0, there is no linear relationship between x and y and the variables are uncorrelated. (r =0) (Figure 17.2 (c)(d)) • Page 401
In this sample: • Strong linear relationship • Negative association: mortality rate decreases in magnitude as percentage of immunization increases • The correlation coefficient merely tells us that a linear relationship exists between two variables; it does not specify whether the relationship is cause-and-effect. • We would also like to be able to draw conclusions about the unknown population correlation using the sample correlation coefficient r. 17.2 Pearson’s Correlation Coefficient
H0: =0 (No association between X and Y) H1: ≠0 (association between X and Y) • The estimated standard error of r : • The statistic (under H0): • If we assume that the pairs of observations were obtained randomly and both X and Y are normally distribution. • If is equal to some other value, represented by 0, the sampling distribution is skewed, and the test statistic no longer follow at t distribution. 17.2 Pearson’s Correlation Coefficient
The coefficient of correlation r has several limitations: • It quantifies only the strength of the linear relationship between two variables. • Care must be taken when the data contain any outliers, or pairs of observations that lie considerably outside the range of the other data points. • The estimated correlation should never be extrapolated beyond the observed ranges of the variables; the relationship between X and Y may change outside of this region. • A high correlation between two variables does not imply a cause-and-effect relationship. 17.2 Pearson’s Correlation Coefficient
17.3Spearman’s Rank Correlation Coefficient • Pearson’s correlation coefficient is very sensitive to outlying values. We may be interested in calculating a measure of association that is more robust. • One approach is to rank the two sets of outcomes x and y separately and known as Spearman’s rank correlation coefficient.(non-parametric method) • Spearman’s rank correlation coefficient: • Where xri and yri are the rank associated the ith subject rather than the actual observations.
An equivalent method for computing rs is provided by • n: the number of data points in the sample • di is the different between the rank of xi and the rank of yi • -1 ≤ rs ≤ 1 • High degree of correlation between x any y: rs =-1 or 1 • A lack of linear association between two variables: rs= 0 • When type of data is ordinal or the conditions do not hold, we should used rs . 17.3 Spearman’s Rank Correlation Coefficient
Spearman’s rank correlation coefficient may also be thought of as a measure of the concordance(一致性) of the ranks for the outcomes x and y. • Case I 17.3 Spearman’s Rank Correlation Coefficient
Case II 17.3Spearman’s Rank Correlation Coefficient
If n is not too small and if we can assume that pairs of ranks are chosen randomly, we can test null hypothesis: H0: =0. The test statistic is • This testing procedure does not require that X and Y be normally distributed. • About rs : • It is much less sensitive to outlying values than Pearson’s correlation coefficient. • It can be used when one or both of the relevant variables are ordinal. • It relies on ranks rather than on actual observations. 17.3 Spearman’s Rank Correlation Coefficient