Multivariate data

Multivariate data

Graphical Techniques • The scatter plot • The two dimensional Histogram

Some Scatter Patterns

Non-Linear Patterns

Measures of strength of a relationship (Correlation) • Pearson’s correlation coefficient (r) • Spearman’s rank correlation coefficient (rho, r)

Pearsons correlation coefficient is defined as below:

where:

Properties of Pearson’s correlation coefficient r • The value of r is always between –1 and +1. • If the relationship between X and Y is positive, then r will be positive. • If the relationship between X and Y is negative, then r will be negative. • If there is no relationship between X and Y, then r will be zero. • The value of r will be +1 if the points, (xi, yi) lie on a straight line with positive slope. • The value of r will be +1 if the points, (xi, yi) lie on a straight line with positive slope.

r =1

r = 0.95

r = 0.7

r = 0.4

r = 0

r = -0.4

r = -0.7

r = -0.8

r = -0.95

r = -1

Computing formulae for the statistics:

Spearman’s rank correlation coefficient r(rho)

Spearman’s rank correlation coefficient r(rho) Spearman’s rank correlation coefficientis computed as follows: • Arrange the observations on X in increasing order and assign them the ranks 1, 2, 3, …, n • Arrange the observations on Y in increasing order and assign them the ranks 1, 2, 3, …, n. • For any case (i) let (xi, yi) denote the observations on X and Y and let (ri, si) denote the ranks on X and Y.

If the variables X and Y are strongly positively correlated the ranks on X should generally agree with the ranks on Y. (The largest X should be the largest Y, The smallest X should be the smallest Y). • If the variables X and Y are strongly negatively correlated the ranks on X should in the reverse order to the ranks on Y. (The largest X should be the smallest Y, The smallest X should be the largest Y). • If the variables X and Y are uncorrelated the ranks on X should randomly distributed with the ranks on Y.

Spearman’s rank correlation coefficient is defined as follows: For each case let di = ri – si = difference in the two ranks. Then Spearman’s rank correlation coefficient (r) is defined as follows:

Properties of Spearman’s rank correlation coefficient r • The value of r is always between –1 and +1. • If the relationship between X and Y is positive, then r will be positive. • If the relationship between X and Y is negative, then r will be negative. • If there is no relationship between X and Y, then r will be zero. • The value of r will be +1 if the ranks of X completely agree with the ranks of Y. • The value of r will be -1 if the ranks of X are in reverse order to the ranks of Y.

Example xi 25.0 33.9 16.7 37.4 24.6 17.3 40.2 yi 24.3 38.7 13.4 32.1 28.0 12.5 44.9 Ranking the X’s and the Y’s we get: ri 4 5 1 6 3 2 7 si 3 6 2 5 4 1 7 Computing the differences in ranks gives us: di 1 -1 -1 1 -1 1 0

Computing Pearsons correlation coefficient, r, for the same problem:

To compute first compute

Then

and Compare with

Comments: Spearman’s rank correlation coefficient r and Pearson’s correlation coefficient r • The value of r an also be computed from: • Spearman’s ris Pearson’s r computed from the ranks.

Spearman’s r is less sensitive to extreme observations. (outliers) • The value of Pearson’s r is much more sensitive to extreme outliers. This is similar to the comparison between the median and the mean, the standard deviation and the pseudo-standard deviation. The mean and standard deviation are more sensitive to outliers than the median and pseudo- standard deviation.

Simple Linear Regression Fitting straight lines to data

The Least Squares Line The Regression Line • When data is correlated it falls roughly about a straight line.

In this situation wants to: • Find the equation of the straight line through the data that yields the best fit. The equation of any straight line: is of the form: Y = a + bX b = the slope of the line a = the intercept of the line

Rise = y2-y1 Run = x2-x1 y2-y1 Rise b = = Run x2-x1 a

a is the value of Y when X is zero • b is the rate that Y increases per unit increase in X. • For a straight line this rate is constant. • For non linear curves the rate that Y increases per unit increase in X varieswith X.

Linear

Non-linear

Example: In the following example both blood pressure and age were measure for each female subject. Subjects were grouped into age classes and the median Blood Pressure measurement was computed for each age class. He data are summarized below:

Graph:

Interpretation of the slope and intercept • Intercept – value of Y at X = 0. • Predicted Blood pressure of a newborn (65.1). • This interpretation remains valid only if linearity is true down to X = 0. • Slope – rate of increase in Y per unit increase in X. • Blood Pressure increases 1.38 units each year.

The Least Squares Line Fitting the best straight line to “linear” data

Reasons for fitting a straight line to data • It provides a precise description of the relationship between Y and X. • The interpretation of the parameters of the line (slope and intercept) leads to an improved understanding of the phenomena that is under study. • The equation of the line is useful for prediction of the dependent variable (Y) from the independent variable (X).

Assume that we have collected data on two variables X and Y. Let (x1, y1) (x2, y2) (x3, y3) … (xn, yn) denote thepairs of measurements on the on two variables X and Y for n cases in a sample (or population)

Let Y = a + b X denote an arbitrary equation of a straight line. a and b are known values. This equation can be used to predict for each value of X, the value of Y. For example, if X = xi (as for the ith case) then the predicted value of Y is:

For example if Y = a + b X = 25.2 + 2.0 X Is the equation of the straight line. and if X = xi = 20 (for the ith case) then the predicted value of Y is:

Multivariate data