Introductory Statistics for Laboratorians dealing with High Throughput Data sets

Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control

Regression/Prediction • In this case we can use X to predict Y with accuracy. • The equation is Y = 2X +1 • For any X we can compute Y • Error is not a factor

In this case we can’t fit a nice straight line to the data. • If we repeat the experiment we will get somewhat different results • Random error is a factor

If in this case we know that the relationship should be a straight line and we believe the deviation from a line is due to error • We can fit a line to the data • There are many possible lines to choose from. • How do we select the best line?

Principles of Prediction • If we don’t know anything about the person being measured, our best bet is always to predict the mean • If we know the person is average on X we should predict they will be average on Y. • We want to select the line that produces the least error.

Development Sample • A sample in which both X and Y are known • Used to develop an equation that can be used to compute a predicted Y from X • Used to Compute the Standard Curve • Unknown Sample • Use the equation developed above to predict Y for people or samples for whom we know X but not Y

Regression Freddie Bruflot X = 5 Y = 10 Actual Y > Error Residual Predicted Y > Total Regression Mean of Y > .75

Computation of SSE

Correlation/Regression Example • High levels of a particular factor in blood samples (call it BF-Costly)is known to be highly predictive of cervical cancer. • Measuring this specific factor is so expensive and time consuming that it is impractical. • The following data are obtained concerning the relationship between a second, easily measured blood factor (call it BF-Cheap) and BF-Costly.

Here is a scatterplot of the data showing the relationship between BF-Cheap and BF-Costly. • It looks like they are correlated. • If it is significant this might be worth pursuing. • Null hypothesis: correlation is zero • Alpha = .05

Test Significance of the Correlation • The probability that the correlation is zero is • .0000724 (7.24E-05) • We can reject the null hypothesis • It may well be worth it to develop a prediction equation that can be used to predict BF-Costly from BF-Cheap

Regression Analysis • Null Hypothesis: the slope of the regression line is zero • Alpha = .05 • Probability is .0000724 • We can reject the null hypothesis • What is the equation? • Is it any good?

Predicted BF-Costly = slope * BF-Cheap + y-intercept • The equation we are looking for will be of the form: • The Y – Intercept is called “constant” and is 27.6 • The slope (the number you multiply BF-Cheap by) is .63 • Both of these are significant • The equation is: Predicted BF-Costly = .63(BF-Cheap) + 27.6

How good is the equation. • What kind of accuracy can we expect. • R-Square (.715) is the proportion of the total variance accounted for by the equation. • About 71.5% of the variance is accounted for • That means about 28.5% is not accounted for

Using Linear Regression to Develop a Standard Curve for Real Time PCR • Develop a standard curve from samples with known concentration. • Y is the concentration • X is the CP • Both X and Y are known • The relationship between concentration and output is not linear it is an S-curve but the relationship between the log(concentration) and output is linear.

Standard Sample Data

CP and Concentration are highly correlated. • It is a negative correlation. • The correlation is -.996 • Now fit a regression line to these data and get the equation of the line.

Fit Regression Line • This is highly significant of course • Y-Intercept is 8.11 • Slope is -.22 • Equation is: • Predicted Concentration = antilog (-.22(CP) + 8.11)

Standard Curve • For an Unknown with a CP of 18.48 • Log Predicted Concentration would be • 8.11 - .22(18.48) = 4.04 • Antilog of 4.04 is 1.10E4

Introductory Statistics for Laboratorians dealing with High Throughput Data sets