200 likes | 295 Views
Introductory Statistics for Laboratorians dealing with High Throughput Data sets. Centers for Disease Control. Regression/Prediction. In this case we can use X to predict Y with accuracy. The equation is Y = 2X +1 For any X we can compute Y Error is not a factor.
E N D
Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control
Regression/Prediction • In this case we can use X to predict Y with accuracy. • The equation is Y = 2X +1 • For any X we can compute Y • Error is not a factor
In this case we can’t fit a nice straight line to the data. • If we repeat the experiment we will get somewhat different results • Random error is a factor
If in this case we know that the relationship should be a straight line and we believe the deviation from a line is due to error • We can fit a line to the data • There are many possible lines to choose from. • How do we select the best line?
Principles of Prediction • If we don’t know anything about the person being measured, our best bet is always to predict the mean • If we know the person is average on X we should predict they will be average on Y. • We want to select the line that produces the least error.
Development Sample • A sample in which both X and Y are known • Used to develop an equation that can be used to compute a predicted Y from X • Used to Compute the Standard Curve • Unknown Sample • Use the equation developed above to predict Y for people or samples for whom we know X but not Y
Regression Freddie Bruflot X = 5 Y = 10 Actual Y > Error Residual Predicted Y > Total Regression Mean of Y > .75
Correlation/Regression Example • High levels of a particular factor in blood samples (call it BF-Costly)is known to be highly predictive of cervical cancer. • Measuring this specific factor is so expensive and time consuming that it is impractical. • The following data are obtained concerning the relationship between a second, easily measured blood factor (call it BF-Cheap) and BF-Costly.
Here is a scatterplot of the data showing the relationship between BF-Cheap and BF-Costly. • It looks like they are correlated. • If it is significant this might be worth pursuing. • Null hypothesis: correlation is zero • Alpha = .05
Test Significance of the Correlation • The probability that the correlation is zero is • .0000724 (7.24E-05) • We can reject the null hypothesis • It may well be worth it to develop a prediction equation that can be used to predict BF-Costly from BF-Cheap
Regression Analysis • Null Hypothesis: the slope of the regression line is zero • Alpha = .05 • Probability is .0000724 • We can reject the null hypothesis • What is the equation? • Is it any good?
Predicted BF-Costly = slope * BF-Cheap + y-intercept • The equation we are looking for will be of the form: • The Y – Intercept is called “constant” and is 27.6 • The slope (the number you multiply BF-Cheap by) is .63 • Both of these are significant • The equation is: Predicted BF-Costly = .63(BF-Cheap) + 27.6
How good is the equation. • What kind of accuracy can we expect. • R-Square (.715) is the proportion of the total variance accounted for by the equation. • About 71.5% of the variance is accounted for • That means about 28.5% is not accounted for
Using Linear Regression to Develop a Standard Curve for Real Time PCR • Develop a standard curve from samples with known concentration. • Y is the concentration • X is the CP • Both X and Y are known • The relationship between concentration and output is not linear it is an S-curve but the relationship between the log(concentration) and output is linear.
CP and Concentration are highly correlated. • It is a negative correlation. • The correlation is -.996 • Now fit a regression line to these data and get the equation of the line.
Fit Regression Line • This is highly significant of course • Y-Intercept is 8.11 • Slope is -.22 • Equation is: • Predicted Concentration = antilog (-.22(CP) + 8.11)
Standard Curve • For an Unknown with a CP of 18.48 • Log Predicted Concentration would be • 8.11 - .22(18.48) = 4.04 • Antilog of 4.04 is 1.10E4