1 / 20

Introductory Statistics for Laboratorians dealing with High Throughput Data sets

Introductory Statistics for Laboratorians dealing with High Throughput Data sets. Centers for Disease Control. Regression/Prediction. In this case we can use X to predict Y with accuracy. The equation is Y = 2X +1 For any X we can compute Y Error is not a factor.

kris
Download Presentation

Introductory Statistics for Laboratorians dealing with High Throughput Data sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control

  2. Regression/Prediction • In this case we can use X to predict Y with accuracy. • The equation is Y = 2X +1 • For any X we can compute Y • Error is not a factor

  3. In this case we can’t fit a nice straight line to the data. • If we repeat the experiment we will get somewhat different results • Random error is a factor

  4. If in this case we know that the relationship should be a straight line and we believe the deviation from a line is due to error • We can fit a line to the data • There are many possible lines to choose from. • How do we select the best line?

  5. Principles of Prediction • If we don’t know anything about the person being measured, our best bet is always to predict the mean • If we know the person is average on X we should predict they will be average on Y. • We want to select the line that produces the least error.

  6. Development Sample • A sample in which both X and Y are known • Used to develop an equation that can be used to compute a predicted Y from X • Used to Compute the Standard Curve • Unknown Sample • Use the equation developed above to predict Y for people or samples for whom we know X but not Y

  7. Regression Freddie Bruflot X = 5 Y = 10 Actual Y > Error Residual Predicted Y > Total Regression Mean of Y > .75

  8. Computation of SSE

  9. Correlation/Regression Example • High levels of a particular factor in blood samples (call it BF-Costly)is known to be highly predictive of cervical cancer. • Measuring this specific factor is so expensive and time consuming that it is impractical. • The following data are obtained concerning the relationship between a second, easily measured blood factor (call it BF-Cheap) and BF-Costly.

  10. Here is a scatterplot of the data showing the relationship between BF-Cheap and BF-Costly. • It looks like they are correlated. • If it is significant this might be worth pursuing. • Null hypothesis: correlation is zero • Alpha = .05

  11. Test Significance of the Correlation • The probability that the correlation is zero is • .0000724 (7.24E-05) • We can reject the null hypothesis • It may well be worth it to develop a prediction equation that can be used to predict BF-Costly from BF-Cheap

  12. Regression Analysis • Null Hypothesis: the slope of the regression line is zero • Alpha = .05 • Probability is .0000724 • We can reject the null hypothesis • What is the equation? • Is it any good?

  13. Predicted BF-Costly = slope * BF-Cheap + y-intercept • The equation we are looking for will be of the form: • The Y – Intercept is called “constant” and is 27.6 • The slope (the number you multiply BF-Cheap by) is .63 • Both of these are significant • The equation is: Predicted BF-Costly = .63(BF-Cheap) + 27.6

  14. How good is the equation. • What kind of accuracy can we expect. • R-Square (.715) is the proportion of the total variance accounted for by the equation. • About 71.5% of the variance is accounted for • That means about 28.5% is not accounted for

  15. Using Linear Regression to Develop a Standard Curve for Real Time PCR • Develop a standard curve from samples with known concentration. • Y is the concentration • X is the CP • Both X and Y are known • The relationship between concentration and output is not linear it is an S-curve but the relationship between the log(concentration) and output is linear.

  16. Standard Sample Data

  17. CP and Concentration are highly correlated. • It is a negative correlation. • The correlation is -.996 • Now fit a regression line to these data and get the equation of the line.

  18. Fit Regression Line • This is highly significant of course • Y-Intercept is 8.11 • Slope is -.22 • Equation is: • Predicted Concentration = antilog (-.22(CP) + 8.11)

  19. Standard Curve • For an Unknown with a CP of 18.48 • Log Predicted Concentration would be • 8.11 - .22(18.48) = 4.04 • Antilog of 4.04 is 1.10E4

More Related