1 / 54

Statistical Analysis Essentials for Predicting the Future with Correlation and Regression

Learn about Z-Tests, T-Tests, ANOVAs, and more in statistical analysis for predicting outcomes. Understand correlations, linear regression, and interpreting results for informed decisions in various scenarios. Discover the basics of hypothesis testing and error types to ensure accurate statistical interpretations.

deforest
Download Presentation

Statistical Analysis Essentials for Predicting the Future with Correlation and Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. T Tests and ANovas Jennifer Siegel

  2. Objectives Statistical background Z-Test T-Test Anovas

  3. Predicting the Future from a Sample • Science tries to predict the future • Genuine effect? • Attempt to strengthen predictions with stats • Use P-Value to indicate our level of certainty that result = genuine effect on whole population (more on this later…)

  4. Normal Distribution

  5. The Basics • Develop an experimental hypothesis • H0 = null hypothesis • H1 = alternative hypothesis • Statistically significant result • P Value = .05

  6. P-Value • Probability that observed result is true • Level = .05 or 5% • 95% certain our experimental effect is genuine

  7. Errors! • Type 1 = false positive • Type 2 = false negative • P = 1 – Probability of Type 1 error

  8. Research Question Example • Let’s pretend you came up with the following theory… Having a baby increases brain volume (associated with possible structural changes)

  9. Populations versus Samples Z - test T - test

  10. Z-Test • Population

  11. Some Problems with a Population-Based Study • Cost • Not able to include everyone • Too time consuming • Ethical right to privacy Realistically researchers can only do sample based studies

  12. T-Test • T = differences between sample means / standard error of sample means • Degrees of freedom = sample size - 1

  13. Two Sampled T-Tests: Pre and Post

  14. Hypothesise • H0 = There is no difference in brain size before or after giving birth • H1 = The brain is significantly smaller or significantly larger after giving birth (difference detected)

  15. Absolute Brain Volumes cm3 T=(1271-1236)/(119-113)

  16. Results: p=.003 Women have a significantly larger brain after giving birth http://www.danielsoper.com/statcalc/calc08.aspx

  17. Types of T-Tests One-sample (sample vs. hypothesized mean) Independent groups (2 separate groups) Repeated measures (same group, different measure)

  18. More than 1 group???

  19. ANOVA • ANalysis Of VAriance • Factor = what is being compared (type of pregnancy) • Levels = different elements of a factor (age of mother) • F-Statistic • Post hoc testing

  20. Different types of Anova • 1 Way Anova • 1 factor with more than 2 levels • Factorial Anova • More than 1 factor • Mixed Design Anovas • Some factors are independent, others are related

  21. What can be concluded from ANOVA • There is a significant difference somewhere between groups • NOT where the difference lies • Finding exactly where the difference lies requires further statistical analysis = post hoc analysis

  22. Conclusions • Z-Tests for populations • T-Tests for samples • ANOVAS compare more than 2 groups in more complicated scenarios

  23. Correlation and Linear Regression VarunV.Sethi

  24. Objective Correlation Linear Regression Take Home Points.

  25. With a few exceptions, every analysis is a variant of GLM

  26. Correlation - How much linear is the relationship of two variables? (descriptive) Regression - How good is a linear model to explain my data? (inferential)

  27. Correlation

  28. Correlation Correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom).

  29. Y Y Y Y Y Y X X Positive correlation Negative correlation No correlation Correlation • Strength and direction of the relationship between variables • Scattergrams

  30. Measures of Correlation • Covariance • 2) Pearson Correlation Coefficient (r)

  31. 1) Covariance • The covariance is a statistic representing the degree to which 2 variables vary together • {Note that Sx2 = cov(x,x) }

  32. A statistic representing the degree to which 2 variables vary together • Covariance formula • cf. variance formula

  33. 2) Pearson correlation coefficient (r) • r is a kind of ‘normalised’ (dimensionless) covariance • r takes values fom -1 (perfect negative correlation) to 1 (perfect positive correlation). r=0 means no correlation (S = st dev of sample)

  34. Pearson – ‘Strength of Linear Relation’ r = 0.816

  35. Limitations: • Sensitive to extreme values • Relationship not a prediction. • Not Causality

  36. Linear Regression

  37. Regression: Prediction of one variable from knowledge of one or more other variables

  38. How good is a linear model (y=ax+b) to explain the relationship of two variables? • If there is such a relationship, we can ‘predict’ the value y for a given x. (25, 7.498)

  39. Linear dependence between 2 variables Two variables are linearly dependent when the increase of one variable is proportional to the increase of the other one y x Samples: - Energy needed to boil water - Money needed to buy coffeepots

  40. εi = ŷi, predicted = yi , observed εi = residual Fiting data to a straight line (o viceversa): • Here, ŷ = ax + b • ŷ : predicted value of y • a: slope of regression line • b: intercept ŷ = ax + b • Residual error (εi): Difference between obtained and predicted values of y (i.e. yi- ŷi) • Best fit line (values of b and a) is the one that minimises the sum of squared errors (SSerror) (yi- ŷi)2

  41. Adjusting the straight line to data: • Minimise (yi- ŷi)2 , which is (yi-axi+b)2 • Minimum SSerror is at the bottom of the curve where the gradient is zero – and this can found with calculus • Take partial derivatives of (yi-axi-b)2 respect parametres a and b and solve for 0 as simultaneous equations, giving: • This can always be done

  42. How good is the model? • We can calculate the regression line for any data, but how well does it fit the data? • Total variance = predicted variance + error variance sy2 = sŷ2 + ser2 • Also, it can be shown that r2 is the proportion of the variance in y that is explained by our regression model r2 = sŷ2 / sy2 • Insert r2 sy2 into sy2 = sŷ2 + ser2 and rearrange to get: ser2 = sy2 (1 – r2) From this we can see that the greater the correlation the smaller the error variance, so the better our prediction

  43. Is the model significant? • Do we get a significantly better prediction of y from our regression equation than by just predicting the mean? F-statistic

  44. Practical Uses of Linear Regression • Prediction / Forecasting • Quantify strength between y and Xj( X1, X2, X3 )

  45. General Linear Model • A General Linear Model is just any model that describes the data in terms of a straight line • Linear regression is actually a form of the General Linear Model where the parameters are b, the slope of the line, and a, the intercept. y = bx + a +ε

More Related