1 / 21

Outline

Assumptions of linear statistical models. Types of Transformations Alternatives to Transformations. Outline. Transformations in Statistical Analysis. Effect addivitity Normality Homoscedasticity Independence. Model Assumptions. Homoscedasticity Normality Additivity Independence.

thor
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assumptions of linear statistical models. Types of Transformations Alternatives to Transformations Outline Transformations in Statistical Analysis • Effect addivitity • Normality • Homoscedasticity • Independence Model Assumptions

  2. Homoscedasticity Normality Additivity Independence Additivity Homoscedasticity Normality Independence Order of Importance Experimental Analysis Models (ANOVA) Observational Analysis Models (Regression) All four are so interrelated that which is “most” important may be immaterial!

  3. Independence • Measurements over time on the same individual. • Time series data (rainfall, temperature, etc). • Repeated measures - split plots in time. • Growth curves. When is this important? • Measurements near each other in space. • Split plot designs. • Spatial data. How do I know it’s a problem? By design - how the data were collected. Temporal/spatial autocorrelation analysis. Rectifying a dependence problem. Modify the type of model to be fitted to the data.

  4. Homoscedasticity How do I know I have a problem? Plot predicted (fitted) values versus residuals. What is the pattern of the spread in the residuals as the predicted values increase? Acceptable x x x • Spread constant. • Spread increases. • Spread decreases then increases. x x x Problems Problems x x x x x x x x x x x x x x x x x x x x x

  5. e X e x Lack of Homogeneity in Regression • Attempt a transformation. • Weighted regression. • Incorporate additional covariates. • Non-linear regression. What to do? What to do if the spread of the residuals plotted versus X looks like this? Need another x variable. or this?

  6. Transforming the Response to achieve Linearity If a scatterplot of y versus x curves upward, proceed down on the scale to choose a transformation.

  7. Handling Heterogeneity no Regression? ANOVA yes Group means Fit linear model accept Test for Homoscedasticity OK Plot residuals reject Type of Transformation OK Box/Cox Family Power Family Traditional Transform Observations

  8. Transformations to Achieve Normality no Regression? ANOVA yes Fit linear model Estimate group means Q-Q plot Formal Tests yes OK Residuals Normal? no Different Model Transform

  9. Transformations to Achieve Normality How can we determine if observations are normally distributed? • Graphical examination: • Normal quantile-quantile plot (QQ-plot). • Histogram or boxplot. • Goodness of fit tests: • Kolmogorov-Smirnov test. • Shapiro-Wilks test. • D’Agostino’s test.

  10. Non-normal! So what? Only very skewed distributions will have a marked effect on the significance level of the F-test for overall model or model effects. Often the same transformations which are used to achieve homoscedasticity will produce more normal-looking observations (residuals). Transformations to Achieve Model Simplicity GOAL: To provide as simple as possible a mathematical form for the relationship among response and explanatory variables. May require transforming both response and explanatory variables.

  11. Alternative Models low complexity Regular Least Squares Weighted Least Squares Non-Parametric Methods Generalized Linear Models Non-Linear Regression high

  12. Example: Predicting brain weight from body weight in mammals via SLR Data are average brain (Y, g) and body (X, kg) weights for 62 species of mammals (2 omitted). Source: Allison & Chicchetti (1976), Science. Species (common name)body weightbrain weight Arctic fox 3.385 44.500 Owl monkey 0.480 15.499 Horse 521.000 655.000 Kangaroo 35.000 56.000 Human 62.000 1320.000 African elephant 6654.000 5712.000 Asian elephant 2547.000 4603.000 … Chimpanzee 52.160 440.000 Tree shrew 0.104 2.500 Red fox 4.235 50.400 Omit

  13. Scatterplot of data is non-informative. Most species have small weights compared to the elephants. Viewing only those mammals with body weight below 300kgs suggests transforming to a log scale to linearize the relationship .

  14. Scatterplot looks linear. Fitted regression equation is: Body weight is a very significant predictor of brain weight (p-value<0.0001). Also, R2=0.922.

  15. human opossum Residual plot shows no obvious violations of the zero mean and constant variance assumption. QQ-Plot demonstrates that the normality assumption for the residuals is plausible.

  16. Checking for influential observations (R) > fm_lm(log(y)~log(x)) > influence.measures(fm) Influence measures of lm(formula = log(y) ~ log(x)) : dfb.1. dfb.lg.. dffit cov.r cook.d hat inf 1 0.13501 -8.18e-03 0.14452 1.009 1.04e-02 0.0167 2 0.27274 -1.56e-01 0.27714 0.956 3.71e-02 0.0245 (Owl Monk.) 3 -0.04860 1.62e-02 -0.04876 1.051 1.21e-03 0.0187 … 14 -0.02853 3.42e-02 -0.03775 1.142 7.25e-04 0.0937 * (Shrew) … 19 0.00538 1.69e-01 0.18810 1.121 1.79e-02 0.0881 * (Asian El.) … 32 0.22151 3.51e-010.532070.788 1.24e-01 0.0295 * (Human) 33 0.00130 -5.11e-02 -0.05538 1.164 1.56e-03 0.1110 * (African El.) 34 -0.31147 1.54e-02 -0.33480 0.846 5.11e-02 0.0167 * (Opossum) 35 0.27033 5.36e-02 0.32472 0.861 4.85e-02 0.0171 * (Rhesus Monk.) … 40 -0.00740 8.39e-03 -0.00945 1.124 4.55e-05 0.0786 * (Brown Bat) … 60 -0.00799 2.27e-03 -0.00806 1.054 3.31e-05 0.0181 In MTB: Stat > > Regression > Regression > Regression Storage

  17. Decision: Leave out man (he doesn’t really fit in with the rest of the mammals) and re-run the analysis. FeatureFull ModelOmit Human 2.111 2.090 0.755 0.745 0.029 0.027 R2 0.922 0.929 Slope p-value < 0.0001 < 0.0001 Even though results don’t change much, we will go with this last model:

  18. Predicting the brain weights of the omitted mammals (R) > xh <- x[-32]; yh <- y[-32] > fmh <- lm(log(yh)~log(xh)) > new <- data.frame(xh=c(.104,4.235)) > predict(fmh, newdata=new, interval="prediction") fit lwr upr 1 0.4038624 -0.9269029 1.734628 2 3.1660753 1.8499283 4.482222 > exp(predict(fmh, newdata=new, interval="prediction")) fit lwr upr 1 1.497598 0.3957776 5.666817 2 23.714231 6.3593633 88.430985 Exponentiate final results! MammalPredicted Brain Wt Prediction IntervalActual Brain Wt Tree Shrew 1.498 (0.396, 5.667) 2.500 Red Fox 23.714 (6.359, 88.431) 50.400 This illustrates the idea of cross-validation in regression. It is often recommended that the data be split into two (equal?) portions; use one for model fitting; the other for model checking/verification.

  19. Predicting the brain weights of the omitted mammals (MTB) Influence measures can be selected here.

  20. The regression equation is lbrain = 2.11 + 0.755 lbody Predictor Coef SE Coef T P Constant 2.11091 0.09860 21.41 0.000 lbody 0.75467 0.02889 26.12 0.000 S = 0.696924 R-Sq = 92.2% R-Sq(adj) = 92.0% Analysis of Variance Source DF SS MS F P Regression 1 331.35 331.35 682.21 0.000 Residual Error 58 28.17 0.49 Total 59 359.52 Unusual Observations Obs lbody lbrain Fit SE Fit Residual St Resid 32 4.13 7.1854 5.2255 0.1197 1.9599 2.85R 33 8.80 8.6503 8.7542 0.2322 -0.1039 -0.16 X 34 1.25 1.3610 3.0563 0.0901 -1.6954 -2.45R 35 1.92 5.1874 3.5575 0.0912 1.6298 2.36R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 0.4028 0.1388 (0.1249, 0.6807) (-1.0196, 1.8253) 2 3.2002 0.0900 (3.0201, 3.3803) ( 1.7936, 4.6068) MTB output (with man) Only available influence measures are: standard/student residuals; hat matrix; Cook’s dist; and dffits.

More Related