370 likes | 520 Views
BIOL 582. Lecture Set 15 Model Building Transformations redux “Non-parametric” Methods. It is worth revisiting the concept of “statistical power” Recall statistical power is 1 – β, where β is the probability of a type II error β is not something that can be measured, directly
E N D
BIOL 582 Lecture Set 15 Model Building Transformations redux “Non-parametric” Methods
It is worth revisiting the concept of “statistical power” • Recall statistical power is 1 – β, where β is the probability of a type II error • β is not something that can be measured, directly • However, when one is designing an experiment, maintaining sufficient statistical power is important • We have experimental design courses at WKU (STAT 402, AGRI 590), where this concept should be explored in more detail • Chapter 12 of Biometry (Sokal and Rohlf 2011) is dedicated to this topic • The concept involves doing a pre-experiment to understand variability among subjects, selecting the effect size (e.g., MSM/MSE) one wishes to observe, and from this information, estimating sample sizes that would provide enough statistical power to observe a desired effect size
The general rule of thumb is that as sample sizes increase, βdecreases. • But “too much” data can cause a problem with model selection • Basically, if β is small for even small effect sizes (because there are many subjects), model selection methods will favor more complex models • This can become a problem in the world of predictive modeling (forecasting), especially if decision-making is based on data-acquisition. • Also, “too little” data can cause a problem with model selection • Basically, if β is large for even small effect sizes (because there are few subjects), model selection methods will favor simpler models
Cross-validation is useful in cases where too much or too little data might cause model selection problems • One might trust a “robust” model (one that produces consistent error when applied to independent data sets) over less robust models. • But what about model selection/comparison when violations of linear model assumptions are apparent? • One is struck with two concurrent problems • Finding a model which produces normal, homoscedastic error • Finding a model that uses appropriate parameters • Changing one might cause a change in the other! • This lecture deals with simultaneous model and data “behavior” considerations, and what happens when one needs to give up and try a different (but perhaps complimentary) approach
Recall the intent with data transformations • The intent with data transformations is to find a new variable which is a function of the desired variable and which is suitable for linear models (i.e., meets assumptions). • Let y be a variable and z be a function of y • If z is a variable that can be used with a linear model, then it is a suitable transformation of y, if and only if, • is solvable
Recall also that both dependent and independent variables can be transformed! • One can consider the transformation of independent variables as different candidate models • E.g., with the snake data > snake<-read.csv("snake.data.csv") > attach(snake) > Sex<-as.factor(Sex) > > lm.1<-lm(HS ~ Sex) > lm.2<-lm(HS ~ SVL) > lm.3<-lm(HS ~ Sex + SVL) > lm.4<-lm(HS ~ log(SVL)) > lm.5<-lm(HS ~ Sex + log(SVL)) > lm.6<-lm(HS ~ Sex *SVL) > lm.7<-lm(HS ~ Sex*log(SVL)) > > AIC(lm.1, lm.2, lm.3, lm.4, lm.5, lm.6, lm.7) df AIC lm.1 3 222.3051 lm.2 3 192.2506 lm.3 4 193.2802 lm.4 3 190.6777 lm.5 4 192.2606 lm.6 5 195.0234 lm.7 5 193.9147 Log-transforming SVL produces less error than not transforming it. But notice that this does not evaluate LM assumptions
E.g., with the snake data (still need to look at residuals) • But here is an important question: when should one do residual plots/Shapiro-Wilk Test? • Before model selection • After model selection? • Consider this: > par(mfrow=c(2,4)) > plot(lm.7) # the most comprehensive model (top row) > plot(lm.4) # the most parsimonious model (bottom row)
E.g., with the snake data (still need to look at residuals) • But here is an important question: when should one do residual plots/Shapiro-Wilk Test? • Before model selection • After model selection? • Consider this: • Model Selection for nested models means removing parameters from the most comprehensive model. These parameters are deemed to be unessential, because they do not change the error much. Thus, if the residuals pass the assumptions “tests” for the comprehensive model, they will likely pass for the simpler model too. Point: test residuals first then do model selection (but look at residuals after too!) > shapiro.test(resid(lm.7)); shapiro.test(resid(lm.4)) Shapiro-Wilk normality test data: resid(lm.7) W = 0.9611, p-value = 0.1826 Shapiro-Wilk normality test data: resid(lm.4) W = 0.9488, p-value = 0.06919
E.g., with the snake data (still need to look at residuals) • ONE CANNOT DO THE FOLLOWING!!!! • If one does this in R, a result is returned. • All R does is calculate penalized log-likelihoods for various models • The problem is that the model is the same; the response data are not the same • Model selection typically means choosing the best model for a GIVEN set of response data • (This is also a reason why cross-validation is often not considered a model-selection method) > lm.a<-lm(HS ~ Sex *SVL) > lm.b<-lm(log(HS) ~ Sex *SVL) > > AIC(lm.a, lm.b) df AIC lm.a 5 195.02338 lm.b 5 16.91903
More often is the case that dependent variables need to be transformed • A good rule of thumb is to evaluate transformations of dependent variables with the most complex models to be compared (those with many independent variables) • Are assumptions violated with complex models? They will probably be violated with simpler models too. Try a new transformation. • Are assumptions met with complex models? If yes, then do model selection and check again after. • But what about choosing the right transformation? • Data mining is not desirable, but sometimes needed. • The following is a key to help
Step 1 • Do the data represent proportions (especially fixed at certain values)? • Try arcsine transformation • Data are certainly not on a range of 0 -1 and are not proportions. • Go to step 2 • Arcsine transformation example (data from Sokal and Rohlf 2011: Biometry) • Percent fertility of eggs of Drosophilia melanogaster raised in vials of 10 eggs each
Arcsine transformation example (data from Sokal and Rohlf 2011: Biometry) • Percent fertility of eggs of Drosophilia melanogaster raised in vials of 10 eggs each > y<-c(0, + 10,10,10, + 20,20,20,20,20,20,20,20, + 30,30,30,30,30,30,30,30,30,30, + 40,40,40,40,40,40, + 50,50,50,50,50,50,50,50,50,50,50,50,50,50,50, + 60,60,60,60,60,60,60,60,60,60,60,60,60,60, + 70,70,70,70,70,70,70,70,70,70,70,70, + 80,80,80,80,80,80,80,80,80,80,80,80,80, + 90,90,90,90,90,90,90,90,90, + 100,100,100,100,100,100,100,100,100) > > length(y) [1] 100 > p<-y/100 > > p [1] 0.0 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 [26] 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 [51] 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 [76] 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Arcsine transformation example (data from Sokal and Rohlf 2011: Biometry) • Percent fertility of eggs of Drosophilia melanogaster raised in vials of 10 eggs each • Note: in this example, the transformation just made the data less platykurtic but did not really normalize them • Less egregious erring, but not perfect! > z<-asin(sqrt(p))*180/pi # reported in degrees, not radians > z [1] 0.00000 18.43495 18.43495 18.43495 26.56505 26.56505 26.56505 26.56505 26.56505 26.56505 [11] 26.56505 26.56505 33.21091 33.21091 33.21091 33.21091 33.21091 33.21091 33.21091 33.21091 [21] 33.21091 33.21091 39.23152 39.23152 39.23152 39.23152 39.23152 39.23152 45.00000 45.00000 [31] 45.00000 45.00000 45.00000 45.00000 45.00000 45.00000 45.00000 45.00000 45.00000 45.00000 [41] 45.00000 45.00000 45.00000 50.76848 50.76848 50.76848 50.76848 50.76848 50.76848 50.76848 [51] 50.76848 50.76848 50.76848 50.76848 50.76848 50.76848 50.76848 56.78909 56.78909 56.78909 [61] 56.78909 56.78909 56.78909 56.78909 56.78909 56.78909 56.78909 56.78909 56.78909 63.43495 [71] 63.43495 63.43495 63.43495 63.43495 63.43495 63.43495 63.43495 63.43495 63.43495 63.43495 [81] 63.43495 63.43495 71.56505 71.56505 71.56505 71.56505 71.56505 71.56505 71.56505 71.56505 [91] 71.56505 90.00000 90.00000 90.00000 90.00000 90.00000 90.00000 90.00000 90.00000 90.00000 >
Step 1 • Do the data represent proportions (especially fixed at certain values)? • Try arcsine transformation • Data are certainly not on a range of 0 -1 and are not proportions. • Go to step 2 • Step 2 • Not sure which transformation is best? • Try Box-Cox Transformation • Tried Box-Cox and it did not help, or the data are categorical, or I am at a loss • Go to step 3
Two people named Box and Cox decided to work together, and stats history will never be the same after 1964 • for all λ ≠ 0 • for λ = 0 • Note that when λ = 1, it is a linear transformation (subtract 1 from y values, which means do not change them); when λ = 0, a log-transformation is in order; when λ = ½, z is a square root transformation, because • and the other parts to the equation are just linear transformations, which only scale the variable • When λ = -1, • Which is called a reciprocal transformation (useful for data that describe rates)
Two people named Box and Cox decided to work together, and stats history will never be the same after 1964 • The previous illustration does not mean that this is how data should be necessarily transformed, but it means that many transformations have a general formula • And if that is the case, then the parameter of the formula,λ, is something that can be optimized • It is optimized in the following log-likelihood function: • Which is analogously for model error • Where the df are n – k – 1
The best way to understand Box-Cox transformations is to do them (log-likelihood profile plot) > pupfish<-read.csv("pupfish.parasite.csv") > > attach(pupfish) > library(MASS) > L<-boxcox(GRUBS~SEX*POPULATION*SL) > > L.max<-max(L$y) > lambda<-L$x[which(L$y==L.max)] > > L.max [1] -654.9889 > lambda [1] 0.2222222 > Height of curve is optimized λ (called )
The best way to understand Box-Cox transformations is to do them (perform the transformation and compare to closest standard transformation) • In this case, 0.22222 is close to 0, which suggests a log-transformation > log.grubs<-log(GRUBS+1) > bct.grubs<-(GRUBS^lambda-1)/lambda > pairs(cbind(GRUBS,log.grubs,bct.grubs)) # the log- and box-cox-transformations # are similar
The best way to understand Box-Cox transformations is to do them (perform the transformation and compare to closest standard transformation) • In this case, 0.22222 is close to 0, which suggests a log-transformation > par(mfrow=c(2,4)) # log-transform in top row; box-cox transform in bottom row > plot(lm(log.grubs~SEX*POPULATION*SL)) > plot(lm(bct.grubs~SEX*POPULATION*SL))
Two people named Box and Cox decided to work together, and stats history will never be the same after 1964 • Box-cox transformed data can be back-transformed > z<-bct.grubs > y<-(z*lambda+1)^(1/lambda) > > GRUBS [1] 67 62 164 85 78 41 14 37 216 88 125 194 35 61 487 106 99 71 114 215 13 24 102 [24] 154 90 91 71 166 98 86 21 19 50 84 11 95 42 10 18 114 142 287 100 56 168 420 [47] 189 31 654 886 414 213 572 874 374 332 281 561 407 132 230 112 178 125 84 235 349 156 123 [70] 319 241 72 313 80 387 115 296 198 274 25 67 221 40 23 17 12 18 5 21 8 4 8 [93] 2 314 38 5 2 3 13 15 14 14 153 > y [1] 67 62 164 85 78 41 14 37 216 88 125 194 35 61 487 106 99 71 114 215 13 24 102 [24] 154 90 91 71 166 98 86 21 19 50 84 11 95 42 10 18 114 142 287 100 56 168 420 [47] 189 31 654 886 414 213 572 874 374 332 281 561 407 132 230 112 178 125 84 235 349 156 123 [70] 319 241 72 313 80 387 115 296 198 274 25 67 221 40 23 17 12 18 5 21 8 4 8 [93] 2 314 38 5 2 3 13 15 14 14 153
Two people named Box and Cox decided to work together, and stats history will never be the same after 1964 • The Box-cox transformation – sometimes called the Power Transformation – is like a universal fix-it tool! • One problem though, while it is easy to understand the log, square root, or the square of a variable in a plot, it is not intuitive to work with power-transformed data.
The Box-Cox is so great, it is worth another example… • These data are made up…. • Something like what would be reported by a fisheries biologist, electrofishing in lakes
The Box-Cox is so great, it is worth another example… • These data are made up…. • Something like what would be reported by a fisheries biologist, electrofishing in lakes > # to create appropriate data, the midpoint of > # each size class was chosen > > lake.A<-c(rep(2,33),rep(6,45),rep(10,19),rep(14,3)) > lake.B<-c(rep(6,20),rep(10,32),rep(14,28),rep(18,13),rep(22,7)) > > y<-c(lake.A,lake.B) > x<-as.factor(c(rep("A",100),rep("B",100))) > > library(MASS) > ldahist(y,x)
The Box-Cox is so great, it is worth another example… • These data are made up…. > lm.raw<-lm(y~x) > par(mfrow=c(1,4)) > plot(lm.raw) > shapiro.test(resid(lm.raw)) Shapiro-Wilknormality test data: resid(lm.raw) W = 0.9448, p-value = 6.157e-07
The Box-Cox is so great, it is worth another example… • These data are made up…. > L<-boxcox(y~x) > L.max<-max(L$y) > lambda<-L$x[which(L$y==L.max)] > > L.max [1] -788.7198 > lambda [1] 0.4646465 > > # Most similar to a square-root transformation
The Box-Cox is so great, it is worth another example… • These data are made up…. > z<-(y^lambda-1)/lambda > ldahist(y,x) > ldahist(z,x) > > # Before/After transformation > # shown in left/right panels
The Box-Cox is so great, it is worth another example… • These data are made up…. > par(mfrow=c(2,4)) # raw data in top row; box-cox transform in bottom row > plot(lm(y~x)) > plot(lm(z~x)) > shapiro.test(resid(lm(z~x))) Shapiro-Wilk normality test data: resid(lm(z ~ x)) W = 0.9552, p-value = 6.227e-06 > > # No improvement?
The Box-Cox is so great, it is worth another example… • These data are made up…. > # Before transformation > var(lake.A);var(lake.B) [1] 10.24 [1] 21.45455 > mean(lake.A); mean(lake.B) [1] 5.68 [1] 12.2 > > # After transformation > var(z[1:100]);var(z[101:200]) [1] 1.704771 [1] 1.493079 > mean(z[1:100]);mean(z[101:200]) [1] 2.466642 [1] 4.604127 > > par(mfrow=c(1,2)) > boxplot(y~x,ylab="size") > boxplot(z~x,ylab="z") While the transformation did not help normalize the data, they helped produce variance independent of the mean within groups, plus equal variances among groups
The Box-Cox is so great, it is worth another example… • These data are made up…. > anova(lm(z~x)) Analysis of Variance Table Response: z Df Sum Sq Mean Sq F value Pr(>F) x 1 228.44 228.442 142.87 < 2.2e-16 *** Residuals 198 316.59 1.599 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > summary(lm(z~x)) Call: lm(formula = z ~ x) Residuals: Min 1Q Median 3Q Max -1.8081 -1.6489 0.3293 0.5791 2.7165 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.4666 0.1264 19.51 <2e-16 *** xB 2.1375 0.1788 11.95 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.264 on 198 degrees of freedom Multiple R-squared: 0.4191, Adjusted R-squared: 0.4162 F-statistic: 142.9 on 1 and 198 DF, p-value: < 2.2e-16
Step 1 • Do the data represent proportions (especially fixed at certain values)? • Try arcsine transformation • Data are certainly not on a range of 0 -1 and are not proportions. • Go to step 2 • Step 2 • Not sure which transformation is best? • Try Box-Cox Transformation • Tried Box-Cox and it did not help, or the data are categorical, or I am at a loss • Go to step 3 • Step 3 • My data are frequency data or categorical or ordinal • Try Generalized linear model with appropriate link function, or Contingency Table Analysis, or distribution-free, non-parametric methods • I give up. I have data that I cannot transform to any semblance of normality or homoscedasticity. • Try Permutation procedures of distribution-free, non-parametric methods
The word “non-parametric” has both a literal and figurative meaning • The literal meaning is that any method that does not depend on theoretical distributions, which vary by changing a specific parameter, is a non-parametric method • For example, randomization tests are non-parametric methods • Non-parametric also often means “distribution-free” as there is no specific distribution forced on a test-statistic • The figurative meaning of “non-parametric” is the use of ordered data, instead of original data, to use a class of statistical methods based on the ranks of values • These methods are usually used when parametric assumptions break-down (and then only if the violations of assumptions are large) • One must be aware that switching to these methods means testing different hypotheses! • These methods are often called “tests of location” because they consider the location of values ranked from large to small (or vice versa)
The basis for most non-parametric analogs is to convert data to ranks (ordered or ordinal data) • The distribution example below illustrates this concept, but also illustrates how ordinal data have different meaning • Hypothesis tests based on ranks are often called “Tests of Location” because they report that average location of groups in a distribution • They are also called “Tests of Independence” because rejection of the null hypothesis suggests that groups have different (mainly non-overlapping) distributions
The basic concept of hypothesis tests is the same as parametric tests. For example, one might measure the sums of squares of ranks among groups to ascertain if groups have different mean ranks (not mean values for the original variable). • However, because using ranks simplifies some expectations, non-parametric methods tend to have formulas that reflect algebraic solutions. • For example, if two groups of five values are compared, the expected average rank of each group if a null hypothesis were true would be • If two groups of 6 values are compared, then the expected average rank of each group if a null hypothesis were true would be • No matter how you try to out-think this problem, the expected average group rank for any N values divided equally among g groups (same group size, n) will always be (N+1)/2 • Thus, the overall mean rank will ALWAYS be (N+1)/2. So, determining things like the sums of squares among group means involves subtracting (N+1)/2. This leads to some simpler formulas. It should be noted that these methods gained a lot of popularity when statistics was still done mostly by hand!
The formulas for these tests are pretty straightforward BUT differences in how much to simplify them and differences in nomenclature make them seem (at first) pretty cumbersome. • For example, here is the formula for the Kruskal-Wallis test stat from one source: • Which “simplifies” to this • The Vassar Stats nomenclature is like this • where and where T is the total sum of ranks within groups (g) or for all Note: N(N+1)/12 is ALWAYS the squared error of ranks. Thus H is analogous to a ratio of SSB/SSE
Here is Mike’s personal take on these tests. • They are fundamentally no different than parametric tests, other than • Working with ordinal data • The test statistic distribution used to get a P-value • It is silly to understand the formulas, especially if you understand ANOVA and t-stats • Performing the test is very much like following a recipe • Convert your data to ranks • Find average ranks • Find dispersion in ranks • Calculate test stat • Compare to a distribution to get a P-value • If you need to use one of these tests, get a cook-book (text or web resource) and start cooking! • But these are not inferior methods!!! Some people use these as a last resort. That is unfortunate. Sometimes they have more power for specific situations.
Example of an appropriate time to use ordinal data: A researcher studies male-male competition of male swordtails in the presence of females. • She always randomly selects two males to compete in each “bout” • Here is an hypothetical outcome of several bouts • y = Tail length / body length • Examination of the data reveals that longer tails fish tend to win the bouts, but also long tailed fish tended to be winners. The one time a short-tailed fish won made the mean relative tail lengths about equal But if one compares ranks (below)….. • Note: this is really a paired design, but notice that using paired differences either way will give the same answer!
To learn more about non-parametric procedures, go to Vassar Stats. • Below is a general guide to use for parametric/non-parametric analogs