1.19k likes | 3.65k Views
Data Transformation. Objectives: Understand why we often need to transform our data The three commonly used data transformation techniques Additive effects and multiplicative effects Application of data transformation in ANOVA and regression. Why Data Transformation?.
E N D
Data Transformation • Objectives: • Understand why we often need to transform our data • The three commonly used data transformation techniques • Additive effects and multiplicative effects • Application of data transformation in ANOVA and regression.
Why Data Transformation? • The assumptions of most parametric methods: • Homoscedasticity • Normality • Additivity • Linearity • Data transformation is used to make your data conform to the assumptions of the statistical methods • Illustrative examples
Homoscedasticity and Normality The data deviates from both homoscedasticity and normality.
Homoscedasticity and Normality Won’t it be nice if we would make data look this way?
Types of Data Transformation • The logarithmic transformation • The square-root transformation • The arcsine transformation. • Data transformation can be done conveniently in EXCEL. • Alternatives: Ranks and nonparametric methods.
Homoscedasticity • The two groups of data seem to differ greatly in means, but a t-test shows that the means do not differ significantly from each other - a surprising result. • The two groups of data differ greatly in variance, and both deviate significantly from normality. These results invalidate the t-test. • We calculate two ratios: var/mean ratio and Std/mean ratio (i.e., coefficient of variation). • Group1 Group2Var/mean 56.420 416.891C.V. 1.230 1.230 • Log-transformation
Log-Transformed Data NewX = ln(X+1) • The transformation is successful because: • The variance is now similar • Deviation from normality is now nonsignificant • The t-test revealed a highly significant difference in means between the two groups 1.31 2.13
Log-Transformed Data Transform back: NewX = ln(X+1) Compare this mean with the original mean. Which one is more preferable? Calculate the standard error, the degree of freedom, and 95% CL (t0.025,16 = 2.47).
Normal but Heteroscedastic Any transformation that you use is likely to change normality. Fortunately, t-test and ANOVA are quite robust for this kind of data. Of course, you can also use nonparametric tests.
Normal but Heteroscedastic The two variances are significantly different. The t-test, however, detects significant difference in means. You can use nonparametric methods to analyse data for comparison, and you are like to find t-test to be more powerful.
Additivity • What experimental design is this? • Compare the group means. Is there an interaction effect? Additivity means that the difference between levels of one factor is consistent for different levels of another factor.
Multiplicative Effects • Compare the group means. Is there an interaction effect? • Does this data set meet the assumption of additivity? • When the assumption of additivity is not met, we have difficulty in interpreting main effects. • Now calculate the ratio of group means. What did you find?
Multiplicative Effects For Factor A, we see that Level 2 has a mean about 2.88 times as large as that for Level 1. For factor B, Level 2 has a mean about 2.18 times as large as that for Level 1). If you know the value for Level 1 of Factor A, you can obtain the value for Level 2 of Factor A by multiplying the known value by 2.88. Similarly, you can do the same for Factor B. We say that the effect of Factors A and B are multiplicative, not additive.
Log-transformation Original Data 37.262 108.458 2102.351 17878.648 82.403 234.508 12400.091 80241.944 3.084 4.127 1.302 1.268 3.778 4.803 1.235 1.385 Transformed data Mean Variance Now log-transform the data. Compare the means. Is the assumption of additivity met now? 1.31 2.13
Why log-transformation can change the multiplicative effects to additive effects?
Square-Root Transformation • The two groups of data differ much in variance. • Calculate two ratios: var/mean ratio and Std/mean ratio (i.e., coefficient of variation). • Does your calculation suggest log-transformation? When is log-transformation appropriate? • Use square-root transformation when different groups have similar Variance/Mean ratios Notice the means, which do not coincide with the most frequent observations
Square-Root Transformation Square-root transformation: Transform the means back to the original scale and compare these means with the original means: 1.17 2.09 The variance is now almost identical between the two groups
Quiz on Data Transformation The data set is right-skewed for each group. Calculate the variance/mean ratio and C.V. for each group, and decide what transformation you should use. Do the transformation and convert the means back to the original scale.
With Multiple Groups When you have multiple groups, a “Variance vs Mean” or a “Std vs Mean” plot can help you to decide which data transformation to use. The graph on the left shows that the Var/Mean ratio is almost constant. What transformation should you use?
Confidence Limits Before transformation After transformation With the skewness in our data, do confidence limits on the right make more sense? Why?
Arcsine Transformation • Used for proportions • Compare the variances before and after transformation • Do you know how to transform the means and C.L. back to the original scale?
Data Transformation Using SAS Data Mydata; input x; newx=log(x); newx=sqrt(x+3/8); newx=arsin(sqrt(x)); cards; Natural logarithm transfromation Square-root transformation Arcsine transformation