250 likes | 840 Views
Standardization of variables. Maarten Buis 5-12-2005. Recap. Central tendency Dispersion SPSS. Standardization. Is used to improve interpretability of variables. Some variables have a natural interpretable metric: e.g. income, age, gender, country.
E N D
Standardization of variables Maarten Buis 5-12-2005
Recap • Central tendency • Dispersion • SPSS
Standardization • Is used to improve interpretability of variables. • Some variables have a natural interpretable metric: e.g. income, age, gender, country. • Others, primarily ordinal variables, do not: e.g. education, attitude items, intelligence. • Standardizing these variables makes them more interpretable.
Standardization • Transforming the variable to a comparable metric • known unit • known mean • known standard deviation • known range • Three ways of standardizing: • P-standardization (percentile scores) • Z-standardization (z-scores) • D-standardization (dichotomize a variable)
When you should always standardize • When averaging multiple variables, e.g. when creating a socioeconomic status variable out of income and education. • When comparing the effects of variables with unequal units, e.g. does age or education have a larger effect on income?
P-Standardization • Every observation is assigned a number between 0 and 100, indicating the percentage of observation beneath it. • Can be read from the cumulative distribution • In case of knots: assign midpoints • The median, quartiles, quintiles, and deciles are special cases of P-scores.
P-standardization • Turns the variable into a ranking, i.e. it turns the variable into a ordinal variable. • It is a non-linear transformation: relative distances change • Results in a fixed mean, range, and standard deviation; M=50, SD=28.6, This can change slightly due to knots • A histogram of a P-standardized variable approximates a uniform distribution
Linear transformation • Say you want income in thousands of guilders instead of guilders. • You divide INCMID by f1000,-
Linear transformation • Say you want to know the deviation from the mean • Subtract the mean (f2543,-) from INCMID
Linear transformation • Adding a constant (X’ = X+c) • M(X’) = M(X)+c • SD(X’) = SD(X) • Multiply with a constant (X’ = X*c) • M(X’) = M(X)*c • SD(X’) = SD(X) * |c|
Z-standardization • Z = (X-M)/SD • two steps: • center the variable (mean becomes zero) • divide by the standard deviation (the unit becomes standard deviation) • Results in fixed mean and standard deviation: M=0, SD=1 • Not in a fixed range! • Z-standardization is a linear transformation: relative distances remain intact.
Z-standardization • Step 1: subtract the mean • c = -M(X) • M(X’) = M(X)+c • M(X’) = M(X)-M(X)=0 • SD(X’)=SD(X)
Z-standardization • Step 2: divide by the standard deviation • c is 1/SD(X) • M(Z) = M(X’) * c • M(Z) = 0 * 1/SD(X) = 0 • SD(Z) = SD(X’) * c • SD(Z) = SD(X) * 1/SD(X) = 1
Normal distribution • Normal distribution = Gauss curve = Bell curve • Formula (McCall p. 120) • Note the (x-m)2 part • apart from that all you have to remember is that the formula is complicated • Normal distribution occurs when a large number of small random events cause the outcome: e.g. measurement error
Normal distribution • Other examples the height of individuals, intelligence, attitude • But: the variables Education, Income and age in Eenzaam98 are not normally distributed
Z-scores and the normal distribution • Z-standardization will not result in a normally distributed variable • Standardization in NOT the same as normalization • We will not discuss normalization (but it does exist) • But: If the original distribution is normally distributed, than the z-standardized variable will have a standard normal distribution.
Standard normal distribution • Normal distribution with M=0 and SD=1. • Table A in Appendix 2 of McCall • Important numbers (to be remembered): • 68% of the observations lie between ± 1 SD • 90% of the observations lie between ± 1.64 SD • 95% of the observations lie between ± 1.96 SD • 99% of the observations lie between ± 2.58 SD
Why bother? • If you know: • That a variable is normally distributed • the mean and standard deviation • Than you know the percentage of observations above or below and observation • These numbers are a good approximation, even if the variable is not exactly normally distributed
P & Z standardization • Both give a distribution with fixed mean, standard deviation, and unit • P-standardization also gives a fixed range • Both are relative to the sample: if you take observations out, than you have to re-compute the standardized variables
P & Z-standardization • When interpreting Z-standardized variables one uses percentiles • With P-standardization one decreases the scale of measurement to ordinal, BUT this improves interpretability.
Do before Wednesday • Read McCall chapter 5 • Understand Appendix 2, table A • make exercises 5.7-5.28