Standardization of variables

Standardization of variables Maarten Buis 5-12-2005

Recap • Central tendency • Dispersion • SPSS

Standardization • Is used to improve interpretability of variables. • Some variables have a natural interpretable metric: e.g. income, age, gender, country. • Others, primarily ordinal variables, do not: e.g. education, attitude items, intelligence. • Standardizing these variables makes them more interpretable.

Standardization • Transforming the variable to a comparable metric • known unit • known mean • known standard deviation • known range • Three ways of standardizing: • P-standardization (percentile scores) • Z-standardization (z-scores) • D-standardization (dichotomize a variable)

When you should always standardize • When averaging multiple variables, e.g. when creating a socioeconomic status variable out of income and education. • When comparing the effects of variables with unequal units, e.g. does age or education have a larger effect on income?

P-Standardization • Every observation is assigned a number between 0 and 100, indicating the percentage of observation beneath it. • Can be read from the cumulative distribution • In case of knots: assign midpoints • The median, quartiles, quintiles, and deciles are special cases of P-scores.

P-standardization • Turns the variable into a ranking, i.e. it turns the variable into a ordinal variable. • It is a non-linear transformation: relative distances change • Results in a fixed mean, range, and standard deviation; M=50, SD=28.6, This can change slightly due to knots • A histogram of a P-standardized variable approximates a uniform distribution

Linear transformation • Say you want income in thousands of guilders instead of guilders. • You divide INCMID by f1000,-

Linear transformation • Say you want to know the deviation from the mean • Subtract the mean (f2543,-) from INCMID

Recap: multiplication and addition and the number line

Linear transformation • Adding a constant (X’ = X+c) • M(X’) = M(X)+c • SD(X’) = SD(X) • Multiply with a constant (X’ = X*c) • M(X’) = M(X)*c • SD(X’) = SD(X) * |c|

Z-standardization • Z = (X-M)/SD • two steps: • center the variable (mean becomes zero) • divide by the standard deviation (the unit becomes standard deviation) • Results in fixed mean and standard deviation: M=0, SD=1 • Not in a fixed range! • Z-standardization is a linear transformation: relative distances remain intact.

Z-standardization • Step 1: subtract the mean • c = -M(X) • M(X’) = M(X)+c • M(X’) = M(X)-M(X)=0 • SD(X’)=SD(X)

Z-standardization • Step 2: divide by the standard deviation • c is 1/SD(X) • M(Z) = M(X’) * c • M(Z) = 0 * 1/SD(X) = 0 • SD(Z) = SD(X’) * c • SD(Z) = SD(X) * 1/SD(X) = 1

Normal distribution • Normal distribution = Gauss curve = Bell curve • Formula (McCall p. 120) • Note the (x-m)2 part • apart from that all you have to remember is that the formula is complicated • Normal distribution occurs when a large number of small random events cause the outcome: e.g. measurement error

Normal distribution • Other examples the height of individuals, intelligence, attitude • But: the variables Education, Income and age in Eenzaam98 are not normally distributed

Z-scores and the normal distribution • Z-standardization will not result in a normally distributed variable • Standardization in NOT the same as normalization • We will not discuss normalization (but it does exist) • But: If the original distribution is normally distributed, than the z-standardized variable will have a standard normal distribution.

Standard normal distribution • Normal distribution with M=0 and SD=1. • Table A in Appendix 2 of McCall • Important numbers (to be remembered): • 68% of the observations lie between ± 1 SD • 90% of the observations lie between ± 1.64 SD • 95% of the observations lie between ± 1.96 SD • 99% of the observations lie between ± 2.58 SD

Why bother? • If you know: • That a variable is normally distributed • the mean and standard deviation • Than you know the percentage of observations above or below and observation • These numbers are a good approximation, even if the variable is not exactly normally distributed

P & Z standardization • Both give a distribution with fixed mean, standard deviation, and unit • P-standardization also gives a fixed range • Both are relative to the sample: if you take observations out, than you have to re-compute the standardized variables

P & Z-standardization • When interpreting Z-standardized variables one uses percentiles • With P-standardization one decreases the scale of measurement to ordinal, BUT this improves interpretability.

Student recap

Do before Wednesday • Read McCall chapter 5 • Understand Appendix 2, table A • make exercises 5.7-5.28

Standardization of variables

Standardization of variables

Presentation Transcript

Standardization

STANDARDIZATION

STANDARDIZATION OF PROTECH

Standardization of In America

Standardization

Standardization of NaOH

STANDARDIZATION OF AGROTECH

Issues of Data Standardization

Standardization

Standardization of methods

Standardization

STANDARDIZATION

Standardization of Lexicon

Challenges of Standardization

Centers of Standardization

Standardization

STANDARDIZATION OF GEOSYNTHETICS

Standardization

STANDARDIZATION OF AGROTECH

STANDARDIZATION