160 likes | 389 Views
WINsorizing. What is it and why could it be inappropriate?. Kyle Allen & Matthew Whitledge May 7, 2013 . What is winsorizing?. What it isn’t… Trimming Truncating A ny other method that completely removes observations from the data Term first used in 1960
E N D
WINsorizing What is it and why could it be inappropriate? Kyle Allen & Matthew Whitledge May 7, 2013
What is winsorizing? • What it isn’t… • Trimming • Truncating • Any other method that completely removes observations from the data • Term first used in 1960 • John W. Tukey; W. J. Dixon • “Numerical value of a wild observation is untrustworthy” • However, its direction of deviation is important • Decreasing the magnitude of the deviation, retaining its direction
Winsorizingan example • Order the observations by value • Xi1, Xi2, …Xi100, where i denotes the ithregressor • If Winsorizing at 1% and 99%, then • The value for Xi1will be replaced by the value for Xi2 • The value for Xi100will be replaced by the value for Xi99 Another example: • Xi1, Xi2, …Xi100 • Winsorize at 10% (5% from bottom and 5% from the top) • Beginning Sample: • Xi1, Xi2, Xi3, Xi4,Xi5,Xi6,… Xi95,Xi96,Xi97,Xi98,Xi99,Xi100 • Winsorized Sample • Xi5, Xi5, Xi5, Xi5,Xi5,Xi6,… Xi95,Xi96,Xi96,Xi96,Xi96,Xi96
Winsorizingalternatives • Are the observations really outliers? • Look at Cook’s D measure • Transform the variables • Take the log or square root of the variable • This shouldn’t be done only to increase significance • Median based estimations • Quantile regression • Median absolute deviation • Nonparametric methods
Winsorizinga Sas example Lift Index Data • Workers perform lifting tasks • Each lift has an amount of stress associated with it • Measuring the number of days an employee missed based on the lift they were performing • 206 observations
winsorizing SAS CODE • procsgplotdata=isqsdata.lilesmerge;scattery=dayslostx=alr;scattery=dayslost1 x=alr;run; • dataisqsdata.lileswin; setisqsdata.lileswin; ifsubject = 6thendayslost = 27; ifsubject = 35thendayslost = 27; run; • procqlimdata=isqsdata.liles; modeldayslost = alr; endogenousdayslost ~ censored(lb=0); run; • procqlimdata=isqsdata.lileswin; modeldayslost1 = alr; endogenous dayslost1 ~ censored(lb=0); run;
WinsorizingImplications • May impact significance • The standard errors will decrease • Depending on how symmetrical the data is, the mean may increase or decrease • For example, if there is an extremely positive outlier, it will decrease the mean • The significance will be determined by the proportionate change in the estimated coefficient, relative to the change in the standard error
Winsorizingwhy could it be inappropriate? • May be appropriate for • Ratios • Book to Market • Other measures in which the denominator can be extremely small • Never winsorize valid observations • Investment Returns • R&D expenditures • Truly exceptional observations • Large number of biological elements • Extremely low stress tolerances for mechanical implements • Model should produce data we could actually see
Winsorizing bibliography • Bibliography • Brillinger, David R. “John W. Tukey: His Life and Professional Contributions.” The Annals of Statistics. 30(2002): 1535-75. • Dixon, W. J. “Simplified Estimation from Censored Normal Samples.” The Annals of Mathematical Statistics. 31(1960): 385-91. • Kafadar, Karen. “John Tukey and Robustness.” Proceedings of the Annual Meeting of the American Statistical Association. 2001. • Kruskal, William, Thomas Ferguson, John W. Tukey, E. J. Gumbel, and F. J. Anscombe. “Discussion of the Papers of Messrs, Anscombe and Daniel.” Technometrics. 2(1960): 157-66. • Tukey, John W. and Donald H. McLaughlin. “Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1. The Indian Journal of Statistics. 25(1963): 331-52. • Westfall, Peter H. and Kevin S. S. Henning. Understanding Advanced Statistical Methods. Boca Raton, FL: CRC Publishing, 2013.