1 / 40

R. Douglas Martin* and Ruben H. Zamar** *Professor of Statistics, Univ. of Washington

ROBUST STATISTICS. R. Douglas Martin* and Ruben H. Zamar** *Professor of Statistics, Univ. of Washington **Professor of Statistics, Univ. of British Columbia. Key Reference Books. Huber, P.J. (1981). Robust Statistics , Wiley

abril
Download Presentation

R. Douglas Martin* and Ruben H. Zamar** *Professor of Statistics, Univ. of Washington

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ROBUST STATISTICS R. Douglas Martin* and Ruben H. Zamar** *Professor of Statistics, Univ. of Washington **Professor of Statistics, Univ. of British Columbia

  2. Key Reference Books • Huber, P.J. (1981). Robust Statistics, Wiley • Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., and Stahel, W.A. (1986). Robust Statistics, The Approach Based on Influence Functions, Wiley. • Rousseeuw, P.J. and Leroy, A.M. (1987). Robust Regression and Outlier Detection, Wiley.

  3. J. W. Tukey (1979) “… just which robust/resistant methods you use is not important – what is important is that you use some. It is perfectly proper to use both classical and robust/resistant methods routinely, and only worry when they differ enough to matter. But when they differ, you should think hard.”

  4. J. W. Tukey “Statistics is a science in my opinion, and it is no more a branch of mathematics than are physics, chemistry and economics; for if its methods fail the test of experience – not the test of logic – they will be discarded” Recommended reading: Annals of Statistics Tukey Memorial Volume (Fall, 2002) “John Tukey’s Contributions to Robust Statistics” (P. J. Huber) “The Life and Professional Contributions of J. W. Tukey” (D. R. Brillinger)

  5. OUTLINE • DATA-ORIENTED INTRODUCTION • LOCATION AND SCALE ESTIMATES • BASIC ROBUSTNESS CONCEPTS • ROBUST REGRESSION • ROBUST MULTIVARIATE LOCATIONAND SCATTER

  6. INTRODUCTION • Outliers Examples • Classical Parameter Estimates are Not Robust • Classical Statistical Inference is Not Robust • Data-Oriented Robustness and Examples • Simple Robust Location and Scale Estimates • Simple Robust Estimates Have Bounded EIF’s • Outlier Mining One Dimension at a Time

  7. OUTLIERS • Outliers are atypical observations that are “well” separated from the bulk of the data • In isolation or in small clusters Dimensionality context • 1-D (relatively easy to detect) • 2-D (harder to detect) • Higher-D (very hard to detect) • Time Series (special challenges)

  8. Classical Statistics • PARAMETER ESTIMATES (“Point” Estimates) • Sample mean and sample standard deviation • Sample correlation and covariance estimates • Linear least squares model fits • Gaussian maximum likelihood • STATISTICAL INFERENCE • t-statistic and t-interval for an unkown mean • Standard errors and t-values for regression coefficients • F-tests for regression model hypotheses • AIC, BIC, Cp model selection statistics

  9. CLASSICAL STATS ARE NOT ROBUST Outliers have “unbounded influence” on classical statistics, resulting in: • Inaccurate parameter estimates and predictions • Inaccurate statistical inference • Standard errors are too large • Confidence intervals are too wide • t-statistics lack power • AIC, BIC, Cp result in wrong models • Unreliable outlier detection

  10. EMPIRICAL INFLUENCE FUNCTION Normalization across sample size Measures influence of an additional point x on T

  11. CLASSICAL ESTIMATES HAVE UNBOUNDED EIF Sample Mean

  12. RESISTANCE (J.W. Tukey’s term) • A Fundamental Continuity Concept - Small changes in the data result in only small changes in estimate - “Change a few, so what” J.W. Tukey (Seattle, 1977) • “Small Changes” Generalization - Small changes in all the data (e.g., rounding errors) - Large changes in a small fraction of the data (a few outliers) • Valuable Consequence - A good fit to the bulk of the data - Reliable, automatic outlier detection

  13. 1-D Outliers: Stock Returns Outliers represent locally large losses/gains Sometimes you must process thousands of such series You need to detect the outliers automatically!

  14. 1-D Outliers: Density of Earth Cavendish, 1798, measurements. Because of the low outlier the median 5.46 is a better estimate of Earth density than the mean 5.42 Outlier

  15. 2-D Outliers: Predicting EPS You have to predict 2001 EPS! You have many of these, e.g., Hundreds!

  16. 2-D Outliers: Main Gain Data

  17. 5-D Outliers: Woodmod Data A group of 4 outliers shows up in the plots of V1 vs V2 and V4 vs V5 Corr(V1,V2) = -0.15 RobCorr((V1,V2) = 0.75

  18. LUNATICS IN MASSACHUSETTS Population densities in Suffolk and Essex are much larger than that in the other counties Correlation= -0.64 Robust Correlation=-0.97

  19. LUNATICS IN MASSACHUSETTS (Continued) Plot with Suffolk and Essex removed Now Nantucket shows up as outlier Correlation = -0.84 Robust Correlation = -0.93

  20. LUNATICS IN MASSACHUSETTS (Continued) Plot with Suffolk, Essex and Nantucket removed Now data show a clear decreasing trend with smaller percentages in more populated counties Correlation = -0.97 Robust Correlation = -0.97

  21. Time Series with Outliers and Level Shifts Need to detect outliers and level shifts as important, distinct events Key aspects of consumer behavior Automate for detecting key changes in a few out of many thousands of customers.

  22. Gene Expression Data Microarray experiments typically used to identify differentially expressed genes. DNA probes printed on a glass are hybridized to two RNA samples separately labeled with two fluorescent dyes The intensity of hybridization values after slide scanning are calculated using image analysis and then used to identify differentially expressed genes

  23. Three Principal Stages of the Technology • Array fabrication (pcr amplification and clone preparation, reaction clean up, array printing) • Probe preparation (mRNA extraction, mRNA labeling, probe labeling and purification) and hybridization • Slide scanning and image processing (gridding, segmentation intensity extraction)

  24. Gene Expression Data(continued) Each of the above-mentioned stages may generate several sources of random variation and of systematic error. • For example • The first one involves variation in the quantity of probe at a spot and in hybridization efficiency of the probe as to their counterparts (mRNA targets) • The second one includes variation in the quantity of mRNA in a sample applied to the slide and variation in the amount of target hybridized to the probe • The third one is subject to variation in optical measurements and in fluorescent intensities computed from the scanned image.

  25. Gene Expression Data(continued) Different substances can be used to increase or damp the level of expression of a gene. Hughes et al., 2000 in Cell 102: 109-126 (2000) “FunctionalDiscovery via Compendium of Expression Profiles” considered 6068 genes and ten different substances abbreviated as: cin cup fre mac sod spf vma yap yer and ymr

  26. Gene Expression Data(continued) The sample exposed to the substance (treatment sample) was labeled “green” The other sample (control sample) was labeled “red” . The normalized green intensity of gene “i” in sample “j” is denoted by The normalized red intensity of gene “i” in sample “j” is denoted by

  27. Gene Expression Data(continued) We will examine the differences between normalized gene expression intensities The expression level for most genes are similar. Those will appear as “normal data” in the boxplots. There are some genes for which the difference in intensity is large. Those are the genes that are likely to be over- or under-expressed in the “treatment” samples.

  28. Gene Expression Data Red - Green intensity levels for ten samples Similar intensity levels for most genes Outliers may correspond to over / under expressed genes

  29. NORMALIZED MEAN-MEDIAN DIFFERENCE Diff = (Med-Mean)/SE(Med) In several cases (red rows in the table) the mean and median have different signs. Differences are relatively small The positive and negative outliers balance each other limiting their overall effect on the mean.

  30. NORMALIZED SD - MAD DIFFERENCE Diff = (SD-MAD)/SE(MAD) The outliers have a bigger impact on the standard deviations Flagging outliers by using means and SD’s becomes more difficult

  31. Standard Deviation vs. MAD SD = 1.45 x MAD SD is approximately 50% larger than MAD across samples.

  32. Flagging Outliers Suppose we have a set of numbers such that most of them are independent normal random variables with mean m and variance Suppose that a relatively small fraction of these numbers are expected to be different from the majority.

  33. Flagging Outliers(continued) We need reliable and automatic ways for flagging outliers We may use the popular rule But a better approach (specially for large datasets) is to use “c” determined by the equation to reduce the probability of flagging “wrong outliers”.

  34. Flagging Outliers(continued) It is easy to verify that:

  35. Flagging Outliers(continued) For the Gene-Expression data n = 6068 and so: For such a large datasets it is better to use to reduce the probability of flagging “wrong genes”.

  36. Flagging Outliers(continued) We can assume that, for each sample, are (approximately) independent normal with mean m = 0 and unknown variance

  37. Flagging Outliers(continued) Since sigma is unknown it must be estimated from the data Robust estimate: MAD Classical estimate: SD Because of the outliers, the SD will systematically overestimate sigma

  38. Flagging Outliers ymr has relatively few very large outliers which drastically inflate the SD cup and yaphave a large number of moderate outliers Which inflate the SD.

  39. “MAD – SD Outliers” vs. “R = SD/MAD” ymr (right-bottom corner) appears as an outlier in this plot In this case there are relatively few large outliers which drastically inflate the Standard Deviation. Robust Fit: Diff = -95+ 91 x R LS Fit: Diff = -51+ 60 x R

  40. BEEF SALES IN USA (1925-1941) Beef sales sharply dropped around 1930 and showed a steady increase on 1933 - 41 High levels of beef consumption in 1925-27 show up as outliers in the plot.

More Related