1 / 27

Loss Functions for Detecting Outliers in Panel Data

Loss Functions for Detecting Outliers in Panel Data. Charles D. Coleman Thomas Bryan Jason E. Devine U.S. Census Bureau. Prepared for the Spring 2000 meetings of the Federal-State Cooperative Program for Population Estimates, Los Angeles, CA, March, 2000. Panel Data.

leo-gilmore
Download Presentation

Loss Functions for Detecting Outliers in Panel Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Loss Functions for Detecting Outliers in Panel Data Charles D. Coleman Thomas Bryan Jason E. Devine U.S. Census Bureau Prepared for the Spring 2000 meetings of the Federal-State Cooperative Program for Population Estimates, Los Angeles, CA, March, 2000

  2. Panel Data A.k.a. “longitudinal data.” xit: • i indexes cross-sectional units: retain identities over time. Exx: Geographic areas, persons, households, companies, autos. • t indexes time. • Chronological or nominal. • Chronological time measures time elapsed between two dates. • Nominal time indexes different sets of estimates, can also index true values.

  3. Notation • Bi is base value for unit i. • Fi is “future” value for unit i. • Fit is future value for unit i at time t. • Bi, Fi, Fit > 0. • i=|Fi-Bi| is absolute difference for unit i. • Subscripts will be dropped when not needed.

  4. What is an Outlier? “[An outlier is] an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.” D.M. Hawkins, Identification of Outliers, 1980, p. 1.

  5. Meaning of an Outlier • Either • Indication of a problem with the data generation process. • Or • A true, but unusual, statement about reality.

  6. Loss Functions • Motivations: The i come from unknown distributions. Want to compare multiple size classes on same basis. • L(Fi;Bi)(i,Bi) is loss function for observation i. • Loss functions measure “badness.” • Loss functions produce rankings of observations to be examined. • Loss functions are empirically based, except for one special case in nominal time.

  7. Assumption 1 Loss is symmetric in error: L(B+; B) = L(B–; B)

  8. Assumption 2 Loss increases in difference: / > 0

  9. Assumption 3 Loss decreases in base value: /B < 0

  10. Property 1 Loss associated with given absolute percentage difference (| / B|) increases in B.

  11. Simplest Loss Function L(F;B) = |F– B|Bq (1a) or (,B) = Bq (1b) with 0 > q > –1.

  12. Loss as Weighted Combination of Absolute Difference and Absolute Percentage Difference • This generates loss function with q = –s/(r + s). • Infinite number of pairs (r, s) correspond to any • given q.

  13. Outlier Criterion • Outlier declared whenever L(F;B)(,B) > C • C is “critical value.” • C can be determined in advance, or as function of data (e.g., quantile or multiple of scale measure).

  14. Loss Function Variants • Time-Invariant Loss Function • Signed Loss Function • Nominal Time

  15. Time-Invariant Loss Function • Idea: Compare multiple dates of data on same basis. • Time need not be round number. • L(Fit;Bi,t) = |Fit– Bi|Btq • Property 1 satisfied as long as t < –1/q. • Thus, useful horizon is limited.

  16. Signed Loss Function • Idea: Account for direction and magnitude of loss. S(F;B) = (F– B) Bq • Can use asymmetric critical values and “q”s: • Declare outliers whenever S+(F;B) = (F– B) Bq+ > C+ or S–(F;B) = (F– B) Bq– < C– with C+  –C–, q+  q–.

  17. Nominal Time • Compare 2 sets of estimates, one set can be actual values, Ai. • Assumptions: • Unbiased: EBi = EFi = Ai. • Proportionate variance: Var(Bi) = Var(Fi) = 2Ai. • q = –1/2. • Either set of estimates can be used for Bi, Fi. • Exception: Ai can only be substituted for Bi.

  18. How to Use: No Preexisting Outlier Criteria • Start with q = – 0.5. • Adjust by increments of 0.1 to get “good” distribution of outliers. • Alternative: Start with q = log(range)/25 – 1, where range is range of data. (Bryan, 1999) • Can adjust.

  19. How to Use: Preexisting Discrete Outlier Criteria • Start with schedule of critical pairs (j, Bj). • These pairs (approximately) satisfy equation Bq = C for some q and C. They are the cutoffs between outliers and nonoutliers. • Run regression log j = –q log Bj + K • Then, C = eK.

  20. Loss Functions and GIS • Loss functions can be used with GIS to focus analyst’s attention on problem areas. • Maps compare tax method county population estimates to unconstrained housing unit method estimates. • q = –0.5 in loss function map.

  21. Absolute Differences between the Population Estimates

  22. Percent Absolute Differences between the Population Estimates

  23. Loss Function Values

  24. Outliers Classified by Another Variable • Di is function of 2 successive observations. • Ri is “reference” variable, used to classify outliers. • Start with schedule of critical pairs (Dj, Rj). • Run regression log Dj = a + log Rj • Then, L(D, R) = DRb and C = ea.

  25. What to Do with Negative Data • From Coleman and Bryan (2000): L(F,B) = |F–B|(|F|+|B|)q, B 0 or F 0, 0 , B = F = 0. S(F,B) = (F–B)(|F|+|B|)q, B 0 or F 0, 0 , B = F = 0. • 0 > q > –1. Suggest q –0.5.

  26. Summary • Defined panel data. • Defined outliers. • Created several types of loss functions to detect outliers in panel data. • Loss functions are empirical (except for nominal time.) • Showed several applications, including GIS.

  27. URL for Presentation http://chuckcoleman.home.dhs.org/fscpela.ppt

More Related