270 likes | 376 Views
Loss Functions for Detecting Outliers in Panel Data. Charles D. Coleman Thomas Bryan Jason E. Devine U.S. Census Bureau. Prepared for the Spring 2000 meetings of the Federal-State Cooperative Program for Population Estimates, Los Angeles, CA, March, 2000. Panel Data.
E N D
Loss Functions for Detecting Outliers in Panel Data Charles D. Coleman Thomas Bryan Jason E. Devine U.S. Census Bureau Prepared for the Spring 2000 meetings of the Federal-State Cooperative Program for Population Estimates, Los Angeles, CA, March, 2000
Panel Data A.k.a. “longitudinal data.” xit: • i indexes cross-sectional units: retain identities over time. Exx: Geographic areas, persons, households, companies, autos. • t indexes time. • Chronological or nominal. • Chronological time measures time elapsed between two dates. • Nominal time indexes different sets of estimates, can also index true values.
Notation • Bi is base value for unit i. • Fi is “future” value for unit i. • Fit is future value for unit i at time t. • Bi, Fi, Fit > 0. • i=|Fi-Bi| is absolute difference for unit i. • Subscripts will be dropped when not needed.
What is an Outlier? “[An outlier is] an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.” D.M. Hawkins, Identification of Outliers, 1980, p. 1.
Meaning of an Outlier • Either • Indication of a problem with the data generation process. • Or • A true, but unusual, statement about reality.
Loss Functions • Motivations: The i come from unknown distributions. Want to compare multiple size classes on same basis. • L(Fi;Bi)(i,Bi) is loss function for observation i. • Loss functions measure “badness.” • Loss functions produce rankings of observations to be examined. • Loss functions are empirically based, except for one special case in nominal time.
Assumption 1 Loss is symmetric in error: L(B+; B) = L(B–; B)
Assumption 2 Loss increases in difference: / > 0
Assumption 3 Loss decreases in base value: /B < 0
Property 1 Loss associated with given absolute percentage difference (| / B|) increases in B.
Simplest Loss Function L(F;B) = |F– B|Bq (1a) or (,B) = Bq (1b) with 0 > q > –1.
Loss as Weighted Combination of Absolute Difference and Absolute Percentage Difference • This generates loss function with q = –s/(r + s). • Infinite number of pairs (r, s) correspond to any • given q.
Outlier Criterion • Outlier declared whenever L(F;B)(,B) > C • C is “critical value.” • C can be determined in advance, or as function of data (e.g., quantile or multiple of scale measure).
Loss Function Variants • Time-Invariant Loss Function • Signed Loss Function • Nominal Time
Time-Invariant Loss Function • Idea: Compare multiple dates of data on same basis. • Time need not be round number. • L(Fit;Bi,t) = |Fit– Bi|Btq • Property 1 satisfied as long as t < –1/q. • Thus, useful horizon is limited.
Signed Loss Function • Idea: Account for direction and magnitude of loss. S(F;B) = (F– B) Bq • Can use asymmetric critical values and “q”s: • Declare outliers whenever S+(F;B) = (F– B) Bq+ > C+ or S–(F;B) = (F– B) Bq– < C– with C+ –C–, q+ q–.
Nominal Time • Compare 2 sets of estimates, one set can be actual values, Ai. • Assumptions: • Unbiased: EBi = EFi = Ai. • Proportionate variance: Var(Bi) = Var(Fi) = 2Ai. • q = –1/2. • Either set of estimates can be used for Bi, Fi. • Exception: Ai can only be substituted for Bi.
How to Use: No Preexisting Outlier Criteria • Start with q = – 0.5. • Adjust by increments of 0.1 to get “good” distribution of outliers. • Alternative: Start with q = log(range)/25 – 1, where range is range of data. (Bryan, 1999) • Can adjust.
How to Use: Preexisting Discrete Outlier Criteria • Start with schedule of critical pairs (j, Bj). • These pairs (approximately) satisfy equation Bq = C for some q and C. They are the cutoffs between outliers and nonoutliers. • Run regression log j = –q log Bj + K • Then, C = eK.
Loss Functions and GIS • Loss functions can be used with GIS to focus analyst’s attention on problem areas. • Maps compare tax method county population estimates to unconstrained housing unit method estimates. • q = –0.5 in loss function map.
Percent Absolute Differences between the Population Estimates
Outliers Classified by Another Variable • Di is function of 2 successive observations. • Ri is “reference” variable, used to classify outliers. • Start with schedule of critical pairs (Dj, Rj). • Run regression log Dj = a + log Rj • Then, L(D, R) = DRb and C = ea.
What to Do with Negative Data • From Coleman and Bryan (2000): L(F,B) = |F–B|(|F|+|B|)q, B 0 or F 0, 0 , B = F = 0. S(F,B) = (F–B)(|F|+|B|)q, B 0 or F 0, 0 , B = F = 0. • 0 > q > –1. Suggest q –0.5.
Summary • Defined panel data. • Defined outliers. • Created several types of loss functions to detect outliers in panel data. • Loss functions are empirical (except for nominal time.) • Showed several applications, including GIS.
URL for Presentation http://chuckcoleman.home.dhs.org/fscpela.ppt