1 / 21

STATISTICAL METHODS AND DATA MANAGEMENT TOOLS FOR OUTLIER DETECTION IN TRI DATA

STATISTICAL METHODS AND DATA MANAGEMENT TOOLS FOR OUTLIER DETECTION IN TRI DATA. Dr. Nagaraj K. Neerchal and Justin Newcomer Department of Mathematics and Statistics, UMBC and Barry Nussbaum Office of Environmental Information, US EPA. Background. Challenges with TRI Data

kim
Download Presentation

STATISTICAL METHODS AND DATA MANAGEMENT TOOLS FOR OUTLIER DETECTION IN TRI DATA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STATISTICAL METHODS AND DATA MANAGEMENT TOOLS FOR OUTLIER DETECTION IN TRI DATA Dr. Nagaraj K. Neerchal and Justin Newcomer Department of Mathematics and Statistics, UMBC and Barry Nussbaum Office of Environmental Information, US EPA

  2. Background • Challenges with TRI Data • Self Reported Data • Compare Facilities to its “Peers” • Objectives: • Investigate the use of statistical methods in identifying anomalous data (outliers) in the TRI database • Develop data management tools to help in the outlier detection process

  3. Comparison Within Peer Groups • Statistical methods, appropriately modified, are useful in identifying potential outliers which are not evident by examining the total release alone • High releases do not necessarily indicate problems with the reported data • Need to cluster the data into groups of peers: • Facilities in the same “peer group” are expected to have similar release values

  4. Statistical Approach • Analyze facilities within their own peer groups • Fit an ANOVA model for total emission releases: • Many ways to estimate the total release • ANOVA Model, Jackknife Technique, ... • Obtain a residual for a facility by comparing the actual emissions to the predictions based on its peers: Facilities reporting on the same chemical are considered “peers”

  5. Jackknife Technique Facilities Reporting on chemical j : The predicted value of the release of chemical j for facility i:

  6. Defining a Metric • Studentized Residuals give us a unitless measure to compare facilities reporting on different chemicals: • For the Jackknife technique we have, • Define a metric that allows us to compare facilities that report on a different number of chemicals:

  7. Flagging the Outliers • Further investigate the facilities corresponding to extreme values of a defined metric: • Outliers are not necessarily wrong - just a place to look

  8. Flagging the Outliers • Is picking out the top 5 enough? • Quick and Easy • May not sufficiently represent all outliers • Can we set a cutoff point? • Define csuch that if then we consider facility ian outlier • Theoretical work can be done to examine properties of these metrics

  9. Flagging the Outliers • Are there other metrics or distances we can use to compare facilities? • A multivariate analogue to the metric defined previously: • This distance depends on the number of chemicals a facility reports on (k) • Conditionally, given k, this distance follows a F-distribution • What can we say about the marginal distribution? • From here if we estimate  we can use the percentiles of the noncentral F-distribution to find a cutoff point c

  10. Flagging the Outliers • We can consider, • Estimate  from the observed distribution of the number of chemicals being reported on:

  11. TRI Trend Tool • The TRI Trend Tool provides an easy way to acquire yearly emissions data on TRI facilities from 1995 through 2003 • Allows the user to group the data by Chemical, State, and SIC Codes • Provides total emissions data over all facilities or individual data for each facility • Incorporates metrics that allow the user to compare facilities regardless of type or number of chemicals being reported on

  12. Functionality • The tool allows the user to obtain subsets of facilities, calculate totals, and identify possible outliers

  13. Grouping Variables • The user can group records by Chemical, SIC Code, State, or any combination of the three

  14. Create Multiple Subsets • The tool provides data over multiple years • Subsets can be requested for multiple levels of a grouping variable

  15. Example 1

  16. Create Multiple Subsets • Subsets can be created using multiple grouping variables

  17. Example 2

  18. Create Multiple Subsets • The user can view and save facility level data

  19. Example 3

  20. Incorporating the Metrics • The user can identify the top 5 facilities with extreme values of the metrics defined previously • The outlier detection process can be refined by grouping by State and/or SIC Code

  21. Example 4

More Related