1 / 18

Erroneous Distribution Data Identification Using Outlier Detection Techniques

Erroneous Distribution Data Identification Using Outlier Detection Techniques. W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of New Jersey, USA. Overview. Review of OBIS DQ-issues Review of existing DQ methods Case study: d etecting outliers in multidimensional data

mandell
Download Presentation

Erroneous Distribution Data Identification Using Outlier Detection Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Erroneous Distribution Data Identification Using Outlier Detection Techniques W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of New Jersey, USA

  2. Overview • Review of OBIS DQ-issues • Review of existing DQ methods • Case study: detecting outliers in multidimensional data • Discussion and future directions

  3. Data Quality (DQ) DQ problems can be generated in every steps of the data life cycle:

  4. DQ problems (I) • Data gathering: instrument failures; false identifications geo-referencing • Data storage key metadata missing erroneous data entry; database default values masquerading as real values

  5. DQ problems (II) • Data delivery: data corruption due to encoding conversion • Data integration: duplicated records • Data retrieval: missing values • Data analysis/cleaning: inappropriate models used, etc.

  6. DQ solving-a process-based approach • DQ solving is an essential component of data analysis and thus part of the data life cycle • A. It builds foundation for analysis and modeling • B. It provides feedbackto improve the whole data life cycle • C. It could lead to more DQ problems if not carefully executed

  7. DQ solving methods • Harvest metadata close to data • Built-in integrity check and double data entry • Model-based approach: a) statistical b) heuristic

  8. OBIS DQ Study • Metadata-related problems • DQ on scientific names • Integrity checking • Redundant records detection • Outliers detection- a case study Outliers sometimes represent erroneous data We are examining data mining tools fordetecting erroneous data points

  9. DBSCAN-a clustering tool • DBSCAN is density-based in feature space • It deals with high dimensional data • There is no need to specify cluster numbers • It identifies outliers during the clustering process • It is a fast algorithm and freely available • M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters in large spatial databases

  10. A diagram of DBSCAN Outlier Border  = 1unit MinPts = 5 Core

  11. Total points distribution

  12. Result from DBSCAN

  13. Limitation of the method • Geographical outliers may be used to identify erroneous points in survey data, but may not good for museum collections or literature-based data records. • Other methods to identify erroneous distribution data ? How about using environmental data as proxies?

  14. Can we get some more information?

  15. Limitations of using environmental variables • Risk of imposing a rigid model at the time of pre-processing • Risk of losing valuable outliers • Risk of circular logic in later analyses

  16. Discussions • Why don’t you use more environmental variables? • Can you use DBSCAN on environmental variables directly?

  17. Possible improvements • Define multiple methods as DQ components • Assign bootstrap weights • Present outlier candidates to experts • Update weights based on user feedback

  18. Summary • Many data quality problems can arise during the whole data life cycle. • Preliminary checking can eliminate a lot of simple errors • Expert knowledge should be integrated and be the decisive factor when it comes to DQ solving • Data mining techniques may act as metal detectors so that experts can focus on a narrowed down group of candidates

More Related