180 likes | 310 Views
Erroneous Distribution Data Identification Using Outlier Detection Techniques. W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of New Jersey, USA. Overview. Review of OBIS DQ-issues Review of existing DQ methods Case study: d etecting outliers in multidimensional data
E N D
Erroneous Distribution Data Identification Using Outlier Detection Techniques W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of New Jersey, USA
Overview • Review of OBIS DQ-issues • Review of existing DQ methods • Case study: detecting outliers in multidimensional data • Discussion and future directions
Data Quality (DQ) DQ problems can be generated in every steps of the data life cycle:
DQ problems (I) • Data gathering: instrument failures; false identifications geo-referencing • Data storage key metadata missing erroneous data entry; database default values masquerading as real values
DQ problems (II) • Data delivery: data corruption due to encoding conversion • Data integration: duplicated records • Data retrieval: missing values • Data analysis/cleaning: inappropriate models used, etc.
DQ solving-a process-based approach • DQ solving is an essential component of data analysis and thus part of the data life cycle • A. It builds foundation for analysis and modeling • B. It provides feedbackto improve the whole data life cycle • C. It could lead to more DQ problems if not carefully executed
DQ solving methods • Harvest metadata close to data • Built-in integrity check and double data entry • Model-based approach: a) statistical b) heuristic
OBIS DQ Study • Metadata-related problems • DQ on scientific names • Integrity checking • Redundant records detection • Outliers detection- a case study Outliers sometimes represent erroneous data We are examining data mining tools fordetecting erroneous data points
DBSCAN-a clustering tool • DBSCAN is density-based in feature space • It deals with high dimensional data • There is no need to specify cluster numbers • It identifies outliers during the clustering process • It is a fast algorithm and freely available • M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters in large spatial databases
A diagram of DBSCAN Outlier Border = 1unit MinPts = 5 Core
Limitation of the method • Geographical outliers may be used to identify erroneous points in survey data, but may not good for museum collections or literature-based data records. • Other methods to identify erroneous distribution data ? How about using environmental data as proxies?
Limitations of using environmental variables • Risk of imposing a rigid model at the time of pre-processing • Risk of losing valuable outliers • Risk of circular logic in later analyses
Discussions • Why don’t you use more environmental variables? • Can you use DBSCAN on environmental variables directly?
Possible improvements • Define multiple methods as DQ components • Assign bootstrap weights • Present outlier candidates to experts • Update weights based on user feedback
Summary • Many data quality problems can arise during the whole data life cycle. • Preliminary checking can eliminate a lot of simple errors • Expert knowledge should be integrated and be the decisive factor when it comes to DQ solving • Data mining techniques may act as metal detectors so that experts can focus on a narrowed down group of candidates