Fast mining of distance-based outliers in high-dimensional datasets

Fast mining of distance-based outliersin high-dimensional datasets Amol Ghoting, Srinivasan Parthasarathy, Matthew Eric Otey Data Mining and Knowledge Discovery Vol. 16 No. 3, 2008 Reporter : CHENG-WEI, CHOU Jan. 13 2010 組員名單： 89721002 周政緯陳永洲

Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯

Introduction • A common problem : automatically finding outliers • Outliers : those points are highly unlikely to occur • A measure of unusualness : a point’s distance • On high-dimensional, existing algorithms have not good performance • This paper further improve the scaling behavior of distance-based outlier detection on large, high-dimensional datasets 89721002 周政緯

Distance-based outlier detection • Three popular definitions of distance-based outliers: • Outliers are the data points for which there are fewer than p other data points within distance d • Outliers are the top n data points whose distance to their kth nearest neighbor is greatest • Outliers are the top n data points whose average distance to their k nearest neighbors is greatest 89721002 周政緯

Distance-based outlier detection • NL(nested loop) algorithm : the best performance in high-dimensional spaces • For each data point in D, scan the dataset and keep track of its k closest neighbors • Maintain a cutoff threshold, c • If (distance of a data point’s kth closest neighbor < c) the data point is no longer an outlier 89721002 周政緯

Distance-based outlier detection 89721002 周政緯

Outlier detection algorithm • RBRP(Recursive Bining and Re-Projection) • A two-phase algorithm for fast mining of distance-based outliers in high dimensional datasets • Finds the top n outliers in the dataset whose distance to their kth nearest neighbor is the greatest 89721002 周政緯

Outlier detection algorithm • First phase of RBRP • Goal : to partition the dataset into bins • Points that are close to each other in space are likely to be assigned to the same bin • A recursive procedure similar to divisive hierarchical clustering • Second phase of RBRP : Use an extension of the NL algorithm to find outliers in the dataset 89721002 周政緯

Outlier detection algorithm 89721002 周政緯

Outlier detection algorithm • Time Complexity of Phase 1 : • Worst case : • Best case : 89721002 周政緯

Outlier detection algorithm • Average case: 89721002 周政緯

Experiment results 89721002 周政緯

Conclusion • Presented RBRP • RBRP improves upon the scaling behavior of the state-of-the-art • Provide theoretical arguments • Validated its scaling behavior • Empirical results on real data back the above claim • Realizing a significant speedup over ORCA 89721002 周政緯

The End Thank you!! 89721002 周政緯

Fast mining of distance-based outliers in high-dimensional datasets

Fast mining of distance-based outliers in high-dimensional datasets

Presentation Transcript

High-Resolution Three-Dimensional Sensing of Fast Deforming Objects

Automatic Subspace Clustering Of High Dimensional Data For Data Mining Application

High-dimensional Indexing based on Dimensionality Reduction

Gaussian KD-Tree for Fast High-Dimensional Filtering

An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data

Challenges in Mining Large Image Datasets

Visualization of High dimensional Datasets

Mining Frequent Closed Cubes in 3D Datasets

A Fast High Utility Itemsets Mining Algorithm

Efficient Mining of High Utility Itemsets from Large Datasets

ICS 278: Data Mining Lecture 5: Low-Dimensional Representations of High-Dimensional Data

Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Searching in High-Dimensional Spaces

Mining of Massive Datasets: Course Introduction

Detecting Distance-Based Outliers in Streams of Data

Dynamics of High-Dimensional Systems

Multi-Dimensional View of Data Mining

Joining Massive High-Dimensional Datasets

High Dimensional Data

Mining Frequent Closed Cubes in 3D Datasets