310 likes | 476 Views
Fast mining of distance-based outliers in high-dimensional datasets. Amol Ghoting, Srinivasan Parthasarathy, Matthew Eric Otey Data Mining and Knowledge Discovery Vol. 16 No. 3, 2008. Reporter : CHENG-WEI, CHOU Jan. 13 2010. 組員名單: 89721002 周政緯 陳永洲. Outline.
E N D
Fast mining of distance-based outliersin high-dimensional datasets Amol Ghoting, Srinivasan Parthasarathy, Matthew Eric Otey Data Mining and Knowledge Discovery Vol. 16 No. 3, 2008 Reporter : CHENG-WEI, CHOU Jan. 13 2010 組員名單: 89721002 周政緯 陳永洲
Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯
Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯
Introduction • A common problem : automatically finding outliers • Outliers : those points are highly unlikely to occur • A measure of unusualness : a point’s distance • On high-dimensional, existing algorithms have not good performance • This paper further improve the scaling behavior of distance-based outlier detection on large, high-dimensional datasets 89721002 周政緯
Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯
Distance-based outlier detection • Three popular definitions of distance-based outliers: • Outliers are the data points for which there are fewer than p other data points within distance d • Outliers are the top n data points whose distance to their kth nearest neighbor is greatest • Outliers are the top n data points whose average distance to their k nearest neighbors is greatest 89721002 周政緯
Distance-based outlier detection • NL(nested loop) algorithm : the best performance in high-dimensional spaces • For each data point in D, scan the dataset and keep track of its k closest neighbors • Maintain a cutoff threshold, c • If (distance of a data point’s kth closest neighbor < c) the data point is no longer an outlier 89721002 周政緯
Distance-based outlier detection 89721002 周政緯
Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯
Outlier detection algorithm • RBRP(Recursive Bining and Re-Projection) • A two-phase algorithm for fast mining of distance-based outliers in high dimensional datasets • Finds the top n outliers in the dataset whose distance to their kth nearest neighbor is the greatest 89721002 周政緯
Outlier detection algorithm • First phase of RBRP • Goal : to partition the dataset into bins • Points that are close to each other in space are likely to be assigned to the same bin • A recursive procedure similar to divisive hierarchical clustering • Second phase of RBRP : Use an extension of the NL algorithm to find outliers in the dataset 89721002 周政緯
Outlier detection algorithm 89721002 周政緯
Outlier detection algorithm 89721002 周政緯
Outlier detection algorithm 89721002 周政緯
Outlier detection algorithm • Time Complexity of Phase 1 : • Worst case : • Best case : 89721002 周政緯
Outlier detection algorithm • Average case: 89721002 周政緯
Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯
Experiment results 89721002 周政緯
Experiment results 89721002 周政緯
Experiment results 89721002 周政緯
Experiment results 89721002 周政緯
Experiment results 89721002 周政緯
Experiment results 89721002 周政緯
Experiment results 89721002 周政緯
Experiment results 89721002 周政緯
Experiment results 89721002 周政緯
Experiment results 89721002 周政緯
Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯
Conclusion • Presented RBRP • RBRP improves upon the scaling behavior of the state-of-the-art • Provide theoretical arguments • Validated its scaling behavior • Empirical results on real data back the above claim • Realizing a significant speedup over ORCA 89721002 周政緯
The End Thank you!! 89721002 周政緯