1 / 30

Fast mining of distance-based outliers in high-dimensional datasets

Fast mining of distance-based outliers in high-dimensional datasets. Amol Ghoting, Srinivasan Parthasarathy, Matthew Eric Otey Data Mining and Knowledge Discovery Vol. 16 No. 3, 2008. Reporter : CHENG-WEI, CHOU Jan. 13 2010. 組員名單: 89721002 周政緯 陳永洲. Outline.

cheri
Download Presentation

Fast mining of distance-based outliers in high-dimensional datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast mining of distance-based outliersin high-dimensional datasets Amol Ghoting, Srinivasan Parthasarathy, Matthew Eric Otey Data Mining and Knowledge Discovery Vol. 16 No. 3, 2008 Reporter : CHENG-WEI, CHOU Jan. 13 2010 組員名單: 89721002 周政緯 陳永洲

  2. Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯

  3. Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯

  4. Introduction • A common problem : automatically finding outliers • Outliers : those points are highly unlikely to occur • A measure of unusualness : a point’s distance • On high-dimensional, existing algorithms have not good performance • This paper further improve the scaling behavior of distance-based outlier detection on large, high-dimensional datasets 89721002 周政緯

  5. Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯

  6. Distance-based outlier detection • Three popular definitions of distance-based outliers: • Outliers are the data points for which there are fewer than p other data points within distance d • Outliers are the top n data points whose distance to their kth nearest neighbor is greatest • Outliers are the top n data points whose average distance to their k nearest neighbors is greatest 89721002 周政緯

  7. Distance-based outlier detection • NL(nested loop) algorithm : the best performance in high-dimensional spaces • For each data point in D, scan the dataset and keep track of its k closest neighbors • Maintain a cutoff threshold, c • If (distance of a data point’s kth closest neighbor < c) the data point is no longer an outlier 89721002 周政緯

  8. Distance-based outlier detection 89721002 周政緯

  9. Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯

  10. Outlier detection algorithm • RBRP(Recursive Bining and Re-Projection) • A two-phase algorithm for fast mining of distance-based outliers in high dimensional datasets • Finds the top n outliers in the dataset whose distance to their kth nearest neighbor is the greatest 89721002 周政緯

  11. Outlier detection algorithm • First phase of RBRP • Goal : to partition the dataset into bins • Points that are close to each other in space are likely to be assigned to the same bin • A recursive procedure similar to divisive hierarchical clustering • Second phase of RBRP : Use an extension of the NL algorithm to find outliers in the dataset 89721002 周政緯

  12. Outlier detection algorithm 89721002 周政緯

  13. Outlier detection algorithm 89721002 周政緯

  14. Outlier detection algorithm 89721002 周政緯

  15. Outlier detection algorithm • Time Complexity of Phase 1 : • Worst case : • Best case : 89721002 周政緯

  16. Outlier detection algorithm • Average case: 89721002 周政緯

  17. Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯

  18. Experiment results 89721002 周政緯

  19. Experiment results 89721002 周政緯

  20. Experiment results 89721002 周政緯

  21. Experiment results 89721002 周政緯

  22. Experiment results 89721002 周政緯

  23. Experiment results 89721002 周政緯

  24. Experiment results 89721002 周政緯

  25. Experiment results 89721002 周政緯

  26. Experiment results 89721002 周政緯

  27. Experiment results 89721002 周政緯

  28. Outline • Introduction • Distance-based outlier detection • Outlier detection algorithm • Experiment results • Conclusion 89721002 周政緯

  29. Conclusion • Presented RBRP • RBRP improves upon the scaling behavior of the state-of-the-art • Provide theoretical arguments • Validated its scaling behavior • Empirical results on real data back the above claim • Realizing a significant speedup over ORCA 89721002 周政緯

  30. The End Thank you!! 89721002 周政緯

More Related