Fast and Efficient DBSCAN Algorithm Utilizing Random Partitioning

SIGMOD Session 12 RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning Hwanjun Song †, Jae-Gil Lee †* † Graduate School of Knowledge Service Engineering, KAIST • * Corresponding Author

OUTLINE 01. Background and Challenge 02. Overview of RP-DBSCAN 03. Experimental Results 04. Conclusions

01 Background and Challenge

DBSCAN Clustering • One of the most widely used clustering algorithms • Received the test of time award at KDD 2014 • Captures arbitrary shape of clusters and does not require the number of clusters in advance • Finds dense regions and expands them in order to form clusters (a) Core point (b) (Directly) density-reachable 4

Distributed Processing • It is unlikely that a single machine supports a typical size of current big data • Distributed processing (Hadoop, Spark) has been adopted to increase the usability of DBSCAN Data Set 1) Data Partitioning 2) Parallel Processing … Worker 3 Worker N Worker 1 Worker 2 3) Merging Final Result 5

Existing Approaches • Common belief for data partitioning • Neighboring points must be assigned to the same partition to calculate the correct density, • Region-basedpartitioning contiguous 1) P2 P1 overlapping region 2) cut 2 P3 cut 1 (b) Data Partitions Data Set 6

Three Limitations 1. Expensive data split • Too many possible choices to cut the space • Increase in # of dimensions  Increase in the cost for partitioning cut 2 cut 1 … cut 2 cut 1 cut 1 cut 2 (a) Choice 1 (b) Choice 2 (c) Choice N 7

2. Load imbalance between data partitions Worker 1 (P1) Execution time wait Worker 1 Worker 2 (P2) cut 1 slowest Worker 2 cut 2 Worker 3 (P3) Data Set wait Worker 3 8

3. Duplicated points in overlapping regions • Increase in # of data points  Increase in total execution time P2 duplicated point P1 P3 (a) Data Set (27 points) (b) Data Partitions (40 points) 9

Our Approach • What if we use random partitioning ? 1. Random partitioning is cheap, i.e., O(N) 2. All partitions have almost the same number and distribution of data points 3.All partitions can be disjoint with each other Data Set (b) Data Partitions 10

Challenge • How to calculate the density of an -neighbor in a random partition? data point in other partitions data point in the current partition Density calculation on a random partition 11

Our Solution • Estimatingthe number of points in other partitions using a compact summary structure data point in the current partition 3 1 2 1 1 2 2 1 1 Density calculation on a random partition 12

Key Contributions • Proposed a random partitioning method to solve the three limitations of the region-based method • Expensive split, load imbalance, and duplicated points • Designed a highlycompact summary of the entire data setto enable the approximate density calculation on a random partition 13

02 Overview of RP-DBSCAN

Overall Procedure • Phase I: Data Partitioning • Performs random partitioning • Builds a compact summary and broadcasts it to all workers • Phase II: Local Clustering • Finds all directly density reachable relationships in each partition • Phase III: Merging • Merges the results obtained from each partition • Labels points based on the reachability relationships 15

*Why? Phase I: Data Partitioning • Randomly distributes the cells into multiple workers • Builds a highly compact summary by adopting the concept of a sub-cell with • As gets smaller, the space can be summarized more precisely Worker 2 (P2) Worker 1 (P1) 2 1 1 1 1 1 1 1 2 1 1 (c) Summary Building (=0.5) (a) Data Set (b) Random Partitioning 16

Compactness of the summary structure • Stores only the density of the sub-cell rather than the exact position of each point • Represents the position of the sub-cell with the ordering of the sub-cells inside the cell density of - 3 ) ) ) density of - 2 ) ) order of - position of - 0 1 3 • ) position of - order of - 2 3 2 2 bits are enough ) 17

Phase II: Local Clustering • Approximately calculates the density of -neighbor • Findsall directly reachable relationships that appear across two cells in each partition • Equivalent to find all reachable relationship between all points P1 2 P1 P2 P2 1 1 1 2 1 (a) Random Partition + Summary (b) Directly Reachability between cells 18

Phase III: Merging • Merges all local results obtained from each partition • Labels points based on the reachability relationships P1 P2 Outliers (b) Labeling Result (a) Merged Result 19

03 Experimental Results

Experimental Setting • Parallel DBSCAN algorithms for comparison Region-based partitioning 21

Real-world data sets used for experiments • GeoLife contains user location data, Cosmo50 contains simulation data, OpenStreetMap contains GPS data, and TeraClickLog contains click log data • Especially, GeoLifeis heavily skewed because a large proportion of users stayed in Beijing 22

Algorithm parameters • = , , , 1}of the value that generates around 10 clusters in each data set • minPts = • = • Cluster setting • Microsoft Azure D12v2 instances located in South Korea • Each instance has four cores, 28 GB RAM, and 200GB of disk(SSD) • Ten instances were used as worker nodes, and two instances were used as master nodes 23

Efficiency • Total elapsed time of five parallel DBSCAN algorithms • RP-DBSCAN was always the fastest • Outperformed the state-of-the-art by up to times in GeoLife data • Only RP-DBSCAN finished for the largest data set Time limit : 20,000s x180 24

Efficiency Detail (1/2) • Load imbalance of five parallel DBSCAN algorithms • The load imbalance of RP-DBSCAN was always close to 1 • Existing algorithms failed to achieve good load balance • In GeoLifedata, the load imbalance was up to 600 x600 25

Efficiency Detail (2/2) • Total number of points processed in the algorithms • The total number of the points processed by RP-DBSCAN was always same to that of the data set • Except for GeoLifedata, the total number of points in existing algorithms increased as increased 26

Accuracy of RP-DBSCAN Rand Index 27

Size of the Summary • Our summary was very compact • The size was only from 0.04% to 8.20% of the data 28

04 Conclusions

Conclusions • Region-based partitioning in existing algorithms induces a critical bottleneck in parallel processing • We proposed a random partitioning method and a highly compact summary of the entire data • RP-DBSCAN achieves almost perfect load balance and does not duplicate points • RP-DBSCAN dramatically outperforms the state-of-the-art parallel DBSCAN algorithms by up to 180 times without much loss of the accuracy 30

Reference [1] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proc. 2nd Int’l Conf. on Knowledge Discovery and Data Mining. 226–231. [2] Irving Cordova and Teng-Sheng Moh. 2015. DBSCAN on Resilient Distributed Datasets. In Proc. 2015 Int’l Conf. on High Performance Computing & Simulation. 531–540. [3] Yaobin He, Haoyu Tan, Wuman Luo, Shengzhong Feng, and Jianping Fan. 2014. MR-DBSCAN: A Scalable MapReduce-based DBSCAN Algorithm for Heavily Skewed Data. Frontiers of Computer Science 8, 1 (2014), 83–99. [4] Bi-Ru Dai and I-Chang Lin. 2012. Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition. In Proc. 2012 IEEE Int’l Conf. on Cloud Computing. 59–66. [5] Alessandro Lulli, Matteo Dell’Amico, Pietro Michiardi, and Laura Ricci. 2016. NG-DBSCAN: Scalable Density-Based Clustering for Arbitrary Data. Proceedings of the VLDB Endowment 10, 3 (2016), 157–168. [6] Yu Zheng, Like Liu, LonghaoWang, and Xing Xie. 2008. Learning Transportation Mode from Raw GPS Data for Geographic Applications on the Web. In Proc. 17th Int’l Conf. on World Wide Web. 247–256. [7] YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman. 2010. Scalable Clustering Algorithm for N-body Simulations in a Shared-Nothing Cluster. In Proc. 22nd Int’l Conf. on Scientific and Statistical Database Management. 132–150. [8] Mordechai Haklay and Patrick Weber. 2008. OpenStreetMap: User-Generated Street Maps. IEEE Pervasive Computing 7, 4 (2008), 12–18. [9] Hwanjun Song, Jae-Gil Lee, and Wook-Shin Han. 2017. PAMAE: Parallel k- Medoids Clustering with High Accuracy and Efficiency. In Proc. 23rd ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. 1087–1096. 31

Thank you

Fast and Efficient DBSCAN Algorithm Utilizing Random Partitioning