510 likes | 890 Views
Density based Clustering. Anushree Garg, krithika chandramouli. Types of Clustering algorithms. Partitioning based K-Means, K- Medoids Hierarchical based BIRCH, Chameleon Density based DBScan, DenCLUE , D-Stream Grid Based STING, WaveCluster. DBScan.
E N D
Density based Clustering Anushree Garg, krithikachandramouli
Types of Clustering algorithms • Partitioning based • K-Means, K-Medoids • Hierarchical based • BIRCH, Chameleon • Density based • DBScan, DenCLUE, D-Stream • Grid Based • STING, WaveCluster
DBScan “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise” - Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu • Density- Based • Used to discover cluster with arbitrary shape • Minimum requirements of Domain Knowledge
Definitions • Core Point • A point having more than the MinPts in its EpsNeighborhood • Boundary Point • Not a core point • Direct Density Reachable • Point p is directly density reachable from q if q is a core point and q is in EpsNeighborhood of p • Density Reachable • Point P is density reachable from q is there are a chain of points p1,…, pM, such that p(i+1) is directly density reachable from pi • Density Connected • P and Q are density connected if there is a point O such that p and q are density reachable from O
Algorithm • Start with arbitrary point p • Retrieve all points density-reachable from p • If p is a core point it includes a cluster • If p in a border point NO cluster and next point is visited in the database • Repeat process till all points are visited
Performance Evaluation CLARANS DBScan
Conclusion (DBScan) • Based on Density Based Clustering • Can effectively find arbitrary shaped clusters • Does not need major domain knowledge
Denclue “An Efficient Approach to Clustering in Large Multimedia Databases with Noise” -Alexander Hinneburg, Daniel A. Keim • Density based clustering • Uses Influence function • Handle large amount of noise
Idea • Each data point has an influence that extends over a range • Influence function • Add all influence functions • Density function
Definitions • Density Attractor x* • Local maximum of the density function • Density attracted points • Points from which a path to x* exists for which the gradient is continuously positive
Center Defined Clusters • All points that are density attracted to a given density attractor x* • Density function at the maximum must exceed x • Points that are attracted to smaller maxima are considered outliers
Arbitrary-Shape Clusters Merges center defined clusters if a path exists for which the density function continuously exceeds x
Algorithm • Step 1: Construct a map of data points • Uses hypercubes of with edge length 2s • Only populated cubes are saved • Step 2: Determine density attractors for all points using hill-climbing • Keeps track of paths that have been taken and points close to them
Step 1: Constructing the map • Hypercubes contain • Number of data points • Pointers to data points • Sum of data values (for mean) • Save populated hypercubes in B+ tree
Step 2: Clustering Step • Uses only highly populated cubes and cubes that are connected to them • Hill-climbing based on local density function and its gradient • Points within s/2 of each hill-climbing path are attached to clusters as well
Time Complexity / Efficiency • Worst case, for N data points • O(N log(N)) • Average case • O(log(N)) • Explanation: Only highly populated areas are considered • Up to 45 times faster than DBSCAN
Comparison with DBSCAN • Corresponding setup • Square wave influence function radius s models neighborhood e in DBSCAN • Definition of core objects in DBSCAN involves MinPts <=> x • Density reachable in DBSCAN becomes density attracted in DENCLUE
Conclusion (DenClue) • Denclue is faster than most other algorithms • Efficient Data Structure • Used for large multimedia databases • Can work well with large number of outliers
D-STREAM • Chen, Yixin, and Li Tu. "Density-based clustering for real-time stream data."Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007.
Data stream clustering • High dimensional stream in real time – a challenging task • Massive volumes of raw data arrives real time – can be scanned only once • Applications – stocks, weather monitoring ..
Clustering algorithms – then vs now • Then • Used single phase model • Treat data stream clustering as continuous version of static clustering • Divide and conquer • Weigh outdated and recent data equally • Don’t capture evolving characteristics of the data • CluStream: 2 phase framework • Offline component based on k-means – identifies spherical clusters, not arbitrary • Requires multiple scans of data
Clustering algorithms – then vs now • Now • D-stream is density based • Doesn’t treat data stream as long sequence of static data • Dynamism of stream – decay factor • Doesn’t require user to specify the number of clusters • Discretize the data space into grids – new data maps to these grids
The D-stream algorithm • Key features of the algorithm • Timestamp of data point labelled by integer • Online component + Offline component • Online component • Reads incoming data record • Places this multi-dimensional record into appropriate density grid • Updates characteristic vector of grid • Offline component • Dynamically adjusts clusters in the time gap (time between arrival of data) • Periodically regulated clusters
D-stream definitions • Input – d dimensions defined in space S = S1 X S2 X .. Sd • Density grid – space Siis divided onto pipartitions • Grid g = S1,j1 X S2,j2 .. Sd,jd= (j1, j2, .. jd) • Every data record x = (x1, x2, .. xd) mapped onto g • Timestamp of arrival T(x) • Density coefficient at time t is given by λ∈ (0, 1) • λ ∈ (0, 1) decay factor • Grid Density • For each grid the time when the last data was received is recorded so that density is updated
D-stream definitions • Characteristic vector of a grid is (tg, tm , D, label, status) • tg – last time of update of g • tm – last time when g was removed from grid_List • D – grid density • Label – class label • Status - SPORADIC or NORMAL to remove sporadic grids • Dense grid • Sparse grid • Transitional grid • Sporadic grids – contain very few data points
Components of D-stream • New data x, mapped to grid g, and density is updates • Scheme gradually reduces density of record & grid • Periodically form clusters • Time interval of inspecting grid cant be too long or short • Compute minimum time for dense grid to become sparse grid • Remove sporadic grids • Grid containing very few data points • Removed by density thresolding • Grid_List keeps track of all grids under analysis
Results • Data – network intrusion data stream, synthetic • Data points – 30K – 85K
Conclusion • D Stream is a clustering technique for fast changing data streams • Finds clusters in arbitrary shapes • Sporadic grids are dynamically removed