720 likes | 1.68k Views
DBSCAN. Data Mining algorithm . School of Electrical Engineering, University of Belgrade Department of Computer Engineering. Professor Dr Veljko Milutinović. Student Milan Micić 2011/3323 milan.z.micic@gmail.com. Content. Introduction The DBSCAN basic idea Algorithm DBSCAN on R
E N D
DBSCAN Data Mining algorithm School of Electrical Engineering, University of Belgrade Department of Computer Engineering Professor Dr Veljko Milutinović Student Milan Micić 2011/3323 milan.z.micic@gmail.com
Content • Introduction • The DBSCAN basic idea • Algorithm • DBSCAN on R • Example • Advantages • Disadvantages • References 2/13
Introduction • Data clustering algorithms • Using in machine learning, pattern recognition, image analyses, information retrieval, and bioinformatics • Hierarchical, centroid-based, distribution-based, density-based, etc 3/13
DBSCAN basic idea • Density-Based Spatial Clustering of Applications with Noise • Munich,1996 • Derived from a human natural clustering approach • Input parameters • The size of epsilon neighborhood – ε • Minimum points in cluster – MinPts • Neighborhood of a given radius εhas to contain at least a minimum number of points MinPts 4/13
DBSCAN basic idea • Directly density-reachable, p1 from p2 • p1 belongs to the ε neighborhood of p2 • p2's neighborhood size is greater than a given parameter MinPts • Density-reachable, p0 from pn • Exists a chain of points p1,..., pn-1, where pi+1 is directly density-reachable from pi • Core, border and noise point 5/13
Algorithm DBSCAN(D, eps, MinPts) C = 0 for each unvisited point P in dataset D mark P as visited N = regionQuery(P, eps) if sizeof(N) < MinPts mark P as NOISE else C = next cluster expandCluster(P, N, C, eps, MinPts) expandCluster(P,N,C,eps,MinPts) add P to cluster C for each point P' in N if P' is not visited mark P' as visited N' = regionQuery(P', eps) if sizeof(N') >= MinPts N = N joined with N' if P' is not yet member of any cluster add P' to cluster C • Complexity with indexing structure: O(n*log(n)) 6/13
DBSCAN on R • GNU General Public License • Various methods for clustering and cluster validation • Interface functions for many methods implemented in language R • DBSCAN: O(n2) • FPC - Flexible Procedures for Clustering • dbscan(x,0.2,showplot=2) • dbscan Pts=600 MinPts=5 eps=0.2 • 0 1 2 3 4 5 6 7 8 9 10 11 • seed 0 50 53 51 52 51 54 54 54 53 51 1 • border 28 4 4 8 5 3 3 4 3 4 6 4 • total 28 54 57 59 57 54 57 58 57 57 57 5 7/13
Example • Astronomy task • Identifying celestial objects by capturing the radiation they emit • Captured noise (by sensors, diffuse emission from atmosphere and space itself) • Eliminating method – to constrain the relevant intensity by a known threshold • In this case – only pixels whose intensity are less than 50 (and consequently darker) are being considered 8/13
Example • DBSCAN algorithm applied on individual pixels • Linking together a complete emission area • Each of the generated cluster will define a celestial entity • ε = 5, MinPts = 5, 64 clusters and 224 outliers found 9/13
Disadvantages • Appropriate parameters εand MinPts • Numerous experiments indicates best MinPts = 4 • Clustering datasets with large difference in densities • “Curse of dimensionality” • In every algorithm based on the Euclidean distance for high-dimensional data sets 10/13
Advantages • Does not require number of clusters in the data a priori • Can find arbitrarily shaped clusters • Even clusters completely surrounded by a different cluster • Mostly insensitive to the ordering of the points in the database • Only border points might swap cluster membership • Has a notion of noise • Requires just two parameters 11/13
References • Martin Ester, Hans-Peter Kriegel, Joerg Sander, XiaoweiXu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Institute for Computer Science, University of Munich,1996; • MehmedKantardzic: “Data Mining: Concepts, Models, Methods, and Algorithms”, 2011; • Wikibooks: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Density-Based_Clustering; • Wiki: http://en.wikipedia.org/wiki/DBSCAN 12/13
Thank you for your attention! Questions Milan Micić milan.z.micic@gmail.com 13/13