290 likes | 684 Views
Clustering. By: Avshalom Katz. We will be talking about…. What is Clustering? Different Kinds of Clustering What is DBSCAN? Pseudocode Example of Clustering Definitions of parameters Complexity. What is Clustering?.
E N D
Clustering By: Avshalom Katz
We will be talking about… • What is Clustering? • Different Kinds of Clustering • What is DBSCAN? • Pseudocode • Example of Clustering • Definitions of parameters • Complexity
What is Clustering? • clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.
Different types of Clustering • Biology • Information retrieval • Climate • Business • Clustering for utility • Summarization
DBSCAN-IntroductionDensity-Based Spatial Clustering of Applications with Noise • Since society has started using databases, the amount of information that we are using is increasing exponentially. Due to that, automatic algorithms are entered to every subject.
Density-Based Spatial Clustering of Applications with Noise • 1. Minimum point in the density (MINEPS) • 2. The distance of the point to check the density (EPS). There are four main steps in the algorithm, and the algorithm gets two parameters:
Definition 1 • To find all adjacent points. The so called “adjacent” points are called so only of the distance between them is smaller than EPS from what we refer to as P- “point”. All the adjacent points are later entered into Neps (P).
Definition 2 • Is to define the core group by checking if the point p is in the core with point q by checking if p includes in Neps (q) and the size of the group Neps (p) is grater then MINPTS.
Definition 3 • Density-reachable the point p is density reachable from point q if there is a sequence of points that the first is p and the last is q, then every couple in the sequence is a directly density reachable
Definition 4 • Density connected point refers to a single point that can reach two different points, also in different direction. For example in the diagram below we can see that P and Q are density-reachable from O. Therefore, P and Q are are density connected.
Definition 5 • Cluster C, wrt.erps and MINPTS are non-empty subset of the database, together these two terms below are created: 1. If P is a member of class C and q is density reachable from P and NEPS(P)> MINTPS then q is also a member of C. 2. If p and q are both members of C, then both p and q are density connected to eachother.
Definition 6 • There are groups of clusters, each point that does not belong to any group is called “noise”.
number of adjacent : 1 stack : current ClusterId : purple number of adjacent : 3 stack : O,P,Q current ClusterId : purple number of adjacent : 0 stack : current ClusterId : purple number of adjacent : 2 stack : P,Q current ClusterId : purple number of adjacent : 5 stack : Q,R,S,T current ClusterId : purple number of adjacent : 5 stack : B,C,D,E,F current ClusterId :green number of adjacent : 8 stack : C,D,E,F,G,H,I, current ClusterId :green number of adjacent : 9 stack : F,G,H,I,J current ClusterId :green number of adjacent : 8 stack : D,E,F,G,H,I, current ClusterId :green number of adjacent : 9 stack : G,H,I,J current ClusterId :green number of adjacent : 6 stack : H,I,J current ClusterId :green number of adjacent : 7 stack : I,J current ClusterId :green number of adjacent : 7 stack : J current ClusterId :green number of adjacent : 5 stack : current ClusterId :green number of adjacent : stack : current ClusterId : purple number of adjacent : 7 stack : E,F,G,H,I current ClusterId :green DBSCAN ( Eps = ε , MinPts = 3 ) U K M ε H R B E A I P C S F J N Q D T V G L O = noise X
Pseudocode of the algorithm DBSCAN (Eps, MinPts) // SetOfPoints is UNCLASSIFIED ClusterId := nextId(NOISE); FOR i FROM 1 TO SetOfPoints.size DO Point := SetOfPoints.get(i); IF Point.ClId = UNCLASSIFIED THEN IF ExpandCluster(SetOfPoints, Point,ClusterId, Eps, MinPts) THEN ClusterId := nextId(ClusterId) END IF END IF END FOR END; // DBSCAN
ExpandCluster(SetOfPoints, Point, ClId, Eps,MinPts) : Boolean; seeds:=SetOfPoints.regionQuery(Point,Eps); IF seeds.size<MinPts THEN // no core point SetOfPoint.changeClId(Point,NOISE); RETURN False; ELSE // all points in seeds are density- // reachable from Point SetOfPoints.changeClIds(seeds,ClId); seeds.delete(Point); WHILE seeds <> Empty DO currentP := seeds.first(); result := SetOfPoints.regionQuery(currentP,Eps); IF result.size >= MinPts THEN FOR i FROM 1 TO result.size DO resultP := result.get(i); IF resultP.ClId IN {UNCLASSIFIED, NOISE} THEN IF resultP.ClId = UNCLASSIFIED THEN seeds.append(resultP);
END IF; • SetOfPoints.changeClId(resultP,ClId); • END IF; // UNCLASSIFIED or NOISE • END FOR; • END IF; // result.size >= MinPts • seeds.delete(currentP); • END WHILE; // seeds <> Empty • RETURN True; • END IF • END; // ExpandCluster
The complexity The complexity of ExpandCluster() is o(logN) in the worst case on a data base in size N and there is n iterations of this function , so it is on * log (n) )
Bibliography • Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. (1999). Optics: ordering points to identify the clustering structure. SIGMOD Rec., 28(2):49-60 • Clustering. (2010, April 19). In Wikipedia, The Free Encyclopedia. Retrieved 14:14, April 19, 2010 fromhttp://en.wikipedia.org/w/index.php?title=Clustering&oldid=357078594 • Ester, M., Kriegel, H.-p., Jörg, S., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. • Ester, M ., Kriegel, H,. Jörg, S., and Xu, X (1995).A DatabaseInterfaceforClustering in Large Spatial Databases, Proc. 1st Int. Conf. onKnowledge Discovery and Data Mining, Montreal, Canada, 1995, AAAI Press, 1995. • Schikuta E., Erhart M.: “The bang-clustering system:Grid-based data analysis”. Proc. Sec. Int. Symp. IDA-97,Vol. 1280 LNCS, London, UK, Springer-Verlag, 1997.