1 / 28

Clustering

Clustering. By: Avshalom Katz. We will be talking about…. What is Clustering? Different Kinds of Clustering What is DBSCAN? Pseudocode Example of Clustering Definitions of parameters Complexity. What is Clustering?.

ayla
Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering By: Avshalom Katz

  2. We will be talking about… • What is Clustering? • Different Kinds of Clustering • What is DBSCAN? • Pseudocode • Example of Clustering • Definitions of parameters • Complexity

  3. What is Clustering? • clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.

  4. Different types of Clustering • Biology • Information retrieval • Climate • Business • Clustering for utility • Summarization

  5. Example

  6. Different kinds of clusters

  7. Well Separated

  8. Prototype based

  9. Graph based

  10. Density based

  11. Share property (conceptual clusters)

  12. DBSCAN-IntroductionDensity-Based Spatial Clustering of Applications with Noise • Since society has started using databases, the amount of information that we are using is increasing exponentially. Due to that, automatic algorithms are entered to every subject.

  13. Database Example

  14. Density-Based Spatial Clustering of Applications with Noise • 1. Minimum point in the density (MINEPS) • 2. The distance of the point to check the density (EPS). There are four main steps in the algorithm, and the algorithm gets two parameters:

  15. Definition 1 • To find all adjacent points. The so called “adjacent” points are called so only of the distance between them is smaller than EPS from what we refer to as P- “point”. All the adjacent points are later entered into Neps (P).

  16. Definition 2 • Is to define the core group by checking if the point p is in the core with point q by checking if p includes in Neps (q) and the size of the group Neps (p) is grater then MINPTS.

  17. Definition 3 • Density-reachable the point p is density reachable from point q if there is a sequence of points that the first is p and the last is q, then every couple in the sequence is a directly density reachable

  18. Definition 4 • Density connected point refers to a single point that can reach two different points, also in different direction. For example in the diagram below we can see that P and Q are density-reachable from O. Therefore, P and Q are are density connected.

  19. Definition 5 • Cluster C, wrt.erps and MINPTS are non-empty subset of the database, together these two terms below are created: 1. If P is a member of class C and q is density reachable from P and NEPS(P)> MINTPS then q is also a member of C. 2. If p and q are both members of C, then both p and q are density connected to eachother.

  20. Definition 6 • There are groups of clusters, each point that does not belong to any group is called “noise”.

  21. number of adjacent : 1 stack : current ClusterId : purple number of adjacent : 3 stack : O,P,Q current ClusterId : purple number of adjacent : 0 stack : current ClusterId : purple number of adjacent : 2 stack : P,Q current ClusterId : purple number of adjacent : 5 stack : Q,R,S,T current ClusterId : purple number of adjacent : 5 stack : B,C,D,E,F current ClusterId :green number of adjacent : 8 stack : C,D,E,F,G,H,I, current ClusterId :green number of adjacent : 9 stack : F,G,H,I,J current ClusterId :green number of adjacent : 8 stack : D,E,F,G,H,I, current ClusterId :green number of adjacent : 9 stack : G,H,I,J current ClusterId :green number of adjacent : 6 stack : H,I,J current ClusterId :green number of adjacent : 7 stack : I,J current ClusterId :green number of adjacent : 7 stack : J current ClusterId :green number of adjacent : 5 stack : current ClusterId :green number of adjacent : stack : current ClusterId : purple number of adjacent : 7 stack : E,F,G,H,I current ClusterId :green DBSCAN ( Eps = ε , MinPts = 3 ) U K M ε H R B E A I P C S F J N Q D T V G L O = noise X

  22. Pseudocode of the algorithm DBSCAN (Eps, MinPts) // SetOfPoints is UNCLASSIFIED ClusterId := nextId(NOISE); FOR i FROM 1 TO SetOfPoints.size DO Point := SetOfPoints.get(i); IF Point.ClId = UNCLASSIFIED THEN IF ExpandCluster(SetOfPoints, Point,ClusterId, Eps, MinPts) THEN ClusterId := nextId(ClusterId) END IF END IF END FOR END; // DBSCAN

  23. ExpandCluster(SetOfPoints, Point, ClId, Eps,MinPts) : Boolean; seeds:=SetOfPoints.regionQuery(Point,Eps); IF seeds.size<MinPts THEN // no core point SetOfPoint.changeClId(Point,NOISE); RETURN False; ELSE // all points in seeds are density- // reachable from Point SetOfPoints.changeClIds(seeds,ClId); seeds.delete(Point); WHILE seeds <> Empty DO currentP := seeds.first(); result := SetOfPoints.regionQuery(currentP,Eps); IF result.size >= MinPts THEN FOR i FROM 1 TO result.size DO resultP := result.get(i); IF resultP.ClId IN {UNCLASSIFIED, NOISE} THEN IF resultP.ClId = UNCLASSIFIED THEN seeds.append(resultP);

  24. END IF; • SetOfPoints.changeClId(resultP,ClId); • END IF; // UNCLASSIFIED or NOISE • END FOR; • END IF; // result.size >= MinPts • seeds.delete(currentP); • END WHILE; // seeds <> Empty • RETURN True; • END IF • END; // ExpandCluster

  25. Example

  26. Define the value of parameter EPS bay MINPTS:

  27. The complexity The complexity of ExpandCluster() is o(logN) in the worst case on a data base in size N and there is n iterations of this function , so it is on * log (n) )

  28. Bibliography • Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. (1999). Optics: ordering points to identify the clustering structure. SIGMOD Rec., 28(2):49-60 • Clustering. (2010, April 19). In Wikipedia, The Free Encyclopedia. Retrieved 14:14, April 19, 2010 fromhttp://en.wikipedia.org/w/index.php?title=Clustering&oldid=357078594 • Ester, M., Kriegel, H.-p., Jörg, S., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. • Ester, M ., Kriegel, H,. Jörg, S., and Xu, X (1995).A DatabaseInterfaceforClustering in Large Spatial Databases, Proc. 1st Int. Conf. onKnowledge Discovery and Data Mining, Montreal, Canada, 1995, AAAI Press, 1995. • Schikuta E., Erhart M.: “The bang-clustering system:Grid-based data analysis”. Proc. Sec. Int. Symp. IDA-97,Vol. 1280 LNCS, London, UK, Springer-Verlag, 1997.

More Related