1 / 18

k-Means and DBSCAN

k-Means and DBSCAN. Erik Zeitler Uppsala Database Laboratory. k-Means. Input M (set of points) k (number of clusters) Output µ 1 , …, µ k (cluster centroids) k-Means clusters the M point into K clusters by minimizing the squared error function

Download Presentation

k-Means and DBSCAN

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. k-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory

  2. k-Means • Input • M (set of points) • k (number of clusters) • Output • µ1, …, µk(cluster centroids) • k-Means clusters the M point into K clustersby minimizing the squared error function clusters Si; i=1, …, k. µi is the centroid of all xjSi. Erik Zeitler

  3. k-Means algorithm select (m1 … mK) randomly from M % initial centroids do (µ1 … µK) = (m1 … mK) all clusters Ci = {} for each point p in M % compute cluster membership of p [µi, i] = min(dist(µ,p)) % assign p to the corresponding cluster: Ci = Ci {p} end for each cluster Ci% recompute the centroids mi = avg(x in Ci) while exists mi µi% convergence criterion Erik Zeitler

  4. K-Means on three clusters Erik Zeitler

  5. I’m feeling Unlucky Bad initial points Erik Zeitler

  6. Eat this! Non-spherical clusters Erik Zeitler

  7. k-Means in practice • How to choose initial centroids • select randomly among the data points • generate completely randomly • How to choose k • study the data • run k-Means for different k • measure squared error for each k • Run k-Means many times! • Get many choices of initial points Erik Zeitler

  8. k-Means pros and cons Erik Zeitler

  9. Questions • Euclidean distance results in spherical clusters • What cluster shape does the Manhattan distance give? • Think of other distance measures too. What cluster shapes will those yield? • Assuming that the K-means algorithm converges in I iterations, with N points and X features for each point • give an approximation of the complexity of the algorithm expressed in K, I, N, and X. • Can the K-means algorithm be parallellized? • How? Erik Zeitler

  10. DBSCAN • Density Based Spatial Clustering of Applications with Noise • Basic idea: • If an object p is density connected to q, • then p and q belong to the same cluster • If an object is not density connected to any other object • it is considered noise Erik Zeitler

  11. e Definitions • e-neigborhood • The neighborhood within a radius e of an object • core object An object is a core object iffthere are more than MinPts objects in its e-neighbourhood • directly density reachable (ddr) An object p is ddr from q iffq is a core object and p is inside the e­neighbourhood of q p q Erik Zeitler

  12. o p q Reachability and Connectivity • density reachable (dr) An object q is dr from p iffthere exists a chain of objects p1 … pns.t.- p1is ddr from p, - p2is ddr from p1, - p3is ddr from … and q is ddr from pn • density connected (dc) p is dc to q iff- exist an object o such that p is dr from o - and q is dr from o p2 p1 p q Erik Zeitler

  13. Recall… • Basic idea: • If an object p is density connected to q, • then p and q belong to the same cluster • If an object is not density connected to any other object • it is considered noise Erik Zeitler

  14. p DBSCAN, take 1 i = 0 do take a point p from M find the set of points P which are density connected to p if P = {} M = M \ {p} else Ci=P i=i+1 M = M \ P end while M  {} HOW? Erik Zeitler

  15. p DBSCAN, take 2 i = 0 find the set of core points CP in M do take a point p from CP find the set of points P which are density reachable to p if P = {} M = M \ {p} else Ci=P i=i+1 M = M \ P end while M  {} HOW? Let’s call this function dr(p) Erik Zeitler

  16. Implement dr create function dr(p) -> P C = {p} P = {p} do remove a point p' from C find all points X that are ddr(p') C = C  (X \ (P  X)) % add newly discovered % points to C P = P  X while C ≠ {} result P p’ p’ Erik Zeitler

  17. DBSCAN pros and cons Erik Zeitler

  18. Questions • Why is the dc criterion useful to define a cluster, instead of dr or ddr? • For which points are density reachable symmetric?i.e. for which p, q: dr(p, q) and dr(q, p)? • Express using only core objects and ddr, which objects will belong to a cluster Erik Zeitler

More Related