230 likes | 497 Views
k-Means and DBSCAN. Erik Zeitler Uppsala Database Laboratory. k-Means. Input M (set of points) k (number of clusters) Output µ 1 , …, µ k (cluster centroids) k-Means clusters the M point into K clusters by minimizing the squared error function
E N D
k-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory
k-Means • Input • M (set of points) • k (number of clusters) • Output • µ1, …, µk(cluster centroids) • k-Means clusters the M point into K clustersby minimizing the squared error function clusters Si; i=1, …, k. µi is the centroid of all xjSi. Erik Zeitler
k-Means algorithm select (m1 … mK) randomly from M % initial centroids do (µ1 … µK) = (m1 … mK) all clusters Ci = {} for each point p in M % compute cluster membership of p [µi, i] = min(dist(µ,p)) % assign p to the corresponding cluster: Ci = Ci {p} end for each cluster Ci% recompute the centroids mi = avg(x in Ci) while exists mi µi% convergence criterion Erik Zeitler
K-Means on three clusters Erik Zeitler
I’m feeling Unlucky Bad initial points Erik Zeitler
Eat this! Non-spherical clusters Erik Zeitler
k-Means in practice • How to choose initial centroids • select randomly among the data points • generate completely randomly • How to choose k • study the data • run k-Means for different k • measure squared error for each k • Run k-Means many times! • Get many choices of initial points Erik Zeitler
k-Means pros and cons Erik Zeitler
Questions • Euclidean distance results in spherical clusters • What cluster shape does the Manhattan distance give? • Think of other distance measures too. What cluster shapes will those yield? • Assuming that the K-means algorithm converges in I iterations, with N points and X features for each point • give an approximation of the complexity of the algorithm expressed in K, I, N, and X. • Can the K-means algorithm be parallellized? • How? Erik Zeitler
DBSCAN • Density Based Spatial Clustering of Applications with Noise • Basic idea: • If an object p is density connected to q, • then p and q belong to the same cluster • If an object is not density connected to any other object • it is considered noise Erik Zeitler
e Definitions • e-neigborhood • The neighborhood within a radius e of an object • core object An object is a core object iffthere are more than MinPts objects in its e-neighbourhood • directly density reachable (ddr) An object p is ddr from q iffq is a core object and p is inside the eneighbourhood of q p q Erik Zeitler
o p q Reachability and Connectivity • density reachable (dr) An object q is dr from p iffthere exists a chain of objects p1 … pns.t.- p1is ddr from p, - p2is ddr from p1, - p3is ddr from … and q is ddr from pn • density connected (dc) p is dc to q iff- exist an object o such that p is dr from o - and q is dr from o p2 p1 p q Erik Zeitler
Recall… • Basic idea: • If an object p is density connected to q, • then p and q belong to the same cluster • If an object is not density connected to any other object • it is considered noise Erik Zeitler
p DBSCAN, take 1 i = 0 do take a point p from M find the set of points P which are density connected to p if P = {} M = M \ {p} else Ci=P i=i+1 M = M \ P end while M {} HOW? Erik Zeitler
p DBSCAN, take 2 i = 0 find the set of core points CP in M do take a point p from CP find the set of points P which are density reachable to p if P = {} M = M \ {p} else Ci=P i=i+1 M = M \ P end while M {} HOW? Let’s call this function dr(p) Erik Zeitler
Implement dr create function dr(p) -> P C = {p} P = {p} do remove a point p' from C find all points X that are ddr(p') C = C (X \ (P X)) % add newly discovered % points to C P = P X while C ≠ {} result P p’ p’ Erik Zeitler
DBSCAN pros and cons Erik Zeitler
Questions • Why is the dc criterion useful to define a cluster, instead of dr or ddr? • For which points are density reachable symmetric?i.e. for which p, q: dr(p, q) and dr(q, p)? • Express using only core objects and ddr, which objects will belong to a cluster Erik Zeitler