240 likes | 276 Views
k-Means and DBSCAN. Gyozo Gidofalvi Uppsala Database Laboratory. Announcements. Updated material for assignment 2 on the lab course home page. Posted sign-up sheets for labs and examinations for assignment 2 outside P1321. Posted office hours. k-Means. Input M (set of points)
E N D
k-Means and DBSCAN Gyozo Gidofalvi Uppsala Database Laboratory
Announcements • Updated material for assignment 2 on the lab course home page. • Posted sign-up sheets for labs and examinations for assignment 2 outside P1321. • Posted office hours Gyozo Gidofalvi
k-Means • Input • M (set of points) • k (number of clusters) • Output • µ1, …, µk(cluster centroids) • k-Means clusters the M point into K clustersby minimizing the squared error function clusters Si; i=1, …, k. µi is the centroid of all xjSi. Gyozo Gidofalvi
k-Means algorithm select (m1 … mK) randomly from M % initial centroids do (µ1 … µK) = (m1 … mK) all clusters Ci = {} for each point p in M % compute cluster membership of p [i] = argminj(dist(µj,p)) % assign p to the corresponding cluster: Ci = Ci {p} end for each cluster Ci% recompute the centroids mi = avg(p in Ci) while exists mi µi% convergence criterion Gyozo Gidofalvi
K-Means on three clusters Gyozo Gidofalvi
I’m feeling Unlucky Bad initial points Gyozo Gidofalvi
kmeans in practice • How to choose initial centroids • select randomly among the data points • generate completely randomly • How to choose k • study the data • run k-Means for different k • measure squared error for each k • Run kmeans many times! • Get many choices of initial points Gyozo Gidofalvi
k-Means iteration step in AmosQL • Calculate point-to-centroid distances: calp2c_distance(…) select p, c, d from Vector of Number p, Vector of Number c, Number d where p in bag({iota(1,10)}) and c in bag({iota(1,10)}) and d = euclid(p,c); • Assign each point to the closest centroid: calc_cluster_assignment(…) groupby((p2c_distances1(…)), #’argminv’); • Recalculate centroids: calc_clust_means(…) groupby(calc_cluster_assignment1(…), #’col_means’); Gyozo Gidofalvi
Transitive closure • tclose is a second order function to explore graphs where the edges are expressed by a transition functionfno tclose(Function fno, Object o)->Bag of Object • fno(o) produces the children of o • tclose applies the transition function fno(o), then fno(fno(o)), then fno(fno(fno(o))), etc until fno returns no new results Gyozo Gidofalvi
Iterate until convergence with tclose in AmosQL create function bagidiv2(Bag of Number b) ->Bag of Bag of Number as (select floor(n/2) from Number n where n in b); create function vecchild_idiv2(Vector of Number vb) ->Bag of Vector of Number as sort(bagidiv2(in(vb))); create function vecconverge_tclose(Bag of Number ib) ->Bag of Vector of Number /* tclose function iterating the bagchild_idiv2 function until convergence */ as select ov from Vector of Number ov where ov in tclose(#'vecchild_idiv2', sort(ib)); Gyozo Gidofalvi
What about this?! Non-spherical clusters Noise Gyozo Gidofalvi
k-Means pros and cons Gyozo Gidofalvi
Questions • Euclidean distance results in spherical clusters • What cluster shape does the Manhattan distance give? • Think of other distance measures too. What cluster shapes will those yield? • Assuming that the K-means algorithm converges in I iterations, with N points and X features for each point • give an approximation of the complexity of the algorithm expressed in K, I, N, and X. • Can the K-means algorithm be parallelized? • How? Gyozo Gidofalvi
DBSCAN • Density Based Spatial Clustering of Applications with Noise • Basic idea: • If an object p is density connected to q, • then p and q belong to the same cluster • If an object is not density connected to any other object • it is considered noise Gyozo Gidofalvi
e Definitions • e-neigborhood • The e-neigborhood of an object p is the set of objects withine-distance of p • core object An object q is a core objectiffthere are at leastMinPts objects in q’s e-neighbourhood • directly density reachable (ddr) An object p is ddr from qiff q is a core object and p is inside the eneighbourhood of q p q Gyozo Gidofalvi
q2 q1 q p q p r Reachability and Connectivity • density reachable (dr) An object pis dr from qiff there exists a chain of objects q1 … qns.t.- q1is ddr from q, - q2is ddr from q1, - q3is ddr from … and pis ddr from qn • density connected (dc) pis dc to riff- exist an object qsuch that pis dr from q- and ris dr from q Gyozo Gidofalvi
Recall… • Basic idea: • If an object p is density connected to q, • then p and q belong to the same cluster • If an object is not density connected to any other object • it is considered noise Gyozo Gidofalvi
p DBSCAN i = 1 do take a point p from M find the set of points P which are density connected to p if P = {} M = M \ {p} else Ci=P i=i+1 M = M \ P end while M {} HOW? Gyozo Gidofalvi
Fining density connected componnets • If r is dc to p there exists q, s.t. both p and r are dr from q. i.e., there exists a ddr-chain from q to both r and p and q is a core object. • Recall: tclose is a second order function to explore graphs where the edges are expressed by a transition functionfno. • fno = ddr Gyozo Gidofalvi
Fining dc components in AmosQL • Assuming q is a core object and the a ddr function with the following signature is defined:ddr(Integer q)->Bag of Integer p • Then: create function dc(Integer q)->Bag of Integer as select p from Integer p where p in tclose(#’ddr’, q); Gyozo Gidofalvi
DBSCAN pros and cons Gyozo Gidofalvi
Questions • Why is the dc criterion useful to define a cluster, instead of dr or ddr? • For which points are density reachable symmetric?i.e. for which p, q: dr(p, q) and dr(q, p)? • Express using only core objects and ddr, which objects will belong to a cluster Gyozo Gidofalvi