Clustering. Example application: posterizing. Lots of pixels, many colors Want to pick just a few colors Solution: treat RGB triples as points in R 3 and cluster Use center points of clusters as new colors. Posterization problems.

  1. Clustering

  2. Example application: posterizing • Lots of pixels, many colors • Want to pick just a few colors • Solution: treat RGB triples as points in R3 and cluster • Use center points of clusters as new colors

  3. Posterization problems • The “distance” in RGB space…Sqrt[ (r1 -r2 )2 + (g1 -g2 )2 + (b1 -b2 )2 ]… is not “perceptually uniform” • Distance of 0.2 in one area (black-to-grey distance, for example) may seem much larger than another (yellow-to-yellow/green) • Approach ignores pixel adjaceny in the image • One solution: cluster in R5 = (x,y,r,g,b)

  4. Tissue Classification

  5. Problems • Not really a “clustering” problem, although similar tissues tend to be clustered • Fundamentally a mixture-model: a pixel contains both bone and soft-tissue, for example.

  6. Friendship nets: Facebook • Given facebook data… • Construct “clusters of friends”

  7. Problems • No coordinates • Information about “closeness” is 0/1 (“have you friended me yet???”)

  8. Netflix

  9. Approach • N = number of movies Netflix has • My coordinates = (1,0,0,1,1,0,…where xi = 1 means “I liked movie I” • Now finding clusters lets Netflix make recommendations

  10. Problems • My coordinates really look like this:(*,*,0,*,*,*,…,*,1,1,*,*,…)with “*” meaning “Never seen the movie and don’t know.” • Even if we did know all my coordinates, the problem lies in {0,1}N rather than RN; is our euclidean intuition really appropriate?

  11. Document classification • Could represent a document by a vector(0,1,0,…) representing whether each English word (aardvark, and, anchovy, …) occurs in the doc • Clusters represent “similar topics”

  12. Problems • Non-isotropic distance: two documents having “the” in common are far less likely to be similar than two with “aardvark” in common • Really need a distance metric that compensates for this before applying clustering

  13. Conclusion • Clustering seems to have a lot of interesting applications… • But it’s important, before starting, to have an embedding of your data in RN where distance in RN is really related to distance between items (at least for small distances!)

  14. A first clustering algorithm • Assuming data that’s really pretty well clustered…how do you find the clusters? • Intro to K-means

  15. Distance vs. Cluster distance

  16. Conclusion • We might want to use a distance, D, to indicate how much two things are “in the same cluster” • Tempting to write D(pi, pk) • Really needs to be D({p1, p2, …}, pi, pk) • One view of clustering is that we want to use euclidean distance, d, to bootstrap discovery of cluster distance, D. • K-means works when d and D are very similar.

  17. How are distance and “cluster distance” related • “If you’re a friend of my friend, you’re my friend” • Suggests a graph-theory approach: find connected components in a graph • Edges in graph when two points are “close enough”

  18. Problems • Edges in graph when two points are “close enough” • Very data-sensitive: a small perturbation of data can join two clusters • These points are much “closer” than these: • “If you’re friends with lots of my friends, you’re my friend.”

  19. Leads to study of “how connected are nodes in a graph”? • The travelling token problem was an intro to that question

  20. Solution to travelling token

  21. …and so on • Make a VERY long movie • Play it VERY rapidly • How “pink” each node appears tells you what fraction of the time the token spends there.

  22. Insight:

  23. Create a shorter movie! • Two tokens (possibly at same spot) in each frame

  24. “Doubled” movie

  25. Apply repeatedly • Limiting version of the movie… • Has a huge number of moving tokens • every frame must look the same!

  26. Every frame looks the same • If ui = number of tokens at node i • di = degree of node i • j is a neighbor of i • Then • i sends j a quantity ui / di tokens • “every frame looks the same” if j sends that many back to i. • That happens exactly if uk / dk = constant • Population of tokens at a node is proportional to node’s degree!

  27. Matrix form • Let aij • be 1 if i and j are connected • be 0 otherwise • Let D = diag(deg of node 1, deg of node 2, …) • Let M = D-1 A • Then our solution u satisfies Mu = u • Insight: Things related to graph diffusion, neighboring, clustering, are related to eigenvalue problems.

  28. Some Graph Terminology and Notation

  29. Ng, Jordan, Weiss • S = {s1, s2, …, sn} in Rp. Want to cluster into k subsets • Form n x n matrix A with aij = exp(-||si – sj||)2/2s2) except aii = 0 • D = diag(row sums of A); L = D-1/2 A D-1/2 • Find k largest (column) eigenvectors of D; arrange in an n x k matrix, X

  30. X contains eigenvectors as columns • Normalize each row of X to get Y. • Rows of Y are points on the unit sphere. • There’s one row per original point • Cluster these points on the unit sphere using k-means; use these clusters on S.

  31. Matlab function y = njw(s, sigma, k) % s = nx3 array of pts; k =#clust. n = size(s, 1); [X,Y] = meshgrid(1:n, 1:n); diffs = s(X,:) - s(Y, :); dists = reshape(dot(diffs', diffs'), n, n);%squared dists. A = exp(-dists/(2*sigma^2)); for i = 1:n % ICK A(i,i) = 0; end D = sum(A'); %% ith entry is sum of A's ith row L = diag(1 ./ (D .^ 0.5)) * A * diag(1 ./ D .^ 0.5); [X,D] = eigs(L,k); % k largest eigenvals/vecs of L. Y = diag((1./dot(X', X')).^0.5) * X; % normalize rows to unit length IDX = kmeans(Y,k, 'emptyaction', 'singleton'); % IDX is a vector of cluster-ids, one per point of S.

  32. Visualization part… clf; hold on; for t = 1:k pts = s(IDX == t, :); c = hsv2rgb( [(t-1)/k, 0.8, 0.8] ) plot3(pts(:,1), pts(:, 2), pts(:, 3),… 'o', 'Color', c); end hold off; figure(gcf);

