480 likes | 580 Views
Clustering. Example application: posterizing. Lots of pixels, many colors Want to pick just a few colors Solution: treat RGB triples as points in R 3 and cluster Use center points of clusters as new colors. Posterization problems.
E N D
Example application: posterizing • Lots of pixels, many colors • Want to pick just a few colors • Solution: treat RGB triples as points in R3 and cluster • Use center points of clusters as new colors
Posterization problems • The “distance” in RGB space…Sqrt[ (r1 -r2 )2 + (g1 -g2 )2 + (b1 -b2 )2 ]… is not “perceptually uniform” • Distance of 0.2 in one area (black-to-grey distance, for example) may seem much larger than another (yellow-to-yellow/green) • Approach ignores pixel adjaceny in the image • One solution: cluster in R5 = (x,y,r,g,b)
Problems • Not really a “clustering” problem, although similar tissues tend to be clustered • Fundamentally a mixture-model: a pixel contains both bone and soft-tissue, for example.
Friendship nets: Facebook • Given facebook data… • Construct “clusters of friends”
Problems • No coordinates • Information about “closeness” is 0/1 (“have you friended me yet???”)
Approach • N = number of movies Netflix has • My coordinates = (1,0,0,1,1,0,…where xi = 1 means “I liked movie I” • Now finding clusters lets Netflix make recommendations
Problems • My coordinates really look like this:(*,*,0,*,*,*,…,*,1,1,*,*,…)with “*” meaning “Never seen the movie and don’t know.” • Even if we did know all my coordinates, the problem lies in {0,1}N rather than RN; is our euclidean intuition really appropriate?
Document classification • Could represent a document by a vector(0,1,0,…) representing whether each English word (aardvark, and, anchovy, …) occurs in the doc • Clusters represent “similar topics”
Problems • Non-isotropic distance: two documents having “the” in common are far less likely to be similar than two with “aardvark” in common • Really need a distance metric that compensates for this before applying clustering
Conclusion • Clustering seems to have a lot of interesting applications… • But it’s important, before starting, to have an embedding of your data in RN where distance in RN is really related to distance between items (at least for small distances!)
A first clustering algorithm • Assuming data that’s really pretty well clustered…how do you find the clusters? • Intro to K-means
Conclusion • We might want to use a distance, D, to indicate how much two things are “in the same cluster” • Tempting to write D(pi, pk) • Really needs to be D({p1, p2, …}, pi, pk) • One view of clustering is that we want to use euclidean distance, d, to bootstrap discovery of cluster distance, D. • K-means works when d and D are very similar.
How are distance and “cluster distance” related • “If you’re a friend of my friend, you’re my friend” • Suggests a graph-theory approach: find connected components in a graph • Edges in graph when two points are “close enough”
Problems • Edges in graph when two points are “close enough” • Very data-sensitive: a small perturbation of data can join two clusters • These points are much “closer” than these: • “If you’re friends with lots of my friends, you’re my friend.”
Leads to study of “how connected are nodes in a graph”? • The travelling token problem was an intro to that question
…and so on • Make a VERY long movie • Play it VERY rapidly • How “pink” each node appears tells you what fraction of the time the token spends there.
Create a shorter movie! • Two tokens (possibly at same spot) in each frame
Apply repeatedly • Limiting version of the movie… • Has a huge number of moving tokens • every frame must look the same!
Every frame looks the same • If ui = number of tokens at node i • di = degree of node i • j is a neighbor of i • Then • i sends j a quantity ui / di tokens • “every frame looks the same” if j sends that many back to i. • That happens exactly if uk / dk = constant • Population of tokens at a node is proportional to node’s degree!
Matrix form • Let aij • be 1 if i and j are connected • be 0 otherwise • Let D = diag(deg of node 1, deg of node 2, …) • Let M = D-1 A • Then our solution u satisfies Mu = u • Insight: Things related to graph diffusion, neighboring, clustering, are related to eigenvalue problems.
Ng, Jordan, Weiss • S = {s1, s2, …, sn} in Rp. Want to cluster into k subsets • Form n x n matrix A with aij = exp(-||si – sj||)2/2s2) except aii = 0 • D = diag(row sums of A); L = D-1/2 A D-1/2 • Find k largest (column) eigenvectors of D; arrange in an n x k matrix, X
X contains eigenvectors as columns • Normalize each row of X to get Y. • Rows of Y are points on the unit sphere. • There’s one row per original point • Cluster these points on the unit sphere using k-means; use these clusters on S.
Matlab function y = njw(s, sigma, k) % s = nx3 array of pts; k =#clust. n = size(s, 1); [X,Y] = meshgrid(1:n, 1:n); diffs = s(X,:) - s(Y, :); dists = reshape(dot(diffs', diffs'), n, n);%squared dists. A = exp(-dists/(2*sigma^2)); for i = 1:n % ICK A(i,i) = 0; end D = sum(A'); %% ith entry is sum of A's ith row L = diag(1 ./ (D .^ 0.5)) * A * diag(1 ./ D .^ 0.5); [X,D] = eigs(L,k); % k largest eigenvals/vecs of L. Y = diag((1./dot(X', X')).^0.5) * X; % normalize rows to unit length IDX = kmeans(Y,k, 'emptyaction', 'singleton'); % IDX is a vector of cluster-ids, one per point of S.
Visualization part… clf; hold on; for t = 1:k pts = s(IDX == t, :); c = hsv2rgb( [(t-1)/k, 0.8, 0.8] ) plot3(pts(:,1), pts(:, 2), pts(:, 3),… 'o', 'Color', c); end hold off; figure(gcf);