390 likes | 544 Views
Clustering and Modularity. Global View of Clustering. Clustering is a data mining technique for analyzing structure in a data set. Many many many different criteria available. k-center k-median k-means Inter-Intra etc. k-Center. Minimize maximum distance. k-median.
E N D
Global View of Clustering Clustering is a data mining technique for analyzing structure in a data set. Many manymany different criteria available. k-center k-median k-means Inter-Intra etc
k-Center Minimize maximum distance
k-median Minimize average distance k-means: minimize distance squared
Inter-Intra T(C) D(C) Maximize D(C) – T(C)
Axioms of Clustering • Clustering function: operates on a set S of more than 2 points and the distances among them where is a partition of S • Distance function: the distance is 0 only for • Does not require the triangle inequality.
Axiom 1 – Scale-Invariance • For any distance function d and any we have that
Axiom 2 - Richness Range(f) is equal to all partitions of S All possible clusterings can be generated given the right distances
d(i,j) d’(i,j) d(i,j) d’(i,j) Axiom 3 - Consistency Let and be two distance functions. If and is such that the distance between all points in a cluster is less than in and the distance between inter-cluster points is larger than in then
Main Result For each , there is no clustering function that satisfies Scale-Invariance, Richness and Consistency Implied by proof that if satisfies Scale-Invariance and Consistency, then Range(f) is an anti-chain
Sparsest Cut Given a graph find a cut that minimizes Favors min cuts that are approximately balanced. ARV gives a approximation to this problem
The Null Hypothesis The expected degree model
Modularity Definition Deviation for a subset : Modularity of a cut
Computational Questions How can we find to maximize the modularity? How can we find multiple components?
Greedy Modularity Start with every vertex in its own cluster For every pair of clusters, check modularity value if joined. Join the two with largest increase in modularity. Stop when only one cluster remains
The Dendogram Image from Newman
An Aside on HAC A very general and popular clustering algorithm Usually used for point clustering, not graphs. It is an algorithm, not an objective
Agglomerative vs. Divisive Clustering • Agglomerative (bottom-up) • each object in its own cluster • repeatedly merge clusters • Divisive (top-down) • all objects in one cluster • repeatedly split clusters
HAC 3 ways to use the distance metric Single Link: min distance between points in different clusters Complete Link: max distance between points in different clusters Group Average: average distance between points in different clusters This usually approximates some clustering objective
Inter-Intra • New ObjectiveFunction: G(C) • Maximize (Distancebetweenclusters – Tightness) • Single Linkage HAC exactlyoptimizesthisobjective. • Mostclusteringproblems are NP-hard. So thisis a rarity. • Note: No k. T(C) D(C)
Reminder of Axioms • Scale-Invariance: For any distance function d and any we have that • Richness:Range(f) is equal to all partitions of S • Consistency: Let and be two distance functions. If and is such that the distance between all points in a cluster is less than in and the distance between inter-cluster points is larger than in then
Any two axioms • For every pair of axioms, there is a stopping condition for single linkage • Consistency + Richness: only link if distance is less than r • Consistency + SI: stop when you have k connected components • Richness + SI: if x is the diameter of the data points, only add edges with weight βx
Spectral Modularity Formulate modularity as a matrix calculation:
Spectral Modularity Find that maximizes s. t. Solution: Relax! Find the top eigenvector and round the entries based on the sign.
Spectral Modularity vs Partitioning Laplacian of a graph: Modularity Matrix:
Agarwal, Kempe Paper Modularity as a Linear Program Attempt 1 at bounds: Relax the constraint and solve the fractional LP.Gives an upper bound on the modularity value. Maximize Subject to
Rounding the LP • While not empty • Select • Take to be all vertices within distance ½ • If average distance in is less than ¼ make a cluster • Else make a cluster.
Quadratic Program Formulation Maximize Subject to for all v MAX-CUT QP Maximize Subject to for all v
Extending the Definition to How do we extend the definition of modularity to modules?
Critiques • What value of modularity actually indicates something interesting? • Clauset et al: 0.3 • Guimera et al: G(n,p) graphs can have modularity 0.3
Resolution Limit What is wrong with the null hypothesis? We see a lot of locality in real networks, so assuming you could connect to anyone in the network isn’t right.
Resolution Limit What happens with big graphs? Degree 3 expander graph
Resolution Limit • If G is large enough, small cliques will be merged.