990 likes | 1.43k Views
Clustering and NLP. Slides by me, Sidhartha Shakya, Pedro Domingos, D. Gunopulos, A.L. Yuille, Andrew Moore, and others. Outline. Clustering Overview Sample Clustering Techniques for NLP K-means Agglomerative Model-based (EM). Clustering Overview. What is clustering?.
E N D
Clustering and NLP Slides by me, Sidhartha Shakya, Pedro Domingos, D. Gunopulos, A.L. Yuille, Andrew Moore, and others NLP
Outline • Clustering Overview • Sample Clustering Techniques for NLP • K-means • Agglomerative • Model-based (EM) NLP
What is clustering? • Given a collection of objects, clustering is a procedure that detects the presence of distinct groups, and assign objects to groups.
Why should we care about clustering? • Clustering is a basic step in most data mining procedures: Examples : • Clustering movie viewers for movie ranking. • Clustering proteins by their functionality. • Clustering text documents for content similarity.
Clustering as Data Exploration Clustering is one of the most widely used tool for exploratory data analysis. Social Sciences Biology Astronomy Computer Science . . All apply clustering to gain a first understanding of the structure of large data sets.
There are Many Clustering Tasks “Clustering” is an ill defined problem • There are many different clustering tasks, leading to different clustering paradigms:
There are Many Clustering Tasks “Clustering” is an ill defined problem • There are many different clustering tasks, leading to different clustering paradigms:
Issues The clustering problem: Given a set of objects, find groups of similar objects • What is similar? Define appropriate metrics • What makes a good group? Groups that contain the highest average similarity between all pairs? Groups that are most separated from neighboring groups? 3. How can you evaluate a clustering algorithm?
Formal Definition Given a data set S and a clustering “objective” function f, find a partition P of S that maximizes (or minimizes) f(P). A partition is a set of subsets of S such that the subsets don’t intersect, and their union is equal to S. NLP
Sample Objective Functions • Objective 1: Minimize the average distance between points in the same cluster • Objective 2: Maximize the margin (smallest distance) between neighboring clusters • Objective 3 (Minimum Description Length): Minimize the number of bits needed to describe the clustering and the number of bits needed to describe the points in each cluster. NLP
More Issues • Having an objective function f gives a way of evaluating a clustering. But the real f is usually not known! • Efficiency Comparing N points to each other means making O(N2) comparisons. • Curse of Dimensionality The more features in your data, the more likely the clustering algorithm is to get it wrong. NLP
Clustering as “Unsupervised” Learning Input Output H = space of boolean functions f = X1Λ ⌐X3Λ ⌐X4 NLP
Clustering as “Unsupervised” Learning Clustering is just like ML, except ….: Input Output H = space of boolean functions f = X1Λ ⌐X3Λ ⌐X4 NLP
Clustering as “Unsupervised” Learning • Supervised learning has: • Labeled training examples • A space Y of possible labels • Unsupervised learning has: • Unlabeled training examples • No information (or limited information) about the space of possible labels NLP
Some Notes on Complexity • The ML example used a space of Boolean functions of N Boolean variables • 22^N+1 possible functions • But many possibilities are eliminated by training data and assumptions • How many possible clusterings? • ~2N * K / K!, for K clusters (K>1) • No possibilities eliminated by training data • Need to search for a good one efficiently! NLP
Clustering Problem Formulation • General Assumptions • Each data item is a tuple (vector) • Values of tuples are nominal, ordinal or numerical • Similarity (or Distance) function is provided • For pure numerical tuples, for example: • Sim(di,dj) = di,kdj,k • sim (di,dj) = cos(di,dj) • …and many more (slide after next)
Similarity Measures in Data Analysis • For Ordinal Values • E.g. "small," "medium," "large," "X-large" • Convert to numerical assuming constant …on a normalized [0,1] scale, where: max(v)=1, min(v)=0, others interpolate • E.g. "small"=0, "medium"=0.33, etc. • Then, use numerical similarity measures • Or, use similarity matrix (see next slide)
Similarity Measures (cont.) • For Nominal Values • E.g. "Boston", "LA", "Pittsburgh", or "male", "female", or "diffuse", "globular", "spiral", "pinwheel" • Binary rule: If di, = dj,k, then sim = 1, else 0 • Use underlying sematic property: E.g. Sim(Boston, LA) = dist(Boston, LA)-1, or Sim(Boston, LA) = (|size(Boston) - size(LA)| )/Max(size(cities)) • Or, use similarity Matrix
Similarity Matrix tiny little small medium large huge tiny 1.0 0.8 0.7 0.5 0.2 0.0 little 1.0 0.9 0.7 0.3 0.1 small 1.0 0.7 0.3 0.2 medium 1.0 0.5 0.3 large 1.0 0.8 huge 1.0 • Diagonal must be 1.0 • Monotonicity property must hold • No linearity (value interpolation) assumed • Qualitative Transitive property must hold
Document Clustering Techniques • Similarity or Distance Measure:Alternative Choices • Cosine similarity • Euclidean distance • Kernel functions, e.g., • Language Modeling P(y|modelx) where x and y are documents
Document Clustering Techniques • Kullback Leibler distance ("relative entropy")
Some Clustering Methods • K-Means and K-medoids algorithms: • CLARANS, [Ng and Han, VLDB 1994] • Hierarchical algorithms • CURE, [Guha et al, SIGMOD 1998] • BIRCH, [Zhang et al, SIGMOD 1996] • CHAMELEON, [Kapyris et al, COMPUTER, 32] • Density based algorithms • DENCLUE, [Hinneburg, Keim, KDD 1998] • DBSCAN, [Ester et al, KDD 96] • Clustering with obstacles, [Tung et al, ICDE 2001]
K-Means NLP
K-means and K-medoids algorithms • Objective function: Minimize the sum of square distances of points to a cluster representative (centroid) • Efficient iterative algorithms (O(n))
K-Means Clustering • Select K seed centroidss.t. d(ci,cj) > dmin 2. Assign points to clusters by minimum distance to centroid 3. Compute new cluster centroids: 4. Iterate steps 2 & 3 until no points change clusters
K-Means Clustering: Initial Data Points Step 1: Select k random seeds s.t. d(ci,cj) > dmin Initial Seeds (k=3)
K-Means Clustering: First-Pass Clusters Step 2: Assign points to clusters by min dist. Initial Seeds
K-Means Clustering: Seeds Centroids Step 3: Compute new cluster centroids: New Centroids
K-Means Clustering: Second Pass Clusters Step 4: Recompute Centroids
K-Means Clustering: Iterate Until Stability New Centroids And so on.
Question If space of possible clusterings is exponential, why is it that K-Means can find one in O(n) time? NLP
Problems with K-means type algorithms • Clusters are approximately spherical • High dimensionality is a problem • The value of K is an input parameter
Hierarchical Clustering • Quadratic algorithms • Running time can be improved using sampling [Guha et al, SIGMOD 1998] [Kollios et al, ICDE 2001]
Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries
Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries
Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries
Hierarchical Agglomerative Clustering Hierarchical agglomerative clustering gives a hierarchy of clusters • This makes it easier to explore the set of possible k-cluster values to choose the best number of clusters 3 4 5 Information Retrieval and Digital Libraries
High density variations • Intuitively “correct” clustering Information Retrieval and Digital Libraries
High density variations • Intuitively “correct” clustering • HAC-generated clusters Information Retrieval and Digital Libraries
Document Clustering Techniques • Example. Group documents based on similarity Similarity matrix: Thresholding at similarity value of .9 yields: complete graph C1 = {1,4,5}, namely Complete Linkage connected graph C2={1,4,5,6}, namely Single Linkage For clustering we need three things: • A similarity measure for pairwise comparison between documents • A clustering criterion (complete Link, Single Ling,…) • A clustering algorithm
Document Clustering Techniques • Clustering Criterion: Alternative Linkages • Single-link ('nearest neighbor"): • Complete-link: • Average-link ("group average clustering") or GAC):
Hierarchical Agglomerative Clustering Methods • Generic Agglomerative Procedure (Salton '89): • result in nested clusters via iterations • Compute all pairwise document-document similarity coefficients • Place each of n documents into a class of its own • Merge the two most similar clusters into one; - replace the two clusters by the new cluster - recompute intercluster similarity scores w.r.t. the new cluster • Repeat the above step until there are only k clusters left (note k could = 1).
Group Agglomerative Clustering 2 1 5 4 3 6 9 7 8
Expectation-Maximization Information Retrieval and Digital Libraries
Clustering as Model Selection Let’s look at clustering as a probabilistic modeling problem: I have some set of clusters C1, C2, and C3. Each one has a certain probability distribution for generating points: P(xi | C1), P(xi | C2), P(xi | C3) NLP
Clustering as Model Selection How can I determine which points belong to which cluster? Cluster for xi = argmaxj P(xi | Cj) So, all I need is to figure out what P(xi | Cj) is, for each i and j. But without training data! How can I do that? NLP