Clustering and NLP

Clustering and NLP Slides by me, Sidhartha Shakya, Pedro Domingos, D. Gunopulos, A.L. Yuille, Andrew Moore, and others NLP

Outline • Clustering Overview • Sample Clustering Techniques for NLP • K-means • Agglomerative • Model-based (EM) NLP

Clustering Overview NLP

What is clustering? • Given a collection of objects, clustering is a procedure that detects the presence of distinct groups, and assign objects to groups.

Another example

Why should we care about clustering? • Clustering is a basic step in most data mining procedures: Examples : • Clustering movie viewers for movie ranking. • Clustering proteins by their functionality. • Clustering text documents for content similarity.

Clustering as Data Exploration Clustering is one of the most widely used tool for exploratory data analysis. Social Sciences Biology Astronomy Computer Science . . All apply clustering to gain a first understanding of the structure of large data sets.

There are Many Clustering Tasks “Clustering” is an ill defined problem • There are many different clustering tasks, leading to different clustering paradigms:

Some more examples

Issues The clustering problem: Given a set of objects, find groups of similar objects • What is similar? Define appropriate metrics • What makes a good group? Groups that contain the highest average similarity between all pairs? Groups that are most separated from neighboring groups? 3. How can you evaluate a clustering algorithm?

Formal Definition Given a data set S and a clustering “objective” function f, find a partition P of S that maximizes (or minimizes) f(P). A partition is a set of subsets of S such that the subsets don’t intersect, and their union is equal to S. NLP

Sample Objective Functions • Objective 1: Minimize the average distance between points in the same cluster • Objective 2: Maximize the margin (smallest distance) between neighboring clusters • Objective 3 (Minimum Description Length): Minimize the number of bits needed to describe the clustering and the number of bits needed to describe the points in each cluster. NLP

More Issues • Having an objective function f gives a way of evaluating a clustering. But the real f is usually not known! • Efficiency Comparing N points to each other means making O(N2) comparisons. • Curse of Dimensionality The more features in your data, the more likely the clustering algorithm is to get it wrong. NLP

Clustering as “Unsupervised” Learning Input Output H = space of boolean functions f = X1Λ ⌐X3Λ ⌐X4 NLP

Clustering as “Unsupervised” Learning Clustering is just like ML, except ….: Input Output H = space of boolean functions f = X1Λ ⌐X3Λ ⌐X4 NLP

Clustering as “Unsupervised” Learning • Supervised learning has: • Labeled training examples • A space Y of possible labels • Unsupervised learning has: • Unlabeled training examples • No information (or limited information) about the space of possible labels NLP

Some Notes on Complexity • The ML example used a space of Boolean functions of N Boolean variables • 22^N+1 possible functions • But many possibilities are eliminated by training data and assumptions • How many possible clusterings? • ~2N * K / K!, for K clusters (K>1) • No possibilities eliminated by training data • Need to search for a good one efficiently! NLP

Clustering Problem Formulation • General Assumptions • Each data item is a tuple (vector) • Values of tuples are nominal, ordinal or numerical • Similarity (or Distance) function is provided • For pure numerical tuples, for example: • Sim(di,dj) =  di,kdj,k • sim (di,dj) = cos(di,dj) • …and many more (slide after next)

Similarity Measures in Data Analysis • For Ordinal Values • E.g. "small," "medium," "large," "X-large" • Convert to numerical assuming constant …on a normalized [0,1] scale, where: max(v)=1, min(v)=0, others interpolate • E.g. "small"=0, "medium"=0.33, etc. • Then, use numerical similarity measures • Or, use similarity matrix (see next slide)

Similarity Measures (cont.) • For Nominal Values • E.g. "Boston", "LA", "Pittsburgh", or "male", "female", or "diffuse", "globular", "spiral", "pinwheel" • Binary rule: If di, = dj,k, then sim = 1, else 0 • Use underlying sematic property: E.g. Sim(Boston, LA) = dist(Boston, LA)-1, or Sim(Boston, LA) = (|size(Boston) - size(LA)| )/Max(size(cities)) • Or, use similarity Matrix

Similarity Matrix tiny little small medium large huge tiny 1.0 0.8 0.7 0.5 0.2 0.0 little 1.0 0.9 0.7 0.3 0.1 small 1.0 0.7 0.3 0.2 medium 1.0 0.5 0.3 large 1.0 0.8 huge 1.0 • Diagonal must be 1.0 • Monotonicity property must hold • No linearity (value interpolation) assumed • Qualitative Transitive property must hold

Document Clustering Techniques • Similarity or Distance Measure:Alternative Choices • Cosine similarity • Euclidean distance • Kernel functions, e.g., • Language Modeling P(y|modelx) where x and y are documents

Document Clustering Techniques • Kullback Leibler distance ("relative entropy")

Some Clustering Methods • K-Means and K-medoids algorithms: • CLARANS, [Ng and Han, VLDB 1994] • Hierarchical algorithms • CURE, [Guha et al, SIGMOD 1998] • BIRCH, [Zhang et al, SIGMOD 1996] • CHAMELEON, [Kapyris et al, COMPUTER, 32] • Density based algorithms • DENCLUE, [Hinneburg, Keim, KDD 1998] • DBSCAN, [Ester et al, KDD 96] • Clustering with obstacles, [Tung et al, ICDE 2001]

K-Means NLP

K-means and K-medoids algorithms • Objective function: Minimize the sum of square distances of points to a cluster representative (centroid) • Efficient iterative algorithms (O(n))

K-Means Clustering • Select K seed centroidss.t. d(ci,cj) > dmin 2. Assign points to clusters by minimum distance to centroid 3. Compute new cluster centroids: 4. Iterate steps 2 & 3 until no points change clusters

K-Means Clustering: Initial Data Points Step 1: Select k random seeds s.t. d(ci,cj) > dmin Initial Seeds (k=3)

K-Means Clustering: First-Pass Clusters Step 2: Assign points to clusters by min dist. Initial Seeds

K-Means Clustering: Seeds  Centroids Step 3: Compute new cluster centroids: New Centroids

K-Means Clustering: Second Pass Clusters Step 4: Recompute Centroids

K-Means Clustering: Iterate Until Stability New Centroids And so on.

Question If space of possible clusterings is exponential, why is it that K-Means can find one in O(n) time? NLP

Problems with K-means type algorithms • Clusters are approximately spherical • High dimensionality is a problem • The value of K is an input parameter

Agglomerative Clustering

Hierarchical Clustering • Quadratic algorithms • Running time can be improved using sampling [Guha et al, SIGMOD 1998] [Kollios et al, ICDE 2001]

Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries

Hierarchical Agglomerative Clustering Hierarchical agglomerative clustering gives a hierarchy of clusters • This makes it easier to explore the set of possible k-cluster values to choose the best number of clusters 3 4 5 Information Retrieval and Digital Libraries

High density variations • Intuitively “correct” clustering Information Retrieval and Digital Libraries

High density variations • Intuitively “correct” clustering • HAC-generated clusters Information Retrieval and Digital Libraries

Document Clustering Techniques • Example. Group documents based on similarity Similarity matrix: Thresholding at similarity value of .9 yields: complete graph C1 = {1,4,5}, namely Complete Linkage connected graph C2={1,4,5,6}, namely Single Linkage For clustering we need three things: • A similarity measure for pairwise comparison between documents • A clustering criterion (complete Link, Single Ling,…) • A clustering algorithm

Document Clustering Techniques • Clustering Criterion: Alternative Linkages • Single-link ('nearest neighbor"): • Complete-link: • Average-link ("group average clustering") or GAC):

Hierarchical Agglomerative Clustering Methods • Generic Agglomerative Procedure (Salton '89): • result in nested clusters via iterations • Compute all pairwise document-document similarity coefficients • Place each of n documents into a class of its own • Merge the two most similar clusters into one; - replace the two clusters by the new cluster - recompute intercluster similarity scores w.r.t. the new cluster • Repeat the above step until there are only k clusters left (note k could = 1).

Group Agglomerative Clustering 2 1 5 4 3 6 9 7 8

Expectation-Maximization Information Retrieval and Digital Libraries

Clustering as Model Selection Let’s look at clustering as a probabilistic modeling problem: I have some set of clusters C1, C2, and C3. Each one has a certain probability distribution for generating points: P(xi | C1), P(xi | C2), P(xi | C3) NLP

Clustering as Model Selection How can I determine which points belong to which cluster? Cluster for xi = argmaxj P(xi | Cj) So, all I need is to figure out what P(xi | Cj) is, for each i and j. But without training data! How can I do that? NLP

Clustering and NLP

Clustering and NLP

Presentation Transcript

Clusters and Clustering

Similarity and clustering

Prosody and NLP

NLP

NLP Credits and Acknowledgment

Canopy Clustering and K-Means Clustering

NLP

NLP and Big Data

Clustering: Partition Clustering

Clustering and MDS

Clustering and Network

A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

Clustering and segmentation

Segmentation and Clustering

Segmentation and Clustering

NLP

NLP

74.793 NLP and Speech

CLUSTERING AND AVAILABILITY

Nlp certification nlp-india

Prosody and NLP