Motivation

Motivation • Web query is usually two or three words long. • Prone to ambiguity • Example • “keyboard” • Input device of computer • Musical instruments • How can we ease the documents selection process for a user?

Motivation • Use clusters to represent each topic! • User can quickly disambiguate the query or drill down into a specific topic

Motivation • Moreover, with precomputed clustering of the corpus, the search for documents similar to a query can be computed efficiently. • Cluster pruning • We will learn how to cluster a collection of documents into groups.

Cluster pruning: preprocessing Pick N docs at random: call these leaders For every other doc, pre-compute nearest leader Docs attached to a leader: its followers; Likely: each leader has ~ N followers.

Cluster pruning: query processing Process a query as follows: Given query Q, find its nearest leader L. Seek K nearest docs from among L’s followers.

Visualization Query Leader Follower

What is Cluster Analysis? • Cluster: a collection of data objects • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis • Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Unsupervised learning: no predefined classes

Quality: What Is Good Clustering? • A good clustering method will produce high quality clusters with • high intra-class similarity • low inter-class similarity • The quality of a clustering result depends on both the similarity measure used by the method and its implementation

Similarity measures • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j) • The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal, ratio-scaled, and vector variables. • Weights should be associated with different variables based on applications and data semantics. • It is hard to define “similar enough” or “good enough” • the answer is typically highly subjective.

Vector Objects • Vector objects: keywords in documents. • Cosine measure (similarity)

Cluster and between the clusters

Centroid, Radius and Diameter of a Cluster (for numerical data sets) • Centroid: the “middle” of a cluster • Radius: square root of average distance from any point of the cluster to its centroid • Diameter: square root of average mean squared distance between all pairs of points in the cluster

Typical Alternatives to Calculate the Similarity between Clusters • Single link: largest similarity between an element in one cluster and an element in the other. • Complete link: smallest similarity between an element in one cluster and an element in the other • Average: avg similarity between an element in one cluster and an element in the other • Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj)

Documents Clustering

Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative (AGNES) a a b b a b c d e c c d e d d e e divisive (DIANA) Step 3 Step 2 Step 1 Step 0 Step 4 Hierarchical Clustering • Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

Hierarchical agglomerative clustering (HAC) • HAC is widely used in document clustering

Nearest Neighbor, Level 2, k = 7 clusters. From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt

Nearest Neighbor, Level 3, k = 6 clusters.

Nearest Neighbor, Level 8, k = 1 cluster.

Hierarchical Clustering Calculate the similarity between all possible combinations of two profiles • Keys • Similarity • Clustering Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters.

HAC • The hierarchical merging process leads to a tree called a dendrogram. • The earlier mergers happen between groups with a large similarity • This value becomes lower and lower for later merges. • The user can cut across the dendrogram at a suitable level to get any desired number of clusters

Partitioning Algorithms: Basic Concept • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means • k-means (MacQueen’67): Each cluster is represented by the center of the cluster • Hard assignment • Soft assignment

K-Means with hard assignment • Given k, the k-means algorithm is implemented in four steps: • Partition objects into k nonempty subsets • Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) • Assign each object to the cluster with the nearest seed point • Go back to Step 2, stop when no more new assignment

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 The K-Means Clustering Method • Example 10 9 8 7 6 5 Update the cluster means Assign each objects to most similar center 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means

K-means with “soft” assignment • Each cluster c is represented as a vector mc in term space. • It is not necessarily the centroid of some documents. • The goal of soft k-means is to find a mc so as to minimize the quantization error

K-means with “soft” assignment • A simple strategy is to reduce the errors among the mean vectors and the documents that they are closed to iteratively. • We can repeatedly through the documents, and for each document d, accumulate a “correction” Dmc for the mc that is closest to d: • After scanning once through all documents, all mcs are updated in a batch • mc <- mc +Dmc • h is called the learning rate.

K-means with “soft” assignment • The contribution from d need not be limited to only that mc that is closest to it. • The contribution can be shared among many clusters, the portion for cluster c being directly related to the current similarity between mc and d. • For example

Comments on the K-Means Method • Strength:Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Comment: Often terminates at a local optimum.

Dimensionality reduction • A significant fraction of the running time is spent in computing similarities between documents and clusters. • The time taken for one similarity calculation is proportional to the total number of nonzero components of the two vectors involved. • For example, the total number of unique terms in the Reuters collection is over 30,000! • A simple way to reduce the running time is to truncate document to a fixed number of the largest magnitude coordinates. • Subspace projection

Dimensionality reduction • Projections to orthogonal subspaces, a subset of dimensions, may not reveal clustering structure in the best possible way. • Since there are usually many ways to express a given concept (synonymy), and most words have multiple meanings (polysemy)

Latent Semantic Indexing (LSI) • Goal • A better approach would allow users to retrieve information on the basis of a conceptual topic or meaning of a document. • LSI tries to overcome the problems of lexical matching by using statistically derived conceptual indices instead of individual words for retrieval. • The latent semantic space has fewer dimensions than the original space. • LSI is thus a method for dimensionality reduction.

Latent Semantic Indexing (LSI) • Introduction • Singular value decomposition (SVD) is used to estimate the structure in word usage across documents. • Performance data shows that these statistically derived vectors are more robust indicators of meaning than individual terms.

Basic concepts of LSI • LSI is a technique that projects queries and documents into a space with “latent” semantic dimensions. • In the latent semantic space, a query and a document can have high cosine similarity even if they do not share any terms • As long as their terms are semantically similar in a sense.

Basic concepts of LSI • Latent semantic indexing is the application of a particular mathematical technique, called Singular Value Decomposition or SVD, to a word-by-document matrix. • SVD takes a matrix A and represents it as in a lower dimensional space such that the “distance” between the two matrices as measured by the 2-norm is minimized:

Basic concepts of LSI • The projection into the latent semantic space is chosen such that the representations in the original space are changed as little as possible when measured by the sum of the squares of the differences. • SVD (and hence LSI) is a least-squares method.

Basic concepts of LSI • SVD project an n-dimensional space onto a k-dimensional space where n > > k. • In applications (word-document matrices), n is the number of word types in the collection. • Values of k that are frequently chosen are 100 and 150.

Basic concepts of LSI • There are many different mappings from high dimensional to low-dimensional spaces. • Latent Semantic Indexing chooses the mapping that is optimal in the sense that it minimizes the distance Δ . • This setup has the consequence that the dimensions of the reduced space correspond to the axes of greatest variation.

Basic concepts of LSI • The SVD projection is computed by decomposing term document matrix At ×d into the product of three matrices, Tt×n , Sn×n , and Dd×n • At ×d = Tt×nSn×n (Dd×n)T • t is the number of terms, d is the number of documents, n = min(t,d) • T and D have orthonormal columns, i.e. TTT= DTD = I • S =diag(σ1, σ2,… σn), σi ≥ σj≥ 0 for 1≤ i ≤ j ≤ n • SVD can be viewed as a method for rotating the axes of the n-dimensional space such that the first axis runs along the direction of largest variation among the documents, the second dimension runs along the direction with the second largest variation and so forth.

Basic concepts of LSI • SVD • T and D represent terms and documents in this new space. • S contains the singular values of A in descending order. • The ith singular value indicates the amount of variation along the ith axis. • By restricting the matrixes T, S and D to their first k < n rows • We can obtain t ×k = Tt×kSn×k (Dd×k)T • Which is the best square approximation of A by a matrix of rank k in the sense defined in the equation

Basic concepts of LSI • Choosing the number of dimensions (k) is an interesting problem. • While a reduction in k can remove much of the noise • Keeping too few dimensions or factors may loose important information. • Using a test database of medical abstracts, LSI performance can improve considerably after 10 or 20 dimensions, peaks between 70 and 100 dimensions, and then begins to diminish slowly. • This pattern of performance (initial large increase and slow decrease to word-based performance) is observed with other datasets as well. • Eventually performance must approach the level of performance attained by standard vector methods, since with k = n factors Aˆ will exactly reconstruct the original term by document matrix A.

Basic concepts of LSI • Document–to-document similarities • Term-to-term similarities

Basic concepts of LSI • Query to document similarities • Idea • User query is represented as a vector in k-dimensional space and then can be compared to the documents in the k-dimensional space • q is simply the vector of words in the users query, multiplied by the appropriate term weights. • The sum of these k-dimensional terms vectors is reflected by the term qTTt×k, and the right multiplication by S-1k×k differentially weights the separate dimensions. • Thus, the query vector is located at the weighted sum of its constituent term vectors. • The query vector can then be compared to all existing document vectors, and the documents ranked by their similarity (nearness) to the query.

Advantages of LSI • True (latent) dimensions • Synonymy • Synonymy refers to the fact that the same underlying concept can be described using different terms. • In LSI, all documents that are related to a topic are all likely to be represented by a similar weighted combination of indexing variables.

Advantages of LSI • Polysemy • Polysemy describes words that have more than one meaning, which is common property of language. • Large numbers of polysemous words in the query can reduce the precision of a search significantly. • By using a reduced representation in LSI, one hopes to remove some "noise" from the data, which could be described as rare and less important usages of certain terms. • This would work only when the real meaning is close to the average meaning. • Since the LSI term vector is just a weighted average of the different meanings of the term, when the real meaning differs from the average meaning, LSI may actually reduce the quality of the search.

Advantages of LSI • Robust with noisy input • Because LSI does not depend on literal keyword matching, it is especially useful when the text input is noisy, as in OCR (Optical Character Reader), open input, or spelling errors. • For example, • There are scanning errors and a word (Dumais) is misspelled (as Duniais) • If correctly spelled context words also occur in documents that contained “Duniais”, then Dumais will probably be near Duniais in the k-dimensional space

Motivation

Motivation

Presentation Transcript

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation