WEB BAR 2004 Advanced Retrieval and Web Mining

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 12

Today’s Topic: Clustering 1 • Motivation: Recommendations • Document clustering • Clustering algorithms

Restaurant recommendations • We have a list of all Palo Alto restaurants • with  and  ratings for some • as provided by Stanford students • Which restaurant(s) should I recommend to you?

Input

Algorithm 0 • Recommend to you the most popular restaurants • say # positive votes minus # negative votes • Ignores your culinary preferences • And judgements of those with similar preferences • How can we exploit the wisdom of “like-minded” people? • Basic assumption • Preferences are not random • For example, if I like Il Fornaio, it’s more likely I will also like Cenzo

Another look at the input - a matrix

Now that we have a matrix View all other entries as zeros for now.

Similarity between two people • Similarity between their preference vectors. • Inner products are a good start. • Dave has similarity 3 with Estie • but -2 with Cindy. • Perhaps recommend Straits Cafe to Dave • and Il Fornaio to Bob, etc.

Algorithm 1.1 • Goal: recommend restaurants I don’t know • Input: evaluation of restaurants I’ve been to • Basic idea: find the person “most similar” to me in the database and recommend something s/he likes. • Aspects to consider: • No attempt to discern cuisines, etc. • What if I’ve been to all the restaurants s/he has? • Do you want to rely on one person’s opinions? • www.everyonesacritic.net (movies)

Algorithm 1.k • Look at the k people who are most similar • Recommend what’s most popular among them • Issues?

Slightly more sophisticated attempt • Group similar users together into clusters • To make recommendations: • Find the “nearest cluster” • Recommend the restaurants most popular in this cluster • Features: • efficient • avoids data sparsity issues • still no attempt to discern why you’re recommended what you’re recommended • how do you cluster?

How do you cluster? • Two key requirements for “good” clustering: • Keep similar people together in a cluster • Separate dissimilar people • Factors: • Need a notion of similarity/distance • Vector space? Normalization? • How many clusters? • Fixed a priori? • Completely data driven? • Avoid “trivial” clusters - too large or small

Clustering other things (documents, web pages) Other approaches to recommendation General unsupervised machine learning. Looking beyond Clustering people for restaurant recommendations Amazon.com

Why cluster documents? • For improving recall in search applications • Better search results • For speeding up vector space retrieval • Faster search • Corpus analysis/navigation • Better user interface

Improving search recall • Cluster hypothesis - Documents with similar text are related • Ergo, to improve search recall: • Cluster docs in corpus a priori • When a query matches a doc D, also return other docs in the cluster containing D • Hope if we do this: • The query “car” will also return docs containing automobile • clustering grouped together docs containing car with those containing automobile. Why might this happen?

Speeding up vector space retrieval • In vector space retrieval, must find nearest doc vectors to query vector • This would entail finding the similarity of the query to every doc – slow (for some applications) • By clustering docs in corpus a priori • find nearest docs in cluster(s) close to query • inexact but avoids exhaustive similarity computation Exercise: Make up a simple example with points on a line in 2 clusters where this inexactness shows up.

Speeding up vector space retrieval • Cluster documents into k clusters • Retrieve closest cluster ci to query • Rank documents in ci and return to user • Applications? Web search engines?

Clustering for UI (1)Corpus analysis/navigation • Given a corpus, partition it into groups of related docs • Recursively, can induce a tree of topics • Allows user to browse through corpus to find information • Crucial need: meaningful labels for topic nodes. • Yahoo: manual hierarchy • Often not available for new document collection

Clustering for UI (2)Navigating search results • Given the results of a search (say Jaguar, or NLP), partition into groups of related docs • Can be viewed as a form of word sense disambiguation • Jaguar may have senses: • The car company • The animal • The football team • The video game • …

Results list clustering example • Cluster 1: • Jaguar Motor Cars’ home page • Mike’s XJS resource page • Vermont Jaguar owners’ club • Cluster 2: • Big cats • My summer safari trip • Pictures of jaguars, leopards and lions • Cluster 3: • Jacksonville Jaguars’ Home Page • AFC East Football Teams

Search Engine Example: Vivisimo • Search for “NLP” on vivisimo • www.vivisimo.com • Doesn’t always work well: no geographic/coffee clusters for “java”!

Representation for Clustering • Similarity measure • Document representation

What makes docs “related”? • Ideal: semantic similarity. • Practical: statistical similarity • We will use cosine similarity. • Docs as vectors. • For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. • We will describe algorithms in terms of cosine similarity.

Recall doc as vector • Each doc j is a vector of tfidf values, one component for each term. • Can normalize to unit length. • So we have a vector space • terms are axes - aka features • n docs live in this space • even with stemming, may have 10000+ dimensions • do we really want to use all terms? • Different from using vector space for search. Why?

Intuition t 3 D2 D3 D1 x y t 1 t 2 D4 Postulate: Documents that are “close together” in vector space talk about the same things.

Cosine similarity

How Many Clusters? • Number of clusters k is given • Partition n docs into predetermined number of clusters • Finding the “right” number of clusters is part of the problem • Given docs, partition into an “appropriate” number of subsets. • E.g., for query results - ideal value of k not known up front - though UI may impose limits. • Can usually take an algorithm for one flavor and convert to the other.

Clustering Algorithms • Hierarchical algorithms • Bottom-up, agglomerative • Top-down, divisive • Need a notion of cluster similarity • Iterative, “flat” algorithms • Usually start with a random (partial) partitioning • Refine it iteratively

Dendrogram: Example be,not,he,I,it,this,the,his,a,and,but,in,on,with,for,at,from,of,to,as,is,was

d3,d4,d5 d4,d5 d3 Dendrogram: Document Example • As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d3 d5 d1 d4 d2 d1,d2

Agglomerative clustering • Given: target number of clusters k. • Initially, each doc viewed as a cluster • start with n clusters; • Repeat: • while there are > k clusters, find the “closest pair” of clusters and merge them.

“Closest pair” of clusters • Many variants to defining closest pair of clusters • “Center of gravity” • Clusters whose centroids (centers of gravity) are the most cosine-similar • Average-link • Average cosine between pairs of elements • Single-link • Similarity of the most cosine-similar (single-link) • Complete-link • Similarity of the “furthest” points, the least cosine-similar

Definition of Cluster Similarity • Single-link clustering • Similarity of two closest points • Can create elongated, straggly clusters • Chaining effect • Complete-link clustering • Similarity of two least similar points • Sensitive to outliers • Centroid-based and average-link • Good compromise

Key notion: cluster representative • We want a notion of a representative point in a cluster • Representative should be some sort of “typical” or central point in the cluster, e.g., • point inducing smallest radii to docs in cluster • smallest squared distances, etc. • point that is the “average” of all docs in the cluster • Centroid or center of gravity

Centroid Centroid • Centroid of a cluster = component-wise average of vectors in a cluster - is a vector. • Need not be a doc. • Centroid of (1,2,3); (4,5,6); (7,2,6) is (4,3,5). • Centroid is a good cluster representative in most cases.

Centroid • Is the centroid of normalized vectors normalized?

Centroid Outliers in centroid computation • Can ignore outliers when computing centroid. • What is an outlier? • Lots of statistical definitions, e.g. • moment of point to centroid > M  some cluster moment. Say 10. Outlier

Medoid As Cluster Representative • The centroid does not have to be a document. • Medoid: A cluster representative that is one of the documents • For example: the document closest to the centroid • One reason this is useful • Consider the representative of a large cluster (>1000 documents) • The centroid of this cluster will be a dense vector • The medoid of this cluster will be a sparse vector • Compare: mean/centroid vs. median/medoid

Centroid after second step. Centroid after first step. Example: n=6, k=3, closest pair of centroids d4 d6 d3 d5 d1 d2

Issues • Have to support finding closest pairs continually • compare all pairs? • Potentially n3 cosine similarity computations • To avoid: use approximations. • “points” are switching clusters as centroids change. • Naïve implementation expensive for large document sets (100,000s) • Efficient implementation • Cluster a sample, then assign the entire set • Avoid dense centroids (e.g., by using medoids) Why?

Exercise • Consider agglomerative clustering on n points on a line. Explain how you could avoid n3 distance computations - how many will your scheme use?

“Using approximations” • In standard algorithm, must find closest pair of centroids at each step • Approximation: instead, find nearly closest pair • use some data structure that makes this approximation easier to maintain • simplistic example: maintain closest pair based on distances in projection on a random line Random line

Different algorithm: k-means • K-means generates a “flat” set of clusters • K-means is non-hierarchical • Given: k - the number of clusters desired. • Iterative algorithm. • Hard to get good bounds on the number of iterations to convergence. • Rarely a problem in practice

Basic iteration • Reassignment • At the start of the iteration, we have k centroids. • Subproblem: where do we get them for 1. iteration? • Each doc assigned to the nearest centroid. • Centroid recomputation • All docs assigned to the same centroid are averaged to compute a new centroid • thus have k new centroids.

Iteration example Docs Current centroids

Iteration example Docs New centroids

k-Means Clustering: Initialization • We could start with with any k docs as centroids • But k random docs are better. • Repeat basic iteration until termination condition satisfied. • Exercise: find better approach for finding good starting points

Termination conditions • Several possibilities, e.g., • A fixed number of iterations. • Doc partition unchanged. • Centroid positions don’t change. Does this mean that the docs in a cluster are unchanged?

Convergence • Why should the k-means algorithm ever reach a fixed point? • A state in which clusters don’t change. • k-means is a special case of a general procedure known as the EM algorithm. • EM is known to converge. • Number of iterations could be large.

Exercise • Consider running 2-means clustering on a corpus, each doc of which is from one of two different languages. What are the two clusters we would expect to see? • Is agglomerative clustering likely to produce different results?

WEB BAR 2004 Advanced Retrieval and Web Mining