1 / 66

WEB BAR 2004 Advanced Retrieval and Web Mining

WEB BAR 2004 Advanced Retrieval and Web Mining. Lecture 12. Today’s Topic: Clustering 1. Motivation: Recommendations Document clustering Clustering algorithms. Restaurant recommendations. We have a list of all Palo Alto restaurants with  and  ratings for some

terry
Download Presentation

WEB BAR 2004 Advanced Retrieval and Web Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 12

  2. Today’s Topic: Clustering 1 • Motivation: Recommendations • Document clustering • Clustering algorithms

  3. Restaurant recommendations • We have a list of all Palo Alto restaurants • with  and  ratings for some • as provided by Stanford students • Which restaurant(s) should I recommend to you?

  4. Input

  5. Algorithm 0 • Recommend to you the most popular restaurants • say # positive votes minus # negative votes • Ignores your culinary preferences • And judgements of those with similar preferences • How can we exploit the wisdom of “like-minded” people? • Basic assumption • Preferences are not random • For example, if I like Il Fornaio, it’s more likely I will also like Cenzo

  6. Another look at the input - a matrix

  7. Now that we have a matrix View all other entries as zeros for now.

  8. Similarity between two people • Similarity between their preference vectors. • Inner products are a good start. • Dave has similarity 3 with Estie • but -2 with Cindy. • Perhaps recommend Straits Cafe to Dave • and Il Fornaio to Bob, etc.

  9. Algorithm 1.1 • Goal: recommend restaurants I don’t know • Input: evaluation of restaurants I’ve been to • Basic idea: find the person “most similar” to me in the database and recommend something s/he likes. • Aspects to consider: • No attempt to discern cuisines, etc. • What if I’ve been to all the restaurants s/he has? • Do you want to rely on one person’s opinions? • www.everyonesacritic.net (movies)

  10. Algorithm 1.k • Look at the k people who are most similar • Recommend what’s most popular among them • Issues?

  11. Slightly more sophisticated attempt • Group similar users together into clusters • To make recommendations: • Find the “nearest cluster” • Recommend the restaurants most popular in this cluster • Features: • efficient • avoids data sparsity issues • still no attempt to discern why you’re recommended what you’re recommended • how do you cluster?

  12. How do you cluster? • Two key requirements for “good” clustering: • Keep similar people together in a cluster • Separate dissimilar people • Factors: • Need a notion of similarity/distance • Vector space? Normalization? • How many clusters? • Fixed a priori? • Completely data driven? • Avoid “trivial” clusters - too large or small

  13. Clustering other things (documents, web pages) Other approaches to recommendation General unsupervised machine learning. Looking beyond Clustering people for restaurant recommendations Amazon.com

  14. Why cluster documents? • For improving recall in search applications • Better search results • For speeding up vector space retrieval • Faster search • Corpus analysis/navigation • Better user interface

  15. Improving search recall • Cluster hypothesis - Documents with similar text are related • Ergo, to improve search recall: • Cluster docs in corpus a priori • When a query matches a doc D, also return other docs in the cluster containing D • Hope if we do this: • The query “car” will also return docs containing automobile • clustering grouped together docs containing car with those containing automobile. Why might this happen?

  16. Speeding up vector space retrieval • In vector space retrieval, must find nearest doc vectors to query vector • This would entail finding the similarity of the query to every doc – slow (for some applications) • By clustering docs in corpus a priori • find nearest docs in cluster(s) close to query • inexact but avoids exhaustive similarity computation Exercise: Make up a simple example with points on a line in 2 clusters where this inexactness shows up.

  17. Speeding up vector space retrieval • Cluster documents into k clusters • Retrieve closest cluster ci to query • Rank documents in ci and return to user • Applications? Web search engines?

  18. Clustering for UI (1)Corpus analysis/navigation • Given a corpus, partition it into groups of related docs • Recursively, can induce a tree of topics • Allows user to browse through corpus to find information • Crucial need: meaningful labels for topic nodes. • Yahoo: manual hierarchy • Often not available for new document collection

  19. Clustering for UI (2)Navigating search results • Given the results of a search (say Jaguar, or NLP), partition into groups of related docs • Can be viewed as a form of word sense disambiguation • Jaguar may have senses: • The car company • The animal • The football team • The video game • …

  20. Results list clustering example • Cluster 1: • Jaguar Motor Cars’ home page • Mike’s XJS resource page • Vermont Jaguar owners’ club • Cluster 2: • Big cats • My summer safari trip • Pictures of jaguars, leopards and lions • Cluster 3: • Jacksonville Jaguars’ Home Page • AFC East Football Teams

  21. Search Engine Example: Vivisimo • Search for “NLP” on vivisimo • www.vivisimo.com • Doesn’t always work well: no geographic/coffee clusters for “java”!

  22. Representation for Clustering • Similarity measure • Document representation

  23. What makes docs “related”? • Ideal: semantic similarity. • Practical: statistical similarity • We will use cosine similarity. • Docs as vectors. • For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. • We will describe algorithms in terms of cosine similarity.

  24. Recall doc as vector • Each doc j is a vector of tfidf values, one component for each term. • Can normalize to unit length. • So we have a vector space • terms are axes - aka features • n docs live in this space • even with stemming, may have 10000+ dimensions • do we really want to use all terms? • Different from using vector space for search. Why?

  25. Intuition t 3 D2 D3 D1 x y t 1 t 2 D4 Postulate: Documents that are “close together” in vector space talk about the same things.

  26. Cosine similarity

  27. How Many Clusters? • Number of clusters k is given • Partition n docs into predetermined number of clusters • Finding the “right” number of clusters is part of the problem • Given docs, partition into an “appropriate” number of subsets. • E.g., for query results - ideal value of k not known up front - though UI may impose limits. • Can usually take an algorithm for one flavor and convert to the other.

  28. Clustering Algorithms • Hierarchical algorithms • Bottom-up, agglomerative • Top-down, divisive • Need a notion of cluster similarity • Iterative, “flat” algorithms • Usually start with a random (partial) partitioning • Refine it iteratively

  29. Dendrogram: Example be,not,he,I,it,this,the,his,a,and,but,in,on,with,for,at,from,of,to,as,is,was

  30. d3,d4,d5 d4,d5 d3 Dendrogram: Document Example • As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d3 d5 d1 d4 d2 d1,d2

  31. Agglomerative clustering • Given: target number of clusters k. • Initially, each doc viewed as a cluster • start with n clusters; • Repeat: • while there are > k clusters, find the “closest pair” of clusters and merge them.

  32. “Closest pair” of clusters • Many variants to defining closest pair of clusters • “Center of gravity” • Clusters whose centroids (centers of gravity) are the most cosine-similar • Average-link • Average cosine between pairs of elements • Single-link • Similarity of the most cosine-similar (single-link) • Complete-link • Similarity of the “furthest” points, the least cosine-similar

  33. Definition of Cluster Similarity • Single-link clustering • Similarity of two closest points • Can create elongated, straggly clusters • Chaining effect • Complete-link clustering • Similarity of two least similar points • Sensitive to outliers • Centroid-based and average-link • Good compromise

  34. Key notion: cluster representative • We want a notion of a representative point in a cluster • Representative should be some sort of “typical” or central point in the cluster, e.g., • point inducing smallest radii to docs in cluster • smallest squared distances, etc. • point that is the “average” of all docs in the cluster • Centroid or center of gravity

  35. Centroid Centroid • Centroid of a cluster = component-wise average of vectors in a cluster - is a vector. • Need not be a doc. • Centroid of (1,2,3); (4,5,6); (7,2,6) is (4,3,5). • Centroid is a good cluster representative in most cases.

  36. Centroid • Is the centroid of normalized vectors normalized?

  37. Centroid Outliers in centroid computation • Can ignore outliers when computing centroid. • What is an outlier? • Lots of statistical definitions, e.g. • moment of point to centroid > M  some cluster moment. Say 10. Outlier

  38. Medoid As Cluster Representative • The centroid does not have to be a document. • Medoid: A cluster representative that is one of the documents • For example: the document closest to the centroid • One reason this is useful • Consider the representative of a large cluster (>1000 documents) • The centroid of this cluster will be a dense vector • The medoid of this cluster will be a sparse vector • Compare: mean/centroid vs. median/medoid

  39. Centroid after second step. Centroid after first step. Example: n=6, k=3, closest pair of centroids d4 d6 d3 d5 d1 d2

  40. Issues • Have to support finding closest pairs continually • compare all pairs? • Potentially n3 cosine similarity computations • To avoid: use approximations. • “points” are switching clusters as centroids change. • Naïve implementation expensive for large document sets (100,000s) • Efficient implementation • Cluster a sample, then assign the entire set • Avoid dense centroids (e.g., by using medoids) Why?

  41. Exercise • Consider agglomerative clustering on n points on a line. Explain how you could avoid n3 distance computations - how many will your scheme use?

  42. “Using approximations” • In standard algorithm, must find closest pair of centroids at each step • Approximation: instead, find nearly closest pair • use some data structure that makes this approximation easier to maintain • simplistic example: maintain closest pair based on distances in projection on a random line Random line

  43. Different algorithm: k-means • K-means generates a “flat” set of clusters • K-means is non-hierarchical • Given: k - the number of clusters desired. • Iterative algorithm. • Hard to get good bounds on the number of iterations to convergence. • Rarely a problem in practice

  44. Basic iteration • Reassignment • At the start of the iteration, we have k centroids. • Subproblem: where do we get them for 1. iteration? • Each doc assigned to the nearest centroid. • Centroid recomputation • All docs assigned to the same centroid are averaged to compute a new centroid • thus have k new centroids.

  45. Iteration example Docs Current centroids

  46. Iteration example Docs New centroids

  47. k-Means Clustering: Initialization • We could start with with any k docs as centroids • But k random docs are better. • Repeat basic iteration until termination condition satisfied. • Exercise: find better approach for finding good starting points

  48. Termination conditions • Several possibilities, e.g., • A fixed number of iterations. • Doc partition unchanged. • Centroid positions don’t change. Does this mean that the docs in a cluster are unchanged?

  49. Convergence • Why should the k-means algorithm ever reach a fixed point? • A state in which clusters don’t change. • k-means is a special case of a general procedure known as the EM algorithm. • EM is known to converge. • Number of iterations could be large.

  50. Exercise • Consider running 2-means clustering on a corpus, each doc of which is from one of two different languages. What are the two clusters we would expect to see? • Is agglomerative clustering likely to produce different results?

More Related