1 / 53

CS 461: Machine Learning Lecture 7

CS 461: Machine Learning Lecture 7. Dr. Kiri Wagstaff wkiri@wkiri.com. Plan for Today. Unsupervised Learning K-means Clustering EM Clustering Homework 4. Review from Lecture 6. Parametric methods Data comes from distribution Bernoulli, Gaussian, and their parameters

Download Presentation

CS 461: Machine Learning Lecture 7

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 461: Machine LearningLecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com CS 461, Winter 2009

  2. Plan for Today • Unsupervised Learning • K-means Clustering • EM Clustering • Homework 4 CS 461, Winter 2009

  3. Review from Lecture 6 • Parametric methods • Data comes from distribution • Bernoulli, Gaussian, and their parameters • How good is a parameter estimate? (bias, variance) • Bayes estimation • ML: use the data (assume equal priors) • MAP: use the prior and the data • Parametric classification • Maximize the posterior probability CS 461, Winter 2009

  4. Clustering Chapter 7 CS 461, Winter 2009

  5. Unsupervised Learning • The data has no labels! • What can we still learn? • Salient groups in the data • Density in feature space • Key approach: clustering • … but also: • Association rules • Density estimation • Principal components analysis (PCA) CS 461, Winter 2009

  6. Clustering • Group items by similarity • Density estimation, cluster models CS 461, Winter 2009

  7. Applications of Clustering • Image Segmentation [Ma and Manjunath, 2004] • Data Mining: Targeted marketing • Remote Sensing: Land cover types • Text Analysis [Selim Aksoy] CS 461, Winter 2009

  8. Applications of Clustering • Text Analysis: Noun Phrase Coreference Cluster JS Singletons John Simon Chief Financial Officer his the 37-year-old president Input text 1986 John Simon, Chief Financial Officer of Prime Corp. since 1986, saw hispay jump 20%, to $1.3 million, as the 37-year-old also became the financial-services company’s president. pay 20% Cluster PC Prime Corp. the financial-services company $1.3 million CS 461, Winter 2009

  9. CS 461, Winter 2009 [ Andrew Moore]

  10. K-means • Ask user how many clusters they’d like. (e.g. k=5) CS 461, Winter 2009 [ Andrew Moore]

  11. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations CS 461, Winter 2009 [ Andrew Moore]

  12. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints) CS 461, Winter 2009 [ Andrew Moore]

  13. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. • Each Center finds the centroid of the points it owns CS 461, Winter 2009 [ Andrew Moore]

  14. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. • Each Center finds the centroid of the points it owns… • …and jumps there • …Repeat until terminated! CS 461, Winter 2009 [ Andrew Moore]

  15. K-means Start: k=5 • Example generated by Dan Pelleg’s super-duper fast K-means system: • Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on www.autonlab.org/pap.html) CS 461, Winter 2009 [ Andrew Moore]

  16. K-means continues… CS 461, Winter 2009 [ Andrew Moore]

  17. K-means continues… CS 461, Winter 2009 [ Andrew Moore]

  18. K-means continues… CS 461, Winter 2009 [ Andrew Moore]

  19. K-means continues… CS 461, Winter 2009 [ Andrew Moore]

  20. K-means continues… CS 461, Winter 2009 [ Andrew Moore]

  21. K-means continues… CS 461, Winter 2009 [ Andrew Moore]

  22. K-means continues… CS 461, Winter 2009 [ Andrew Moore]

  23. K-means continues… CS 461, Winter 2009 [ Andrew Moore]

  24. K-means terminates CS 461, Winter 2009 [ Andrew Moore]

  25. K-means Algorithm • Randomly select k cluster centers • While (points change membership) • Assign each point to its closest cluster • (Use your favorite distance metric) • Update each center to be the mean of its items • Objective function: Variance • K-means applet CS 461, Winter 2009

  26. K-means Algorithm: Example • Randomly select k cluster centers • While (points change membership) • Assign each point to its closest cluster • (Use your favorite distance metric) • Update each center to be the mean of its items • Objective function: Variance • Data: [1, 15, 4, 2, 17, 10, 6, 18] CS 461, Winter 2009

  27. K-means for Compression Original image Clustered, k=4 159 KB 53 KB CS 461, Winter 2009

  28. Issue 1: Local Optima • K-means is greedy! • Converging to a non-global optimum: CS 461, Winter 2009 [Example from Andrew Moore]

  29. Issue 1: Local Optima • K-means is greedy! • Converging to a non-global optimum: CS 461, Winter 2009 [Example from Andrew Moore]

  30. Issue 2: How long will it take? • We don’t know! • K-means is O(nkdI) • d = # features (dimensionality) • I =# iterations • # iterations depends on random initialization • “Good” init: few iterations • “Bad” init: lots of iterations • How can we tell the difference, before clustering? • We can’t • Use heuristics to guess “good” init CS 461, Winter 2009

  31. Issue 3: How many clusters? • The “Holy Grail” of clustering CS 461, Winter 2009

  32. Issue 3: How many clusters? • Select k that gives partition with least variance? • Best k depends on the user’s goal [Dhande and Fiore, 2002] CS 461, Winter 2009

  33. Issue 4: How good is the result? • Rand Index • A = # pairs in same cluster in both partitions • B = # pairs in different clusters in both partitions • Rand = (A + B) / Total number of pairs 4 7 1 8 6 3 10 5 9 2 4 7 1 6 8 10 3 5 9 2 Rand = (5 + 26) / 45 CS 461, Winter 2009

  34. K-means: Parametric or Non-parametric? • Cluster models: means • Data models? • All clusters are spherical • Distance in any direction is the same • Cluster may be arbitrarily “big” to include outliers CS 461, Winter 2009

  35. EM Clustering • Parametric solution • Model the data distribution • Each cluster: Gaussian model • Data: “mixture of models” • Hidden value zt is the cluster of item t • E-step: estimate cluster memberships • M-step: maximize likelihood (clusters, params) CS 461, Winter 2009

  36. m1 The GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi m2 m3 CS 461, Winter 2009 [ Andrew Moore]

  37. m1 The GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi • Each component generates data from a Gaussian with mean mi and covariance matrix s2I • Assume that each datapoint is generated according to the following recipe: m2 m3 CS 461, Winter 2009 [ Andrew Moore]

  38. The GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi • Each component generates data from a Gaussian with mean mi and covariance matrix s2I • Assume that each datapoint is generated according to the following recipe: • Pick a component at random. Choose component i with probability P(wi). m2 CS 461, Winter 2009 [ Andrew Moore]

  39. The GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi • Each component generates data from a Gaussian with mean mi and covariance matrix s2I • Assume that each datapoint is generated according to the following recipe: • Pick a component at random. Choose component i with probability P(wi). • Datapoint ~ N(mi, s2I ) m2 x CS 461, Winter 2009 [ Andrew Moore]

  40. m1 The General GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi • Each component generates data from a Gaussian with mean mi and covariance matrix Si • Assume that each datapoint is generated according to the following recipe: • Pick a component at random. Choose component i with probability P(wi). • Datapoint ~ N(mi, Si ) m2 m3 CS 461, Winter 2009 [ Andrew Moore]

  41. EM in action • http://www.the-wabe.com/notebook/em-algorithm.html CS 461, Winter 2009

  42. Gaussian Mixture Example: Start CS 461, Winter 2009 [ Andrew Moore]

  43. After first iteration CS 461, Winter 2009 [ Andrew Moore]

  44. After 2nd iteration CS 461, Winter 2009 [ Andrew Moore]

  45. After 3rd iteration CS 461, Winter 2009 [ Andrew Moore]

  46. After 4th iteration CS 461, Winter 2009 [ Andrew Moore]

  47. After 5th iteration CS 461, Winter 2009 [ Andrew Moore]

  48. After 6th iteration CS 461, Winter 2009 [ Andrew Moore]

  49. After 20th iteration CS 461, Winter 2009 [ Andrew Moore]

  50. EM Benefits • Model actual data distribution, not just centers • Get probability of membership in each cluster,not just distance • Clusters do not need to be “round” CS 461, Winter 2009

More Related