Unsupervised, Cont’d Expectation Maximization

Unsupervised, Cont’dExpectation Maximization

Presentation tips • Practice! • Work on knowing what you’re going to say at each point. • Know your own presentation • Practice! • Work on timing • You have 15 minutes to talk + 3 minutes for questions • Will be graded on adherence to time! • Timing is hard. Becomes easier as you practice

Presentation tips • Practice! • What appears on your screen is diff than what will appear when projected • Different size; different font; different line thicknesses; different color • Avoid hard-to-distinguish colors (red on blue) • Don’t completely rely on color for visual distinctions

The final report • Due: Dec 17, 5:00 PM (last day of finals week) • Should contain: • Intro: what was your problem; why should we care about it? • Background: what have other people done? • Your work: what did you do? Was it novel or re-implementation? (Algorithms, descriptions, etc.) • Results: Did it work? How do we know? (Experiments, plots & tables, etc.) • Discussion: What did you/we learn from this? • Future work: What would you do next/do over? • Length: Long enough to convey all that

The final report • Will be graded on: • Content: Have you accomplished what you set out to? Have you demonstrated your conclusions? Have you described what you did well? • Analysis: have you thought clearly about what you accomplished, drawn appropriate conclusions, formulated appropriate “future work”, etc? • Writing and clarity: Have you conveyed your ideas clearly and concisely? Are all of your conclusions supported by arguments? Are your algorithms/data/etc. described clearly?

Back to clustering • Purpose of clustering: • Find “chunks” of “closely related” data • Uses notion of similarity among points • Often, distance is interpreted as similarity • Agglomerative: • Start w/ individuals==clusters; join together clusters • There’s also divisive: • Start w/ all data==one cluster; split apart clusters

Combinatorial clustering • General clustering framework: • Set target of k clusters • Choose a cluster optimality criterion • Often function of “between-cluster variation” vs. “within-cluster variation” • Find assignment of points to clusters that minimizes (maximizes) this criterion • Q: Given N data points and k clusters, how many possible clusterings are there?

Example clustering criteria • Define: • Cluster i: • Cluster i mean: • Between-cluster variation: • Within-cluster variation:

Example clustering criteria • Now want some way to trade off within vs. between • Usually want to decrease w/in-cluster var, but increase between-cluster var • E.g., maximize: • or: • α>0 controls relative importance of terms

Comb. clustering example Clustering of seismological data http://www.geophysik.ruhr-uni-bochum.de/index.php?id=3&sid=5

Unsup. prob. modeling • Sometimes, instead of clusters want a full probability model of data • Can sometimes use prob. model to get clusters • Recall: in supervised learning, we said: • Find a probability model, Pr[X|Ci] for each class, Ci • Now: find a prob. model for data w/o knowing class: Pr[X] • Simplest: fit your favorite model via ML • Harder: assume a “hidden cluster ID” variable

Hidden variables • Assume data is generated by k different underlying processes/models • E.g., k different clusters, k classes, etc. • BUT, you don’t get to “see” which point was generated by which process • Only get the X for each point; the y is hidden • Want to build complete data model from k different “cluster specific” models:

Mixture models • This form is called a “mixture model” • “mixture” of k sub-models • Equivalent to the process: Roll a weighted die (weighted by αi); choose the corresponding sub-model; generate a data point from that sub-model • Example: mixture of Gaussians:

Parameterizing a mixture • How do you find the params, etc? • Simple answer: use maximum likelihood: • Write down joint likelihood function • Differentiate • Set equal to 0 • Solve for params • Unfortunately... It doesn’t work in this case • Good exercise: try it and see why it breaks • Answer: Expectation Maximization

Expectation-Maximization • General method for doing maximum likelihood in the presence of hidden variables • Identified by Dempster, Laird, & Rubin (1977) • Called the “EM algorithm”, but is really more of a “meta-algorithm”: recipe for writing algorithms • Works in general when you have: • Probability distribution over some data set • Missing feature/label vals for some/all data points • Special cases: • Gaussian mixtures • Hidden Markov models • Kalmann fliters • POMDPs

The Gaussian mixture case • Assume: data generated from 1-d mixture of Gaussians: • Whole data set: • Introduce a “responsibility” variable: • If you know model params, can calculate responsibilities

Parameterizing responsibly • Assume you know the responsibilities, zij • Can use this to find parameters for each Gaussian (think about special case where zij=0 or 1):

Unsupervised, Cont’d Expectation Maximization