170 likes | 293 Views
Unsupervised, Cont’d Expectation Maximization. Presentation tips. Practice! Work on knowing what you’re going to say at each point. Know your own presentation Practice! Work on timing You have 15 minutes to talk + 3 minutes for questions Will be graded on adherence to time!
E N D
Presentation tips • Practice! • Work on knowing what you’re going to say at each point. • Know your own presentation • Practice! • Work on timing • You have 15 minutes to talk + 3 minutes for questions • Will be graded on adherence to time! • Timing is hard. Becomes easier as you practice
Presentation tips • Practice! • What appears on your screen is diff than what will appear when projected • Different size; different font; different line thicknesses; different color • Avoid hard-to-distinguish colors (red on blue) • Don’t completely rely on color for visual distinctions
The final report • Due: Dec 17, 5:00 PM (last day of finals week) • Should contain: • Intro: what was your problem; why should we care about it? • Background: what have other people done? • Your work: what did you do? Was it novel or re-implementation? (Algorithms, descriptions, etc.) • Results: Did it work? How do we know? (Experiments, plots & tables, etc.) • Discussion: What did you/we learn from this? • Future work: What would you do next/do over? • Length: Long enough to convey all that
The final report • Will be graded on: • Content: Have you accomplished what you set out to? Have you demonstrated your conclusions? Have you described what you did well? • Analysis: have you thought clearly about what you accomplished, drawn appropriate conclusions, formulated appropriate “future work”, etc? • Writing and clarity: Have you conveyed your ideas clearly and concisely? Are all of your conclusions supported by arguments? Are your algorithms/data/etc. described clearly?
Back to clustering • Purpose of clustering: • Find “chunks” of “closely related” data • Uses notion of similarity among points • Often, distance is interpreted as similarity • Agglomerative: • Start w/ individuals==clusters; join together clusters • There’s also divisive: • Start w/ all data==one cluster; split apart clusters
Combinatorial clustering • General clustering framework: • Set target of k clusters • Choose a cluster optimality criterion • Often function of “between-cluster variation” vs. “within-cluster variation” • Find assignment of points to clusters that minimizes (maximizes) this criterion • Q: Given N data points and k clusters, how many possible clusterings are there?
Example clustering criteria • Define: • Cluster i: • Cluster i mean: • Between-cluster variation: • Within-cluster variation:
Example clustering criteria • Now want some way to trade off within vs. between • Usually want to decrease w/in-cluster var, but increase between-cluster var • E.g., maximize: • or: • α>0 controls relative importance of terms
Comb. clustering example Clustering of seismological data http://www.geophysik.ruhr-uni-bochum.de/index.php?id=3&sid=5
Unsup. prob. modeling • Sometimes, instead of clusters want a full probability model of data • Can sometimes use prob. model to get clusters • Recall: in supervised learning, we said: • Find a probability model, Pr[X|Ci] for each class, Ci • Now: find a prob. model for data w/o knowing class: Pr[X] • Simplest: fit your favorite model via ML • Harder: assume a “hidden cluster ID” variable
Hidden variables • Assume data is generated by k different underlying processes/models • E.g., k different clusters, k classes, etc. • BUT, you don’t get to “see” which point was generated by which process • Only get the X for each point; the y is hidden • Want to build complete data model from k different “cluster specific” models:
Mixture models • This form is called a “mixture model” • “mixture” of k sub-models • Equivalent to the process: Roll a weighted die (weighted by αi); choose the corresponding sub-model; generate a data point from that sub-model • Example: mixture of Gaussians:
Parameterizing a mixture • How do you find the params, etc? • Simple answer: use maximum likelihood: • Write down joint likelihood function • Differentiate • Set equal to 0 • Solve for params • Unfortunately... It doesn’t work in this case • Good exercise: try it and see why it breaks • Answer: Expectation Maximization
Expectation-Maximization • General method for doing maximum likelihood in the presence of hidden variables • Identified by Dempster, Laird, & Rubin (1977) • Called the “EM algorithm”, but is really more of a “meta-algorithm”: recipe for writing algorithms • Works in general when you have: • Probability distribution over some data set • Missing feature/label vals for some/all data points • Special cases: • Gaussian mixtures • Hidden Markov models • Kalmann fliters • POMDPs
The Gaussian mixture case • Assume: data generated from 1-d mixture of Gaussians: • Whole data set: • Introduce a “responsibility” variable: • If you know model params, can calculate responsibilities
Parameterizing responsibly • Assume you know the responsibilities, zij • Can use this to find parameters for each Gaussian (think about special case where zij=0 or 1):