1 / 51

Recap

Recap. Curse of dimension Data tends to fall into the “rind” of space Variance gets larger (uniform cube) : Data falls further apart from each other (uniform cube ) : Statistics becomes unreliable, difficult to build histograms, etc.  use simple models. Recap. Multivariate Gaussian

Download Presentation

Recap

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recap • Curse of dimension • Data tends to fall into the “rind” of space • Variance gets larger (uniform cube): • Data falls further apart from each other (uniform cube): • Statistics becomes unreliable, difficult to build histograms, etc.  use simple models

  2. Recap • Multivariate Gaussian • By translation and rotation, it turns into multiplication of normal distributions • MLE of mean: • MLE of covariance:

  3. Be cautious.. Data may not be in one blob, need to separate data into groups Clustering

  4. CS 498 Probability & Statistics Clustering methods Zicheng Liao

  5. What is clustering? • “Grouping” • A fundamental part in signal processing • “Unsupervised classification” • Assign the same label to data points that are close to each other Why?

  6. We live in a universe full of clusters

  7. Two (types of) clustering methods • Agglomerative/Divisive clustering • K-means

  8. Agglomerative/Divisive clustering Agglomerative clustering Divisive clustering Hierarchical cluster tree (Bottom up) (Top down)

  9. Algorithm

  10. Agglomerative clustering: an example • “merge clusters bottom up to form a hierarchical cluster tree” Animation from Georg Berber www.mit.edu/~georg/papers/lecture6.ppt

  11. Dendrogram Distance >> X = rand(6, 2); %create 6 points on a plane >> Z = linkage(X); %Z encodes a tree of hierarchical clusters >> dendrogram(Z); %visualize Z as a dendrograph

  12. Distance measure • Popular choices:Euclidean, hamming, correlation, cosine,… • A metric • (triangle inequality) • Critical to clustering performance • No single answer, depends on the data and the goal • Data whitening when we know little about the data

  13. Inter-cluster distance • Treat each data point as a single cluster • Only need to define inter-cluster distance • Distance between one set of points and another set of points • 3 popular inter-cluster distances • Single-link • Complete-link • Averaged-link

  14. Single-link • Minimum of all pairwise distances between points from two clusters • Tend to produce long, loose clusters

  15. Complete-link • Maximum of all pairwise distances between points from two clusters • Tend to produce tight clusters

  16. Averaged-link • Average of all pairwise distances between points from two clusters

  17. How many clusters are there? • Intrinsically hard to know • The dendrogram gives insights to it • Choose a threshold to split the dendrogram into clusters Distance

  18. An example do_agglomerative.m

  19. Divisive clustering • “recursively split a cluster into smaller clusters” • It’s hard to choose where to split: combinatorial problem • Can be easier when data has a special structure (pixel grid)

  20. K-means • Partition data into clusters such that: • Clusters are tight (distance to cluster center is small) • Every data point is closer to its own cluster center than to all other cluster centers (Voronoi diagram) [figures excerpted from Wikipedia]

  21. Formulation • Find K clusters that minimize: • Two parameters: • NP-hard for global optimal solution • Iterative procedure (local minimum) Cluster center

  22. K-means algorithm • Choose cluster number K • Initialize cluster center • Randomly select K data points as cluster centers • Randomly assign data to clusters, compute the cluster center • Iterate: • Assign each point to the closest cluster center • Update cluster centers (take the mean of data in each cluster) • Stop when the assignment doesn’t change

  23. Illustration Randomly initialize 3 cluster centers (circles) Assign each point to the closest cluster center Update cluster center Re-iterate step2 [figures excerpted from Wikipedia]

  24. Example do_Kmeans.m (show step-by-step updates and effects of cluster number)

  25. Discussion • How to choose cluster number ? • No exact answer, guess from data (with visualization) • Define a cluster quality measure then optimize K = 2? K = 3? K = 5?

  26. Discussion • Converge to local minimum => counterintuitive clustering 1 2 3 6 5 4 [figures excerpted from Wikipedia]

  27. Discussion • Favors spherical clusters; • Poor results for long/loose/stretched clusters Input data(color indicates true labels) K-means results

  28. Discussion • Cost is guaranteed to decrease in every step • Assign a point to the closest cluster center minimizes the cost for current cluster center configuration • Choose the mean of each cluster as new cluster center minimizes the squared distance for current clustering configuration • Finish in polynomial time

  29. Summary • Clustering as grouping “similar” data together • A world full of clusters/patterns • Two algorithms • Agglomerative/divisive clustering: hierarchical clustering tree • K-means: vector quantization

  30. CS 498 Probability & Statistics Regression Zicheng Liao

  31. Example-I • Predict stock price Stock price t+1 time

  32. Example-II • Fill in missing pixels in an image: inpainting

  33. Example-III • Discover relationship in data Time in service for devices from 3 production lots Amount of hormones by devices from 3 production lots

  34. Example-III • Discovery relationship in data

  35. Linear regression • Input: • Linear model with Gaussian noise • : explanatory variable • : dependent variable • : parameter of linear model • : zero mean Gaussian random variable

  36. Parameter estimation • MLE of linear model with Gaussian noise Likelihood function  [Least squares, Carl F. Gauss, 1809]

  37. Parameter estimation • Closed form solution Cost function Normal equation   (expensive to compute the matrix inverse for high dimension)

  38. Gradient descent • http://openclassroom.stanford.edu/MainFolder/VideoPage.php?course=MachineLearning&video=02.5-LinearRegressionI-GradientDescentForLinearRegression&speed=100(Andrew Ng) Init: Repeat: Until converge. (Guarantees to reach global minimum in finite steps)

  39. Example do_regression.m

  40. Interpreting a regression

  41. Interpreting a regression • Zero mean residual • Zero correlation

  42. Interpreting the residual

  43. Interpreting the residual • has zero mean • is orthogonal to every column of • is also de-correlated from every column of • is orthogonal to the regression vector • is also de-correlatedfrom the regression vector (follow the same line of derivation)

  44. How good is a fit? • Information ~ variance • Total variance is decoupled into regression variance and error variance • How good is a fit: How much variance is explained by regression: (Because and have zero covariance)

  45. How good is a fit? • R-squared measure • The percentage of variance explained by regression • Used in hypothesis test for model selection

  46. Regularized linear regression • Cost • Closed-form solution • Gradient descent Penalize large values in Init: Repeat: Until converge.

  47. Why regularization? • Handle small eigenvalues • Avoid dividing by small values by adding the regularizer

  48. Why regularization? • Avoid over-fitting: • Over fitting • Small parameters  simpler model  less prone to over-fitting Smaller parameters make a simpler (better) model Over fit: hard to generalize to new data

  49. L1 regularization (Lasso) • Some features may be irrelevant but still have a small non-zero coefficient in • L1 regularization pushes small values of to zero • “Sparse representation”

  50. How does it work? • When is small, the L1 penalty is much larger than squared penalty. • Causes trouble in optimization (gradient non-continuity)

More Related