200 likes | 308 Views
Clustering. Instructor: Max Welling ICS 178 Machine Learning & Data Mining. Unsupervised Learning. In supervised learning we were given attributes & targets (e.g. class labels). In unsupervised learning we are only given attributes. Our task is to discover structure in the data.
E N D
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining
Unsupervised Learning • In supervised learning we were given attributes & targets (e.g. class labels). • In unsupervised learning we are only given attributes. • Our task is to discover structure in the data. • Example: the data may be structured in clusters: Is this a good clustering?
Why Discover Structure ? • Often, the result of an unsupervised learning algorithm is a new representation • for the same data. This new representation should be more meaningful • and could be used for further processing (e.g. classification). • Clustering: The new representation is now given by the label of a • cluster to which the data-point belongs. • This tells us which data-cases are similar to each other. • The new representation is smaller and hence more convenient computationally. • Clustering: Each data-case is now encoded by its cluster label. This is a lot • cheaper than its attribute values. • CF: We can group the users into user-communities or/and the movies into • movie genres. If we need to predict something we simply pick the average • rating in the group.
Clustering: K-means • We iterate two operations: • 1. Update the assignment of data-cases to clusters • 2. Update the location of the cluster. • Denote the assignment of data-case “i” to cluster “c”. • Denote the position of cluster “c” in a d-dimensional space. • Denote the location of data-case i • Then iterate until convergence: • 1. For each data-case, compute distances to each cluster and pick the closest one: • 2. For each cluster location, compute the mean location of all data-cases • assigned to it: Nr. of data-cases in cluster c Set of data-cases assigned to cluster c
K-means • Cost function: • Each step in k-means decreases this cost function. • Often initialization is very important since there are very many local minima in C. • Relatively good initialization: place cluster locations on K randomly chosen data-cases. • How to choose K? • Add complexity term: and minimize also over K
Vector Quantization • K-means divides the space up in a Voronoi tesselation. • Every point on a tile is summarized by the code-book vector “+”. • This clearly allows for data compression !
Mixtures of Gaussians • K-means assigns each data-case to exactly 1 cluster. But what if • clusters are overlapping? • Maybe we are uncertain as to which cluster it really belongs. • The mixtures of Gaussians algorithm assigns data-cases to cluster with • a certain probability.
MoG Clustering Covariance determines the shape of these contours • Idea: fit these Gaussian densities to the data, one per cluster.
EM Algorithm: E-step • “r” is the probability that data-case “i” belongs to cluster “c”. • is the a priori probability of being assigned to cluster “c”. • Note that if the Gaussian has high probability on data-case “i” • (i.e. the bell-shape is on top of the data-case) then it claims high • responsibility for this data-case. • The denominator is just to normalize all responsibilities to 1:
EM Algorithm: M-Step total responsibility claimed by cluster “c” expected fraction of data-cases assigned to this cluster weighted sample mean where every data-case is weighted according to the probability that it belongs to that cluster. weighted sample covariance
EM-MoG • EM comes from “expectation maximization”. We won’t go through the derivation. • If we are forced to decide, we should assign a data-case to the cluster which • claims highest responsibility. • For a new data-case, we should compute responsibilities as in the E-step • and pick the cluster with the largest responsibility. • E and M steps should be iterated until convergence (which is guaranteed). • Every step increases the following objective function (which is the total • log-probability of the data under the model we are learning):
Agglomerative Hierarchical Clustering Every data-case is a cluster • Define a “distance” between clusters (later). • Initially, every data-case is its own cluster. • At each iteration, compute the distances • between all existing clusters (you can store • distances and avoid their re-computation). • Merge the closest clusters into 1 single cluster. • Update you “dendrogram”.
Iteration 3 • This way you build a hierarchy. • Complexity Order (why?)
Distances produces minimal spanning tree. avoids elongated clusters.
Gene Expression DataMicro-array Data • The expression level of genes is • tested under different experimental • conditions. • We like to find the genes which • co-express in a subset of conditions. • Both genes and conditions are • clustered and shown as dendrograms.
Exercise I • Imagine I have run a clustering algorithm on some data describing 3 • attributes of cars: height, weight, length. • I have found two clusters. An expert comes by and tells you that class 1 is • really Ferrari’s while class 2 is Hummers. • A new data-case (car) is presented, i.e. you get to see the height, weight, length. • Describe how you can use the output of your clustering, including the information • obtained from the expert to classify the new car as a Ferrari or a Hummer. • Be very precise: use an equation or pseudo-code to describe what to do. • You add the new car to the dataset and run the K-means starting at its converged • assignments and cluster means obtained from before. Is it possible that the • assignments of the old data change due to the addition of the new data-case?
Exercise II • We classify data according to the 3-nearest neighbors (3-NN) rule. • Explain in detail how this works. • Which decision surface do you think is smoother: the one for 1-NN or for • 100-NN? Explain. • Is k-NN a parametric or non-parametric method. • Give an important property of non-parametric classification method. • We will do linear regression on data of the form (Xn,Yn) where Xn and Yn are • real values: Yn = AXn+b+n • where A,b are parameters and n is the noise variable. • Provide the equation for the total Error of the data-items. • We want to minimize the Error. With respect to what ? • You are given a new attribute Xnew. What would you predict for Ynew.