440 likes | 456 Views
Clustering. Inter-cluster distances are maximized. Intra-cluster distances are minimized. Definition. Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. Applications.
E N D
Inter-cluster distances are maximized Intra-cluster distances are minimized Definition • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Applications • Group related documents for browsing • Group genes and proteins that have similar functionality • Group stocks with similar price fluctuations • Reduce the size of large data sets • Group users with similar buying mentalities
How many clusters? Six Clusters Two Clusters Four Clusters Clustering is ambiguous • There is no correct or incorrect solution for clustering.
Challenges faced • Scalability • Ability to deal with different types of attributes • Noise & Outliers • Complex shapes and types of data • Incremental clustering and insensitivity to the order of input records • High dimensionality • Constraint-based clustering • Interpretability and usability
Types of Data • Data Matrix • n-objects with p-variables. • The structure is in the form of a relational table, or n x p matrix • Dissimilarity Matrix • object-by-object structure. Stores a collection of proximities that are available for all pair of n objects. • d(i, j) is the dissimilarity between objects i and j. • d(i, j) = d(j, i) and d(i, i) = 0
Types of Data • Interval- Scaled Variables • Binary Variables • Nominal • Ordinal • Ratio-Scaled variables • Variables of Mixed Types
Binary variables • Binary variable has only two states 0 and 1 • Dissimilarity between two binary variables is by a 2*2 contingency table for binary variables OBJ j OBJ i
Dissimilarity between binary variables D(Jack,Mary)=0.33 D(Jack,Jim)=0.67 D(Mary,Jim)=0.75
Other types of data • Ordinal • similar to nominal variables, but values are ordered in some sequence. • Eg. rank or employees can be assistant, associate, full • Ratio-Scaled variables • Makes a positive measurement on a non-linear scale Eg. Growth of bacteria, radioactivity • Variables of Mixed Types
Types of clustering • Hierarchical clustering(BIRCH) • A set of nested clusters organized as a hierarchical tree • Partitional Clustering(k-means,k-mediods) • A division data objects into non-overlapping (distinct) subsets (i.e., clusters) such that each data object is in exactly one subset • Density – Based(DBSCAN) • Based on density functions • Grid-Based(STING) • Based on nultiple-level granularity structure • Model-Based(SOM) • Hypothesize a model for each of the clusters and find the best fit of the data to the given model
A Partitional Clustering Partitional Clustering Original Points
Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Dendrogram Non-traditional Hierarchical Clustering
Clustering Algorithms • Partitional • K-means • K-mediods • Hierarchial • Agglomerative • Divisive
K-Mean Algorithm • Each cluster is represented by the mean value of the objects in the cluster • Input : set of objects (n), no of clusters (k) • Output : set of k clusters • Algo • Randomly select k samples & mark them a initial cluster • Repeat • Assign/ reassign in sample to any given cluster to which it is most similar depending upon the mean of the cluster • Update the cluster’s mean until No Change.
K-Means (Array) • Step 1: Randomly assign objects to k clusters • Step 2: Find the mean of each cluster • Step 3: Re-assign objects to the cluster with closest mean. • Step 4: Go to step2 Repeat until no change.
Example 1 Given: {2,3,6,8,9,12,15,18,22} Assume k=3. • Solution: • Randomly partition given data set: • K1 = 2,8,15 mean = 8.3 • K2 = 3,9,18 mean = 10 • K3 = 6,12,22 mean = 13.3 • Reassign • K1 = 2,3,6,8,9 mean = 5.6 • K2 = mean = 0 • K3 = 12,15,18,22 mean = 16.75
Reassign • K1 = 3,6,8,9 mean = 6.5 • K2 = 2 mean = 2 • K3 = 12,15,18,22 mean = 16.75 • Reassign • K1 = 6,8,9 mean = 7.6 • K2 = 2,3 mean = 2.5 • K3 = 12,15,18,22 mean = 16.75 • Reassign • K1 = 6,8,9 mean = 7.6 • K2 = 2,3 mean = 2.5 • K3 = 12,15,18,22 mean = 16.75 • STOP
Example 2 Given {2,4,10,12,3,20,30,11,25} Assume k=2. Solution: K1 = 2,3,4,10,11,12 K2 = 20, 25, 30
Advantages • K-means is relatively scalable and efficient in processing large • data sets • The computational complexity of the algorithm is O(nkt) • n: the total number of objects • k: the number of clusters • t: the number of iterations • Normally: k<<n and t<<n • Disadvantage • Can be applied only when the mean of a cluster is defined • Users need to specify k • K-means is not suitable for discovering clusters with non convex • shapes or clusters of very different size • It is sensitive to noise and outlier data points (can influence the • mean value)
K-Means (graph) • Step1: Form k centroids, randomly • Step2: Calculate distance between centroids and each object • Use Euclidean’s law do determine min distance: d(A,B) = (x2-x1)2 + (y2-y1)2 • Step3: Assign objects based on min distance to k clusters • Step4: Calculate centroid of each cluster using C = (x1+x2+…xn , y1+y2+…yn) n n • Go to step 2. • Repeat until no change in centroids.
Example 1 • There are four types of medicines and each have two attributes, as shown below. Find a way to group them into 2 groups based on their features.
Solution • Plot the values on a graph. • Mark any k centeroids
Calculate Euclidean distance of each point from the centeroids. • D = 0 1 3.61 5 1 0 2.83 4.24 • Based on minimum distance, we assign points to clusters: K1 = A K2 = B, C, D • Calculate new centeroids • C = 2+4+5 , 1+3+4 = (11/3 , 8/3) 3 3
Marking the new centroids • Continue the iteration, until there is no change in the centroids or clusters.
Example 2 • Use K-means algorithm to create two clusters. Given: