1 / 47

Data Mining: Clustering

Data Mining: Clustering. Developed by: Dr Eddie Ip Modified by: Dr Arif Ansari. Clustering Outline. Goal: Provide an overview of the clustering problem and introduce some of the basic algorithms. Clustering Problem Overview Clustering Techniques Hierarchical Algorithms

walterl
Download Presentation

Data Mining: Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining: Clustering Developed by: Dr Eddie Ip Modified by: Dr Arif Ansari

  2. Clustering Outline Goal:Provide an overview of the clustering problem and introduce some of the basic algorithms • Clustering Problem Overview • Clustering Techniques • Hierarchical Algorithms • Partitional Algorithms • Genetic Algorithm • Clustering Large Databases

  3. Clustering Examples • Segment customer database based on similar buying patterns. • Group houses in a town into neighborhoods based on similar features. • Identify new plant species • Identify similar Web usage patterns

  4. Clustering Example

  5. Size Based Geographic Distance Based Clustering Houses

  6. Clustering vs. Classification • No prior knowledge • Number of clusters • Meaning of clusters • Unsupervised learning

  7. Clustering Issues • Outlier handling • Dynamic data • Interpreting results • Evaluating results • Number of clusters • Data to be used • Scalability

  8. Impact of Outliers on Clustering

  9. Clustering Problem • Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:Dg{1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k. • A Cluster, Kj, contains precisely those tuples mapped to it. • Unlike classification problem, clusters are not known a priori.

  10. Types of Clustering • Hierarchical – Nested set of clusters created. • Partitional – One set of clusters created. • Incremental – Each element handled one at a time. • Simultaneous– All elements handled together. • Overlapping/Non-overlapping

  11. Hierarchical Partitional Categorical Large DB Agglomerative Divisive Clustering Approaches Clustering Sampling Compression

  12. Cluster Parameters

  13. Distance Between Clusters • Single Link: smallest distance between points • Complete Link: largest distance between points • Average Link:average distance between points • Centroid:distance between centroids

  14. Hierarchical Clustering • Clusters are created in levels actually creating sets of clusters at each level. • Agglomerative • Initially each item in its own cluster • Iteratively clusters are merged together • Bottom Up • Divisive • Initially all items in one cluster • Large clusters are successively divided • Top Down

  15. Hierarchical Algorithms • Single Link • MST Single Link • Complete Link • Average Link

  16. Dendrogram • Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. • Each level shows clusters for that level. • Leaf – individual clusters • Root – one cluster • A cluster at level i is the union of its children clusters at level i+1.

  17. Levels of Clustering

  18. Agglomerative Example A B E C D Threshold of 1 2 3 4 5 A B C D E

  19. MST Example A B E C D

  20. Single Link • View all items with links (distances) between them. • Finds maximal connected components in this graph. • Two clusters are merged if there is at least one edge which connects them. • Uses threshold distances at each level. • Could be agglomerative or divisive.

  21. Single Link Clustering

  22. Partitional Clustering • Nonhierarchical • Creates clusters in one step as opposed to several steps. • Since only one set of clusters is output, the user normally has to input the desired number of clusters, k. • Usually deals with static sets.

  23. Partitional Algorithms • MST • Squared Error • K-Means • Nearest Neighbor • PAM • BEA • GA

  24. Squared Error • Minimized squared error

  25. K-Means • Initial set of clusters randomly chosen. • Iteratively, items are moved among sets of clusters until the desired set is reached. • High degree of similarity among elements in a cluster is obtained. • Given a cluster Ki={ti1,ti2,…,tim}, the cluster mean is mi = (1/m)(ti1 + … + tim)

  26. K-Means Example • Given: {2,4,10,12,3,20,30,11,25}, k=2 • Randomly assign means: m1=3,m2=4 • K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16 • K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 • K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6 • K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25 • Stop as the clusters with these means are the same.

  27. Nearest Neighbor • Items are iteratively merged into the existing clusters that are closest. • Incremental • Threshold, t, used to determine if items are added to existing clusters or a new cluster is created.

  28. PAM • Partitioning Around Medoids (PAM) (K-Medoids) • Handles outliers well. • Ordering of input does not impact results. • Does not scale well. • Each cluster represented by one item, called the medoid. • Initial set of k medoids randomly chosen.

  29. PAM

  30. PAM Cost Calculation • At each step in algorithm, medoids are changed if the overall cost is improved. • Cjih – cost change for an item tj associated with swapping medoid ti with non-medoid th.

  31. BEA • Bond Energy Algorithm • Database design (physical and logical) • Vertical fragmentation • Determine affinity (bond) between attributes based on common usage. • Algorithm outline: • Create affinity matrix • Convert to BOND matrix • Create regions of close bonding

  32. BEA Modified from [OV99]

  33. Genetic Algorithm Example • {A,B,C,D,E,F,G,H} • Randomly choose initial solution: {A,C,E} {B,F} {D,G,H} or 10101000, 01000100, 00010011 • Suppose crossover at point four and choose 1st and 3rd individuals: 10100011, 01000100, 00011000 • What should termination criteria be?

  34. Clustering Large Databases • Most clustering algorithms assume a large data structure which is memory resident. • Clustering may be performed first on a sample of the database then applied to the entire database. • Algorithms • BIRCH • DBSCAN • CURE

  35. Desired Features for Large Databases • One scan (or less) of DB • Online • Suspendable, stoppable, resumable • Incremental • Work with limited main memory • Different techniques to scan (e.g. sampling) • Process each tuple once

  36. BIRCH • Balanced Iterative Reducing and Clustering using Hierarchies • Incremental, hierarchical, one scan • Save clustering information in a tree • Each entry in the tree contains information about one cluster • New nodes inserted in closest entry in tree

  37. Clustering Feature • CT Triple: (N,LS,SS) • N: Number of points in cluster • LS: Sum of points in the cluster • SS: Sum of squares of points in the cluster • CF Tree • Balanced search tree • Node has CF triple for each child • Leaf node represents cluster and has CF value for each subcluster in it. • Subcluster has maximum diameter

  38. Improve Clusters

  39. DBSCAN • Density Based Spatial Clustering of Applications with Noise • Outliers will not effect creation of cluster. • Input • MinPts – minimum number of points in cluster • Eps – for each point in cluster there must be another point in it less than this distance away.

  40. DBSCAN Density Concepts • Eps-neighborhood: Points within Eps distance of a point. • Core point: Eps-neighborhood dense enough (MinPts) • Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point. • Density-reachable: A point si density-reachable form another point if there is a path from one to the other consisting of only core points.

  41. Density Concepts

  42. CURE • Clustering Using Representatives • Use many points to represent a cluster instead of only one • Points will be well scattered

  43. CURE Approach

  44. CURE for Large Databases

  45. Comparison of Clustering Techniques

  46. Example: Market Basket Data • Items frequently purchased together: Bread PeanutButter • Uses: • Placement • Advertising • Sales • Coupons • Objective: increase sales and reduce costs

  47. Sampling Algorithm • Ds = sample of Database D; • PL = Large itemsets in Ds using smalls; • C = PL BD-(PL); • Count C in Database using s; • ML = large itemsets in BD-(PL); • If ML =  then done • else C = repeated application of BD-; • Count C in Database;

More Related