CSE 634 Data Mining Concepts Techniques

1. Cluster Analysis CSE 634Data Mining Concepts & Techniques Cluster Analysis Group 6 Nam, Kyu Han (105953722) Ju, Jae Won (106112650) Chung, Dong Hwan (105275323)

2. Cluster Analysis References Jiawei Han and Michelle Kamber. Data Mining Concept and Techniques (Chapter 7, Sections 1- 4). Morgan Kaufman, 2005 Jiawei Han, Lecture Notes, University of Illinois at Urbana-Champaign, http://www-faculty.cs.uiuc.edu/~hanj/bk2/07.ppt Prof.�Dr.�J.�F�rnkranz and Dr.�G.�Grieser, �Maschinelles Lernen and Data Mining� (3-11) http://www.ke.informatik.tu-darmstadt.de/lehre/ws05/mldm/clustering.pdf K. Wagsta, C. Cardie, S. Rogers, and S. Schroedl, �Constrained K-means Clustering with Background Knowledge�, Proceedings of 18th International Conference on Machine Learning 2001. (pp. 577-584). Morgan Kaufmann, San Francisco, CA.

3. Cluster Analysis Data Mining Concepts and Techniques

4. Cluster Analysis What is Cluster Analysis? Cluster : Collection of data objects (Intraclass similarity) - Objects are similar to objects in same cluster (Interclass dissimilarity) - Objects are dissimilar to objects in other clusters Cluster analysis Statistical method for grouping a set of data objects into clusters A good clustering method produces high quality clusters with high intraclass similarity and low interclass similarity Clustering is unsupervised classification Data objects in a cluster have two properties - Intraclass and Interclass. These are properties that a cluster tries to improve. Examples of clusters: Stars in a galaxy, Planets in the solar system, Kinds of rocks Explain cluster analysis Why is it unsupervised? Because it does not rely on predefined classes or trained data. It is learning by observation, not learning by examples. Data objects in a cluster have two properties - Intraclass and Interclass. These are properties that a cluster tries to improve. Examples of clusters: Stars in a galaxy, Planets in the solar system, Kinds of rocks Explain cluster analysis Why is it unsupervised? Because it does not rely on predefined classes or trained data. It is learning by observation, not learning by examples.

5. Cluster Analysis Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

6. Cluster Analysis Data Representation Data matrix (two mode) N objects with p attributes Dissimilarity matrix (one mode) d(i,j) : dissimilarity between i and j

7. Cluster Analysis Types of Data in Cluster Analysis Interval-Scaled Variables Binary Variables Nominal, Ordinal, and Ratio-Scaled Variables Variables of Mixed Types

8. Cluster Analysis Interval-Scaled Variables Continuous measurements of a roughly linear scale E.g. weight, height, temperature, etc.

9. Cluster Analysis Using Interval-Scaled Values Step 1: Standardize the data To ensure they all have equal weight To match up different scales into a uniform, single scale Not always needed! Sometimes we require unequal weights for an attribute Step 2: Compute dissimilarity between records Use Euclidean, Manhattan or Minkowski distance Exceptions: height may be a more important attribute associated with basketball players Exceptions: height may be a more important attribute associated with basketball players

10. Cluster Analysis Data Types and Distance Metrics Distances are normally used to measure the similarity or dissimilarity between two data objects Minkowski distance: where i = (xi1, xi2, �, xip) and j = (xj1, xj2, �, xjp) are two p-dimensional data objects, and q is a positive integer

11. Cluster Analysis Data Types and Distance Metrics (Cont�d) If q = 1, d is Manhattan distance If q = 2, d is Euclidean distance

12. Cluster Analysis Data Types and Distance Metrics (Cont�d) Properties d(i,j) ? 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) ? d(i,k) + d(k,j) Can also use weighted distance, or other dissimilarity measures.

13. Cluster Analysis Binary Attributes A contingency table for binary data Simple matching coefficient (if the binary attribute is symmetric): Jaccard coefficient (if the binary attribute is asymmetric):

14. Cluster Analysis Dissimilarity between Binary Attributes Example

15. Cluster Analysis Nominal Attributes A generalization of the binary attribute in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching m: # of attributes that are same for both records, p: total # of attributes Method 2: rewrite the database and create a new binary attribute for each of the m states For an object with color yellow, the yellow attribute is set to 1, while the remaining attributes are set to 0.

16. Cluster Analysis Ordinal Attributes An ordinal attribute can be discrete or continuous Order is important (ex.rank) Can be treated like interval-scaled replacing xif by their rank map the range of each variable onto [0, 1] by replacing i-th object in the f-th attribute by compute the dissimilarity using methods for interval-scaled attributes

17. Cluster Analysis Ratio-Scaled Attributes Ratio-scaled attribute: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt Methods: treat them like interval-scaled attributes � not a good choice because scales may be distorted apply logarithmic transformation yif = log(xif) treat them as continuous ordinal data and treat their rank as interval-scaled.

18. Cluster Analysis Attributes of Mixed Types A database may contain all the six types of attributes symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio. Use a weighted formula to combine their effects. f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w. f is interval-based: use the normalized distance f is ordinal or ratio-scaled compute ranks rif and and treat zif as interval-scaled

19. Cluster Analysis Data Mining Concepts and Techniques

20. Cluster Analysis






















42. Cluster Analysis Data Mining Concepts and Techniques

43. Cluster Analysis Introduction Clustering is an unsupervised method of data analysis Data instances grouped according to some notion of similarity Access only to the set of features describing each object No information as to where each instance should be placed with partition However there might be background knowledge about the domain or data set that could be useful to algorithm In this paper the authors try to integrate this background knowledge into clustering algorithms.

44. Cluster Analysis K-means Clustering Used to partition a data set into k groups Group instances based on attributes into k groups High intra-cluster similarity; Low inter-cluster similarity Cluster similarity is measured in regards to the mean value of objects in the cluster. How does K-means work ? First, select K random instances from the data � initial cluster centers Second, each instance is assigned to its closest (most similar) cluster center Third, each cluster center is updated to the mean of its constituent instances Repeat steps two and three till there is no further change in assignment of instances to clusters

45. Cluster Analysis Constrained K-means Clustering Two pair-wise constraints Must-link: constraints which specify that two instances have to be in the same cluster Cannot-link: constraints which specify that two instances must not be placed in the same cluster When using a set of constraints we have to take the transitive closure Constraints may be derived from Partially labeled data Background knowledge about the domain or data set

46. Cluster Analysis Constrained Algorithm First, select K random instances from the data � initial cluster centers Second, each instance is assigned to its closest (most similar) cluster center such that VIOLATE-CONSTRAINT(I, K, M, C) is false. If no such cluster exists , fail Third, each cluster center is updated to the mean of its constituent instances Repeat steps two and three till there is no further change in assignment of instances to clusters VIOLATE-CONSTRAINT instance I, cluster K, must-link constraint M, cannot-link constraint C For each (i, i=) in M: if i= is not in K, return true. For each (i, i?) in C : if i? is in K, return true Otherwise return false

47. Cluster Analysis Experimental Results on GPS Lane Finding Large database of digital road maps available These maps contain only coarse information about the location of the road By refining maps down to the lane level we can enable a host of more sophisticated applications such as lane departure detection Approach Based on the observation that drivers tend to drive within lane boundaries Lanes should correspond to �densely traveled� regions in contrast to the lane boundaries Possible to collect data about the location of cars and then cluster that data to automatically determine where the individual lanes are located

48. Cluster Analysis GPS Lane Finding (cont�d) Collect data about the location of cars as they drive along a given road Collect data once per second from several drivers using GPS receivers affixed to top of their vehicles Each data instance has two features: 1. Distance along the road segment 2. Perpendicular offset from the road centerline For evaluation purposes drivers were asked to indicate which lane they occupied and any lane changes

49. Cluster Analysis GPS Lane Finding (cont�d) For the problem of automatic lane detection, Two domain-specific heuristics for generating constraints Trace contiguity means that, in the absence of lane changes, all of the points generated from the same vehicle in a single pass over a road segment should end up in the same lane. Maximum separation refers to a limit on how far apart two points can be (perpendicular to the centerline) while still being in the same lane. If two points are separated by at least four meters, then we generate a constraint that will prevent those two points from being placed in the same cluster. To better analyze performance in the domain, authors modified the cluster center representation

50. Cluster Analysis GPS Lane Finding (cont�d)

51. Cluster Analysis Conclusion Measurable improvement in accuracy The use of constraints while clustering means that, unlike the regular k-means algorithm, the assignment of instances to clusters can be order-sensitive. If a poor decision is made early on, the algorithm may later encounter an instance i that has no possible valid cluster Ideally, the algorithm would be able to backtrack, rearranging some of the instances so that i could then be validly assigned to a cluster. Could be extended to hierarchical algorithms

CSE 634 Data Mining Concepts Techniques

CSE 634 Data Mining Concepts Techniques

Presentation Transcript

CSE 634 Data Mining Techniques

CSE 634 Data Mining Techniques

CSE 634 Data Mining Techniques

Data Mining: Concepts and Techniques

CSE 634/590 Data mining Extra Credit:

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

CSE 634 Data Mining Techniques

Data Mining: Concepts and Techniques

CSE 634 Data Mining Techniques

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques