500 likes | 943 Views
Cluster Analysis. References. Jiawei Han and Michelle Kamber. Data Mining Concept and Techniques (Chapter 7, Sections 1- 4). Morgan Kaufman, 2005Jiawei Han, Lecture Notes, University of Illinois at Urbana-Champaign, http://www-faculty.cs.uiuc.edu/~hanj/bk2/07.pptProf.
1. Cluster Analysis CSE 634Data Mining Concepts & Techniques Cluster Analysis
Group 6
Nam, Kyu Han (105953722)
Ju, Jae Won (106112650)
Chung, Dong Hwan (105275323)
2. Cluster Analysis References Jiawei Han and Michelle Kamber. Data Mining Concept and Techniques (Chapter 7, Sections 1- 4). Morgan Kaufman, 2005
Jiawei Han, Lecture Notes, University of Illinois at Urbana-Champaign,
Prof. Dr. J. Fürnkranz and Dr. G. Grieser, “Maschinelles Lernen and Data Mining” (3-11)
K. Wagsta, C. Cardie, S. Rogers, and S. Schroedl, “Constrained K-means Clustering with Background Knowledge”, Proceedings of 18th International Conference on Machine Learning 2001. (pp. 577-584). Morgan Kaufmann, San Francisco, CA.
3. Cluster Analysis Data Mining Concepts and Techniques
4. Cluster Analysis What is Cluster Analysis? Cluster : Collection of data objects
(Intraclass similarity) - Objects are similar to objects in same cluster
(Interclass dissimilarity) - Objects are dissimilar to objects in other clusters
Cluster analysis
Statistical method for grouping a set of data objects into clusters
A good clustering method produces high quality clusters with high intraclass similarity and low interclass similarity
Clustering is unsupervised classification
Data objects in a cluster have two properties - Intraclass and Interclass. These are properties that a cluster tries to improve.
Examples of clusters: Stars in a galaxy, Planets in the solar system, Kinds of rocks
Explain cluster analysis
Why is it unsupervised? Because it does not rely on predefined classes or trained data. It is learning by observation, not learning by examples.
Data objects in a cluster have two properties - Intraclass and Interclass. These are properties that a cluster tries to improve.
Examples of clusters: Stars in a galaxy, Planets in the solar system, Kinds of rocks
Explain cluster analysis
Why is it unsupervised? Because it does not rely on predefined classes or trained data. It is learning by observation, not learning by examples.
5. Cluster Analysis Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
City-planning: Identifying groups of houses according to their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
6. Cluster Analysis Data Representation
Data matrix (two mode)
N objects with p attributes
Dissimilarity matrix (one mode)
d(i,j) : dissimilarity
between i and j
7. Cluster Analysis Types of Data in Cluster Analysis
Interval-Scaled Variables
Binary Variables
Nominal, Ordinal, and Ratio-Scaled Variables
Variables of Mixed Types
8. Cluster Analysis Interval-Scaled Variables Continuous measurements of a roughly linear scale
E.g. weight, height, temperature, etc.
9. Cluster Analysis Using Interval-Scaled Values
Step 1: Standardize the data
To ensure they all have equal weight
To match up different scales into a uniform, single scale
Not always needed! Sometimes we require unequal weights for an attribute
Step 2: Compute dissimilarity between records
Use Euclidean, Manhattan or Minkowski distance Exceptions: height may be a more important attribute associated with basketball players
Exceptions: height may be a more important attribute associated with basketball players
10. Cluster Analysis Data Types and Distance Metrics
Distances are normally used to measure the similarity or dissimilarity between two data objects
Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer
11. Cluster Analysis Data Types and Distance Metrics (Cont’d)
If q = 1, d is Manhattan distance
If q = 2, d is Euclidean distance
12. Cluster Analysis Data Types and Distance Metrics (Cont’d)
d(i,j) ? 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) ? d(i,k) + d(k,j)
Can also use weighted distance, or other dissimilarity measures.
13. Cluster Analysis Binary Attributes A contingency table for binary data
Simple matching coefficient (if the binary attribute is symmetric):
Jaccard coefficient (if the binary attribute is asymmetric):
14. Cluster Analysis Dissimilarity between Binary Attributes Example
15. Cluster Analysis Nominal Attributes A generalization of the binary attribute in that it can take more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of attributes that are same for both records, p: total # of attributes
Method 2: rewrite the database and create a new binary attribute for each of the m states
For an object with color yellow, the yellow attribute is set to 1, while the remaining attributes are set to 0.
16. Cluster Analysis Ordinal Attributes An ordinal attribute can be discrete or continuous
Order is important (ex.rank)
Can be treated like interval-scaled
replacing xif by their rank
map the range of each variable onto [0, 1] by replacing i-th object in the f-th attribute by
compute the dissimilarity using methods for interval-scaled attributes
17. Cluster Analysis Ratio-Scaled Attributes Ratio-scaled attribute: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt
treat them like interval-scaled attributes — not a good choice because scales may be distorted
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data and treat their rank as interval-scaled.
18. Cluster Analysis Attributes of Mixed Types A database may contain all the six types of attributes
symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.
Use a weighted formula to combine their effects.
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
compute ranks rif and
and treat zif as interval-scaled
19. Cluster Analysis Data Mining Concepts and Techniques
20. Cluster Analysis
21. Cluster Analysis
22. Cluster Analysis
23. Cluster Analysis
24. Cluster Analysis
25. Cluster Analysis
26. Cluster Analysis
27. Cluster Analysis
28. Cluster Analysis
29. Cluster Analysis
30. Cluster Analysis
31. Cluster Analysis
32. Cluster Analysis
33. Cluster Analysis
34. Cluster Analysis
35. Cluster Analysis
36. Cluster Analysis
37. Cluster Analysis
38. Cluster Analysis
39. Cluster Analysis
40. Cluster Analysis
41. Cluster Analysis
42. Cluster Analysis Data Mining Concepts and Techniques
43. Cluster Analysis Introduction Clustering is an unsupervised method of data analysis
Data instances grouped according to some notion of similarity
Access only to the set of features describing each object
No information as to where each instance should be placed with partition
However there might be background knowledge about the domain or data set that could be useful to algorithm
In this paper the authors try to integrate this background knowledge into clustering algorithms.
44. Cluster Analysis K-means Clustering Used to partition a data set into k groups
Group instances based on attributes into k groups
High intra-cluster similarity; Low inter-cluster similarity
Cluster similarity is measured in regards to the mean value of objects in the cluster.
How does K-means work ?
First, select K random instances from the data – initial cluster centers
Second, each instance is assigned to its closest (most similar) cluster center
Third, each cluster center is updated to the mean of its constituent instances
Repeat steps two and three till there is no further change in assignment of instances to clusters
45. Cluster Analysis Constrained K-means Clustering Two pair-wise constraints
Must-link: constraints which specify that two instances have to be in the same cluster
Cannot-link: constraints which specify that two instances must not be placed in the same cluster
When using a set of constraints we have to take the transitive closure
Constraints may be derived from
Partially labeled data
Background knowledge about the domain or data set
46. Cluster Analysis Constrained Algorithm First, select K random instances from the data – initial cluster centers
Second, each instance is assigned to its closest (most similar) cluster center such that VIOLATE-CONSTRAINT(I, K, M, C) is false. If no such cluster exists , fail
Third, each cluster center is updated to the mean of its constituent instances
Repeat steps two and three till there is no further change in assignment of instances to clusters
instance I, cluster K,
must-link constraint M, cannot-link constraint C
For each (i, i=) in M: if i= is not in K, return true.
For each (i, i?) in C : if i? is in K, return true
Otherwise return false
47. Cluster Analysis Experimental Results on GPS Lane Finding Large database of digital road maps available
These maps contain only coarse information about the location of the road
By refining maps down to the lane level we can enable a host of more sophisticated applications such as lane departure detection
Based on the observation that drivers tend to drive within lane boundaries
Lanes should correspond to “densely traveled” regions in contrast to the lane boundaries
Possible to collect data about the location of cars and then cluster that data to automatically determine where the individual lanes are located
48. Cluster Analysis GPS Lane Finding (cont’d) Collect data about the location of cars as they drive along a given road
Collect data once per second from several drivers using GPS receivers affixed to top of their vehicles
Each data instance has two features:
1. Distance along the road segment
2. Perpendicular offset from the road centerline
For evaluation purposes drivers were asked to indicate which lane they occupied and any lane changes
49. Cluster Analysis GPS Lane Finding (cont’d) For the problem of automatic lane detection,
Two domain-specific heuristics for generating constraints
Trace contiguity means that, in the absence of lane changes, all of the points generated from the same vehicle in a single pass over a road segment should end up in the same lane.
Maximum separation refers to a limit on how far apart two points can be (perpendicular to the centerline) while still being in the same lane. If two points are separated by at least four meters, then we generate a constraint that will prevent those two points from being placed in the same cluster.
To better analyze performance in the domain, authors modified the cluster center representation
50. Cluster Analysis GPS Lane Finding (cont’d)
51. Cluster Analysis Conclusion Measurable improvement in accuracy
The use of constraints while clustering means that, unlike the regular k-means algorithm, the assignment of instances to clusters can be order-sensitive.
If a poor decision is made early on, the algorithm may later encounter an instance i that has no possible valid cluster
Ideally, the algorithm would be able to backtrack, rearranging some of the instances so that i could then be validly assigned to a cluster.
Could be extended to hierarchical algorithms