E N D
1. Data Mining--Clustering Prof. Sin-Min Lee
12. AprioriTid Algorithm The database is not used at all for counting the support of candidate itemsets after the first pass.
The candidate itemsets are generated the same way as in Apriori algorithm.
Another set C is generated of which each member has the TID of each transaction and the large itemsets present in this transaction. This set is used to count the support of each candidate itemset.
The advantage is that the number of entries in C may be smaller than the number of transactions in the database, especially in the later passes.
13. Apriori Algorithm Candidate itemsets are generated using only the large itemsets of the previous pass without considering the transactions in the database.
The large itemset of the previous pass is joined with itself to generate all itemsets whose size is higher by 1.
Each generated itemset, that has a subset which is not large, is deleted. The remaining itemsets are the candidate ones.
14. Example
15. Example
16. Example
17. Example
28. Clustering Group data into clusters
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Unsupervised learning: no predefined classes
30. What is Cluster Analysis?
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined classes
Typical applications
to get insight into data
as a preprocessing step
31. What Is A Good Clustering? High intra-class similarity and low inter-class similarity
Depending on the similarity measure
The ability to discover some or all of the hidden patterns
32. General Applications of Clustering Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access patterns
33. Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation database
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
City-planning: Identifying groups of houses according to their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
34. What Is Good Clustering? A good clustering method will produce high quality clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the similarity measure used by the method and its implementation.
The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.
35. Data Structures in Clustering
Data matrix
(two modes)
Dissimilarity matrix
(one mode)
36. Measuring Similarity Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j)
There is a separate quality function that measures the goodness of a cluster.
The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.
Weights should be associated with different variables based on applications and data semantics.
It is hard to define similar enough or good enough
the answer is typically highly subjective.
37. Notion of a Cluster can be Ambiguous
38. Hierarchy algorithmsAgglomerative: each object is a cluster, merge clusters to form larger onesDivisive: all objects are in a cluster, split it up into smaller clusters
39. Types of Clusters: Well-Separated Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.
40. Types of Clusters: Center-Based Center-based
A cluster is a set of objects such that an object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster
The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most representative point of a cluster
41. Types of Clusters: Contiguity-Based Contiguous Cluster (Nearest neighbor or Transitive)
A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.
42. Types of Clusters: Density-Based Density-based
A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when noise and outliers are present.
43. Types of Clusters: Conceptual Clusters Shared Property or Conceptual Clusters
Finds clusters that share some common property or represent a particular concept.
.
45. Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of merges or splits
47. Intermediate Situation After some merging steps, we have some clusters
59. Hierarchical Clustering: Time and Space requirements O(N2) space since it uses the proximity matrix.
N is the number of points.
O(N3) time in many cases
There are N steps and at each step the size, N2, proximity matrix must be updated and searched
Complexity can be reduced to O(N2 log(N) ) time for some approaches
60. Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters, it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more of the following:
Sensitivity to noise and outliers
Difficulty handling different sized clusters and convex shapes
Breaking large clusters
65. MST: Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree)
Start with a tree that consists of any point
In successive steps, look for the closest pair of points (p, q) such that one point (p) is in the current tree but the other (q) is not
Add q to the tree and put an edge between p and q
66. MST: Divisive Hierarchical Clustering Use MST for constructing hierarchy of clusters