960 likes | 1k Views
Unsupervised Learning. Clustering Unsupervised classification, that is, without the class attribute Want to discover the classes Association Rule Discovery Discover correlation. The Clustering Process. Pattern representation Definition of pattern proximity measure Clustering
E N D
Unsupervised Learning • Clustering • Unsupervised classification, that is, without the class attribute • Want to discover the classes • Association Rule Discovery • Discover correlation Data Mining and Knowledge Discovery
The Clustering Process • Pattern representation • Definition of pattern proximity measure • Clustering • Data abstraction • Cluster validation Data Mining and Knowledge Discovery
Pattern Representation • Number of classes • Number of available patterns • Circles, ellipses, squares, etc. • Feature selection • Can we use wrappers and filters? • Feature extraction • Produce new features • E.g., principle component analysis (PCA) Data Mining and Knowledge Discovery
Pattern Proximity • Want clusters of instances that are similar to each other but dissimilar to others • Need a similarity measure • Continuous case • Euclidean measure (compact isolated clusters) • The squared Mahalanobis distance alleviates problems with correlation • Many more measures Data Mining and Knowledge Discovery
Pattern Proximity • Nominal attributes Data Mining and Knowledge Discovery
Clustering Techniques Clustering Hierarchical Partitional Single Link Complete Link Square Error Mixture Maximization Expectation Maximization CobWeb K-means Data Mining and Knowledge Discovery
Technique Characteristics • Agglomerative vs Divisive • Agglomerative: each instance is its own cluster and the algorithm merges clusters • Divisive: begins with all instances in one cluster and divides it up • Hard vs Fuzzy • Hard clustering assigns each instance to one cluster whereas in fuzzy clustering assigns degree of membership Data Mining and Knowledge Discovery
More Characteristics • Monothetic vs Polythetic • Polythetic: all attributes are used simultaneously, e.g., to calculate distance (most algorithms) • Monothetic: attributes are considered one at a time • Incremental vs Non-Incremental • With large data sets it may be necessary to consider only part of the data at a time (data mining) • Incremental works instance by instance Data Mining and Knowledge Discovery
Hierarchical Clustering Dendrogram S imilarity F G C B D E A A B C D E F G Data Mining and Knowledge Discovery
Hierarchical Algorithms • Single-link • Distance between two clusters set equal to the minimum of distances between all instances • More versatile • Produces (sometimes too) elongated clusters • Complete-link • Distance between two clusters set equal to maximum of all distances between instances in the clusters • Tightly bound, compact clusters • Often more useful in practice Data Mining and Knowledge Discovery
2 1 2 2 2 2 1 2 1 1 1 2 1 * * * * * * * * * 2 2 2 1 1 1 2 2 2 1 2 1 2 1 2 2 2 2 1 2 1 1 1 2 1 * * * * * * * * * 2 2 2 1 1 1 2 2 2 1 2 1 Example: Clusters Found Single-Link Complete-Link Data Mining and Knowledge Discovery
Partitional Clustering • Output a single partition of the data into clusters • Good for large data sets • Determining the number of clusters is a major challenge Data Mining and Knowledge Discovery
K-Means Predetermined number of clusters Start with seed clusters of one element Seeds Data Mining and Knowledge Discovery
Assign Instances to Clusters Data Mining and Knowledge Discovery
Find New Centroids Data Mining and Knowledge Discovery
New Clusters Data Mining and Knowledge Discovery
Discussion: k-means • Applicable to fairly large data sets • Sensitive to initial centers • Use other heuristics to find good initial centers • Converges to a local optimum • Specifying the number of centers very subjective Data Mining and Knowledge Discovery
Clustering in Weka • Clustering algorithms in Weka • K-Means • Expectation Maximization (EM) • Cobweb • hierarchical, incremental, and agglomerative Data Mining and Knowledge Discovery
CobWeb • Algorithm (main) characteristics: • Hierarchical and incremental • Uses category utility • Why divide by k? The k clusters All possible values for attribute ai Data Mining and Knowledge Discovery
Category Utility • If each instance in its own cluster • Category utility function becomes • Without k it would always be best for each instance to have its own cluster, overfitting! Data Mining and Knowledge Discovery
The Weather Problem Data Mining and Knowledge Discovery
Weather Data (without Play) • Label instances: a,b,….,n Add another instance in its own cluster Start by putting the first instance in its own cluster a a b Data Mining and Knowledge Discovery
Adding the Third Instance Evaluate the category utility of adding the instance to one of the two clusters versus adding it as its own cluster b a a c b a c b c Highest utility Data Mining and Knowledge Discovery
Adding Instance f First instance not to get its own cluster: a b c d e f Look at the instances: Rainy Cool Normal FALSE Rainy Cool Normal TRUE Quite similar! Data Mining and Knowledge Discovery
Add Instance g Look at the instances: E) Rainy Cool Normal FALSE F) Rainy Cool Normal TRUE G) Overcast Cool Normal TRUE a b c d e f g Data Mining and Knowledge Discovery
Add Instance h Look at the instances: Runner up A) Sunny Hot High FALSE D) Rainy Mild High FALSE H) Sunny Mild High FALSE Rearrange: Best matching node Merged into a single cluster before h is added b c e f g a d h (Splitting is also possible) Data Mining and Knowledge Discovery
g f j m n a d h c l e i b k Final Hierarchy What next? Data Mining and Knowledge Discovery
g f j m n a d h c l e i b k Dendrogram Clusters What do a, b, c, d, h, k, and l have in common? Data Mining and Knowledge Discovery
Numerical Attributes • Assume normal distribution • Problems with zero variance! • The acuity parameter imposes a minimum variance Data Mining and Knowledge Discovery
Hierarchy Size (Scalability) • May create very large hierarchy • The cutoff parameter is uses to suppress growth • If cut node off. Data Mining and Knowledge Discovery
Discussion • Advantages • Incremental scales to large number of instances • Cutoff limits size of hierarchy • Handles mixed attributes • Disadvantages • Incremental sensitive to order of instances? • Arbitrary choice of parameters: • divide by k, • artificial minimum value for variance of numeric attributes, • ad hoc cutoff value Data Mining and Knowledge Discovery
Probabilistic Perspective • Most likely set of clusters given data • Probability of each instance belonging to a cluster • Assumption: instances are drawn from one of several distributions • Goal: estimate the parameters of these distributions • Usually: assume distributions are normal Data Mining and Knowledge Discovery
Mixture Resolution • Mixture: set of k probability distributions • Represent the k clusters • Probabilities that an instance takes certain attribute values given it is in the cluster • What is the probability an instance belongs to a cluster (or a distribution) Data Mining and Knowledge Discovery
Cluster B Cluster A Attribute One Numeric Attribute Two cluster mixture model: Given some data, how can you determine the parameters: Data Mining and Knowledge Discovery
Problems • If we knew which instance came from each cluster we could estimate these values • If we knew the parameters we could calculate the probability that an instance belongs to each cluster Data Mining and Knowledge Discovery
EM Algorithm • Expectation Maximization (EM) • Start with initial values for the parameters • Calculate the cluster probabilities for each instance • Re-estimate the values for the parameters • Repeat • General purpose maximum likelihood estimate algorithm for missing data • Can also be used to train Bayesian networks (later) Data Mining and Knowledge Discovery
Beyond Normal Models • More than one class: • Straightforward • More than one numeric attribute • Easy if assume attributes independent • If dependent attributes, treat them jointly using the bivariate normal • Nominal attributes • No more normal distribution! Data Mining and Knowledge Discovery
EM using Weka • Options • numClusters: set number of clusters. • Default = -1 selects it automatically • maxIterations: maximum number of iterations • seed -- random number seed • minStdDev -- set minimum allowable standard deviation Data Mining and Knowledge Discovery
Other Clustering • Artificial Neural Networks (ANN) • Random search • Genetic Algorithms (GA) • GA used to find initial centroids for k-means • Simulated Annealing (SA) • Tabu Search (TS) • Support Vector Machines (SVM) • Will discuss GA and SVM later Data Mining and Knowledge Discovery
Applications • Image segmentation • Object and Character Recognition • Data Mining: • Stand-alone to gain insight into the data • Preprocess before classification that operates on the detected clusters Data Mining and Knowledge Discovery
DM Clustering Challenges • Data mining deals with large databases • Scalability with respect to number of instance • Use a random sample (possible bias) • Dealing with mixed data • Many algorithms only make sense for numeric data • High dimensional problems • Can the algorithm handle many attributes? • How do we interpret a cluster in high dimensions? Data Mining and Knowledge Discovery
Other (General) Challenges • Shape of clusters • Minimum domain knowledge (e.g., knowing the number of clusters) • Noisy data • Insensitivity to instance order • Interpretability and usability Data Mining and Knowledge Discovery
Clustering for DM • Main issue is scalability to large databases • Many algorithms have been developed for scalable clustering: • Partitional methods: CLARA, CLARANS • Hierarchical methods: AGNES, DIANA, BIRCH, CURE, Chameleon Data Mining and Knowledge Discovery
Practical Partitional Clustering Algorithms • Classic k-Means (1967) • Work from 1990 and later: • k-Medoids • Uses the mediod instead of the centroid • Less sensitive to outliers and noise • Computations more costly • PAM (Partitioning Around Mediods) algorithm Data Mining and Knowledge Discovery
Large-Scale Problems • CLARA: Clustering LARge Applications • Select several random samples of instances • Apply PAM to each • Return the best clusters • CLARANS: • Similar to CLARA • Draws samples randomly while searching • More effective than PAM and CLARA Data Mining and Knowledge Discovery
Hierarchical Methods • BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies • Clustering feature: triplet summarizing information about subclusters • Clustering feature tree: height-balanced tree that stores the clustering features Data Mining and Knowledge Discovery
BIRCH Mechanism • Phase I: • Scan database to build an initial CF tree • Multilevel compression of the data • Phase II: • Apply a selected clustering algorithm to the leaf nodes of the CF tree • Has been found to be very scalable Data Mining and Knowledge Discovery
Conclusion • The use of clustering in data mining practice seems to be somewhat limited due to scalability problems • More commonly used unsupervised learning: Association Rule Discovery Data Mining and Knowledge Discovery
Association Rule Discovery • Aims to discovery interesting correlation or other relationships in large databases • Finds a rule of the form if A and B then C and D • Which attributes will be included in the relation is unknown Data Mining and Knowledge Discovery
Mining Association Rules • Similar to classification rules • Use same procedure? • Every attribute is the same • Apply to every possible expression on right hand side • Huge number of rules Infeasible • Only want rules with high coverage/support Data Mining and Knowledge Discovery