220 likes | 470 Views
Clustering NG GIM MENG WEK070100 Tan Wei Ren WEK Tan Chin Keong WEK. Numerical Method for AI. Clustering a method of unsupervised learning, it’s not trained with examples of correct answers. a common technique for statistical data analysis used in many
E N D
Clustering NG GIM MENG WEK070100 Tan Wei Ren WEKTan Chin Keong WEK Numerical Method for AI
Clustering • a method of unsupervised learning, it’s not trained with • examples of correct answers. • a common technique for statistical data analysis used in many • fields, including machine learning, data mining, pattern • recognition, image analysis and bioinformatics. • It’s like classification but the groups are not predefined, so the • algorithm will try to group similar items together. Numerical Method for AI Introduction
Types of clustering • Hierarchical algorithms find successive clusters using previously established clusters. • Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. • Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. Numerical Method for AI Introduction
An important step in any clustering is to select a distance measure, which will determine how the similarity of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Numerical Method for AI Distance measure
Numerical Method for AI Distance measure
Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree structure called a dendrogram. The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual observations.Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the leaves and successively merges clusters together; or divisive, in which one starts at the root and recursively splits the clusters. Numerical Method for AI i.Hierarchical clustering
For example, suppose this data is to be clustered, and the euclidean distance is the distance metric. Raw data Traditional representation Numerical Method for AI i.Hierarchical clustering
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion). Numerical Method for AI i.Hierarchical clustering
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion). Numerical Method for AI i.Hierarchical clustering
1. k-meansclustering The k-means algorithm assigns each point to the cluster whose center (alsso called centroid) is nearest. The center is the average of all the points in the cluster - that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster.Example : The data set has three dimensions and the cluster has two points : X = (x1, x2, x3) and Y = (y1, y2, y3). Then the centroidZbecomesZ = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2. Numerical Method for AI ii.Partitional Clustering
1. k-meansclustering • The algorithm steps are : • Choose the number of clusters, k. • Randomly generate k clusters and determine the cluster • centers, or directly generate k random points as cluster centers. • Assign each point to the nearest cluster center. • Recompute the new cluster centers. • Repeat the two previous steps until some convergence criterion • is met (usually that the assignment hasn't changed). Numerical Method for AI ii.Partitional Clustering
1. k-meansclustering Advantages : Its simplicity and speed which allows it to run on large datasets.Disadvantage :It does not yield the same result with each run, since the resulting clusters depend on the initial random assignments.The requirement for the concept of a mean to be definable which is not always the case. For such datasets the k-medoids variant is appropriate. Numerical Method for AI ii.Partitional Clustering
1. k-meansclustering Advantages : Its simplicity and speed which allows it to run on large datasets.Disadvantage :It does not yield the same result with each run, since the resulting clusters depend on the initial random assignments.The requirement for the concept of a mean to be definable which is not always the case. For such datasets the k-medoids variant is appropriate. Numerical Method for AI ii.Partitional Clustering
A.K. Jain M.N. Murty and P.J. Flin:'Data Clustering: a review’ ACM Computing survey vol 31 (3), sept 1999. This paper presents an overview of pattern clustering methods, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. It presents taxonomy of clustering techniques, and recent advances. It also describes some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval. Numerical Method for AI Literature Review
Components of a Clustering Task • Typical pattern clustering activity involves the following steps [Jain and Dubes 1988]: • (1) Pattern representation, • (2) Definition of a pattern proximity measure appropriate to the data domain, • (3) Clustering or grouping, • (4) Data abstraction (if needed), • (5) Assessment of output (if needed). Numerical Method for AI Literature Review
Pattern representationrefers to the number of classes, the number of available patterns, and the number, type, and scale of the features available to the clustering algorithm. Some of this information may not be controllable. • Pattern proximitydetermines how the similarity of two elements is calculated. It is usually measured by a distance function defined on pairs of patterns. A variety of distance measures are in use in the various communities. • The grouping step can be performed in a number of ways. The output clustering (or clusterings) can be hard (a partition of the data into groups) or fuzzy (where each pattern has a variable degree of membership in each of the output clusters). • Data abstractionis the process of extracting a simple and compact representation of a data set - a compact description of each cluster. Numerical Method for AI Literature Review
Data mining Many data mining applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples.Grouping of Shopping Items Clustering can be used to group all the shopping items available on the web into a set of unique products. For example, all the items on eBay can be grouped into unique products. (eBay doesn't have the concept of a Stock-Keeping Unit(SKU)) Numerical Method for AI Application
Social network analysis In the study of social networks, clustering may be used to recognize communities within large groups of people. Medicine In medical imaging, such as PET scans, cluster analysis can be used to differentiate between different types of tissue and blood in a three dimensional image.Market research Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers. Numerical Method for AI Application
Comparisons between data clusterings There have been several suggestions for a measure of similarity between two clusterings. Such a measure can be used to compare how well different data clustering algorithms perform on a set of data. Recently, a new Mallows Distance based metric was also proposed for soft clustering comparisons. Numerical Method for AI Conclusion