Computacion inteligente

Computacion inteligente Fuzzy Clustering

Agenda • Basic concepts • Types of Clustering • Types of Clusters • Distance functions • Clustering Algorithms

Basic concepts

Classification • Historically, objects are classified into groups • periodic table of the elements (chemistry) • taxonomy (zoology, botany) • Why classify? • Understanding • prediction • organizational convenience, convenient summary • Summarization • Reduce the size of large data sets These aims do not necessarily lead to the same classification; e.g. SIZE of object vs. TYPE/USE of object

Classification • Classification divides objects into groups based on a set of values • Unlike a theory, a classification is neither true nor false, and should be judged largely on the usefulness of results • However, a classification (clustering) may be useful for suggesting a theory, which could then be tested

Inter-cluster distances are maximized Intra-cluster distances are minimized What is clustering? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Simple example Composition of mammalian milk

Composition of mammalian milk Proteins (%) Clustering Fat (%) Simple example

Feature space Pattern What is clustering? • No class values denoting an a priori grouping of the data instances are given. • So, it’s a method of data exploration • a way of looking for patterns or structure in the data that are of interest

What is clustering? • A form of unsupervised learning • You generally don’t have examples demonstrating how the data should be grouped together • Clustering is often called an unsupervised learning task Due to historical reasons, clustering is often considered synonymous with unsupervised learning.

Clustering vs. class prediction • Clustering: • No learning set, no given classes • Goal: discover the ”best” classes or groupings • Class prediction: • A learning set of objects with known classes • Goal: put new objects into existing classes • Also called: Supervised learning, or classification

Components of Clustering Task • Pattern Representation • Number of classes and available patterns • Number, type, and scale of features available to algorithm • Feature selection/extraction • Definition of Pattern Proximity measure • Defined on pairs of patterns • Distance measures and conceptual similarities And…

Components of Clustering Task • Clustering / Grouping • Data abstraction (optional) • Extraction of simply and compact data representation • Output Assessment (optional) • How good is it? • The quality of a clustering result depends on the algorithm, the distance function, and the application.

Pattern Representation • Which features do we use? • Currently, no theoretical guidelines to suggest appropriate patterns and features to use in specific situation • User generally must provide insight • Careful analysis of available features can yield improved clustering results

Pattern Representation • Example: The balls of same colour are clustered into a group as shown below: Thus, we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity.

How many clusters? Six Clusters Two Clusters Four Clusters Notion of a Cluster can be Ambiguous

Types of Clustering

Types of Clusterings • Important distinction between hierarchical and partitional sets of clusters • Partitional Clustering • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree

Original Points A Partitional Clustering Partitional Clustering

Hierarchical Clustering

Other Distinctions Between Sets of Clusters • Exclusive versus non-exclusive • In non-exclusive clusterings, points may belong to multiple clusters. • Fuzzy versus non-fuzzy • In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 • Weights must sum to 1 • Partial versus complete • In some cases, we only want to cluster some of the data

Types of Clusters

Types of Clusters • Well-separated clusters • Center-based clusters • Contiguous clusters • Density-based clusters • Property or Conceptual • Described by an Objective Function

3 well-separated clusters Types of Clusters: Well-Separated • Well-Separated Clusters: • A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.

4 center-based clusters Types of Clusters: Center-Based • Center-based • A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster • The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster

4 center-based clusters Types of Clusters: Center-Based • Center-based • The centroid representation alone works well if the clusters are of the hyper-spherical shape. • If clusters are elongated or are of other shapes, centroids are not sufficient

Common ways to represent clusters • Use the centroid of each cluster to represent the cluster. • compute the radius and • standard deviation of the cluster to determine its spread in each dimension • The centroid representation alone works well if the clusters are of the hyper-spherical shape. • If clusters are elongated or are of other shapes, centroids are not sufficient

8 contiguous clusters Types of Clusters: Contiguity-Based • Contiguous Cluster (Nearest neighbor or Transitive) • A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

6 density-based clusters Types of Clusters: Density-Based • Density-based • A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. • Used when the clusters are irregular or intertwined, and when noise and outliers are present.

2 Overlapping Circles Types of Clusters: Conceptual Clusters • Shared Property or Conceptual Clusters • Finds clusters that share some common property or represent a particular concept.

Types of Clusters: Objective Function • Clusters Defined by an Objective Function • Finds clusters that minimize or maximize an objective function. • Enumerate all possible ways of dividing the points into clusters and evaluate the ‘goodness’ of each potential set of clusters by using the given objective function.

Types of Clusters: Objective Function • Clusters Defined by an Objective Function • Can have global or local objectives. • Hierarchical clustering algorithms typically have local objectives • Partitional algorithms typically have global objectives

Distance functions

Clustering Task • Consists in introducing D, a distance measure (or a measure of similarity or proximity) between sample patterns.

Distance functions • The similarity measure is often more important than the clustering algorithm used • Instead of talking about similarity measures, we often equivalently refer to dissimilarity measures

Quality in Clustering • A good clustering method will produce high quality clusters with • high intra-class similarity • low inter-class similarity • The quality of a clustering result depends on both the similarity measure used by the method and its implementation

Distance functions • There are numerous distance functions for • Different types of data • Numeric data • Nominal data • Different specific applications Weights should be associated with different variables based on applications and data semantics.

Distance functions for numeric data • We denote distance with: • where xi and xj are data points (vectors) • Most commonly used functions are • Euclidean distance and • Manhattan (city block) distance d(x,y) They are special cases of Minkowski distance

Metric Spaces • Metric Space: A pair (X,d) where X is a set and d is a distance function such that for x,y in X: Symmetry Separation Triangular inequality

C Q p = 1 Manhattan (Rectilinear, City Block) p = 2 Euclidean p =  Max (Supremum, “sup”) d(Q,C) Minkowski distance, Lp

Euclidean distance, L2 • Here n is the number of dimensions in the data vector.

deuc=0.5846 deuc=1.1345 deuc=2.6115 Euclidean distance These examples of Euclidean distance match our intuition of dissimilarity pretty well…

deuc=1.41 deuc=1.22 Euclidean distance …But what about these? What might be going on with the expression profiles on the left? On the right?

Weighted Euclidean distance • Weighted Euclidean distance

Mahalanobis distance

More Metrics • Manhattan distance, L1 • Linf(Chessboard):

Clustering Algorithms

Computacion inteligente