Pattern Recognition: Statistical & Neural Clustering

Nanjing University of Science & Technology Pattern Recognition:Statistical and Neural Lonnie C. Ludeman Lecture 26 Nov 4, 2005

Lecture 26 Topics • General Concept of Clustering • Basic problems in determining clusters • Definition of distance functions between clusters • Introduce the K-Means Clustering Algorithm Example 1 Example 2

Clusteringis the art of grouping together pattern vectors that in some sense belong together because they have similar characteristics and are different from other pattern vectors. In the most general problem the number of clusters or subgroups is unknown as are the properties that make them similar.

Question: How do we start the process of finding clusters and identifying similarities??? Answer: First realize that clustering is an art and there is no correct answer only feasible alternatives. Secondexplore structures of data, similarity measures, and limitations of various clustering procedures

Formalization of the Problem of Clustering Given a set S of NS n-dimensional pattern vectors: S= { xj ; j =1, 2, ... , NS } Clustering is the process of partitioning S into M subsets, Clk , k=1, 2, ... , M called clusters that satisfy the following conditions.

K ∩ Clk k = 1 1. The members in each subset are in some sense similar and not similar to members in the other subsets. 2. Clk≠ Φ Not empty 3. Clk∩Clj≠ΦPairwise disjoint = S Exhaustive 4. Φ is the Null Set

Illustration of Clusters and Cluster centers

Will now look at two examples that illustrate problems in performing meaningful clustering: Example 1: Problems with scaling Example 2: The nonuniqueness of results

Example 1: Given the data below, obtained by measuring the weight and diameter of 4 large foam balls labeled a, b, c, and d. Find two clusters from the set { a, b, c, d }

Solution: The plot of the points in the 2-dimensional pattern space is given below

Solution: The plot of the points in the 2-dimensional pattern space is given below By closeness in pattern space select Cl1 = { a,c } Cl2 = { b,d }

The plot of the same points in the 2-dimensional pattern space with Diameter shown in inches rather than feet (different scale) is given below

The plot of the same points in the 2-dimensional pattern space with Diameter shown in inches rather than feet (different scale) is given below By closeness in pattern space select Cl1 = { a,b } Cl2 = { c,d }

Which set of clusters is the correct answer ???

Which set of clusters is the correct answer ??? Cl1 = { a,c } Cl2 = { b,d } #1: Measured in feet Cl1 = { a,b } Cl2 = { c,d } #2: Measured in inches

Which set of clusters is the correct answer ??? Cl1 = { a,c } Cl2 = { b,d } #1: Measured in feet Cl1 = { a,b } Cl2 = { c,d } #2: Measured in inches Other measurement Units #3: Cl1 = { a,d } Cl2 = { b,c }

Which set of clusters is the correct answer ??? Cl1 = { a,c } Cl2 = { b,d } #1: Measured in feet Cl1 = { a,b } Cl2 = { c,d } #2: Measured in inches Other measurement Units #3: Cl1 = { a,d } Cl2 = { b,c } #4: None of the above

Which set of clusters is the correct answer ??? Cl1 = { a,c } Cl2 = { b,d } #1: Measured in feet Cl1 = { a,b } Cl2 = { c,d } #2: Measured in inches Other measurement Units #3: Cl1 = { a,d } Cl2 = { b,c } #4: None of the above #5: All of the above

Answer: There is no correct answer, the clusters provide us with different interpretations of the data where the closeness of patterns is measured with different definitions of similarity.

One approach is to solve the scaling problem is to normalize each dimension separately if they represent different properties like weight and diameter. For our problem we have Diameter 1 Weight 1

Example 2. Given a Standard(USA) deck of 52 playing cards. Each card is specified by the pair of values: (denomination, suit) where denomination is from { 2, 3, ..., 10, J, Q, K, A } and suit is from {  ,  , ,  } Find a reasonable clustering of the data.

    Given Patterns

Solution 1:    

Solution 2:    

Solution 3:    

Solution 4:    

Solution 5:    

Solution 6:Another Choice for 26 clusters       Solution 7:Still Another Choice for 26 clusters     

Concentrate now on quantitative data and examine measures of similaritybetween pattern samples and clusters Euclidean Distance between two pattern vectorsx and y The smaller the distance the larger the similarity

Measures of Distance between two pattern Classes Si and Sj 1. minimum distance 2. average distance

3. between means Where 4. between medians

5. maximum distance Interpretation of dmax , dmean, dmin

Measure of Performance for Clustering Overall performance measure J for a given set of clusters Clkfor k =1, 2, ... , K where the mean of each cluster is Mk i k i k

If K=NS,the number of samples, then the cluster centers equal the sample in the cluster and the performance would be 0. If K=1 then all samples are in just one cluster and J would be maximum. 34 There is no useful information in either one of these conditions!

Methods for Clustering Quantitative Data * 1. K-Means Clustering Algorithm 2. Hierarchical Clustering Algorithm 3. ISODATA Clustering Algorithm 4. Fuzzy Clustering Algorithm * Just introduce in this lecture, details in following lecture

K-Means Clustering Algorithm: Basic Procedure Randomly Select K cluster centers from Pattern Space Distribute set of patterns to the cluster center using minimum distance Compute new Cluster centers for each cluster Continue this process until the cluster centers do not change.

Flow Diagram for K-Means Algorithm

Summary Lecture 26 • Presented General Concept of Clustering • Discussed Basic problems in determining clusters by presenting • Gave the Definition of distance functions between clusters • Introduced the K-Means Clustering Algorithm Example 1 Example 2

End of Lecture 26

Pattern Recognition: Statistical & Neural Clustering