MIS 542 Data Mining Concepts and Techniques — Chapter 5 — Clustering

MIS 542Data MiningConcepts and Techniques — Chapter 5 —Clustering 2014/2015 Fall

Chapter 5. Cluster Analysis • What is Cluster Analysis? • Types of Data in Cluster Analysis • A Categorization of Major Clustering Methods • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Clustering Methods • Outlier Analysis • Summary

What is Cluster Analysis? • Cluster: a collection of data objects • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis • Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Clustering is unsupervised learning: no predefined classes • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms

General Applications of Clustering • Pattern Recognition • Spatial Data Analysis • create thematic maps in GIS by clustering feature spaces • detect spatial clusters and explain them in spatial data mining • Image Processing • Economic Science (especially market research) • WWW • Document classification • Cluster Weblog data to discover groups of similar access patterns

Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

What Is Good Clustering? • A good clustering method will produce high quality clusters with • high intra-class similarity • low inter-class similarity • The quality of a clustering result depends on both the similarity measure used by the method and its implementation. • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

Requirements of Clustering in Data Mining • Scalability • Ability to deal with different types of attributes • Ability to handle dynamic data • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge to determine input parameters • Able to deal with noise and outliers • Insensitive to order of input records • High dimensionality • Incorporation of user-specified constraints • Interpretability and usability

Data Structures • Data matrix • (two modes) • Dissimilarity matrix • (one mode)

Properties of Dissimilarity Measures • Properties • d(i,j) 0 for i  j • d(i,i)= 0 • d(i,j)= d(j,i) symmetry • d(i,j) d(i,k)+ d(k,j) triangular inequality • Exercise: Can you find examples where distance between objects are not obeying symmetry property

Measure the Quality of Clustering • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) • There is a separate “quality” function that measures the “goodness” of a cluster. • The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. • Weights should be associated with different variables based on applications and data semantics. • It is hard to define “similar enough” or “good enough” • the answer is typically highly subjective.

Type of data in clustering analysis • Interval-scaled variables: • Binary variables: • Nominal, ordinal, and ratio variables: • Variables of mixed types:

Classification by Scale • Nominal scale:merely distinguish classes: with respect to A and B XA=XB or XAXB • e.g.: color {red, blue, green, …} • gender { male, female} • occupation {engineering, management. .. } • Ordinal scale: indicates ordering of objects in addition to distinguishing • XA=XB or XAXB XA>XB or XA<XB • e.g.: education {no school< primary sch. < high sch. < undergrad < grad} • age {young < middle < old} • income {low < medium < high }

Interval scale: assign a meaningful measure of difference between two objects • Not only XA>XB but XAisXA – XB units different from XB • e.g.: specific gravity • temperature in oC or oF • Boiling point of water is 100 oC different then its melting point or 180 oF different • Ratio scale: an interval scale with a meaningful zero point • XA > XB but XA is XA/XB times greater then XB • e.g.: height, weight, age (as an integer) • temperature in oK or oR • Water boils at 373 oK and melts at 273 oK • Boiling point of water is 1.37 times hotter then melting poing

Comparison of Scales • Strongest scale is ratio weakest scale is ordinal • Ahmet`s height is 2.00 meters HA • Mehmet`s height is 1.50 meter HM • HA  HM nominal: their heights are different • HA > HM ordinal Ahmet is taller then Mehmet • HA - HM =0.50 meters interval Ahmet is 50 cm taller then Mehmet • HA / HM =1.333 ratio scale, no mater height is measured in meter or inch …

Interval-valued variables • Standardize data • Calculate the mean absolute deviation: where • Calculate the standardized measurement (z-score) • Using mean absolute deviation is more robust than using standard deviation

Other Standardizations • Min Max scale between 0 and 1 or -1 and 1 • Decimal scale • For Ratio Scaled variales • Mean transformation • zi,f = xi,f/mean_f • Measure in terms of means of variable f • Log transformation • zi,f = logxi,f

X2 * X2 * * * * * * * * X1 * * * * * * * * X1 * * * * * * * Both has zero mean and standardized by Z scores X1 and X2 are unity in both cases I and II X1.X2=0 in case I whereasX1.X2 1 in case II Shell we use the same distance measure in both cases After obtaining the z scores

Exercise X2 * X2 * * * A* * * * * X1 * A* * * * * * * X1 * * * * * * * Suppose d(A,O) = 0.5 in case I and II Does it reflect the distance between A and origin? Suggest a transformation so as to handle correlation Between variables

Similarity and Dissimilarity Between Objects • Distances are normally used to measure the similarity or dissimilarity between two data objects • Some popular ones include: Minkowski distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q >=1 • If q = 1, d is Manhattan distance

Similarity and Dissimilarity Between Objects (Cont.) • If q = 2, d is Euclidean distance: • Properties • d(i,j) 0 • d(i,i)= 0 • d(i,j)= d(j,i) • d(i,j) d(i,k)+ d(k,j) • Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures

Similarity and Dissimilarity Between Objects (Cont.) • Weights can be assigned to variables • Where wi i = 1…P weights showing the importance of each variable

XA XA XB XB Manhatan distance between XA and XB Euclidean distance between XA and XB

Exercise • Take one of these points as origin and draw the locus of points that are 1,2 ,3 units away from the oirgin with two dimensions according to • Menhattan distance • Euchlidean distance • Chebychev distance

Binary Variables • Symmetric asymmetric • Symmetric: both of its states are equally valuable and carry the same weight • gender: male female • 0 male 1 female arbitarly coded as 0 or 1 • Asymmetric variables • Outcomes are not equally important • Encoded by 0 and 1 • E.g. patient smoker or not • 1 for smoker 0 for nonsmoker asymmetric • Positive and negative outcomes of a disease test • HIV positive by 1 HIV negative 0

Binary Variables • A contingency table for binary data • Simple matching coefficient (invariant, if the binary variable is symmetric): • Jaccard coefficient (noninvariant if the binary variable is asymmetric): Object j Object i

Dissimilarity between Binary Variables • Example • gender is a symmetric attribute • the remaining attributes are asymmetric binary • let the values Y and P be set to 1, and the value N be set to 0

Nominal Variables • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green • Method 1: Simple matching • m: # of matches, p: total # of variables • Higher weights can be assigned to variables with large number of states • Method 2: use a large number of binary variables • creating a new binary variable for each of the M nominal states

Example • 2 nominal variables • Faculty and country for students • Faculty {eng, applied Sc., Pure Sc., Admin., } 5 distinct values • Country {Turkey, USA} 10 distinct values • P = 2 just two varibales • Weight of country may be increased • Student A (eng, Turkey) B(Applied Sc, Turkey) • m =1 in one variable A and B are similar • D(A,B) = (2-1)/2 =1/2

Example cont. • Different binary variables for each faculty • Eng 1 if student is in engineering 0 otherwise • AppSc 1 if student in MIS, 0 otherwise • Different binary variables for each country • Turkey 1 if sturent Turkish, 0 otherwise • USA 1 if student USA ,0 otherwise

Ordinal Variables • An ordinal variable can be discrete or continuous • Order is important, e.g., rank • Can be treated like interval-scaled • replace xif by their rank • map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by • compute the dissimilarity using methods for interval-scaled variables

Example • Credit Card type: gold > silver > bronze > normal, 4 states • Education: grad > undergrad > highschool > primary school > no school, 5 states • Two customers • A(gold,highschool) • B(normal,no school) • rA,card = 1 , rB,card = 4 • rA,edu = 3 , rA,card = 5 • zA,card = (1-1)/(4-1)=0 • zB,card = (4-1)/(4-1)=1 • zA,edu = (3-1)/(5-1)=0.5 • zB,edu = (5-1)/(5-1)=1 • Use any interval scale distance measure on z values

Exercise • Find an attribute having both ordinal and nominal charecterisitics define a similarity or dissimilarity measure for to objects A and B

Ratio-Scaled Variables • Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt • Methods: • treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted) • apply logarithmic transformation yif = log(xif) • treat them as continuous ordinal data treat their rank as interval-scaled

Example • Cluster individuals based on age weights and heights • All are ratio scale variables • Mean transformation • Zp,i = xp,i/meanp • As absolute zero makes sense measure distance by units of mean for each variable • Then you may apply z`= logz • Use any distance measure for interval scales then

Example cont. • A weight difference of 0.5 kg is much more important for babies then for adults • d(3kg,3.5kg) = 0.5 (3.5-3)/3 percentage difference • d(71.5kg,70.0kg) =0.5 • d`(3kg,3.5kg) = (3.5-3)/3 percentage difference very significant approximately log(3.5)-log3 • d(71.5kg,71.0kg) = (71.5-70.0)/70.0 • Not important log71.5 – log71 almost zero

Examples from Sports • Boxing wrestling • 48 48 • 51 52 • 54 56 • 57 62 • 60 68 • 63.5 74 • 67 82 • 71 90 • 75 100 • 81 130

Variables of Mixed Types • A database may contain all the six types of variables • symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio • One may use a weighted formula to combine their effects • f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w. • f is interval-based: use the normalized distance • f is ordinal or ratio-scaled • compute ranks rif and • and treat zif as interval-scaled

fij is count variable • fij = 0 if • f is binary and asymmetric variable and • Xif = Xjf = 0 • fij = 1 o.w.

Exercise • Construct an example data containiing all types of variables • Define variables in that data set • and compute distance between two sample objects

Basic Measures for Clustering • Clustering: Given a database D = {t1, t2, .., tn}, a distance measure dis(ti, tj) defined between any two objects ti and tj, and an integer value k, the clustering problem is to define a mapping f: D  {1, …, k} where each ti is assigned to one cluster Kf, 1 ≤ f ≤ k, such that tfp,tfq ∈Kf and ts ∉Kf, dis(tfp,tfq)≤dis(tfp,ts) • Centroid, radius, diameter • Typical alternatives to calculate the distance between clusters • Single link, complete link, average, centroid, medoid

Centroid, Radius and Diameter of a Cluster (for numerical data sets) • Centroid: the “middle” of a cluster • Radius: square root of average distance from any point of the cluster to its centroid • Diameter: square root of average mean squared distance between all pairs of points in the cluster

Typical Alternatives to Calculate the Distance between Clusters • Single link: smallest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq) • Complete link: largest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq) • Average: avg distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq) • Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj) • Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj) • Medoid: one chosen, centrally located object in the cluster

Major Clustering Approaches • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based: based on connectivity and density functions • Grid-based: based on a multiple-level granularity structure • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

Partitioning Algorithms: Basic Concept • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen’67): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

The K-Means Clustering Method • Given k, the k-means algorithm is implemented in four steps: • Partition objects into k nonempty subsets • Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) • Assign each object to the cluster with the nearest seed point • Go back to Step 2, stop when no more new assignment

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 The K-Means Clustering Method • Example 10 9 8 7 6 5 Update the cluster means Assign each objects to most similar center 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means

MIS 542 Data Mining Concepts and Techniques — Chapter 5 — Clustering