Course on Data Mining (581550-4)

7.11. 24./26.10. 14.11. Home Exam 30.10. 21.11. 28.11. Course on Data Mining (581550-4) Intro/Ass. Rules Clustering Episodes KDD Process Text Mining Appl./Summary Data mining: Clustering

Course on Data Mining (581550-4) Today 14.11.2001 • Today's subject: • Classification, clustering • Next week's program: • Lecture: Data mining process • Exercise: Classification, clustering • Seminar: Classification, clustering Data mining: Clustering

Classification and clustering • Classification and prediction • Clustering and similarity Data mining: Clustering

Cluster analysis • What is cluster analysis? • Similarity and dissimilarity • Types of data in cluster analysis • Major clustering methods • Partitioning methods • Hierarchical methods • Outlier analysis • Summary Overview Data mining: Clustering

What is cluster analysis? • Cluster: a collection of data objects • similar to one another within the same cluster • dissimilar to the objects in the other clusters • Aim of clustering: to group a set of data objects into clusters Data mining: Clustering

Typical uses of clustering • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms Used as? Data mining: Clustering

Applications of clustering • Marketing: discovering of distinct customer groups in a purchase database • Land use: identifying of areas of similar land use in an earth observation database • Insurance: identifying groups of motor insurance policy holders with a high average claim cost • City-planning: identifying groups of houses according to their house type, value, and geographical location Data mining: Clustering

What is good clustering? • A good clustering method will produce high quality clusters with • high intra-class similarity • low inter-class similarity • The quality of a clustering result depends on • the similarity measure used • implementation of the similarity measure • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Data mining: Clustering

Requirements of clustering in data mining (1) • Scalability • Ability to deal with different types of attributes • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge to determine input parameters Data mining: Clustering

Requirements of clustering in data mining (2) • Ability to deal with noise and outliers • Insensitivity to order of input records • High dimensionality • Incorporation of user-specified constraints • Interpretability and usability Data mining: Clustering

Similarity and dissimilarity between objects (1) • There is no single definition of similarity or dissimilarity between data objects • The definition of similarity or dissimilarity between objects depends on • the type of the data considered • what kind of similarity we are looking for Data mining: Clustering

Similarity and dissimilarity between objects (2) • Similarity/dissimilarity between objects is often expressed in terms of a distancemeasure d(x,y) • Ideally, every distance measure should be a metric, i.e., it should satisfy the following conditions: Data mining: Clustering

Type of data in cluster analysis • Interval-scaled variables • Binary variables • Nominal, ordinal, and ratio variables • Variables of mixed types • Complex data types Data mining: Clustering

Interval-scaled variables (1) • Continuous measurements of a roughly linear scale • For example, weight, height and age • The measurement unit can affect the cluster analysis • To avoid dependence on the measurement unit, we should standardize the data Data mining: Clustering

Interval-scaled variables (2) To standardize the measurements: • calculate the mean absolute deviation where and • calculate the standardized measurement (z-score) Data mining: Clustering

Interval-scaled variables (3) • One group of popular distance measures for interval-scaled variables are Minkowski distances where i = (xi1, xi2, …, xip)andj = (xj1, xj2, …, xjp)are twop-dimensional data objects, andqis a positive integer Data mining: Clustering

Interval-scaled variables (4) • If q = 1,the distance measure is Manhattan (or city block) distance • If q = 2,the distance measure is Euclidean distance Data mining: Clustering

Binary variables (1) • A binary variable has only two states: 0 or 1 • A contingency table for binary data Object j Object i Data mining: Clustering

Binary variables (2) • Simple matching coefficient (invariant similarity, if the binary variable is symmetric): • Jaccard coefficient (noninvariant similarity, if the binary variable is asymmetric): Data mining: Clustering

Binary variables (3) Example: dissimilarity between binary variables: • a patient record table • eight attributes, of which • gender is a symmetric attribute, and • the remaining attributes are asymmetric binary Data mining: Clustering

Binary variables (4) • Let the values Y and P be set to 1, and the value N be set to 0 • Compute distances between patients based on the asymmetric variables by using Jaccard coefficient Data mining: Clustering

Nominal variables • A generalization of the binary variablein that it can take more than 2 states, e.g., red, yellow, blue, green • Method 1: simple matching • m: # of matches,p: total # of variables • Method 2: use a large number of binary variables • create a new binary variable for each of the M nominal states Data mining: Clustering

Ordinal variables • An ordinal variable can be discrete or continuous • Order of values is important, e.g., rank • Can be treated like interval-scaled • replacing xif by their rank • map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by • compute the dissimilarityusing methods for interval-scaled variables Data mining: Clustering

Ratio-scaled variables • A positive measurement on a nonlinear scale, approximately at exponential scale • for example, AeBt or Ae-Bt • Methods: • treat them like interval-scaled variables — not a good choice! (why?) • apply logarithmic transformation yif = log(xif) • treat them as continuous ordinal data and treat their rank as interval-scaled Data mining: Clustering

Variables of mixed types (1) • A database may contain all the six types of variables • One may use a weighted formula to combine their effects: where Data mining: Clustering

Variables of mixed types (2) Contribution of variable f to distance d(i,j): • if f is binary or nominal: • if f is interval-based: use the normalized distance • if f is ordinal or ratio-scaled • compute ranks rif and • and treat zif as interval-scaled Data mining: Clustering

Complex data types • All objects considered in data mining are not relational => complex types of data • examples of such data are spatial data, multimedia data, genetic data, time-series data, text data and data collected from World-Wide Web • Often totally different similarity or dissimilarity measures than above • can, for example, mean using of string and/or sequence matching, or methods of information retrieval Data mining: Clustering

Major clustering methods • Partitioning methods • Hierarchical methods • Density-based methods • Grid-based methods • Model-based methods (conceptual clustering, neural networks) Data mining: Clustering

Partitioning methods • A partitioning method: construct a partition of a database D of n objects into a set of k clusters such that • each cluster contains at least one object • each object belongs to exactly one cluster • Given a k, find a partition of k clustersthat optimizes the chosen partitioning criterion Data mining: Clustering

Criteria for judging the quality of partitions • Global optimal: exhaustively enumerate all partitions • Heuristic methods: • k-means (MacQueen’67): each cluster is represented by the center of the cluster (centroid) • k-medoids (Kaufman & Rousseeuw’87): each cluster is represented by one of the objects in the cluster(medoid) Data mining: Clustering

K-means clustering method (1) • Input to the algorithm: the number of clusters k, and a database of n objects • Algorithm consists of four steps: • partition object into k nonempty subsets/clusters • compute a seed points as the centroid (the mean of the objects in the cluster) for each cluster in the current partition • assign each object to the cluster with the nearest centroid • go back to Step 2, stop when there are no more new assignments Data mining: Clustering

K-means clustering method (2) Alternative algorithm also consists of four steps: • arbitrarily choose k objects as the initial cluster centers (centroids) • (re)assign each object to the cluster with the nearest centroid • update the centroids • go back to Step 2, stop when there are no more new assignments Data mining: Clustering

K-means clustering method - Example Data mining: Clustering

Strengths of K-means clustering method • Relatively scalable in processing large data sets • Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum; the global optimum may be found using techniques such as genetic algorithms Data mining: Clustering

Weaknesses of K-means clustering method • Applicable only when the mean of objects is defined • Need to specifyk, the number of clusters, in advance • Unable to handle noisy data and outliers • Not suitable to discover clusters with non-convex shapes, or clusters of very different size Data mining: Clustering

Variations of K-means clustering method (1) • A few variants of the k-means which differ in • selection of the initial k centroids • dissimilarity calculations • strategies for calculating cluster centroids Data mining: Clustering

Variations of K-means clustering method (2) • Handling categorical data: k-modes (Huang’98) • replacing means of clusters with modes • using new dissimilarity measures to deal with categorical objects • using a frequency-based method to update modes of clusters • A mixture of categorical and numerical data: k-prototype method Data mining: Clustering

K-medoids clustering method • Input to the algorithm: the number of clusters k, and a database of n objects • Algorithm consists of four steps: • arbitrarily choose k objects as the initial medoids (representative objects) • assign each remaining object to the cluster with the nearest medoid • select a nonmedoid and replace one of the medoids with it if this improves the clustering • go back to Step 2, stop when there are no more new assignments Data mining: Clustering

Hierarchical methods • A hierarchical method: construct a hierarchy of clustering, not just a single partition of objects • The number of clusters k is not required as an input • Use a distance matrix as clustering criteria • A termination conditioncan be used (e.g., a number of clusters) Data mining: Clustering

A tree of clusterings • The hierarchy of clustering is ofter given as a clustering tree, also called a dendrogram • leaves of the tree represent the individual objects • internal nodes of the tree represent the clusters Data mining: Clustering

Two types of hierarchical methods (1) Two main types of hierarchical clustering techniques: • agglomerative (bottom-up): • place each object in its own cluster (a singleton) • merge in each step the two most similar clusters until there is only one cluster left or the termination condition is satisfied • divisive (top-down): • start with one big cluster containing all the objects • divide the most distinctive cluster into smaller clusters and proceed until there are n clusters or the termination condition is satisfied Data mining: Clustering

Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative a a b b a b c d e c c d e d d e e divisive Step 3 Step 2 Step 1 Step 0 Step 4 Two types of hierarchical methods (2) Data mining: Clustering

Inter-cluster distances • Three widely used ways of defining the inter-cluster distance, i.e., the distance between two separate clusters, are • single linkagemethod (nearest neighbor): • complete linkagemethod (furthest neighbor): • average linkagemethod (unweighted pair-group average): Data mining: Clustering

Strengths of hierarchical methods • Conceptually simple • Theoretical properties are well understood • When clusters are merged/split, the decision is permanent => the number of different alternatives that need to beexamined is reduced Data mining: Clustering

Weaknesses of hierarchical methods • Merging/splitting of clusters is permanent => erroneous decisions are impossible to correct later • Divisive methods can be computational hard • Methods are not (necessarily) scalable for large data sets Data mining: Clustering

Outlier analysis (1) • Outliers • are objects that are considerably dissimilar from the remainder of the data • can be caused by a measurement or execution error, or • are the result of inherent data variability • Many data mining algorithms try • to minimizethe influence of outliers • to eliminatethe outliers Data mining: Clustering

Outlier analysis (2) • Minimizing the effect of outliers and/or eliminating the outliers maycause information loss • Outliers themselves may be of interest => outlier mining • Applications of outlier mining • Fraud detection • Customized marketing • Medical treatments Data mining: Clustering

Summary (1) • Cluster analysis groups objects based on their similarity • Cluster analysis has wide applications • Measure of similarity can be computed for various type of data • Selection of similarity measure is dependent on the data used and the type of similarity we are searching for Data mining: Clustering

Summary (2) • Clustering algorithms can be categorized into • partitioning methods, • hierarchical methods, • density-based methods, • grid-based methods, and • model-based methods • There are still lots of research issues on cluster analysis Data mining: Clustering

Seminar Presentations/Groups 7-8 Classification of spatial data K. Koperski, J. Han, N. Stefanovic: “An Efficient Two-Step Method of Classification of Spatial Data", SDH’98 Data mining: Clustering

Course on Data Mining (581550-4)