EECS 800 Research Seminar Mining Biological Data

EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

Administrative • If you haven’t scheduled a meeting for the class project with the instructor, please do this asap.

Overview • Hierarchical clustering • Density based clustering • Graph based clustering • Subspace clustering

Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative (AGNES) a a b b a b c d e c c d e d d e e divisive (DIANA) Step 3 Step 2 Step 1 Step 0 Step 4 Hierarchical Clustering • Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

AGNES (Agglomerative Nesting) • Introduced in Kaufmann and Rousseeuw (1990) • Implemented in statistical analysis packages, e.g., Splus • Use the Single-Link method and the dissimilarity matrix. • Merge nodes that have the least dissimilarity • Go on in a non-descending fashion • Eventually all nodes belong to the same cluster

Density-Based Clustering Methods • Clustering based on density (local cluster criterion), such as density-connected points • Major features: • Discover clusters of arbitrary shape • Handle noise • One scan • Need density parameters as termination condition • Several studies: • DBSCAN: Ester, et al. (KDD’96) • OPTICS: Ankerst, et al (SIGMOD’99). • DENCLUE: Hinneburg & D. Keim (KDD’98) • CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

p MinPts = 5 Eps = 1 cm q Density-Based Clustering: Basic Concepts • Two parameters: • Eps: Maximum radius of the neighbourhood • MinPts: Minimum number of points in an Eps-neighbourhood of that point • NEps(p):{q belongs to D | dist(p,q) <= Eps} • Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if • p belongs to NEps(q) • core point condition: |NEps (q)| >= MinPts • Directly density-reachable is asymmetric

p p1 q p q o Density-Reachable and Density-Connected • Density-reachable: • A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi • Density-connected • A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts

Outlier Border Eps = 1cm MinPts = 5 Core DBSCAN: Density Based Spatial Clustering of Applications with Noise • A cluster is a maximal set of density-connected points • Discovers clusters of arbitrary shape in spatial databases with noise • A point is a core point if it has more than a specified number of points (MinPts) within Eps. These are points that are at the interior of a cluster • A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point • A noise point is any point that is not a core point or a border point.

DBSCAN: The Algorithm • Classify all points to {core, border, noise} w.r.t. Eps and MinPts. • Select a point P arbitrarily • If p is a core point, a cluster is formed. • If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. • Continue the process until all of the points have been processed.

DBSCAN Algorithm • Eliminate noise points • Perform clustering on the remaining points

DBSCAN: Core, Border and Noise Points Original Points Point types: core, borderandnoise Eps = 10, MinPts = 4

When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.75). Original Points • Varying densities • High-dimensional data (MinPts=4, Eps=9.92)

DBSCAN: Sensitive to Parameters

OPTICS: A Cluster-Ordering Method (1999) • OPTICS: Ordering Points To Identify the Clustering Structure • Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99) • Produces a special order of the database w.r.t. its density-based clustering structure • Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure • Can be represented graphically or using visualization techniques • This cluster-ordering contains info is equiv to the density-based clusterings that corresponds to a broad range of parameter settings (distance parameters)

D Core Distance • Core Distance • MinPts-distance(p) for an object p is the distance between p and its MinPtrs neighbor. • Core distance(p) is MinPts-distance(p) for a core object • Core distance(p) is not defined for none core objects

Reachability Distance • Reachability distance: • The reachability-distance of an object p with respect to another core object o is the smallest distance such that p is directly density-reachable from o. • For a pair of p and o (o is a core object): • max(core-distance(o), distance(o, p)) p1 o p2

Reachability-distance undefined ‘ Cluster-order of the objects Object Ordering

Density-Based Clustering: OPTICS & Its Applications

DENCLUE: Using Statistical Density Functions • Using statistical density estimations • Major features • Solid mathematical foundation • Good for data sets with large amounts of noise • Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets • Significant faster than existing algorithm (e.g., DBSCAN) • But needs a large number of parameters • An Efficient Approach to Clustering in Large Multimedia Databases with Noise by Hinneburg & Keim (KDD’98)

Gradient: The steepness of a slope • Example

Example: Density Computation D={x1,x2,x3,x4} fDGaussian(x)= influence(x1) + influence(x2) + influence(x3) + influence(x4)=0.04+0.06+0.08+0.6=0.78 x1 x3 0.04 0.08 y x2 x4 0.06 x 0.6 Remark: the density value of y would be larger than the one for x

Denclue: Technical Essence • Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree-based access structure • Influence function: describes the impact of a data point within its neighborhood • Overall density of the data space can be calculated as the sum of the influence function of all data points • Clusters can be determined mathematically by identifying density attractors • Density attractors are local maximal of the overall density function

Density Attractor

Center-Defined and Arbitrary

Graph-Based Clustering • Graph-Based clustering uses the proximity graph • Start with the proximity matrix • Consider each point as a node in a graph • Each edge between two nodes has a weight which is the proximity between the two points • Initially the proximity graph is fully connected • MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph • In the simplest case, clusters are connected components in the graph.

Graph-Based Clustering: Sparsification • The amount of data that needs to be processed is drastically reduced • Sparsification can eliminate more than 99% of the entries in a proximity matrix • The amount of time required to cluster the data is drastically reduced • The size of the problems that can be handled is increased

Graph-Based Clustering: Sparsification … • Clustering may work better • Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points. • The nearest neighbors of a point tend to belong to the same class as the point itself. • This reduces the impact of noise and outliers and sharpens the distinction between clusters. • Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph partitioning algorithms. • Chameleon and Hypergraph-based Clustering

Sparsification in the Clustering Process

Chameleon: Clustering Using Dynamic Modeling • Adapt to the characteristics of the data set to find the natural clusters • Use a dynamic model to measure the similarity between clusters • Main property is the relative closeness and relative inter-connectivity of the cluster • Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters • The merging scheme preserves self-similarity • One of the areas of application is spatial data

Characteristics of Spatial Data Sets • Clusters are defined as densely populated regions of the space • Clusters have arbitrary shapes, orientation, and non-uniform sizes • Difference in densities across clusters and variation in density within clusters • Existence of special artifacts and noise The clustering algorithm must address the above characteristics and also require minimal supervision.

Chameleon: Steps • Preprocessing Step: Represent the Data by a Graph • Given a set of points, construct the k-nearest-neighbor (k-NN) graph to capture the relationship between a point and its k nearest neighbors • Concept of neighborhood is captured dynamically (even if region is sparse) • Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices • Each cluster should contain mostly points from one “true” cluster, i.e., is a sub-cluster of a “real” cluster

Chameleon: Steps … • Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters • Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters • Two key properties used to model cluster similarity: • Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters • Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters • CHAMELEON measures the closeness of two clusters by computing the average similarity between the points in Cithat are connected to points in Cj

CHAMELEON (Clustering Complex Objects)

Grid-Based Clustering Method • Using multi-resolution grid data structure • Several interesting methods • STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) • WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) • A multi-resolution clustering approach using wavelet method • CLIQUE: Agrawal, et al. (SIGMOD’98) • On high-dimensional data (thus put in the section of clustering high-dimensional data

STING: A Statistical Information Grid Approach • Wang, Yang and Muntz (VLDB’97) • The spatial area area is divided into rectangular cells • There are several levels of cells corresponding to different levels of resolution

The STING Clustering Method • Each cell at a high level is partitioned into a number of smaller cells in the next lower level • Statistical info of each cell is calculated and stored beforehand and is used to answer queries • Parameters of higher level cells can be easily calculated from parameters of lower level cell • count, mean, s, min, max • type of distribution—normal, uniform, etc. • Use a top-down approach to answer spatial data queries • Start from a pre-selected layer—typically with a small number of cells • For each cell in the current level compute the confidence interval

Comments on STING • Remove the irrelevant cells from further consideration • When finish examining the current layer, proceed to the next lower level • Repeat this process until the bottom layer is reached • Advantages: • Query-independent, easy to parallelize, incremental update • O(K), where K is the number of grid cells at the lowest level • Disadvantages: • All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected

WaveCluster: Clustering by Wavelet Analysis (1998) • Sheikholeslami, Chatterjee, and Zhang (VLDB’98) • A multi-resolution clustering approach which applies wavelet transform to the feature space • How to apply wavelet transform to find clusters • Summarizes the data by imposing a multidimensional grid structure onto data space • These multidimensional spatial data objects are represented in a n-dimensional feature space • Apply wavelet transform on feature space to find the dense regions in the feature space • Apply wavelet transform multiple times which result in clusters at different scales from fine to coarse

Wavelet Transform • Wavelet transform: A signal processing technique that decomposes a signal into different frequency sub-band (can be applied to n-dimensional signals) • Data are transformed to preserve relative distance between objects at different levels of resolution • Allows natural clusters to become more distinguishable

The WaveCluster Algorithm • Input parameters • # of grid cells for each dimension • the wavelet, and the # of applications of wavelet transform • Why is wavelet transformation useful for clustering? • Use hat-shape filters to emphasize region where points cluster, but simultaneously suppress weaker information in their boundary • Effective removal of outliers, multi-resolution, cost effective • Major features: • Detect arbitrary shaped clusters at different scales • Not sensitive to noise, not sensitive to input order • Only applicable to low dimensional data • Both grid-based and density-based

Quantization& Transformation • First, quantize data into m-D grid structure, then wavelet transform • a) scale 1: high resolution • b) scale 2: medium resolution • c) scale 3: low resolution

Clustering High-Dimensional Data • Clustering high-dimensional data • Many applications: text documents, DNA micro-array data • Major challenges: • Many irrelevant dimensions may mask clusters • Distance measure becomes meaningless—due to equi-distance • Clusters may exist only in some subspaces

Clustering in High Dimensional Space • Methods • Feature transformation: only effective if most dimensions are relevant • PCA & SVD useful only when features are highly correlated/redundant • Feature selection: wrapper or filter approaches • useful to find a subspace where the data have nice clusters • Subspace-clustering: find clusters in all the possible subspaces • CLIQUE, ProClus, and frequent pattern-based clustering

The Curse of Dimensionality(graphs adapted from Parsons et al. KDD Explorations 2004) • Data in only one dimension is relatively packed • Adding a dimension “stretch” the points across that dimension, making them further apart • Density decrease dramatically • Distance measure becomes meaningless—due to equi-distance

Why Subspace Clustering?(adapted from Parsons et al. SIGKDD Explorations 2004) • Clusters may exist only in some subspaces • Subspace-clustering: find clusters in all the subspaces

CLIQUE (Clustering In QUEst) • Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98) • Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space • CLIQUE can be considered as both density-based and grid-based • It partitions each dimension into the same number of equal length interval • It partitions an m-dimensional data space into non-overlapping rectangular units • A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter • A cluster is a maximal set of connected dense units within a subspace

CLIQUE: The Major Steps • Partition the data space and find the number of points that lie inside each cell of the partition. • Identify the subspaces that contain clusters using the Apriori principle • Identify clusters • Determine dense units in all subspaces of interests • Determine connected dense units in all subspaces of interests. • Generate minimal description for the clusters • Determine maximal regions that cover a cluster of connected dense units for each cluster • Determination of minimal cover for each cluster

Salary (10,000) Vacation(week) 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 age age 20 30 40 50 60 20 30 40 50 60 Vacation  = 3 30 50 Salary age

Strength and Weakness of CLIQUE • Strength • automatically finds subspaces of thehighest dimensionality such that high density clusters exist in those subspaces • insensitive to the order of records in input and does not presume some canonical data distribution • scales linearly with the size of input and has good scalability as the number of dimensions in the data increases • Weakness • The accuracy of the clustering result may be degraded at the expense of simplicity of the method

EECS 800 Research Seminar Mining Biological Data