Subspace Clustering/Biclustering

Subspace Clustering/Biclustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu

Data Mining: Clustering K-means clustering minimizes Where

Clustering by Pattern Similarity (p-Clustering) • The micro-array “raw” data shows 3 genes and their values in a multi-dimensional space • Parallel Coordinates Plots • Difficult to find their patterns • “non-traditional” clustering

Clusters Are Clear After Projection

Motivation • E-Commerce: collaborative filtering

Motivation

Gene Expression Data

Biclustering of Gene Expression Data • Genes not regulated under all conditions • Genes regulated by multiple factors/processes concurrently • Key to determine function of genes • Key to determine classification of conditions

Motivation • DNA microarray analysis

Motivation

Motivation • Strong coherence exhibits by the selected objects on the selected attributes. • They are not necessarily close to each other but rather bear a constant shift. • Object/attribute bias • bi-cluster

Challenges • The set of objects and the set of attributes are usually unknown. • Different objects/attributes may possess different biases and such biases • may be local to the set of selected objects/attributes • are usually unknown in advance • May have many unspecified entries

What’s Biclustering? • Given an n x m matrix, A, find a set of submatrices, Bk, such that the contents of each Bk follow a desired pattern • Row/Column order need not be consistent between different Bks

Bipartite Graphs • Matrix can be thought of as a Graph Rows are one set of vertices L, Columns are another set R • Edges are weighted by the corresponding entries in the matrix If all weights are binary, biclustering becomes biclique finding

Bicluster Structures

Previous Work • Subspace clustering • Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. • Collaborative filtering: Pearson R • Only considers global offset of each object/attribute.

bi-cluster • Consists of a (sub)set of objects and a (sub)set of attributes • Corresponds to a submatrix • Occupancy threshold • Each object/attribute has to be filled by a certain percentage. • Volume: number of specified entries in the submatrix • Base: average value of each object/attribute (in the bi-cluster)

bi-cluster

Bicluster Structures

Cheng and Church • Example: • Correlation between any two columns = correlation between any two rows = 1. • aij = aiJ + aIj – aIJ, where aiJ = mean of row i, aIj = mean of column j, aIJ = mean of A. • Biological meaning: the genes have the same (amount of) response to the conditions.

bi-cluster • Perfect -cluster • Imperfect -cluster • Residue: dij diJ dIJ dIj

Cheng and Church • Model: • A bicluster is represented the submatrix A of the whole expression matrix (the involved rows and columns need not be contiguous in the original matrix). • Each entry Aij in the bicluster is the superposition (summation) of: • The background level • The row (gene) effect • The column (condition) effect • A dataset contains a number of biclusters, which are not necessarily disjoint.

Cheng and Church • Finding the largest -bicluster: • The problem of finding the largest square -bicluster (|I| = |J|) is NP-hard. • Objective function for heuristic methods (to minimize):=> sum of the components from each row and column, which suggests simple greedy algorithms to evaluate each row and column independently.

Cheng and Church • Greedy methods: • Algorithm 0: Brute-force deletion (skipped) • Algorithm 1: Single node deletion • Parameter(s):  (maximum squared residue). • Initialization: the bicluster contains all rows and columns. • Iteration: • Compute all aIj, aiJ, aIJ and H(I, J) for reuse. • Remove a row or column that gives the maximum decrease of H. • Termination: when no action will decrease H or H <= . • Time complexity: O(MN)

Cheng and Church • Greedy methods: • Algorithm 2: Multiple node deletion (take one more parameter . In iteration step 2, delete all rows and columns with row/column residue > H(I, J)). • Algorithm 3: Node addition (allow both additions and deletions of rows/columns).

Cheng and Church • Handling missing values and masking discovered biclusters: replace by random numbers so that no recognizable structures will be introduced. • Data preprocessing: • Yeast: x  100log(105x) • Lymphoma: x  100x (original data is already log-transformed)

Cheng and Church • Some results on yeast cell cycle data (288417):

Cheng and Church • Some results on lymphoma data (402696):

Cheng and Church • Discussion: • Biological validation: comparing with the clusters in previously published results. • No evaluation of the statistical significance of the clusters. • Both the model and the algorithm are not tailored for discovering multiple non-disjoint clusters. • Normalization is of utmost importance for the model, but this issue is not well-discussed.

The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Y Improved? N

The FLOC algorithm • Action: the change of membership of a row(or column) with respect to a cluster column M=4 1 2 3 4 row 3 4 2 2 1 M+N actions are Performed at each iteration 2 1 3 2 3 N=3 3 4 2 0 4

Performance • Microarray data: 2884 genes, 17 conditions • 100 bi-clusters with smallest residue were returned. • Average residue = 10.34 • The average residue of clusters found via the state of the art method in computational biology field is 12.54 • The average volume is 25% bigger • The response time is an order of magnitude faster

Coherent Cluster Want to accommodate noises but not outliers

dxa dxb x x dya z dyb y y a a b b Coherent Cluster • Coherent cluster • Subspace clustering • pair-wise disparity • For a 22 (sub)matrix consisting of objects {x, y} and attributes {a, b} mutual bias of attribute a mutual bias of attribute b attribute

Coherent Cluster • A 22 (sub)matrix is a -coherent cluster if its D value is less than or equal to . • An mn matrix X is a -coherent cluster if every 22 submatrix of X is -coherent cluster. • A -coherent cluster is a maximum -coherent cluster if it is not a submatrix of any other -coherent cluster. • Objective: given a data matrix and a threshold , find all maximum -coherent clusters.

Coherent Cluster • Challenges: • Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality. • The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix. • The actual values of the objects in a coherent cluster may be far apart from each other. • Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

References • J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002. • H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002. • Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004. • J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03.

Subspace Clustering/Biclustering