1 / 21

An Unsupervised Learning Approach for Overlapping Co-clustering

An Unsupervised Learning Approach for Overlapping Co-clustering. Machine Learning Project Presentation Rohit Gupta and Varun Chandola {rohit,chandola}@cs.umn.edu. Outline. Introduction to Clustering Description of Application Domain From Traditional Clustering to Overlapping Co-clustering

koen
Download Presentation

An Unsupervised Learning Approach for Overlapping Co-clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola {rohit,chandola}@cs.umn.edu

  2. Outline • Introduction to Clustering • Description of Application Domain • From Traditional Clustering to Overlapping Co-clustering • Current State of Art • A Frequent Itemsets Based Solution • An Alternate Minimization Based Solution • Application to Gene Expression Data • Experimental Results • Conclusions and Future Directions

  3. Clustering • Clustering is an unsupervised machine learning technique • Uses unlabeled samples • In the simplest form – determine groups (clusters) of data objects such that the objects in one cluster are similar to each other and dissimilar to objects in other clusters • Where each data object is a set of attributes (or features) with a definite notion of proximity • Most traditional clustering algorithms • Are partitional in nature. Assign a data object to exactly one cluster • Perform clustering along one dimension

  4. Application Domains • Gene Expression Data • Genes vs. Experimental Conditions • Find similar genes based on their expression values for different experimental conditions • Each cluster would represent a potential functional module in the organism • Text Documents Data • Documents vs. Words • Movie Recommendation Systems • Users vs. Movies

  5. Overlapping Clustering • Also known as soft clustering, fuzzy clustering • A data object can be assigned to more than one cluster • Motivation is that many real world data sets have inherently overlapping clusters • A gene can be a part of multiple functional modules (clusters)

  6. Co-clustering • Co-clustering is the problem of simultaneously clustering rows and columns of a data matrix • Also known as bi-clustering, subspace clustering, bi-dimensional clustering, simultaneous clustering, block clustering • The resulting clusters are blocks in the input data matrix • These blocks often represent more coherent and meaningful clusters • Only a subset of genes participate in any cellular process of interest that is active for only a subset of conditions

  7. Overlapping Co-clustering overlaps Co-clusters Segal et al, 2003, Banerjee et al, 2005 Dhillon et al, 2003, Cho et al 2004, Banerjee et al, 2005 [Bergmann et al, 2003] Overlapping Co-clusters

  8. Current State of Art • Traditional Clustering – Numerous algorithms like k-means • Overlapping Clustering – Probabilistic RelationalModel Based Approach by Segal et al and Banerjee et al • Co-clustering – Dhillon et al for gene expression data and document clustering. (Banerjee et al provided a general framework using a general class of Bregman distortion functions) • Overlapping co-clustering • Iterative Signature Algorithm (ISA) by Bergmann et al for gene expression data • Uses an Alternate Minimization technique • Involves thresholding after every iteration • We propose a more formal framework based on the co-clustering approach by Dhillon et al and another simpler frequent itemsets based solution

  9. Frequent Itemsets Based Approach • Based on the concept of frequent itemsets from association analysis domain • A frequent itemset is a set of items (features) which occur together more than a specified number of times (referred to as support threshold) in the data set • The data has to be binary (only presence or absence is considered)

  10. Frequent Itemsets Based Approach (2) • Application to gene expression data: • Normalization – first along columns (conditions) to remove scaling effects and then along rows (genes) • Binarization • Values above a preset threshold λ are set to 1 and the rest to 0. • Values above a preset percentile are set to 1 and the rest to 0. • Split each gene column to three components g+, g0 and g- signifying the up and down regulation of the gene's expression. This triples the number of items (or genes) • Gene expression matrix converted to transaction format data – each experiment is a transaction and contains index values for the genes that were expressed in this experiment

  11. Frequent Itemsets Based Approach (3) • Algorithm: • Run closed frequent itemset algorithm to generate frequent closed itemsets with a specified support threshold σ • Post-Processing: • Prune frequent itemsets (set of genes) of length < α • For each remaining itemset, scan the transaction data to record all the transactions (experiments) in which this itemset occurs • (Note: The combination of these transactions (experiments) and the itemset (genes) will give the desired sets of genes with subsets of conditions they are most tightly co-expressed with)

  12. Limitations of Frequent Itemsets Based Approach • Binarization of the gene expression matrix may lose some of the patterns in the data • Up-regulation and down-regulation of genes not directly taken into account • Setting up right support threshold incorporating the domain knowledge is not trivial • Large number of modules obtained – difficult to evaluate biologically • Traditional association analysis based approaches only considers dense blocks, noise may break the actual module in this case – Error tolerant Itemsets (ETI) offers a potential solution though

  13. Alternate Minimization (AM) Based Approach • Extends the non-overlapping co-clustering approach by [Dhillon et al, 2003, Banerjee et al 2005] • Algorithm • Input: Data Matrix A (size: m x n) and k, l (# of row and column clusters) • Initialize row and column cluster mappings, X (size: m x k) and Y (size: n x l) • Random assignment of rows (or columns) to row (or column) clusters • Any traditional one dimensional clustering can be used to initialize X and Y • Objective function: ||A – Â||2, Â is matrix approximation of A computed as follows: • Each element of a co-cluster (obtained using current X and Y) is replaced by mean of co-cluster (aI,J) • Each element of a co-cluster is replaced by (ai,J + aI,j – aI,J) i.e row mean + column mean – overall mean

  14. Alternate Minimization (AM) Based Approach(2) • While (converged) • Phase 1: • Compute row cluster prototypes (based on current X and matrix A) • Compute Bregman distance, dΦ(ri, Rr) - each row to each row cluster prototype • Compute probability with which each of m rows fall into each of k row clusters • Update row cluster X keeping column cluster Y same (some thresholding is required here to allow limited overlap) • Phase 2: • Compute column cluster prototypes (based on current Y and matrix A) • Compute Bregman distance, dΦ(cj, Cc) - each column to each column cluster prototype • Compute probability with which each of n columns fall into each of l column clusters • Update column cluster Y keeping row cluster X same • Compute objective function: ||A – Â||2 • Check convergence

  15. Observations • Each row or column can be assigned to multiple row and column clusters respectively by certain probability based on their distances from respective cluster prototypes. This will produce overlapping co-clustering. • Maximum overlapping co-clusters that could be obtained = k x l • Initialization of X and Y can be done in multiple ways – two ways are explored in the experiments • Thresholding to control percent overlap is tricky and requires domain knowledge • Cluster Evaluation is important – internal and external • SSE, Entropy of each co-cluster • Biological evaluation using GO (Gene Ontology) for results on gene expression data

  16. Experimental Results (1) • Frequent Itemsets Based Approach • A synthetic data set (40 X 40) Total Number of co-clusters detected = 3

  17. Experimental Results (2) • Frequent Itemsets Based Approach • Another synthetic data set (40 X 40) Total Number of co-clusters detected = 7 All 4 blocks (in the original data set) were detected Need post-processing to eliminate unwanted co-clusters

  18. Experimental Results (3) • AM Based Approach • Synthetic data sets (20 X 20) • Finds co-clusters for each case

  19. Experimental Results (4) • AM Based Approach on Gene Expression Dataset • Human Lymphoma Microarray Data [Described in Cho et al, 2004] • # genes = 854 • # conditions = 96 • k = 5, l = 5, one dimensional k-means for initialization of X and Y • Total Number of co-clusters = 25 Objective Function vs. Iterations Input Data A preliminary analysis of the 25 co-clusters show that only one meaningful co-cluster is obtained

  20. Conclusions • Frequent Itemsets based approach is guaranteed to find dense overlapping co-clusters • Error Tolerant Itemset Approach offers a potential solution to address the problem of noise • AM based approach is a formal algorithm to find overlapping co-clusters • Simultaneously performs clustering in both dimensions while minimizing a global objective function • Results on synthetic data prove the correctness of the algorithm • Preliminary results on gene expression data show promise and will be further evaluated • A key insight here is that application of these techniques to gene expression data requires domain knowledge for pre-processing, initialization, thresholding as well as post-processing of the co-clusters obtained

  21. References • [Bergmann et al, 2003] Sven Bergmann, Jan Ihmels and Naama Barkai, Iterative signature algorithm for the analysis of large-scale gene expression data, Phys. Rev. E 67, pp 31902, 2003 • [Liu et al, 2004] Jinze Liu, Paulsen Susan, Wei Wang, Andrew Nobel and Jan Prins, Mining Approximate Frequent Itemset from Noisy Data, Proc. IEEE ICDM, pp. 463-466, 2004 • [Cho et al, 2004] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra, Minimum sum-squared residue co-clustering of gene expression data. In Proceedings of SIAM Data Mining Conference, pages 114-125, 2004 • [Dhillon et al, 2003] Inderjit S. Dhillon, Subramanyam Mallela and Dharmendra S. Modha, Information-Theoretic Co-Clustering, Proc. ACM SIGKDD, pp. 89-98, 2003 • [Banerjee et al, 2004] A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In KDD '04: Proceedings of the 10th ACM SIGKDD, pages 509-514, 2004

More Related