1 / 20

Sparsity-Cognizant Overlapping Co-clustering

March 11, 2010. Sparsity-Cognizant Overlapping Co-clustering. Hao Zhu Dept. of ECE, Univ. of Minnesota http:// spincom.ece.umn.edu. Acknowledgements: G. Mateos, Profs. G. B. Giannakis, N. D. Sidiropoulos, A. Banerjee, and G. Leus

alika
Download Presentation

Sparsity-Cognizant Overlapping Co-clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. March 11, 2010 Sparsity-Cognizant Overlapping Co-clustering Hao Zhu Dept. of ECE, Univ. of Minnesota http:// spincom.ece.umn.edu Acknowledgements:G. Mateos, Profs. G. B. Giannakis, N. D. Sidiropoulos, A. Banerjee, and G. Leus NSF grants CCF 0830480 and CON 014658 ARL/CTA grant no. DAAD19-01-2-0011

  2. Outline • Motivation and context • Problem statement and Plaid models • Sparsity-cognizant overlapping co-clustering (SOC) • Uniqueness • Simulated tests • Conclusions and future research

  3. Context • Co-clustering (biclustering) = two-way clustering • Clustering: partition the objects(samples, rows) based on a similarity criteria on their attributes(features, columns) [Tan-Steinbach-Kumar ’06] • Co-clustering: simultaneous clustering of objects and attributes [Busygin et al ’08] • Dense, approximately constant-valued submatrices • NP-hard: reduce to ordinary clustering as in k-means attributes objects

  4. Context • Co-clustering (biclustering) = two-way clustering • Clustering: partition the objects(samples, rows) based on a similarity criteria on their attributes(features, columns) [Tan-Steinbach-Kumar ’06] • Co-clustering: simultaneous clustering of objects and attributes [Busygin et al ’08] • Dense, approximately constant-valued submatrices • NP-hard: reduce to ordinary clustering as in k-means • Application areas • Social network: cohesive subgroups of actors within a network[Wasserman et al ’94] • Bioinformatics: interpertable biological structure in gene expression data [Lazzeroni et al ’02] • Internet traffic: dominat host groups with strong interactions [Jin et al ’09]

  5. Related Works and our Focus • Bipartite spectral graph partitioning [Dhillon ’01] • Matrix factorization based on SVD • Orthogonal nonnegative matrix factorization (tNMF) [Ding et al ’06] • Search for non-overlapping co-clusters using orthogonality • Overlapping co-clustering under Bayesian framework [Fu-Banerjee ’09] • Probabilistic model for co-cluster membership indicators and parameters • EM algorithm for inference and parameter estimation • E-step: Gibbs sampling for membership indicator detection • Plaid models [Lazzeroni et al ’02] • Superposition of multiple overlapping layers (co-clusters) • Greedy layer search: one at a time

  6. Related Works and our Focus (cont’d) • Overlapping: some objects/attributes may relate to multiple co-clusters • Borrow plaid model features • Partial co-clustering motivated by “Uninteresting background” • Linear model requires low computational burden • Exploit sparsity in co-cluster membership vectors • Sparse information hidden in large data set • Parsimonious models: more interpretable and informative • Simultaneous cross-layer optimization • Compared with greedy layer-by-layer strategy • Our focus: Sparsity-cognizant overlapping co-clustering (SOC) algorithm

  7. Modeling • Matrix Y: n× p • induced by two groups of interacting nodes and • Yij measures the strength of the relationship between and Ex-1: Internet traffic: traffic activity graph (TAG)1 • Track the traffic flow between inside host and outside host Ex-2: Gene expression microarray data2 • Measure the level with which gene is expressed in sample 1,2 The two pictures are taken from [Jin et al ’09] and [Lazzeroni et al ’02], respectively.

  8. Submatrices • Hidden dense/uniform submatrices • capture a subset of that has similar feature values related to a subset of • reveal certain informative behavioral patterns • Features • distributed “sparsely” in Y compared to the data dimension np • may overlap because of some multiple-functioned nodes Goal: Efficient co-clustering algorithms to extract the underlying submatrices, by exploiting sparsity and accounting for possible overlapping

  9. Plaid Models • Matrix Y: superposition of k submatrices (layers) • l : level of layer (0 background) 1 if ( ) is in the -th layer • il(jl) = 0 otherwise • Row/column-level related effects il and jl • l common to all the nodes in the layer • il and jlexpressthe node-related response

  10. Problem Statement Problem: Given the plaid model, seek the optimal membership indicators • Data fitting error penalized by the L1 norm of the indicators •  > 0 controls the sparsity enforced • Facilitate extraction of the more informative/inpretable submatrices out of Y • The optimal solution is NP-hard • Binary constraints on membership indicators [Tuner et al ’05] • Product of different variables • Efficient sub-optimal algorithm to identify the submatrices jointly • Recall the submatrices are detected one at a time in [Lazzeroni et al ’02]

  11. Sparsity-cognizant overlapping co-clustering (SOC) • Background-layer-free residue matrix Z • Iterative cycling update of , , and  • Per iteration s, (s) collects all ijk(s) values, likewise for r(s) and k(s) q(s) r(s-1) q(s) k(s) r(s) k(s-1) k(s-1) • Different from [Lazzeroni et al ’02] • All the k layers are updated jointly, less prone to error propagation across layers • Membership indicators are updated with binary constraint combinatorial complexity

  12. Updating (s) • Given r(s-1)and k(s-1) • Unconstrained quadratic programming closed-form solution • Inversion of a large matrix leads to numerical instablity • Coordinate descent algorithm alternating across all the layers For l = 1, ..., k • Define residue matrix : • Reduce to by extracting from the rows il(s-1)=1 and the columns jl(s-1)=1 • Update for T cycles (T small)

  13. Updating r(s) and k(s) • Given q(s)and k(s-1), determine r(s) • Obtain jointly membership indicators for the i -th row • Important for overlapping submatrices to eliminate cross effects • L1 norm penalty reduces to linear term due to non-negativity • Quadratic minimization subject to {0,1} binary constraints NP-hard • Similar problems in MIMO/multiuser detection with binary alphabet • (Near-) optimal sphere decoding algorithm (SDA) • Incurs polynomial (cubic) complexity in general • Same techniques to detect k(s) with q(s)and r(s)

  14. Convergence and Implementation • SOC algorithm converges (at least) to a stationary point • Data fitting error cost: bounded below and non-increasing per iteration • Pruning steps [Lazzeroni et al ’02], [Tuner et al ’05] • Initialization • Background level fitting to obtain matrix Z (Recall the submatrix parameter fitting) • Membership indicators r(0) and k(0) : K-means [Tuner et al ’05] • Parameter choices • Number of layers k : explain a certain percentage of variation • Sparse regularization parameter : trial-and-error/bi-cross-validation [Witten et al ’09]

  15. Uniqueness • Plain plaid models: decomposition into product of unknown matrices • Binary-valued matrices: R=[il](n ×k) and K =[jl](p ×k) • Diagonal matrix D = diag(1 , ... , k) • Blind source separation (BSS) [Talwar et al ’96], [van der Veen et al ’96] • Product of two matrices, finite alphabet (FA)/constant modulus (CM) constraint • (Generally) uniquely identifiable with enough number of samples • 3-way array (Candecomp/Parafac) [Kruskal ’77], [Sidiropoulos et al ’00] two-way • Unique up to permutation and scaling: • Fails to hold if h = 1

  16. Uniqueness (cont’d) • Sparsity in blind identification • Sparse component analysis: very sparse representation [Georgiev et al ’07] • Non-negative source separation using local dominance [Chan et al ’08] Proposition: Consider where diagonal matrix D and binary-valued matrices R, K are all of full rank. Each column vector klof K is locally sparse8l, which means there exists an (unknown) row index jl such that Given Z, the matrices R, D, and K are unique up to permutations. • Proof relies on convex analysis • Affine hull of column vectors of K coincides with the one of Z • Under local sparseness, its convex hull becomes the intersection of the affine hull and the positive orthant • Columns of K are extreme points of convex hull • Results hold also when R is locally sparse (symmetry)

  17. Preliminary Simulation • Two uniform blocks + noises ~ Unif [0, 0.5] • SOC parameters: k=2, S=20, T=1, and =0,3 = 0 Original Plaid Permuted = 3

  18. Real Data To Simulate • Internet traffic flow data • Uncover different types of co-clusters: in-star,out-star, bi-mesh,.... • Examples of Email application: department servers, Gmail • Overlapping co-clusters may reveal server farms • Gene expression microarray data • Co-clusters may exhibit some biological patterns • Need to check with the gene enrichment value

  19. Concluding Summary • Plaid models to reveal overlapping co-clusters • Exploit sparsity for parsimonious recovery • Jointly decide among multiple layers • SOC algorithm iteratively updates the unknown parameters • Coordinate descent solver for the layer level parameters • Sphere decoder detects membership indicators jointly • Local sparseness leads to unique decomposition

  20. Thank You! Future Directions • Implementation issues with parameter choices • Efficient initializations and membership vector detection • Comprehensive numerical experiments on real data

More Related