290 likes | 562 Views
Consistent Bipartite Graph Co-Partitioning for High-Order Heterogeneous Co-Clustering. Tie-Yan Liu WSM Group, Microsoft Research Asia 2005.11.11 Joint work with Bin Gao, Peking University. Outline. Motivation What is high-order heterogeneous co-clustering
E N D
Consistent Bipartite Graph Co-Partitioning for High-Order Heterogeneous Co-Clustering Tie-Yan Liu WSM Group, Microsoft Research Asia 2005.11.11 Joint work with Bin Gao, Peking University
Outline • Motivation • What is high-order heterogeneous co-clustering • Why previous methods can not work well on this problem • Consistent Bipartite Graph Go-partitioning (CGBC) • Experimental Evaluation • Conclusions and Future Work Talk at NTU, Tie-Yan Liu
Clustering • Clustering is to group the data objects into clusters, so that objects in the same cluster are similar to each other. • Spectral Clustering • Models the similarity of data objects by an affinity graph, and assume that the best clustering result corresponds to the minimal (ratio, normalized or min-max) graph cut. • It can be proven that the minimum of the normalized cut can be achieved by minimizing this objective function and the corresponding solution q is the eigenvector associated with the second smallest eigenvalue of the generalized eigenvalue problem . Talk at NTU, Tie-Yan Liu
Co-Clustering • Co-clustering is to group two types of objects into their own clusters simultaneously. • Bipartite graph partitioning (Dhillon and Zha) • Use bipartite graph to model the inter-relationship between the two types of objects: the edges are of the same type in the bipartite graph so the graph cut is still easy to define. • It can be proven that the solutions are the singular vectors associated with the second smallest singular value of the normalized inter-relationship matrix Talk at NTU, Tie-Yan Liu
High-order Heterogeneous Co-Clustering (HHCC) • HHCC is to group multiple (≥2) types of objects into clusters simultaneously. • “Order” is defined as the number of types of objects. • If we use graph to represent the inter-relationship between data objects, we will have that although the edges in each bipartite graph are of the same type, they are of different type for different bipartite graphs. This is what “heterogeneous” refers to, as compared to spectral clustering and bipartite graph co-clustering. Talk at NTU, Tie-Yan Liu
HHCC is not a Rare Problem • Typical examples • Surrounding Text – Web Image – Visual Features • User – Query– Click through • Many other examples • Category – Document – Term; Reader – Newspaper – Article; Passenger – Airplane – Airways; Webpage – Website – Site-group; Article – Magazine – Category; Hardware – Computer – Usage; Software – People – Community Talk at NTU, Tie-Yan Liu
Why HHCC is a new problem? • Although bipartite graph partitioning is just a trivial extension of the spectral clustering, the extension to HHCC is non-trivial • Since there are different types of edges in the HHCC problem, the cut of high-order data is difficult to define. It may not be very reasonable to assign some weights to heterogeneous edges so as to make their contributions to the graph cut comparable. • Simply applying spectral clustering may cause the high-order problem degraded to be a 2-order problem. Talk at NTU, Tie-Yan Liu
An Example of Weighting Heterogeneous Edges α = 0.01 α = 1 no matter how we adjust the weights to balance the different types of edges, we always can not cluster X into two groups successfully α = 100 Embeddings produced by spectral clustering Talk at NTU, Tie-Yan Liu
An Example of Weighting Heterogeneous Edges (Cont.) • Mathematical Proof. Including X and Z Talk at NTU, Tie-Yan Liu
3-Order Heterogeneous graph Order Degradation 2-Order Heterogeneous graph Talk at NTU, Tie-Yan Liu
Our Solution • We will try to tackle the aforementioned problems by proposing a new solution to HHCC: Consistent Bipartite Graph Co-Partitioning (CGBC). • Where should we get started? • Star-structured HHCC • The concept of consistency • An SDP-based solution Talk at NTU, Tie-Yan Liu
Why “Star-Structured”? • “Star-Structure” means that in the heterogeneous graph, there is a central type of objects which connects all the other types of objects, and there is no direct connections between any other object types • “Star-Structured” is the simplest but very common case of HHCC. Talk at NTU, Tie-Yan Liu
Why “Star-Structured”? • “Star-Structured” is the simplest but very common case of HHCC. • Surrounding text • Web Images • Visual features • Author • Conference • Paper • Key Word • Customer • Shareholder • Shop • Supplier • Advertisement Media Talk at NTU, Tie-Yan Liu
The Concept of Consistency • Divide the star-structured HHCC problem into a set of bipartite sub-problems, where each sub-problem only has homogeneous edges. • Solve each sub problem separately, to avoid the order degradation. • Add a global constraint to the central type of objects, so as to get a feasible cut for the original problem. Talk at NTU, Tie-Yan Liu
The Concept of Consistency partition these two graphs simultaneously and consistently divide this tripartite graph into two bipartite graphs Talk at NTU, Tie-Yan Liu
Formulating the Optimization Problem • Minimize the cuts of the two bipartite graphs, with the constraints that their partitioning results on the central type of objects are the same. • Objective Function: The definition of q and p indicates the consistency between these two graphs: the y in the two embeddings are the same, so we actually force the partitioning on the central type of objects to be the same. Talk at NTU, Tie-Yan Liu
How to Solve the Optimization Problem #1: Convert it to a QCQP Problem Simplify the original Problem to single-objective programming Considering that the normalized Rayleigh quotient has been a scalar measure of the graph structure, the combination of two Rayleigh quotients is more reasonable and indicates which graph we should trust more. Linear combination is only one of the approaches of multi-objective programming. We can surely use other methods which do not have this argument. Assistant Notations Quadratically Constrained Quadratic Programming (QCQP) Sum-of-ratios Quadratic Fractional Programming Talk at NTU, Tie-Yan Liu
How to Solve the Optimization Problem #2: Convert QCQP to SDP Semi-definite Programming (SDP) Talk at NTU, Tie-Yan Liu
The Final Algorithm (CGBC) • Set the parameters β, θ1 and θ2. • Given the inter-relation matrices A and B, form the corresponding diagonal matrices and Laplacian matrices D(1), D(2), L(1) and L(2). • Extend D(1), D(2), L(1) and L(2) to Π1, Π2, Г1 and Г2, and form Г, such that the coefficient matrices in the SDP problem can be computed. • Solve the above SDP problem by a certain iterative algorithm such as SDPA. • Extract ω from W and regard it as the embedding vector of the heterogeneous objects. • Run the k-means algorithm on ω to obtain the desired partitioning of the heterogeneous objects. Talk at NTU, Tie-Yan Liu
CGBC’s Extension to the k-star-structured HHCC Talk at NTU, Tie-Yan Liu
Experiment on Toy Problem Relation Matrix A Totally based on the first graph Y(8:12) A more reasonable cut which is based on the information from both the first and the second graph Embedding values of heterogeneous objects β= 0 0.2 0.4 0.6 0.8 1.0 Relation Matrix B Totally based on the second graph Y(12:8) Talk at NTU, Tie-Yan Liu
Experiment on Web Image Clustering Talk at NTU, Tie-Yan Liu
Embedding of the Clustering Hill vs Owl Flying vs Map Talk at NTU, Tie-Yan Liu
Average Performance Performance Comparison Talk at NTU, Tie-Yan Liu
Conclusions • We propose a new problem named high-order heterogeneous co-clustering (HHCC). • We propose a consistent bipartite graph co-partitioning algorithm to solve the HHCC problem with star-structured inter-relationship. • Various experiments demonstrate the effectiveness of our proposed algorithm. Talk at NTU, Tie-Yan Liu
References • Bin Gao, Tie-Yan Liu, et al, Consistent Bipartite Graph Co-Partitioning for Star-Structured High-Order Heterogeneous Data Co-Clustering, in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2005), pp41~50. • Bin Gao, Tie-Yan Liu, Tao Qin, Qian-Sheng Cheng, Wei-Ying Ma, Web Image Clustering by Consistent Utilization of Low-level Features and Surrounding Texts, in Proceedings of ACM Multimedia 2005. Talk at NTU, Tie-Yan Liu
Thanks! Contact: tyliu@microsoft.com http://research.microsoft.com/users/tyliu/ Talk at NTU, Tie-Yan Liu