360 likes | 386 Views
This research paper presents the RankClus algorithm that combines clustering and ranking techniques to analyze heterogeneous information networks. It improves cluster accuracy and provides meaningful rankings within each cluster. The algorithm is demonstrated using a toy example and a bi-type network case study.
E N D
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, Tianyi Wu Department of Computer Science University of Illinois at Urbana-Champaign EDBT’09, St.-Petersburg, Russia, March 2009
Outline • Background • Motivation • The RankClus Algorithm • Experiments • Conclusion
Information Networks Are Ubiquitos Conference-Author Network Co-author Network
Two Kinds of Information Networks • Homogeneous Network • Objects belong to single type • E.g., co-author network, internet, friendship network, gene interaction network, and so on • Most current studies are on homogeneous networks • Heterogeneous Network • Objects belong to several types • E.g., conference-author network, paper-conference-author-topic network, movie-user network, webpage-tag-user network, and so on • Most real networks are heterogeneous networks, and many homogeneous networks are extracted from a more complex network
How to Better Understand Information Networks? • Problem: Hard to understand large, raw networks • Huge number of objects • Links are “in a mess” • Solution: Extracting aggregate information from networks • Ranking • Clustering
Ranking • Goal • Evaluate importance of objects in the network • A ranking function: map an object into a real non-negative score • Algorithms • PageRank (for homogeneous networks) • HITS (for homogeneous networks) • PopRank (for heterogeneous networks)
Clustering • Goal • Group similar objects together and obtain the cluster label for each object • Algorithms • Spectral clustering: Min-Cut, N-Cut, and MinMax-Cut (for homogeneous networks) • Density-based clustering: SCAN (for homogeneous networks) • How to cluster heterogeneous networks? • Use SimRank to first extract pair-wise similarity for target objects (but time complexity is high) • Combined with spectral clustering
Outline • Background • Motivation • The RankClus Algorithm • Experiments • Conclusion
Why RankClus? • More meaningful cluster • Within each cluster, ranking score for every object is available as well • More meaningful ranking • Ranking within a cluster is more meaningful than in the whole network • Address the problem of clustering in heterogeneous networks • No need to compute pair-wise similarity of objects • Mapping each object into a low measure space
Global Ranking vs. Within-Cluster Ranking in a Toy Example • Two areas: 10 conferences and 100 authors in each area
Difficulties in Clustering Heterogeneous Networks • What type of objects to be clustered? • Clustering on one specific type of objects (called target objects): specified by user • Clustering of target objects can induce a sub-network of the original network • Efficient algorithm of clustering • How to avoid calculating pair-wise similarities among target objects?
Outline • Background • Motivation • The RankClus Algorithm • Experiments • Conclusion
Algorithm Framework - Illustration Sub-Network Ranking Clustering
Algorithm Framework—Philosophy • Ranking and clustering can be mutually improved • Ranking: Once a cluster becomes more accurate, ranking will be more reasonable for such a cluster and will be the distinguished feature of the cluster • Clustering: Once ranking is more distinguished from each other, the clusters can be adjusted and get more accurate results • Objects preserve similarity under new measure space • E.g., consider VLDB and SIGMOD
Algorithm Framework - Summary • Step 0. Initialization • Randomly partition target objects into K clusters • Step 1. Ranking • Ranking for each sub-network induced from each cluster, which serves as feature for each cluster • Step 2. Generating new measure space • Estimate mixture model coefficients for each target object • Step 3. Adjusting cluster • Step 4. Repeat Step 1-3 until stable
Focus on A Bi-type Network Case • Conference-author network, links can exist between • Conference (X) and author (Y) • Author (Y) and author (Y) • Use W to denote the links and there weights • W =
Step 1: Feature Extraction — Ranking • Simple Ranking • Proportional to degree counting for objects • E.g., number of publications of authors • Considers only immediate neighborhood in the network • Authority Ranking • Extension to HITS in weighted bi-type network • Rules: • Rule 1: Highly ranked authors publish many papers in highly ranked conferences • Rule 2: Highly ranked conferences attract many papers from many highly ranked authors • Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors
Rules in Authority Ranking • Rule 1: Highly ranked authors publish many papers in highly ranked conferences • Rule 2: Highly ranked conferences attract many papers from many highly ranked authors • Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors
Philosophy in Authority Ranking • Ranking score propagated by iterations using rules 2 and 1, or rules 2 and 3 • The authority ranking of X and Y turned out to be primary eigenvectors of some symmetric matrix • Considers the impact from the overall network • Should be better than simple ranking
Example: Authority Ranking in the 2-Area Conference-Author Network • Given the correct cluster, the ranking of authors are quite distinct from each other
Step 2: Generate New Measure Space—A Naive Method • Mapping target object to a K-dimensional vector directly by considering a sub-network induced by it • r(Y|x) vs. r(Y|k) • Cosine similarity or KL-Divergence can be used • E.g., (cos(r(Y|x), r(Y|1)), …, cos(r(Y|x), r(Y|K)))
Step 2: Generate New Measure Space—A Mixture Model Method • Consider each target object’s links are generated under a mixture distribution of ranking from each cluster • Consider ranking as a distribution: r(Y) → p(Y) • Each target object xi is mapped into a K-vector (πi,k) • Parameters are estimated using the EM algorithm • Maximize the log-likelihood given all the observations of links
Example: 2-D Coefficients in the 2-Area Conference-Author Network • The conferences are well separated in the new measure space
Step 3: Cluster Adjustment in New Measure Space • Cluster center in new measure space • Vector mean of objects in the cluster (K-dimensional) • Cluster adjustment • Distance measure: 1- Cosine similarity • Assign to the cluster with the nearest center
A Running Case Illustration for 2-Area Conf-Author Network Initially, ranking distributions are mixed together Two clusters of objects mixed together, but preserve similarity somehow Improved a little Two clustersare almost well separated Improved significantly Well separated Stable
Ranking Function Analysis • Why “Authority Ranking” is better than “Simple Ranking”? • For authority ranking, each object’s score is determined by • The number of objects linking to it • The strength of these links (weight of link) • The quality of these objects (score) • For simple ranking, each object’s score is determined by • The number of objects linking to it • The strength of these links (weight of link) • The quality of these objects are equal
Re-examine the Rules • Rule 1: Highly ranked authors publish many papers in highly ranked conferences • An author publishing many papers in junk conferences will be ranked low • Rule 2: Highly ranked conferences attract many papers from many highly ranked authors • A conference accepting most papers from lowly ranked authors will be ranked low • Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors • A highly ranked author in an area usually has many co-operations with others
Why Better Ranking Function Derives Better Clustering? • Consider the measure space generation process • For naive method, highly ranked objects in a cluster play a more important role to decide a target object’s new measure • For mixture model, the same • Intuitively, if we can find the highly ranked objects in a cluster, equivalently, we get the right cluster
Time Complexity Analysis • At each iteration, |E|: edges in network, m: number of target objects, K: number of clusters • Ranking for sparse network • ~O(|E|) • Mixture model estimation • ~O(K|E|+mK) • Cluster adjustment • ~O(mK^2) • In all, linear to |E| • ~O(K|E|)
Outline • Background • Motivation • The RankClus Algorithm • Experiments • Conclusion
Case Study: Dataset: DBLP • All the 2676 conferences and 20,000 authors with most publications, from the time period of year 1998 to year 2007. • Both conference-author relationships and co-author relationships are used. • K=15
Accuracy Study • Dataset: synthetic dataset • Simulate a bipartite network similar to conf-author network • P: control the node number of attribute objects • T: transition probability matrix, to control the overlap between clusters • K: fix to 3 • Generating parameters for the five synthetic datasets Data1: medium separated and medium density P = [1000, 1500,2000], T = [0.8,0.05,0.15; 0.1,0.8,0.1; 0.1,0.05,0.85] Data2: medium separated and low density P = [800,1300,1200], T = [0.8,0.05,0.15; 0.1,0,8,0.1; 0.1,0.05,0.85] Data3: medium separated and high density P = [2000,3000,4000], T = [0.8,0.05,0.15; 0.1,0.8,0.1; 0.1,0.05,0.85] Data4: highly separated and medium density P = [1000,1500,2000], T = [0.9,0.05,0.05; 0.05,0.9,0.05; 0.1,0.05,0.85] Data5: poorly separated and medium density P = [1000, 1500,2000], T = [0.7,0.15,0.15; 0.15,0.7,0.15; 0.15,0.15,0.7]
Accuracy Study (Cont.) • 5 (synthetic) dataset settings, 4 methods • For each setting, generate 10 datasets, run each method for each dataset 100 times • RankClus with authority ranking is the best overall
Efficiency Study • Varying size of attribute type of objects (×2)
Outline • Background • Motivation • The RankClus Algorithm • Experiments • Conclusions
Conclusions • A general framework is proposed in which ranking and clustering are successfully combined to analyze information networks • Formally study how ranking and clustering can mutually reinforce each other in information network analysis • A novel algorithm, RankClus, is proposed and its correctness and effectiveness are verified • A thorough experimental study on both synthetic and real datasets in comparison with the state-of-the-art algorithms, and the experimental results demonstrate the accuracy and efficiency of RankClus