RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, Tianyi Wu Department of Computer Science University of Illinois at Urbana-Champaign EDBT’09, St.-Petersburg, Russia, March 2009

Outline • Background • Motivation • The RankClus Algorithm • Experiments • Conclusion

Information Networks Are Ubiquitos Conference-Author Network Co-author Network

Two Kinds of Information Networks • Homogeneous Network • Objects belong to single type • E.g., co-author network, internet, friendship network, gene interaction network, and so on • Most current studies are on homogeneous networks • Heterogeneous Network • Objects belong to several types • E.g., conference-author network, paper-conference-author-topic network, movie-user network, webpage-tag-user network, and so on • Most real networks are heterogeneous networks, and many homogeneous networks are extracted from a more complex network

How to Better Understand Information Networks? • Problem: Hard to understand large, raw networks • Huge number of objects • Links are “in a mess” • Solution: Extracting aggregate information from networks • Ranking • Clustering

Ranking • Goal • Evaluate importance of objects in the network • A ranking function: map an object into a real non-negative score • Algorithms • PageRank (for homogeneous networks) • HITS (for homogeneous networks) • PopRank (for heterogeneous networks)

Clustering • Goal • Group similar objects together and obtain the cluster label for each object • Algorithms • Spectral clustering: Min-Cut, N-Cut, and MinMax-Cut (for homogeneous networks) • Density-based clustering: SCAN (for homogeneous networks) • How to cluster heterogeneous networks? • Use SimRank to first extract pair-wise similarity for target objects (but time complexity is high) • Combined with spectral clustering

Why RankClus? • More meaningful cluster • Within each cluster, ranking score for every object is available as well • More meaningful ranking • Ranking within a cluster is more meaningful than in the whole network • Address the problem of clustering in heterogeneous networks • No need to compute pair-wise similarity of objects • Mapping each object into a low measure space

Global Ranking vs. Within-Cluster Ranking in a Toy Example • Two areas: 10 conferences and 100 authors in each area

Difficulties in Clustering Heterogeneous Networks • What type of objects to be clustered? • Clustering on one specific type of objects (called target objects): specified by user • Clustering of target objects can induce a sub-network of the original network • Efficient algorithm of clustering • How to avoid calculating pair-wise similarities among target objects?

Algorithm Framework - Illustration Sub-Network Ranking Clustering

Algorithm Framework—Philosophy • Ranking and clustering can be mutually improved • Ranking: Once a cluster becomes more accurate, ranking will be more reasonable for such a cluster and will be the distinguished feature of the cluster • Clustering: Once ranking is more distinguished from each other, the clusters can be adjusted and get more accurate results • Objects preserve similarity under new measure space • E.g., consider VLDB and SIGMOD

Algorithm Framework - Summary • Step 0. Initialization • Randomly partition target objects into K clusters • Step 1. Ranking • Ranking for each sub-network induced from each cluster, which serves as feature for each cluster • Step 2. Generating new measure space • Estimate mixture model coefficients for each target object • Step 3. Adjusting cluster • Step 4. Repeat Step 1-3 until stable

Focus on A Bi-type Network Case • Conference-author network, links can exist between • Conference (X) and author (Y) • Author (Y) and author (Y) • Use W to denote the links and there weights • W =

Step 1: Feature Extraction — Ranking • Simple Ranking • Proportional to degree counting for objects • E.g., number of publications of authors • Considers only immediate neighborhood in the network • Authority Ranking • Extension to HITS in weighted bi-type network • Rules: • Rule 1: Highly ranked authors publish many papers in highly ranked conferences • Rule 2: Highly ranked conferences attract many papers from many highly ranked authors • Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors

Rules in Authority Ranking • Rule 1: Highly ranked authors publish many papers in highly ranked conferences • Rule 2: Highly ranked conferences attract many papers from many highly ranked authors • Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors

Philosophy in Authority Ranking • Ranking score propagated by iterations using rules 2 and 1, or rules 2 and 3 • The authority ranking of X and Y turned out to be primary eigenvectors of some symmetric matrix • Considers the impact from the overall network • Should be better than simple ranking

Example: Authority Ranking in the 2-Area Conference-Author Network • Given the correct cluster, the ranking of authors are quite distinct from each other

Step 2: Generate New Measure Space—A Naive Method • Mapping target object to a K-dimensional vector directly by considering a sub-network induced by it • r(Y|x) vs. r(Y|k) • Cosine similarity or KL-Divergence can be used • E.g., (cos(r(Y|x), r(Y|1)), …, cos(r(Y|x), r(Y|K)))

Step 2: Generate New Measure Space—A Mixture Model Method • Consider each target object’s links are generated under a mixture distribution of ranking from each cluster • Consider ranking as a distribution: r(Y) → p(Y) • Each target object xi is mapped into a K-vector (πi,k) • Parameters are estimated using the EM algorithm • Maximize the log-likelihood given all the observations of links

Example: 2-D Coefficients in the 2-Area Conference-Author Network • The conferences are well separated in the new measure space

Step 3: Cluster Adjustment in New Measure Space • Cluster center in new measure space • Vector mean of objects in the cluster (K-dimensional) • Cluster adjustment • Distance measure: 1- Cosine similarity • Assign to the cluster with the nearest center

A Running Case Illustration for 2-Area Conf-Author Network Initially, ranking distributions are mixed together Two clusters of objects mixed together, but preserve similarity somehow Improved a little Two clustersare almost well separated Improved significantly Well separated Stable

Ranking Function Analysis • Why “Authority Ranking” is better than “Simple Ranking”? • For authority ranking, each object’s score is determined by • The number of objects linking to it • The strength of these links (weight of link) • The quality of these objects (score) • For simple ranking, each object’s score is determined by • The number of objects linking to it • The strength of these links (weight of link) • The quality of these objects are equal

Re-examine the Rules • Rule 1: Highly ranked authors publish many papers in highly ranked conferences • An author publishing many papers in junk conferences will be ranked low • Rule 2: Highly ranked conferences attract many papers from many highly ranked authors • A conference accepting most papers from lowly ranked authors will be ranked low • Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors • A highly ranked author in an area usually has many co-operations with others

Why Better Ranking Function Derives Better Clustering? • Consider the measure space generation process • For naive method, highly ranked objects in a cluster play a more important role to decide a target object’s new measure • For mixture model, the same • Intuitively, if we can find the highly ranked objects in a cluster, equivalently, we get the right cluster

Time Complexity Analysis • At each iteration, |E|: edges in network, m: number of target objects, K: number of clusters • Ranking for sparse network • ~O(|E|) • Mixture model estimation • ~O(K|E|+mK) • Cluster adjustment • ~O(mK^2) • In all, linear to |E| • ~O(K|E|)

Case Study: Dataset: DBLP • All the 2676 conferences and 20,000 authors with most publications, from the time period of year 1998 to year 2007. • Both conference-author relationships and co-author relationships are used. • K=15

Accuracy Study • Dataset: synthetic dataset • Simulate a bipartite network similar to conf-author network • P: control the node number of attribute objects • T: transition probability matrix, to control the overlap between clusters • K: fix to 3 • Generating parameters for the five synthetic datasets Data1: medium separated and medium density P = [1000, 1500,2000], T = [0.8,0.05,0.15; 0.1,0.8,0.1; 0.1,0.05,0.85] Data2: medium separated and low density P = [800,1300,1200], T = [0.8,0.05,0.15; 0.1,0,8,0.1; 0.1,0.05,0.85] Data3: medium separated and high density P = [2000,3000,4000], T = [0.8,0.05,0.15; 0.1,0.8,0.1; 0.1,0.05,0.85] Data4: highly separated and medium density P = [1000,1500,2000], T = [0.9,0.05,0.05; 0.05,0.9,0.05; 0.1,0.05,0.85] Data5: poorly separated and medium density P = [1000, 1500,2000], T = [0.7,0.15,0.15; 0.15,0.7,0.15; 0.15,0.15,0.7]

Accuracy Study (Cont.) • 5 (synthetic) dataset settings, 4 methods • For each setting, generate 10 datasets, run each method for each dataset 100 times • RankClus with authority ranking is the best overall

Efficiency Study • Varying size of attribute type of objects (×2)

Outline • Background • Motivation • The RankClus Algorithm • Experiments • Conclusions

Conclusions • A general framework is proposed in which ranking and clustering are successfully combined to analyze information networks • Formally study how ranking and clustering can mutually reinforce each other in information network analysis • A novel algorithm, RankClus, is proposed and its correctness and effectiveness are verified • A thorough experimental study on both synthetic and real datasets in comparison with the state-of-the-art algorithms, and the experimental results demonstrate the accuracy and efficiency of RankClus

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis