1 / 36

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis

This research paper presents the RankClus algorithm that combines clustering and ranking techniques to analyze heterogeneous information networks. It improves cluster accuracy and provides meaningful rankings within each cluster. The algorithm is demonstrated using a toy example and a bi-type network case study.

dwilliam
Download Presentation

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, Tianyi Wu Department of Computer Science University of Illinois at Urbana-Champaign EDBT’09, St.-Petersburg, Russia, March 2009

  2. Outline • Background • Motivation • The RankClus Algorithm • Experiments • Conclusion

  3. Information Networks Are Ubiquitos Conference-Author Network Co-author Network

  4. Two Kinds of Information Networks • Homogeneous Network • Objects belong to single type • E.g., co-author network, internet, friendship network, gene interaction network, and so on • Most current studies are on homogeneous networks • Heterogeneous Network • Objects belong to several types • E.g., conference-author network, paper-conference-author-topic network, movie-user network, webpage-tag-user network, and so on • Most real networks are heterogeneous networks, and many homogeneous networks are extracted from a more complex network

  5. How to Better Understand Information Networks? • Problem: Hard to understand large, raw networks • Huge number of objects • Links are “in a mess” • Solution: Extracting aggregate information from networks • Ranking • Clustering

  6. Ranking • Goal • Evaluate importance of objects in the network • A ranking function: map an object into a real non-negative score • Algorithms • PageRank (for homogeneous networks) • HITS (for homogeneous networks) • PopRank (for heterogeneous networks)

  7. Clustering • Goal • Group similar objects together and obtain the cluster label for each object • Algorithms • Spectral clustering: Min-Cut, N-Cut, and MinMax-Cut (for homogeneous networks) • Density-based clustering: SCAN (for homogeneous networks) • How to cluster heterogeneous networks? • Use SimRank to first extract pair-wise similarity for target objects (but time complexity is high) • Combined with spectral clustering

  8. Outline • Background • Motivation • The RankClus Algorithm • Experiments • Conclusion

  9. Why RankClus? • More meaningful cluster • Within each cluster, ranking score for every object is available as well • More meaningful ranking • Ranking within a cluster is more meaningful than in the whole network • Address the problem of clustering in heterogeneous networks • No need to compute pair-wise similarity of objects • Mapping each object into a low measure space

  10. Global Ranking vs. Within-Cluster Ranking in a Toy Example • Two areas: 10 conferences and 100 authors in each area

  11. Difficulties in Clustering Heterogeneous Networks • What type of objects to be clustered? • Clustering on one specific type of objects (called target objects): specified by user • Clustering of target objects can induce a sub-network of the original network • Efficient algorithm of clustering • How to avoid calculating pair-wise similarities among target objects?

  12. Outline • Background • Motivation • The RankClus Algorithm • Experiments • Conclusion

  13. Algorithm Framework - Illustration Sub-Network Ranking Clustering

  14. Algorithm Framework—Philosophy • Ranking and clustering can be mutually improved • Ranking: Once a cluster becomes more accurate, ranking will be more reasonable for such a cluster and will be the distinguished feature of the cluster • Clustering: Once ranking is more distinguished from each other, the clusters can be adjusted and get more accurate results • Objects preserve similarity under new measure space • E.g., consider VLDB and SIGMOD

  15. Algorithm Framework - Summary • Step 0. Initialization • Randomly partition target objects into K clusters • Step 1. Ranking • Ranking for each sub-network induced from each cluster, which serves as feature for each cluster • Step 2. Generating new measure space • Estimate mixture model coefficients for each target object • Step 3. Adjusting cluster • Step 4. Repeat Step 1-3 until stable

  16. Focus on A Bi-type Network Case • Conference-author network, links can exist between • Conference (X) and author (Y) • Author (Y) and author (Y) • Use W to denote the links and there weights • W =

  17. Step 1: Feature Extraction — Ranking • Simple Ranking • Proportional to degree counting for objects • E.g., number of publications of authors • Considers only immediate neighborhood in the network • Authority Ranking • Extension to HITS in weighted bi-type network • Rules: • Rule 1: Highly ranked authors publish many papers in highly ranked conferences • Rule 2: Highly ranked conferences attract many papers from many highly ranked authors • Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors

  18. Rules in Authority Ranking • Rule 1: Highly ranked authors publish many papers in highly ranked conferences • Rule 2: Highly ranked conferences attract many papers from many highly ranked authors • Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors

  19. Philosophy in Authority Ranking • Ranking score propagated by iterations using rules 2 and 1, or rules 2 and 3 • The authority ranking of X and Y turned out to be primary eigenvectors of some symmetric matrix • Considers the impact from the overall network • Should be better than simple ranking

  20. Example: Authority Ranking in the 2-Area Conference-Author Network • Given the correct cluster, the ranking of authors are quite distinct from each other

  21. Step 2: Generate New Measure Space—A Naive Method • Mapping target object to a K-dimensional vector directly by considering a sub-network induced by it • r(Y|x) vs. r(Y|k) • Cosine similarity or KL-Divergence can be used • E.g., (cos(r(Y|x), r(Y|1)), …, cos(r(Y|x), r(Y|K)))

  22. Step 2: Generate New Measure Space—A Mixture Model Method • Consider each target object’s links are generated under a mixture distribution of ranking from each cluster • Consider ranking as a distribution: r(Y) → p(Y) • Each target object xi is mapped into a K-vector (πi,k) • Parameters are estimated using the EM algorithm • Maximize the log-likelihood given all the observations of links

  23. Example: 2-D Coefficients in the 2-Area Conference-Author Network • The conferences are well separated in the new measure space

  24. Step 3: Cluster Adjustment in New Measure Space • Cluster center in new measure space • Vector mean of objects in the cluster (K-dimensional) • Cluster adjustment • Distance measure: 1- Cosine similarity • Assign to the cluster with the nearest center

  25. A Running Case Illustration for 2-Area Conf-Author Network Initially, ranking distributions are mixed together Two clusters of objects mixed together, but preserve similarity somehow Improved a little Two clustersare almost well separated Improved significantly Well separated Stable

  26. Ranking Function Analysis • Why “Authority Ranking” is better than “Simple Ranking”? • For authority ranking, each object’s score is determined by • The number of objects linking to it • The strength of these links (weight of link) • The quality of these objects (score) • For simple ranking, each object’s score is determined by • The number of objects linking to it • The strength of these links (weight of link) • The quality of these objects are equal

  27. Re-examine the Rules • Rule 1: Highly ranked authors publish many papers in highly ranked conferences • An author publishing many papers in junk conferences will be ranked low • Rule 2: Highly ranked conferences attract many papers from many highly ranked authors • A conference accepting most papers from lowly ranked authors will be ranked low • Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors • A highly ranked author in an area usually has many co-operations with others

  28. Why Better Ranking Function Derives Better Clustering? • Consider the measure space generation process • For naive method, highly ranked objects in a cluster play a more important role to decide a target object’s new measure • For mixture model, the same • Intuitively, if we can find the highly ranked objects in a cluster, equivalently, we get the right cluster

  29. Time Complexity Analysis • At each iteration, |E|: edges in network, m: number of target objects, K: number of clusters • Ranking for sparse network • ~O(|E|) • Mixture model estimation • ~O(K|E|+mK) • Cluster adjustment • ~O(mK^2) • In all, linear to |E| • ~O(K|E|)

  30. Outline • Background • Motivation • The RankClus Algorithm • Experiments • Conclusion

  31. Case Study: Dataset: DBLP • All the 2676 conferences and 20,000 authors with most publications, from the time period of year 1998 to year 2007. • Both conference-author relationships and co-author relationships are used. • K=15

  32. Accuracy Study • Dataset: synthetic dataset • Simulate a bipartite network similar to conf-author network • P: control the node number of attribute objects • T: transition probability matrix, to control the overlap between clusters • K: fix to 3 • Generating parameters for the five synthetic datasets Data1: medium separated and medium density P = [1000, 1500,2000], T = [0.8,0.05,0.15; 0.1,0.8,0.1; 0.1,0.05,0.85] Data2: medium separated and low density P = [800,1300,1200], T = [0.8,0.05,0.15; 0.1,0,8,0.1; 0.1,0.05,0.85] Data3: medium separated and high density P = [2000,3000,4000], T = [0.8,0.05,0.15; 0.1,0.8,0.1; 0.1,0.05,0.85] Data4: highly separated and medium density P = [1000,1500,2000], T = [0.9,0.05,0.05; 0.05,0.9,0.05; 0.1,0.05,0.85] Data5: poorly separated and medium density P = [1000, 1500,2000], T = [0.7,0.15,0.15; 0.15,0.7,0.15; 0.15,0.15,0.7]

  33. Accuracy Study (Cont.) • 5 (synthetic) dataset settings, 4 methods • For each setting, generate 10 datasets, run each method for each dataset 100 times • RankClus with authority ranking is the best overall

  34. Efficiency Study • Varying size of attribute type of objects (×2)

  35. Outline • Background • Motivation • The RankClus Algorithm • Experiments • Conclusions

  36. Conclusions • A general framework is proposed in which ranking and clustering are successfully combined to analyze information networks • Formally study how ranking and clustering can mutually reinforce each other in information network analysis • A novel algorithm, RankClus, is proposed and its correctness and effectiveness are verified • A thorough experimental study on both synthetic and real datasets in comparison with the state-of-the-art algorithms, and the experimental results demonstrate the accuracy and efficiency of RankClus

More Related