P-Rank: A Comprehensive Structural Similarity Measure over Information Networks

P-Rank: A Comprehensive Structural Similarity Measureover Information Networks Peixiang Zhao, Jiawei Han, Yizhou Sun University of Illinois at Urbana-Champaign CIKM’ 09 November 3rd, 2009, Hong Kong Presented by Prof. Hong Cheng, CUHK

Outline • Introduction & Motivation • P-Rank • Formula • Derivatives • Computation • Experimental Studies • Future direction & Conclusion Nov. 3rd 2009 CIKM’09 Hong Kong 1 of 15

Introduction • Information Networks (INs) • Physical, conceptual, and human/societal entities • Interconnected relationships among different entities • INs are ubiquitous and form a critical component of modern information infrastructure • The Web • highway or urban transportation networks • research collaboration and publication networks • Biological networks • social networks Nov. 3rd 2009 CIKM’09 Hong Kong 2 of 15

Problem • Similarity computation on entities of INs • How similar is webpage A with webpage B in the Web ? • How similar is researcher A with researcher B in DBLP co-authorship network ? • First of all, how to define “similarity” within a massive IN? • Textual proximity of entity labels/contents • Structural proximity conveyed through links! • A good structural similarity measure in INs: SimRank (KDD’02) Nov. 3rd 2009 CIKM’09 Hong Kong 3 of 15

Why SimRank is not Enough? • Philosophy • two entities are similar if they are referenced by similar entities • Potential problems • Semantic incomplete • Only partial structural information from in-link direction is considered during similarity computation • Biased similarity results • May fail in different IN settings ! • Inefficient in computation • Worst-case O(n4), can be improved to O(n3), where n is the number of vertices in the information network Nov. 3rd 2009 CIKM’09 Hong Kong 4of 15

Why SimRank is not Enough? (a) A Heterogeneous IN and Structural Similarity Scores (b) A Homogeneous IN and Structural Similarity Scores Nov. 3rd 2009 CIKM’09 Hong Kong 5of 15

P(enetrating)-Rank • Philosophy: Two entities are similar, if • they are referenced by similar entities • they reference similar entities • Advantages • Semantic complete • Structural information from both in-link and out-link directions are considered during similarity computation • Robust in different IN settings • A unified structural similarity framework • SimRank is just a special case Nov. 3rd 2009 CIKM’09 Hong Kong 6of 15

P-Rank Formula • The structural similarity between vertex a and vertex b (a ≠ b), s(a, b): • Recursive form • Approximate iterative form In-link similarity Out-link similarity Nov. 3rd 2009 CIKM’09 Hong Kong 7of 15

P-Rank Property • The iterative P-Rank has the following properties: • Symmetry: sk(a, b) = sk(b, a) • Monotonicity: 0 ≤ sk(a, b) ≤ sk+1(a, b) ≤ 1 • Existence: The solution to the iterative P-Rank formula always exists and converges to a fixed point, s(∗, ∗), which is the theoretical solution to the recursive P-Rank formula • Uniqueness: the solution to the iterative P-Rank formula is unique when C ≠ 1 • The theoretical solution to P-Rank can be reached by a repetitive computation via the iterative form Nov. 3rd 2009 CIKM’09 Hong Kong 8of 15

P-Rank Derivatives • P-Rank proposes a unified structural similarity framework, upon which many structural similarity measures are just its special cases Nov. 3rd 2009 CIKM’09 Hong Kong 9 of 15

P-Rank Computation • An iterative algorithm is executed until it reaches the fixed point • Space complexity: O(n2) • Time complexity: O(n4), can be improved to O(n3) by amortization • Approximation algorithms on different IN scenarios • Homogeneous IN • Radius based pruning: vertex-pairs beyond a radius of r are no longer considered in similarity computation • Heterogeneous IN • Category based pruning: vertex-pairs in different categories are no longer considered in similarity computation Nov. 3rd 2009 CIKM’09 Hong Kong 10of 15

Experimental Studies • Data sets: • Heterogeneous IN: DBLP (paper, author, conference, year) • Homogeneous IN: DBLP (paper with citation), Synthetic data R-MAT • Methods • P-Rank • SimRank • Metrics • Compactness of clusters • Algorithmic nature • Ground truth Nov. 3rd 2009 CIKM’09 Hong Kong 11of 15

Compactness of Clusters • P-Rank and SimRank are used as underlying similarity measures, respectively, and K-Medoids are used to cluster different vertices • Compactness: intra-cluster distance/inter-cluster distance Homogeneous IN Heterogeneous IN Nov. 3rd 2009 CIKM’09 Hong Kong 12of 15

Algorithmic Nature • Iterative P-Rank converges fast to the fixed point P-Rank v.s. the damping factor C P-Rank v.s. lambda Nov. 3rd 2009 CIKM’09 Hong Kong 13of 15

Ground Truth Ranking Result • Top-10 ranking results for author vertices in DBLP by P-Rank Nov. 3rd 2009 CIKM’09 Hong Kong 14of 15

Conclusion • The proliferation of information networks calls for effective structural similarity measures in • Ranking • Clustering • Top-k Query Processing • …… • Compared with SimRank, P-Rank is witnessed to be a more effective structural similarity measure in large information networks • Semantic complete, general, robust, and flexible enough to be employed in different IN settings Nov. 3rd 2009 CIKM’09 Hong Kong 15of 15

Thank you • CIKM’ 09 November 3rd, 2009, Hong Kong

P-Rank: A Comprehensive Structural Similarity Measure over Information Networks