Diversified Ranking on Large Graphs: An Optimization Viewpoint

Diversified Ranking on Large Graphs: An Optimization Viewpoint Hanghang Tong, Jingrui He, Zhen Wen, Ching-Yung Lin, Ravi Konuru KDD 2011, August 21-24, San Diego, CA

Background: Why Diversity? • A1: Uncertainty & Ambiguityin an Information Need Case 1: Uncertainty from the query Case 2: Uncertainty from the user

Background: Why Diversity? (cont.) • A2: Uncertainty & ambiguityof an information need • C1: Product search  want different reviews • C2: Political issue debate  desire different opinions • C3: Legal search  get an overview of a topic • C4: Team assembling  find a set of relevant & diversified experts • A3: Become a better and safer employee • Better: A 1% increase in diversity  an additional $886 of monthly revenue • Safer: A 1% increase in diversity  an increase of 11.8% in job retention

Problem Definitions & Challenges 4 • Problem 1 (Evaluate/measure a given top-k ranking list) • Given: A large graph A, the query vector p, the damping factor c, and a subset of k nodes S; • Measure: the goodness of the subset of nodes S by a single number in terms of (a) the relevance of each node in S wrt the query vector p, and (b) the diversity among all the nodes in the subset S. • Problem 2 (Find a near optimal top-k ranking list) • Given: A large graph A, the query vector p, the damping factor c, and the budget k; • Find: A subset of k nodes S that maximizes the goodness measure f(S). • Challenges • (for Prob. 1) No existing measure encoding both relevance and diversity • (for Prob. 2) Sub-set level optimization

Our Solutions (10 seconds introduction!) relevance weight diversity 5 • Problem 1 (Evaluate/measure a given top-k ranking list) • A1: A weighted sum between relevance and similarity • Problem 2 (Find a near optimal top-k ranking list) • A2: A greedy algorithm (near-optimal, linear scalability)

Details 10 9 12 2 8 1 11 3 4 6 5 7 Measure Relevance (r) by RWR (a.k.a. Personalized PageRank) r = c A r + (1-c)e Restart p Starting vector Adjacency matrix Ranking vector 1 n x 1 n x 1 n x n

Details 10 10 9 9 12 12 2 2 8 8 1 1 11 11 3 3 4 4 6 6 5 5 7 7 Diversity ~ reverse of weighted similarity on the personalized graph r = c A r + (1-c)e = [c A + (1-c) e 1’ ] r = B r g(S) = w∑r(i) - ∑B(i,j)r(j) i in S i,j in S B: PersonalizedGraph (a.k.a ‘Google-Matrix’) B(i,j): How node i and node j are connected in the personalized graph

Details Properties of g(S): Why is it a Good Measure? • P1: g(S)=0 for an empty set S • P2: g(S) is sub-modular for any w>0 • P3: g(S) is monotonically non-decreasing for any w>=2 • A greedy algorithm (Dragon) leads to near-opt. solution • Quality: g(S) >= (1−1/e)g(S*), where S* is the optimal subset maximizing g(S) • Complexity: O(m) for both time and space For any w>=2 Footnote: Dragon stands forDiversified Ranking on Graph: An Optimization Viewpoint

Experimental Results Quality Budget An Illustrative Example Compare w/ alternative choices Time Time Budget Opt. Quality Scalability Quality-Time Balance 9

Conclusion • Problem 1 (Evaluate/measure a given top-k ranking list) • A1: A weighted sum between relevance and similarity • Problem 2 (Find a near optimal top-k ranking list) • A2: A greedy algorithm (near-optimal, linear scalability) • Contact: Hanghang Tong (htong@us.ibm.com)

Academic Literature: More Detailed Comparison For Problem 2 For Problem 1 [6] [7] This Disclosure Proposes (1) The first measure that combines both relevance & diversity (2) The first method that (a) leads to near-optimal solution with (b) linear complexity

Diversified Ranking on Large Graphs: An Optimization Viewpoint