Ranking

Ranking

Motivation: searching the web

Motivation: Basis for recommendation

Motivation: Personalized search

Why graphs? • The underlying data is naturally a graph • WWW graph: Websites are linked with each other • Papers linked by citation • Authors linked by co-authorship • Bipartite graph of customers and products • Friendship networks: who knows whom

What are we looking for • Rank nodes for a particular query • Top k matches for “Random Walks” from Citeseer • Who are the most likely co-authors of “Manuel Blum”. • Top k book recommendations for Purna from Amazon • Top k websites matching “Sound of Music” • Top k friend recommendations for Purna when she joins “Facebook”

How do search engines decide how to rank your query results? • How does Google rank the query results? • How would you do it?

Naïve ranking of query results • Given query q • Rank the web pages pin your database based on some similarity measure sim(p,q) • Scenarios where this is not a good idea?

Why Link Analysis? • First generation search engines • view documents as flat text files • could not cope with size or user needs • Second generation search engines • ranking becomes critical • use of Web specific data: Link Analysis • shift from relevance to authoritativeness

Link Analysis: Intuition • A link from page p to page q denotes endorsement • page p considers page q an authority on a subject • “mine” the web graph of recommendations • assign an authority value to every page

Link Analysis Ranking (LAR) Algorithms • Start with a collection of web pages • Extract the underlying hyperlink graph • Run the LAR algorithm on the graph • Output: an authority weight for each node w w w w w

Two Types of Algorithms • Query dependent: rank a small subset of pages related to a specific query • HITS (Kleinberg 98) was proposed as query dependent • Query independent: rank the whole Web • PageRank (Brin and Page 98) was proposed as query independent

Query-dependent LAR • Given a query q, find a subset of web pages S that are related to q • Rank the pages in S based on some ranking criterion

Query-dependent input Root set: includes pages relevant to the query q Root Set

Query-dependent input Root set: includes pages relevant to the query q Root Set OUT IN

Query-dependent input Base Set Root Set OUT IN

Properties of a good base set S • S is relatively small • S is rich in relevant pages • S contains most (or many) of the strongest authorities

How to construct a good seed set S • Given a query q: • collect the t highest-ranked pages for q from a text-based search engine to form set Γ • S = Γ • add to S all the pages pointing to Γ • add to S all the pages that pages from Γpoint to

Link Filtering • Navigational links: serve the purpose of moving within a site (or to related sites) • www.espn.com→ www.espn.com/nba • www.yahoo.com→ www.yahoo.it • Filter out navigational links • same domain name • same IP address

Hubs and Authorities [K98] • Based on the relationship between • authorities (pages) for a topic and • hubs (pages) that link to many related authorities • We observe that: • a certain natural type of equilibrium exists between hubs and authorities in the graph defined by the link structure • We exploit this equilibrium to develop an algorithm that identifies both types of pages simultaneously

Hubs and Authorities [K98] • Authority is not necessarily transferred directly between authorities • Pages have double identity • hub identity • authority identity • Good hubs point to good authorities • Good authorities are pointed by good hubs authorities hubs

HITS Algorithm • Initialize all weights to 1 • Repeat until convergence • O operation : hubs collect the weight of the authorities • Ioperation: authorities collect the weight of the hubs • Normalize weights under some norm • Why do we need normalization?

1 1 1 1 HITS and eigenvectors • LetAbe the adjacencymatrix of the Link Analysis graph A(i , j) = weight on the edge from i to j Adjacency matrix A

HITS and eigenvectors • Let atbe the authority weight vector and htthe hub weight vector at iteration t • Then in vector terms we have: at = ATht-1 and ht = Aat hence: at = ATA at-1and ht = AATht-1 • The authority weight vector a is the strongesteigenvector of ATA • The hub weight vector h is the strongesteigenvector of AAT

Singular Value Decomposition • Linear trendvin matrix A: • the tendency of the row vectors of A to align with vector v • SVD can be used to discover the linear trends in the data • vi : the i-th strongest linear trend • λi: the strength of the i-th strongest linear trend λ2 v2 v1 λ1 • HITS discovers the strongest linear trend in the authority space

[n×r] [r×r] [r×n] Singular Value Decomposition • r: rank of matrix A • λ1≥ λ2≥ … ≥λr: singular values (square roots of eig-valsAAT, ATA) • : left singular vectors (eig-vectors of AAT) • : right singular vectors (eig-vectors of ATA)

HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a

HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 1 1 1 1 1 1

HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 3 3 3 3 3

HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 32 32 32 3∙2 3∙2 3∙2

HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 33 33 33 32 ∙ 2 32 ∙ 2

HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 34 34 34 32∙ 22 32∙ 22 32∙ 22

HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 32n 32n after n iterations weight of node p is proportional to the number of paths that leave node p 32n 3n∙ 2n 3n∙ 2n 3n∙ 2n

HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 1 1 after normalization with the max element as n → ∞ 1 0 0 0

Strengths and weaknesses of HITS • Strength: its ability to rank pages according to the query topic, which may be able to provide more relevant authority and hub pages • Weaknesses: • Can be easily spammed: adding out-links in one’s own page is so easy • Topic drift:many pages in the expanded set may not be on topic • Inefficiency at query time: collecting the root set, expanding it, and performing eigenvector computation are all expensive operations

Query-independent LAR • Have an apriori ordering of the web pages • Q: Set of pages that contain the keywords in the query q • Present the pages in Q ordered according to order π

InDegree algorithm • Rank pages according to in-degree • wi = |B(i)| where B(i) is the number of incoming links to node i w=3 w=2 • Red Page • Yellow Page • Blue Page • Purple Page • Green Page w=2 w=1 w=1

Query-independent LAR • In-degree is a local measure • All links to a page are considered equal, regardless of where they come from • Two pages with the same in-degree are considered equally important, even if one is cited by more prestigious sources than the other

PageRank algorithm [BP98] • Good authorities should be pointed by good authorities • Random walk on the web graph • pick a page at random • with probability 1 - α jump to a random page • with probability αfollow a random outgoing link • PageRank of page p: • F(q): the set of outgoinglinks of page q • Red Page • Purple Page • Yellow Page • Blue Page • Green Page

1 1/2 1 1/2 What is a random walk t=0

1 1 1/2 1/2 1 1 1/2 1/2 t=1 t=0

1 1 1 1/2 1/2 1/2 1 1 1 1/2 1/2 1/2 t=1 t=0 t=2

1 1 1 1 1/2 1/2 1/2 1/2 1 1 1 1 1/2 1/2 1/2 1/2 t=1 t=0 t=2 t=3

2 1 1/2 1 1 1/2 3 Markov chains • A Markov chain describes a discrete time stochastic process over a set of states according to a transition probability matrix • Pij = probability of moving to state j when at state i • ∑jPij = 1 (stochastic matrix) • From any node the probability of moving out is 1 • Memorylessness property: The next state of the chain depends only at the current state and not on the past of the process S = {s1, s2, … sn} P = {Pij} Transition matrix P 0 0 1/2 1 0 1/2 0 1 0

Random walks • Random walks on graphs correspond to Markov Chains • The set of states S is the set of nodes of the graph G • The transition probability matrix is the probability that we follow an edge from one node to another

An example v2 v1 v3 v5 v4

State probability vector • Define a vector qt = (qt1,qt2, … ,qtn) that stores the probability of being at state i at time t • q0i = the probability of starting from state i

An example v2 v1 v3 qt+11 = 1/3 qt4+ 1/2 qt5 qt+12 = 1/2 qt1 + qt3+ 1/3 qt4 v5 v4 qt+13= 1/2 qt1 + 1/3 qt4 qt+14 = 1/2 qt5 qt = qt-1P qt+15 = qt2

Stationary Distribution • When the surfer keeps walking for a long time • Then the distribution does not change anymore • i.e., qt= qt-1 = π • We will call πthe stationary distribution • For “well-behaved” graphs this does not depend on the start distribution!!

Ranking

Ranking

Presentation Transcript

ABC Ranking

Ranking

Jobwise Ranking

Forced Ranking

Ranking

Ranking

Tag Ranking

WORLD RANKING

Ranking

Ranking

Forced Ranking

Site Ranking

Competencies Ranking

google ranking

Website Ranking

Web google ranking - Google Ranking - Webgoogleranking

Website Ranking

Suchmaschinen Ranking