350 likes | 360 Views
Learn about centrality measures in the PageRank algorithm, including Eigenvector Centrality and Katz Centrality. Understand the importance of PageRank in ranking web pages and fighting spam.
E N D
Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge
Some slide from last week that we didn’t talk about in class:
PageRank algorithm • Eigenvector centrality:i’s Rank score is the sumof the Rank scores of all pages j that point to i: • Then Katz centrality adds the teleportation by adding a small weight edge to each node (using a weight of ): • BUT, since a page jmay point to many other pages, its prestige score should be sharedamong these pages. (For example NPS pointing to many sites)
Matrix notation (1) • Let be a n-dimensional column vector of PageRank values, i.e., )T. • Let A be the adjacency matrix of our digraph with entries • Then the PageRank centrality of node is given by: or Where is the damping factor, generally set for = .85 (more on the next page).
Matrix notation (2) So the PageRank centrality of node is given by: where is the damping factor (generally = .85) Recall from eigenvector centrality: =or = • Small values (close to 0): the contribution given by paths longer than one hop is small, so centrality scores are mainly influenced by in-degrees. • Large values (close to ): allows long paths to be devalued smoothly, and centrality scores influenced by the topology of G. • Recommendation: choose , where the centrality diverges at α = . The default is usually .85
Overview PR: most known and influential algorithms for computing the relevance of web pages
An example as just described: Problem vertex (no outgoing links) Recall that the problem with verticeswith indegree = 0was solved by using in-degree matrix Is the formula abovewell defined? each row shows the in degree If not, how could we fix the formula or the matrix? each column shows the out degree
How can we fix the problem? • Remove those pages with no out-links during the PageRank computation as these pages do not affect the ranking of any other page directly (these pages will get outgoing links in the future). • Add a complete set of outgoing links from each such page i to all the pages on the Web. each column shows the out degree in-degree matrix each row shows the in degree The second choice is used in PR since matrix may get updated
How can we fix the out degree = 0? in-degree matrix Inverse of the out-degree matrix
PR centrality formula is well defined By multiplying them we obtain the matrix that captures: • The in and out degree per vertex • Divides the centrality of each vertex by its degree The contribution of node 5 is insignificant, and the formula is now well defined out-degree matrix in-degree matrix
Transition probability matrix • This modified matrix is called the state transition probability matrix. Denote its entries by pij: • pij represents the transition probability that the surfer in state i (page i) will move to state j (page j). • Here is an example:
A small Internet consisting of just 4 websites Source: http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html
A small Internet consisting of just 4 websites pij represents the transition probability that the surfer on page jwill move to page i: Source: http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html
A small Internet consisting of just 4 websites Random surfer: each page has equal probability ¼ to be chosen as a starting point. The probability that page iwill be visited after k steps(i.e. the random surfer ending up at page i) is equal to entry of A kx. Simplification for this example: No β was involved since id i > 0, for all i Source: http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html
Some comments • Newman’s book gives: where α is called the damping factorwhich can be set to between 0 and 1 (or the inverse of the largest eigenvalue of A). • And the formula in the original PageRank paper is: where d is the damping factor (d = 0.85 as default) • Gephi: the default value for is the probability = 0.85 and Epsilon is the criteria for eigenvector convergence based on the power method
Final Points on PageRank • Fighting spam. • A page is important if the pages pointing to it are important. • Since it is not easy for Web page owner to add in-links into his/her page from other important pages, it is thus not easy to influence PageRank. • PageRank is a global measure and is query independent. • The values of the PageRank algorithm of all the pages are computed and saved off-line rather than at the query time => fast • Criticism: • There are companies that can increase your pagerank by adding it to a cluster and increasing its indegree • It cannot not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic. • But it works based on the keyword search
Betweenness Centrality Some pages are adaptedfrom Dan Ryan, Mills College
Different types of centralities: Betweenness Centrality Closeness Centrality Eigenvector Centrality Degree Centrality Source: Discovering Sets of Key Players in Social Networks – Daniel Ortiz-Arroyo – Springer 2010/
Betweenness Centrality • Intuition: how many pairs of individuals would have to go through you in order to reach one another in the minimum number of hops? • Interactions between two individuals depend on the other individuals in the set of nodes. The nodes in the middle have some control over the paths in the graph. • Useful for flow, such as information or data packages
Assumptions • When there is more than one geodesic, all geodesics are equally likely to be used. • Flow takes the shortest path (we’ll look at alternatives) • Every pair of nodes in G exchanges a message with equal probability per unit time. • Question: How many messages, on average, will have passed through each vertex en route to their destination? • A node’s betweennessis given by all pairs of nodes,including the node in question.
Meaning of betweenness centrality • Vertices with high betweenness centrality have influence in the network by virtue of their control over information passing between others. • They get to see the messages as they pass through • They could get paid for passing the message along Thus they get a lot of power: their removal would disrupt communication How would you capture it in a mathematical formula?
Formula for betweenness centrality , where • is the number of s-t geodesics that i belongs to (default: i could equal s or t, but in other versions it cannot and that’s where you see 0 values) • in an undirected graph, an s-tgeodesic is the same as a t-s geodesics, so the edge gets counted twice) It is applicable to directed networksas well.
Bounds for disconnected graphs Let G be a disconnected graph: • What is the minimum value of betweenness centrality a vertex can have in disconnected graphs? • an isolated vertex: 0 • What is the maximum value of betweennesscentrality a vertex can have in disconnected graphs? • center of a star with center: Let at with center node at Then there are pairs of nodes, from which we take away the paths from to since is not on them.
Bounds for connected graphs Let G be connected: • What is the minimum value of betweenness centrality a vertex can have in connected graphs? • A leaf x would have it: (where we have paths from x to each vertex. And more paths from each vertex to x. Finally one path from x to x. • What is the maximum value of betweennesscentrality a vertex can have in disconnected graphs? • center of a star in the largest component: which is the number of pairs of nodes minus the paths from a leaf to itself
A refined formula How do we find the relative (to the other nodes) betweenness centrality values? , where • is the number of s-t geodesics that i belongs to. • is the number of s-t geodesics • Convention: if = 0 and = 0, then (in an undirected graph, an s-tgeodesic is the same as a t-s geodesics, so it gets counted twice)
In class activity: betweenness of A? • Fraction of shortest paths that include vertex A Number of paths 1 shortest path of 4 goes through A 1 shortest path of 4 goes through A 1 shortest path of 4 goes through A = 0.75
A normalized refined formula How do we find the normalized relative betweenness centrality values? Allows to compare nodes in other graph. ) / where • is the number of s-t geodesics that i belongs to. • is the number of s-t geodesics • Convention: = 0 and = 0, then
Another normalized formula How do we find the normalized relative betweenness centrality values? Allows to compare nodes in other graph. ) / where • is the number of s-t geodesics that i belongs to. • is the number of s-t geodesics • Convention: = 0 and = 0, then
Betweenness Centrality • Used generally for Information flow • Typically distributed over a wide range • Betweenness only uses geodesic paths • Information can also flow on longer paths • Sometimes we hear it through the grapevine • While betweenness focuses just on the geodesic, flow betweenness centrality focuses on how information might flow through many different paths.
Flow betweenness centrality Same expression, , BUT • is the maximum flow transmitted from s to t through all possible paths that i belongs to. • is the maximum flow transmitted from s to t through all possible paths • Convention: = 0 and = 0, then (in an undirected graph, an s-t geodesic is the same as a t-s geodesics, so it gets counted twice)
Random walk betweenness centrality Same expression, BUT • is the number of times a random walk from s to t passes through i, averaged over many repetitions of a walk • Note that ≠ • A good measure for traffic that doesn’t have a particular destination
Other extensions of centralities • How would you extend the centralities you have seen? What else would you introduce that would capture the centrality of a vertex? • Would you use it for edges? • This is a good time to share your thoughts • Subgraph/subset centrality? • How central are you to that particular subgraph? • How central is the subgraph to the network? • If so, would you repeat the centralities seen before for that subgraph?
Overview • Local measure: • degree • Relative to rest of network: • closeness, betweenness, eigenvector, Katz, PageRank • How evenly is centrality distributed among nodes? • hubs and authorities • You’ve learned the traditional centralities. Based on your understanding of the methodologies that create them, decide which one is appropriate to use for your application.
Let’s practice in Gephi • And if there is time, in Python (code on line, same code as before)