290 likes | 301 Views
Network Mining. Finding Important Nodes PageRank Relative Importance Discovering Network Modules Inferring Important Paths. PageRank. Web pages are organized in a network. Each webpage is represented as a node. Each hyperlink is a directed edge
E N D
Network Mining • Finding Important Nodes • PageRank • Relative Importance • Discovering Network Modules • Inferring Important Paths
PageRank • Web pages are organized in a network. • Each webpage is represented as a node. • Each hyperlink is a directed edge • The entire web can be viewed as a directed graph.
PageRank PageRank is a numeric value that represents how important a page is on the web. Webpage importance One page links to another page = A vote for the other page A link from page A to page B is a vote on A to B. If page A is more important itself, then the vote of A to B should carry more weight. More votes = More important the page must be How can we model this importance?
The Random Surfer Model PageRank = A model of user behaviour A surfer clicks on links at random with no regard towards content.
PageRank • Importance Computation • The importance of a page is distributed to pages that it points to. • The importance of a page is the aggregation of the importance shares of the pages that points to it. • If a page has 5 outlinks, the importance of the page is divided into 5 and each link receives one fifth share of the importance.
Algorithm Now we refer PageRank as “PR” PR(A) = (1-d) + d (PR(T1) / C(T1) + ….. + PR(Tn) / C(Tn) ) • PR(A) is the PageRank of page A • PR(Ti) is the PageRank of pages Ti which link to page A • C(Ti) is the number of outbound links on page Ti • d is a damping factor which can be set between 0 and 1 • (usually set to 0.85) A page’s PageRank = 0.15 + 0.85 * (a “share” of the PageRank of every page that links to it) “share” = the linking page’s PageRank divided by the number of outbound links on the page
Page A Page B PR = 1 PR = 1 PR(A) = 0.15 + 0.85 * PR(B) PR(B) = 0.15 + 0.85 * PR(A) We can't work out A's PageRank until we know B's PageRank, and we can't work out B's PageRank until we know A's PageRank. Iterations are necessary to calculate the most accurate values by using inaccurate values 100 iterations are necessary to get a good approximation of the PageRank values of the whole web
PageRank of a Site PageRank of a site is equal to the pagerank of all pages in the site. The maximal pagerank of a site is equal to the total number of pages in the site.
Internal Linking • A website has a maximum amount of PageRank that is distributed between its pages by internal links • The maximum amount of PageRank in a site increases as the number of pages in the site increases • By linking poorly, it is possible to fail to reach the site’s maximum PageRank, but it is not possible to exceed it
Internal Linking Page A Page B Page C PR = 1 PR = 1 PR = 1 Maximum PageRank is the amount of PageRank in the site. So this site’s maximum PageRank is 3. PR(A) = 0.15 + 0.85 * ( 0 ) = 0.15 PR(B) = 0.15 + 0.85 * ( 0 ) = 0.15 PR(C) = 0.15 + 0.85 * ( 0 ) = 0.15 Total PageRank in this site = 0.45 Wasting most of its potential PageRank!
Internal Linking Page A Page B Page C PR = 1 PR = 1 PR = 1 PR(A) = 0.15 + 0.85 * ( 0 ) = 0.15 PR(B) = 0.15 + 0.85 * ( PR(A)/1 ) = 0.15 + 0.85 * ( 1 ) = 1 PR(C) = 0.15 + 0.85 * ( 0 ) = 0.15 After 100 iterations….
Internal Linking Page A Page B Page C PR = 1 PR = 1 PR = 1 PR(A) = 0.15 PR(B) = 0.2775 PR(C) = 0.15 Total PageRank in this site = 0.5775 Slightly better but still not the best it could be
Internal Linking Page A Page B PR(A) = 0.15 + 0.85 * ( PR(B)/1 + PR(C) / 1) = 0.15 + 0.85 * (1 + 1 ) = 0.15 + 1.7 = 1.85 PR = 1 PR = 1 PR(B) = 0.15 + 0.85 * ( PR(A)/2 ) = 0.15 + 0.85 * (0.5 ) = 0.15 + 0.425 = 0.575 Page C PR(C) = 0.15 + 0.85 * ( PR(A)/2 ) = 0.15 + 0.85 * (0.5 ) = 0.15 + 0.425 = 0.575 PR = 1 Page A = 1.459459 Page B = 0.7702703 Page C = 0.7702703 After 100 iterations…
Internal Linking Page A Page B PR = 1 PR = 1 Page C PR(A) = 0.15 + 0.85 * ( PR(B)/2 + PR(C) / 2) = 0.15 + 0.85 * (0.5 + 0.5 ) = 0.15 + 0.85 = 1 PR(B) = 1 PR = 1 PR(C) = 1 Total PageRank in this site = 3 Good Linking!
Internal Linking Page A Page B PR(A) = 0.15 + 0.85 * ( PR(B)/1 + PR(C) / 2) = 0.15 + 0.85 * (1 + 0.5 ) = 0.15 + 1.275 = 1.425 PR = 1 PR = 1 PR(B) = 0.15 + 0.85 * ( PR(A)/2 + PR(C)/2 ) = 0.15 + 0.85 * (0.5 + 0.5) = 0.15 + 0.85 = 1 Page C PR(C) = 0.15 + 0.85 * ( PR(A)/2 ) = 0.15 + 0.85 * (0.5 ) = 0.15 + 0.425 = 0.575 PR = 1 Page A = 1.298245 Page B = 0.9999999 Page C = 0.7017543 After 100 iterations…
Internal Linking Page A Page B PR = 1 PR = 1 Page C PR(A) = 1.46 PR(B) = 0.77 PR = 1 PR(C) = 0.77 Total PageRank in this site = 3 No PageRank has been wasted!
Internal Linking Page A Page B PR = 1 PR = 1 Page C PR(A) = 1.298 PR(B) = 0.999 PR = 1 PR(C) = 0.702 Page A and Page C lose some PageRank Page B gains some PageRank Total PageRank in this site = 3
Page X Page A Page D Inbound Links PR = 10 PR = 1 PR = 1 Try to set the Damping Factor “d” to 0.5 in this example to see the influence of the “d” Page B Page C PR(A) = 0.5 + 0.5 * ( PR(X) / C(X) + PR(D) ) = 6.33 PR(B) = 0.5 + 0.5 * PR(A) = 3.67 PR = 1 PR = 1 PR(C) = 0.5 + 0.5 * PR(B) = 2.33 PR(D) = 0.5 + 0.5 * PR(C) = 1.67 Initial effect of the additional inbound link of page A: d x PR(X) / C(X) = 0.5 x 10 / 1 = 5
Page X Page A Page D Inbound Links PR = 10 PR = 1 PR = 1 Page B Page C Now we set the Damping Factor “d” back to 0.85 PR(A) = 0.15 + 0.85 * ( PR(X) / C(X) + PR(D) ) = 18.78 PR(B) = 0.15 + 0.85 * PR(A) = 16.12 PR = 1 PR = 1 PR(C) = 0.15 + 0.85 * PR(B) = 13.85 PR(D) = 0.15 + 0.85 * PR(C) = 11.92 Initial effect of the additional inbound link of page A: d x PR(X) / C(X) = 0.85 x 10 / 1 = 8.5
Site 1 Site 2 Page A Page C Outbound Links PR(A) = 0.15 + 0.85 * PR(B) = 1 PR(B) = 0.15 + 0.85 * PR(A) = 1 PR = 1 PR = 1 PR(C) = 0.15 + 0.85 * PR(D) = 1 PR(D) = 0.15 + 0.85 * PR(C) = 1 Page B Page D PR(A) = 0.15 + 0.85 * PR(B) = 0.43 PR(B) = 0.15 + 0.85 * PR(A)/2 = 0.33 PR = 1 PR = 1 PR(C) = 0.15 + 0.85 * (PR(A)/2 + PR(D) = 1.67 PR(D) = 0.15 + 0.85 * PR(C) = 1.57 Site 1 loses: 0.76 - 2= -1.24, Site 2 gains: 3.24 – 2 = 1.24 The PageRank benefit for one site equals the PageRank loss of the other
Page A Page C Dangling Links Danglink links are links that point to any Page with no outgoing links. PR = 1 PR = 1 PR(A) = 0.15 + 0.85 * PR(B) = 0.43 Page B PR(B) = 0.15 + 0.85 * PR(A) /2= 0.33 PR(C) = 0.15 + 0.85 * PR(A)/2 = 0.33 PR = 1 The total PageRank is only 1.10 which is only one third of the maximum PageRank. To prevent PageRank from the negative effects of dangling links, pages without outbound links have to be removed from the database until the PageRank values are computed.
Page A Page C Dangling Links PR(A) = 0.15 + 0.85 * PR(B) = 0.43 PR = 1 PR = 1 PR(B) = 0.15 + 0.85 * PR(A) = 0.33 PR(C) = 0.15 + 0.85 * PR(A) = 0.33 Page B PR(A) = 0.15 + 0.85 * PR(B) = 1 PR(B) = 0.15 + 0.85 * PR(A) = 1 PR = 1
PageRank • Page A’s PR is based on all its inbound neighbors. • PR is defined recursively • With new calculation, each page’s PR may change. • If B and C points to A, the order of computing A, B, C may effect the results of A’s PR. • How can the problem be solved?
PageRank • PR of all web pages can form a vector • X is a matrix. • Each row and column represents a page. • Xu,v = 1/C(u) if there is a link from u to v. • 0 otherwise. • P = cXP+cE • P = c (A+EI)P since ||P||=1, I is the identity matrix • P is the eigenvector of (A+EI).
PageRank • When the number of pages is small, P can be computed efficient. • However, when the number of page is large, computing P is a time consuming technique. • An iterative method is used for computation. • Any initial value of P is ok. • After applying the equation 50 times, P can be converged to a stable state.
PageRank Algorithm • R0 <-S • Loop: • Ri+1 <-A * Ri • d <- ||Ri||1 - ||Ri+1||1 • Ri+1 <-Ri + dE • dif <- ||Ri+1 – Ri||1 • While (dif > t)
Examples • P(A)=P(B)=P(C)=0.15 A B C
PageRank • A=1.3 • B=1 • C=0.7 A C B