1 / 30

Network Mining

Network Mining. Finding Important Nodes PageRank Relative Importance Discovering Network Modules Inferring Important Paths. PageRank. Web pages are organized in a network. Each webpage is represented as a node. Each hyperlink is a directed edge

hharvey
Download Presentation

Network Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Network Mining • Finding Important Nodes • PageRank • Relative Importance • Discovering Network Modules • Inferring Important Paths

  2. PageRank • Web pages are organized in a network. • Each webpage is represented as a node. • Each hyperlink is a directed edge • The entire web can be viewed as a directed graph.

  3. PageRank PageRank is a numeric value that represents how important a page is on the web. Webpage importance One page links to another page = A vote for the other page: A link from page A to page B is a vote on A to B. If page A is more important itself, then the vote of A to B should carry more weight. More votes = More important the page must be How can we model this importance?

  4. The Random Surfer Model PageRank = A model of user behaviour A surfer clicks on links at random with no regard towards content. Intuition: Imagine a web surfer doing a simple random walk on the entire web for an infinite number of steps. Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page. At some point, the percentage of time spent at each page will converge to a fixed value. This value is known as the PageRank of the page.

  5. PageRank • Importance Computation • The importance of a page is distributed to pages that it points to. • The importance of a page is the aggregation of the importance shares of the pages that points to it. • If a page has 5 outlinks, the importance of the page is divided into 5 and each link receives one fifth share of the importance.

  6. PageRank

  7. Algorithm Now we refer PageRank as “PR” PR(A) = (1-d) + d (PR(T1) / C(T1) + ….. + PR(Tn) / C(Tn) ) • PR(A) is the PageRank of page A • PR(Ti) is the PageRank of pages Ti which link to page A • C(Ti) is the number of outbound links on page Ti • d is a damping factor which can be set between 0 and 1 • (usually set to 0.85) A page’s PageRank = 0.15 + 0.85 * (a “share” of the PageRank of every page that links to it) “share” = the linking page’s PageRank divided by the number of outbound links on the page

  8. Page A Page B PR = 1 PR = 1 PR(A) = 0.15 + 0.85 * PR(B) PR(B) = 0.15 + 0.85 * PR(A) We can't work out A's PageRank until we know B's PageRank, and we can't work out B's PageRank until we know A's PageRank. Iterations are necessary to calculate the most accurate values by using inaccurate values 100 iterations are necessary to get a good approximation of the PageRank values of the whole web

  9. PageRank of a Site PageRank of a site is equal to the pagerank of all pages in the site. The maximal pagerank of a site is equal to the total number of pages in the site.

  10. Internal Linking • A website has a maximum amount of PageRank that is distributed between its pages by internal links • The maximum amount of PageRank in a site increases as the number of pages in the site increases • By linking poorly, it is possible to fail to reach the site’s maximum PageRank, but it is not possible to exceed it

  11. Internal Linking Page A Page B Page C PR = 1 PR = 1 PR = 1 Maximum PageRank is the amount of PageRank in the site. So this site’s maximum PageRank is 3. PR(A) = 0.15 + 0.85 * ( 0 ) = 0.15 PR(B) = 0.15 + 0.85 * ( 0 ) = 0.15 PR(C) = 0.15 + 0.85 * ( 0 ) = 0.15 Total PageRank in this site = 0.45 Wasting most of its potential PageRank!

  12. Internal Linking Page A Page B Page C PR = 1 PR = 1 PR = 1 PR(A) = 0.15 + 0.85 * ( 0 ) = 0.15 PR(B) = 0.15 + 0.85 * ( PR(A)/1 ) = 0.15 + 0.85 * ( 1 ) = 1 PR(C) = 0.15 + 0.85 * ( 0 ) = 0.15 After 100 iterations….

  13. Internal Linking Page A Page B Page C PR = 1 PR = 1 PR = 1 PR(A) = 0.15 PR(B) = 0.2775 PR(C) = 0.15 Total PageRank in this site = 0.5775 Slightly better but still not the best it could be

  14. Internal Linking Page A Page B PR(A) = 0.15 + 0.85 * ( PR(B)/1 + PR(C) / 1) = 0.15 + 0.85 * (1 + 1 ) = 0.15 + 1.7 = 1.85 PR = 1 PR = 1 PR(B) = 0.15 + 0.85 * ( PR(A)/2 ) = 0.15 + 0.85 * (0.5 ) = 0.15 + 0.425 = 0.575 Page C PR(C) = 0.15 + 0.85 * ( PR(A)/2 ) = 0.15 + 0.85 * (0.5 ) = 0.15 + 0.425 = 0.575 PR = 1 Page A = 1.459459 Page B = 0.7702703 Page C = 0.7702703 After 100 iterations…

  15. Internal Linking Page A Page B PR = 1 PR = 1 Page C PR(A) = 0.15 + 0.85 * ( PR(B)/2 + PR(C) / 2) = 0.15 + 0.85 * (0.5 + 0.5 ) = 0.15 + 0.85 = 1 PR(B) = 1 PR = 1 PR(C) = 1 Total PageRank in this site = 3 Good Linking!

  16. Internal Linking Page A Page B PR(A) = 0.15 + 0.85 * ( PR(B)/1 + PR(C) / 2) = 0.15 + 0.85 * (1 + 0.5 ) = 0.15 + 1.275 = 1.425 PR = 1 PR = 1 PR(B) = 0.15 + 0.85 * ( PR(A)/2 + PR(C)/2 ) = 0.15 + 0.85 * (0.5 + 0.5) = 0.15 + 0.85 = 1 Page C PR(C) = 0.15 + 0.85 * ( PR(A)/2 ) = 0.15 + 0.85 * (0.5 ) = 0.15 + 0.425 = 0.575 PR = 1 Page A = 1.298245 Page B = 0.9999999 Page C = 0.7017543 After 100 iterations…

  17. Internal Linking Page A Page B PR = 1 PR = 1 Page C PR(A) = 1.46 PR(B) = 0.77 PR = 1 PR(C) = 0.77 Total PageRank in this site = 3 No PageRank has been wasted!

  18. Internal Linking Page A Page B PR = 1 PR = 1 Page C PR(A) = 1.298 PR(B) = 0.999 PR = 1 PR(C) = 0.702 Page A and Page C lose some PageRank Page B gains some PageRank Total PageRank in this site = 3

  19. Page X Page A Page D Inbound Links PR = 10 PR = 1 PR = 1 Try to set the Damping Factor “d” to 0.5 in this example to see the influence of the “d” Page B Page C PR(A) = 0.5 + 0.5 * ( PR(X) / C(X) + PR(D) ) = 6.33 PR(B) = 0.5 + 0.5 * PR(A) = 3.67 PR = 1 PR = 1 PR(C) = 0.5 + 0.5 * PR(B) = 2.33 PR(D) = 0.5 + 0.5 * PR(C) = 1.67 Initial effect of the additional inbound link of page A: d x PR(X) / C(X) = 0.5 x 10 / 1 = 5

  20. Page X Page A Page D Inbound Links PR = 10 PR = 1 PR = 1 Page B Page C Now we set the Damping Factor “d” back to 0.85 PR(A) = 0.15 + 0.85 * ( PR(X) / C(X) + PR(D) ) = 18.78 PR(B) = 0.15 + 0.85 * PR(A) = 16.12 PR = 1 PR = 1 PR(C) = 0.15 + 0.85 * PR(B) = 13.85 PR(D) = 0.15 + 0.85 * PR(C) = 11.92 Initial effect of the additional inbound link of page A: d x PR(X) / C(X) = 0.85 x 10 / 1 = 8.5

  21. Site 1 Site 2 Page A Page C Outbound Links PR(A) = 0.15 + 0.85 * PR(B) = 1 PR(B) = 0.15 + 0.85 * PR(A) = 1 PR = 1 PR = 1 PR(C) = 0.15 + 0.85 * PR(D) = 1 PR(D) = 0.15 + 0.85 * PR(C) = 1 Page B Page D PR(A) = 0.15 + 0.85 * PR(B) = 0.43 PR(B) = 0.15 + 0.85 * PR(A)/2 = 0.33 PR = 1 PR = 1 PR(C) = 0.15 + 0.85 * (PR(A)/2 + PR(D) = 1.67 PR(D) = 0.15 + 0.85 * PR(C) = 1.57 Site 1 loses: 0.76 - 2= -1.24, Site 2 gains: 3.24 – 2 = 1.24 The PageRank benefit for one site equals the PageRank loss of the other

  22. Page A Page C Dangling Links Danglink links are links that point to any Page with no outgoing links. PR = 1 PR = 1 PR(A) = 0.15 + 0.85 * PR(B) = 0.43 Page B PR(B) = 0.15 + 0.85 * PR(A) /2= 0.33 PR(C) = 0.15 + 0.85 * PR(A)/2 = 0.33 PR = 1 The total PageRank is only 1.10 which is only one third of the maximum PageRank. To prevent PageRank from the negative effects of dangling links, pages without outbound links have to be removed from the database until the PageRank values are computed.

  23. Page A Page C Dangling Links PR(A) = 0.15 + 0.85 * PR(B) = 0.43 PR = 1 PR = 1 PR(B) = 0.15 + 0.85 * PR(A) = 0.33 PR(C) = 0.15 + 0.85 * PR(A) = 0.33 Page B PR(A) = 0.15 + 0.85 * PR(B) = 1 PR(B) = 0.15 + 0.85 * PR(A) = 1 PR = 1

  24. Speed of Convergence • Early experiments on Google used 322 million links. • PageRank algorithm converged (within small tolerance) in about 52 iterations. • Number of iterations required for convergence is empirically O(log n) (where n is the number of links). • Therefore calculation is quite efficient.

  25. PageRank • Page A’s PR is based on all its inbound neighbors. • PR is defined recursively • With new calculation, each page’s PR may change. • If B and C points to A, the order of computing A, B, C may effect the results of A’s PR. • How can the problem be solved?

  26. PageRank • PR of all web pages can form a vector • X is a matrix. • Each row and column represents a page. • Xu,v = 1/C(u) if there is a link from u to v. • 0 otherwise. • P = cXP+cE • P = c (A+EI)P since ||P||=1, I is the identity matrix • P is the eigenvector of (A+EI).

  27. PageRank • When the number of pages is small, P can be computed efficient. • However, when the number of page is large, computing P is a time consuming technique. • An iterative method is used for computation. • Any initial value of P is ok. • After applying the equation 50 times, P can be converged to a stable state.

  28. PageRank Algorithm • R0 <-S • Loop: • Ri+1 <-A * Ri • d <- ||Ri||1 - ||Ri+1||1 • Ri+1 <-Ri + dE • dif <- ||Ri+1 – Ri||1 • While (dif > t)

  29. Examples • P(A)=P(B)=P(C)=0.15 A B C

  30. PageRank • A=1.3 • B=1 • C=0.7 A C B

More Related