1 / 34

Hyperlink Analysis

Hyperlink Analysis . A Survey ( In Progress). Overview of This Talk. Introduction to Hyperlink Analysis Classification of Hyperlink Analysis Two sub-topics: Measures and Metrics Interesting Web Structures. Definition of Hyperlink Analysis.

phillipsm
Download Presentation

Hyperlink Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hyperlink Analysis A Survey (In Progress)

  2. Overview of This Talk • Introduction to Hyperlink Analysis • Classification of Hyperlink Analysis • Two sub-topics: • Measures and Metrics • Interesting Web Structures

  3. Definition of Hyperlink Analysis • Hyperlink Analysis can be defined as an area of Web Information Retrieval using the hyperlink structure of the Web.

  4. Motivation • Hyperlinks serve two main purposes. • Pure Navigation. • Point to pages with authority* on the same topic of the page containing the link. • This can be used to retrieve useful information from the web. * - a set of ideas or statements supporting a topic

  5. What Information Can Be Retrieved ? • Quality of Web Page. • - The authority of a page on a topic. • - Ranking of web Pages. • Interesting Web Structures. • Graph patterns like Co-citation, Social choice, Complete bipartite graphs etc. • Web Page Classification. - Classifying web pages according to various topics.

  6. What Information Can Be Retrieved? (Cont…) • Which pages to crawl. - Deciding which web pages to add to the collection of web pages. • Finding Related Pages. - Given one relevant page, find all related pages. • Detection of duplicated pages. - Detection of neared-mirror sites to eliminate duplication.

  7. Classification of Hyperlink Analysis Research Measures and Metrics Interesting Web Structures Hyperlink Analysis Web Page Classification Web Search (Still needs to be refined. Suggestions Welcome)

  8. Standards for measuring properties of a page or a web structure. Quality of a page. Distance between pages. Web Page Reputation. Measures/metrics

  9. PageRank Citation Ranking[1] • Aim • Ranking Metric for Hypertext Documents • Approach • Page has a high rank if the sum of the ranks of its backlinks is high

  10. Authoritative Sources in Hyperlink Environment[3] • Aim • Determining relative “authority” of pages • Approach • Good authority page is one pointed to by many good hubs • Good hub page is one that points to many good authorities • Results • Efficient when query topic is sufficiently “broad” • Benefits • Locating dense bipartite communities

  11. Does “Authority” Mean Quality ?[4] • Aim. • Are any metrics we compute for Web documents good predictors of document quality ? • Approach. • Do experts agree in their quality judgments? • Are different link-based metrics different? • Indegree, PageRank and Authority. • Can we predict human quality judgments ? Compute correlations between each pair of metrics and also compare it with expert judgment.

  12. Does “Authority” Mean Quality ?[4] • Results. • Experts agree on the nature of a quality within a topic. • No significant difference between link based metrics. • In-degree performed as well as PR and Authority.

  13. Web Page Reputations [5] • Aim. • Input: URL, Output: Ranked set of topics for which the page has a reputation. • Approach. • A page an acquire a high reputation on a topic because the page is pointed to by many pages on that topic, or because the page is pointed to by some high reputation pages on that topic. • A page is deemed authority on the topicif it is pointed to by good hubs on the topic, and a good hub is one that points to good authorities.

  14. One-level Influence Propagation • Reputation of the page p on a topic is the probability that the random surfer looking for topic t will visit page p • At each step: • with probability d>0 jump to a random page, or • with probability (1-d) follow a random link from the current page if term t appears in page p otherwise

  15. Two Level Influence Propagation • with probability d>0 jump to random page that contains term t • with probability (1-d) follow random link forward/backward from the current page, alternating directions • Authority Reputation of a page p on a topic t is the probability that a random surfer looking for a topic t makes a forward visit to the page p • Hub Reputation of a page p on a topic t is the probability that a random surfer looking for a topic t makes a backward visit to the page p

  16. Two Level Influence Propagation A(p,t) = probability of a forward visit to page p when searching for term t = Authority rank of page p on term t if term t appears in page p otherwise H(p,t) = probability of a backward visit to page p when searching for term t = Hub rank of page p on term t if term t appears in page p otherwise

  17. Factors Affecting Page Reputation • How well a topic is represented. • How well pages on a topic are connected.

  18. Link Analysis and Stability[6] • Aim. • When to expect stable rankings under small perturbations to hyperlink patterns. • Approach. • Eigengap directly affects the stability of eigenvectors in HITS algorithm. • Coupled Markov Chain Theory(?). • So long as perturbed web pages did not have high overall PageRank scores, then the perturbed PageRank Scores will not be far from the original. • Result. • HITS – Unstable; PageRank – Stable.

  19. Stable Algorithms [7] • Aim • Stable Link Analysis Methods • Approach • Randomized HITS • Merging Hubs and Authorities notion with “reset” mechanism from PageRank • Subspace HITS • Combining multiple eigenvectors from HITS to yield aggregate authority scores – Subspace HITS • Results • Both approaches more stable than HITS, latter a little worse than PageRank

  20. Average Clicks [8] • Aim. • A new definition of distance between two pages. • Approach. • Based on probability to click a link through random surfing. • Benefit. • A good justification of practical search for fetching neighboring pages. • Result. • Distance by average clicks seems to fit well intuitively.

  21. Interesting Web Structure • Analyzing interesting graph patterns or Web Structures. • Helpful in identification of ‘Web Communities.’

  22. Endorsement Interesting Web Structures [11] Mutual Reinforcement Social Choice Co-Citation Transitive Endorsement

  23. Interesting Web Structures [11] Directed Complete Bipartite graph NK-clan with N=2, K=10 NK- Clan is a set of K-nodes in which there is a path length N or less(ignoring edge directions) between every pair of nodes

  24. In - Tree Out- Tree Interesting Web Structures [11]

  25. Interesting Web Structures • Web Communities

  26. Friends and Neighbors [9] • Aim. • Techniques to mine information in order to predict relationship between individuals. • Approach. • Similarity measured by analyzing text, in-links, out-links and mailing list. • Result. • In-links were ‘good’ predictors.

  27. References • [1] S. Brin and L. Page(1998) The PageRank Citation Ranking: Bringing Order to the Web. In Technical Report available at http://www-db.stanford.edu/~backrub/pageranksub.ps, January 1998. • [2] T. Haveliwala,(1999) Efficient Computation of PageRankIn Technical Report , Stanford University,CA • [3] J.M. Klienberg (1998), Authoritative Sources in Hyperlinked Environment

  28. References • [4] B. Amento1, L. Terveen, and Will Hill(2000) , Does "Authority" Mean Quality? Predicting Expert Quality Ratings of Web Documents (ACM 2000)  • [5] D. Rafiei, A.O. Mendelzon (2000), What is this Page Known for? Computing Web Page Reputations ,Proceedings of Ninth International WWW Conference

  29. References(contd…) • [6] A. Y. Ng, A. X. Zheng, and M. I. Jordan(2001),Link Analysis, Eigenvectors and Stability, IJCAI-01. • [7] A. Y. Ng, A. X. Zheng, and M. I. Jordan(2001), Stable algorithms for link analysis. Proc. 24th International Conference on Research and Development in Information Retrieval (SIGIR), 2001. • [8] Y. Matsuo, Y.Ohsawa and M. Ishizuka(2001), Average-clicks: A new measure of distance on the WWW, WI-2001, 2001.

  30. References(contd…) • [9] L. A. Adamic and E. Adar(2000), Friends and Neighbors on the Web,Xerox Palo Alto Research Center Palo Alto, CA 94304. • [10] A. Borodin, G.O. Roberts, J.S. Rosenthal, P. Tsaparas (2000), Finding Authorities and Hubs From Link Structures on the World Wide Web,WWW10 Proceedings.

  31. References (contd…) • [11] Kemal Efe, Vijay Raghavan, C. Henry Chu, Adrienne L. Broadwater, Levent Bolelli, Seyda Ertekin(2000), The Shape of the Web and Its Implications for Searching the Web, International Conference on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet- Proceedings at http://www.ssgrr.it/en/ssgrr2000/proceedings.htm, Rome. Italy, Jul.-Aug. 2000 • [12] Monika Henzinger, Link Analysis in Web Information Retrieval, ICDE Bulletin Sept 2000, Vol 23. No.3

  32. PageRank Approach • PageRank of a page p. • d is the damping factor (or probability that a page is chosen uniformly at random from all pages ). • n is the number of nodes in Graph G. • outdegree(q) is the number of edges leaving a page q. • Back.

  33. HITS Approach • Let z denote the vector(1,1,1,1,….1). • Initially set x  z ; y  z, • For i = 1,2,3…. • Apply the I Operation. • Apply the O operation. • Normalize x and y. • The sequence of (x, y) pairs produced converges to a limit (x*, y*). • Return (x*, y* ) as the authority and hub weights. • Back.

  34. Friends and Neighbors • Predicting Friendship • Items that are unique to few users are weighted more than commonly occurring items • 2 people mention item, Weight = 1/log(2) = 1.4 • 5 people mention item, Weight = 1/log(5) = 0.62 Back

More Related