340 likes | 350 Views
Hyperlink Analysis . A Survey ( In Progress). Overview of This Talk. Introduction to Hyperlink Analysis Classification of Hyperlink Analysis Two sub-topics: Measures and Metrics Interesting Web Structures. Definition of Hyperlink Analysis.
E N D
Hyperlink Analysis A Survey (In Progress)
Overview of This Talk • Introduction to Hyperlink Analysis • Classification of Hyperlink Analysis • Two sub-topics: • Measures and Metrics • Interesting Web Structures
Definition of Hyperlink Analysis • Hyperlink Analysis can be defined as an area of Web Information Retrieval using the hyperlink structure of the Web.
Motivation • Hyperlinks serve two main purposes. • Pure Navigation. • Point to pages with authority* on the same topic of the page containing the link. • This can be used to retrieve useful information from the web. * - a set of ideas or statements supporting a topic
What Information Can Be Retrieved ? • Quality of Web Page. • - The authority of a page on a topic. • - Ranking of web Pages. • Interesting Web Structures. • Graph patterns like Co-citation, Social choice, Complete bipartite graphs etc. • Web Page Classification. - Classifying web pages according to various topics.
What Information Can Be Retrieved? (Cont…) • Which pages to crawl. - Deciding which web pages to add to the collection of web pages. • Finding Related Pages. - Given one relevant page, find all related pages. • Detection of duplicated pages. - Detection of neared-mirror sites to eliminate duplication.
Classification of Hyperlink Analysis Research Measures and Metrics Interesting Web Structures Hyperlink Analysis Web Page Classification Web Search (Still needs to be refined. Suggestions Welcome)
Standards for measuring properties of a page or a web structure. Quality of a page. Distance between pages. Web Page Reputation. Measures/metrics
PageRank Citation Ranking[1] • Aim • Ranking Metric for Hypertext Documents • Approach • Page has a high rank if the sum of the ranks of its backlinks is high
Authoritative Sources in Hyperlink Environment[3] • Aim • Determining relative “authority” of pages • Approach • Good authority page is one pointed to by many good hubs • Good hub page is one that points to many good authorities • Results • Efficient when query topic is sufficiently “broad” • Benefits • Locating dense bipartite communities
Does “Authority” Mean Quality ?[4] • Aim. • Are any metrics we compute for Web documents good predictors of document quality ? • Approach. • Do experts agree in their quality judgments? • Are different link-based metrics different? • Indegree, PageRank and Authority. • Can we predict human quality judgments ? Compute correlations between each pair of metrics and also compare it with expert judgment.
Does “Authority” Mean Quality ?[4] • Results. • Experts agree on the nature of a quality within a topic. • No significant difference between link based metrics. • In-degree performed as well as PR and Authority.
Web Page Reputations [5] • Aim. • Input: URL, Output: Ranked set of topics for which the page has a reputation. • Approach. • A page an acquire a high reputation on a topic because the page is pointed to by many pages on that topic, or because the page is pointed to by some high reputation pages on that topic. • A page is deemed authority on the topicif it is pointed to by good hubs on the topic, and a good hub is one that points to good authorities.
One-level Influence Propagation • Reputation of the page p on a topic is the probability that the random surfer looking for topic t will visit page p • At each step: • with probability d>0 jump to a random page, or • with probability (1-d) follow a random link from the current page if term t appears in page p otherwise
Two Level Influence Propagation • with probability d>0 jump to random page that contains term t • with probability (1-d) follow random link forward/backward from the current page, alternating directions • Authority Reputation of a page p on a topic t is the probability that a random surfer looking for a topic t makes a forward visit to the page p • Hub Reputation of a page p on a topic t is the probability that a random surfer looking for a topic t makes a backward visit to the page p
Two Level Influence Propagation A(p,t) = probability of a forward visit to page p when searching for term t = Authority rank of page p on term t if term t appears in page p otherwise H(p,t) = probability of a backward visit to page p when searching for term t = Hub rank of page p on term t if term t appears in page p otherwise
Factors Affecting Page Reputation • How well a topic is represented. • How well pages on a topic are connected.
Link Analysis and Stability[6] • Aim. • When to expect stable rankings under small perturbations to hyperlink patterns. • Approach. • Eigengap directly affects the stability of eigenvectors in HITS algorithm. • Coupled Markov Chain Theory(?). • So long as perturbed web pages did not have high overall PageRank scores, then the perturbed PageRank Scores will not be far from the original. • Result. • HITS – Unstable; PageRank – Stable.
Stable Algorithms [7] • Aim • Stable Link Analysis Methods • Approach • Randomized HITS • Merging Hubs and Authorities notion with “reset” mechanism from PageRank • Subspace HITS • Combining multiple eigenvectors from HITS to yield aggregate authority scores – Subspace HITS • Results • Both approaches more stable than HITS, latter a little worse than PageRank
Average Clicks [8] • Aim. • A new definition of distance between two pages. • Approach. • Based on probability to click a link through random surfing. • Benefit. • A good justification of practical search for fetching neighboring pages. • Result. • Distance by average clicks seems to fit well intuitively.
Interesting Web Structure • Analyzing interesting graph patterns or Web Structures. • Helpful in identification of ‘Web Communities.’
Endorsement Interesting Web Structures [11] Mutual Reinforcement Social Choice Co-Citation Transitive Endorsement
Interesting Web Structures [11] Directed Complete Bipartite graph NK-clan with N=2, K=10 NK- Clan is a set of K-nodes in which there is a path length N or less(ignoring edge directions) between every pair of nodes
In - Tree Out- Tree Interesting Web Structures [11]
Interesting Web Structures • Web Communities
Friends and Neighbors [9] • Aim. • Techniques to mine information in order to predict relationship between individuals. • Approach. • Similarity measured by analyzing text, in-links, out-links and mailing list. • Result. • In-links were ‘good’ predictors.
References • [1] S. Brin and L. Page(1998) The PageRank Citation Ranking: Bringing Order to the Web. In Technical Report available at http://www-db.stanford.edu/~backrub/pageranksub.ps, January 1998. • [2] T. Haveliwala,(1999) Efficient Computation of PageRankIn Technical Report , Stanford University,CA • [3] J.M. Klienberg (1998), Authoritative Sources in Hyperlinked Environment
References • [4] B. Amento1, L. Terveen, and Will Hill(2000) , Does "Authority" Mean Quality? Predicting Expert Quality Ratings of Web Documents (ACM 2000) • [5] D. Rafiei, A.O. Mendelzon (2000), What is this Page Known for? Computing Web Page Reputations ,Proceedings of Ninth International WWW Conference
References(contd…) • [6] A. Y. Ng, A. X. Zheng, and M. I. Jordan(2001),Link Analysis, Eigenvectors and Stability, IJCAI-01. • [7] A. Y. Ng, A. X. Zheng, and M. I. Jordan(2001), Stable algorithms for link analysis. Proc. 24th International Conference on Research and Development in Information Retrieval (SIGIR), 2001. • [8] Y. Matsuo, Y.Ohsawa and M. Ishizuka(2001), Average-clicks: A new measure of distance on the WWW, WI-2001, 2001.
References(contd…) • [9] L. A. Adamic and E. Adar(2000), Friends and Neighbors on the Web,Xerox Palo Alto Research Center Palo Alto, CA 94304. • [10] A. Borodin, G.O. Roberts, J.S. Rosenthal, P. Tsaparas (2000), Finding Authorities and Hubs From Link Structures on the World Wide Web,WWW10 Proceedings.
References (contd…) • [11] Kemal Efe, Vijay Raghavan, C. Henry Chu, Adrienne L. Broadwater, Levent Bolelli, Seyda Ertekin(2000), The Shape of the Web and Its Implications for Searching the Web, International Conference on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet- Proceedings at http://www.ssgrr.it/en/ssgrr2000/proceedings.htm, Rome. Italy, Jul.-Aug. 2000 • [12] Monika Henzinger, Link Analysis in Web Information Retrieval, ICDE Bulletin Sept 2000, Vol 23. No.3
PageRank Approach • PageRank of a page p. • d is the damping factor (or probability that a page is chosen uniformly at random from all pages ). • n is the number of nodes in Graph G. • outdegree(q) is the number of edges leaving a page q. • Back.
HITS Approach • Let z denote the vector(1,1,1,1,….1). • Initially set x z ; y z, • For i = 1,2,3…. • Apply the I Operation. • Apply the O operation. • Normalize x and y. • The sequence of (x, y) pairs produced converges to a limit (x*, y*). • Return (x*, y* ) as the authority and hub weights. • Back.
Friends and Neighbors • Predicting Friendship • Items that are unique to few users are weighted more than commonly occurring items • 2 people mention item, Weight = 1/log(2) = 1.4 • 5 people mention item, Weight = 1/log(5) = 0.62 Back