180 likes | 281 Views
Hypersearching the Web, Chakrabarti, Soumen. Presented By Ray Yamada. Overview. Why Do We Care? Purpose of The Paper? Solution by Clever Project Pros / Cons of the Paper Further Research. Why Do We Care?. Web Link Analysis is crucial for efficient Crawling and Ranking algorithms
E N D
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada
Overview • Why Do We Care? • Purpose of The Paper? • Solution by Clever Project • Pros / Cons of the Paper • Further Research
Why Do We Care? • Web Link Analysis is crucial for efficient Crawling and Ranking algorithms • Crawling: Google Sitemap Submission, Yahoo Directory • Ranking: Relevant Result
Purpose of The Paper? • To Overcome These Challenges: • Its Size & Growth • Its Content Types • Language Semantics • New Language • Staleness of Results • SPAM • And More…
Solution: Hyperlinks, Hyperlinks, Hyperlinks… • Can Think of the Web as a Directed Graph • Node = Web page (URL) • Edge = Hyperlink
Solution: HITS Algorithm • Hyperlink-Induced Topic Search (HITS) • A.k.a. Hubs and Authorities • Hubs – Highly-valued lists for a given query • Ex. Yahoo Directory, Open Directory Project and Bookmarking sites. • Authorities – Highly endorsed answers to the query • Ex. New York Times, Huffington Post, Twitter • It is possible for a webpage to be both Hub and Authority • Ex. Restaurant Review Blogs
Solution: HITS Algorithm Cont… • For each page p, we assign it two values hub(p) and auth(p) • Initial Value: For all p, hub(p) = 1, auth(p) = 1 (or any predetermined number) • Authority Update Rule: For each page p, update auth(p) to be the sum of the hub scores of all pages that point to it. • Hub Update Rule: For each page p, update hub(p) to be the sum of the authority scores of all pages that point to it. • Normalize and Repeat
Solution: HITS Algorithm Cont… Calculation
Pros: • Accurately addresses concerns and challenges we currently deal with • Great introduction to search engine algorithm • Briefly covered many topics (Breadth)
Cons: • Some materials are out of date (1999) • Ex. Google vs. Clever Project • Lack of Depth • Ex. Normalization of Hub and Auth values
Further Research: HITS Algorithm – Extreme Cases • Large-in-small-out sites • High Auth(p) • No Problem • Small-in-large-out sites • High Hub(p) • Problem
Further Research: HITS + Relevance Scoring Method • Vector Space Model (VSM) • Documents and queries are represented by vectors • Term Frequency • Okapi Measurement • Term Frequency + Document Length • Cover Density Ranking (CDR) • Phrase Similarity (How close terms appear)
Further Research: HITS + Relevance Scoring Method • Use Cosine Relevance Test Price Car
Further Research: HITS + Relevance Scoring Method • Three-Level Scoring Method (TLS) • Manual Evaluation of Relevance • Relevant Links = 2 points • Slightly Relevant Links = 1 point • Inactive Links + Error Links (404, 603) = 0 point • Irrelevant Links = 0 point • Order of query terms matters
Further Research: Co-citation Graph • Regular Link Graph: • Co-citation Graph:
What’s Next? • Google’s New Search Index: Caffeine • Announced June 8th, 2010 • Up to 50% fresher results • Twice as fast • Real Time Search • Twitter / Facebook http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html
References • Chakrabarti, Soumen; Dom, Byron; Kumar, S. Ravi; Raghavan, Prabhakar; Rajagopalan, Sridhar & Tomkins, Andrew. (1999). "Hypersearching the Web" [Article]. Scientific American, June1999, ():. • Longzhuang Li , Yi Shang , Wei Zhang, Improvement of HITS-based algorithms on web documents, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA [doi>10.1145/511446.511514] • Henzinger, M. (2001). Hyperlink analysis for the Web. IEEE Internet Computing, 5(1), 45-50. • Kleinberg, Jon (1999). "Authoritative sources in a hyperlinked environment" (PDF). Journal of the ACM46 (5): 604–632. doi:10.1145/324133.324140. • von Ahn, Luis (2008-10-19). "Hubs and Authorities" (PDF). 15-396: Science of the Web Course Notes. Carnegie Mellon University. Retrieved 2008-11-09.