200 likes | 296 Views
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Link Analysis {week 09}. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0.
E N D
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Link Analysis{week 09} from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
Are you connected? • The Internet (1969) is a network that’s • Global • Decentralized • Redundant • Made up of many different types of machines • How many machines make up the Internet?
Browsing the Web from Fluency with Information Technology, 4th edition by Lawrence Snyder, Addison-Wesley, 2010, ISBN 0-13-609182-2
The World Wide Web • Sir Tim Berners-Lee
Weaving the Web • The World Wide Web (or just Web) is: • Global • Decentralized • Redundant (sometimes) • Made up of Web pagesand interactive Web services • How many Web pages are on the Web?
Links • Links are useful to us humans fornavigating Web sites and finding things • Links are also useful to search engines • <a href="http://cnn.com"> Latest News </a> destination link (URL) anchor text
Anchor text • How does anchor text apply to ranking? • Anchor text describes thecontent of the destination page • Anchor text is short, descriptive,and often coincides with query text • Anchor text is typically writtenby a non-biased third party
The Web as a graph (i) • We often represent Web pages as vertices and links as edges in a webgraph http://www.openarchives.org/ore/0.1/datamodel-images/WebGraphBase.jpg
The Web as a graph (ii) • An example: http://www.growyourwritingbusiness.com/images/web_graph_flower.jpg
Using webgraphs for ranking • Links may be interpreted as describinga destination Web page in terms of its: • Popularity • Importance • We focus on incoming links (inlinks) • And use this for ranking matching documents • Drawback is obtaining incoming link data • Authority • Incoming link count
PageRank (i) • PageRank is a link analysis algorithm • PageRank is accredited to Sergey Brinand Lawrence Page (the Google guys!) • The original PageRank paper: • http://infolab.stanford.edu/~backrub/google.html
PageRank (ii) • Browse the Web as a random surfer: • Choose a random number r between 0 and 1 • If r < λ then go to a random page • else follow a random link from the current page • Repeat! • The PageRank of page A (noted PR(A)) is the probability that this “random surfer” will be looking at that page
PageRank (iii) • Jumping to a random pageavoids getting stuck in: • Pages that have no links • Pages that only have broken links • Pages that loop back to previously visited pages
PageRank (iv) • PageRank of page C is theprobability a random surferis viewing page C • Based on inlinks • PR(C) = PR(A) / 2 + PR(B) / 1 • We assume PageRank is distributed evenly across all pages (so 0.33 for A, B, and C) • PR(C) = 0.33 / 2 + 0.33 / 1 = 0.50
PageRank (v) • More generally: • Bu is the set of pages that point to u • Lv is the number of outgoing links from page v (not counting duplicate links)
PageRank (vi) • We can account for the “random jumps” by incorporating constant λ into the equation: • Typically, λ is low (e.g. λ = 0.15) (N is the number of pages)
Link quality (and avoiding spam) • A cycle tends to negate theeffectiveness of thePageRank algorithm
What next? • Read and study Chapter 4.5