1 / 19

Link Analysis {week 09}

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Link Analysis {week 09}. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0.

sonja
Download Presentation

Link Analysis {week 09}

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Link Analysis{week 09} from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

  2. Are you connected? • The Internet (1969) is a network that’s • Global • Decentralized • Redundant • Made up of many different types of machines • How many machines make up the Internet?

  3. Browsing the Web from Fluency with Information Technology, 4th edition by Lawrence Snyder, Addison-Wesley, 2010, ISBN 0-13-609182-2

  4. The World Wide Web • Sir Tim Berners-Lee

  5. Weaving the Web • The World Wide Web (or just Web) is: • Global • Decentralized • Redundant (sometimes) • Made up of Web pagesand interactive Web services • How many Web pages are on the Web?

  6. Links • Links are useful to us humans fornavigating Web sites and finding things • Links are also useful to search engines • <a href="http://cnn.com"> Latest News </a> destination link (URL) anchor text

  7. Anchor text • How does anchor text apply to ranking? • Anchor text describes thecontent of the destination page • Anchor text is short, descriptive,and often coincides with query text • Anchor text is typically writtenby a non-biased third party

  8. The Web as a graph (i) • We often represent Web pages as vertices and links as edges in a webgraph http://www.openarchives.org/ore/0.1/datamodel-images/WebGraphBase.jpg

  9. The Web as a graph (ii) • An example: http://www.growyourwritingbusiness.com/images/web_graph_flower.jpg

  10. Using webgraphs for ranking • Links may be interpreted as describinga destination Web page in terms of its: • Popularity • Importance • We focus on incoming links (inlinks) • And use this for ranking matching documents • Drawback is obtaining incoming link data • Authority • Incoming link count

  11. PageRank (i) • PageRank is a link analysis algorithm • PageRank is accredited to Sergey Brinand Lawrence Page (the Google guys!) • The original PageRank paper: • http://infolab.stanford.edu/~backrub/google.html

  12. PageRank (ii) • Browse the Web as a random surfer: • Choose a random number r between 0 and 1 • If r < λ then go to a random page • else follow a random link from the current page • Repeat! • The PageRank of page A (noted PR(A)) is the probability that this “random surfer” will be looking at that page

  13. PageRank (iii) • Jumping to a random pageavoids getting stuck in: • Pages that have no links • Pages that only have broken links • Pages that loop back to previously visited pages

  14. PageRank (iv) • PageRank of page C is theprobability a random surferis viewing page C • Based on inlinks • PR(C) = PR(A) / 2 + PR(B) / 1 • We assume PageRank is distributed evenly across all pages (so 0.33 for A, B, and C) • PR(C) = 0.33 / 2 + 0.33 / 1 = 0.50

  15. PageRank (v) • More generally: • Bu is the set of pages that point to u • Lv is the number of outgoing links from page v (not counting duplicate links)

  16. PageRank (vi) • We can account for the “random jumps” by incorporating constant λ into the equation: • Typically, λ is low (e.g. λ = 0.15) (N is the number of pages)

  17. Link quality (and avoiding spam) • A cycle tends to negate theeffectiveness of thePageRank algorithm

  18. What next? • Read and study Chapter 4.5

More Related