1 / 19

CS315 – Link Analysis

CS315 – Link Analysis. Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS. t 3. d 2. d 1. θ. t 1. t 2. 1st Generation: Content Similarity. Content Similarity Ranking : The more rare words two documents share, the more similar they are

Download Presentation

CS315 – Link Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS315 – Link Analysis • Three generations of Search Engines • Anchor text • Link analysis for ranking • Pagerank • HITS

  2. t 3 d 2 d 1 θ t 1 t 2 1st Generation: Content Similarity • Content Similarity Ranking:The more rare words two documents share, the more similar they are • Documents are treated as “bags of words”(no effort to “understand” the contents) • Similarity is measured by vector angles • Query Results are rankedby sorting the anglesbetween query and documents

  3. But we also have “los links!” Page B Page A hyperlink Anchor text Assumption 1: A hyperlink from a page denotes vote of confidence to second page (qualitysignal) Assumption 2:The anchor text of the hyperlink describes the target page (textual context)

  4. www.aa.com 1 www.bb.com 2 www.cc.com 1 www.dd.com 2 www.zz.com 0 2nd Generation: Add Popularity • A hyperlink from a page in site A to some page in site Bis considered a popularityvote from site A to site B • Popularity of a page = f(number of in-links) • Query Processing • First retrieve all relevant pages meeting the text query (e.g., best affordable college). • Order these by taking into account the link popularity of the site it resides

  5. 3rd Generation: Add Reputation • Each page starts with some basic “reputation” (e.g., = 1)and repeatedly distributes fair (equal) fractions to its linked pages(while receiving fractions from them)until some “equilibrium” • The reputation“PageRank” of a page P = the sum of a fair fraction of the reputations of all pages Pj that point to P • Beautiful Math behind it • PR = principal eigenvector of the web’s link matrix • PR equivalent to the chance of randomly surfing to the page • Idea similar to academic co-citations

  6. Roots of PageRank: Citation Analysis • Citation frequency • Deans compute it at tenure time • Citation indexing • Who is author cited by? • Co-citation coupling frequency • Co-citations with a given author measures “impact” • Are you co-cited with influential publications? • Bibliographic coupling frequency • Articles that co-cite the same articles are related

  7. PageRank PR– Definition • W is a web page • Wiare the web pages that have a link to W • O(Wi) is the number of out-links from Wi • t is the teleportation probability • N is the size of the Web (that we have seen) W1 W1 W . W2 W2 W3 W3

  8. PageRank: Iterative Computation • t is normally set to 0.15, but for this example, for simplicity let’s set it to 0.5 • Set initial PR values to 1 • Solve the following equations iteratively:

  9. ExcelComputation of PR

  10. Pagerank – Matrix Multiplication Equivalent Def. • Imagine a browser doing a random walk on web pages: • Start at a random page P • At each step, walk with equal probability out of the current page along one of the links on that page, • Continue doing this randomwalk for a long time • “In the steady state” each page has a long-term visit rate: • Use this rate as the page’s score. 1/3 1/3 1/3 P

  11. Not quite enough • The web is full of dead-ends. • Random walk can get stuck in dead-ends. • Makes no sense to talk about long-term visit rates. ??

  12. Teleporting • At a dead end, jump to a random web page. • At any non-dead end, • With probability, say, 15%, jump to a random web page. • With remaining probability (85%), go out on a random link. • t=0.15 is the commonly used “teleporting” parameter.

  13. Result of teleporting • Now cannot get stuck locally. • There exists a computable long-term rate at which any page is visited • This not obvious, but it has been proven! • How do we compute this visit rate?

  14. Markov chains: abstractions of random walks • A Markov chain consists of n states, and an nntransition probability matrix P. • At each step, we are in exactly one of the states. • For 1  i,j  n, the matrix entry Pijtells us the probability of j being the next state, given we are currently in state i. • Clearly, for all i, i j Pij

  15. Computing PR with Markov chains • Example (next two slides): Represent the teleporting random walk with teleporting parameter t=15% as a Markov chain, for this graph: A B C D

  16. Computing P with Matrix Multiplication • Start with Adjacency matrix A of the Web Graph • If there is hyperlink from i to j, Aij = 1, else Aij = 0 • If • a row has all 0’s, • replace each element by 1/N • Else • divide each 1 by the number of 1’s in the row • Multiply the matrix by 1-t • Add t/N to every entry of the resulting matrix P= A B C D

  17. Computing all Pageranks P= • Theorem: Regardless of where we start, we eventually reach the steady state a. • Start with any distribution (say x=(1 0 … 0)). • After one step, we’re at xP; • after two steps at xP2 , • then xP3 and so on. • “Eventually” means for “large”k, xPk = a. • Algorithm: • multiply x by increasing powers of Puntil the product looks stable. A B C D

  18. Pagerank summary • Preprocessing: • Given graph of links, build matrix P. • From it compute a. • The entry ai is a number between (0, 1) = the pagerank of page i. • Query processing: • Retrieve pages meeting query. • Rank them by their pagerank. • Order is query-independent • If PR(A) > PR(B) for some query, it beats it in every query

  19. How is Pagerank used? http://www.google.com/corporate/tech.html • PageRank Technology: • “PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results.” • This claim has recently changed: • “Today we use more than 200 signals, including PageRank, to order websites, and we update these algorithms on a weekly basis” • Pagerank is dead, long live Pagerank!

More Related