CS315 – Link Analysis

CS315 – Link Analysis • Three generations of Search Engines • Anchor text • Link analysis for ranking • Pagerank • HITS

t 3 d 2 d 1 θ t 1 t 2 1st Generation: Content Similarity • Content Similarity Ranking:The more rare words two documents share, the more similar they are • Documents are treated as “bags of words”(no effort to “understand” the contents) • Similarity is measured by vector angles • Query Results are rankedby sorting the anglesbetween query and documents

But we also have “los links!” Page B Page A hyperlink Anchor text Assumption 1: A hyperlink from a page denotes vote of confidence to second page (qualitysignal) Assumption 2:The anchor text of the hyperlink describes the target page (textual context)

www.aa.com 1 www.bb.com 2 www.cc.com 1 www.dd.com 2 www.zz.com 0 2nd Generation: Add Popularity • A hyperlink from a page in site A to some page in site Bis considered a popularityvote from site A to site B • Popularity of a page = f(number of in-links) • Query Processing • First retrieve all relevant pages meeting the text query (e.g., best affordable college). • Order these by taking into account the link popularity of the site it resides

3rd Generation: Add Reputation • Each page starts with some basic “reputation” (e.g., = 1)and repeatedly distributes fair (equal) fractions to its linked pages(while receiving fractions from them)until some “equilibrium” • The reputation“PageRank” of a page P = the sum of a fair fraction of the reputations of all pages Pj that point to P • Beautiful Math behind it • PR = principal eigenvector of the web’s link matrix • PR equivalent to the chance of randomly surfing to the page • Idea similar to academic co-citations

Roots of PageRank: Citation Analysis • Citation frequency • Deans compute it at tenure time • Citation indexing • Who is author cited by? • Co-citation coupling frequency • Co-citations with a given author measures “impact” • Are you co-cited with influential publications? • Bibliographic coupling frequency • Articles that co-cite the same articles are related

PageRank PR– Definition • W is a web page • Wiare the web pages that have a link to W • O(Wi) is the number of out-links from Wi • t is the teleportation probability • N is the size of the Web (that we have seen) W1 W1 W . W2 W2 W3 W3

PageRank: Iterative Computation • t is normally set to 0.15, but for this example, for simplicity let’s set it to 0.5 • Set initial PR values to 1 • Solve the following equations iteratively:

ExcelComputation of PR

Pagerank – Matrix Multiplication Equivalent Def. • Imagine a browser doing a random walk on web pages: • Start at a random page P • At each step, walk with equal probability out of the current page along one of the links on that page, • Continue doing this randomwalk for a long time • “In the steady state” each page has a long-term visit rate: • Use this rate as the page’s score. 1/3 1/3 1/3 P

Not quite enough • The web is full of dead-ends. • Random walk can get stuck in dead-ends. • Makes no sense to talk about long-term visit rates. ??

Teleporting • At a dead end, jump to a random web page. • At any non-dead end, • With probability, say, 15%, jump to a random web page. • With remaining probability (85%), go out on a random link. • t=0.15 is the commonly used “teleporting” parameter.

Result of teleporting • Now cannot get stuck locally. • There exists a computable long-term rate at which any page is visited • This not obvious, but it has been proven! • How do we compute this visit rate?

Markov chains: abstractions of random walks • A Markov chain consists of n states, and an nntransition probability matrix P. • At each step, we are in exactly one of the states. • For 1  i,j  n, the matrix entry Pijtells us the probability of j being the next state, given we are currently in state i. • Clearly, for all i, i j Pij

Computing PR with Markov chains • Example (next two slides): Represent the teleporting random walk with teleporting parameter t=15% as a Markov chain, for this graph: A B C D

Computing P with Matrix Multiplication • Start with Adjacency matrix A of the Web Graph • If there is hyperlink from i to j, Aij = 1, else Aij = 0 • If • a row has all 0’s, • replace each element by 1/N • Else • divide each 1 by the number of 1’s in the row • Multiply the matrix by 1-t • Add t/N to every entry of the resulting matrix P= A B C D

Computing all Pageranks P= • Theorem: Regardless of where we start, we eventually reach the steady state a. • Start with any distribution (say x=(1 0 … 0)). • After one step, we’re at xP; • after two steps at xP2 , • then xP3 and so on. • “Eventually” means for “large”k, xPk = a. • Algorithm: • multiply x by increasing powers of Puntil the product looks stable. A B C D

Pagerank summary • Preprocessing: • Given graph of links, build matrix P. • From it compute a. • The entry ai is a number between (0, 1) = the pagerank of page i. • Query processing: • Retrieve pages meeting query. • Rank them by their pagerank. • Order is query-independent • If PR(A) > PR(B) for some query, it beats it in every query

How is Pagerank used? http://www.google.com/corporate/tech.html • PageRank Technology: • “PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results.” • This claim has recently changed: • “Today we use more than 200 signals, including PageRank, to order websites, and we update these algorithms on a weekly basis” • Pagerank is dead, long live Pagerank!

CS315 – Link Analysis