AMCS/CS 340: Data Mining

Page Rank AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Outline • PageRank • Introduction • Matrix Formation • Issues • Topic-Specific PageRank • Web Spam 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

PageRank PageRank is a link analysis algorithm used by Google Internet search engine Google shows pages based in many variables Words Page title Domain name… PageRank A measure of the importance of a web page PageRank is what made Google the owner of internet Why is it called PageRank? 3

Story of Google’s PageRank Because PageRank was developed at by Larry Page (hence the name Page-Rank) Sergey Brin, Larry Page (1998). "The Anatomy of a Large-Scale Hypertextual Web Search Engine". Proceedings of the 7th international conference on World Wide Web (WWW). 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

STRING e.g. ‘books’ Input: string “books” Output: unordered list of 845,740,000matches WORD SEARCH ENGINE 845,740,000 matches SELECT PageRankTM Ordered list of web pages matching the query Web search 5

Challenges of Search • Web is big, how big ? In the October 2010 survey we received responses from 232,839,963 sites. ---From Netcraft a large number of stale blogs at wordpress.com and 163.com were expired from the survey. Web pages per website: 273 (2005) Estimate: 63.6 billion pages 6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Challenges of Search • Web is big, how big ? • Much duplication (30--‐40%) • Best estimate of “unique” static HTML pages comes from search engine claims • Google =25 billion(?), Yahoo = 20 billion (?) • Web pages are not equally “important” • e.g., www.joe-schmoe.com v www.stanford.edu • Inlinks as votes • www.stanford.edu has 23,400 inlinks • www.joe-schmoe.com has 1 inlink • Are all inlinks equal? • Recursive question! From Stanford CS345 Data Mining by Anand Rajaraman, Jeffrey D. Ullman 7

The Structure of Web • What is the structure of the Web? • How is it organized? • As a Graph Directed Graph 8

Good or Bad website? • Inlinks are “good” (recommendations) • Inlinks from a “good” site are better than inlinks from a “bad” site • but inlinks from sites with many outlinks are not as “good”... • “Good” and “bad” are relative. web site xxx web site xxx web site xxx web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site yyyy 9

Ranking nodes in the Graph • Since there is big diversity in the connectivity of the webgraph, • We can rank webpages by the link structure • Links as votes • Each link’s vote is proportional to the importance of its source page • If page P with importance x has noutlinks, each link gets x/n votes • Page P’s own importance is the sum of the votes on its inlinks www.joe-‐schmoe.com www.stanford.edu 10 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Simple “flow” model The web in 1839 12 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Solving the flow equations • 3 equations, 3 unknowns, no constants • No unique solution • All solutions are equivalent except for differences accounted for scale factor • Additional constraint forces uniqueness • y+a+m = 1 • y = 2/5, a = 2/5, m = 1/5 • Gaussian elimination method works for small examples, but we need a better method for large graphs 13 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Transforming the Problem by Matrix The web in 1839 14 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Transforming the Problem by Matrix The web in 1839 ? 15 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Matrix Formulation • Matrix M has one row and one column for each web page • Suppose pagejhas n out-links • If ji, then Mij=1/n • Else Mij=0 • M is a column stochastic matrix • Columns sum to 1 Hyperlink Matrix M= 16

Problem Formulation Amazon M’soft Yahoo Yahoo Amazon M’soft • Supposer is a vector with one entry per web page • riis the importance score of page i • call it the rank vector • |r| = 1 Vote of j to out-links Importance of i Received vote on i’s in-links 17

Eigenvector formulation • The flow equations can be written • r = Mr • The rank vector ris an eigenvector of the stochastic Hyperlink matrix M • In fact, its first or principal eigenvector, with corresponding eigenvalue 1 18 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

PageRank • How to get the rank the webpages by finding the principle eigenvector ? Power iteration Inverse iteration QR algorithm 19 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Power Iteration M= Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Problems 3 Questions: Does the sequence rk always converge? Is the vector rk independent of r0 ? Is the vector rk a good ranking? 3 Ideas Dangling nodes (Spider trap) Second eigenvector! (Convergence) Looping nodes (Irreducible) No! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Dangling nodes (Spider Trap) Pages without outlinks Microsoft becomes a spider trap 0.5 0.5 0.375 0.31 0.25 0.2 …… 0 0 0.5 0.25 0.25 0.19 0.15 0.13 ……. 0 0 0 0.25 0.375 0.5 0.6 0.67 ……. 1 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The Dangling nodes Matrix Pages without outlinks Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Dangling nodes Every entry in S is the probability that a surfer goes from page i to page j, and if he/she gets a dangling node, chooses any page at random

The Stochastic Matrix S = M + D S is stochastic, 0 ≤ Sij ≤ 1, ΣSj = 1 Dominant (stationary) eigenvector always exists, and the largest engenvalue is 1 Sr = r, λ1 = 1 [Perron-Frobenius] ri is probability that surfer visits page i Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The second eigenvalue Convergence rate rk r is determined by |λ2| Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

We are almost done S must be primitive (λ2 ≤ 1) S is not primitive, λ1 = 1, λ2 = 1 !!! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

We are almost done A matrix is reducible if it can be placed into block upper/lower-triangular form by simultaneous row/column permutations [Wolfram.com] S must be irreducible, so that the stationary vector has all positive entries S is reducible r

Google Matrix Damping factor 0 ≤ α ≤ 1 α ≈ 1  convergence slow α ≈ 0  convergence fast U: Matrix of ones, personalization vector n: Total number of pages Trade-off : α = 0.85. Google matrix, it has been proven |λ2| = α G is stochastic, all entries are positive, means irreducible and primitive, a dominant eigenvector always exists, and its coefficients are all greater than zero Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Yahoo Amazon M’soft Back to Power Iteration Example y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 r = Mr y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6 2/5 2/5 1/5 . . . 35 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Random Walk Interpretation • Imagine a random web surfer • At any time t, surfer is on some page P • At time t+1, the surfer follows an outlink from P uniformly at random • Ends up on some page Q linked from P • Process repeats indefinitely • Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t • p(t) is a probability distribution on pages 36 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The stationary distribution • Where is the surfer at time t+1? • Follows a link uniformly at random • p(t+1) = M*p(t) • Suppose the random walk reaches a state such that p(t+1) = M*p(t) = p(t) • Then p(t) is called a stationary distribution for the random walk • Our rank vector r satisfies r = Mr • So it is a stationary distribution for the random surfer 37 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Existence and Uniqueness A central result from the theory of random walks (aka Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0. 38 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Topic-Specific Page Rank • Instead of generic popularity, can we measure popularity within a topic? • E.g., computer science, health • Bias the random walk • Random walker prefers to pick a page from a set S of web pages • S contains only pages that are relevant to the topic • e.g., Open Directory (DMOZ) pages for a given topic (www.dmoz.org) • For each set S, we get a different rank vector rS 39 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Matrix formulation • Let Aij= Mij + (1-)/|S| if i ϵS Aij= Mijotherwise • Ais stochastic • We have weighted all pages in set S equally • Could also assign different weights to them 40 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

0.2 0.5 0.5 0.4 0.4 Node Iteration 0 1 2… stable 1 1.0 0.2 0.52 0.294 2 0 0.4 0.08 0.118 3 0 0.4 0.08 0.327 4 0 0 0.32 0.261 1 0.8 1 1 0.8 0.8 Example Suppose S = {1},  = 0.8 1 2 3 0 1 0 0 M= 0.5 0 0 0 0.5 0 0 1 0 0 1 0 4 0.2 1 0.2 0.2 A= 0.4 0 0 0 0.4 0 0 0.8 0 0 0.8 0 Note how we initialize the pagerankvector differs from the unbiased page rank case. 41 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Web Spam • Search has become the default gateway to the web • Very high premium to appear on the first page of search results • e.g., e-commerce sites • advertising-driven sites 43 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

What is web spam? • Spamming • any deliberate action to boost a web page’s position in search engine results, • incommensurate with page’s real value • Spam • web pages that are the result of spamming • Approximately 10-15% of web pages are spam 44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Boosting techniques • Term spamming • Manipulating the text of web pages in order to appear relevant to queries • Repeat one or a few specific terms e.g., free, cheap • Dump a large number of unrelated terms e.g., copy entire dictionaries • Link spamming • Creating link structures that boost page rank • Get as many links from accessible pages as possible to target page t • Construct “link farm” to get page rank multiplier effect 45 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Detecting Spam • Term spamming • Analyze text using statistical methods e.g., Naïve Bayes classifiers • Similar to email spam filtering • Also useful: detecting approximate duplicate pages • Link spamming • Open research area • One approach: TrustRank 46 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

TrustRank idea • Basic principle: approximate isolation • It is rare for a “good” page to point to a “bad” (spam) page • Sample a set of “seed pages” from the web • Have an oracle (human) identify the good pages and the spam pages in the seed set • Expensive task, so must make seed set as small as possible 47 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

References The anatomy of a large-scale hypertextual web search engine. Brin & Page 1998 Eigenstructure of the Google Matrix, Haveliwala & Kamvar 2003, Eldén 2003, Serra-Capizzano 2005 Rebecca Wills, Google’s PageRank: The Math Behind the Search Engine Amy Langville & Carl Meyer, Google’s PageRank and Beyond David Austin, How Google finds your needle in the Web’s Haystack, AMS Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

WassilyLeontief In 1941, Russian/American Harvard economist published a paper in which he divides a country's economy into sectors that both supply and receive resources from each other, although not in equal measure. He developed an iterative method of valuing each sector based on the importance of the sectors that supply it. Sound familiar? In 1973, Leontief was awarded the Nobel Prize in economics for this work. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

What you should know • How does PageRank work? • What are the main issues of using power iteration to get the principle eigenvector? • How to solve the issues? • How does topic-specific PageRank work? • What is Web Spam? 54 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

AMCS/CS 340: Data Mining