400 likes | 597 Views
CS246. Link-Based Ranking. Problems of TFIDF Vector. Works well on small controlled corpus, but not on the Web Top result for “American Airlines” query: accident report of American Airline flights Do users really care how many times “American Airlines” mentioned? Easy to spam
E N D
CS246 Link-Based Ranking
Problems of TFIDF Vector • Works well on small controlled corpus, but not on the Web • Top result for “American Airlines” query: accident report of American Airline flights • Do users really care how many times “American Airlines” mentioned? • Easy to spam • Ranking purely based on page content • Authors can manipulate page content to get high ranking • Any idea?
Link-based Ranking • People “expect” to get AA home page for the query “American Airlines” • Many pages point to AA home page, but not to accident report • Use link-count!
Simple Link Count • Still easy to spam • Create many pages and add links to a page • How to avoid spam?
PageRank • A page is important if it is pointed by many important pages • PR(p) = PR(p1)/n1 + … + PR(pk)/nkpi : page pointing to p, ni : number of links in pi • PageRank of p is the sum of PageRanks of its parents • One equation for every page • N equations, N unknown variables
Ne MS Am Example: Web of 1842 • Netscape, Microsoft and Amazon PR(n) = PR(n)/2 + PR(a)/2 PR(m) = +PR(a)/2 PR(a) = PR(n)/2 + PR(m)
PageRank: Matrix Notation • Web graph matrix M = { mij } • Each page i corresponds to row i and column i of the matrix M • mij = 1/n if page i is one of the n children of page jmij = 0 otherwise • PageRank vector • PageRank equation
PageRank: Iterative Computation • Initially every page has a unit of importance • At each round, each page shares its importance among its children and receives new importance from its parents • Eventually the importance of each page reaches a limit • Stochastic matrix
Example: Web of 1842 Ne MS Am
PageRank: Eigenvector • PageRank equation • is the principal eigenvector of M
PageRank: Random Surfer Model • The probability of a Web surfer to reach a page after many clicks, following random links Random Click
Problems on the Real Web • Dead end • A page with no links to send importance • All importance “leak out of” the Web • Crawler trap • A group of one or more pages that have no links out of the group • Accumulate all the importance of the Web
Example: Dead End • No link from Microsoft Dead end Ne MS Am
Example: Dead End Ne MS Am
Solution to Dead End • Assume a surfer to jumps to a random page at a dead end Ne MS Am
Example: Crawler Trap • Only self-link at Microsoft Crawler trap Ne MS Am
Example: Crawler Trap Ne MS Am
Crawler Trap: Damping Factor “Tax” each page some fraction of its importance and distribute it equally Probability to jump to a random page Assuming 20% tax
Link Spam Problem • Q: What if a spammer creates a lot of pages and create a link to a single spam page? • PageRank better than simple link count, but still vulnerable to link spam • Q: Any way to avoid link spam?
TrustRank [Gyongyi et al. 2004] • Good pages don’t point to spam pages • Trust a page only if it is linked by what you trust • Same as PageRank except the random jump probability term
TrustRank: Theory [Bianchini et al. 2005] consider a set of pages S IN(S) S OUT(S) DP(S)
What Does It Mean? • PS = 0 if BS= 0 and PIN= 0 • You cannot improve your TrustRank simply by creating more pages and linking within yourself • To get non-zero TrustRank, you need to be either trusted or get links from outside
Is TrustRank the Ultimate Solution? • Not really… • Honeypot: A page with good content with hidden links to spams • Good users link to honeypot due to its quality content • Blogs, forums, wikis, mailing lists • Easy to add spam links • Link exchange • Set of sites exchanging links to boost ranking • A never-ending rat race…
Anti-Spamming at Search Engines • Anchor text • Consider what others think about your page • Give higher weights to anchors from high PageRank pages • More difficult to spam • TrustRank • To gain importance, you need to convince many pages under other’s control or convince search engines • More difficult to spam • Consider inter-site links with higher weight
Hub and Authority • More detailed evaluation of importance • A page is useful if • It has good contents or • It has links to useful pages (good bookmark) • Hub/Authority • Authority: pages with good contents • Hub: pages pointing to good content pages
Hub/Authority: Definition • Recursive definition similar to PageRank • Authority pages are linked to by many hub pages • Hub pages link to many authority pages • H(p) = A(p1) + … + A(pk)A(p) = H(p1) + … + H(pm)
Hub/Authority: Matrix Notation • Web graph matrix A = { aij } • Each page i corresponds to row i and column i of the matrix A • aij = 1 if page i points to page jaij = 0 otherwise • A is not a stochastic matrix • AT: similar to PageRank matrix M, without stochastic restriction
Ne MS Am Example: Web of 1842 • [n, m, a]: vector
Hub/Authority: Iterative Computation • Hub/Authority vector • : divergence scaling factor • : divergence scaling factor • Compute and iteratively with scaling
Hub/Authority: Eigenvector • : eigenvector of : eigenvector of
Ne MS Am Example: Web of 1842
Hub/Authority and Root Set • Apply the equations on a small neighbor graph (base set) • Start with, say, 100 pages on “bicycling” • Add pages pointing to the 100 pages • Add pages that the 100 pages are pointing to • Identified pages are good “Hub” and “Authority” on “bicycling”
Hub/Authority and Web Community • Hub/Authority is often used to identify Web communities • Nice notion of “Hub” and “Authority” of the community • Often Hub and Authority are tightly linked to each other
Questions • Can we apply Hub/Authority to the entire Web like PageRank?
Hub/Authority on the Entire Web? • Hub/Authority works well on a topic-specific subset, but works poorly for the whole Web • Easy to spam • Create a page pointing to many authority pages (e.g., Yahoo, Google, etc.) The page becomes a good hub page • On the page, add a link to your home page
Questions • Can we apply PageRank to a small base set?
PageRank on a Small Subset • In general, PageRank works better for larger dataset • We may be able to compute “topic-specific” PageRank • Any other way for “topic-specific” PageRank?
Summary: Link-Based Ranking • PageRank • TrustRank variation • Hub/Authority