240 likes | 394 Views
Combating Web Spam with TrustRank. Zolt´an Gy¨ongyi - Hector Garcia-Molina - Jan Pedersen Presented By: Mahek Jasani (USC). Web Spam. It refers to hyperlinked pages on WWW that are created with the intention of misleading search engines. How is it done?.
E N D
Combating Web Spam with TrustRank Zolt´an Gy¨ongyi - Hector Garcia-Molina - Jan Pedersen Presented By: MahekJasani (USC)
Web Spam It refers to hyperlinked pages on WWW that are created with the intention of misleading search engines.
How is it done? • Adding thousands of keywords often making text invisible to humans • Creating Large number of bogus web pages
Web Spam Detection Web Spam Detection is important! Not only for search engines But for users as well as content providers
Goal of this Paper • The goal of this paper is to assist human experts who detect web spam. • The methods presented in this paper can be used either as : • helpers in an initial scanning process, suggesting pages that should be examined more closely by an expert • As a counter bias to be applied when the results are ranked
Preliminaries • Web Model Graph G = (V,E) • V corresponds to webpages • E corresponds to directed links directed links that connect webpages 1 4 2 3
Web Model unreferenced non-referencing • Transition Matrix T • Inverse Transition Matrix U 4 2 3 1
PageRank • PageRank Algorithm uses link information to assign global importance scores to all pages on web • Two main factors that affect PageRank • The intuition is that a webpage is important if several other paged point to it. • The importance of certain page influences and is being influenced by the importance of some other pages.
PageRank • Thus the PageRank score r(p) of page p is defined as: • The equivalent matrix equation form is: • -> decay factor; (q) -> out-degree of q • First part comes from the pages that point to P • the other part is static and equal for all web pages
Assessing Trust Oracle Function • Determining if a page is spam is subjective and requires human evaluation • It’s a notion of a human checking a page for spam. Example
Trust Function Intuition: -> Good pages seldom point to bad ones (Approximation Isolation) -> However, good pages may get tricked • Trust function T(p) -> Probability that a given page P is good.
Trust function • Ideal Trust Property • But, in practice it is difficult to achieve • Ordered Trust Property • Another way is Threshold Trust property Threshold value
Evaluation Metrics • A binary function I(T,O, p,q) is introduced to signal if a bad page received an equal or higher trust score than a good page 1) Pairwise Orderedness • If pairord equals 1, there are no cases when T misrateda pair. Conversely, if pairord equals zero, then T misratedall the pairs
Evaluation Metrics 2) Precision • It is defined as the fraction of good among all pages in X that have a trust score above : 3) Recall • It is defined as the ratio between the number of good pages with a trust score above and the total number of good pages in X:
Computing Trust • Ignorant Trust Function • Let L = 3, Seed set = {1, 3, 6} • Let o and to denote the vectors of oracle and trust scores for each page, respectively. In this case, Performance: Pairwise Orderedness : 17/21 Precision: 1 Recall: 0.5
Computing Trust • M-step Trust Function Using different values of M Performance decreases with values of M > 2.
Trust Attenuation • Trust Dampening • Trust Splitting
Selecting Seeds • Two heuristics: • Inverse PageRank • Select pages with maximum number of outlinks • High PageRank • Give preference to pages with high PageRank L=2? S = {1, 3} / {2,3} (desirable) S = {1, 2} (Inverse PageRank)
TrustRank Algorithm • Input -> Web Graph, Transition Matrix T, and other parameters L=3, B=0.85, MB=20 • Select Seeds • Rank Seeds • Invoke Oracle Function L times (assign 1 to good seeds) • Normalize the result of Oracle Function • Biased PageRank Equation Static score distribution vector
Experiments • Set of 31,003,946 sites • Seeds were selected using Inverse PageRank Algorithm • Top 1250 seeds were manually evaluated. • Out of that 178 sites turned out to be good seeds.
Results • Pairwise Orderedness
CONCLUSION • Guess what? QUESTIONS?