400 likes | 491 Views
The Efficacy of Collusions in Web Ranking and the Countermeasurements. Hui Zhang University of Southern California. Outline. Problem Statement. PageRank algorithm : a brief introduction. Study of PageRank’s robustness to collusion.
E N D
The Efficacy of Collusions in Web Ranking and the Countermeasurements Hui Zhang University of Southern California
Outline • Problem Statement. • PageRank algorithm : a brief introduction. • Study of PageRank’s robustness to collusion. • Adaptive-resetting: make PageRank robust to collusion. • Conclusions. USC CS599
Search Engine Optimization (SEO) USC CS599
Web spam [Gyongyin et al. 2004] • Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve. • A spammer will play with two factors which decide the rank score of a page in a query: • Relevance – textual similarity between the query and a page. • Importance – the global popularity of a page, which is query-independent. USC CS599
Collusion in Web ranking • A manipulation of the hyperlink structure by a group of users with the intention of improving the rating one or more users in the group. USC CS599
PageRank [Brin1998] • An eigenvector-based rating scheme to rank hypertext documents on the WWW. • An iterative algorithm to calculate the importance of a web page based on the importance of its parent pages. • Can be applied to other systems than WWW. USC CS599
With prob. , I will restart the walk at a random node. : resetting probability : resetting probability • With prob. (1-), I will continue the walk to a random successor node. PageRank: random walk model node referential link The walker X 1/2 1/3 Z Y • As time goes on, the expected percentage of steps the walker is at each node vconverges to the PageRank weight PR(v). USC CS599
I’m not colluding! • I’m not colluding! • I’m not colluding! • I’m not colluding! PageRank: is it collusion-proof? • Can a node easily boost its rank by manipulating its out-going links with others’? USC CS599
Amp(G): a metric on group collusion : resetting probability real group weight + (1-) (1-) • “actual” group weight (1-W(G’)) • In the system of node group G, for a subgroup G’, • the amplification factor Amp(G’) = + y x G 2 j PR(x) PR(x) PR(y) PR(y) PR(y) i G’ 4 4 2 3 3 N WG(G’) =PR(i)+PR(j) Win(G’) = USC CS599
Answer for (1+1 = ?) in PageRank • In the original PageRank system, • where is the resetting probability. USC CS599
Two experimental topologies • W, a Web link topology • Contains the link structure of upwards of 80 million URLs. • Source: the Stanford WebBase. • B, a weblog blogrolling topology • Contains the blogrolling structure of upwards of 72,000 blogs. • Source: www.blogstreet.com, the XML-RPC webblog service. USC CS599
Experiment 1: Collusion200 • Model a small number of web pages simultaneously colluding. • Methodology: • 100 colluding groups of 200 nodes; • Each colluding group has the circle topology consisting of two nodes with adjacent ranks; • Arbitrarily chose node pairs originally ranked around 1000th, 2000th, …, 100000th. • = 0.15. USC CS599
Experiment result of Collusion200 (I) Figure 1: W - Amplification factors of the 100 colluding groups in Collusion200. USC CS599
Old rank: 100009th New rank: 5038th Old rank: 1005th Old rank: 10001th New rank: 67th New rank: 450th Experiment result of Collusion200 (III) Figure 2: W – new PR rank after Collusion200. USC CS599
There is a long flat portion… Figure 3: The PR weight distribution of 4 topologies. USC CS599
Next step: how to detect collusions? • Identifying colluding groups is unlikely to be computationally tractable. • The densest k-subgraph problem[Feige et al. 1997]. • The classical CLIQUE problem. • The problem of finding hiding large cliques in random graphs[Juels 1998]. USC CS599
Hardness on Amp • Theorem on Hardness. • Max G’G Amp(G’) is a NP-Hard problem. USC CS599
How about using finer statistics of the random walk N-2 N-1 0 N N+1 1 2 A counterexample: a star+dangling circle topology Figure E: • The revisit intervals of the random walk on a colluding node will likely to have a large variance compared to its expectation. USC CS599
G G’ An observation on collusion behaviors • To increase their PR weight, i.e., the stationary weight in the random walk, the colluding nodes will stall the random walk. • When the resetting probability increases, the colluding nodes must suffer a significant drop in PR weight. • Therefore, we expect the PR weight of colluding nodes to be highly correlated with 1/ (the average walk length), while that of non-colluding nodes is relatively insensitive to the change in . USC CS599
An intuitive example node referential link USC CS599
An intuitive example node referential link A colluding group USC CS599
y x • A colluding node x: PR(x) = , and co-co(PR(x), 1/ ) 1. (co-co: correlation coefficient) • A non-colluding node y: PR(x) = , and co-co(PR(y), 1/ ) 0. N: the system size; K: the colluding group size; K << N. An intuitive example node referential link A colluding group USC CS599
Adaptive-resetting scheme • Part II – personalization: • Calculate each node x's out-link personalized- = F(default, co-co(x)). • Exponential function FExp= . • Linear function FLinear= default+(0.5-default)*co-co(x) • The final PR weight vector is calculated with these personalized resetting values. • Part I – collusion detection: • Given the topology, calculate the PR vector under different values. • {} = {0.0375, 0.05, 0.075, 0.15, 0.3, 0.45, 0.6}, default = 0.15. • Calculate the correlation coefficient between the curve of each node x's PR weight and the curve of 1/ . Label it as co-co(x). USC CS599
Experiment result of Collusion200 (IV) Figure 5: W - Amplification factors of the 100 colluding groups in Collusion200. USC CS599
Experiment result of Collusion200 (VI) Figure 6: W – new PR rank after Collusion200. USC CS599
Experiment 2: Collusion22 • Model various colluding subgraphs. • Methodology: • 3 colluding groups: node referential link G1: 10-node ring G2: 10-node star topology G3: 2-node ring USC CS599
Experiment result of Collusion22 (I) Figure 7: Amplification factors of the 3 colluding groups in Collusion22. USC CS599
Experiment result of Collusion22 (II) Figure 8: W – new PR weight after Collusion22. USC CS599
Dropped out New top-25 URL list in W Dropping New USC CS599
Conclusions • Simple collusions lead to effective Web ranking improvement. • A simple scheme based on PageRank algorithm effectively counteracts Web ranking collusions. USC CS599
Backup slides USC CS599
Reputation systems [Okita2003] • A means of describing social trust networks. • The basic concept is a democratic meritocracy. • A rating system is used to evaluate individual members, and those results are then collated to produce a consensus about the merit of any given member. • Examples: • Livejournal, Friendster, eBay, Advogato USC CS599
PageRank algorithm [Brin1998] • Basic algorithm • v Rank(v) = • Enhanced algorithm against rank sinks • v Rank(v) = : damping factor • Assume N pages. • Assign all pages the initial value 1/N • Let Nu be the out-degree of Page u, Rank(v) the importance of Page v, Bv the set of pages pointing to v. USC CS599
Co-co distribution in real-world graphs Figure 4: the co-co PDF distribution in W and B: the [0, 0.1] range actually corresponds to [-1, 0.1] range. USC CS599
Experiment result of Collusion200 (II) Figure A: W – new PR weight after Collusion200. USC CS599
Experiment result of Collusion200 (VII) Figure B: B – new PR rank after Collusion200 USC CS599
Experiment result of Collusion200 (X) Figure C: B – new PR weight after Collusion200 USC CS599
Experiment result of Collusion200 (V) Figure 6: W – new PR weight after Collusion200. USC CS599
Correlation coefficient USC CS599
Experiment result of Collusion22 (III) Figure D: W – new PR rank after Collusion22. USC CS599