250 likes | 403 Views
L EHIGH. U NIVERSITY. Introduction: Web Search. Google Yahoo! MSN Search Ask A9 Exalead Gigablast + metasearch + many more!. Web search – the access to the Web for hundreds of millions of people Hundreds of millions of queries per day Queries + people = TRAFFIC
E N D
LEHIGH UNIVERSITY Models of Trust for the Web (MTW) WWW2006 Workshop
Introduction: Web Search Google Yahoo! MSN Search Ask A9 Exalead Gigablast + metasearch + many more! • Web search – the access to the Web for hundreds of millions of people • Hundreds of millions of queries per day • Queries + people = TRAFFIC • A HUGE incentive for web site owners to rank highly in search engine results • Communicate some message (advertising, political statement) • Install viruses, adware, etc. Models of Trust for the Web (MTW) WWW2006 Workshop
Introduction: Web Spam • a.k.a. search engine spam, spamdexing • Any technique to manipulate search engine results • Target page gets an undeservedly higher ranking • Many methods • Link farms, keyword stuffing, cloaking, link bombs, and more • The target of much of our work! Models of Trust for the Web (MTW) WWW2006 Workshop
Propagating Trust and Distrust to Demote Web Spam Baoning Wu, Vinay Goel, and Brian D. Davison Computer Science & Engineering Lehigh University Bethlehem, PA USA
Outline • Background and motivation • Proposed methods • Experimental results Models of Trust for the Web (MTW) WWW2006 Workshop
Background: PageRank • (Page and Brin, 1998) • Uses number and status of “parents” to determine status of child • r(i+1) = (1-α) * T * r(i) + α * s • r: PageRank score vector (with N nodes) • T: transition matrix (NxN) • (1-α): decay factor; α: jump probability • s: uniform distribution of 1/N • PageRank score generates a ranking of importance of node Models of Trust for the Web (MTW) WWW2006 Workshop
Background: TrustRank • (Gyongyi and Garcia-Molina, VLDB 2004) • Uses number and trust of “parents” to determine trust status of child • t(i+1) = (1-α) * T * t(i) + α * s • t: TrustRank score vector (with N nodes) • T: transition matrix (NxN) • (1-α): decay factor • s: seed set trust score distribution • Vector of size N, but only seed nodes are non-zero • Demotes web spam by propagating trust from a known good seed set. Models of Trust for the Web (MTW) WWW2006 Workshop
Specific Motivation • In TrustRank • Parent divides its trust among its children. • This may not be optimal – real-world trust relationships are independent of the number of trusted entities. • Distrust can also be propagated. Trust Propagation A B Hyperlink Distrust Propagation Models of Trust for the Web (MTW) WWW2006 Workshop
Key steps in propagation • Decay of trust (d) • Trust is not perfectly transitive. • Splitting of trust • For each parent, how to divide its score among its children. • Accumulation of trust • For each child, how to accumulate the overall score given the portions from all of its parents. Models of Trust for the Web (MTW) WWW2006 Workshop
Outline • Background and motivation • Proposed methods • Experimental results Models of Trust for the Web (MTW) WWW2006 Workshop
Choices for Trust Splitting • Given a node i with trust score TR(i) and O(i) outgoing links: • Equal splitting • Gives d*TR(i)/O(i) to each child (used by TrustRank) • Constant splitting • Gives d*TR(i) to each child • Logarithmic splitting • Gives d*TR(i)/log(1+O(i)) to each child Models of Trust for the Web (MTW) WWW2006 Workshop
Choices for Trust Accumulation • Simple summation • Sum the trust values from each parent • Maximum share • Use the maximum of the trust values sent by the parents • Maximum parent • Sum the trust values but never exceed the trust score of most-trusted parent Models of Trust for the Web (MTW) WWW2006 Workshop
Propagating Distrust • Distrust can be propagated from a seed set of bad nodes. • Similar to trust propagation, but in reverse – follow incoming links, not outgoing links • Same key choices for decay, splitting and accumulation Models of Trust for the Web (MTW) WWW2006 Workshop
Combining Trust and Distrust • For each node i, Trust score TR(i) and Distrust score DIS_TR(i), the combination score Total(i) can be Total(i) = ŋ * TR(i) – ß * DIS_TR(i)where 0 ≤ ŋ ≤ 1, 0 ≤ ß ≤ 1 Models of Trust for the Web (MTW) WWW2006 Workshop
Outline • Background and motivation • Proposed methods • Experimental results Models of Trust for the Web (MTW) WWW2006 Workshop
Data set • 20M pages from the Swiss search engine [search.ch] in 2004 • 350K sites with “.ch” domain • We used only this site graph • Seed sets • 3,589 labeled sites as using web spam with various techniques (provided) • 20,005 sites with pages in dir.search.ch topics as trusted set Models of Trust for the Web (MTW) WWW2006 Workshop
Experimental Design • Explore various combinations of trust and distrust propagation • Evaluation • Performance of TrustRank is the number of spam sites found among the highest-ranked ~1% of sites. • We use the same metric in this work. Models of Trust for the Web (MTW) WWW2006 Workshop
Baseline result Models of Trust for the Web (MTW) WWW2006 Workshop
Simple TrustRank Improvement: Increase jump probability (α) default α=0.15 (α) Models of Trust for the Web (MTW) WWW2006 Workshop
Other trust propagation methods Models of Trust for the Web (MTW) WWW2006 Workshop
Results of propagating distrustCombined equally with TrustRank, 200 seeds Models of Trust for the Web (MTW) WWW2006 Workshop
Combining trust and distrustUsing best scoring trust and distrust formulations, beta=(1-eta) >2200 (Distrust Only) (Trust Only) Models of Trust for the Web (MTW) WWW2006 Workshop
Coverage of trust propagation Percentage of sites affected by approach. TrustRank reached 76.05%. Models of Trust for the Web (MTW) WWW2006 Workshop
Conclusions • Propagating trust based on outdegree does not appear to be optimal. • Alternative splitting and accumulation methods can help to demote top ranked spam sites. • Propagating distrust can also help to demote top ranked spam sites. • Additional tests needed! • E.g., to examine impact on retrieval Models of Trust for the Web (MTW) WWW2006 Workshop
Thank You! Questions? Contact Info: Dr. Brian D. Davison davison(at)cse.lehigh.edu WUME Laboratory Computer Science and Engineering Lehigh University Bethlehem, PA 18015 USA The WUME Lab http://wume.cse.lehigh.edu/ Models of Trust for the Web (MTW) WWW2006 Workshop