560 likes | 625 Views
Web Spam. 2008.12.20. Outline. Motivation Introduction to Web Spam Web Spam techniques Web Spam Detection Conclusions. Outline. Motivation Introduction to Web Spam Web Spam Techniques Web Spam Detection Conclusions. Motivation.
E N D
Web Spam 2008.12.20
Outline • Motivation • Introduction to Web Spam • Web Spam techniques • Web Spam Detection • Conclusions
Outline • Motivation • Introduction to Web Spam • Web Spam Techniques • Web Spam Detection • Conclusions
Motivation • Increased exposure on the World Wide Web may yield significant financial gains • E-commerce is rapidly growing • Projected to $329 billion by 2010;13% of all US retail sales • More traffic more money • Large fraction of traffic from Search Engines • Increase Search Engine referrals: • Place ads • Provide genuinely better content • Create Web spam …
Outline • Motivation • Introduction to Web Spam • Web Spam Techniques • Web Spam Detection • Conclusions
Defining Web Spam Spamming = misleading search engines to obtain higher-than-deserved ranking Ranking Importance Relevance
Why Web Spam is Bad • Bad for users • Makes it harder to satisfy information need • Leads to frustrating search experience • Bad for search engines • Wastes bandwidth, CPU cycles, storage space • Pollutes corpus (infinite number of spam pages!) • Distorts ranking of results
Outline • Motivation • Introduction to Web Spam • Web Spam Techniques • Web Spam Detection • Conclusions • References
Web Spam Techniques • Two categories of techniques associated with web spam • Boosting • Term-based • Link-based • Hiding
Techniques/Boosting • Used to increase ranking • Hypertext boosting • Term –Relevance (one/many queries) –Target: TF-IDF variants • Link –Importance –Target: inlink/outlink count
Techniques/Boosting/Term heavy spamming,low priority or ignore them completely give a higher weight to terms that appear in the title • <html> • <head> • <metaname=“keywords”content=“buy,cheap • ,cameras,Lens,accessories,nikon,canon”> • <title>free,free,free, cheap</title> • </head> • <body> • Our customers agree that we are the best online • retailer of cameras! • … • </body> • </html> Simplest,most popular,as old as search engines the url of a page =>a set of terms,to determine the relevance of the page offer a summary of the pointed document,higer weight • <html> • …A great <a href= • “buy-canon-rebel-20d-lens-case.camerasx.com”> • free,great deals,cheap,inexpensive,cheap,free</a> • store. • </html>
Techniques/Boosting/Link Outgoing links to well-known pages provide useful resourses,BUT,have links to spam pages Spammers can control a large number of sites and create arbitrary link structures Buy expired domains,takes advantage of the false relevance/importance converyed by the pool of old links A group of spammers set up a link exchange structure,their sites point to each other • Post messages (containing links) to Blogs;forums;Wikis allow webmasters to post links their sites,maybe spam links
Techniques/Hiding Different web pages to users and web crawlers <script type=“text/javascript”><!-- location.replace(“target.html”) //--> </script> • <body background=“red”> • <font color=“red”>hidden text</font> • … • </body> <meta http-equiv=“refresh” content=“0;url=plush.com”> <div style=“visibility:hidden”>You can’t see me!</div> <a href=“target.html”><img src= “tinyimg.gif”></a>
Outline • Motivation • Introduction to Web Spam • Web Spam Taxonomy • Web Spam Detection • Conclusions
How do we detect spam? Detecting Techniques Content-based Link-based Cloaking-based other
Content-based Detection Detecting Techniques Content-based Link-based Cloaking-based other
Related Work • D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB2004] • A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. [www 2006] • …
Related Work • D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB2004] • A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. [www 2006] • …
Content-based Detection • Number of words in the page and title • Average word length • Amount of anchor text • Compression rate • Fraction of page drawn from globally popular words • Fraction of globally popular words detection spam web through content analysis WWW2006
Number of Words in <title> 110 words detection spam web through content analysis WWW2006
Distribution of Word-counts in <title> • Spam more likely in pages with more words in title detection spam web through content analysis WWW2006
Compressibility of a Page detection spam web through content analysis WWW2006
zipRatio of a page detection spam web through content analysis WWW2006
Distribution of zipRatios • Spam more likely in pages with high zipRatio detection spam web through content analysis WWW2006
Combine heuristics • Use the previously presented metrics as features for a classifier • show results for a decision-tree
Link-based Detection Detecting Techniques Content-based Link-based Cloaking-based other
Related Work • Davison B. Recognizing nepotistic links on the Web. 2000 • Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] • B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] • R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] • Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. [AIRWeb2006] • L. Becchetti, C. Castillo1 D. Donato1, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. [WebKDD’06] • B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] • Z.Gyongyi,P.Berkhin,h.Garcia-Molina,J.Pedersen.LinkSpam DetectionBased on Mass Estimation.[VLDB 2006]
Related Work • Davison B. Recognizing nepotistic links on the Web. 2000 • Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] • B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] • R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] • Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. [AIRWeb2006] • L. Becchetti, C. Castillo1 D. Donato1, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. [WebKDD’06] • B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] • Z.Gyongyi,P.Berkhin,h.Garcia-Molina,J.Pedersen.LinkSpam DetectionBased on Mass Estimation.[VLDB 2006]
Our target Detect pages that achieve high PageRank through link spamming Link Spam Detection Based on Mass Estimation VLDB2006
PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006
PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006
PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006
PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006
PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006
Spam Mass: Definition • Absolute mass • Amount (part) of PageRank coming from spam • Relative mass • Fraction of PageRank coming from spam • More useful in practice a.m. = p0– = 5 f.m. =p0-/p0=5/7 Link Spam Detection Based on Mass Estimation VLDB2006
Spam Mass: Estimation Approximate the set of good nodes by a subset called good core Link Spam Detection Based on Mass Estimation VLDB2006
Spam Mass: Estimation Approximate the set of good nodes by a subset called good core Link Spam Detection Based on Mass Estimation VLDB2006
Spam Mass: Algorithm Create good core Compute PageRank scores pi and pi+ For all pages i with large PageRank Mark page as spam if mi > threshold Compute estimated relative mass mi as (pi – pi+) / pi Link Spam Detection Based on Mass Estimation VLDB2006
Hiding-based Detection Detecting Techniques Content-based Link-based Cloaking-based other
Related Work • M. Najork. System and method for identifying cloaked web servers, June 21 2005. U.S. Patent number6,910,077. • B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. [AIRWeb2005] • B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. [www2006]
Related Work • M. Najork. System and method for identifying cloaked web servers, June 21 2005. U.S. Patent number6,910,077. • B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. [AIRWeb2005] • B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. [www2006]
Motivation C (a page from crawler’s perspective) B (a page from browser’s Perspective) Web pages be updated frequently compare The difference between C1 and B1 is bigger than the difference between C1 and C2,this evidence is enough that the page is cloaking Cloaking and redirection: A preliminary study AIRWeb2005
Detecting Cloaking C2 C1 B1 compare compare term link 选定阀值 选定阀值 Cloaking and redirection: A preliminary study AIRWeb2005
Other Detection methods Detecting Techniques Content-based Link-based Cloaking-based other
Related Work • Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,Fabrizio Silvestri: Know your neighbors:web spam detection using the web topology.[SIGIR 2007] • S.Webb,J.Caverlee and C.Pu.Characterizingweb spam using content andhttp session anlysis.[CEAS 07] • S.Webb,J.Caverlee and C.Pu.PredictingWeb Spam with HTTP SessionInformation[CIKM 08] • …
Related Work • Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,Fabrizio Silvestri: Know your neighbors:web spam detection using the web topology.[SIGIR 2007] • S.Webb,J.Caverlee and C.Pu.Characterizingweb spam using content andhttp session anlysis.[CEAS 07] • S.Webb,J.Caverlee and C.Pu.PredictingWeb Spam with HTTP SessionInformation[CIKM 08] • …
Web Topology Detection • Pages topologically close to each other are more likely to have the same label (spam/nonspam) than random pairs of pages. • Pages linked together are more likely to be on the same topic than random pairs of pages [Davison, 2000] • Spam tends to be clustered on the Web (black on figure) know your neighbors:Web Spam Detection using the Web Topology SIGIR2007
if the majority of a cluster is predicted to be spam then we change the prediction for all hosts in the cluster to spam. The inverse holds true too. Clustering know your neighbors:Web Spam Detection using the Web Topology SIGIR2007