1 / 53

Web Spam

Web Spam. 2008.12.20. Outline. Motivation Introduction to Web Spam Web Spam techniques Web Spam Detection Conclusions. Outline. Motivation Introduction to Web Spam Web Spam Techniques Web Spam Detection Conclusions. Motivation.

brookec
Download Presentation

Web Spam

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Spam 2008.12.20

  2. Outline • Motivation • Introduction to Web Spam • Web Spam techniques • Web Spam Detection • Conclusions

  3. Outline • Motivation • Introduction to Web Spam • Web Spam Techniques • Web Spam Detection • Conclusions

  4. Motivation • Increased exposure on the World Wide Web may yield significant financial gains • E-commerce is rapidly growing • Projected to $329 billion by 2010;13% of all US retail sales • More traffic  more money • Large fraction of traffic from Search Engines • Increase Search Engine referrals: • Place ads  • Provide genuinely better content  • Create Web spam … 

  5. Outline • Motivation • Introduction to Web Spam • Web Spam Techniques • Web Spam Detection • Conclusions

  6. Web Spam

  7. Defining Web Spam Spamming = misleading search engines to obtain higher-than-deserved ranking Ranking Importance Relevance

  8. Why Web Spam is Bad • Bad for users • Makes it harder to satisfy information need • Leads to frustrating search experience • Bad for search engines • Wastes bandwidth, CPU cycles, storage space • Pollutes corpus (infinite number of spam pages!) • Distorts ranking of results

  9. Outline • Motivation • Introduction to Web Spam • Web Spam Techniques • Web Spam Detection • Conclusions • References

  10. Web Spam Techniques • Two categories of techniques associated with web spam • Boosting • Term-based • Link-based • Hiding

  11. Techniques/Boosting • Used to increase ranking • Hypertext boosting • Term –Relevance (one/many queries) –Target: TF-IDF variants • Link –Importance –Target: inlink/outlink count

  12. Techniques/Boosting/Term heavy spamming,low priority or ignore them completely give a higher weight to terms that appear in the title • <html> • <head> • <metaname=“keywords”content=“buy,cheap • ,cameras,Lens,accessories,nikon,canon”> • <title>free,free,free, cheap</title> • </head> • <body> • Our customers agree that we are the best online • retailer of cameras! • … • </body> • </html> Simplest,most popular,as old as search engines the url of a page =>a set of terms,to determine the relevance of the page offer a summary of the pointed document,higer weight • <html> • …A great <a href= • “buy-canon-rebel-20d-lens-case.camerasx.com”> • free,great deals,cheap,inexpensive,cheap,free</a> • store. • </html>

  13. Techniques/Boosting/Link Outgoing links to well-known pages provide useful resourses,BUT,have links to spam pages Spammers can control a large number of sites and create arbitrary link structures Buy expired domains,takes advantage of the false relevance/importance converyed by the pool of old links A group of spammers set up a link exchange structure,their sites point to each other • Post messages (containing links) to Blogs;forums;Wikis allow webmasters to post links their sites,maybe spam links

  14. Techniques/Hiding Different web pages to users and web crawlers <script type=“text/javascript”><!-- location.replace(“target.html”) //--> </script> • <body background=“red”> • <font color=“red”>hidden text</font> • … • </body> <meta http-equiv=“refresh” content=“0;url=plush.com”> <div style=“visibility:hidden”>You can’t see me!</div> <a href=“target.html”><img src= “tinyimg.gif”></a>

  15. Outline • Motivation • Introduction to Web Spam • Web Spam Taxonomy • Web Spam Detection • Conclusions

  16. How do we detect spam? Detecting Techniques Content-based Link-based Cloaking-based other

  17. Content-based Detection Detecting Techniques Content-based Link-based Cloaking-based other

  18. Related Work • D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB2004] • A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. [www 2006] • …

  19. Related Work • D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB2004] • A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. [www 2006] • …

  20. Content-based Detection • Number of words in the page and title • Average word length • Amount of anchor text • Compression rate • Fraction of page drawn from globally popular words • Fraction of globally popular words detection spam web through content analysis WWW2006

  21. Number of Words in <title> 110 words detection spam web through content analysis WWW2006

  22. Distribution of Word-counts in <title> • Spam more likely in pages with more words in title detection spam web through content analysis WWW2006

  23. Compressibility of a Page detection spam web through content analysis WWW2006

  24. zipRatio of a page detection spam web through content analysis WWW2006

  25. Distribution of zipRatios • Spam more likely in pages with high zipRatio detection spam web through content analysis WWW2006

  26. Combine heuristics • Use the previously presented metrics as features for a classifier • show results for a decision-tree

  27. Decision Tree

  28. Link-based Detection Detecting Techniques Content-based Link-based Cloaking-based other

  29. Related Work • Davison B. Recognizing nepotistic links on the Web. 2000 • Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] • B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] • R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] • Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. [AIRWeb2006] • L. Becchetti, C. Castillo1 D. Donato1, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. [WebKDD’06] • B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] • Z.Gyongyi,P.Berkhin,h.Garcia-Molina,J.Pedersen.LinkSpam DetectionBased on Mass Estimation.[VLDB 2006]

  30. Related Work • Davison B. Recognizing nepotistic links on the Web. 2000 • Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] • B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] • R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] • Krishnan, V. and Raj, R. Web Spam Detection with Anti-Trust-Rank. [AIRWeb2006] • L. Becchetti, C. Castillo1 D. Donato1, S. Leonardi, and R. Baeza-Yates. Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. [WebKDD’06] • B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] • Z.Gyongyi,P.Berkhin,h.Garcia-Molina,J.Pedersen.LinkSpam DetectionBased on Mass Estimation.[VLDB 2006]

  31. Our target Detect pages that achieve high PageRank through link spamming Link Spam Detection Based on Mass Estimation VLDB2006

  32. PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006

  33. PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006

  34. PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006

  35. PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006

  36. PageRank Contribution Link Spam Detection Based on Mass Estimation VLDB2006

  37. Spam Mass: Definition • Absolute mass • Amount (part) of PageRank coming from spam • Relative mass • Fraction of PageRank coming from spam • More useful in practice a.m. = p0– = 5 f.m. =p0-/p0=5/7 Link Spam Detection Based on Mass Estimation VLDB2006

  38. Spam Mass: Estimation Approximate the set of good nodes by a subset called good core Link Spam Detection Based on Mass Estimation VLDB2006

  39. Spam Mass: Estimation Approximate the set of good nodes by a subset called good core Link Spam Detection Based on Mass Estimation VLDB2006

  40. Spam Mass: Algorithm Create good core Compute PageRank scores pi and pi+ For all pages i with large PageRank Mark page as spam if mi > threshold Compute estimated relative mass mi as (pi – pi+) / pi Link Spam Detection Based on Mass Estimation VLDB2006

  41. Hiding-based Detection Detecting Techniques Content-based Link-based Cloaking-based other

  42. Related Work • M. Najork. System and method for identifying cloaked web servers, June 21 2005. U.S. Patent number6,910,077. • B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. [AIRWeb2005] • B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. [www2006]

  43. Related Work • M. Najork. System and method for identifying cloaked web servers, June 21 2005. U.S. Patent number6,910,077. • B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. [AIRWeb2005] • B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. [www2006]

  44. Motivation C (a page from crawler’s perspective) B (a page from browser’s Perspective) Web pages be updated frequently compare The difference between C1 and B1 is bigger than the difference between C1 and C2,this evidence is enough that the page is cloaking Cloaking and redirection: A preliminary study AIRWeb2005

  45. Detecting Cloaking C2 C1 B1 compare compare term link 选定阀值 选定阀值 Cloaking and redirection: A preliminary study AIRWeb2005

  46. Other Detection methods Detecting Techniques Content-based Link-based Cloaking-based other

  47. Related Work • Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,Fabrizio Silvestri: Know your neighbors:web spam detection using the web topology.[SIGIR 2007] • S.Webb,J.Caverlee and C.Pu.Characterizingweb spam using content andhttp session anlysis.[CEAS 07] • S.Webb,J.Caverlee and C.Pu.PredictingWeb Spam with HTTP SessionInformation[CIKM 08] • …

  48. Related Work • Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,Fabrizio Silvestri: Know your neighbors:web spam detection using the web topology.[SIGIR 2007] • S.Webb,J.Caverlee and C.Pu.Characterizingweb spam using content andhttp session anlysis.[CEAS 07] • S.Webb,J.Caverlee and C.Pu.PredictingWeb Spam with HTTP SessionInformation[CIKM 08] • …

  49. Web Topology Detection • Pages topologically close to each other are more likely to have the same label (spam/nonspam) than random pairs of pages. • Pages linked together are more likely to be on the same topic than random pairs of pages [Davison, 2000] • Spam tends to be clustered on the Web (black on figure) know your neighbors:Web Spam Detection using the Web Topology SIGIR2007

  50. if the majority of a cluster is predicted to be spam then we change the prediction for all hosts in the cluster to spam. The inverse holds true too. Clustering know your neighbors:Web Spam Detection using the Web Topology SIGIR2007

More Related