1 / 37

Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee CEAS `11 – September 1, 2011. Link Spamming Wikipedia for Profit. Overview/Outline. How do wikis/Wikipedia prevent link spam? How common is wiki/Wikipedia link spam?

limei
Download Presentation

Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Andrew G. West,Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee CEAS `11 – September 1, 2011 Link Spamming Wikipedia for Profit

  2. Overview/Outline • How do wikis/Wikipedia prevent link spam? • How common is wiki/Wikipedia link spam? • How “successful” are the attack vectors? • Might there be more effective ones? (yes) • How would one defend against them?

  3. Defining Link Spam • Any violation of external link policy [2] • Commercial • Non-notable sources: fan pages, blogs, etc. • Two dimensions • Destination (URL) • Presentation • HTML nofollow Link spam example

  4. Motivations Spam not uncommon in collaborative/UGC apps (surveyed in [9,12]) • Wikipedia/wikis are unique: • Edit-anywhere (no append-only semantics) • Global editing (not network limited) • Community-driven mitigation • Extremely high traffic (#7 in Alexa) • Potential for traffic/profit (e.g., Amazon [14])

  5. STATUS QUO OFDEFENSE MECHANISMS

  6. Single-link Mitigation Assume a “clean” account adds a “new” spam link:

  7. Aggregate Mitigation Problematic URLs Malicious Accounts • URL blacklists [3] • Manually maintained • Local + global versions • ≈17k entries in combo. • Warning system [10] • 4 warnings without • consequence • 5th blocks account Malicious Collectives Unauthorized Bots • Either Sybil “sock-puppets” • or actual collectives • Manual signature detection • or IP correlation Bots can be very fast Rate-limits CAPTCHAs [15] Special software

  8. Aggregate Mitigation Problematic URLs Malicious Accounts • URL blacklists [3] • Manually maintained • Local + global versions • ≈17k entries in combo. • Warning system [10] • 4 warnings without • consequence • 5th blocks account Malicious Collectives Unauthorized Bots • Either Sybil “sock-puppets” • or actual collectives • Manual signature detection • or IP correlation Bots can be very fast Rate-limits CAPTCHAs [15] Special software

  9. Aggregate Mitigation Problematic URLs Malicious Accounts • URL blacklists [3] • Manually maintained • Local + global versions • ≈17k entries in combo. • Warning system [10] • 4 warnings without • consequence • 5th blocks account Malicious Collectives Unauthorized Bots • Either Sybil “sock-puppets” • or actual collectives • Manual signature detection • or IP correlation Bots can be very fast Rate-limits CAPTCHAs [15] Special software

  10. Aggregate Mitigation Problematic URLs Malicious Accounts • URL blacklists [3] • Manually maintained • Local + global versions • ≈17k entries in combo. • Warning system [10] • 4 warnings without • consequence • 5th blocks account Malicious Collectives Unauthorized Bots • Either Sybil “sock-puppets” • or actual collectives • Manual signature detection • or IP correlation Bots can be very fast Rate-limits CAPTCHAs [15] Special software

  11. Aggregate Mitigation Problematic URLs Malicious Accounts • URL blacklist • Manually maintained • Local + global versions • ≈17k entries in combo. • Warning system • 4 warnings without • consequence • 5th blocks account • TAKEAWAY: • * Only humans can catch “new” instances • * Aggregate mechanisms must wait for atomic instances to compound before they can take affect • HUMAN LATENCY! Malicious Collectives Unauthorized Bots • Either Sybil “sock-puppets” • or actual collectives • Manual signature detection • or IP correlation Bots can be very fast Rate-limits CAPTCHAs Special software

  12. STATUS QUO OFWIKIPEDIA SPAMMING

  13. Corpus Creation • “Spam” edits are those that: • Added exactly one external link • Made no changes outside context of that link • Were “rolled-back” (expedited admin. undo) • Edits meeting: if(1 && 2 && !3) = “Ham” • Edits meeting: if(3) = “Damaging”

  14. Corpus Example (1) Because the link was the ONLY change made. The privileged user’s decision to roll-back that edit speaks DIRECTLY to that link’s inappropriateness.

  15. Corpus Example (2)

  16. Spam Genres TAKEAWAY: Spam is: • Categorically diverse • “Subtlety”: Info. adjacent services • Not monetarily-driven? Spam by ODP/DMOZ category

  17. Spam Placement TAKEAWAY: • Conventions followed • Subtlety for persistence?

  18. Bad Domains + Blacklist TAKEAWAYS: • Wiki spammers ≠ email spammers Email Spam URLS Wiki SpamURLs ø • Domain statistics don’t suggest max. utility • Only 2 of 25 worst were blacklisted • Only 14 domains appear 10+ times in {SPAM}

  19. Spam Perpetrators • 57% of spam added by non-registered users • Yet we will show registered accounts beneficial • Worst users map onto worst domains • Dedicated spam accounts; most blocked Geo-locating spammers

  20. Spam Life/Impact Spam lifespan • 19 minutes at median • 85 secs. for damage • Reason for difference Spam page views • Proxy for “link views” • Metric of choice • 6.05 views per spam link

  21. Broadening Search • Maybe our corpus just missed something? • Archives show some abuse (but non-automated) • Deleted revisions; media coverage • SUMMARY: Status quo strategies unsuccessful • ≈ 6 views/link not likely to be profitable • Patrollers un-fooled by subtle strategies which seem to aim for “link persistence” • Cause or effect of unsophisticated strategies?

  22. A NOVEL ATTACK MODEL (inspired by [15])

  23. Attack Summary MODEL: Abandon deception, aggressively exploit latency of human detection process. Attack characterized by 4 vectors: • Target high-traffic pages • Autonomous attainment of privileged accounts; mechanized operation thereof • Prominent link placement/style • Distributed

  24. Popular Pages (1)

  25. Popular Pages (1)

  26. Popular Pages (1) • Imagine 85 seconds on these pages! • Why not just protect these somehow? • Next: Account-level vulnerabilities

  27. Popular Pages (2)

  28. Privileged Accounts • Becoming autoconfirmed • Outsource the CAPTCHA solve [15] • Requires 10 good edits (or warnings/block) • Non-vetted namespaces; helpful bots; thesaurus attacks • Conduct campaigns via API [1] at high-speed • “Anti-bot” software found ineffective

  29. Prominent Placement <p style="font-size:5em;font-weight:bolder"> [http://www.example.com Example link]</p>

  30. Distributed Attack Two notions of “distributed”: • Need IP-agility to avoid IP (range) blocks • What spammer doesn’t? • Use open-proxies, existing botnet, etc. • There are many sites one can target • Wiki language editions; WMF sister-sites • Universal API [1] into MediaWiki installs

  31. MODEL EFFECTIVENESS &DEFENSE STRATEGIES

  32. User Responses • Administrative response • Expected flow to campaign termination • Very conservative example: • 1 min. account survival = 70 links placed • Top 70 articles @ 1 min. each = 2,100 active views • Reader response • Sources of link exposure • Active views: Link in default version • Inactive views: Version histories and watchlisters • Content scrapers and mashup apps. • Click-through desensitization (email spam? [13])

  33. Economics • Cost of campaigns (about $1 marginal) • Affiliate programs; 50% commissions [13] • CAPTCHA per account; $1 per thousand [15] • Domain names; $1-$2 each • Minimal labor costs (< 100 LOC) • Expected return-on-investment; extrapolate from “male enhancement pharmacy” study [13] • 2100 exposures -> 20 click-through -> $5.20 gross • Affiliate fees: $5.20 -> $2.60 net >> $1 marginal • Why not seen live? Naivety? Scale?

  34. Defense Strategies (1) • Ethical issues; WMF notification • Focus on technical defense (sociological aspects) • Require explicit approval • Prevent from going live until vetted • Controversial “Flagged Revisions” proposal • Privilege configuration • Edit count is a poor metric (see [8]) • No human can do 70 edits/min. – maybe 5 edits/min.? • Tool-expedited users should have separate status

  35. Defense Strategies (2) • Autonomous signature-driven detection [19] • Human latency gone (dwindling workforce [11]) • Machine-learning classifier over: • Wikipedia metadata [5] (URL addition rates, editor permissions) • Landing-site analysis [7] (Commercial intent, SEO) • Third-party data (Alexa web crawler, Google Safe Browsing) • Implemented and operational on English Wikipedia • Offline analysis: 66% status-quo spam catch-rate at 0.5% FP-rate Anti-spam algorithm Wikipedia STiki Client STiki Services Scoring STiki Client IRC #enwiki# Edit Queue Fetch Edit Likely vandalism Likely vandalism Wiki-API Likely vandalism Display ------------------- Likely innocent Maintain Classify Bot Logic if(score) > thresh: REVERT else:

  36. References (1) [01] MediaWiki API. http://en.wikipedia.org/w/api.php [02] Wikipedia: External links. http://en.wikipedia.org/wiki/WP:EL [03] Wikipedia spam blacklists. http://en.wikipedia.org/wiki/WP:BLACKLIST [04] WikiProject spam. http://en.wikipedia.org/wiki/WP:WPSPAM [05] B. Adler, et al. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In CICLing 2011. [06] J. Antin and C. Cheshire. Readers are not free-riders: Reading as a form of participation on Wikipedia. In CSCW 2010. [07] H. Dai, et al. Detecting online commercial intention (OCI). In WWW 2006. [08] P. K.-F. Fong and R. P. Biuk-Aghai. What did they do? Deriving high-level edit histories in wikis. In WikiSym 2010. [09] H. Gao, et al. Detecting and characterizing social spam campaigns. In CCS’10. [10] R. S. Geiger and D. Ribes. The work of sustaining order in Wikipedia: The banning of a vandal. In CSCW 2010.

  37. References (2) [11] E. Goldman. Wikipedia’s labor squeeze and its consequences. Journal of Telecomm. and High Tech. Law, 8, 2009. [12] P. Heymann, et al. Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Internet Comp., 11(6):36–45, 2007. [13] C. Kanich, et al. Spamalytics: An empirical market analysis of spam marketing conversion. In CCS 2008. [14] C. McCarthy. Amazon adds Wikipedia to book-shopping. http://news.cnet.com/8301-13577_3-20024297-36.html, 2010. [15] M. Motoyama, et al. Re: CAPTCHAs - Understanding CAPTCHA-solving services in an economic context. In USENIX Security 2010 [16] R. Priedhorsky, et al. Creating, destroying, and restoring value in Wikipedia. In GROUP 2007, the ACM Conference on Supporting Group Group [17] Y. Shin, et al. The nuts and bolts of a forum spam automator. In LEET 2011. [18] B. E. Ur and V. Ganapathy. Evaluating attack amplification in online social networks. In W2SP 2009, Web 2.0 Security and Privacy [19] A. G. West, et al. Autonomous link spam detection in purely collaborative environments. In WikiSym 2011.

More Related