300 likes | 374 Views
A Quantitative Study of Forum Spamming Using Context-Based Analysis. Yi-Min Wang^ Ming Ma^. Yuan Niu* Hao Chen* Francis Hsu*. *UC Davis, ^Microsoft Research. User. Spammer. A Look at the Web. Why do we care about spam?. Users want to Look at quality pages on the web
E N D
A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research
User Spammer A Look at the Web
Why do we care about spam? • Users want to • Look at quality pages on the web • Interact without the trouble of moderation • Surf safely • Search engines want to • Provide good search results • Profit from ads • We want to investigate the landscape of the problem • Popular battleground: web forums
Why Web Forums? • Open communities: wiki, forums, blogs • Increasingly easy to contribute
3. Propagates Splog URL Returns 2. Writes Splog URLs 4. Sends User to Doorway URL 1. Creates 5. Redirects User Spammer How Spammers Operate Search Engine Comment Spam Search Results Doorway Pages (Splogs) Spammer Domain
How to deal with the problem? • Content based approach • Constrained by content retrieved • May be deceived by tricks like cloaking and redirection • We propose: context-based analysis
Context-based Analysis • Consisting of • Redirection • Cloaking analysis • See dynamic content not served to crawlers • Use the Strider URL Tracer • Flag large number of doorway pages to spam domains • Based on intuition that: • Publishing links is necessary to increase popularity • We must see the destination URL eventually
Doorways & Redirections Google search: Coach handbag
Redirection Analysis • Fed URLs to Strider URL Tracer, which records all pages visited • Ranked top 3rd Party Domains by redirections • Seed known spammer domain • Identified doorway pages based on association with spammer domains • Manually investigated unknown domains to expand the blacklist
Cloaking Analysis • Diff-based check • Run URL twice – once with anti-cloaking, once without • Crawler-browser cloaking (User-agent, scripting-on/off) • Click-through cloaking (Referer)
www.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.htmlwww.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.html Javascript Enabled www.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.html Javascript Disabled Google Search: ringtones download Crawler-Browser Cloaking
Advertising Page from Click-throughs Cached page/ Scripting off/ Crawler View Cached page/ Scripting off/ Crawler View Directly Visiting the Page Directly Visiting the Page Click-Through Cloaking
3. Propagates Splog URL Returns 2. Writes Splog URLs 4. Sends User to Doorway URL Search User 1. Creates 5. Redirects User Webhost Spammer Three Perspectives Search Engine Comment Spam Search Results Doorway Pages (Splogs) Spammer Domain
Search User • Chose 9 popular forum software – written in Perl/PHP, hosted/unhosted • WWWBoard, Hypernews, Ikonboard, Ezboard, Bravenet, Invision Board, Phpbb, Phorum, and VBulletin • Compiled popular tags and common spam terms –list of 190 keywords • “Myspace, jewelry, casino, shopping, baseball…” • Searched for all <keyword, forum-software> pairs in Google & MSN
Search User • Search terms returned spammed forums in top 20 results from both Google and MSN • Only exception is “palm-texas-holdem-game” • Top 5 most spammed forums:
Honeyblogs • Spammers: • Create their own doorway pages, and • Promote the doorways by posting to other people’s pages • Honeyblogs lure the spammer in: • No moderation, default accept all policy • Pinged blog aggregators with every post • Abandoned within three months
Honeyblogs • 41,100 comments collected over 339 days • 19,297 comments received in the last month • Ilium – 930/1432 • Litlog – 3734/5714 • Spammer activity got me kicked off my hosting server
Honeyblog Activity 3142
Webhost Perspective • Focus on splog doorways • Above Numbers are lower bounds • Consider only pages using cloaking & redirection
Webhost Perspective • Blogspot: 1,091 splogs • Most popular • Randomly sampled 1% of profile pages created in July and extracted all blog links – 13,389 • 60% of splogs used cloaking • 24% of splogs redirected to filldirect.com
Webhost Perspective • Blogspoint: 3535 splogs • 2166 redirected to finance-web-search.com • 917 redirected to casino-web-search.com • Blogstudio: 198 splogs • 130 redirected to finance-web-search.com • 54 redirected to casino-web-search.com • Blogsharing: 82 splogs • Plumber related link spamming in splogs
Also of note… • Malicious URLs • Previous work by MSR (Strider HoneyMonkey)1 discovered sites that actively exploit browser vulnerabilities • We tested 8 known malicious URLs for presence on the web • Found 5 spammed in forums, 2 in link farms, 1 in referrer logs • Universal redirectors • Redirects user to any URL (sometimes destination is obfuscated): • www.rit.edu/~ksa/cgi-bin/splinks/click.cgi?num=2&url=[your url here] • http://tinyurl.com/3c7twl • http://www.canadianpharmacyltd.com/group.php?id=59&aid=860 • Could be used to serve malicious URLs, particularly those on .edu and .gov sites 1Yi-Min Wang, et al. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. NDSS, 2006.
Related Work (Part 1) • Diff-based cloaking • Wu & Davison – Diff-based cloaking combined with content based analysis • Our approach detects click-through cloaking • Content based approaches • Fetterly, Manasse and Najork – URL properties, clustering pages of similar content • Mishne, Carmel, Lempel – Compared statistical models of comments & target pages against post content • Kolari, Finin and Joshi – Meta tag text, anchor text, URLs • Our approach is complimentary to content-based approaches
Related Work (Part 2) • Measurements of Trust • Metaxas et al – Defined trust neighborhoods • Benczur et al – SpamRank: Identify outliers by looking at PageRank of the site and its “supporters” • Similarly, our approach propagates distrust by following redirections • Plugins to aid moderating forums/blogs • Akismet • Bad Behavior, Spam Karma • Our approach does not require cooperation from forum owners
Conclusions • Context-based approach successfully detects advanced cloaking & redirection based spam • Spammers are pervasive • 189 of 190 search terms returned spammed forums in the top 20 search results from both Google and MSN • Same spammer redirecting to two domains on blogspoint and blogstudio
Future work • There is hope! • Economic solution • Identifies middlemen in online advertising • Read our WWW07 paper1 • http://wwwcsif.cs.ucdavis.edu/~niu • http://research.microsoft.com/csm/strider/ 1Yi-Min Wang et al. Spam Double-Funnel: Connecting Web Spammers with Advertisers. WWW 2007.