290 likes | 1k Views
How to beat an Adaptive Spam Filter John Graham-Cumming Creator and Maintainer of POPFile Research Director, Sophos’s Anti-Spam Task Force Token Space neither “Red Coat” Spams Obfuscated spam is trivial to spot and filter No need to even read the text, the obfuscations are enough
E N D
How to beat an Adaptive Spam Filter John Graham-Cumming Creator and Maintainer of POPFile Research Director, Sophos’s Anti-Spam Task Force
Token Space neither
“Red Coat” Spams • Obfuscated spam is trivial to spot and filter • No need to even read the text, the obfuscations are enough • No real email contains the word Viagra written V<font size=0> </font>i<font size=0> </font>a<font size=0> </font>g<font size=0> </font>r<font size=0> </font>a • “Field Guide to Spam” highlights spammer obfuscations: Invisible Ink, Camouflage, Hypertextus Interruptus... • www.sophos.com/spaminfo
POPFile's working great for me... but not 100% • November 3, 2003 through December 22, 2003 • Total mails received: 52,931 • Total spams: 35,928 (68%!) • Total spams missed: 125 • So POPFile ~99.7% accurate • 1 in 254 spams gets through... why?
Taxonomy of filter busting spams • 52%: “picospams” • 13%: RTF • 9%: Challenge/Response • 9%: NDR • 4%: Totally blank • 13%: Other • Multiple copies of an offer for an “Incredible Spam Filter” • A message in Hebrew
RTF • Microsoft email clients sniff Rich Text Format • (actually they sniff a lot of different formats) Content-Type: text/plain {\rtf1\ansi\ansicpg1252\uc1 \deff0\deflang1046\deflangfe1046{\fonttbl{\f0\froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman;}{\f1\fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;} {\f16\froman\fcharset129\fprq2{\*\panose 02030600000101010101}Batang{\*\falt??};}{\f28\froman\fcharset0\fprq2{\*\panose 02040602050305030304}Book Antiqua;}{\f29\froman\fcharset129\fprq2{\*\panose 00000000000000000000}@Batang;}{\f40\froman\fcharset238\fprq2 Times New Roman CE;}
Challenge/Response • Received a number of “fake challenges” • Challenges directed me to a spammer's web site • This is how spammers can kill C/R • Personal note: I don't “do” C/R. If I mail you and you challenge me I hit delete, because, as Dan Quinlan put it: “C/R is the ultimate email diss. By using it you are saying, 'my time is more important than yours.'”
Non-deliverable Response • As well as faking C/R messages, spammers fake NDRs • The NDR has the “original email” (actually a spam) as an attachment • Spammers can even get NDRs generated for them by badly configured mail servers • Send spam to known wrong address on a mail server with a forged from address • Mail server sends NDR to the forged from attaching the spam
picospams • Spam containing either: • As few tokens as possiblerobin: http://www.xg187.com • Only HTML tokens<a href=http://www.spammersite.com/><img src=http://www.spammersite.com/img></a> • Picospams got through because • Hadn't been seen before • Contained “good” headers • Had “word salad” Thanks to Robin Keir for the tiny robin: mail
“Good Headers” • The combination of two things leads to the ham tokens outweighing the spam • picospam text • Relaying the message through a good server • Suitable good servers are: • Mail relays like acm.org, alum.mit.edu • SourceForge.net • Mailing lists
“Word Salad” • Spam stuffed with randomly selected words:<a href="http://www.2004hosting.net/cable/"><img border="0" src="http://www.2004hosting.net/fiter1.jpg"></a>deliverance banister haploid sin beachcomb case stub doublet bread confucius buckaroo questionnaire tech issuance diagnose anglican finance pirouette u.s.a agree faculty nomenclature sheik insinuate pack dutchmen inhibition dubious patriotic aluminate • Sometimes words are hidden using Invisible Ink, Camouflage, MIME is Money or other tricks The term “word salad” was coined by Cindy Harris in a POPFile forum.
“Word Salad” Experiment • Took a real picospam (HTML style) that had previously been caught by POPFileSubject: cialis is now ready <DIV align=center><FONT face="arial black" size=2>Save over 70% on</FONT></DIV><CENTER><FONT face="arial black" size=2>USA approved meds</B></FONT><BR></CENTER><center><a href="http://cfcliihhp.646fgfg5.com/v95/index.php?id=v95">Come visit us</a> • Added 100s of words from /usr/share/dict/words • Scored for spam vs. ham against my POPFile installation
“Word Salad” Results Number of spams (per 10,000) that got through Number of words added
“Word Salad” Ineffective • Best result was 0.04% get through if • Send each person 10,000 copies of each spam • AND each spam is 3x bigger than before • Ineffective because • Randomly chosen words are likely to be: • First in neither, • then in spammy, • finally in hammy! Because spammers send so much spam!
Word Salad neither neither
Word Salad Variants • Got similar results using words pulled from • News stories via news.google.com • Articles from wikipedia.org • Back to basics… • A filter busting spam needs to: • HAVE FEW tokens that look like spam • HAVE MORE tokens that look like my ham • How do you find my hammy tokens?
Bayes vs. Bayes • If adaptive filters are so smart, perhaps they can beat adaptive filters? • Experiment: • Take a trained spam filter (“Good” POPFile) • And an untrained spam filter (“Evil” POPFile) • Take a spam that got through “Good” • Send copies of the spam with 5 random words appended • Train “Evil” depending on if it gets through “Good” or not
B vs B berkshire marriott wireless
How to get feedback • When sending each message include a unique web bug • Creates an effective feedback loop • Spammer can use web bug to train their POPFile installation • Bad news... this works: • Tested against my POPFile installation • Sent 10,000 emails containing 5 randomwords from /usr/share/dict/words • Found my kryptonite
Kryptonite Words • accommodations, arrangements, berkshire, category, channel, checking, comment, currency, endless, entitled, flying, hills, independent, invoice, logging, marriott, occupancy, officer, operated, quantity, redeeming, rent, shared, silicon, touch, wireless • Adding just one of these words turns the spam into a ham!
Is B vs B practical? • Took 10,000 messages to one email address to train evil POPFile • But what about 10 messages to 1,000 mail addresses? • Say send 10 copies of a spam to everyone at company.com: might find company.com specific kryptonite
Defense against the dark arts • Absolutely NO feedback to spammers • No rendering HTML • No bouncing • No SMTP server errors • No selective challenge/response • No NDRs • Mailing List/Mail Forwards • Do spam filtering on in bound messages • Integrate header analysis with adaptive filtering
Conclusions • Current spam is “easy” for adaptive filters to detect • As spammers react to adaptive filtering spam will get harder to recognize • Feedback mechanisms present a risk to the effectiveness of adaptive filtering • Adaptive filters will need merging with “traditional” anti-spam techniques like DNSBL
Thank you. All questions will now be answered via telepathy :-)