430 likes | 620 Views
Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards. Jenq-Haur Wang Academia Sinica Nov. 16-17, 2006. Outline. Introduction Existing Solutions Regulatory Solutions Technical Solutions Potential Industrial Standards. Introduction. What is spam?
E N D
Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards Jenq-Haur Wang Academia Sinica Nov. 16-17, 2006
Outline • Introduction • Existing Solutions • Regulatory Solutions • Technical Solutions • Potential Industrial Standards E-mail Spam
Introduction • What is spam? • E-mail, netnews, instant messaging (“spim”), “Google-spam”, guestbook spam, Weblog comments spam, VoIP (“spit”), … • Unsolicited messages flooded to uninterested receivers, usually sent in bulk • What is e-mail spam? • Junk e-mail • Unsolicited bulk e-mail (UBE) • Unsolicited commercial e-mail (UCE) E-mail Spam
Spam Statistics • Jan. 2001, • 8% of all e-mail traffic in the US is spam [Brightmail Inc.] • Jan. 2003, • 42%[Brightmail Inc.] • Jul. 2004, • 65% [Symantec (Brightmail) Inc.] • In 2002, • 3 pieces/day/user (average) [Ferris Research] • By 2005, • 10 pieces/day/user (average) [Ferris Research] E-mail Spam
Spam Statistics (cont.) E-mail Spam
Spam Statistics (cont.) E-mail Spam
Costs of Spam • Enterprises • > US$10 billion for US organizations in 2003 [Ferris Research] • US$245,000/year for a company with 14,000 employees [IDC] • End users • 5 spam/day, 30 seconds each -> 15 hours/year [Ferris Research] • Loss of productivity • Burden on ISPs • System resource consumption on servers • Waste on network bandwidth • User complaints E-mail Spam
Latest Spam Statistics [source: Spam Statistics 2006, by Don Evett,TopTenReviews, Inc.] • Email considered spam: 40% • Daily Spam emails sent: 12.4 biliion • Daily spam received per person: 6 • Annual spam received per person: 2,200 • Spam cost to all non-corp. Internet users: $255 million • Spam cost to all US corporations in 2002: $8.9 billion • States with anti-spam laws: 26 E-mail Spam
Latest Spam Statistics (cont.) • Email address changes due to spam: 16% • Estimated spam increase by 2007: 63% • Annual spam in 1,000 employee company: 2.1 million • Users who reply to spam email: 28% • Users who purchase from spam email: 8% • Corporate email that is considered spam: 15-20% • Wasted corporate time per spam email: 4-5 sec E-mail Spam
Email Statistics • Daily emails sent: 31 billion • Daily emails sent per email address: 56 • Daily emails sent per person: 174 • Daily emails sent per corporate user: 34 • Daily emails received per person: 10 • Email addresses per person: 3.1 average • Cost to all Internet users: $255 million E-mail Spam
Spam Categories • Products: 25% • Financial: 20% ↑ • Adult: 19% ↑ • Scams: 9% • Health: 7% • Internet: 7% • Leisure: 6% • Spiritual: 4% • Other: 3% (Source: http://www.brightmail.com/spamstats.html, Jun. 2004 & http://spam-filter-review.toptenreviews.com/spam-statistics.html, 2006 ) E-mail Spam
Origins of Spam • Where does the spam come from? [Sophos, “Dirty Dozen” spam producing countries, Apr. 2005] • 35.7% (43%): from the US • 25.0% ↑(16%): from South Korea • 9.7% (11%): from China • … E-mail Spam
Major Factors • Simple SMTP mail relaying mechanism • Cannot verify the identity of the sender • Forged IP address /sender e-mail address • Open mail relay/proxy • Low cost for sending bulk e-mails • Low cost for e-mail address harvesting • Web, mailing list, … • Bulk mailer programs • Low cost for obtaining “free” e-mail address E-mail Spam
MX records DNS sender domain SMTP MUAs MTAs queues sender SMTP POP3/IMAP4 MUAr MTAr mailbox recipient receiver domain Lifecycle of E-mails E-mail Spam
Existing Solutions • Regulatory solutions • Anti-spam laws • Limitations • Technical solutions • Filtering • Postage • Disposable e-mail address E-mail Spam
Regulatory Solutions • Anti-spam laws • http://www.spamlaws.com/ • Ex: US federal law CAN-SPAM Act (S.877) enacted on Jan. 1, 2004 • Limitations • Dependence on evidences in technical information • Slow and costly process E-mail Spam
Current Status ofAnti-Spam Laws • In the US: • Enacted federal laws: CAN-SPAM Act of 2003 (Pub. L. 108-187, S. 877) • Enacted state laws: Arkansas, California, Colorado, Connecticut, Delaware, Idaho, Illinois, Indiana, Iowa, Kansas, Louisiana, Maryland, Minnesota, Missouri, Nevada, New Mexico, North Carolina, Ohio, Oklahoma, Pennsylvania, Rhode Island, South Dakota, Tennessee, Utah, Virginia, Washington, West Virginia, Wisconsin, Wyoming, … • In Europe: • European Union, Austria, Belgium, Czech Republic, Denmark, Finland, France, Germany, Greece, Ireland, Italy, Luxembourg, Netherlands, Norway, Portugal, Spain, Sweden, United Kingdom, … • In other countries: • Argentina, Australia, Brazil, Canada, India, Japan, Panama, Peru, Russia, South Korea, Yugoslavia, … • Taiwan: “Anti-Hacker” laws in the Martial Law (Jun. 3, 2003) E-mail Spam
Technical Solutions • Filtering: toseparate bad from good • Heuristic-based • Classification-based: machine learning • Others: peer-to-peer, honeypot • Postage: to increase the cost of sending e-mails • Hiding email address • Encoding (text to image, Java script, …) • Disposable email address: separate e-mail address for different correspondence • Enhancing SMTP mechanism • Email path verification • Authenticated SMTP E-mail Spam
Filtering TechniqueHeuristic-based • Black/White/Grey lists • Blacklist: lists of IP addresses that send spam • RBLs (Real-time Blackhole Lists), open mail relays, open proxies, … • Whitelist: lists of trusted sender • Challenge-response mechanism • Greylisting: temporary delay of e-mail from unknown sender • Problems • Easy to make mistake • Forged IP address/sender e-mail address • Lists need to be updated frequently • Changing spammer e-mail addresses E-mail Spam
Filtering TechniqueHeuristic-based (cont.) • Keyword-matching rules (ex. MS Outlook) • Look for similar messages based on their subject or content • Problems • Exact rules are difficult to formulate and maintain • Spam is always changing • Chinese menu (madlibs) attack Make thousands of dollars working at home !!! Earn lots of money in the comfort of your own house. E-mail Spam
Filtering TechniqueClassification-based • Machine learning • Text classification methods: TF-IDF, Naïve Bayes, SVM (Support Vector Machine), … • Learn spam vs. good • Adapt to changing spam • Problems • Need lots of training data • Diverse contents in e-mail spam • Spammers are learning too • Images, synonyms, misspellings, … • “One man’s spam is another man’s ham” E-mail Spam
Filtering Techniques -- Others • Distributed (peer-to-peer, collaborative) spam filtering • To share the knowledge of spam features • SpamNet: Cloudmark • SpamWatch: UC Berkeley • Problems • Efficacy • Efficiency E-mail Spam
report check Add-in Add-in MUAr recipient recipient Distributed Spam Filtering • Cloudmark’s SpamNet SpamNet MUAr POP3/IMAP4 MTAr Client-side Client-side E-mail Spam
Discussions on Filtering-based Approach • False-positive vs. false-negative • Cost-sensitive e-mail classification • Incoming vs. outgoing e-mail filtering • Ex. corporate mail filtering might focus on preventing confidential data E-mail Spam
Postage • Postage: to increase the cost of sending e-mails • Money: payment • Computation: time • Turing tests: challenge-response • Problems • Requires multiple monetary transactions for each e-mail delivery • Who pays for infrastructure? E-mail Spam
Disposable E-mail Address • Disposable e-mail address • Separate e-mail address for each correspondence • Channelized e-mail system [R. Hall] • Sort incoming mails according to sender address • Terminate the address with spam • Problems • How do new senders get your address? • What’s the sender address for multiple receivers? • Difficult to remember E-mail Spam
Enhancing SMTP Mechanism • Email path verification • To trace the real origin of e-mail (sender) • Problem: accounting is needed for packet network • Authenticated SMTP • Trusted environment • SMTP authentication (RFC 2554), SMTP over SSL/TLS (RFC 3207), digital signatures (PGP, …) • Problem: need client-server cooperation E-mail Spam
Other Techniques (cont.) • Reputation-based approach • Based on HITS (Hyperlink Induced Topic Search) algorithm • Ranking on email sending/receiving reputation • Problem • Bad reputation for volume senders (mailing lists, newsletters, …) E-mail Spam
Existing Anti-Spam Tools • Open Source Filters • SpamAssassin • ifile • bogofilter • POPfile • SpamBayes • CRM114 • Commercial Products • BrightMail • SurfControl • Anti-virus E-mail Spam
Spammers’ Tricks • Images: MIME • Invisible ink (hidden text): color • Misspelling • o -> 0 • i -> l -> 1 -> ! • S -> 5 • F R E E, g-i=r-l, … • Ref: John Graham-Cumming: The Spammers’ Compendium, http://www.jgc.org/tsc/index.htm E-mail Spam
Potential Industrial Standards • Sender/Domain authentication for e-mails • Sender ID Framework (Microsoft) • DKIM (Yahoo, Cisco) • DomainKeys (Yahoo) • Identified Internet Mail (Cisco) • SPF • Sender Permitted From (AOL) E-mail Spam
Structures of E-mails • Envelope: SMTP (RFC 2821) • Header & body: RFC 2822 E-mail Spam
Sender ID Framework (MS) E-mail Spam
DomainKeys E-mail Spam
IIM –Authentication /Authorization Model Messages must pass two tests before they are authenticated AUTHORIZE THE SENDER AUTHENTICATE THE MESSAGE + Receiving domain authenticates the message—i.e. Verifies that the message was not altered in any consequential manner prior to reaching the receiving domain Receiving domain asks sending domain to confirm that whoever signed the message was authorized to do so (without having to identify the sender) E-mail Spam 10401_10_2004
Identified Internet Mail E-mail Spam
DomainKeys Identified Mail(DKIM) • Derived from Yahoo DomainKeys and Cisco Identified Mail • IETF Working Group formed • IETF Internet draft • Message header authentication • DNS identifiers • Public keys in DNS • End-to-end • Between origin/receiver administrative domains • Not path-based E-mail Spam
SPF • Sender Policy Framework • Derived from Sender Permitted From (SPF, AOL) • By Meng Wong, CTO of Pobox • Current specification: SPFv1 (RFC 4408) • Reverse MX records • Adopted by many mail server implementations E-mail Spam
Tips for End Users (1/2) • Never give out your personal e-mail address to strangers • Use separate e-mail addresses for business and public use (“disposable”) • Never respond to unsolicited e-mail • Do not click on links within unsolicited e-mail, including deceptive unsubscribe links E-mail Spam
Tips for End Users (2/2) • Read carefully the subject line on all e-mail, and use the preview feature on mail programs • If your e-mail address appears on a Web site, ask the site's manager to do some encoding • Use e-mail service providers that filter spam • Install an anti-spam program on your computer E-mail Spam
Conclusion • Anti-spam is a battle • “Every time we discover a feature to catch spam, spammers will find a work-around” • Some advices • Filtering is just one part of the solutions • Try to make the costs of spammers higher • Be nice to your e-mail address • Mail delivery has to be improved E-mail Spam
References • IRTF ASRG: http://asrg.sp.am/ • Sender ID: http://www.microsoft.com/mscorp/safety/technologies/senderid/technology.mspx • DKIM: http://dkim.org/ • DomainKeys: http://antispam.yahoo.com/domainkeys • Identified Internet Mail: http://www.identifiedmail.com/ • SPF Project: http://www.openspf.org/ • RFCs and Internet Drafts E-mail Spam
References for Research • MIT Spam Conference (2003-2006) • http://www.spamconference.org/ • Conference on Email and Anti-Spam (CEAS) (2004-2006) • http://www.ceas.cc/ • TREC (Text REtrieval Conference) Spam Track (2005-2006) • http://trec.nist.gov/data/spam.html E-mail Spam