320 likes | 453 Views
Fighting Spam: Techniques on the Table. Cynthia Dwork Microsoft Research SVC. Why?. Huge problem Industry: costs in worker attention, infrastructure Individuals: increased ISP fees Hotmail: huge storage costs, 65-85% FTC: fraud, confidence crimes Ruining e-mail, devaluing the Internet.
E N D
Fighting Spam: Techniques on the Table Cynthia Dwork Microsoft Research SVC
Why? • Huge problem • Industry: costs in worker attention, infrastructure • Individuals: increased ISP fees • Hotmail: huge storage costs, 65-85% • FTC: fraud, confidence crimes • Ruining e-mail, devaluing the Internet
Desiderata • Reduce amount of spam seen by me • Don’t harm ordinary e-mail • Let me publish my e-mail address • Let me act autonomously
Techniques on the Table • Filtering • Everyone: text-based • Brightmail: decoys; rules updates • Microsoft: (seeded) trainable filters [Sahami, Dumais, Heckerman, Horvitz'98] • SpamHaus, SpamCop, Osirusoft, … • IP addresses, ISPs, proxies, … • Puzzles • CPU cycles [Dwork-Naor'92] • Memory cycles [Burrows et al.] • Turing tests [Naor'96] • Pay recipient [Gates'96]
Outline • Description and Discussion of • MSR’s filtering and three kinds of Puzzles • Two architectures • Odds and Ends
Cost-Sensitive Filtering [SDDH’98] • Classification Scheme that • Assigns probabilities (cf. Spertus'97) • Differentiates among costs of errors • Feature space: words in message corpus, phrases, punctuation, domain type (“.edu”), etc. • Classes: Junk, Not-Junk; sex junk, etc. • Classifier: maps attribute vector to a distribution on classes
Classifier Construction 500 – 2000 features of max mutual information Bayesian Classifier Naïve Bayesian Classifier
Social and Technical Issues Social • Getting the user to train the filter Technical • Seed filter is language-specific • Less effective during training • Training requires many examples (?) • Spammers will adapt
Computational Puzzles [DN’92] If I don't know you: Prove you spent 10s of computation time, just for me and just for this message • User Experience Everything works automatically; typical user experience is unchanged • Economics for Hotmail’s billion daily spams: 125,000 CPUs Up front capital cost: circa $150,000,000 • The spammers can’t afford it.
NY Times 6/27/02 "Most of the spammers are not wealthy people," said Stephen Kline, a lawyer for the New York State attorney general's office.
Cryptographic Puzzles • Hard to compute (CPU-intensive) • lots of work for the sender • Easy to check • little work for receiver • Parameterized to scale with Moore's Law • easy to exponentially increase computational cost, while barely increasing checking cost • Can be based on (carefully) weakened signatures, hash collisions
Memory-Bound Puzzles [ABMW] • Slow CPUs are a lot slower than the fastest • Factor of 10 – 30 within desktops • Memory latencies vary little • factor of 3 • So: design a puzzle leading to a large number of cache misses • Equalizes actual computation time
Candidate Based on “Random” f (simplified construction) • f: n bits to n bits • Puzzle Generation: • xk = f(k) (x0), extra piece p, for random x0 • Correct response: x0(p used to disambiguate) • Hope: puzzle best solvedbuilding table for f -1, working backwards from xk • Choose n so f -1 fits in small memory, but not in cache
Computation: Social Issues • Trust • Who chooses f ? • Who writes the code? • Who sets the price?
Computation: Technical Issues • Distribution Lists (!) • Awkward Introductory Period • Old versions of mail programs; bounces • Very Slow Machines • Can implement “post office,” but: Who gets to be the Post Office? • The Subverters
Turing Tests [N’96] • CAPTCHAs (Completely Automated Public Turing test for telling Computers and Humans Apart) • Defeat automated account generation • 5-10% drop in subscription rate • teams of conjectured-humans (8-hour shifts) • Yes: Distorted images of simple word • No: ``Find 3 words in image with multiple overlapping images of words'' • Others: subject classification, audio • M. Blum: people have done preprocessing
Social and Technical Issues • Social (especially in enterprise setting) • ADA, S.508 (blind, dyslexic) • Not ``professional'' • Productivity cost: context switch • Irritating. • Technical • No theory • If/when broken, these will revert to computational challenges, but with no ``hardness parameter'' • Idrive, AltaVista, broken [J]
Point-to-Point Architecture (Ideal Message Flow) • permits send-and-forget • Can add post office to handle money payments m, f (m,S,R,t) Sender client S Recipient client R
Here to There (and There’) • Three e-mail messages • R’s mail client caches m, h(m), S, R, t • Bounce • html attachment with Java Script for f • contains parameters for f (h(m), S, R, t) • clicking on link causes computation, sending e-mail • (optional) link for download of client software m Ignorant Sender S Spam-Protected Recipient R bounce release m
Point-to-Point: Issues • Social • Unfriendly to very slow machines • Senders trust code (transition period)? • (Who gets to be a post office?) • Technical • No pre-computation possible • Function update requires new download • Sender’s browser must be configured for sending e-mail (transition period)
Ticket Server [ABBW] (Ideal Message Flow) Ticket kit = (#, puzzle) Ticket = (#, response) • Any payment method • Tickets may be accumulated in advance (pre-computation) • Refunds (not shown) • Centralization eases updates TicketServer 3 Ticket OK? HTTP HTTP Recipient Server Get Ticket Kit 1 SMTP 2 MSG + Ticket Sender
Social and Technical Issues • Social • Who gets to be a ticket server? Trust? • Federation? • Trust code (transition period) • Technical • Complex; 5 flows (7 with refund) • Target for subverters
Cycle Stealing • Stealing cycles from humans: Pornography companies require random users to solve a CAPTCHA before seeing the next image [vABL] • Worse for computational challenges • There are lots of cycles, but anyone can buy them.
Politics and Taste • Will users like this? • Transition period is awkward, senders experience some pain, recipients benefit • Establish standards • puzzle functions • mailing lists • How should the community proceed?
Medium Weight • Exploit browser power • Web server can be personal • Easily modified to facilitate changes to function f WebServer 4 SMTP 3 f(h(m)) Recipient Client Plug-In HTTP f m 1 Sender SMTP NDR 2
MW Transitions 1. S sends m to R R’s plug-in checks for S on safelist, blocklist; acts appropriately If S on neither list: 2. Plug-in hashes m, caches m,h(m), sends NDR to S 3. S clicks appropriate link in NDR. URL communicates h(m) to W. New web page appears in S’s browser, containing applet; applet sends results to W. 4. W sends mail to R. Plug-in sees the mail from W and does the right thing. WebServer 4 SMTP 3 f(h(m)) Recipient Client Plug-In HTTP f m 1 Sender SMTP NDR 2
Web-Based Mail (Hotmail) Ticket Server • Client downloads Active-X component or applet • Computation done by client • Protocol run by Hotmail • No code downloaded for money Alternatively • Hotmail purchases tickets for (paying?) customers, or • Hotmail promises not to send too much outgoing spam, receivers trust Hotmail 3 Ticket OK? HTTP HTTP Recipient Server (Exchange) Get Ticket Kit 1 SMTP 2 MSG + Ticket Sender: Hotmail Server Client Browser
TS Transition Period • NDR contains multiple links. User clicks appropriate link. • Sender, having no ticket-handling code, invokes a script in web browser. • Probably need to roll out slowly. Examples: • Start with trivial computation (just click on link?) • NDRs on small fraction of messages • Opt-in in Hotmail • Active safelisting Ticket Server 5 “Deliver M” SMTP 4 3 HTTP Recipient Server (Exchange) Get Ticket Kit Send Ticket M 1 SMTP Sender (Outlook Express) NDR 2
TS Transition Details • To handle senders that have no ticket-handling code: invoke a script in web browser • S sends message M • R keeps M, but doesn't deliver it • R returns an NDR (bounce) to S • S gets ticket kit from TS by clicking appropriate URL in NDR • S may download applet; not needed for money • S solves puzzle; sends ticket to TS • TS tells R to deliver M Ticket Server 5 “Deliver M” SMTP 4 3 HTTP Recipient Server (Exchange) Get Ticket Kit Send Ticket M 1 SMTP Sender (Outlook Express) NDR 2