640 likes | 817 Views
. Source: Pew Internet
3. Email addiction 41% check email first thing in the morning
23% have checked in bed in their pajamas
4. Overview Email
Most important application
Great research problems for people working on NLP
Techniques spammers use
Solutions to Spam
Other kinds of abuse (SPIM and SPAT and BLAM)
Fun problems you find building real systems
5. Part 1: Email A sample of interesting NLP email problems
Finding what’s important
Task Flags
Organizing mail
Auto foldering
Auto tagging
Finding what’s interesting
Automatic search
Contact finding
6. Priorities(Eric Horvitz, Andy Jacobs, David Hovel, etc.) Automatically determines how important your email is
Send to your cell phone
Different sound/toast
Uses machine learning
Sent directly to you?
From your manager?
Uses future tense?
Future dates?
7. Task Flags (S. Corston-Oliver, E. Ringger, M. Gamon, R. Campbell)
8. Task Flags Continued(S. Corston-Oliver, E. Ringger, M. Gamon, R. Campbell)
9. Auto Foldering (Jake Brutlag and Chris Meek) Use machine learning to figure out automatically what folder mail goes in.
Interesting text classification problem
Folders contain as few as three entries
Data changes over time
10. Automatic Tagging for Email (Arun C. Surendran, John C. Platt and Erin Renshaw) Automatically tag email messages to enrich search organization and navigation.
How it works:
Put messages into clusters
Naming clusters is hard
Use domain-dependent filtering (remove common intranet words)
Use noun phrases from subjects
Words do not have to occur
in all messages in cluster
11. Automatic Search(Joshua Goodman and Vitor Carvalho) Automatically show users useful search results
Examined over 20 factors
Automatically train machine learning system to weight them.
Frequency of keywords in Internet Search query logs (MSN) is third most helpful feature (after TF and IDF)
Helped solve lots of linguistic problems
Almost everything in query logs is a “meaningful phrase”
Much easier to port to multiple languages
12. Contact Finding(T. Kristjansson, A. Culotta, P. Viola and A. McCallum) Automatically find contact information in an email message.
Machine learning method – train it by showing examples
13. Other Interesting Email Research Most of the research I’ve just shown you is from Microsoft Research
Main reason: much easier to steal slides from colleagues with nearby offices
Why do people in MSR spend so much time working on email problems?
CALO Project
“Cognitive Assistant that Learns and Organizes”: DARPA funded project lead by SRI, with 22 organizations participating
Main way you deal with your automated assistant is through email.
RADAR Project
Primarily at CMU (11 research groups) (DARPA funded)
Cognitive assistant that can do tasks like space planning, automated web master, etc.
Primary interface to the assistant is through email
14. Interesting NLP Oriented Email Research Understanding Temporal Expressions in Emails, Han, et al.
TODAY!!! – Semantics 1 – 4:40 to 5:05
Reply Expectation Prediction for Email Management, Dredze et al.
Implicit Queries (IQ) for Contextualized Search, Dumais et al.
Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text, Minkov et al.
Email Task Management: An Iterative Relational Learning Approach, Khoussainov et al.
Inferring Ongoing Activities of Workstation Users by Clustering Email Huang, et al.
Learning to Extract Signature and Reply Lines from Email Carvalho, et al.
User expertise modeling and adaptivity in a speech-based e-mail system. Jokinen, et al.
Learning to Classify Email into ``Speech Acts''. Cohen, et al.
The AthosMail Text Processor, Gamback, et al.
Knowledge intensive e-mail summarization in CARPANTA, Alonso et al.
15. Other interesting email research Not all email research is language oriented
Social Network Analysis (work by Danyel Fisher, Marc Smith, others)
Calendar Research (A. J. Brush, others)
HCI (ReMail project at IBM; Grand Central by Gina Venolia)
Visualization (MailSOM by Florian Mansmann)
Email Storage
Next generation email protocols
16. Part 2: Spam SPAM is the number one problem for email systems
Estimates from about 71% to 87% of mail is spam
At 71%, if you stop 90% of the spam, 1/5 of your mail will be spam
Over a billion spam a day will get past filters worldwide.
Techniques spammers use
Solutions to Spam
Other kinds of abuse (SPIM and SPAT and BLAM)
17. Techniques spammers use A few examples of tricks spammers use to get past spam filters
Most spam filters have text classification as main or important part, often with linear models (e.g. Naďve Bayes, etc.)
18. The Hitchhiker Chaffer Content Chaff
Random passages from the Hitchhiker’s Guide
Footers from valid mail
19. Hitchhiker Chaffer’s Later Work Can use hidden text, e.g. white on white or many other tricks
User sees only spammy text
Spam filter sees everything, including good words.
20. Hitchhiker Chaffer’s Later Work Can use hidden text, e.g. white on white or many other tricks
21. Weather Report Guy Content in Image
Good Word Chaff
22. Secret Decoder Ring Looks easy
Is it?
23. Secret Decoder Ring Dude Character Encoding
HTML word breaking
24. Diploma Guy Word Obscuring
25. Diploma Guy Word Obscuring
26. Diploma Guy Word Obscuring
27. Diploma Guy Word Obscuring
28. Diploma Guy Word Obscuring
29. More of Diploma Guy Diploma Guy is good at what he does
30. Trends in Spam Exploits(Hulten et al.)
31. Solutions to Spam Filtering
Machine Learning
Matching/Fuzzy Hashing
(Blackhole Lists (IP addresses))
Turing Tests, Money, Computation
(Disposable Email Addresses)
Smart Proof
32. Filtering TechniqueMachine Learning Learn spam versus good
Problem: need source of training data
Get users to volunteer GOOD and SPAM
Over 100,000 volunteers on Hotmail, over 50,000 new labeled examples/day.
Use standard text classification features, but also email/spam features
Time of day, number of recipients, etc.
But spammers are adapting to machine learning too
Images, different words, misspellings, etc.
33. Filtering TechniqueMatching/Fuzzy Hashing Use “Honeypots” – addresses that should never get mail
All mail sent to them is spam
Look for similar messages that arrive in real mailboxes
Exact match easily defeated
Use fuzzy hashes
How effective?
The Madlibs attack defeats exact match filters and most fuzzy hashing
Spammers already doing this
34. Postage Basic problem with email is that it is free
Force everyone to pay (especially spammers) and spam goes away
Send payment pre-emptively, with each outbound message, or wait for challenge
Multiple kinds of
Turing Test,
35. Turing Tests (HIP, CAPTCHA)(Naor ’96) You send me mail; I don’t know you
I send you a challenge: type these letters
Your response is sent to my computer
Your message is moved to my inbox, where I read it
36. Computational Challenge(Dwork and Naor ’92) Sender must perform time consuming computation
Example: find a hash collision
Easy for recipient to verify, hard for sender to find collision
Requires say 10 seconds (or 5 minutes?) of sender CPU time (in background)
Can be done preemptively, or in response to challenge
37. Money Pay actual money (1 cent?) to send a message
My favorite variation: take money only when user hits “Report Spam” button
Otherwise, refund to sender
Free for non-spammers to send mail, but expensive for spammers
Requires multiple monetary transactions for every message sent – expensive
Who pays for infrastructure?
38. SmartProof: Most challenge-response approaches challenge every message
We use machine learning. Challenge only suspicious messages (avoids annoying challenges)
Can auto-respond with computation
Least annoying to sender – may never see challenge
Can respond by solving a Turing Test
39. Kinds of Abuse Email spam
Chat rooms (SPAT)
Instant Messenger (SPIM)
Blog spam (BLAM)
All great NLP problems
40. Chat Room Spam MSN closed its free chat rooms
Spambots come in and pretend to chat
But really just advertising porn sites
Some spambots trivial
Don’t talk at all, but take up space
Link to porn spam in their profile
Some spambots very sophisticated
You can have a short conversation with them before they try to convince you to go to their website
Randomized conversations so hard for users to spot
41. joshuagood9: hi theresuperchristina: hey there how u doin?joshuagood9: doing fine, and you?superchristina: hey there how u doin?joshuagood9: are you a bot?superchristina: im not a bot are u? loljoshuagood9: are you a bot?superchristina: i hate bots loljoshuagood9: how old are you?superchristina: whats up?joshuagood9: asl?superchristina: im 21 f usa and u?joshuagood9: I am fine, thank yousuperchristina: right on asl?... im 20 f usajoshuagood9: 74/M, WAsuperchristina: nice age joshuagood9: thank yousuperchristina: yw sweety..could u do me a favor..check out my homepage and my profile see if my cam works? brb Chat Bot
43. Instant Messenger Spam“SPIM” Send messages to people via IM
Microsoft solved this by requiring people to get permission before IMing
Spammers put spam in their “name” – so permission request message now has spam!
44. Blog Spam (BLAM) Post comments with links in blogs
The links used to be used by search engines as part of rankings
Most search engines now completely ignore these links (throwing away valuable information)
Spammer posts links from his blog to victim blog
Trackback software shows victim that there is a link to his blog
Victim uses trackback to see who linked Many providers disabling trackbacks
45. SPIM, SPAT, BLAM etc. are great NLP problems Tons of ways to obfuscate email spam, because you can send pictures and arbitrary HTML
IM, chat rooms, blog comments all basically restricted to plain text
NLP techniques may be more appropriate for these domains than for email spam
Other kinds of abuse in chat rooms
Pedophiles, phishing, etc.
MSN and Yahoo have both closed off large parts of their chat room systems because of pedophiles
46. Finding Cool Problems by Building Systems Fun problems we found when we shipped adaptation for a spam filter
Fun problems we found when we worried about losing good mail.
47. What Happened When we Shipped an Adaptive Spam FIlter The first spam filter we shipped was adaptive
If user corrected mistakes, we improved the filter.
What to do if the user does not correct mistakes?
We assumed the filter was correct
For users who rarely fixed mistakes, this lead to catastrophically bad results – the filter got worse and worse and worse
48. Threshold DriftConservative Threshold Setting
49. Threshold DriftLots of Spam Classified as Good
50. Threshold DriftNew Separator Parallel to Old
51. Threshold DriftNew Separator Parallel to Old
52. Adaptation with partial user feedback is hard Users may correct all errors, or only all spam, all good, 50% spam, 10% spam, no errors, etc.
Need to work no matter what the user correction rate is
Great problem that you find when you try to build a real system
53. Fun problems we found when we worried about losing good mail Most machine learning focuses on accuracy
Assumes all errors equally bad
For spam (and most other problems) cost of deleting good mail much higher than cost of spam in inbox
54. Our technique(Scott Yih and Joshua Goodman) First, learn a model on all training data (e.g. linear classifier)
Pick the subset of the data in the region you care about
Find all messages, good and spam, that are more than, say, 50% likely to be spam according to the first model
Train a new model on only this data
At test time, use both models
Works substantially better than other techniques: at the desired low false positive rate, reduce spam by 20%-40% at compared to normal techniques.
Can make exciting progress even in well-explored area like text classification when you build a system.
55. Conclusion (1/2) Building systems is a great way to find interesting and important new problems
Some applied research
Search query logs instead of shallow parser
Sometimes leads to fundamental research
56. Conclusion (2/2)
57. Disposable Email Addresses You have one address for each sender
All go to same mailbox
If I give you my address, and you send me spam, I just delete the address
How do new senders get an address?
If I send mail to 3 people, which address is it From?
Hard to remember!
58. My Favorite Solution If we could get everyone at Hotmail to never answer any spam, spammers would just give up sending to Hotmail.
So, when new Hotmail users sign up, send them 100 really tempting ads
If they answer any of them, terminate account
59. My Favorite Solution If we could get everyone at Hotmail to never answer any spam, spammers would just give up sending to Hotmail.
So, when new Hotmail users sign up, send them 100 really tempting ads
If they answer any of them, terminate account
Hotmail management refuses to consider this.
60. I tried to ship a grammar checker Eric Brill gave a keynote in ???
“Processing Natural Language
without Natural Language
All you need is lots of data
You can build a grammar checker with very simple machine learning.
Solve common grammar problems like “their”/ “they’re”, etc.
Makes NLP sound really boring and problems seem easy.
Grammar checking is actually a very interesting problem
61. Why grammar checking is interesting (and hard) after all Product groups already had good solutions for English
Wanted Brazilian Portuguese
There’s tons of well-edited data for English
Try finding data for Brazilian Portuguese, etc.
“There’s no data like more data” only applies if there is more data
English is uninflected, but most languages have strong inflection
If you don’t morphologically analyze, the vocabulary is effectively huge, multiplying the data sparsity problem
62. What else went wrong Top priority: agreement (singular/plural, gender)
Traditional ML approach to grammar checking (“confusable word pairs”) is local, no structure
Works well for > 90% of “test” instances, because most agreement is local.
People doesn’t make mistakes when the subject and verb is next to each other
People who make a mistake is most likely to do so when the subject and verb is far apart.
Need grammar, or some other powerful technique
No Brazilian Portuguese treebank
Grammar checking is a great problem for NLP
Trying to build a real system helps us find problems we didn’t even know we had.
63. Blackhole Lists Lists of IP addresses that send spam
Open relays, Open proxies, DSL/Cable lines, etc…
Easy to make mistakes
Open relays, DSL, Cable send good and spam…
Who makes the lists?
Some list-makers very aggressive
Some list-makers too slow
64. tatyanaatkins: want to make money?joshuagood9: how?tatyanaatkins: have run a textile company and get pay in cheques and money ordersjoshuagood9: how do I make money?tatyanaatkins: i gt my clients to send them to u while u cash em and remove your pay then sen the rest to me joshuagood9: Why don't you cash them yourself?tatyanaatkins: because presently i am traveling around and this come in at a rate faster than i can tatyanaatkins: need assistance in catching uptatyanaatkins: if u wish i can send u the letter of incoporationjoshuagood9: yes, email it to mejoshuagood9: joshuagood9@yahoo.comtatyanaatkins: hold onjoshuagood9: you are in nigeria?tatyanaatkins: yestatyanaatkins: that's where the factory isjoshuagood9: how much will you pay metatyanaatkins: u get up to 200 dollars every deliveryjoshuagood9: what is in a delivery? how do I get the money to you?tatyanaatkins: i get the clients to send them to ujoshuagood9: and then what?tatyanaatkins: u cash it and send via western unionjoshuagood9: sounds easytatyanaatkins: yeahjoshuagood9: why do you pay me so muchmoney?joshuagood9: how many do I have to cash? Is one "delivery" one check? or a lot?tatyanaatkins: cos people have eloped with my money n the pastjoshuagood9: why will you trust me?tatyanaatkins: so i have decided to pay good so we all can be satisfiedjoshuagood9: that makes sensejoshuagood9: Let me call you on the phone, and we can talk about ittatyanaatkins: okjoshuagood9: what is your number?tatyanaatkins: 2340833830119joshuagood9: oh, that's internationaljoshuagood9: I;m at work nowjoshuagood9: I'll have to call you later, from hometatyanaatkins: oktatyanaatkins: are u interested?joshuagood9: of course!tatyanaatkins: so i'll send u your letterjoshuagood9: my letter?tatyanaatkins: of employmentjoshuagood9: oh, ok Nigerian Chatter