290 likes | 437 Views
Bayesian Filtering Anti-Phishing Toolbar Benefits. P. Likarish, E. Jung, D. Dunbar, T. E. Hansen, and J.-P. Hourcade 12/04/07 presented by EJ Jung. Phishing. Why study phishing?. Identity Theft * One of fastest growing crimes ~15 million Americans/year , $2.8 billion dollars.
E N D
Bayesian Filtering Anti-Phishing Toolbar Benefits P. Likarish, E. Jung, D. Dunbar, T. E. Hansen, and J.-P. Hourcade 12/04/07 presented by EJ Jung
Why study phishing? • Identity Theft* • One of fastest growing crimes • ~15 million Americans/year, $2.8 billion dollars *Gartner, Inc. 2007 press release. http://www.gartner.com/it/page.jsp?id=501912, March 2007 **Phishing report. http://apwg.org
Phishing leads into malware **Phishing report. Trojans and keyloggers. http://apwg.org
Phishing and botnet into black market (Franklin et al, 2007) • 6 months of IRC log
… and into national security threat • FBI director Robert Muller says: • Younis Tsouli, and his colleagues stole thousands of credit card accounts through phishingschemes. They ran up charges of more than $3 million for items they thought fellowextremists might need, from night vision goggles to GPS devices. • botnet is Swiss Army Knifes of hackers
Anti-Phishing Tools • Client or server side? • server side protection is limited • server-client cooperation • hash of system • Clientside is more common • web browser toolbar • password management
Early Efforts • Largely heuristics-based • Set of rules developed by experts • Still used by most anti-phishing tools • Examples: • IE7 phishing filter • SpoofGuard
SpoofGuard* • IE6 toolbar • Developed by Chou, Ledesma, Teraguchi, Boneh, Mitchell at Stanford • Heuristics+whitelist *N. Chou, R. Ledesma, Y. Teraguchi, D. Boneh, and J. C. Mitchell. Client-side defense against web-based identity theft. In NDSS '04: Proceedings of the 11th Annual Network and Distributed System Security Symposium, February 2004
Stateless Heuristics • URL check • Suspicious URLs: @, IP, hex • Image check • Hashed image database • Image hashing • Produces same hash for similar images • Link check • Fails if >¼ of links fail URL check • Password check
Stateful Heuristics • Domain check • Hamming distance to known domains • Referrals • From email site? • May require DNS lookup • Image-domain association • Extension of hashed image heuristic • <image, URL> tuples
Scoring TSS = Total Spoof Score 0 Ex: P1= URL check (0 if page passes, 1 if it fails) w1 = .2 Source: N. Chou, R. Ledesma, Y. Teraguchi, D. Boneh, and J. C. Mitchell. Client-side defense against web-based identity theft. In NDSS '04: Proceedings of the 11th Annual Network and Distributed System Security Symposium, February 2004
Drawbacks to Heuristics • Difficult to develop accurate rules* • Large number of false positives and negatives** • Heuristics don’t evolve—phishing sites do. *M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In AAAI Workshop on Learning for Text Categorization, July 1998. **Y. Zhang, J. I. Hong, and L. F. C Y. Zhang, J. I. Hong, and L. F. Cranor. CANTINA: a content-based approach to detecting phishing web sites. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 639–648, New York, NY, USA, 2007. ACM Press.
Next: Blacklist/Whitelist • ~2004-current • Largely blacklist-based • rely on phishing site reports • still used by most anti-phishing tools • Examples: • IE7 phishing filter • Firefox 2 phishing protection & Google safe-browsing • Netcraft* Toolbar *Netcraft Ltd. http://toolbar.netcraft.com
Drawbacks to Blacklist/Whitelist • Need reliable and timely sources for reports • Window of vulnerability • after site launch before being blacklisted • avg lifetime of a phishing site: 3 days • avg lifetime after blacklisted: 22 hours • cost of undoing identity theft: priceless adapt classification methods -CANTINA, B-APT *Y. Zhang, J. I. Hong, and L. F. Cranor. CANTINA: a content-based approach to detecting phishing web sites. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 639–648, New York, NY, USA, 2007. ACM Press.
CANTINA* • Technique • TF-IDF + Robust Hyperlinks • Domain name • Heuristics • *Y. Zhang, J. I. Hong, and L. F. Cranor. CANTINA: a content-based approach to detecting phishing web sites. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 639–648, New York, NY, USA, 2007. ACM Press.
TF-IDF • Text classification technique • Information retrieval • Term Frequency-Inverse Document Frequency • Importance of a word in a document in a given corpus • Document = website • Corpus = English language
Robust Hyperlinks • Phelps and Wilensky • TF-IDF on all words on page • Lexical signature • 5 words with highest TF-IDF scores • Almost uniquely id 1,000,000,000 pages…
TF-IDF + Hyperlinks in CANTINA • Calculate lexical signature • Google search on signature • If domain name is within top 30 hits, site is legitimate • Otherwise, it is phishing • Results: • 94% true positives : 30% false positives
Improving on TF-IDF • Add domain name to Google search • 97% • 30% • TF-IDF + Zero results-Means-Phishing + domain name • 97% t.p. : 10% f.p. 67% t.p. 10% f.p.
Adding heuristics to CANTINA • Heuristics from SpoofGuard and other sources • Trade-off • Reduces true positive accuracy • 97% 89% t.p. • Reduces false positive rate • 10% 1% f.p.
Drawbacks to CANTINA • Relies on outside sources for information • Google • Requires heuristics to reduce false positives • Reduces accuracy… • Language-specific • Different corpus for each foreign language • Difficulties with East Asian languages • Unacceptable false positive rate • Misclassifications undermine user confidence in tool
B-APT: Bayesian Anti-Phishing toolbar • Firefox browser toolbar • will extend to other browsers • goals: detect, communicate, and educate • Bayesian filtering + whitelist • similar to spam filtering • different from spam filtering • phishing sites mirror legitimate sites • hard to find training set (inbox vs. blacklist database) • comprehensive whitelist • Innovative UI • no known effective security indicators for warning user of phishing sites (Dhamija, 2006; Wu, 2007)
Bayesian classification • Bayes’ law on conditional probability • Pros • easy to compute • training and tayloring • Cons • assume independence among words • Bayesian poisoning
Implementation details • Training on phishing pages and legitimate pages • Phishtrack: HTML of phishing pages* • 1200+ phishing sites = 160+ unique sites • Alexa top 500: most popular websites** • same KBs of phishing sites (17k vs 64k tokens) *http://www.dslreports.com/phishtrack **http://www.alexa.com/
B-APT detecting phishing sites Anti-phishing tool’s tested on 60 phishing sites
B-APT detecting legitimate sites Anti-phishing tool’s tested on 60 legitimate sites
Summary • Classification + heuristics do well • B-APT has no false negative, some false positive • working on communicating false positives • detect, communicate, and educate • Use of any toolbar is better than none • the least number was 42% of IE7 • blacklist-based ones get better as time passes (Zhang, 2007) • Beware of malware • Badware.org with Google