260 likes | 486 Views
CANTINA : A Content-Based Approach to Detecting Phishing Web Sites. Yue Zhang , Jason Hong, and Lorrie Cranor. WWW 2007. 2008.09.09. Agenda. Phishing Attacks Motivation & Goal Relative Work CANTINA Evaluation Conclusion. Phishing Attacks(1/2).
E N D
CANTINA : A Content-Based Approach to Detecting Phishing Web Sites Yue Zhang , Jason Hong, and Lorrie Cranor WWW 2007 2008.09.09
Agenda Phishing Attacks Motivation & Goal Relative Work CANTINA Evaluation Conclusion
Phishing Attacks(1/2) • The Act of stealing personal information via the internet for the purpose of committing financial fraud • Create a faked site similar to original sites like bank • Send to users using variable methods • Spam e-mail, XSS vulnerabilities, Malware … • Technical issues • URL Obfuscation • Similar domain, Encoding URL… • DNS hijacking • Modifying hosts file, DNS server setting… • Malware • BHO(Browser Helper Object), Browser Toolbar, Key logger…
Phishing Attacks(2/2) • Criminals often create phishing sites by copying and then modifying a legitimate site’s web pages • Similar to original web site • Often contain brand names and other terms that are common on a given web page • Owner’s brands
Motivation & Goal • Phishing is a rapidly growing problem with 9,255 unique phishing sites reported in 2006 • 84 Anti-phishing toolbars • Low accuracies • There is a strong need for better automated detection algorithms • A novel content-based approach for detecting phishing web sites. • Accomplish the accuracy more than existing approach
Related work(1/3) • Anti-Phishing has four categories • Why People Fall for Phishing Attacks? • Have examined the reasons that people fall for phishing attacks • Educating people about Phishing Attacks • Focused on online training materials, testing and situated learning • Anti-Phishing User Interface • Focused on the development of better user interface for anti-phishing tools • Automated Detection of Phishing
Relative work(2/3) • Anti-Phishing user interface • Toolbar-based approach • Browser extensions • Dynamic Security Skins • Web Wallet
Relative Work(3/3) • Automated detection of phishing • To use heuristics to judge whether a page has phishing characteristics. • Host name, domain name, URLs,… • To use a blacklist that lists reported phishing URLs
CANTINA | Basic Concept • Criminals often create phishing sites by copying and then modifying a legitimate site’s web pages • Contain brand names and terms of legitimate pages • Robust Hyperlinks • To find a broken links • Add lexical signature to URLs • If link doesn’t work, then feed signature to search engine • Ex. http://aaa.com/a.html?lexical-signature==“word1+word2+...+word5” • TF/IDF (Term frequency/Inverse document frequency) • Frequency based algorithm. • Basic algorithm for search engine • comparing and classifying documents • A term has a high TF-IDF weight by having a high term frequency in a given document
CANTINA | Basic Concept Calculate TF-IDF weight of each term Web page Take the five terms with highest TF-IDF weight Search top file term(term1+term2..) using google Compare the domain name with google search results Phishing site : domain name of current page do not match the domain name of the N top search results (30)
CANTINA | Basic Concept Faked Page TF/IDF Top 5 : eBay, user, sign, help, forgot
CANTINA | Basic Concept Real Page TF/IDF Top 5 : eBay, user, sign, help, forgot
CANTINA | Additional Solutions • Basic CANTINA has a number of false positive • Solutions • Add the current domain name to the lexical signature • ZMP(Zero results Means Phishing) • Google returns zero search results • Meaningless domain(e.g., “u-s-j.be”) • Larger set of heuristics based on related work • From existing approach (e.g., SpoofGuard, PILFER) • Age of Domain, Known Images, Suspicious URL,…
Evaluation | Effectiveness #1(1/2) • Four conditions • Basic TF-IDF • Basic TF-IDF + domain name • Basic TF-IDF + ZMP • Basic TF-IDF + domain + ZMP • 100 phishing URLs and 100 legitimate URLs • Phishing URLs : PhishTank.com • Legitimate URLs : From previous study
Evaluation | Effectiveness #1(2/2) • Basic TF-IDF + ZMP + domain • False positives a little high • Final TF-IDF
Evaluation | Effectiveness #2(1/2) • Want to reduce false positives • Combining several heuristics method
Evaluation | Effectiveness #2(2/2) • Determining the best weights for these heuristics is a typical classification problem. • Use a simple forward linear model • Used 100 phishing URLs, 100 legitimate to find weights
Evaluation | Effectiveness #3(1/2) • To evaluate the effectiveness of Final-TF-IDF, Final-TD-IDF+heuristics, SpoofGuard, and Netcraft • SpoofGuard : the highest true positive rate • Relies entirely on heuristics • Netcraft : one of the best toolbars overall • Uses a combination of heuristics and an extensive blacklist. • 100 phishing URLs from PhishTank.com • 100 legitimate URLs • 35 sites often attacked (citibank. Papayl) • 35 top pages from Alexa ( most popular sites) • 30 random web pages from random.yahoo.com
Evaluation | Effectiveness #3(2/2) • Reduced false positives from 6% to 1% by combining Final-TF-IDF with simple heuristics • But, true positive was decreased
Discussion • Limitations • Does not apply to non-English web sites • System Performance • Depend on performance of Google search engine • Attacks by criminals • use image instead of words • Add invisible text • Circumventing TF-IDF and PageRank • Using “Google Bombs” • Attempt a DoS attack on Google
Conclusion • CANTINA uses TF-IDF + search engines + heuristics to find phishing web sites • 97% true positives with 6% false positives • 89% true positives with 1% false positives • Shifts problem of identifying phishing sites to a search engine problem