CANTINA: A Content-Based Approach to Detecting Phishing Web Sites

CANTINA: A Content-Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7 1

Reference • Y. Zhang, J. Hong, and L. Cranor, “Cantina: A content-based approach to detecting phishing web sites,” in proceedings of the International World Wide Web Conference (WWW), 2007.

Outline • Introduction • Automated Detection of Phishing • A Content-based Approach for Detecting Phishing Web Sites • Evaluation • Conclusion

Introduction • Phishing • A kind of attack in which victims are tricked by spoofed emails and fraudulent web sites into giving up personal information • How many phishing sites are there? • 9,255 unique phishing sites were reported in June of 2006 alone • How much phishing costs each year? • $1 billion to 2.8 billion per year

Automated Detection of Phishing • Use heuristics to judge whether a page has phishing characteristics • Can detect phishing attacks as soon as they are launched • Attackers may be able to design their attacks to avoid heuristic detection • Often produce false positives • Use a blacklist that lists reported phishing URLs • Higher accuracy • Require human intervention and verification

CANTINA: Content-based Approach • TF-IDF algorithm • Robust Hyperlinks • Adapting TF-IDF for detecting phishing • A set of auxiliary heuristics

TF-IDF Algorithm • Yield a weight that measures how important a word is to a document in a corpus • Term Frequency (TF) • The number of times a given term appears in a specific document • Measure of the importance of the term within the particular document • Inverse Document Frequency (IDF) • Measure how common a term is across an entire collection of documents • A term has a high TF-IDF weight • A high term frequency in a given document • A low document frequency in the whole collection of documents

Robust Hyperlinks • Overcome the problem of broken links • Basic idea • Add a small number of well-chosen terms, which they called a lexical signature, to URLs • Create signatures • Calculate the TF-IDF value for each word in a document, and then select the words with highest value • Lexical signature of about five terms are sufficient to determine a web resource virtually uniquely

How CANTINA Works? (1/2) • Given a web page, calculate the TF-IDF scores of each term on that web page • Generate a lexical signature by taking the five terms with highest TF-IDF weights • Feed this lexical signature to a search engine, which in the case is Google • If the domain name of the current web page matches the domain name of the N top search results, it will be considered a legitimate web site. Otherwise, it will be considered a phishing site.

How CANTINA Works? (2/2) • Assumption • Google indexes the vast majority of legitimate web sites, and that legitimate sites will be ranked higher than phishing sites • Two heuristics • Domain name • Add the current domain name to the lexical signature • Zero results Means Phishing (ZMP) • If Google fails to return any result, the suspected site will be labeled as phishing • Example: eBay & its phishing site

Example: eBay

Phishing Site of eBay

Legitimate Site of eBay

Auxiliary Heuristics

Evaluation • Four experiments • Evaluation of TF-IDF • Evaluation of heuristics • Evaluation of CANTINA • Evaluation of CANTINA using URLs gathered from email • Two metrics • True positives (correctly labeling a phishing site as phishing, higher is better) • False positives (incorrectly labeling a legitimate site as phishing, lower is better)

Experiment 1 – Evaluation of TF-IDF (1/3) • Basic TF-IDF – Calculate the lexical signature based on the top 5 terms, submit that to Google, and check if the domain name of the page in question matches any of the top 30 results • Basic TF-IDF+domain – Same as Basic TF-IDF, except that the domain name of the page in question is added to the lexical signature • Basic TF-IDF+ZMP – Same as Basic TF-IDF, except that zero search results means that the page in question is labeled as a phishing site (ZMP is “zero means phishing”) • Basic TF-IDF+domain+ZMP – A combination of the two variants above. This combination turned out to have the best results, and is also called Final-TF-IDF in later sections

Experiment 1 – Evaluation of TF-IDF (2/3) • 100 phishing URLs from PhishTank.com • 100 legitimate URLs from a list of 500 used in 3Sharp’s study of anti-phishing toolbars

Experiment 1 – Evaluation of TF-IDF (3/3)

Experiment 2 – Evaluation of Heuristics (1/2) • Determine the best weights for these heuristics. • If S = 1, the page is labeled as a legitimate page, and if S = -1, it is labeled as a phishing site

Experiment 2 – Evaluation of Heuristics (2/2)

Experiment 3 – Evaluation of CANTINA (1/2) • Comparison: • Final-TF-IDF, Final-TD-IDF+heuristics, SpoofGuard, and Netcraft • SpoofGuard • Rely entirely on heuristics • Netcraft • Use a combination of heuristics and a blacklist • 100 phishing URLs with unique domains from PhishTank • 100 legitimate URLs using the following strategy • Select the login pages of 35 sites that are often attacked by phishers • Select the 35 top pages from Alexa Web Search • Select 30 random pages from http://random.yahoo.com/fast/ryl, and manually verify that they are legitimate

Experiment 3 – Evaluation of CANTINA (2/2)

Experiment 4 – Evaluation of CANTINA Using URLs Gathered from Email (1/2) • Evaluate CANTINA using URLs gathered from users’ actual email • Gather 3038 unique URLs, of which only 2519 were active, from the 3385 email messages • Label the active URLs as “phishing,” “spam,” or “legitimate.” manually • Phishing: pages that impersonate known brands and ask for personal data (19) • Spam: those selling unsolicited products or services (388) • All other URLs were deemed legitimate (2100)

Experiment 4 – Evaluation of CANTINA Using URLs Gathered from Email (2/2)

Conclusion • A content-based approach for detecting phishing web sites • Pure TF-IDF approach • Can catch 97% phishing sites with about 6% false positives • Final TF-IDF approach • Can catch about 90% of phishing sites with only 1% false positives

CANTINA: A Content-Based Approach to Detecting Phishing Web Sites

CANTINA: A Content-Based Approach to Detecting Phishing Web Sites

Presentation Transcript

Rigid Structure from Video

A Community-Based Approach to Teenage Pregnancy Prevention

Even Faster Web Sites

Spyware Spam Phishing

Detecting Degradation in DNA samples

Adrenal Incidentaloma: Evidence Based Approach

Detecting Outliers

Acute Pancreatitis Evidence Based Approach

Portfolio-based Assessment: A constructivist approach to measuring learning.

Phishing: Some Technical Suggestions for Banks and Other Financial Institutions

Database as a Service: An Architecture Based Approach to Designing DBaaS

Approach to Altered Mental Status

Coding with ASCII: compact, yet text-based 3D content

Beginning Level Content-Based Language Learning

Behavior Change and Organizational Sustainability: An Evidence-Based Approach

Learning on Relevance Feedback in Content-based Image Retrieval

Too many matches…

Approach to a thyroid nodule

A Strength-based Approach to Supervised Visitation in Child Welfare

EVIDENCE BASED MEDICINE A new approach to clinical care and research

A New Approach for the Performance Based Seismic Design of Structures

Content, Assessment and Pedagogy: An Integrated Design Approach