PhishNet : Predictive Blacklisting to Detect Phishing Attacks

PhishNet: Predictive Blacklisting toDetect Phishing Attacks PawanPrakash Manish Kumar RamanaRaoKompella MinaxiGupta Purdue University, Indiana University INFOCOM(March, 2010)

Outline • Introduction • Component1 • Component2 • Evaluation • Related Work • Conclusion

Phishing Attacks • Simplicity and Ubiquity of the Web • Attract several miscreants • lure innocent to revealing sensitive information • Above all. Today, such an miscreants attack of common and increasing by day is Phishing • http://www.antifishing.org/events/events.html • APWG’s anti-phishing contents of work

Popular Solution • Add additional feature s within an Internet browser • Often provided by a mechanism is known as Blacklisting • Simple to design and easy to implement • Major problem: Incompleteness • cyber-criminals are extremely savvy so that they are easy to evade blacklists

Observation • Malicious URLs do often tend to occur in groups that are close to each other either syntactically or semantically • www1.rogue.com, www2.rogue.com • two URLs with hostnames resolves to the same IP address

Implication • First, discover new sources of maliciousness in and around the original blacklist entries and add them into blacklist • Second, exact match implementation of a blacklist to an approximate match that is aware of several of the legal mutations that often exist within these URLs

PhishNet • Comprise two major components: A. a URL prediction component B. an approximate URL matching component

Predicting Malicious URLs • Predicting new URLs from existing blacklist entries • e.g., PhishTank, http://www.phishtank.com/index.php • Use five heuristics for generating new URLs • Basic idea • combine pieces of known phishing URLs(parent) from a blacklist to generate new URLs(child) • Then, test the existence of these child URLs using a verification process

Heuristics • H1, Replacing TLD (Top Level Domain) • find such variants of original blacklist entries obtained by changing the TLDs • use 3210 effective TLDs (eq, co.in) • H2, IP address equivalence • URLs have same IP address are grouped together into clusters • create new URLs by considering all combinations of hostnames and pathnames

Heuristics (Conti.) • H3, Directory structure similarity • URLs with similar directory structure are grouped together • build new URLs by exchanging the filenames among URLs belonging to the same group • H4, Query string substitution • build new URLs by exchanging the query strings among URLs

Heuristics (Conti.) • H5, Brand name equivalence • phishers often target multiple brand name using the same URL structure • build new URLs by substituting brand names occurring in phishing URLs with other brand names

Verificaiton • Eliminate URLs that are either non-existent or are non-phishing sites • conduct DNS lookup • establish a connection to the corresponding server • initiate a HTTPGETrequest to obtain content from the server • if request is successful, use publicly available detection tool

Approximate Matching • Determine whether a given URL is a phishing site or not • Perform approximate match of a given URL to the entries in the blacklist by first breaking the input URL into four different entities • IP address • hostname • directory structure • brand name

Approximate Matching (Conti.)

Approximate Matching (Conti.) • M1: Matching IP address • drect match • assign a normalized score based onthe number of blacklist entries that map to a given IP address • IP address IPi is common to ni URLs • scores computing as following:

Approximate Matching (Conti.) • M2: Matching hostname • classify between WHS (Web Host Service)and non-WHS

Approximate Matching (Conti.) • M2: Matching hostname • A. Matching WHSes: • if match succeeds, confidence score is computed using (1), on the number of URLs that have the same primary domain • B. Matching non-WHSes: • based on syntactic similarity across labels • If match succeeds, confidence score is computed using(1), ni referring to the number of URLs that match a given regular expression

Approximate Matching (Conti.) • M3: Matching directory structure • If match succeeds, confidence score is computed using(1), nirepresenting the number of URLs corresponding to a directory structure in the hash map • M4: Matching brand names • If match succeeds, confidence score is computed using(1), nibeing the number of occurrences of the brand name

Predicting Malicious URLs • Collected URLs over a period of24 days starting from 2nd July 2009 to 25th July 2009 • Generated almost1.55 million child URLs from the approximately 6,000 parentURLs • About 34489 out of 1.55 million could be fetched (%2), be compared with parent URL using page similarity tool • http://www.webconfs.com • greater than 90% similarity are reported asour new malicious URLs

Approximate Matching • Effectiveness and the URL processing time • Use data from four sources • The experimental setup consists of two phases—training and testing • For evaluation, use the following weight to different normalized scores : • W(M1) = 1.0 • W(M2) = 1.0 • W(M3) = 1.5 • W(M4) = 1.5

Approximate Matching

80% of 14% of URLs Gen (11.2% of URLs Gen) Approximate Matching 14% of URLs Gen 2.2% of 14% of URLs Gen (2% of URLs Gen)

Related Work • APWGregularly publishes facts and figures about phishing such as list of TLDs and brand names targeted, trends in phishing URL structure • Highly Predictive blacklisting • Rely on tens of thousands of features based on extra information from outside sources such as WhoIS, registrar information

Conclusion • Blacklisting is the most common technique to defend against phishing attacks • PhishNet suffers from low false positives and is remarkably effective at flagging new URLs that were not part of the original blacklist

THANKYOU

PhishNet : Predictive Blacklisting to Detect Phishing Attacks