Detecting Fake Websites: The Contribution of Statistical Learning Theory

Detecting Fake Websites: The Contribution of Statistical Learning Theory Abbasi, Zhang, Zimbra, Chen, Nunamaker MISQ, 34(3), 2010 MISQ Best Paper Award, 2011

Introduction • The increased popularity of the Internet has attracted opportunists. • Seeking to capitalize on the asymmetric nature of online information exchange. • Consequently many forms of fake and deceptive websites have appeared (Chua & Wareham, 2004): • Web Spam • Sites attempting to deceive search engines to boost their ranking (Gyongi and Garcia-Molina, 2005). • Objective: search engine optimization (SEO) • Often done for profit (site for sale) • Typically do not attempt to defraud Internet users • Concocted Sites • Fraudulent sites attempting to appear as legitimate commercial service providers. • Objective: failure-to-ship fraud (Chua and Wareham, 2004) • E.g., fake escrow, financial, and delivery company sites (Abbasi and Chen, 2007). • Spoof Sites • Replicas of real commercial sites intended to deceive the authentic sites’ customers (Chou et al., 2004). • Objective: identity theft; capture ones account information

Introduction • We focus on spoof and concocted websites • Since they’re used to defraud end users • Concocted sites • Becoming increasingly common, with over one hundred new entries added daily to online databases such as the Artists Against 4-1-9. • Spoof sites • According to a 2004 survey, 70% of respondents had visited spoof sites and 15% admitted to providing personal data to spoofs (Wu et al., 2006).

Introduction • Fake websites are often very well professional looking and difficult to identify as phony (MacInnes et al., 2005). • In response to increasing Internet user awareness, fraudsters also becoming more sophisticated (Levy and Arce, 2004). • As a result, there is a need for enhanced fake website detection techniques (Chou et al., 2004). • Such methods are important to decrease Internet fraud stemming from phony websites.

Introduction • Numerous tools have been proposed, however they have several shortcomings: • Most are lookup systems: rely solely on manually crafted blacklists of fake URLs • Lists generated from user reports, making them reactive. • Few systems using proactive classification techniques have been proposed • Those that have utilize overly simplistic features and classification heuristics • Most systems are geared towards spoof sites • It is unclear how effective they would be at detecting generated fraud sites • We propose a statistical learning theory (SLT) based system for detecting fake websites • Capable of detecting generated fraud and spoof sites • Uses a rich feature set and composite SVM kernel for enhanced fake website detection capabilities • Can be combined with a lookup mechanism for hybridized detection using a dynamic classifier

Fake Website Detection Tools • Several tools developed for identifying and protecting against fake websites. • Fake website detection tools belong to two categories: • Lookup Systems • Classifier Systems

Fake Website Detection Tools • Lookup Systems • Description • Use a client-server architecture (Li and Helenius, 2007) • Server side maintains blacklist of known fake site URLs (Zhang et al., 2007) • Rely on collaborative sanctioning mechanisms similar to reputation ranking (Hariharan et al., 2007) • Examples are IE7 Phishing Filter, FirePhish, Sitehound, and the Earthlink Toolbar • Advantages • High precision: less likely to report false positives, i.e., considering an authentic site fake (Zhang et al., 2007) • Since all URLs in database are verified by the online sources they are taken from. • Computationally faster than classifier systems • Easier to implement • Disadvantages • Lower recall: more likely to report false negatives (i.e., overlooking fake websites) • Since database is limited to small number of online resources, may lack coverage. • Lookup systems are reactive by nature; depending on users to report URLs (Liu et al., 2006)

Fake Website Detection Tools • Classifier Systems • Description • Use rule based heuristics or similarity scores • Applied to website content or domain registration information (Wu et al., 2006; Zhang et al., 2007) • Classifier systems run on the client side • Example are SpoofGuard, Netcraft, and eBay Account Guard • Advantages • Can provide better coverage (i.e., recall) for spoof and generated fake sites than lookup systems • Depending on the classification heuristics, rules, and/or models used. • Classifier systems are proactive • Disadvantages • Classifiers can be more computationally expensive, taking longer to classify web pages than lookup systems • More prone to false positives • Generalization ability of classification models over time can be an issue • Especially if the fake websites are constantly changing and evolving • In such situations the classification model must also adapt and relearn

Summary of Fake Website Detection Tools

Summary of Fake Website Detection Tools • Existing systems’ performance is inadequate due to insufficient use of “fraud cues” • Could be useful since fake websites are often “templatic” • Fraudsters automatically mass-produce fake websites • There has been no prior evaluation on concocted sites • There’s been limited use of classifiers evaluating page content • Limited utilization of hybrid systems that combine classifiers with a lookup mechanism.

Fraud Cues in Fake Website Templates • Body text • Web page source code • URLs • Images • Linkage information

Fake Website Detection using SLT-based Methods • In summary, effective fake website detection systems must: • Generalize across diverse collections of concocted and spoof websites. • Incorporate rich sets of fraud cues. • Leverage important domain-specific knowledge: stylistic similarities and content duplication. • Provide long term sustainability against dynamic adversaries by adapting to changes.

Fake Website Detection using SLT-based Methods • SLT also provides a mechanism for addressing the four important characteristics necessary for effective fake website detection systems. • Ability to generalize • The “maximum margin” principle and corresponding optimization techniques employed by SLT-based classifiers set out to minimize classification error while simultaneously maximizing their generalization capabilities • Rich fraud cues • Since SLT-based classifiers transform input data into a kernel matrix, they are able to utilize sizable input feature spaces • Utilization of domain knowledge • By supporting the use of custom kernels, SLT-based classifiers are able to incorporate unique problem nuances and intricacies, while preserving the semantic structure of the input data space • Dynamic learning • As with other learning-based classifiers, SLT-based classifiers can also update their models by relearning on newer, more up-to-date training collections of real and fake websites

Research Hypotheses • Since classifier systems can better generalize than lookup systems: • H1: Any non-trivial classifier system, rule or learning-based, will outperform systems relying exclusively on a lookup mechanism. • Since SLT-based classifiers can incorporate large sets of fraud cues: • H2: SLT-based website classifiers will outperform rule-based classifiers. • Since SLT-based classifiers can incorporate domain knowledge via custom kernels: • H3: SLT-based learning classifiers will outperform other machine learning algorithms. • SLT-based classifiers, equipped with custom, problem-specific kernel functions, can better preserve important fraudcue relations: • H4: SLT-based classifiers using well-designed kernels will outperform ones using generic kernel functions

AZProtect System Overview • Developed an SLT-based fake website detection system • Uses rich feature set and SVM kernel based machine learning classifier. • Capable of classifying concocted and spoof sites. • Evaluates multiple web pages from a potential site for improved performance • Prior systems only evaluated single URL • Feature set utilizes over 5,000 features from 5 information types: • Body text, HTML design, Images, Linkage, and URLs. • Features extracted and classifier built on 1,000 training websites collected 6 months before the testing websites. • Independent of test bed (no overlap). • Support Vector Machine classifier • Uses a linear composite kernel • Tailored towards representing the content similarity and duplication tendencies of fake websites.

AZProtect System Overview Linear composite kernel compares pages’ feature vectors against training site pages Considers average and maximum similarity for pattern and duplication detection Also incorporates page linkage and structure information in each comparison Considers website fake if greater than n% of its pages are classified as fake

AZProtect System Overview Illustration of Page-Page and Page-Site Similarity Scores used in the Linear Composite Kernel Function

AZProtect System Overview Kernel Illustration: Comparing Two Web Pages against Legitimate and Fake Websites

AZProtect System Overview 1 2 3 4 5 6

Evaluation Test Bed • We evaluated 350 fake generated websites and 350 spoof sites over a 6 week period. • Taken from 4 online databases (Liu et al., 2006; Zhang et al., 2007): • Concocted Sites • Artists Against 4-1-9 • http://wiki.aa419.org • Escrow Fraud Online • http://escrow-fraud.com • Spoof Sites • PhishTank • http://www.phishtank.com • Anti-Phishing Working Group (APWG) • http://www.antiphishing.org • Also evaluated 200 legitimate sites. • Comprised of websites commonly spoofed or those relevant to concocted websites • Resulted in 900 website test bed

Comparison of Classifier and Lookup Systems • AZProtect had the best overall performance and fake website detection accuracy on both test beds. • Netcraft and SpoofGuard also had decent performance on both data sets • Sitehound performed poorly on both test beds, with the worst performance on each • FirePhish, IE7, and SpoofGuard also fared well on the spoof site test bed, but not on concocted sites

H1 and H2 Results • Conducted pair-wise t-tests on overall accuracy, concocted, and spoof detection rates. • H1: Classifier vs. Lookup Systems • Compared the performance of the four classifier systems against the four lookup-based tools. • AZProtect and Netcraft significantly outperformed the four lookup systems for all three evaluation metrics (p-values < 0.001) • SpoofGuard also significantly outperformed all lookup systems in terms of overall accuracy and concoction detection rates. • H2: Learning vs. Rule-based Classifier Systems • AZProtect significantly outperformed all three comparison classification systems (all p-values < 0.001). • The SLT-based system’s ability to incorporate a rich set of fraud cues allowed it to better detect fake websites than existing rule-based classifier systems.

Comparison of Learning Classifiers • An important element of the AZProtect system is its linear composite SVM kernel. • Compared it with several learning methods applied to related classification problems, including text, style, and website categorization • All algorithms were trained on the same set of 1,000 websites • H3: SLT-based learning classifier vs. other learning classifiers • The linear composite SVM kernel significantly outperformed all six comparison methods in terms of overall accuracy and its spoof detection rate. • Also significantly outperformed Naïve bayes, Winnow, and Neural Net on concocted websites. • However it was outperformed by J48 on the concocted websites.

Comparison of Static and Dynamic Learning Classifiers • We compared the custom linear composite kernel against other generic kernel functions. • The comparison kernels did not incorporate problem-specific characteristics related to the fake website domain. • Linear kernel that weighted all attributes in the input feature vectors equally (Ntoulas et al., 2006) • Linear kernel that weighted each attribute in the feature vector based on its information gain score (attained on the training data). • Additionally, 2nd and 3rd degree polynomial kernels and a radial basis function kernel were incorporated (Drost and Scheffer, 2005). • H4: Custom linear kernel vs. other kernels • Proposed kernel significantly outperformed comparison kernels on 21 out of 25 conditions.

Conclusions and Future Directions • Contributions • Advocated the development of SLT-based fake website detection systems • Used experiments to show that SLT-based systems can improve fake website detection capabilities • Due to better generalization ability, ability to use rich fraud cues and custom kernels, and through the use of dynamic learning. • Proposed an improved SLT-based fake website detection system • SVM classifier with a composite linear kernel and rich feature set • Evaluated effectiveness of static and dynamic classifiers • Compared various state-of-the-art systems for fake site detection • Applied to concocted and spoof sites • Future Directions • Usability study of proposed AZProtect system • Compare effectiveness of various toolbar layouts (Wu et al., 2006) • Improve computation time of system • Currently 2.9 seconds per website • Other systems are between 0.5 – 2.0 seconds (Chou et al., 2004; Liu et al., 2006)

References • Abbasi, A. and Chen, H. “Detecting Fake Escrow Websites using Rich Fraud Cues and Kernel Based Methods,” Paper Submitted to the Workshop on Information Technologies and Systems, Montreal, Canada, 2007. • Chou, N. Ledesma, R., Teraguchi, Y., Boneh, D. and Mitchell, J. C. “Client-side Defense Against Web-based Identity Theft,” In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA., 2004. • Chua, C. E. H. and Wareham, J. “Fighting Internet Auction Fraud: An Assessment and Proposal,” IEEE Computer, (37:10), 2004, pp. 31–37. • Gyongi, Z. and Garcia-Molina, H. “Spam: It’s not Just for Inboxes Anymore,” IEEE Computer, (38:10), 2005, pp. 28-34. • Hariharan, P., Asgharpour, F., and Jean Camp, L. “NetTrust – Recommendation System for Embedding Trust in a Virtual Realm,” In Proceedings of the ACM Conference on Recommender Systems, Minneapolis, Minnesota, 2007. • Levy, E. and Arce, I. “Criminals Become Tech Savvy,” IEEE Security and Privacy, (2:2), 2002, pp. 65-68. • Li, L. and Helenius, M. “Usability Evaluation of Anti-Phishing Toolbars,” Journal in Computer Virology, (3:2), 2007, pp. 163-184. • Liu, W., Deng, X., Huang, G., and Fu, A. Y. “An Antiphishing Strategy Based on Visual Similarity Assessment,” IEEE Internet Computing, (10:2), 2006, pp. 58-65. • MacInnes, I., Damani, M., and Laska, J. “Electronic Commerce Fraud: Towards an Understanding of the Phenomenon,” In Proceedings of the Hawaii International Conference on Systems Sciences (HICSS), 2005. • Wu, M., Miller, R. C., and Garfunkel, S. L. “Do Security Toolbars Actually Prevent Phishing Attacks?,” In Proceedings of the Conference on Human Factors in Computing Systems, Montreal, Canada, 2006, pp. 601-610. • Zhang, Y., Egelman, S., Cranor, L. and Hong, J. “Phinding Phish: Evaluating Anti-phishing Tools,” In Proceedings of the 14th Annual Network and Distributed System Security Symposium (NDSS), 2007.

Detecting Fake Websites: The Contribution of Statistical Learning Theory

Detecting Fake Websites: The Contribution of Statistical Learning Theory

Presentation Transcript

A Contribution to the Model Theory

Statistical Learning

The statistical p hysics of learning - revisited

Learning theory Transfer of Learning

A Theory of Learning

Statistical Learning Theory

Statistical Learning

The Social Learning Theory of Aggression

The Social Learning Theory of Aggression

The Elements of Statistical Learning

Statistical Learning Theory and Applications

Statistical Theory of Nuclear Reactions

The Contribution of Research Methods to Management Theory

Detecting Statistical Interactions with Additive Groves of Trees

Statistical Validation of The Unified Cycle Theory

Detecting fake materials

The social learning theory of crime

Statistical Learning

The Social Learning Theory of Aggression

Statistical learning

Statistical Decision Theory

Contribution of a statistical organisation to social media