Detecting Malicious Ads on Craigslist: A Machine Learning Approach

A Machine Learning Approach to detect malicious advertisements on craigslist Nithin Reddy Vemula, Rohan Saraf, Idris Wishiwala Faculty Advisor : Shirin Nilizadeh, Ph.D Department of Computer Science and Engineering University of Texas at Arlington (UTA)

Contents • Motivation • Related work • Existing Defenses • Overall design • Data collection • Research Methodology • Results • Findings • Conclusion • Limitations • Future works • Conclusion • References • Acknowledgements

Motivation • Malicious Ads plague all sorts of online advertisement forums. • Bot fraud impacted up to 37 percent of ads, compared to up to 22 percent observed in a similar study in 2014. • Digital ad fraud takes $1 for every $3 spent on digital ads and online advertisers are estimated to loss $7.2 Billion globally to bots in 2016. • Our friend was duped by one such ad, we have all had similar experiences. • Craigslist was chosen as a forum since it is one of the most widely used (and abused) ad website in the US. • The popularity of Craigslist and the simplicity of its design made it an ideal choice for this research.

Related Work • Spam detection in online classified advertisements by Hung Tran Et al • They propose a novel set of features that could strongly discriminate between spam and legitimate advertisement posts. • Outdated and may no longer be relevant since it was conducted a long time ago. • Online recruitment services: another playground for fraudsters by SokratisVidros Et al • Deals mostly with fake and spam ad detection in the recruitment domain. • Its scope is limited to the online recruitment services. Deals mostly with the risk associated with the data leaks.

Related Work • In Search of a Toyota Sienna: Prevalence and Identification of Auto Scam on Craigslist by Shirin Nilizadeh Et al • Deals with online fraud ads in the automobile sector of craigslist. • Focused algorithm may be specifically designed to detect automobile ad fraud. • Craigslist Scams and Community Composition: Investigating Online Fraud Victimization Shirin Nilizadeh Et al • Analyses historical scam data and its relationship with economic, structural, and cultural characteristics of the communities that are exposed to fraudulent advertising.

Existing Defenses • Craigslist users flag postings they find to be in violation via the "prohibited" link at the top of each posting. • Free classified ads sufficiently flagged are subject to automated removal. • Postings may also be flagged for removal by Craigslist staff or Craigslist automated systems. (Very Minimal)

Overall Design

Data Collection • Craigslist (Dallas and Fortworth) • Apartments/Housing, Office, Cars, Appliances, Computers/Computer Part and Cell Phones. • The data was collected for 3 weeks (April 6 – April 24). • 60535 posts collected in total over 3 weeks.

Data Collection Challenges • Craigslist was among those websites having the hardest designs to scrape data. • Integration with selenium web driver failed. • Unable to find the desired related field based on this Posting Id. • Limitations to building our own scraper for searching tags containing desired data.

Data Collection Challenges • Craigslist performs soft delete, making it difficult to reach the removed post. • Only way to identify a removed post was to visit its dedicated URL. • Structure of Craigslist Advertisement URL: • https://dallas.craigslist.org/ndf/apa/d/addison-cardio-theater-garages-with/6864761185.html • Reaching and extracting the desired field : posting–body

Web Scraper • Checked the URL carrying the specific ad and its elements, thus found the posting-body. • Extraction using Web Scraper lets us access multiple levels of navigation. • Category of Advertisements • Targeted Advertisement • Advertisement Content and Count of Images • Build sitemaps, scrape sites and export data in CSV format directly from your browser. • 25 minutes for 3000 advertisements

Data Categories

Jupyter Notebook • Python Program • Beautiful Soup • Input to this program were the scrapped URL's • The program will output one of the possible scenarios : • Active • Removed by user • Flagged for removal • URL could not be opened. • Used the Program to retrieve ground truth from the dataset of classified Ads.

Ground Truth

Ground Truth Classification

DatasetLimitations • Not all categories were covered. • Lesser number of records collected for targeted categories. • Posts expire after a certain period making it impossible to categorize. • Ignoring Active Posts for this research. • Possibility of "Removed by user" posts being malicious. • Imbalanced Dataset.

Research Methodology • Training Set • Feature Extraction • Machine Learning Model • Validation

Training Set • 60535 – 8056 • Filtered to remove additional special characters. • 75 –25 Split • 6042 Records were used to train the classifier • 2014 Records were used to test the classifier

Training Set Sample Data

Feature Extraction • Text is messy • Abag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: • A vocabulary of known words. • A measure of the presence of known words. • Machine learning algorithms cannot work with raw text directly; the text must be converted into numbers. Specifically, vectors of numbers. • A bag-of-words model, is a way of extracting features from text for use in modelling, such as with machine learning algorithms.

Feature Extraction • sklearn.feature_extraction.text.CountVectorizer • Stop Words • Max Features • NGram Range (2,3,4) - 1 Word • Ignoring sentiment analysis. • Convert a collection of text documents to a matrix of token counts

Machine Learning Model • Logistic Regression • It is the go-to method for binary classification problems (problems with two class values). • Results inclined towards majority class • Benign : 7092 • Malicious : 964 • SMOTE - Synthetic Minority Over-sampling Technique • AUROC score 0.56

Machine Learning Model • Random Forest Classifier • Random Forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. • Eliminates need for resampling. • Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. • Parameters : • N_estimators = 501 • Criterion = Entropy

Validation • Machine learning models need to be validated for their performance on the data. • Area Under the Receiver Operating Characteristics (AUROC) • Insensitive towards imbalanced data • Widely used in prior work • Average AUC Score of 0.86%where 1.0% being perfect

Validation ROC Curve

Results

Findings • The results show that it is possible to detect the malicious nature of an advertisement based on its content. • If used in an effective way machine learning can detect advertisements and flag it as potentially spam and warn users in the first place.

Limitations • Explored a selected few ad domains too. We have chosen a few frequently used ad domains with a high data input flow (approx. 2000) records or so. • Not tested against poisoning attacks. Due to the time constraint we could not test the classifier against attacks such as poisoning and dynamic modification attacks. • The results are susceptible to change when any updates are made to Craigslist. This could be either good or bad, however, it is not tested for the current system. • The system is not completely automated and could have some human error factors. • We were unable to track the source of these ads as well, which could have eliminated a lot of grunt work.

Conclusion • Targeted the most widely used advertisement forum in the US to detect malicious ads using a Machine Learning approach. • Impact of an advertisement's content. • Scraped advertisement and their content upon overcoming challenges. • Used advertisement's content to extract features(words) and train a random forest classifier to detect malicious advertisements. • Achieved an average precision of 90% and a recall of 70%.

Future Work • The system needs to be made more automated to avoid human intervention and eliminate human error. • Collection of more data and for more categories for a longer duration. • Testing the system against attacks such as poisoning. • Need to make this system more generic/for other ad forums. • Work on a different classifier.

References • Web Scraper Documentation,Web Scraper, www.webscraper.io/cloud-scraper. • “Dallas / Fort Worth Jobs, Apartments, for Sale, Services, Community, and Events.” Craigslist, dallas.craigslist.org/. • Mongia, Manan. “Python | NLP Analysis of Restaurant Reviews.” www.geeksforgeeks.org/python-nlp-analysis-of-restaurant-reviews/. • RAY, SUNIL. ”Improve Your Model Performance Using Cross Validation”. 3 May 2018, www.analyticsvidhya.com/blog/2018/05/improve-model-performance-cross-validation-in-python-r/.

Acknowledgements • We would like to thank Prof. Shirin Nilizadeh for providing us the opportunity and independence required for to complete this project. Without her support we would not have achieved the results that we did. • We also would like to thank our classmates for their support on and off the field.

Questions ?

Detecting Malicious Ads on Craigslist: A Machine Learning Approach