iSRD Spam Review Detection with Imbalanced Data Distributions

iSRD Spam Review Detection with Imbalanced Data Distributions Yan Zhu

Agenda • Overview • Objective • Sentiment Analysis And Imbalanced Data Distributions • ISRD: Methodology • Experiments And Results • Conclusion

Overview • The increasing use of the internet in all aspects of our lives, made us rely on the internet for doing all of our daily life activities. • Online product reviews are becoming vital for customers to obtain additional user-centered knowledge about the products. • However, some vendors are paying customers to write good reviews so they can boost their revenue through online sales

Overview • Examples of spam reviews include untruthful/fake review reports and review reports irrelevant to the products (such as an advertisement). • One of the most effective ways to distinguish spam and non-spam reviews is by using machine learning techniques. • Non-spam reviews are often the majority population, and the spam or fake reviews are relatively rare and difficult to obtain.

Objective • In the paper, the authors discussed sentiment analysis techniques for opinion mining in order to convey user’s sentiments by using document sentiment classification based on supervised learning, and feature based sentiment analysis. • To solve the problem of unbalanced data set, we develop iSRD, which is a new classifier framework that deals with imbalanced review data

SentimentAnalysis And Imbalanced Data Distributions • Sentiment Analysis • Sentiments often hold the real words that people wants to deliver • Successfully analyzing and understanding the sentiments are useful for many domain specific applications • Sentiment analysis is categorized into three levels including document level, sentence level and aspect level. • Naïve Bayes and Support vector machine proved their abilities to give good results in supervised learning, by using bag-of-words as features

SentimentAnalysis And Imbalanced Data Distributions • Sentiment Analysis • Review spams has been identified into three types: • fake reviews, including untruthful reviews; • reviews about brand only, that describe and comments on the brand rather than the product or service; • the non-review, that are not reviews or might be irrelevant text , questions or advertisements.

SentimentAnalysis And Imbalanced Data Distributions • Sentiment Analysis • The main challenge is that fake reviews are very hard to detect even manually, because there is no clear way to distinguish between fake and true reviews. • Machine learning is a best suitable technique that achieves a good generalization from the provided representations and learns the behavior from the given examples in order to classify unseen examples.

SentimentAnalysis And Imbalanced Data Distributions • Imbalanced Data Distributions • In machine learning, imbalanced data distributions often happens because of the lack of examples from the minority class. • The problem of imbalanced data appears when users intend to train a good classifier from imbalanced training data, where classifiers are inherently biased toward the majority class, leading to incorrect generalization rules.

SentimentAnalysis And Imbalanced Data Distributions • Imbalanced Data Distributions • Instead of the accuracy, one should focus on precision, recall, sensitivity and specificity, which give us accurate performance for the minority class

SentimentAnalysis And Imbalanced Data Distributions • Imbalanced Data Distributions • Many methods exist to handle data with imbalanced distributions. Examples include sampling and re-weighting. When using those approaches, boosting and bagging are often used to combine classifiers trained from sampled datasets for prediction.

ISRD: Methodology • The main theme is to use under sampling to generate a relatively balanced data set, and then user classifiers trained from sampled datasets for prediction. • Repeat the sampling for multiple times , each of which will generate a balanced dataset to reduce the sample selection bias. • For each balanced dataset, train a classifier, and use the ensemble of the classifiers from all sampled datasets for spam classification.

ISRD: Methodology • First split the dataset into a training (FIT) and a test set, where training and the test sets contain similar data imbalance ratios. • Use β to change the data imbalance levels in the FIT set to evaluate the performance for different data imbalance levels. • After we obtain the altered FIT dataset, we apply randomly under-sampling to generate balanced dataset

ISRD: Methodology • We trained a classifier from each balanced datasets, and then use the majority voting of the m classifiers to predict the class labels of the reviews in the test sets. • In order to validate the performance of the above design, our experiments will record the performance of each classifier against the same supplied test set and then compare the results for validation.

Experiments • Data Collection • Collected review reports for multiple hotels located at different cities and different countries • Two major data sources • Opinion Based Entity Ranking Project Dataset (2012) • Deceptive or fake reviews from the Deceptive Opinion Spam Corpus v1.4, which are gathered from Amazon MTurk

Data Collection • Data Preprocessing • Form a dataset with two columns where each row denotes a review, and the first column includes all text of the review, and the second column shows the class label • Convert texts into bag-of-words representation using StringToWordVector filter in Weka • Store the dataset as an ARFF file to be used in the following steps

Data Collection • Data Sampling • Randomly select examples and create an imbalanced test dataset with a very close imbalance ratio as the training set.

Data Collection • Data Sampling • Build five Fit datasets from the original Fit dataset by Under-sampling minority class (spam), but keeping all non-spam examples.

Data Collection • Data Sampling • For each of the Fit dataset, we will then apply random under-sampling to the majority class to create a set of balanced datasets.

Results • Instead of looking and examining the accuracy, other measures, such as precision, recall, sensitivity and specificity, will provide more accurate performance to evaluate the algorithm performance on the minority samples. • Compare the performance of our model with a decision tree based classifier (C4.5) by using different statistical measurements.

Results • The class of interest here is the spam class which is actually here the Positive class, so our interest is to increase the True Positive Rate and decrease the False Positive Rate

Results

Conclusion • In this research we have addressed the problem of detecting spam online reviews from imbalanced data distributions • proposed a new classifier technique to overcome the problem of imbalanced data distributions for review spam detection. • proposed to use random under-sampling to generate balanced training sets.

Conclusion • The experiments show that our proposed method, iSRD, significantly outperforms baseline classifier C4.5 in terms of TNR, FNR, Sensitivity, AUC and PRC, which are the common measures used for imbalanced data evaluation.

Reference Al Najada, H., & Zhu, X. (2014, August). iSRD: Spam review detection with imbalanced data distributions. In Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on (pp. 553-560). IEEE.

iSRD Spam Review Detection with Imbalanced Data Distributions

iSRD Spam Review Detection with Imbalanced Data Distributions

Presentation Transcript

Inductive Learning from Imbalanced Data Sets

Data Distributions

Imbalanced Data Set Learning with Synthetic Examples

Web Spam Detection with Anti-Trust Rank

imbalanced data

Machine Learning Basics with Applications to Email Spam Detection

Continuous Distributions Review

Data Distributions

Opinion Spam Detection

Machine Learning Basics with Applications to Email Spam Detection

Spam Email Detection

Network-Level Spam Detection

Review Spam Detection via Temporal Pattern Discovery

SPAM DETECTION IN P2P SYSTEMS

SPAM DETECTION IN P2P SYSTEMS

SPAM DETECTION IN P2P SYSTEMS

Data Distributions

Investigating the Effect of Sampling Methods for Imbalanced Data Distributions

Spam Detection

Inductive Learning from Imbalanced Data Sets

imbalanced data

Review Spam Detection via Temporal Pattern Discovery