160 likes | 309 Views
Know your Neighbors: Web Spam Detection using the Web Topology. Carlos Castillo, chato@yahoo-inc.com Debora Donato , debora@yahoo-inc.com Aristides Gionis , gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com Fabrizio Silvestri , f.silvestri@isti.cnr.it
E N D
Know your Neighbors: Web Spam Detectionusing the Web Topology Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com FabrizioSilvestri, f.silvestri@isti.cnr.it Presented by Anton Rodriguez-Dmitriev
Personal Background • Graduated from FSU • Working on a MSECE • Specializing in Controls • CS minor • Work part-time at STW Technic, LP
Web Spam Consequences • Damages reputation of search engine • Weakens the trust of the users • Eiron et al. ranked 100 million pages using PageRank: 11 out of the top 20 were pornographic pages • PageRank alone cannot filter spam • Cost incurred in crawling, indexing and storing spam pages
Some popular spamming techniques • Link Spam: create link structure, usually tightly knit community of links, to try to affect the outcome of the link-based ranking algorithm. • Content Spam: maliciously crafting the content of a Webpage using techniques such as keyword stuffing, inserting keywords that are more related to popular queries • Cloaking: send different content to a search engine than to the regular visitor of a website
Topology of the Dataset • Used WEBSPAM-UK2006 dataset: publically available spam collection • Undirected graph • Pruned to contain only hosts that share more than 100 links • Black nodes are spam and white nodes are non-spam • Most spammers in the larger connected component are clustered together • Other connected components are single-class
Evaluation of the process • Confusion Matrix: • a represents the number of non-spam examples that were correctly classified • b represents the number of examples of non-spam that were falsely classified as spam • c represents the spam examples that were falsely classified as non-spam • d represents the number of spam examples that were correctly classified
Success Measures • True positive-rate (or Recall): • False positive-rate : • Precision: • F-measure :
Link-based Features • Degree-related measures: • In-degree and out-degree of the hosts and neighbors • Edge-reciprocity: the number of links that are reciprocal • Assortativity: the ratio between the degree of a particular page and the average degree of its neighbors • PageRank • TrustRank: uses a subset of hand-picked trusted nodes and propagates their labels through the Web graph • Truncated PageRank: a variant of PageRank that diminishes the influence of a page to the PageRank of its neighbors
Link-based Features • Estimation of supporters: • Given two nodes x and y, x is a d-supporterof y, if the shortest path from x to y has length d • Nd(x) is the set of d-supporters of page x • Spam pages have a smaller bottleneck than non-spam • Bottleneck number : Histogram of b4(x) for spam and non-spam
Content-based Features • Most interesting features presented: • Finding the k most frequent words in the dataset, excluding stopwords: • Corpus precision: is the fraction of words in a page that appear in a set of popular terms • Corpus recall: to be the fraction of popular terms that appear in the page • Considering the set of q most popular terms in a query log: • Query precision and query recall: are analogous to corpus precision and recall. • Used k & q = 100, 200, 500 and 1000
Content-based Features • The best features are the corpus precision and query precision • All features where judged based only on histograms Histogram of the query precision in non-spam vs. spam pages for q = 500.
Classifiers • Cost-sensitive decision tree • Cost of zero for correctly classifying the instance • Cost of misclassifying spam as normal is R times more costly as classifying a normal host as spam • R can be used to tune the balance between the true-positive rate and the false-positive rate • Used “bagging” to help reduce the false-positive rate
Conclusion • Experimental evidence led to the hypotheses: • Non-spam nodes tend to be linked by very few spam nodes, and usually link to no spam nodes • Spam nodes are mainly linked by spam nodes • These tendencies can be exploited to yield better spam detection • Using multiple features, link-based and content-based, provided better detection • Error rate can be tuned by adjusting the cost matrix
Critique • Article presented many features, both link-based and content-based, that can be used for spam detection, and also techniques to optimize based on graph topology (smoothing) • Results obtained showed which features and optimizations were effective • Dataset that was used is outdated, so there is no indication on how well the methods would work with newer or more sophisticated spamming techniques • There was no direct comparison between prior research results and the results obtained