340 likes | 354 Views
This paper presents a splog detection task and proposes a solution based on temporal and link properties. The unique characteristics of splogs are identified, and a time-sensitive online detection task is proposed to capture these characteristics. The detection technique utilizes temporal and link properties for effective splog detection.
E N D
The Splog Detection Task andA Solution Based on Temporal and Link Properties Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Siaodan Song, Yun Chi, Koji Hino, Hari Sundaram, Jun Tatemura and Belle Tseng Presenter: Belle Tseng NEC Laboratories America, Cupertino, CA.
Problem statement Goal: combat spam in the blogosphere • What are splogs? • How to detect splogs? • How to evaluate anti-splogs techniques? Approach: splog detection task & solution • Identify unique characteristics of splogs • Propose a time-sensitive online detection task that captures the unique characteristics • Propose a splog detection technique based on temporal & link properties WWW 2006, January 2, 20202
Outline of the talk • Introduction • Splog detection task • Our detection method • Data pre-processing & annotation • Experiment results • Concluding remarks WWW 2006, January 2, 20203
Introduction • Motivation • Related work • What are splogs? WWW 2006, January 2, 20204
Motivation • Splogs are polluting the blogosphere… • 10-20% of blogs are splogs [1] • An average of 44 of the top 100 blogs search results in three popular blog search engines came from splogs [1] • 75% of new pings came from splogs; more than 50% of claimed blogs pinging weblogs.com are splogs [2] • Research issues • What are splogs? • How to detect splogs? • How to evaluate anti-splogs techniques? no concrete definition! splogs are different from web spams! a comparative evaluation framework on TREC dataset captures the unique characteristics of splogs Splog (spam+blog)—a new and serious problem in the blogosphere! WWW 2006, January 2, 20205
Related work • Web spam detection • Content analysis • [Ntoulas06]: statistical properties in content • Link analysis • [Gyongyi05]: spam mass estimation • Splog detection • [Kolari06]: apply web spam detection & topic identification techniques in splog detection However, splogs are different… WWW 2006, January 2, 20206
Example (1): keyword stuffing WWW 2006, January 2, 20207
Example (2): stolen content Traditional content analysis is not enough! WWW 2006, January 2, 20208
Example (3): link farm WWW 2006, January 2, 20209
Example (4): via trackback links Traditional link analysis is not enough! WWW 2006, January 2, 202010
What are splogs? • Splog: a blog created by an author who has the intention of spamming • NOTE: a blog having comment spam or trackback spam is not considered a splog S: splog W: affiliate website Ads/ppc: profitable mechanism WWW 2006, January 2, 202011
Characteristics of splogs • Typical characteristics • Machine-generated content • No value-addition • Hidden agenda, usually an economic goal • Uniqueness of splogs • Dynamic content • Non-endorsement link Splog detection—different from web spam detection! WWW 2006, January 2, 202012
Task Definition • Framework • Traditional IR-based evaluation • Proposed online evaluation WWW 2006, January 2, 202013
Framework • Splog detector for the blog search engines • Different from the web search engine in the growing contents (feeds) • So, time is crucial • Entries become available gradually time dealy to gather enough evidence • A splog persists in the index with growing content detect it as soon as possible • How fast is the detector? • Make a decision withless evidence b1, b2, b3…: downloaded blogs e1, e2, e3…: downloaded entries WWW 2006, January 2, 202014
Detection tasks • Traditional IR-based evaluation • with ground truth • K-fold cross-validation • Performance measures: precision/recall, AUC, ROC plot, etc. • without ground truth • Performance measure: average precision at top N of the ranked list based on pooling of multiple detection list WWW 2006, January 2, 202015
Online evaluation • A framework to evaluate time-sensitive detection performance B(t1): a partition consisting of blogs discovered during ti-1 to ti pjk: detection performance at time tj on the partition at tk (B(tk)) Pi: average performance for each delay i=j-k WWW 2006, January 2, 202016
Detection Method • Baseline features • Temporal regularity • Link regularity WWW 2006, January 2, 202017
Baseline features • A subset of the content features presented in [Ntoulas06] • In practice, • Extract features from 5 parts of a blog • tokenized URLs, blog and post titles, anchor text, blog homepage content, and post (entry) content • Vectorize by word count, average word length, and a tf-idf vector • Prune rarely-used words • Feature selection using Fisher linear discriminant analysis (LDA)—to avoid over-fitting WWW 2006, January 2, 202018
New features • Challenges • Content-based methods: suffer from more sophisticated content generation schemes • Link-based methods: suffer from different semantics of links; link graph is more dynamic and incomplete • Observation • Content: machine-generated posts • How to capture the characteristics in machine-generated content? • Link: to drive traffic to a specific set of affiliate websites • How to capture the characteristics in specific linking targets? Splogs’ motivation is different from normal, human-generated blogs! Temporal regularity estimation Link regularity estimation WWW 2006, January 2, 202019
Temporal regularity (TCR) • Temporal content regularity (TCR) • Captures the similarity between growing contents • Estimated by autocorrelation of the content • Similarity measure: histogram intersection distance distance between two posts (k posts in between) TCR: autocorrelation Amount of common contents of two posts WWW 2006, January 2, 202020
TCR examples WWW 2006, January 2, 202021
Temporal regularity (TSR) • Temporal structural regularity (TSR) • captures consistency in timing of content creation • estimated by the entropy of the post-time difference distribution • Use hierarchical clustering method blog entropy of post-time Normalized by the maximum observed blog entropy WWW 2006, January 2, 202022
TSR examples WWW 2006, January 2, 202023
Link regularity (LR) • captures consistency in blogs’ targeting websites • Splog—more consistent behavior because its main intention is to drive traffic to affiliate websites • Affiliate websites—not authoritative to normal bloggers • Analyzing the linking behavior using HITS algorithm • LR: compute hub scores with out-link normalization • Splogs target focused set of websites, while normal blogs usually have more diverse targets WWW 2006, January 2, 202024
Classification • Binary classification: splog or normal blogs • Use SVMs classifier with a radial basis function kernel • Combine baseline features with TCR, TSR, LR R (TCR, TSR, LR) SVMs Splog/non-splog base-n WWW 2006, January 2, 202025
Data-Preprocessing & Ground Truth • Pre-processing • Annotation tool • Disagreement among annotators • Ground truth WWW 2006, January 2, 202026
Data • TREC dataset: 100,649 feeds • Removing duplicate feeds and feeds without homepage or permalinks 43.6K unique blogs • Most blogs are discovered in the first week • used blogs discovered in the first week in online experiment WWW 2006, January 2, 202027
Annotation (1) • An interface for annotators • Five labels: • (N) Normal • (S) Splog • (B) Borderline • (U) Undecided • (F) Foreign WWW 2006, January 2, 202028
Annotation (2) • Disagreement among annotators • They agree more on normal blogs but less on near-splog blogs (S/B/U) • Pooling? • Splog recognition: conservative vs. aggressive • Ground truth • Label 9240 blogs (random & stratified sampling) • 7905 labeled as normal, 525 labeled as splogs • Low splog percentage • Some known splogs are pre-filtered • Focus on the 43.6K subset of blogs having both homepage and entries WWW 2006, January 2, 202029
Experimental Results • Offline detection • Online detection WWW 2006, January 2, 202030
Offline evaluation base-n: n-dimensional baseline features R+base-n: with temporal and link regularity features WWW 2006, January 2, 202031
Online experiment testing period linking graph Week 2 Week 7 Week 1 WWW 2006, January 2, 202032
Online evaluation • Without sufficient content data, the regularity features provide a significant boost to the performance WWW 2006, January 2, 202033
Summary • Splog—a new and serious problem in the blogosphere • Detection of splogs is different from web spam detection • Identifying new detection tasks • Online evaluation measure how quickly a detector can identify splogs • Introducing useful and unique features of blogs/splogs • temporal and link regularity measures • Annotation • Guideline and tool help reduce annotation effort WWW 2006, January 2, 202034