Splog Detection: Temporal and Link Properties Solution

The Splog Detection Task andA Solution Based on Temporal and Link Properties Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Siaodan Song, Yun Chi, Koji Hino, Hari Sundaram, Jun Tatemura and Belle Tseng Presenter: Belle Tseng NEC Laboratories America, Cupertino, CA.

Problem statement Goal: combat spam in the blogosphere • What are splogs? • How to detect splogs? • How to evaluate anti-splogs techniques? Approach: splog detection task & solution • Identify unique characteristics of splogs • Propose a time-sensitive online detection task that captures the unique characteristics • Propose a splog detection technique based on temporal & link properties WWW 2006, January 2, 20202

Outline of the talk • Introduction • Splog detection task • Our detection method • Data pre-processing & annotation • Experiment results • Concluding remarks WWW 2006, January 2, 20203

Introduction • Motivation • Related work • What are splogs? WWW 2006, January 2, 20204

Motivation • Splogs are polluting the blogosphere… • 10-20% of blogs are splogs [1] • An average of 44 of the top 100 blogs search results in three popular blog search engines came from splogs [1] • 75% of new pings came from splogs; more than 50% of claimed blogs pinging weblogs.com are splogs [2] • Research issues • What are splogs? • How to detect splogs? • How to evaluate anti-splogs techniques? no concrete definition! splogs are different from web spams! a comparative evaluation framework on TREC dataset captures the unique characteristics of splogs Splog (spam+blog)—a new and serious problem in the blogosphere! WWW 2006, January 2, 20205

Related work • Web spam detection • Content analysis • [Ntoulas06]: statistical properties in content • Link analysis • [Gyongyi05]: spam mass estimation • Splog detection • [Kolari06]: apply web spam detection & topic identification techniques in splog detection However, splogs are different… WWW 2006, January 2, 20206

Example (1): keyword stuffing WWW 2006, January 2, 20207

Example (2): stolen content Traditional content analysis is not enough! WWW 2006, January 2, 20208

Example (3): link farm WWW 2006, January 2, 20209

Example (4): via trackback links Traditional link analysis is not enough! WWW 2006, January 2, 202010

What are splogs? • Splog: a blog created by an author who has the intention of spamming • NOTE: a blog having comment spam or trackback spam is not considered a splog S: splog W: affiliate website Ads/ppc: profitable mechanism WWW 2006, January 2, 202011

Characteristics of splogs • Typical characteristics • Machine-generated content • No value-addition • Hidden agenda, usually an economic goal • Uniqueness of splogs • Dynamic content • Non-endorsement link Splog detection—different from web spam detection! WWW 2006, January 2, 202012

Task Definition • Framework • Traditional IR-based evaluation • Proposed online evaluation WWW 2006, January 2, 202013

Framework • Splog detector for the blog search engines • Different from the web search engine in the growing contents (feeds) • So, time is crucial • Entries become available gradually  time dealy to gather enough evidence • A splog persists in the index with growing content  detect it as soon as possible • How fast is the detector? • Make a decision withless evidence b1, b2, b3…: downloaded blogs e1, e2, e3…: downloaded entries WWW 2006, January 2, 202014

Detection tasks • Traditional IR-based evaluation • with ground truth • K-fold cross-validation • Performance measures: precision/recall, AUC, ROC plot, etc. • without ground truth • Performance measure: average precision at top N of the ranked list based on pooling of multiple detection list WWW 2006, January 2, 202015

Online evaluation • A framework to evaluate time-sensitive detection performance B(t1): a partition consisting of blogs discovered during ti-1 to ti pjk: detection performance at time tj on the partition at tk (B(tk)) Pi: average performance for each delay i=j-k WWW 2006, January 2, 202016

Detection Method • Baseline features • Temporal regularity • Link regularity WWW 2006, January 2, 202017

Baseline features • A subset of the content features presented in [Ntoulas06] • In practice, • Extract features from 5 parts of a blog • tokenized URLs, blog and post titles, anchor text, blog homepage content, and post (entry) content • Vectorize by word count, average word length, and a tf-idf vector • Prune rarely-used words • Feature selection using Fisher linear discriminant analysis (LDA)—to avoid over-fitting WWW 2006, January 2, 202018

New features • Challenges • Content-based methods: suffer from more sophisticated content generation schemes • Link-based methods: suffer from different semantics of links; link graph is more dynamic and incomplete • Observation • Content: machine-generated posts • How to capture the characteristics in machine-generated content? • Link: to drive traffic to a specific set of affiliate websites • How to capture the characteristics in specific linking targets? Splogs’ motivation is different from normal, human-generated blogs! Temporal regularity estimation Link regularity estimation WWW 2006, January 2, 202019

Temporal regularity (TCR) • Temporal content regularity (TCR) • Captures the similarity between growing contents • Estimated by autocorrelation of the content • Similarity measure: histogram intersection distance distance between two posts (k posts in between) TCR: autocorrelation Amount of common contents of two posts WWW 2006, January 2, 202020

TCR examples WWW 2006, January 2, 202021

Temporal regularity (TSR) • Temporal structural regularity (TSR) • captures consistency in timing of content creation • estimated by the entropy of the post-time difference distribution • Use hierarchical clustering method blog entropy of post-time Normalized by the maximum observed blog entropy WWW 2006, January 2, 202022

TSR examples WWW 2006, January 2, 202023

Link regularity (LR) • captures consistency in blogs’ targeting websites • Splog—more consistent behavior because its main intention is to drive traffic to affiliate websites • Affiliate websites—not authoritative to normal bloggers • Analyzing the linking behavior using HITS algorithm • LR: compute hub scores with out-link normalization • Splogs target focused set of websites, while normal blogs usually have more diverse targets WWW 2006, January 2, 202024

Classification • Binary classification: splog or normal blogs • Use SVMs classifier with a radial basis function kernel • Combine baseline features with TCR, TSR, LR R (TCR, TSR, LR) SVMs Splog/non-splog base-n WWW 2006, January 2, 202025

Data-Preprocessing & Ground Truth • Pre-processing • Annotation tool • Disagreement among annotators • Ground truth WWW 2006, January 2, 202026

Data • TREC dataset: 100,649 feeds • Removing duplicate feeds and feeds without homepage or permalinks  43.6K unique blogs • Most blogs are discovered in the first week • used blogs discovered in the first week in online experiment WWW 2006, January 2, 202027

Annotation (1) • An interface for annotators • Five labels: • (N) Normal • (S) Splog • (B) Borderline • (U) Undecided • (F) Foreign WWW 2006, January 2, 202028

Annotation (2) • Disagreement among annotators • They agree more on normal blogs but less on near-splog blogs (S/B/U) • Pooling? • Splog recognition: conservative vs. aggressive • Ground truth • Label 9240 blogs (random & stratified sampling) • 7905 labeled as normal, 525 labeled as splogs • Low splog percentage • Some known splogs are pre-filtered • Focus on the 43.6K subset of blogs having both homepage and entries WWW 2006, January 2, 202029

Experimental Results • Offline detection • Online detection WWW 2006, January 2, 202030

Offline evaluation base-n: n-dimensional baseline features R+base-n: with temporal and link regularity features WWW 2006, January 2, 202031

Online experiment testing period linking graph Week 2 Week 7 Week 1 WWW 2006, January 2, 202032

Online evaluation • Without sufficient content data, the regularity features provide a significant boost to the performance WWW 2006, January 2, 202033

Summary • Splog—a new and serious problem in the blogosphere • Detection of splogs is different from web spam detection • Identifying new detection tasks • Online evaluation measure how quickly a detector can identify splogs • Introducing useful and unique features of blogs/splogs • temporal and link regularity measures • Annotation • Guideline and tool help reduce annotation effort WWW 2006, January 2, 202034

Splog Detection: Temporal and Link Properties Solution

Splog Detection: Temporal and Link Properties Solution

Presentation Transcript

On Link-based Similarity Join

Solution Properties

PROPERTIES OF A SOLUTION

On detection and attribution …

Task B1 The use of chemical based on their physical properties

Error Detection and Correction : Data Link Layer

Moving Object Detection with Background Model based on spatio -Temporal Texture

Imposing Restrictions Over Temporal Properties in OWL: A Rule Based Approach

Solution properties

SVMs for the Blogosphere: Blog Identification and Splog Detection

Solution Properties

Assertion-Based Verification : Verification of Logical and Temporal Properties

Agent-Based Explicitly Spatial and Temporal (A-BEST) Model

SOLUTION PROPERTIES

Implications of Model Specification and Temporal Revisit Designs on Trend Detection

A Spam Mail-based Solution for Botnet Detection and Network Bandwidth Protection

Solution properties

Alternating Temporal Logic and Game-Based Properties

A Web based solution for spatio-temporal visualization

Solution properties

Solution Properties

Review on Detection and Grading the Cataract based on Image Processing