Improving Web Spam Classification using Rank-time Features

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob ,Yun KAIST DATABASE & MULTIMEDIA LAB

Contents • Introduction • Support Vector Machine • Data Set • Domain Separation • Rank-time features • Evaluation • Summary DATABASE & MULTIMEDIA LAB

Introduction • World Wide Web(WWW) • Definition • An information space in which the items of interest, referred to as resources, are identified by global identifiers [IAN04] • Description • Too much information • Needs Web Search Engines DATABASE & MULTIMEDIA LAB

Introduction • Web Search Engine • Definition • A search engine designed to search for information on the World Wide Web [WIK08] • Description • Retrieves pages relevant to users’ query • Ranking is become important • Web Spam interferes Web Search Engines DATABASE & MULTIMEDIA LAB

Web Spam(1/2) • Definition • A page that uses bad method to improve ranking [KRI07] • Object • Mislead web search engines’ rank algorithm • Make profit by increase page’s traffic • Reason why we should remove Web Spam • Users spend too much time to search for information • Ranking on search engines is critical for making profit • Reduce search engine’s resources DATABASE & MULTIMEDIA LAB

Web Spam(2/2) • Type of Web spam • Link stuffing • Keyword stuffing • Cloaking • Web farming • When to remove Web Spam • Crawl-time • Index-time • Rank-time • How to remove Web Spam • By training machine – Support Vector Machine(SVM) DATABASE & MULTIMEDIA LAB

Support Vector Machine(1/2) v1 n dimensions ? v2 <3 dimensions> <2 dimensions> • Definition • A set of related supervised learning methods used for classification and regression[WIK08] • Description • Find separating hyperplane with maximal margin on vector space DATABASE & MULTIMEDIA LAB

Support Vector Machine(2/2) • Procedure • Collect Datasets • Classify Datasets into Training Datasets and Test Dataset • Train the machine with Training Datasets • Test the machine with Test Dataset • Problem • We need to collect Datasets DATABASE & MULTIMEDIA LAB

Dataset • Definition • A set of labeled sample data for training and test • Collecting Procedure • Collect common query lists from MSN Live search engine • Label each of top-10 result as spam, non-spam or unknown by human judge • Classify dataset into training datasets and a test dataset • Classification method on datasets • Very important! • We choose Domain Separation DATABASE & MULTIMEDIA LAB

Domain Separation(1/6) • Definition • A classification method that classify according to domains • Procedure(in this paper) • For each URL from dataset • Calculate hash value by domain • If a new hash value comes, assign it randomly into 5 files • If the hash value comes again, put into the assigned file • Adjust 5 files into similar size • Why should we choose Domain Separation? DATABASE & MULTIMEDIA LAB

Domain Separation(2/6) • Domain separated vs. Randomly separated • Opinion • Domain separated datasets are better • The result trained with randomly separate dataset is WRONG! • It’s general classification problem in machine learning • Reason • If there exists subsets in dataset, and they has features, we should use those features • In fact, some spammers buy a domain for making spam page, it’s common that whole pages related that domain labeled spam • How to make domain separated datasets? DATABASE & MULTIMEDIA LAB

Domain Separation(3/6) • Five-fold cross validation • Definition • A method for training and test the SVM using in this paper • Procedure • Choose one of five domain-separated datasets as a test set • Choose other domain-separated datasets as training datasets • Train the SVM with 4 training datasets • Test the SVM with a test set • Repeat above procedures at all combination of sets DATABASE & MULTIMEDIA LAB

Domain Separation(4/6) • The result of domain separation • Total 31,300 URLs • 3,133 spam labeled URLs(9.99%) • Problem • Learning feature vector to subset hash to label may turn out to be wildly and incorrectly optimistic • Leave future work DATABASE & MULTIMEDIA LAB

Domain Separation(5/6) • Description • No duplicated domain • Consists 25% spam • Couldn’t use domain information • Worst-case graph DATABASE & MULTIMEDIA LAB

Domain Separation(6/6) • Description • Add additional feature • Consists 10% spam • More difficult to detect than 25% spam • Result • Still little bit lower than randomly sep., but it’s worst-case • Note : Still couldn’t use domain information DATABASE & MULTIMEDIA LAB

FEATA(1/2) • Description • Rank independent features • FEATA includes • Domain-level features • Page-level features • Link information DATABASE & MULTIMEDIA LAB

FEATA(2/2) • Description • Average precision 60% at 10.8% recall • Consists of 10% spam • Not so good • We will add Rank-time features! DATABASE & MULTIMEDIA LAB

Rank-time Features • Definition • Features using on rank-time • Motivation • Every page has feature vector • Shape of spam/non-spam pages’ feature vector is different • Spammer can’t guess distribution of non-spam feature vector • Consist of • Query independent features(FEATB) • Query dependent features(FEATQ) DATABASE & MULTIMEDIA LAB

FEATB • Definition • Query independent, rank-time features • Description • Page-level features • Domain-level features • Popularity features • Time features DATABASE & MULTIMEDIA LAB

FEATQ • Definition • Query dependent, rank-time features • Description • Depend on the match between query and document property • Examine for each returned result • Future work • Label spam on the URL only, not on the relevance of a URL to a query DATABASE & MULTIMEDIA LAB

Evaluation • Micro averaged on five tests DATABASE & MULTIMEDIA LAB

Summary • Classification of Web Spam is an important problem • We can classify Web Spam by training on the SVM • Making training datasets as domain-separated datasets is very important • Rank-time features improve classification performance by as much as 25% in recall at a set precision DATABASE & MULTIMEDIA LAB

References • [KRY07]Krysta, M., Qiang, W., Chris, J., Aaswath, R,. “Improving Web Spam Classification using Rank-time Features”, AIRWeb ’07, May 8, 2007 • [IAN04] Ian, J., “Architecture of the World Wide Web, Volume One”, W3C Recommendation, Dec 15, 2004 • [WIK08] “Web Search Engine”, “Support Vector Machine”, http://wikipedia.org, Sep 25, 2008 DATABASE & MULTIMEDIA LAB

[Appendix A] Receiver Operating Characteristic DATABASE & MULTIMEDIA LAB

Improving Web Spam Classification using Rank-time Features

Improving Web Spam Classification using Rank-time Features

Presentation Transcript

Classification Web

Know your Neighbors: Web Spam Detection Using the Web Topology

Improving Classification Accuracy Using Automatically Extracted Training Data

Web Spam Detection with Anti-Trust Rank

Know your Neighbors: Web Spam Detection using the Web Topology

Fast Time Series Classification Using Numerosity Reduction

Improving Internal Controls Using Excel 2010 New Features

Web Spam Taxonomy

Improving Classification Accuracy Using Knowledge Based Approach

Time Features

Improving Web Searching Using Descriptive Graphs

Improving Supervised Classification using Confidence Weighted Learning

Fast Time Series Classification Using Numerosity Reduction

Topical TrustRank: Using Topicality to Combat Web Spam

Web Spam

Web classification

Know your Neighbors: Web Spam Detection using the Web Topology

Best Features for Web Design - Vaughan - Rank Higher

Web Spam