240 likes | 351 Views
Improving Web Spam Classification using Rank-time Features. September 25, 2008 TaeSeob , Yun KAIST DATABASE & MULTIMEDIA LAB. Contents. Introduction Support Vector Machine Data Set Domain Separation Rank-time features Evaluation Summary. Introduction. World Wide Web(WWW)
E N D
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob ,Yun KAIST DATABASE & MULTIMEDIA LAB
Contents • Introduction • Support Vector Machine • Data Set • Domain Separation • Rank-time features • Evaluation • Summary DATABASE & MULTIMEDIA LAB
Introduction • World Wide Web(WWW) • Definition • An information space in which the items of interest, referred to as resources, are identified by global identifiers [IAN04] • Description • Too much information • Needs Web Search Engines DATABASE & MULTIMEDIA LAB
Introduction • Web Search Engine • Definition • A search engine designed to search for information on the World Wide Web [WIK08] • Description • Retrieves pages relevant to users’ query • Ranking is become important • Web Spam interferes Web Search Engines DATABASE & MULTIMEDIA LAB
Web Spam(1/2) • Definition • A page that uses bad method to improve ranking [KRI07] • Object • Mislead web search engines’ rank algorithm • Make profit by increase page’s traffic • Reason why we should remove Web Spam • Users spend too much time to search for information • Ranking on search engines is critical for making profit • Reduce search engine’s resources DATABASE & MULTIMEDIA LAB
Web Spam(2/2) • Type of Web spam • Link stuffing • Keyword stuffing • Cloaking • Web farming • When to remove Web Spam • Crawl-time • Index-time • Rank-time • How to remove Web Spam • By training machine – Support Vector Machine(SVM) DATABASE & MULTIMEDIA LAB
Support Vector Machine(1/2) v1 n dimensions ? v2 <3 dimensions> <2 dimensions> • Definition • A set of related supervised learning methods used for classification and regression[WIK08] • Description • Find separating hyperplane with maximal margin on vector space DATABASE & MULTIMEDIA LAB
Support Vector Machine(2/2) • Procedure • Collect Datasets • Classify Datasets into Training Datasets and Test Dataset • Train the machine with Training Datasets • Test the machine with Test Dataset • Problem • We need to collect Datasets DATABASE & MULTIMEDIA LAB
Dataset • Definition • A set of labeled sample data for training and test • Collecting Procedure • Collect common query lists from MSN Live search engine • Label each of top-10 result as spam, non-spam or unknown by human judge • Classify dataset into training datasets and a test dataset • Classification method on datasets • Very important! • We choose Domain Separation DATABASE & MULTIMEDIA LAB
Domain Separation(1/6) • Definition • A classification method that classify according to domains • Procedure(in this paper) • For each URL from dataset • Calculate hash value by domain • If a new hash value comes, assign it randomly into 5 files • If the hash value comes again, put into the assigned file • Adjust 5 files into similar size • Why should we choose Domain Separation? DATABASE & MULTIMEDIA LAB
Domain Separation(2/6) • Domain separated vs. Randomly separated • Opinion • Domain separated datasets are better • The result trained with randomly separate dataset is WRONG! • It’s general classification problem in machine learning • Reason • If there exists subsets in dataset, and they has features, we should use those features • In fact, some spammers buy a domain for making spam page, it’s common that whole pages related that domain labeled spam • How to make domain separated datasets? DATABASE & MULTIMEDIA LAB
Domain Separation(3/6) • Five-fold cross validation • Definition • A method for training and test the SVM using in this paper • Procedure • Choose one of five domain-separated datasets as a test set • Choose other domain-separated datasets as training datasets • Train the SVM with 4 training datasets • Test the SVM with a test set • Repeat above procedures at all combination of sets DATABASE & MULTIMEDIA LAB
Domain Separation(4/6) • The result of domain separation • Total 31,300 URLs • 3,133 spam labeled URLs(9.99%) • Problem • Learning feature vector to subset hash to label may turn out to be wildly and incorrectly optimistic • Leave future work DATABASE & MULTIMEDIA LAB
Domain Separation(5/6) • Description • No duplicated domain • Consists 25% spam • Couldn’t use domain information • Worst-case graph DATABASE & MULTIMEDIA LAB
Domain Separation(6/6) • Description • Add additional feature • Consists 10% spam • More difficult to detect than 25% spam • Result • Still little bit lower than randomly sep., but it’s worst-case • Note : Still couldn’t use domain information DATABASE & MULTIMEDIA LAB
FEATA(1/2) • Description • Rank independent features • FEATA includes • Domain-level features • Page-level features • Link information DATABASE & MULTIMEDIA LAB
FEATA(2/2) • Description • Average precision 60% at 10.8% recall • Consists of 10% spam • Not so good • We will add Rank-time features! DATABASE & MULTIMEDIA LAB
Rank-time Features • Definition • Features using on rank-time • Motivation • Every page has feature vector • Shape of spam/non-spam pages’ feature vector is different • Spammer can’t guess distribution of non-spam feature vector • Consist of • Query independent features(FEATB) • Query dependent features(FEATQ) DATABASE & MULTIMEDIA LAB
FEATB • Definition • Query independent, rank-time features • Description • Page-level features • Domain-level features • Popularity features • Time features DATABASE & MULTIMEDIA LAB
FEATQ • Definition • Query dependent, rank-time features • Description • Depend on the match between query and document property • Examine for each returned result • Future work • Label spam on the URL only, not on the relevance of a URL to a query DATABASE & MULTIMEDIA LAB
Evaluation • Micro averaged on five tests DATABASE & MULTIMEDIA LAB
Summary • Classification of Web Spam is an important problem • We can classify Web Spam by training on the SVM • Making training datasets as domain-separated datasets is very important • Rank-time features improve classification performance by as much as 25% in recall at a set precision DATABASE & MULTIMEDIA LAB
References • [KRY07]Krysta, M., Qiang, W., Chris, J., Aaswath, R,. “Improving Web Spam Classification using Rank-time Features”, AIRWeb ’07, May 8, 2007 • [IAN04] Ian, J., “Architecture of the World Wide Web, Volume One”, W3C Recommendation, Dec 15, 2004 • [WIK08] “Web Search Engine”, “Support Vector Machine”, http://wikipedia.org, Sep 25, 2008 DATABASE & MULTIMEDIA LAB
[Appendix A] Receiver Operating Characteristic DATABASE & MULTIMEDIA LAB