Detecting Web Spam Created with Markov Chains Text Generators

Detecting Web Spam Created with Markov Chains Text Generators Anton S. Pavlov, Moscow State University, department of Computational Mathematics and Cybernetics Boris V. Dobrov, Moscow State University, Research Computing Center, Information resources analysis Laboratory Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Overview • Web Spam • Markov chains text generators • Proposed Approach • Method overview • Constraints and features • Machine learning • Experiments • Conclusion Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Web Spam (I) • Web Spam – deliberate actions aimed at unjustifiably raising relevance of some pages in a search engine: • Decreases search quality and performance • Aims at specific algorithms, used by search engines (PageRank, BM25, etc.) • Efficient if mass created • Some types of web spam get into search results page (doorways) → automatically generated Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Web Spam (II) • Spammers have to create many web pages, that will be indistinguishable from normal pages: • Create texts manually Too Expensive • Copying texts from other sources Duplicates can be detected • Generate texts automatically: keyword stuffing, Markov chains, text modification Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Markov Chains Text Generators (I) • Markov Chain – a sequence of random variables, each variable depends only on previous one • Can be used as a generative model for texts: • States – word n-gramms • Transitions – probability of seeing a word, after seeing n words Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Markov Chains Text Generators (II) • Markov chains training: • Select a collection of documents • Collect probabilities of seeing a word xin every state • Generation process: • Use collected statistics to generate a text of predefined length Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Markov Chains Text Generators (III) • Resulting texts retain topical coherence and local coherence Again, the prevalence of spam within blog comments. Our work in this paper extend our previous work in this paper approximate what will eventually be perceived by users of search engines. Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Proposed Method (I) • Generated texts can simulate only some constraints of the natural texts: • Local coherence • Topical coherence • Presumably generated texts violate other constraints: • Genre coherence • Readability • Diversity • Etc… Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Proposed Method (II) Style/Genre identification features Authorship identification features • Collect various features: • Style/Genre • Authorship • Readability • Diversity • Use machine learning to create automatic spam classifier Readability metrics Text diversity features Machine learning Automatic spam classifier Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Genre and Style • Part of speech (POS) usage statistics • Especially rare POS statistics • Punctuation statistics: • Expressive punctuation («!», «?») • Smileys («:)») • References ([23]) Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Authorship • Part of speech statistics • Standard deviations of part of speech ratios • Helps identify texts with mixed authorship • Sentences with several verbs Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Readability • Readability metrics: • Average word length • Ratio of long words • Average, minimum, maximum sentence length Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Diversity • Zipf’s law: • For words • For stemmed nouns • Compression rates: • gzip • bz2 • Same words occurring in neighboring sentences Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Machine Learning • Extracted 61 features for each document • Compared two popular ML algorithms: • SVM • C4.5 + Bagging Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Decision Trees • Algorithm based on C4.5: • Each split minimizes informational entropy • Use different subsets of the training set to build a tree and to select it’s weight Avoids overfitting • Use bagging to merge multiple weak classifiers Gzipcompr. ratio > 0.6 N Y Spam Max sentence length > 32 N Y Ham Spam Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Experiments • Experiments were conducted on ROMIP By.Web collection • Web spam sources: • Rusadult doorway generator (rusadult) • Doorway.su doorway generator (doorway_su) • Order 2 Markov chains text generator (markov2) • Generated 3 training and 3 test sets: • Each contained 10000 By.Web documents (not spam) • Each contained 10000 documents from generators (spam) • Measured precision, recall, and F-measure Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Detecting Generated Web Spam Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Real World Spam Examples • Trained classifiers on texts generated by markov2 generator • Checked if the same classifier can be used to detect real spam • Real spam samples from the Yandex Blog Search • Manually collected 600 automatically generated spam texts • Measured recall on the real spam samples, and F-measure on previously used test set • Measured how adding samples of real spam affected classification Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Real World Spam Examples • Продажа автомашин. Каталог цен! автомобильная акустика Большой выбор, хорошие цены Модель Daewoo Nexia представляет собой последнюю модификацию модели Opel Kadett E. В 1986 году лицензированное производство этого автомобиля началось в автомобильный видеорегистратор Купля-продажа авто в Тюмени Много частных объявлений о продаже автомобилей в Тюмени на Новые и. автомобили. Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Detecting Real World Spam Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Conclusion • Proposed a new approach to web spam detection • Proved the possibility of detection web spam, generated using Markov chains • Detected spam not in the original training datasets Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

The End • Questions? Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

References • Веб коллекция BY.Web, http://romip.ru/ru/collections/by.web-2007.html. • Генератор дорвеевDoorway.Su, http://doorway.su/. • Зеленков Ю.Г., Сегалович И.В., Сравнительный анализ методов определения нечетких дубликатов для Web-документов // Труды 9ой Всероссийской научной конференции «Электронные библиотеки: перспективные методы и технологии, электронные коллекции» - RCDL’2007, Переславль, Россия, 2007. – Том 1, С. 166-174. • Парсерmystem http://company.yandex.ru/technology/mystem/. • Серверный генератор дорвеев от RUSADULT.com, http://doorways.rusadult.com/ru/. • Фоменко В.П., Фоменко Т.Г., Авторский инвариант русских литературных текстов, 1981. • Чжун Кай-лай, Однородные цепи Маркова. Перев. с англ. — М.: Мир, 1964. — 425 с. • Яндекс.Поиск по блогам, http://blogs.yandex.ru/. • Benczúr, A. A., Bíró, I., Csalogány, K. and Sárlós, T. Web Spam Detection via Commercial Intent Analysis. In Proceedings of the 3rd international workshop on Adversarial Information Retrieval on the Web, Banff, Alberta, Canada, May 8th, 2007. Pages: 89–92. • Braslavski P. Document Style Recognition Using Shallow Statistical Analysis. In Proceedings of the ESSLLI 2004 Workshop on Combining Shallow and Deep Processing for NLP, Nancy, France, 2004, p. 1–9. • Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., Vigna, S. A Reference Collection for Web Spam. ACM SIGIR Forum Volume 40, Issue 2 (December 2006) Pages: 11–24. Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

References • Dale, E. and J. S. Chall. 1949. “The concept of readability.” Elementary English 26: 23. • Dubay, W.H.. 2004. The Principles of Readability. Costa Mesa, CA: Impact Information • Fetterly, D., Manasse, M., Najork, M. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of WebDB’04, New York, USA, 2004. • Fetterly, D., Manasse, M., Najork, M. Detecting phrase-level duplication on the World Wide Web. In Proceedings of SIGIR’05, pages 170–177, New York, NY, USA, 2005. ACM. • Gyöngyi, Z. and Garcia-Molina H., Web Spam Taxonomy. In Proceedings of AIRWeb 2005, May 2005. • Joachims, T. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999. • Mishne, G., Carmel, D., and Lempel, R. Blocking blog spam with language model disagreement. In Proceedings of AIRWeb 2005, May 2005. • Piskorski, J., Sydow, M., Weiss, D., Exploring Linguistic Features for Web Spam Detection: A Preliminary Study. In Proceedings of the 4th international workshop on Adversarial Information Retrieval on the Web, Beijing, China, Pages 25-28. • Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. • Urvoy T., Chauveau E., Filoche, P. Tracking Web Spam with HTML Style Similarities. ACM Transactions on the Web, Vol. 2, No. 1, Article 3. Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains Text Generator, RCDL'09

Detecting Web Spam Created with Markov Chains Text Generators

Detecting Web Spam Created with Markov Chains Text Generators

Presentation Transcript

Markov Chains

Markov Chains

Heuristics for Detecting Spam Web Pages

Markov Chains

Detecting Spam Web Pages

Markov Chains

Markov Chains

Markov Chains

Markov Chains

Markov chains

Markov chains

Markov Chains

Markov Chains Regular Markov Chains Absorbing Markov Chains

Markov Chains

Markov Chains

Tutorial: Markov Chains

Detecting Spam and Spam Responses

Markov Chains

Markov Chains

Markov chains

Detecting Spam Web Pages

Detecting Web Spam with CombinedRank