1 / 1

Detecting Nepotistic Links by Language Model Disagreement

András A. Benczúr István Bíró Károly Csalogány Máté Uher http://www.ilab.sztaki.hu/websearch. Problem Statement Hyperlinks between topically dissimilar pages cover Undeserved PageRank (Spam or Navigational links) Unrelated anchor hit (links to owners, maintainers)

jeslyn
Download Presentation

Detecting Nepotistic Links by Language Model Disagreement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. András A. Benczúr István Bíró Károly Csalogány Máté Uher http://www.ilab.sztaki.hu/websearch • Problem Statement • Hyperlinks between topically dissimilar pages cover • Undeserved PageRank (Spam or Navigational links) • Unrelated anchor hit (links to owners, maintainers) • Content spam (text with no meaning for humans) Language model disagreement: Unigram language model for text (D) in collection (C): Detecting Nepotistic Links by Language Model Disagreement language model: kamera videodvdVinci … NRank: PageRank over penalized hyperlinks • NRank evaluation • Calculate PageRank • Form 20 buckets each containing 5% of total PR value • Pick 50 URLs from each bucket • → 1000-page sample stratified on PageRank • Manually classify it as • Non-usable: unknown, alias, empty, dead • Usable: reputable, ad, weborg, spam • Within spam: thema-*.delink farm language model: lipstick fashionpickup ... Kullback-Leibler divergence(KL) between the language model of the target and sourcepages: nepotistic links: penalize above threshold Unknown0.4% Alias 0.3% Empty 0.4% Non-existent 7.9% Ad 3.7% Weborg 0.8% Spam 16.5% Reputable 70.0% Distribution of categories: .de domain [Benczúr-Csalogány-Sarlós-Uher 2005] The Assumed Gaussian Mixture Model [Mishne et al. 2005] Distribution of KL between anchor text and target document • Disagreement in anchor text • anchor within doc • anchor of pointing doc • empty (or short) anchor: • use neighboring 5 words Average demotion of reputable and spam pages into NRank buckets Fraction of spam in NRank buckets Algorithmic Issues example: links to maintainer penalized requries docs in internal memory KL(D1||D2) along all hyperlinks external sort anchor text of all referencing docs to D KL(A||D) for document and anchors from pointing docs KL(A||D) for document and own anchor [Mishne et al. 2005] ETIK works even with streaming docs Computer and Automation Research Institute, Hungarian Academy of Sciences Eötvös University Budapest Inter-University Center for Telecommunications and Informatics

More Related