10 likes | 108 Views
András A. Benczúr István Bíró Károly Csalogány Máté Uher http://www.ilab.sztaki.hu/websearch. Problem Statement Hyperlinks between topically dissimilar pages cover Undeserved PageRank (Spam or Navigational links) Unrelated anchor hit (links to owners, maintainers)
E N D
András A. Benczúr István Bíró Károly Csalogány Máté Uher http://www.ilab.sztaki.hu/websearch • Problem Statement • Hyperlinks between topically dissimilar pages cover • Undeserved PageRank (Spam or Navigational links) • Unrelated anchor hit (links to owners, maintainers) • Content spam (text with no meaning for humans) Language model disagreement: Unigram language model for text (D) in collection (C): Detecting Nepotistic Links by Language Model Disagreement language model: kamera videodvdVinci … NRank: PageRank over penalized hyperlinks • NRank evaluation • Calculate PageRank • Form 20 buckets each containing 5% of total PR value • Pick 50 URLs from each bucket • → 1000-page sample stratified on PageRank • Manually classify it as • Non-usable: unknown, alias, empty, dead • Usable: reputable, ad, weborg, spam • Within spam: thema-*.delink farm language model: lipstick fashionpickup ... Kullback-Leibler divergence(KL) between the language model of the target and sourcepages: nepotistic links: penalize above threshold Unknown0.4% Alias 0.3% Empty 0.4% Non-existent 7.9% Ad 3.7% Weborg 0.8% Spam 16.5% Reputable 70.0% Distribution of categories: .de domain [Benczúr-Csalogány-Sarlós-Uher 2005] The Assumed Gaussian Mixture Model [Mishne et al. 2005] Distribution of KL between anchor text and target document • Disagreement in anchor text • anchor within doc • anchor of pointing doc • empty (or short) anchor: • use neighboring 5 words Average demotion of reputable and spam pages into NRank buckets Fraction of spam in NRank buckets Algorithmic Issues example: links to maintainer penalized requries docs in internal memory KL(D1||D2) along all hyperlinks external sort anchor text of all referencing docs to D KL(A||D) for document and anchors from pointing docs KL(A||D) for document and own anchor [Mishne et al. 2005] ETIK works even with streaming docs Computer and Automation Research Institute, Hungarian Academy of Sciences Eötvös University Budapest Inter-University Center for Telecommunications and Informatics