220 likes | 399 Views
Web Page Language Identification Based on URLs. Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao. Reference. Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages 176-188. ACM, 2008. Outline.
E N D
Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor:Hsing-Kuo Pao
Reference Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages 176-188. ACM, 2008
Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions
Introduction • Given only the URL of a web page, can we identify its language? • Web crawlers • Personalized Web Browser • We consider the problem of determining the language of a web page using only its URL. • English , French , German , Spanish , and Italian • .com(60%), .org (10%) • www.wasserbett-test.com
Introduction • Applying machine learning techniques • Features • Word features • N-grams features • Custom-made features • Machine learning algorithm • Naïve Bayes • Decision Tree • Relative Entropy • Maximum Entropy
Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions
Extracting Feature Vectors • Words as features • Remove “www” , ”index”, ”html” …,etc. • For example, http://www.internetwordstats.com/africa2.htm • Split into : internetwordstats , com , africa • cnn , gov are indicative of English • Produits ,recherche are indicative of French
Trigrams as features • Start with the some token as the method above(word as features) • Eg, weather • “_we” , “wea” , “eat” , “ath” ,”the” ,”her” , “er_” • “_th” , “ing” are very common in English
Custom-made features • Top-level domain country code • OpenOffice dictionaries • Dictionary with city names • Number of hyphens
Classification Algorithms Country code top-level domain only (ccTLD) Country code top-level domain plus (ccTLD+) Naïve bayes (NB) Decision Tees (DT) Relative Entropy(RE) Maximum Entropy(ME)
Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions
DataSet • The algorithms were evaluated on three different data sets • Open Directory Project • Microsoft’s Live Search • 1260 pages form a large web crawl labels by hand
Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions
P = n+p(+|+)/ (n+p(+|+) + n−(1 − p(−|−))) = p(+|+) = p(−|−) F = 2/(1/R+1/P)
Conclusions This paper shows that high quality language identifiers for web pages can be built based on URLs alone. The largest challenge is to identify English-looking URLs of non-English web pages.