Web Page Language Identification Based on URLs

Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor:Hsing-Kuo Pao

Reference Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages 176-188. ACM, 2008

Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

Introduction • Given only the URL of a web page, can we identify its language? • Web crawlers • Personalized Web Browser • We consider the problem of determining the language of a web page using only its URL. • English , French , German , Spanish , and Italian • .com(60%), .org (10%) • www.wasserbett-test.com

Introduction • Applying machine learning techniques • Features • Word features • N-grams features • Custom-made features • Machine learning algorithm • Naïve Bayes • Decision Tree • Relative Entropy • Maximum Entropy

Extracting Feature Vectors • Words as features • Remove “www” , ”index”, ”html” …,etc. • For example, http://www.internetwordstats.com/africa2.htm • Split into : internetwordstats , com , africa • cnn , gov are indicative of English • Produits ,recherche are indicative of French

Trigrams as features • Start with the some token as the method above(word as features) • Eg, weather • “_we” , “wea” , “eat” , “ath” ,”the” ,”her” , “er_” • “_th” , “ing” are very common in English

Custom-made features • Top-level domain country code • OpenOffice dictionaries • Dictionary with city names • Number of hyphens

Classification Algorithms Country code top-level domain only (ccTLD) Country code top-level domain plus (ccTLD+) Naïve bayes (NB) Decision Tees (DT) Relative Entropy(RE) Maximum Entropy(ME)

DataSet • The algorithms were evaluated on three different data sets • Open Directory Project • Microsoft’s Live Search • 1260 pages form a large web crawl labels by hand

P = n+p(+|+)/ (n+p(+|+) + n−(1 − p(−|−))) = p(+|+) = p(−|−) F = 2/(1/R+1/P)

Human Performance

Baseline : ccTLD

Conclusions This paper shows that high quality language identifiers for web pages can be built based on URLs alone. The largest challenge is to identify English-looking URLs of non-English web pages.

Web Page Language Identification Based on URLs

Web Page Language Identification Based on URLs

Presentation Transcript

Intelligent Web-based Interactive Language Learning

urls

Web Page

Web page

Web Page

Recent work on Language Identification

Artist Identification Based on Song Analysis

we.b : The web of short URLs

urls

Language Identification

URLs

Web page

Language Identification in Web Pages

Web Page

Based on: MicroRNA identification based on sequence and structure alignment

Web Page Clustering based on Web Community Extraction

Dynamic URLs vs Static URLs

SEO Best Practices For eCommerce Product Page URLs

Web Servers and URLs

Based on: MicroRNA identification based on sequence and structure alignment