1 / 22

Web Page Language Identification Based on URLs

Web Page Language Identification Based on URLs. Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao. Reference. Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages 176-188. ACM, 2008. Outline.

chelsey
Download Presentation

Web Page Language Identification Based on URLs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor:Hsing-Kuo Pao

  2. Reference Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages 176-188. ACM, 2008

  3. Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

  4. Introduction • Given only the URL of a web page, can we identify its language? • Web crawlers • Personalized Web Browser • We consider the problem of determining the language of a web page using only its URL. • English , French , German , Spanish , and Italian • .com(60%), .org (10%) • www.wasserbett-test.com

  5. Introduction • Applying machine learning techniques • Features • Word features • N-grams features • Custom-made features • Machine learning algorithm • Naïve Bayes • Decision Tree • Relative Entropy • Maximum Entropy

  6. Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

  7. Extracting Feature Vectors • Words as features • Remove “www” , ”index”, ”html” …,etc. • For example, http://www.internetwordstats.com/africa2.htm • Split into : internetwordstats , com , africa • cnn , gov are indicative of English • Produits ,recherche are indicative of French

  8. Trigrams as features • Start with the some token as the method above(word as features) • Eg, weather • “_we” , “wea” , “eat” , “ath” ,”the” ,”her” , “er_” • “_th” , “ing” are very common in English

  9. Custom-made features • Top-level domain country code • OpenOffice dictionaries • Dictionary with city names • Number of hyphens

  10. Classification Algorithms Country code top-level domain only (ccTLD) Country code top-level domain plus (ccTLD+) Naïve bayes (NB) Decision Tees (DT) Relative Entropy(RE) Maximum Entropy(ME)

  11. Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

  12. DataSet • The algorithms were evaluated on three different data sets • Open Directory Project • Microsoft’s Live Search • 1260 pages form a large web crawl labels by hand

  13. Outline Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

  14. P = n+p(+|+)/ (n+p(+|+) + n−(1 − p(−|−))) = p(+|+) = p(−|−) F = 2/(1/R+1/P)

  15. Human Performance

  16. Baseline : ccTLD

  17. Conclusions This paper shows that high quality language identifiers for web pages can be built based on URLs alone. The largest challenge is to identify English-looking URLs of non-English web pages.

More Related