200 likes | 325 Views
Language Identification in Web Pages. Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005). Motivation. Goal: Efficiently crawl web pages in a given language, Portuguese in our case.
E N D
Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005)
Motivation • Goal: Efficiently crawl web pages in a given language, Portuguese in our case. • Necessity to accurately distinguish one language from others. We take a n-gram based approach to solve this problem, which has been reported to give excellent results.
Problems • Web texts are considerably different: • Multilingual documents. • Spelling errors. • Lack of coherent sentences. • Often small amounts of textual data. These considerable differences motivate a revisit to the problem.
Outline • Introduction. • Context and Related Work. • Language identification. • Text categorization with n-grams. • Our Language Identification Algorithm. • Experimental Results. • Future Work. • Conclusions.
Language Identification • Sibun and Reynar provided a good survey. • Variety of features have been tried: • Characters, words, POS tags, n-grams, ... N-gram based methods seem to be the most promising. • Dunning, Damashek, Cavnar & Trenkle, ...
N-grams in text categorization N-grams = n-character slices of a longer string. • “tumba!” is composed of the following n-grams: • Unigrams: _, t, u, m, b, a, !, _ • Bigrams: _t, tu, um, mb, ba, a!, !_ • Trigrams: _tu, tum, umb, mba, ba!, a!_, !__ • Quadgrams: _tum, tumb, umba, mba!, ba!_, a!__, !___ • Quintgrams: _tumb, tumba, umba!, mba!_ , ba!__, a!___, !____ • Advantages: • Efficiently handle spelling and grammatical errors. • No need for tokenization, stemming, ... • Computationally and space efficient.
Outline • Introduction. • Context and Related Work. • Our Language Identification Algorithm. • N-gram categorization approach. • Measuring similarity with n-gram profiles. • Heuristics for Web documents. • Experimental Results. • Future Work. • Conclusions.
N-gram categorization approach • Measure similarity among documents through n-gram statistics. • N-grams of multiple lengths simultaneously (1-5)
More efficient similarity measures • Lin's information theoretic similarity measure: • Jiang and Conranth's distance formula:
Heuristics for the Web • Use meta-data information, if available and valid. • Matching strings on the language meta tag. • Filter common or automatically generated strings. • “optimized for Internet Explorer” • Weight n-grams according to HTML markup. • Title, bold typeface, subject and description metatags • Handle insufficient data. • Ignore pages with less 40 characters. • Handle multilingualism and hard to decide cases. • Weight largest sentences.
Outline • Introduction. • Context and Related Work. • Our Language Identification Algorithm. • Experimental Results. • Future Work. • Conclusions.
Evaluation Experiments • Language profiles for 23 different languages. • Test collection: 500 documents for each of 12 different languages. • HTML documents crawled from portals and online newspapers. • Tested the classification algorithm in different settings. • Lin's measure was the most accurate. • Heuristics improve performance.
Application to the Portuguese Web About 3.5 million pages. Multiple file types. Significant portion of the Portuguese Web is written in foreign languages, especially English.
Limitations • Unable to distinguish dialects of the same language? • Portuguese from Portugal and from Brazil. • English and American English? • Possible directions: • Web linkage information. • “Discriminative” n-grams instead of most frequent.
Future Work • Carefully choose better training data. • Smoothing (Good-Turing). • Use n-grams approach for other classification tasks.
Conclusions • N-grams are effective in language guessing. • Text from the Web presents problems. • Lin's similarity measure seems effective.
Thanks for your attention! bmartins@xldb.di.fc.ul.pt http://www.tumba.pt http://tcatng.sourceforge.net