Measuring Monolinguality

Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language and Speech Resources, Genova 27 May 2006

Why Monolinguality ? Alien language noise disturbs statistics for corpus-based methods: • Language Models, e.g. n-gram • Lexical Acquisition • Semantic Indexing • Co-occurrence Statistics

What is Monolinguality? • Foreign language sentences should be removed • Sentences containing few foreign language words or phrases, such as movie titles, terminology etc. should remain.

Korean Example • A:Yes. The traffic cop said I had one too many and made me take the sobriety test, but I passed it. B:Lucky you ! • 무인도 표류 소년 25명 통해 인간의 야만성 그려 영국 소설가 윌리엄 골딩의 83년 노벨문학상 수상작을 영화화한 `파리대왕'(Lord of the flies)은 결코 편안하게 감상할 수 있는 영화는 아니다 .

Recall Zipf‘s Law It holds also for random samples of words Top frequent words

Measuring Monolinguality Given a corpus of language A with x% noise of language B, the amount of noise is measured: • For top frequency words of B, divide the relative frequency in the corpus by the relative frequency of a clean B corpus • The amount of noise is the predominant ratio: many ratios will be close to x%.

The top frequency words of B w.r.t. A • Words that do not occur in language A. Their frequency ratio will be around x%. • Words that are also amongst the highest frequency words of language A and moreover have the same function. Their frequency ratio will be around 1. • Words that occur in language A, but at different frequency bands. They are a random sample of words of L and distributed in a Zipf way • Words of B that are often used in named entities and titles (such as capitalized stop words). They appear in the corpus of language A more frequently then the expected x% of noise. The second group of words is only present in languages that are very similar to each other.

Lexical overlap in top 1000 words

Experiment 1 Artificial noise mixtures: Injecting alien language material in monolingual corpora • Experiment 1a: Injecting different amounts of German Noise in a chunk of the British National Corpus (~ 20 Million words) • Experiment 1b: Injecting 1% noise of Norwegian, Swedish and Dutch into a Danish corpus (~17 Million words) For measuring, we used the top 1000 words

German in BNC

Invading Denmark

Before cleaning After cleaning Number of top-1000-words found Approx. Frequency ratio Number of top-1000-words found Frequency ratio German 1000 0.708 1000 0.946 English 995 0.126 987 0.0010 French 924 0.0398 906 0.00002 Dutch 995 0.000891 775 0.000006 Turkish 642 0.0000631 562 0.000006 Experiment 2 For a collection of web documents (~700 Million words from .de domains, we measure the effect of a corpus cleaning method that strips alien language material

Cleaning .de web

Conclusion • Measure captures well the amount of noise • Noise measured down to a ratio of 10-5 • Effective: involves 1000 frequency counts per language

Application: Monolingual Corpora • Screenshot corpora http://corpora.uni-leipzig.de

Workflow Texts: Web / Newspapers Dictionaries (Dornseiff, WordNets, Wikipedia, ...) Small Worlds URLs Crawling Small Worlds Clustering Classification Words Text Text Text Text • Similar objects (words, sentences, documents, URLs) • Classification (se-mantic properties, subject areas, ...) • Combined objects (NE-Recognition, terminology, ...): determine patterns,extract multi-words Resources Techniques Results Language detection, Cleaning • Decomposition • Morphology • Inflection • Translation pairs lang. 1 lang. 2 lang. n ... Language +Time Tools Co-occurrences etc. POS Tagging • Neologisms • Trend Mining • Topic Tracking Standard Size Corpora Web Statistics Dictionaries Classified Objects Language Statistics Small Worlds

CorpusBrowser Per word: • Frequency • Example sentences • Co-occurrences: left and right neighbours, sentence-based • Co-occurrence graph

Only a few copies left! DVD: • 15 languages • Corpus Browser • Corpora in plain text and database format

Questions?? THANK YOU!

Measuring Monolinguality

Measuring Monolinguality

Presentation Transcript

Measuring

MEASURING

Measuring

Measuring

Measuring

Measuring:

Measuring

Measuring

Measuring

Measuring

Measuring

Measuring

Measuring

Measuring ?

Measuring

Measuring

Measuring

Measuring

Measuring

Measuring

Measuring

Measuring Monolinguality